I have elements that I've scraped off of a website and when I print them using the following code, they show up neatly as spaced out elements.
print("\n" + time_element)
prints like this
F
4pm-5:50pm
but when I pass time_element into a dataframe as a column and convert it to a string, the output looks like this
# b' \n F\n \n 4pm-5:50pm\n
I am having trouble understanding why it appears so and how to get rid of this "\n" character. I tried using regex to match the "F" and the "4pm-5:50pm" and I thought this way I could separate out the data I need. But using various methods including
# Define the list and the regex pattern to match
time = df['Time']
pattern = '[A-Z]+'
# Filter out all elements that match the pattern
filtered = [x for x in time if re.match(pattern, x)]
print(filtered)
I get back an empty list.
From my research, I understand the "\n" represents a new line and that there might be invisible characters. However, I'm not understanding more about how they behave so I can get rid of them/around them to extract the data that I need.
When I pass the data to csv format, it prints like this all in one cell
F
4pm-5:50pm
but I still end up in the similar place when it comes to separating out the data that I need.
you can use the function strip() when you extract data from the website to avoid "\n"
Related
How do i convert data into comma separated values, i want to convert like
I have this data in excel on single cell
"ABCD x3 ABC, BAC x 3"
Want to convert to
ABCD,ABCD,ABCD,ABC,BAC,BAC,BAC
can't find an easy way to do that.
I am trying to solve it in python so i can get a structured data
Hi Zeeshan to try and sort the string into usable data while also multiplying certain parts of the string is kind of tricky for me.
the best solution I can think of is kind of gross but it seems to work. hopefully my comments aren't too confusing <3
import re
data = "ABCD x3 AB BAC x2"
#this will split the string into a list that you can iterate through.
Datalist = re.findall(r'(\w+)', data)
#create a new list for the final result
newlist = []
for object in Datalist:
#for each object in the Datalist list
#if the object starts with 'x'
if re.search("x.*", object):
#convert the multiplier to type(string) and then split the x from the multiplier number string
xvalue = str(object).split('x')
#grab and remove the last item added to the newlist because it hasnt been multiplied.
lastitem = newlist.pop()
#now we can add the last item back in by as many times as the x value
newlist.extend([lastitem] * int(xvalue[1]))
else:
#if the object doesnt start with an x then we can just add it to the list.
newlist.extend([object])
#print result
print(newlist)
#re.search() - looks for a match in a string
#.split() - splits a string into multiple substrings
#.pop() - removes the last item from a list and returns that item.
#.extend() - adds an item to the end of a list
keep in mind that to find the multiplier its looking for x followed by a number (x1). if there is a space for example = (x 1) then it will match x but it wont return a value because there is a space.
there might be multiple ways around this issue and I think the best fix will be to restructure how the data is Formatted into the cell.
here are a couple of ways you can work with the data. it wont directly solve your issue but I hope it will help you think about how you approach it (not being rude I don't actually have a good way to handle your example <3 )
split() will split your string as character 'x' and return a list of substrings you can iterate over.
data = 'ABCD ABCD ABCD ABC BAC BAC BAC'
splitdata = data.split(' ')
print(splitdata)
#prints - ['ABCD', 'ABCD', 'ABCD', 'ABC', 'BAC', 'BAC', 'BAC']
you could also try and match strings from the data
import re
data2 = "ABCD x3 ABC BAC x3"
result = []
for match in re.finditer(r'(\w+) x(\d+)', data2):
substring, count = match.groups()
result.extend([substring] * int(count))
print(result)
use re.finditer to go through the string and match the data with the following format = '(\w+) x(\d+)'
each match then gets added to the list.
'\w' is used to match a character.
'\d' is used to match a digit.
'+' is the quantifier, means one or more.
so we are matching = '(\w+) x(\d+)',
which broken down means we are matching (\w+) one or more characters followed by a 'space' then 'x' followed by (\d+) one or more digits
so because your cell data is essentially a string followed by a multiplier then a string followed by another string and then another multiplier, the data just feels too random for a general solution and i think this requires a direct solution that can only work if you know exactly what data is already in the cell. that's why i think the best way to fix it is to rework the data in the cell first. im in no way an expert and this answer is to help you think of ways around the problem and to add to the discussion :) ,if someone wants to correct me and offer a better solution to this I would love to know myself.
I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'
For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']
So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.
I am trying to split a string I extract on the first occurrence of a comma. I have tried using the split, but something is wrong, as it doesn't split.
for i in range(len(items)):
alldata = items[i].getText().encode('utf-8').split(',', 1)
csvfile.writerow(alldata)
The variable items contains the data I extract from an URL. The output in the CSV file is put in one column. I want it to be on two columns. An example of the data (alldata) I get in the CSV file, looks like this:
['\n\n\n1958\n\n\nGeorge Lees\n']
Using this data as an example, I need the year 1958 to be on one column, and the name George Lees to be on another column instead of the new lines.
EDIT
Forgot to mention what I meant with the commas. The reason why I mentioned the commas is that I also tried splitting on whitespaces. When I did that I got the data:
['1958', 'George', 'Lees']
So what I tried to achieve was to split the data on the first comma occurrence. That's why I did split(',', 1) forgetting that I also need to split on whitespaces. So my problem is that I don't know how I split on both the first commas occurence, so that the year is on oe column and the whole name is on another column. I got
['\n\n\n1958\n\n\nGeorge Lees\n']
When I tried to split with split(',', 1)
You can use strip to remove all spaces in the start & end and then use split by "\n" to get the required output. I have also used the filter method to remove any empty string or values.
Ex:
A = ['\n\n\n1958\n\n\nGeorge Lees\n']
print filter(None, A[0].strip().split("\n"))
Output:
['1958', 'George Lees']
I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'