I haven't used regex much and was having issues trying to split out 3 specific pieces of info in a long list of text I need to parse.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
pattern1 = Parse the **Name Text**
pattern2 = Parse the number `#x`
pattern3 = Grab everything else until the next pattern 1
What I have doesn't seem to work well. There are empty elements? They are not grouped together? And I can't figure out how to grab the last pattern text without it affecting the first 2 patterns. I'd also like it if all 3 matches were in a tuple together rather than separated. Here's what I have so far:
all = r"\*\*(.+?)\*\*|\`#(.+?)\`:"
l = re.findall(all, note)
Output:
[('Jane Greiz', ''), ('', '1'), ('Thomas Fitzpatrick', ''), ('', '90'), ('Anthony Smith', ''), ('', '91')]
Don't use alternatives. Put the name and number patterns after each other in a single alternative, and add another group for the match up to the next **.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
all = r"\*\*(.+?)\*\*.*?\`#(.+?)\`:(.*)"
print(re.findall(all, note))
Output is:
[('Jane Greiz', '1', ' Should be open here .'), ('Thomas Fitzpatrick', '90', ' Anim: Can we start the movement.'), ('Anthony Smith', '91', ' Her left shoulder.')]
Related
I am trying to get all numerical value (integers,decimal,float,scientific notation) from an expression and want to differentiate them from digits that are not realy number but part of a name. For example in the expression below.
230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1
the first 230 is not a numerical value as it is part of a tag (230FIC100.PV).
Using the web tool regexp.com I come up with the following expression that works for the expression above.
(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$
However when I try to use the above expression in python re.findall() I receive as result a list with 5 tuples with 6 elements on each.
import re
pat = r'(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$'
exp = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1 '
matches = re.findall(pat,exp)
The result is
special variables
function variables
0:('2', '', '', '2', 'e3', ' ')
1:('20', '', '', '20', '', ' ')
2:('20.4', '20.4', '', '', '', ' ')
3:('45', '', '', '45', '', ' ')
4:('2', '', '', '2', 'e4', ' ')
len():5
I would like some help to undestand what is happening and if there is any way to get this done in a similar way that happen on the regexp.com.
This should take care of it. (All the items are strings)
import re
st = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1'
re.findall(r'-?[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)|-?\d+\.\d+|\b\d+\b', st)
referred: How to extract numbers from strings,
Extracting scientific numbers from string,
and Extracting decimal values from string
I'm currently trying to scrape a website for some information but am running into some issues.
I currently have a bs4.element.Tag element with some html and text in it, and when I do "variable.text", I get the following text:
\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t
What I want is to get rid of all the white space characters (\n and \t) to get the relevant information in a list or any iterable form.
I've tried a bunch of regex commands already, but the one that got me closest to my goal was: re.split('[\t\n]',variable.text), I got the following:
['',
'',
'Ulmstead Club',
'',
'',
'',
'',
'',
'911 Lynch Dr',
'',
'',
'',
'',
'',
'',
'',
'Arnold, Maryland',
'',
'',
'',
'',
I've cut off a lot of the output to save some space.
I'm super lost and any help would be greatly appreciated
Try splitting on [\t\n]+:
re.split('[\t\n]+', variable.text.strip())
This would seem to work as it would eliminate the empty string entries in the output array.
My guess is that, this simple expression might be also helpful,
(?:\\n|\\t)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:\\n|\\t)"
test_str = "\\n\\nUlmstead Club\\n\\t\\t\\t\\t\\t911 Lynch Dr\\n\\n\\t\\t\\t\\t\\t\\tArnold, Maryland\\t\\t\\t\\t\\t 21012\\n\\t\\t\\t\\t\\tUnited States\\n(410) 757-9836 \\n\\n Get directions\\n\\n Favorite court \\n\\n\\n\\nTennis Court Details\\n\\n\\n\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tLocation type:\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tClub\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tMatches played here:\\t\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t0\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
You could use string.replace() function to get rid of the \n and \t, no really needing a regular expression to do so (I have replaced the \n and \t with 2 whitespaces for the next step):
variable.text = variable.text.replace("\n"," ")
variable.text = variable.text.replace("\t"," ")
if you want then to split your data into a list, you could split it through whitespaces, and use remove() to delete any extra empty strings in the list (note that I am not 100% sure of how you want your data separated, I have just made the solution that fitted my logic of how it should be split) :
result = re.split("[\s]\s+",variable.text)
while ('' in result):
result.remove('')
Here is the full code example:
import re
teststring ="\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t"
teststring = teststring.replace("\n"," ")
teststring = teststring.replace("\t"," ")
#split any fields with more than 1 whitespace between them
result = re.split("[\s]\s+",teststring)
#remove any empty string fields of the list
while ('' in result):
result.remove('')
print(result)
Result is:
['Ulmstead Club', '911 Lynch Dr', 'Arnold, Maryland', '21012', 'United States', '(410) 757-9836', 'Get directions', 'Favorite court', 'Tennis Court Details', 'Location type:', 'Club', 'Matches played here:', '0']
I would run 2 regex on the string starting with 1 then 2
Find \s*(?:\r?\n)\s*
Replace \n
https://regex101.com/r/EGTyKB/1
Find [ ]*\t+[ ]*
Replace \t
https://regex101.com/r/XIyi44/1
This clears out all the whitespace cruft and turns it into
a readable block of text.
Ulmstead Club
911 Lynch Dr
Arnold, Maryland 21012
United States
(410) 757-9836
Get directions
Favorite court
Tennis Court Details
Location type:
Club
Matches played here:
0
So this input file already has line breaks. It's the natural setting in which it's created. When I attempt to identify certain lines so that I can go back and call the values from said lines i get,
name = line[2]
IndexError: list index out of range
Any thoughts? I know there has to be an easy solution as this is fairly basic but I have sifted through every entry on splitting and splitting with ('\n') and nothing has worked. Any help from you folks would be greatly appreciated!
-Ut prosim
Input:
ID rpmI_bact
AC TIGR00001
DE ribosomal protein L35
Script
for i in info.readlines():
line = i.split('\n')
id_hit = line[0]
ac = line[1]
name = line[2]
print(name)
Error
name = line[2]
IndexError: list index out of range
First of all, when you do readlines, you will get back a list of all the lines your file, which will probably look something like this:
[' ID rpmI_bact', ' AC TIGR00001', ' DE ribosomal protein L35']
If you take one of these values and then try to split on newlines, you won't get anything split:
' ID rpmI_bact'.split('\n')
[' ID rpmI_bact']
Notice that the return value is a list with one element, so that's why you get your IndexError.
Now, it seems like you want to take each line and split on whitespace? If so, the way to do that is to use split(' '), but this is going to give you potentially unreliable content back:
In [8]: for line in lines:
...: print(line.split(' '))
...:
['', '', '', '', 'ID', '', 'rpmI_bact']
['', '', '', '', 'AC', '', 'TIGR00001']
['', '', '', '', 'DE', '', 'ribosomal', 'protein', 'L35']
Notice how it's not obvious where the "content" is? We can solve this in a few ways. One is to introduce regexes, while the other way is to simply take the values that are not empty strings (note that empty strings in Python are Falsey values):
In [9]: bool("")
Out[9]: False
In [10]: for line in lines:
...: print([elem for elem in line.split(' ') if elem])
...:
['ID', 'rpmI_bact']
['AC', 'TIGR00001']
['DE', 'ribosomal', 'protein', 'L35']
Now you have to figure out what you want to do with those lists. I didn't really get that from the question, though.
I'd probably consider just making it a dictionary. Then you can query it by the 2 letter key. No need for the .readlines() either.
d = dict(line.strip().split(' ', 2) for line in info)
That should give you a dictionary looking like
{'AC': 'TIGR00001', 'DE': 'ribosomal protein L35', 'ID': 'rpmI_bact'}
Then you can just access the ID you're interested in
name = d['DE']
I have some raw data like this:
Dear John Buy 1 of Coke, cost 10 dollars
Ivan Buy 20 of Milk
Dear Tina Buy 10 of Coke, cost 100 dollars
Mary Buy 5 of Milk
The rule of the data is:
Not everyone will start with "Dear", while if there is any, it must end with costs
The item may not always normal words, it could be written without limits (including str, num, etc.)
I want to group the information, and I tried to use regex. That's what I tried before:
for line in file.readlines():
match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line)
if match is not None:
print(match.groups())
file.close()
Now the output looks like:
('John', '1', 'Coke', '10')
('Ivan', '20', 'Milk', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Milk', '')
Showing above is what I want. However, if the item is replaced by some strange string like A1~A10, some of outputs will get wrong info:
('Ivan', '20', 'A1', '10')
('Mary', '5', 'A1', '10')
I think the constant format in the item field is that it will always end with , (if there is any). But I just don't know how to use the advantage.
Thought it's temporarily success by using the code above, I thought the (?P<item>\w+) has to be replaced like (?P<item>.+). If I do so, it'll take wrong string in the tuple like:
('John', '1', 'Coke, cost 10 dollars', '')
How could I read the data into the format I want by using the regex in Python?
I have tried this regular expression
^(Dear)?\s*(?P<name>\w*)\D*(?P<num>\d+)\sof\s(?P<drink>\w*)(,\D*(?P<cost>\d+)\D*)?
Explanation
^(Dear)? match line starting either with Dear if exists
(?P<name>\w*) a name capture group to capture the name
\D* match any non-digit characters
(?P<num>\d+) named capture group to get the num.
\sof\s matching string of
(?P<drink>\w*) to get the drink
(,\D*(?P<cost>\d+)\D*)? this is an optional group to get the cost of the drink
with
>>> reobject = re.compile('^(Dear)?\s*(\w*)[\sa-zA-Z]*(\d+)\s*\w*\s*(\w*)(,[\sa-zA-Z]*(\d+)[\s\w]*)?')
First data snippet
>>> data1 = 'Dear John Buy 1 of Coke, cost 10 dollars'
>>> match_object = reobject.search(data1)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('John', '1', 'Coke', '10')
Second data snippet
>>> data2 = ' Ivan Buy 20 of Milk'
>>> match_object = reobject.search(data2)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('Ivan', '20', 'Milk', None)
Without regex:
with open('commandes.txt') as f:
results = []
for line in f:
parts = line.split(None, 5)
price = ''
if parts[0] == 'Dear':
tmp = parts[5].split(',', 1)
for tok in tmp[1].split():
if tok.isnumeric():
price = tok
break
results.append((parts[1], parts[3], tmp[0], price))
else:
results.append((parts[0], parts[2], parts[4].split(',')[0], price))
print(results)
It doesn't care what characters are used except spaces until the product name, that's why each line is splitted by spaces in 5 parts. When the line starts with "Dear", the last part is separated by the comma to extract the product name and the price. Note that if the price is always at the same place (ie: after "cost"), you can avoid the innermost for loop and replace it with price = tmp[1].split()[1]
Note: if you want to prevent empty lines to be processed, you can change the first for loop to:
for line in (x for x in f if x.rstrip()):
I would use this regex:
r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?'
Demo
>>> line = 'Dear Tina Buy 10 of A1~A10'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', None)
>>> line = 'Dear Tina Buy 10 of A1~A10, cost 100 dollars'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', '100')
Explanation
The first section of your regex is perfectly fine, here’s the tricky part:
(?P<item>[^,]+) As we're sure that the string will contain a comma when the cost string is present, here we say that we want anything but comma to set the item value.
(?:,\D+)?(?P<costs>\d+)? Here we're using two groups. The important thing is the ? after the parenthesis enclosing the groups:
'?' Causes the resulting RE to match 0 or 1 repetitions of the
preceding RE. ab? will match either ‘a’ or ‘ab’.
So we use ? to match both possibilities (with the cost string present or not)
(?:,\D+) is a non-capturing that will match a comma followed by anything but a digit.
(?P<costs>\d+) will capture any digit in the named group cost.
If you use .+, the subpattern will grab the whole rest of the line as . matches any character but a newline without the re.S flag.
You can replace the \w+ with a negated character class subpattern [^,]+ to match one or more characters other than a comma:
r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)\D*(?P<costs>\d*)'
^^^^^
See the IDEONE demo:
import re
file = "Dear John Buy 1 of A1~A10, cost 10 dollars\n Ivan Buy 20 of Milk\nDear Tina Buy 10 of Coke, cost 100 dollars\n Mary Buy 5 of Milk"
for line in file.split("\n"):
match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,\W]+)\D*(?P<costs>\d*)',line)
if match:
print(match.groups())
Output:
('John', '1', 'A1~A10', '10')
('Ivan', '20', 'Mil', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Mil', '')
I have a string like:
'Agendas / Schedules meetings and speakers 4 F 1928-1209 Box 2'
And I am trying to split it on what appear to be tabs. Though if I print with print repr(str) I only see special characters at the end:
'Agendas / Schedules meetings and speakers 4 F 1928-1209 Box 2\r\n'
And if I try things like print re.split('\t+', str) or print re.split('\s+', str), nothing is split, ie output is still:
['Agendas / Schedules meetings and speakers 4 F 1928-1209 Box 2\r\n']
Is there a way to isolate these fixed width items if regex is not working out?
Update: am hoping to split exclusively on the larger white spaces, so .split() creating a list element of every word is not what I'm looking for.
I've ran across this a few times in the past, you may have a case of Zero-Width-Space.
>>> s = 'Agendas / Schedules meetings and speakers 4 F 1928-1209 Box 2'
>>> re.split(ur'[\u200b\s]+', s, flags=re.UNICODE)
['Agendas', '/', 'Schedules', 'meetings', 'and', 'speakers', '4', 'F', '1928-1209', 'Box', '2']
The split() method of string will split on whitespace by default. So:
print str.split()
should do the trick.
Thanks everyone for the input. Didn't realize that python had a zero-width-space bug (http://bugs.python.org/issue13391)
Anyway, it appears that matching on more than one whitespace did the trick:
>>>re.split('\s{2,}', s)
['Agendas / Schedules meetings and speakers', '4 F', '1928-1209', 'Box 2']
>>>