Confused on this regular expression pattern in Python - python

I wanna find 6 digit in my webpage:
<td style="width:40px;">705214</td>
My code is:
s = f.read()
m = re.search(r'\A>\d{6}\Z<', s)
l = m.group(0)

If you just want to find 6 digits in between a > and < symbol, use the following regex:
import re
s = '<td style="width:40px;">705214</td>'
m = re.search(r'>(\d{6})<', s)
l = m.groups()[0]
Note the use of parentheses ( and ) to denote a capturing group.

You can also use a look-ahead and a look-behind for the checking:
m = re.search(r'(?<=>)\d{6}(?=<)', s)
l = m.group(0)
This regex will match to 6 digits that are preceded by a > and followed by a <.

You may want to check for any whitespace (tabs, space, newlines) between the tags. \s* means zero or more whitespace.
s='<td style="width:40px;">\n\n705214\t\n</td>'
m=re.search(r'>\s*(\d{6})\s*<',s)
m.groups()
('705214',)
Parsing HTML is a blast. Usually you treat the file as one long line, remove leading and trailing whitespace between the values contained inside the tags. Maybe looking into a HTML table parsing module may help, especially if you need to parse several columns.
stackoverflow answer using lxml etree
Also, htmp.parser was suggested. Food for thought.
(Still learning what modules python has to offer :) )

I think you want something like this:
m = re.search(r'>(\d{6})<', s)
l = m.group(1)
The ( ) around \d{6} indicate a subgroup of the result.
If you want to find multiple instances of 6-digit substrings between > and < then try this:
s = '<tag1>111111</tag1> <tag2>222222</tag2>'
m = re.findall(r'>(\d{6})<', s)
In this case, m will be ['111111','222222'].

Related

regex match special case

I have a cinematic scenario with a bunch of strings like this:
80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla
And my goal is to match all the numbers up to : in selected sequence (selected is 80101_ in this example, strings #2, #3, #5, #6), matching strings without existing numbers (like 80101_:Blablab, string #4) but without matching the string with _intertitle (string #1).
My current regex looks like this (code in Python):
selection = "80101"; # I'm getting this from elsewhere
pattern = selection + "_" + "\d*";
This matches all the strings with/without numbers but also a string with _intertitle. If I modify my pattern like this "\d[^:]*", it doesn't match _intertitle but also doesn't match the string without numbers... I can't get the right pattern, could anyone please lead me in the right direction? Thanks.
I think you should add "(?=:)" in the and of your pattern:
r"80101_\d*(?=:)"
This means: select "80101_" + zero or more digits only if it’s followed by ":". In case of "80101_intertitle:Blablabla" we have a non-digit symbol between "80101_" and ":", so it doesn't match.
You could use a negative lookahead:
80101_\d*(?!intertitle)
That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used.
regex101 demo
Your pattern could be written as:
pattern = selection + r"_\d*(?!intertitle)"
You need anchors and multiline flag. Also, you should add the :.* at the end of the regex as well to match the whole string.
^80101_\d*:.*$
See the Demo: https://regex101.com/r/yqGgrv/1
Here is the respective python code as well:
In [1]: s = """80101_intertitle:Blablabla
...: 80101_1:BlablablaBlablabla
...: 80101_2:Blablabla
...: 80101_:BlablablaBlablablaBlablabla
...: 80101_3:BlablablaBlablabla
...: 80101_11:Blablabla
...: 801_1:Blablabla
...: 801_2:Blablabla"""
In [2]: import re
In [4]: re.findall(r'^80101_\d*:.*$', s, re.M)
Out[4]:
['80101_1:BlablablaBlablabla',
'80101_2:Blablabla',
'80101_:BlablablaBlablablaBlablabla',
'80101_3:BlablablaBlablabla',
'80101_11:Blablabla']
Yes, that is easily done:
import re
s = '''80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla'''
matches = re.findall(r'(80101_\d+:.*)', s)
for match in matches:
print(match)
matches = re.findall(r'(80101_:.*)', s)
for match in matches:
print(match)

Splitting a string with delimiters and conditions

I'm trying to split a general string of chemical reactions delimited by whitespace, +, = where there may be an arbitrary number of whitespaces. This is the general case but I also need it to split conditionally on the parentheses characters () when there is a + found inside the ().
For example:
reaction= 'C5H6 + O = NC4H5 + CO + H'
Should be split such that the result is
splitresult=['C5H6','O','NC4H5','CO','H']
This case seems simple when using filter(None,re.split('[\s+=]',reaction)). But now comes the conditional splitting. Some reactions will have a (+M) which I'd also like to split off of as well leaving only the M. In this case, there will always be a +M inside the parentheses
For example:
reaction='C5H5 + H (+M)= C5H6 (+M)'
splitresult=['C5H5','H','M','C5H6','M']
However, there will be some cases where the parentheses will not be delimiters. In these cases, there will not be a +M but something else that doesn't matter.
For example:
reaction='C5H5 + HO2 = C5H5O(2,4) + OH'
splitresult=['C5H5','HO2','C5H5O(2,4)','OH']
My best guess is to use negative lookahead and lookbehind to match the +M but I'm not sure how to incorporate that into the regex expression I used above on the simple case. My intuition is to use something like filter(None,re.split('[(?<=M)\)\((?=\+)=+\s]',reaction)). Any help is much appreciated.
You could use re.findall() instead:
re.findall(pattern, string, flags=0)
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction0)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction1)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction2)
but, if you prefer re.split() and filter(), then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction0))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction1))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction2))
the pattern for findall is different from the pattern for split,
because findall and split are looking for different things; 'the opposite things', indeed.
findall, is looking for that you wanna (keep it).
split, is looking for that you don't wanna (get rid of it).
In findall, '[A-Z0-9]+(?:([1-9],[1-9]))?'
match any upper case or number > [A-Z0-9],
one or more times > +, follow by a pair of numbers, with a comma in the middle, inside of parenthesis > \([1-9],[1-9]\)
(literal parenthesis outside of character classes, must be escaped with backslashes '\'), optionally > ?
\([1-9],[1-9]\) is inside of (?: ), and then,
the ? (which make it optional); ( ), instead of (?: ) works, but, in this case, (?: ) is better; (?: ) is a no capturing group: read about this.
try it with the regex in the split
That seems overly complicated to handle with a single regular expression to split the string. It'd be much easier to handle the special case of (+M) separately:
halfway = re.sub("\(\+M\)", "M", reaction)
result = filter(None, re.split('[\s+=]', halfway))
So here is the regex which you are looking for.
Regex: ((?=\(\+)\()|[\s+=]|((?<=M)\))
Flags used:
g for global search. Or use them as per your situation.
Explanation:
((?=\(\+)\() checks for a ( which is present if (+ is present. This covers the first part of your (+M) problem.
((?<=M)\)) checks for a ) which is present if M is preceded by ). This covers the second part of your (+M) problem.
[\s+=] checks for all the remaining whitespaces, + and =. This covers the last part of your problem.
Note: The care for digits being enclosed by () ensured by both positive lookahead and positive lookbehind assertions.
Check Regex101 demo for working
P.S: Make it suitable for yourself as I am not a python programmer yet.

Python regex, how to search for multiple strings?

I'm new to python and am trying to figure out python regex to find any strings that match -. For example, 'type1-001' and 'type2-001' should be a match, but 'type3-asdf001' shouldn't be a match. I would like to be able to match with a regex like [type1|type2|type3]-\d+ to find any strings that start with type1, type2, or type3 and then are appended with '-' and digits. Also, it would be cool to know how to search for any upper case text appended with '-' and digits.
Here's what I think should work, but I can't seem to get it right...
pref_num = re.compile(r'[type1|type2]-\d+')
[] will match any of the set of characters appearing between the brackets. To group regexes you need to use (). So, I think your regex should be something like:
pref_num = re.compile(r'(type1|type2)-\d+')
As to how to search any uppercase text appended with - and digits, I would suggest:
[A-Z]+-\d+
If you only want the digit after "type" to be variable then you should put only those in the square brackets like so:
re.compile(r'type[1|2]-\d+')
You can use the pattern
'type[1-3]-[0-9]{3}'
Demo
>>> import re
>>> p = 'type[1-3]-[0-9]{3}'
>>> s = 'type2-005 with some text type1-101 and then type1-asdf001'
>>> re.findall(p, s)
['type2-005', 'type1-101']
pref_num = re.compile(r'(type1|type2|type3)-\d+')
m = pref_num.search('type1-000')
if m != None: print(m.string)
m = pref_num.search('type2-000')
if m != None: print(m.string)
m = pref_num.search('type3-abc000')
if m != None: print(m.string)

Get text between last forward slash and then before first hyphen

I need to parse a URL, and get 1585710 from :
http://www.example.com/0/100013573/1585710-key-description-goes-here
So that means it's between the last / and before the first -
I have very little experience with regex, it's a really hard concept for me to understand.
Any help or assistance would be much appreciated
Edit: Using Python.
Use the below regex and get the number from group index 1.
^.*\/([^-]*)-.*$
DEMO
Code:
>>> import re
>>> s = "http://www.example.com/0/100013573/1585710-key-description-goes-here"
>>> m = re.search(r'^.*\/([^-]*)-.*$', s, re.M)
>>> m
<_sre.SRE_Match object at 0x7f8a51f07558>
>>> m.group(1)
'1585710'
>>> m = re.search(r'.*\/([^-]*)-.*', s)
>>> m.group(1)
'1585710'
>>> m = re.search(r'.*\/([^-]*)', s)
>>> m.group(1)
'1585710'
Explanation:
.*\/ Matches all the characters upto the last / symbol.
([^-]*) Captures any character but not of - zero or more times.
-.* Matches all the remaining characters.
group(1) contains the characters which are captured by the first capturing group. Printing the group(1) will give the desired result.
You can use matching groups in order to extract the number with the regex \/(\d+)-:
import re
s = 'http://www.example.com/0/100013573/1585710-key-description-goes-here'
m = re.search(r'\/(\d+)-', s)
print m.group(1) # 1585710
Check out the Fiddler
Well, if you need to find any strings between a / and a -, you could simply do:
/.*-
Since . is any char, and * is any amount. However, this poses a problem, because you could get the whole /www.example.com/0/100013573/1585710-key-description-goes, which is between / and a -. So, what you need to do is to search for anything that is not a / and -:
/[^/-]*-
^ means no, and anything between [] is, roughly, an OR list.
Hope that helps.
EDIT: No, it doesn't help, as user rici mentioned, when you have a - in your url name (as in www.lala-lele.com).
To make sure is the last / you got, you can match the rest of your string, making sure it doesn't have any / in it until the end ($), as in:
/[^/-]*-[^/]*$
And, to get just the string inside it, you can:
/\([^/-]*\)-[^/]*$
Since \( and \) specify what you want as the output of your regex.

Regex help to match groups

I am trying to write a regex for matching a text file that has multiple lines such as :
* 964 0050.56aa.3480 dynamic 200 F F Veth1379
* 930 0025.b52a.dd7e static 0 F F Veth1469
My intention is to match the "0050.56aa.3480 " and "Veth1379" and put them in group(1) & group(2) for using later on.
The regex I wrote is :
\*\s*\d{1,}\s*(\d{1,}\.(?:[a-z][a-z]*[0-9]+[a-z0-9]*)\.\d{1,})\s*(?:[a-z][a-z]+)\s*\d{1,}\s*.\s*.\s*((?:[a-z][a-z]*[0-9]+[a-z0-9]*))
But it does not seem to be working when I test at:
http://www.pythonregex.com/
Could someone point to any obvious error I am doing here.
Thanks,
~Newbie
Try this:
^\* [0-9]{3} +([0-9]{4}.[0-9a-z]{4}.[0-9a-z]{4}).*(Veth[0-9]{4})$
Debuggex Demo
The first part is in capture group one, the "Veth" code in capture group two.
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a list of online testers in the bottom section.
I don't think you need a regex for this:
for line in open('myfile','r').readlines():
fields = line.split( )
print "\n" + fields[1] + "\n" +fields[6]
A very strict version would look something like this:
^\*\s+\d{3}\s+(\d{4}(?:\.[0-9a-f]{4}){2})\s+\w+\s+\d+\s+\w\s+\w\s+([0-9A-Za-z]+)$
Debuggex Demo
Here I assume that:
the columns will be pretty much the same,
your first match group contains a group of decimal digits and two groups of lower-case hex digits,
and the last word can be anything.
A few notes:
\d+ is equivalent to \d{1,} or [0-9]{1,}, but reads better (imo)
use \. to match a literal ., as . would simply match anything
[a-z]{2} is equivalent to [a-z][a-z], but reads better (my opinion, again)
however, you might want to use \w instead to match a word character
This will do it:
reobj = re.compile(r"^.*?([\w]{4}\.[\w]{4}\.[\w]{4}).*?([\w]+)$", re.IGNORECASE | re.MULTILINE)
match = reobj.search(subject)
if match:
group1 = match.group(1)
group2 = match.group(2)
else:
result = ""

Categories

Resources