I have a webscraper that scrapes prices, for that I need it to find following prices in strings:
762,50
1.843,75
In my first naive implementation, I didn't take the . into consideration and matched the first number with this regex perfectly:
re.findall("\d+,\d+", string)[0]
Now I need to match both cases and my initial idea was this:
re.findall("(\d+.\d+,\d+|\d+,\d+)", string)[0]
With an idea, that using the or operator, could find either the first or the second, which don't work, any suggestions?
No need to use a or, just add the first part as an optional parameter:
(?:\d+\.)?\d+,\d+
The ? after (?:\d+\.) makes it an optional parameter.
The '?:' indicate to not capture this group, just match it.
>>> re.findall(r'(?:\d+\.)?\d+,\d+', '1.843,75 762,50')
['1.843,75', '762,50']
Also note that you have to escape the . (dot) that would match any character except a newline (see http://docs.python.org/2/library/re.html#regular-expression-syntax)
In regular expression, dot (.) matches any character (except newline unless DOTALL flag is not set). Escape it to match . literally:
\d+\.\d+,\d+|\d+,\d+
^^
To match multiple leading digits, the regular expression should be:
>>> re.findall(r'(?:\d+\.)*\d+,\d+', '1,23 1.843,75 123.456.762,50')
['1,23', '1.843,75', '123.456.762,50']
NOTE used non-capturing group because re.findall return a list of groups If one or more groups are present in the pattern.
UPDATE
>>> re.findall(r'(?<![\d.])\d{1,3}(?:\.\d{3})*,\d+',
... '1,23 1.843,75 123.456.762,50 1.2.3.4.5.6.789,123')
['1,23', '1.843,75', '123.456.762,50']
How about:
(\d+[,.]\d+(?:[.,]\d+)?)
Matches:
- some digits followed by , or . and some digits
OR
- some digits followed by , or . and some digits followed by , or . and some digits
It matches: 762,50 and 1.843,75 and 1,75
It will also match 1.843.75 are you OK with that?
See it in action.
I'd use this:
\d{1,3}(?:\.\d{3})*,\d\d
This will match number that have dot as thousand separator
\d*\.?\d{3},\d{2}
See the working example here
This might be slower than regex, but given that the strings you are parsing are probably short, it should not matter.
Since the solution below does not use regex, it is simpler, and you can be more sure you are finding valid floats. Moreover, it parses the digit-strings into Python floats which is probably the next step you intend to perform anyway.
import locale
locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
def float_filter(iterable):
result = []
for item in iterable:
try:
result.append(locale.atof(item))
except ValueError:
pass
return result
text = 'The price is 762,50 kroner'
print(float_filter(text.split()))
yields
[762.5]
The basic idea: by setting a Danish locale, locale.atof parses commas as the decimal marker and dots as the grouping separator.
In [107]: import locale
In [108]: locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
Out[108]: 'en_DK.UTF-8'
In [109]: locale.atof('762,50')
Out[109]: 762.5
In [110]: locale.atof('1.843,75')
Out[110]: 1843.75
In general, you have a set of zero or more XXX., followed by one or more XXX,, each up to 3 numbers, followed by two numbers (always). Do you want to also support numbers like 1,375 (without 'cents'?). You also need to avoid some false detection cases.
That looks like this:
matcher=r'((?:(?:(?:\d{1,3}\.)?(?:\d{3}.)*\d{3}\,)|(?:(?<![.0-9])\d{1,3},))\d\d)'
re.findall(matcher, '1.843,75 762,50')
This detects a lot of boundary cases, but may not catch everything....
Related
In my code I Want answer [('22', '254', '15', '36')] but got [('15', '36')]. My regex (?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3} is not run for 3 time may be!
import re
def fun(st):
print(re.findall("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
ip="22.254.15.36"
print(fun(ip))
Overview
As I mentioned in the comments below your question, most regex engines only capture the last match. So when you do (...){3}, only the last match is captured: E.g. (.){3} used against abc will only return c.
Also, note that changing your regex to (2[0-4]\d|25[0-5]|[01]?\d{1,2}) performs much better and catches full numbers (currently you'll grab 25 instead of 255 on the last octet for example - unless you anchor it to the end).
To give you a fully functional regex for capturing each octet of the IP:
(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})
Personally, however, I'd separate the logic from the validation. The code below first validates the format of the string and then checks whether or not the logic (no octets greater than 255) passes while splitting the string on ..
Code
See code in use here
import re
ip='22.254.15.36'
if re.match(r"(?:\d{1,3}\.){3}\d{1,3}$", ip):
print([octet for octet in ip.split('.') if int(octet) < 256])
Result: ['22', '254', '15', '36']
If you're using this method to extract IPs from an arbitrary string, you can replace re.match() with re.search() or re.findall(). In that case you may want to remove $ and add some logic to ensure you're not matching special cases like 11.11.11.11.11: (?<!\d\.)\b(?:\d{1,3}\.){3}\d{1,3}\b(?!\.\d)
You only have two capturing groups in your regex:
(?: # non-capturing group
( # group 1
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)\.
){3}
( # group 2
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)
That the first group can be repeated 3 times doesn't make it capture 3 times. The regex engine will only ever return 2 groups, and the last match in a given group will fill that group.
If you want to capture each of the parts of an IP address into separate groups, you'll have to explicitly define groups for each:
pattern = (
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)')
def fun(st, p=re.compile(pattern)):
return p.findall(st)
You could avoid that much repetition with a little string and list manipulation:
octet = r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)'
pattern = r'\.'.join([octet] * 4)
Next, the pattern will just as happily match the 25 portion of 255. Better to put matching of the 200-255 range at the start over matching smaller numbers:
octet = r'(2(?:5[0-5]|[0-4]\d)|[01]?[0-9]{1,2})'
pattern = r'\.'.join([octet] * 4)
This still allows leading 0 digits, by the way, but is
If all you are doing is passing in single IP addresses, then re.findall() is overkill, just use p.match() (matching only at the string start) or p.search(), and return the .groups() result if there is a match;)
def fun(st, p=re.compile(pattern + '$')):
match = p.match(st)
return match and match.groups()
Note that no validation is done on the surrounding data, so if you are trying to extract IP addresses from a larger body of text you can't use re.match(), and can't add the $ anchor and the match could be from a larger number of octets (e.g. 22.22.22.22.22.22). You'd have to add some look-around operators for that:
# only match an IP address if there is no indication that it is part of a larger
# set of octets; no leading or trailing dot or digits
pattern = r'(?<![\.\d])' + pattern + r'(?![\.\d])'
I encountered a very similar issue.
I found two solutions, using the official documentation.
The answer of #ctwheels above did mention the cause of the problem, and I really appreciate it, but it did not provide a solution.
Even when trying the lookbehind and the lookahead, it did not work.
First solution:
re.finditer
re.finditer iterates over match objects !!
You can use each one's 'group' method !
>>> def fun(st):
pr=re.finditer("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st)
for p in pr:
print(p.group(),end="")
>>> fun(ip)
22.254.15.36
Or !!!
Another solution haha : You can still use findall, but you'll have to make every group a non-capturing group ! (Since the main problem is not with findall, but with the group function that is used by findall (which, we all know, only returns the last match):
"re.findall:
...If one or more groups are present in the pattern, return a list of groups"
(Python 3.8 Manuals)
So:
>>> def fun(st):
print(re.findall("(?:(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
>>> fun(ip)
['22.254.15.36']
Have fun !
Suppose I want to match a string like this:
123(432)123(342)2348(34)
I can match digits like 123 with [\d]* and (432) with \([\d]+\).
How can match the whole string by repeating either of the 2 patterns?
I tried [[\d]* | \([\d]+\)]+, but this is incorrect.
I am using python re module.
I think you need this regex:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this:
"^(\d|\(\d+\))+$"
You can use a character class to match the whole of string :
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example :
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \d+ to get all the numbers :
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \d+\(\d+\) repeatedly you can use following regex :
(?:\d+\(\d+\))+
You can achieve it with this pattern:
^(?=.)\d*(?:\(\d+\)\d*)*$
demo
(?=.) ensures there is at least one character (if you want to allow empty strings, remove it).
\d*(?:\(\d+\)\d*)* is an unrolled sub-pattern. Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)*. For your example, it replaces (?:\d+|\(\d+\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking.
Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\1 that uses the fact that a lookahead is naturally atomic:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\(\d+\))? at the end of the pattern, before the anchor $:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$
I have the following 2 strings of train station IDs (showing the direction of travel) separated by "-".
String A (strA):
NS1-NS2-NS3-NS4-NS5-NS7-NS8-NS9-NS10-NS11-NS13-NS14-NS15-NS16-NS17-NS18-NS19-NS20-NS21-NS22-NS23-NS24-NS25-NS26-NS27
String B (strB):
NS27-NS26-NS25-NS24-NS23-NS22-NS21-NS20-NS19-NS18-NS17-NS16-NS15-NS14-NS13-NS11-NS10-NS9-NS8-NS7-NS5-NS4-NS3-NS2-NS1
I want to find out which of String A or B contains stations "NS4" followed by "NS1" (answer should be String B).
My current code as follows:
searchStr = ".*NS4-.*NS1(-.*|)"
re.search(searchStr, strA)
re.search(searchStr, strB)
But the result keep returning a match in String A.
May I know how to specify 'searchStr' in order to match only String B?
Two ways to do it: tokenizing and improving the regex.
Tokenizing
tokA = strA.split('-')
tokB = strB.split('-')
print('NS4' in tokA and tokA.index('NS1') > tokA.index('NS4'))
print('NS4' in tokB and tokB.index('NS1') > tokB.index('NS4'))
# False
# True
Regex
import re
pattern = '(^|-)NS4.+NS1(-|$)'
print(re.search(pattern, strA) is not None)
print(re.search(pattern, strB) is not None)
# False
# True
Performance
Tokenization: 2.3072939129997394
Regex: 11.138173280000046
But if you really need performance, I'm sure there are faster ways. Even the tokenization method does multiple passes.
As an alternative to tokenizing, you could use the following expression.
NS4(?=.*?NS1(?!\d))
It literally means:
The characters "NS4" literally.
Followed by any characters, until it finds NS1.
NS1 cannot be followed by a digit.
To educate readers as to what I've used:
(?=) is a Positive Lookahead.
Whatever you place inside this token must be found for the match to be True.
I placed .*? to match anything, as few times as possible using the ? quantifier, followed by NS1 since that is what we want to find.
(?!) is a Negative Lookahead
Whatever you place inside this token, as you might guess, must NOT be found for the match to be True.
I placed a digit in here, so that things like NS10 or NS11 or NS19 are never matched.
Easiest way to explain this is an example:
I have this string: 'Docs/src/Scripts/temp'
Which I know how to split two different ways:
re.split('/', 'Docs/src/Scripts/temp') -> ['Docs', 'src', 'Scripts', 'temp']
re.split('(/)', 'Docs/src/Scripts/temp') -> ['Docs', '/', 'src', '/', 'Scripts', '/', 'temp']
Is there a way to split by the forward slash, but keep the slash part of the words?
For example, I want the above string to look like this:
['Docs/', '/src/', '/Scripts/', '/temp']
Any help would be appreciated!
Interesting question, I would suggest doing something like this:
>>> 'Docs/src/Scripts/temp'.replace('/', '/\x00/').split('\x00')
['Docs/', '/src/', '/Scripts/', '/temp']
The idea here is to first replace all / characters by two / characters separated by a special character that would not be a part of the original string. I used a null byte ('\x00'), but you could change this to something else, then finally split on that special character.
Regex isn't actually great here because you cannot split on zero-length matches, and re.findall() does not find overlapping matches, so you would potentially need to do several passes over the string.
Also, re.split('/', s) will do the same thing as s.split('/'), but the second is more efficient.
A solution without split() but with lookaheads:
>>> s = 'Docs/src/Scripts/temp'
>>> r = re.compile(r"(?=((?:^|/)[^/]*/?))")
>>> r.findall(s)
['Docs/', '/src/', '/Scripts/', '/temp']
Explanation:
(?= # Assert that it's possible to match...
( # and capture...
(?:^|/) # the start of the string or a slash
[^/]* # any number of non-slash characters
/? # and (optionally) an ending slash.
) # End of capturing group
) # End of lookahead
Since a lookahead assertion is tried at every position in the string and doesn't consume any characters, it doesn't have a problem with overlapping matches.
1) You do not need regular expressions to split on a single fixed character:
>>> 'Docs/src/Scripts/temp'.split('/')
['Docs', 'src', 'Scripts', 'temp']
2) Consider using this method:
import os.path
def components(path):
start = 0
for end, c in enumerate(path):
if c == os.path.sep:
yield path[start:end+1]
start = end
yield path[start:]
It doesn't rely on clever tricks like split-join-splitting, which makes it much more readable, in my opinion.
If you don't insist on having slashes on both sides, it's actually quite simple:
>>> re.findall(r"([^/]*/)", 'Docs/src/Scripts/temp')
['Docs/', 'src/', 'Scripts/']
Neither re nor split are really cut out for overlapping strings, so if that's what you really want, I'd just add a slash to the start of every result except the first.
Try about this:
re.split(r'(/)', 'Docs/src/Scripts/temp')
From python's documentation
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the
occurrences of pattern. If capturing parentheses are used in pattern,
then the text of all groups in the pattern are also returned as part
of the resulting list. If maxsplit is nonzero, at most maxsplit splits
occur, and the remainder of the string is returned as the final
element of the list. (Incompatibility note: in the original Python 1.5
release, maxsplit was ignored. This has been fixed in later releases.)
I'm not sure there is an easy way to do this. This is the best I could come up with...
import re
lSplit = re.split('/', 'Docs/src/Scripts/temp')
print [lSplit[0]+'/'] + ['/'+x+'/' for x in lSplit][1:-1] + ['/'+lSplit[len(lSplit)-1]]
Kind of a mess, but it does do what you wanted.
I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.