import re
sequence = 'i have -0.03 dollars in my hand'
m = re.search('(have )(-\w[.]+)( dollars\w+)',sequence)
print m.group(0)
print m.group(1)
print m.group(2)
Looking for a way to extract text between two occurrences. In this case, the format is 'i have ' followed by - floats and then followed by ' dollars\w+'
How do i use re.search to extract this float ?
Why don't the groups work this way ? I know there's something I can tweak to get it to work with these groups. any help would be greatly appreciated
I thought I could use groups with paranthesis but i got an eror
-\w[.]+ does not match -0.03 because [.] matches . literally because . is inside the [...].
\w after dollars also prevent the pattern to match the sequence. There no word character after dollars.
Use (-?\d+\.\d+) as pattern:
import re
sequence = 'i have -0.03 dollars in my hand'
m = re.search(r'(have )(-?\d+\.\d+)( dollars)', sequence)
print m.group(1) # captured group start from `1`.
print m.group(2)
print m.group(3)
BTW, captured group numbers start from 1. (group(0) returns entire matched string)
Your regex doesn't match for several reasons:
it always requires a - (OK in this case, questionable in general)
it requires exactly one digit before the . (and it even allows non-digits like A).
it allows any number of dots, but no more digits after the dots.
it requires one or more alphanumerics immediately after dollars.
So it would match "I have -X.... dollarsFOO in my hand" but not "I have 0.10 dollars in my hand".
Also, there is no use in putting fixed texts into capturing parentheses.
m = re.search(r'\bhave (-?\d+\.\d+) dollars\b', sequence)
would make much more sense.
This question has already been asked in many formulations before. You're looking for a regular expression that will find a number. Since number formats may include decimals, commas, exponents, plus/minus signs, and leading zeros, you'll need a robust regular expression. Fortunately, this regular expression has already been written for you.
See How to extract a floating number from a string and Regular expression to match numbers with or without commas and decimals in text
Related
I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.
I am learning regex using Python and am a little confused by this tutorial I am following. Here is the example:
rand_str_2 = "doctor doctors doctor's"
# Match doctor doctors or doctor's
regex = re.compile("[doctor]+['s]*")
matches = re.findall(regex, rand_str_2)
print("Matches :", len(matches))
I get 3 matches
When I do the same thing but replace the * with a ? I still get three matches
regex = re.compile("[doctor]+['s]?")
When I look into the documentation I see that the * finds 0 or more and ? finds 0 or 1
My understanding of this is that it would not return "3 matches" because it is only looking for 0 or 1.
Can someone offer a better understanding of what I should expect out of these two Quantifiers?
Thank you
You are correct about the behavior of the two quantifiers. When using the *, the three matches are "doctor", "doctor", "doctor's". When using the ?, the three matches are "doctor", "doctor" and "doctor'". With the * it tries to match the characters in the character class (' and s) 0 or more times. Thus, for the final match it is greedy and matches as many times as possible, matching both ' and s. However, the ? will only match at most one character in the character class, so it matches to '.
The reason this happens is because of the grouping in that specific expression. The square brackets are telling whatever is reading the expression to "match any single character in this list". This means that it is looking for either a ' or a s to satisfy the expression.
Now you can see how the quantifier effects this. Doing ['s]? is telling the pattern to "match ' or s between 0 and 1 times, as many times as possible", so it matches the ' and stops right before the s.
Doing ['s]* on the other hand is telling it to "match ' or s between 0 and infinity, as many times as possible". In this case it will match both the ' and the s because they're both in the list of characters it's trying to match.
I hope this makes sense. If not, feel free to leave a comment and I'll try my best to clarify it.
I did use re.findAll to extract decimal number from a string like this:
size = "Koko33,5 m²"
numbers = re.findall("\d+\,*\d+", size)
print(numbers) = ['33,5']
Then I was trying to get only number 33,5 out of that ['33,5'].
And by guess I did this :
numbers = re.findall("\d+\,*\d+", size)[0]
And it worked. But I don't understand why it worked?
I'm new to programming so every help is good :)
It works because it finds the pattern where the is a number, then a comma, then another number.
\d gets a number, + gets the previous expression (\d) to get all the continuous same letters, then \, just finds the comma, then * matches between zero and unlimited times then there is another \d+.
The last thing, the slicing part ([0]), gets the first matched pattern (in this case there is only one).
More explanation
You guessed well.
\d+ Find 1 or more numbers (1,2,3...)
,* Find 0, 1 or more commas
\d+ Find 1 or more numbers (1,2,3...)
The pattern should find 33,5 or 999,123. Any "number comma number" pattern.
Best source on Regex that I have found is "Mastering Regular Expressions" by Jeffrey E. F. Friedl.
I have a webscraper that scrapes prices, for that I need it to find following prices in strings:
762,50
1.843,75
In my first naive implementation, I didn't take the . into consideration and matched the first number with this regex perfectly:
re.findall("\d+,\d+", string)[0]
Now I need to match both cases and my initial idea was this:
re.findall("(\d+.\d+,\d+|\d+,\d+)", string)[0]
With an idea, that using the or operator, could find either the first or the second, which don't work, any suggestions?
No need to use a or, just add the first part as an optional parameter:
(?:\d+\.)?\d+,\d+
The ? after (?:\d+\.) makes it an optional parameter.
The '?:' indicate to not capture this group, just match it.
>>> re.findall(r'(?:\d+\.)?\d+,\d+', '1.843,75 762,50')
['1.843,75', '762,50']
Also note that you have to escape the . (dot) that would match any character except a newline (see http://docs.python.org/2/library/re.html#regular-expression-syntax)
In regular expression, dot (.) matches any character (except newline unless DOTALL flag is not set). Escape it to match . literally:
\d+\.\d+,\d+|\d+,\d+
^^
To match multiple leading digits, the regular expression should be:
>>> re.findall(r'(?:\d+\.)*\d+,\d+', '1,23 1.843,75 123.456.762,50')
['1,23', '1.843,75', '123.456.762,50']
NOTE used non-capturing group because re.findall return a list of groups If one or more groups are present in the pattern.
UPDATE
>>> re.findall(r'(?<![\d.])\d{1,3}(?:\.\d{3})*,\d+',
... '1,23 1.843,75 123.456.762,50 1.2.3.4.5.6.789,123')
['1,23', '1.843,75', '123.456.762,50']
How about:
(\d+[,.]\d+(?:[.,]\d+)?)
Matches:
- some digits followed by , or . and some digits
OR
- some digits followed by , or . and some digits followed by , or . and some digits
It matches: 762,50 and 1.843,75 and 1,75
It will also match 1.843.75 are you OK with that?
See it in action.
I'd use this:
\d{1,3}(?:\.\d{3})*,\d\d
This will match number that have dot as thousand separator
\d*\.?\d{3},\d{2}
See the working example here
This might be slower than regex, but given that the strings you are parsing are probably short, it should not matter.
Since the solution below does not use regex, it is simpler, and you can be more sure you are finding valid floats. Moreover, it parses the digit-strings into Python floats which is probably the next step you intend to perform anyway.
import locale
locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
def float_filter(iterable):
result = []
for item in iterable:
try:
result.append(locale.atof(item))
except ValueError:
pass
return result
text = 'The price is 762,50 kroner'
print(float_filter(text.split()))
yields
[762.5]
The basic idea: by setting a Danish locale, locale.atof parses commas as the decimal marker and dots as the grouping separator.
In [107]: import locale
In [108]: locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
Out[108]: 'en_DK.UTF-8'
In [109]: locale.atof('762,50')
Out[109]: 762.5
In [110]: locale.atof('1.843,75')
Out[110]: 1843.75
In general, you have a set of zero or more XXX., followed by one or more XXX,, each up to 3 numbers, followed by two numbers (always). Do you want to also support numbers like 1,375 (without 'cents'?). You also need to avoid some false detection cases.
That looks like this:
matcher=r'((?:(?:(?:\d{1,3}\.)?(?:\d{3}.)*\d{3}\,)|(?:(?<![.0-9])\d{1,3},))\d\d)'
re.findall(matcher, '1.843,75 762,50')
This detects a lot of boundary cases, but may not catch everything....
I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string