I did use re.findAll to extract decimal number from a string like this:
size = "Koko33,5 m²"
numbers = re.findall("\d+\,*\d+", size)
print(numbers) = ['33,5']
Then I was trying to get only number 33,5 out of that ['33,5'].
And by guess I did this :
numbers = re.findall("\d+\,*\d+", size)[0]
And it worked. But I don't understand why it worked?
I'm new to programming so every help is good :)
It works because it finds the pattern where the is a number, then a comma, then another number.
\d gets a number, + gets the previous expression (\d) to get all the continuous same letters, then \, just finds the comma, then * matches between zero and unlimited times then there is another \d+.
The last thing, the slicing part ([0]), gets the first matched pattern (in this case there is only one).
More explanation
You guessed well.
\d+ Find 1 or more numbers (1,2,3...)
,* Find 0, 1 or more commas
\d+ Find 1 or more numbers (1,2,3...)
The pattern should find 33,5 or 999,123. Any "number comma number" pattern.
Best source on Regex that I have found is "Mastering Regular Expressions" by Jeffrey E. F. Friedl.
Related
I have a code which takes a string as input and discards all the letters and prints only the numbers which doesn't contain 9 at any of the place.
I have decided to do it with the help of regex but couldn't find a working expression to achieve it where it is needed to be modified?
I have also tried with [^9] but it doesn't work.
import re
s = input().lstrip().rstrip()
updatedStr = s.replace(' ', '')
nums = re.findall('[0-8][0-8]', updatedStr)
print(nums)
The code should completely discard the number which contains 9 at any place.
for example - if the input is:
"This is 67 and 98"
output: ['67']
input:
"This is the number 678975 or 56783 or 87290 thats it"
output: ['56783'] (as the other two numbers contain 9 at some places)
I think you should try using:
nums=re.findall('[0-8]+',updatedStr)
Instead.
[0-8]+ means "one or more ocurrences of a number from 0 to 8"
I tried : 12313491 a asfasgf 12340 asfasf 123159
And got: ['123134', '1', '12340', '12315']
(Your code returns the array. If you want to join the numbers you should add some code)
It sounds like you wan't to match all numbers that don't contain a 9.
Your pattern should match any string of digits that doesn't contain a nine but ends and starts with a non-digit
pattern = re.compile('(?<=[^\d])[0-8]+(?=[^\d])')
pattern.findall(inputString) # Finds all the matches
Here the pattern is doing a couple of things.
(?<=...) is a positive look behind. This means we will only get matches that have a non digit before it.
[0-8]+ will match 1 or more digits except 9
(?=...) is a lookahead. We will only get matches that end in a non digit.
Note:
inputString does not need to be stripped. And in fact this pattern may run into issues if there is a number at the beginning or end of a string. To prevent this. simply pad it with any chars.
inputString = ' ' + inputString + ' '
Look at the python re docs for more info
I'm trying to parse a text file and extract certain integers out of it. Each line in my text file is of this format:
a and b
where a is an integer and b could be a float or an integer
eg. '4 and 10.2356' or '400 and 25'
I need to extract both a and b. I'm trying to use re.findall() to do this:
print re.findall("\d+", txt)[0] #extract a
#Extract b
try:
print float(re.findall("\d+.\d+", txt)[1])
except IndexError:
print float(re.findall("\d+.\d+", txt)[0])
here txt is a single line from the file. The reason for the try and except block is as follows:
if a is a single digit integer, eg. 4, the try part of the code just returns b. However, if a is not a single digit integer, eg. 400, the try part of the code returns both a and b. I found this weird.
However, I don't know how to modify the above code to extract b when it is an integer. I tried putting another try and except bock inside the existing except block, but it gave me weird results (in some instances a and b got concatenated). Please help me out.
Also, can anyone please tell me the difference between \d+ and \d+.\d+ and why \d+.\d+ returns 400 and not 4 even when both are integers.
Just make the pattern which matches as decimal part as optional.
>>> s = '4 and 10.2356'
>>> re.findall(r'\d+(?:\.\d+)?', s)
['4', '10.2356']
>>> print(int(re.findall(r'\d+(?:\.\d+)?', s)[0]))
4
>>> print(float(re.findall(r'\d+(?:\.\d+)?', s)[1]))
10.2356
\d+ matches one or more digits.
\d+.\d+ matches one or more digits plus any single character plus one or more digits.
\d+\.\d+ matches one or more digit characters pus a literal dot plus one or more digits.
\d+(?:\.\d+)? matches integer as well as floating point numbers because we made the pattern which matches the decimal part as optional. ? after a capturing or non-capturing group would turn the whole group to an optional one.
import re
sequence = 'i have -0.03 dollars in my hand'
m = re.search('(have )(-\w[.]+)( dollars\w+)',sequence)
print m.group(0)
print m.group(1)
print m.group(2)
Looking for a way to extract text between two occurrences. In this case, the format is 'i have ' followed by - floats and then followed by ' dollars\w+'
How do i use re.search to extract this float ?
Why don't the groups work this way ? I know there's something I can tweak to get it to work with these groups. any help would be greatly appreciated
I thought I could use groups with paranthesis but i got an eror
-\w[.]+ does not match -0.03 because [.] matches . literally because . is inside the [...].
\w after dollars also prevent the pattern to match the sequence. There no word character after dollars.
Use (-?\d+\.\d+) as pattern:
import re
sequence = 'i have -0.03 dollars in my hand'
m = re.search(r'(have )(-?\d+\.\d+)( dollars)', sequence)
print m.group(1) # captured group start from `1`.
print m.group(2)
print m.group(3)
BTW, captured group numbers start from 1. (group(0) returns entire matched string)
Your regex doesn't match for several reasons:
it always requires a - (OK in this case, questionable in general)
it requires exactly one digit before the . (and it even allows non-digits like A).
it allows any number of dots, but no more digits after the dots.
it requires one or more alphanumerics immediately after dollars.
So it would match "I have -X.... dollarsFOO in my hand" but not "I have 0.10 dollars in my hand".
Also, there is no use in putting fixed texts into capturing parentheses.
m = re.search(r'\bhave (-?\d+\.\d+) dollars\b', sequence)
would make much more sense.
This question has already been asked in many formulations before. You're looking for a regular expression that will find a number. Since number formats may include decimals, commas, exponents, plus/minus signs, and leading zeros, you'll need a robust regular expression. Fortunately, this regular expression has already been written for you.
See How to extract a floating number from a string and Regular expression to match numbers with or without commas and decimals in text
I have a text file with text that looks like below
Format={ Window_Type="Tabular", Tabular={ Num_row_labels=10
}
}
I need to look for Num_row_labels >=10 in my text file. How do I do that using Python 3.2 regex?
Thanks.
Assume that the data is formatted as above, and there is no leading 0's in the number:
Num_row_labels=\d{2,}
A more liberal regex which allows arbitrary spaces, still assume no leading 0's:
Num_row_labels\s*=\s*\d{2,}
An even more liberal regex which allows arbitrary spaces, and allow leading 0's:
Num_row_labels\s*=\s*0*[1-9]\d+
If you need to capture the numbers, just surround \d{2,} (in 1st and 2nd regex) or [1-9]\d+ (in 3rd regex) with parentheses () and refers to it in the 1st capture group.
Use:
match = re.search("Num_row_labels=(\d+)", line)
The (\d+) matches at least one decimal digit (0-9) and captures all digits matched as a group (groups are stored in the object returned by re.search and re.match, which I'm assigning to match here). To access the group and compare compare against 10, use:
if int(match.group(1)) >= 10:
print "Num_row_labels is at least 10"
This will allow you to easily change the value of your threshold, unlike the answers that do everything in the regex. Additionally, I believe this is more readable in that it is very obvious that you are comparing a value against 10, rather than matching a nonzero digit in the regex followed by at least one other digit. What the code above does is ask for the 1st group that was matched (match.group(1) returns the string that was matched by \d+), and then, with the call to int(), converts the string to an integer. The integer returned by int() is then compared against 10.
The regex is Num_row_labels=[1-9][0-9]{1}.*
Now you can use the re python module (take a look here) to analyze your text and extract those
the re looks like:
Num_row_labels=[0-9]*[1-9][0-9]+
Example of usage:
if re.search('Num_row_labels=[0-9]*[1-9][0-9]+', line):
print line
The regular expression [0-9]*[1-9][0-9]+ means that in the string must be at least
one digit from 1 to 9 ([1-9], symbol class [] in regular expressions means that here can be any symbol from the range specified in the brackets);
and at least one digit from 0 to 9 (but it can be more of them) ([0-9]+, the + sign in regular expression means that the symbol/expression that stand before it can be repeated 1 or more times).
Before these digits can be any other digits ([0-9]*, that means any digit, 0 or more times). When you already have two digits you can have any other digits before — the number would be greater or equal 10 anyway.
I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string