Problems with using re.findall() in python - python

I'm trying to parse a text file and extract certain integers out of it. Each line in my text file is of this format:
a and b
where a is an integer and b could be a float or an integer
eg. '4 and 10.2356' or '400 and 25'
I need to extract both a and b. I'm trying to use re.findall() to do this:
print re.findall("\d+", txt)[0] #extract a
#Extract b
try:
print float(re.findall("\d+.\d+", txt)[1])
except IndexError:
print float(re.findall("\d+.\d+", txt)[0])
here txt is a single line from the file. The reason for the try and except block is as follows:
if a is a single digit integer, eg. 4, the try part of the code just returns b. However, if a is not a single digit integer, eg. 400, the try part of the code returns both a and b. I found this weird.
However, I don't know how to modify the above code to extract b when it is an integer. I tried putting another try and except bock inside the existing except block, but it gave me weird results (in some instances a and b got concatenated). Please help me out.
Also, can anyone please tell me the difference between \d+ and \d+.\d+ and why \d+.\d+ returns 400 and not 4 even when both are integers.

Just make the pattern which matches as decimal part as optional.
>>> s = '4 and 10.2356'
>>> re.findall(r'\d+(?:\.\d+)?', s)
['4', '10.2356']
>>> print(int(re.findall(r'\d+(?:\.\d+)?', s)[0]))
4
>>> print(float(re.findall(r'\d+(?:\.\d+)?', s)[1]))
10.2356
\d+ matches one or more digits.
\d+.\d+ matches one or more digits plus any single character plus one or more digits.
\d+\.\d+ matches one or more digit characters pus a literal dot plus one or more digits.
\d+(?:\.\d+)? matches integer as well as floating point numbers because we made the pattern which matches the decimal part as optional. ? after a capturing or non-capturing group would turn the whole group to an optional one.

Related

Using re.findAll for decimal numbers

I did use re.findAll to extract decimal number from a string like this:
size = "Koko33,5 m²"
numbers = re.findall("\d+\,*\d+", size)
print(numbers) = ['33,5']
Then I was trying to get only number 33,5 out of that ['33,5'].
And by guess I did this :
numbers = re.findall("\d+\,*\d+", size)[0]
And it worked. But I don't understand why it worked?
I'm new to programming so every help is good :)
It works because it finds the pattern where the is a number, then a comma, then another number.
\d gets a number, + gets the previous expression (\d) to get all the continuous same letters, then \, just finds the comma, then * matches between zero and unlimited times then there is another \d+.
The last thing, the slicing part ([0]), gets the first matched pattern (in this case there is only one).
More explanation
You guessed well.
\d+ Find 1 or more numbers (1,2,3...)
,* Find 0, 1 or more commas
\d+ Find 1 or more numbers (1,2,3...)
The pattern should find 33,5 or 999,123. Any "number comma number" pattern.
Best source on Regex that I have found is "Mastering Regular Expressions" by Jeffrey E. F. Friedl.

Regex expression to exclude any number from any place

I have a code which takes a string as input and discards all the letters and prints only the numbers which doesn't contain 9 at any of the place.
I have decided to do it with the help of regex but couldn't find a working expression to achieve it where it is needed to be modified?
I have also tried with [^9] but it doesn't work.
import re
s = input().lstrip().rstrip()
updatedStr = s.replace(' ', '')
nums = re.findall('[0-8][0-8]', updatedStr)
print(nums)
The code should completely discard the number which contains 9 at any place.
for example - if the input is:
"This is 67 and 98"
output: ['67']
input:
"This is the number 678975 or 56783 or 87290 thats it"
output: ['56783'] (as the other two numbers contain 9 at some places)
I think you should try using:
nums=re.findall('[0-8]+',updatedStr)
Instead.
[0-8]+ means "one or more ocurrences of a number from 0 to 8"
I tried : 12313491 a asfasgf 12340 asfasf 123159
And got: ['123134', '1', '12340', '12315']
(Your code returns the array. If you want to join the numbers you should add some code)
It sounds like you wan't to match all numbers that don't contain a 9.
Your pattern should match any string of digits that doesn't contain a nine but ends and starts with a non-digit
pattern = re.compile('(?<=[^\d])[0-8]+(?=[^\d])')
pattern.findall(inputString) # Finds all the matches
Here the pattern is doing a couple of things.
(?<=...) is a positive look behind. This means we will only get matches that have a non digit before it.
[0-8]+ will match 1 or more digits except 9
(?=...) is a lookahead. We will only get matches that end in a non digit.
Note:
inputString does not need to be stripped. And in fact this pattern may run into issues if there is a number at the beginning or end of a string. To prevent this. simply pad it with any chars.
inputString = ' ' + inputString + ' '
Look at the python re docs for more info

Regex sub phone number format multiple times on same string

I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]
Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online

python regular expression numbers in a row

I'm trying to check a string for a maximum of 3 numbers in a row for which I used:
regex = re.compile("\d{0,3}")
but this does not work for instance the string 1234 would be accepted by this regex even though the digit string if over length 3.
If you want to check a string for a maximum of 3 digits in string you need to use '\d{4,}' as you are only interest in the digits string over a length of 3.
import re
str='123abc1234def12'
print re.findall('\d{4,}',str)
>>> '[1234]'
If you use {0,3}:
str='123456'
print re.findall('\d{0,3}',str)
>>> ['123', '456', '']
The regex matches digit strings of maximum length 3 and empty strings but this cannot be used to test correctness. Here you can't check whether all digit strings are in length but you can easily check for digits string over the length.
So to test do something like this:
str='1234'
if re.match('\d{4,}',str):
print 'Max digit string too long!'
>>> Max digit string too long!
\d{0} matches every possible string. It's not clear what you mean by "doesn't work", but if you expect to match a string with digits, increase the repetition operator to {1,3}.
If you wish to exclude runs of 4 or more, try something like (?:^|\D)\d{1,3}(?:\D|$) and of course, if you want to capture the match, use capturing parentheses around \d{1,3}.
The method you have used is to find substrings with 0-3 numbers, it couldn't reach your expactation.
My solve:
>>> import re
>>> re.findall('\d','ds1hg2jh4jh5')
['1', '2', '4', '5']
>>> res = re.findall('\d','ds1hg2jh4jh5')
>>> len(res)
4
>>> res = re.findall('\d','23425')
>>> len(res)
5
so,next you just need use ‘if’ to judge the numbers of digits.
There could be a couple reasons:
Since you want \d to search for digits or numbers, you should probably spell that as "\\d" or r"\d". "\d" might happen to work, but only because d isn't special (yet) in a string. "\n" or "\f" or "\r" will do something totally different. Check out the re module documentation and search for "raw strings".
"\\d{0,3}" will match just about anything, because {0,3} means "zero or up to three". So, it will match the start of any string, since any string starts with the empty string.
or, perhaps you want to be searching for strings that are only zero to three numbers, and nothing else. In this case, you want to use something like r"^\d{0,3}$". The reason is that regular expressions match anywhere in a string (or only at the beginning if you are using re.match and not re.search). ^ matches the start of the string, and $ matches the end, so by putting those at each end you are not matching anything that has anything before or after \d{0,3}.

Python Regex to look for string

I have a text file with text that looks like below
Format={ Window_Type="Tabular", Tabular={ Num_row_labels=10
}
}
I need to look for Num_row_labels >=10 in my text file. How do I do that using Python 3.2 regex?
Thanks.
Assume that the data is formatted as above, and there is no leading 0's in the number:
Num_row_labels=\d{2,}
A more liberal regex which allows arbitrary spaces, still assume no leading 0's:
Num_row_labels\s*=\s*\d{2,}
An even more liberal regex which allows arbitrary spaces, and allow leading 0's:
Num_row_labels\s*=\s*0*[1-9]\d+
If you need to capture the numbers, just surround \d{2,} (in 1st and 2nd regex) or [1-9]\d+ (in 3rd regex) with parentheses () and refers to it in the 1st capture group.
Use:
match = re.search("Num_row_labels=(\d+)", line)
The (\d+) matches at least one decimal digit (0-9) and captures all digits matched as a group (groups are stored in the object returned by re.search and re.match, which I'm assigning to match here). To access the group and compare compare against 10, use:
if int(match.group(1)) >= 10:
print "Num_row_labels is at least 10"
This will allow you to easily change the value of your threshold, unlike the answers that do everything in the regex. Additionally, I believe this is more readable in that it is very obvious that you are comparing a value against 10, rather than matching a nonzero digit in the regex followed by at least one other digit. What the code above does is ask for the 1st group that was matched (match.group(1) returns the string that was matched by \d+), and then, with the call to int(), converts the string to an integer. The integer returned by int() is then compared against 10.
The regex is Num_row_labels=[1-9][0-9]{1}.*
Now you can use the re python module (take a look here) to analyze your text and extract those
the re looks like:
Num_row_labels=[0-9]*[1-9][0-9]+
Example of usage:
if re.search('Num_row_labels=[0-9]*[1-9][0-9]+', line):
print line
The regular expression [0-9]*[1-9][0-9]+ means that in the string must be at least
one digit from 1 to 9 ([1-9], symbol class [] in regular expressions means that here can be any symbol from the range specified in the brackets);
and at least one digit from 0 to 9 (but it can be more of them) ([0-9]+, the + sign in regular expression means that the symbol/expression that stand before it can be repeated 1 or more times).
Before these digits can be any other digits ([0-9]*, that means any digit, 0 or more times). When you already have two digits you can have any other digits before — the number would be greater or equal 10 anyway.

Categories

Resources