Python Regex to look for string - python

I have a text file with text that looks like below
Format={ Window_Type="Tabular", Tabular={ Num_row_labels=10
}
}
I need to look for Num_row_labels >=10 in my text file. How do I do that using Python 3.2 regex?
Thanks.

Assume that the data is formatted as above, and there is no leading 0's in the number:
Num_row_labels=\d{2,}
A more liberal regex which allows arbitrary spaces, still assume no leading 0's:
Num_row_labels\s*=\s*\d{2,}
An even more liberal regex which allows arbitrary spaces, and allow leading 0's:
Num_row_labels\s*=\s*0*[1-9]\d+
If you need to capture the numbers, just surround \d{2,} (in 1st and 2nd regex) or [1-9]\d+ (in 3rd regex) with parentheses () and refers to it in the 1st capture group.

Use:
match = re.search("Num_row_labels=(\d+)", line)
The (\d+) matches at least one decimal digit (0-9) and captures all digits matched as a group (groups are stored in the object returned by re.search and re.match, which I'm assigning to match here). To access the group and compare compare against 10, use:
if int(match.group(1)) >= 10:
print "Num_row_labels is at least 10"
This will allow you to easily change the value of your threshold, unlike the answers that do everything in the regex. Additionally, I believe this is more readable in that it is very obvious that you are comparing a value against 10, rather than matching a nonzero digit in the regex followed by at least one other digit. What the code above does is ask for the 1st group that was matched (match.group(1) returns the string that was matched by \d+), and then, with the call to int(), converts the string to an integer. The integer returned by int() is then compared against 10.

The regex is Num_row_labels=[1-9][0-9]{1}.*
Now you can use the re python module (take a look here) to analyze your text and extract those

the re looks like:
Num_row_labels=[0-9]*[1-9][0-9]+
Example of usage:
if re.search('Num_row_labels=[0-9]*[1-9][0-9]+', line):
print line
The regular expression [0-9]*[1-9][0-9]+ means that in the string must be at least
one digit from 1 to 9 ([1-9], symbol class [] in regular expressions means that here can be any symbol from the range specified in the brackets);
and at least one digit from 0 to 9 (but it can be more of them) ([0-9]+, the + sign in regular expression means that the symbol/expression that stand before it can be repeated 1 or more times).
Before these digits can be any other digits ([0-9]*, that means any digit, 0 or more times). When you already have two digits you can have any other digits before — the number would be greater or equal 10 anyway.

Related

Using re.findAll for decimal numbers

I did use re.findAll to extract decimal number from a string like this:
size = "Koko33,5 m²"
numbers = re.findall("\d+\,*\d+", size)
print(numbers) = ['33,5']
Then I was trying to get only number 33,5 out of that ['33,5'].
And by guess I did this :
numbers = re.findall("\d+\,*\d+", size)[0]
And it worked. But I don't understand why it worked?
I'm new to programming so every help is good :)
It works because it finds the pattern where the is a number, then a comma, then another number.
\d gets a number, + gets the previous expression (\d) to get all the continuous same letters, then \, just finds the comma, then * matches between zero and unlimited times then there is another \d+.
The last thing, the slicing part ([0]), gets the first matched pattern (in this case there is only one).
More explanation
You guessed well.
\d+ Find 1 or more numbers (1,2,3...)
,* Find 0, 1 or more commas
\d+ Find 1 or more numbers (1,2,3...)
The pattern should find 33,5 or 999,123. Any "number comma number" pattern.
Best source on Regex that I have found is "Mastering Regular Expressions" by Jeffrey E. F. Friedl.

Capturing entire repeated string based on a repeated pattern

Following regex matches both 59-59-59 and 59-59-59-59 and outputs only 59
The intent is to match four and only numbers followed by - with the max number being 59. Numbers less than 10 are represented as 00-09.
print(re.match(r'(\b[0-5][0-9]-{1,4}\b)','59-59-59').groups())
--> output ('59-',)
I need a pattern match that matches exactly 59-59-59-59
and does not match 59--59-59or 59-59-59-59-59
Try using the following pattern, if using re.match:
[0-5][0-9](?:-[0-5][0-9]){3}$
This is phrased to match an initial number starting with 0 through 5, followed by any second digit. Then, this is followed by a dash and a number with the same rules, this quantity three times exactly. Note that re.match anchor at the beginning by default, so we only need an ending anchor $.
Code:
print(re.match(r'([0-5][0-9](?:-[0-5][0-9]){3})$', '59-59-59-59').groups())
('59-59-59-59',)
If you intend to actually match the same number four times in a row, then see the answer by #Thefourthbird.
If you want to find such a string in a larger text, then consider using re.search. In that case, use this pattern:
(?:^|(?<=\s))[0-5][0-9](?:-[0-5][0-9]){3}(?=\s|$)
Note that instead of using word boundaries \b I used lookarounds to enforce the end of the "word" here. This means that the above pattern will not match something like 59-59-59-59-59.
In your pattern, this part -{1,4} matches 1-4 times a hyphen so 59-- will match.
If all the matches should be the same as 59, you could use a backreference to the first capturing group and repeat that 3 times with a prepended hyphen.
\b([0-5][0-9])(?:-\1){3}\b
Your code might look like:
import re
res = re.match(r'\b([0-5][0-9])(?:-\1){3}\b', '59-59-59-59')
if res:
print(res.group())
If there should not be partial matches, you could use an anchors to assert the ^ start and the end $ of the string:
^([0-5][0-9])(?:-\1){3}$

Regular expression to convert given number in the required format

I am first time using regular expression hence need help with one slightly complex regular expression. I have input list of around 100-150 string object(numbers).
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
Expected output = ['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']
## I have tried this -
import re
new_list = []
for i in range (0, len(input)):
new_list.append(re.sub('\d+-\d+-\d+','0000\\1', input[i]))
## problem is with second argument '0000\\1'. I know its wrong but unable to solve
print(new_list) ## new_list is the expected output.
As you can see, I need to convert string of numbers coming in different formats into 15 digit numbers by adding leading zeros to them.
But there is catch here i.e. some numbers i.e.'000480087800784' are already 15 digits, so should be left unchanged (That's why I cannot use string formatting (.format) option of python) Regex has to be used here, which will modify only required numbers. I have already tried following answers but not been able to solve.
Using regex to add leading zeroes
using regular expression substitution command to insert leading zeros in front of numbers less than 10 in a string of filenames
Regular expression to match defined length number with leading zeros
Your regex does not work as you used \1 in the replacement, but the regex pattern has no corresponding capturing group. \1 refers to the first capturing group in the pattern.
If you want to try your hand at regex, you may use
re.sub(r'^(\d+)-(\d+)-(\d+)$', lambda x: "{}-{}-{}".format(x.group(1).zfill(5), x.group(2).zfill(5), x.group(3).zfill(5)), input[i])
See the Python demo.
Here, ^(\d+)-(\d+)-(\d+)$ matches a string that starts with 1+ digits, then has -, then 1+ digits, - and again 1+ digits followed by the end of string. There are three capturing groups whose values can be referred to with \1, \2 and \3 backreferences from the replacement pattern. However, since we need to apply .zfill(5) on each captured text, a lambda expression is used as the replacement argument, and the captures are accessed via the match data object group() method.
However, if your strings are already in correct format, you may just split the strings and format as necessary:
for i in range (0, len(input)):
splits = input[i].split('-')
if len(splits) == 1:
new_list.append(input[i])
else:
new_list.append("{}-{}-{}".format(splits[0].zfill(5), splits[1].zfill(5), splits[2].zfill(5)))
See another Python demo. Both solutions yield
['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']
How about analysing the string for numbers and dashes, then adding leading zeros?
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
output = []
for inp in input:
# calculate length of string
inpLen = len(inp)
# calculate num of dashes
inpDashes = inp.count('-')
# add specific number of leading zeros
zeros = "0" * (15-(inpLen-inpDashes))
output.append(zeros + inp)
print (output)
>>> ['00000090-10-07457', '000480087800784', '00000001-713-0926', '00000012-710-8197', '00000001-345-1715', '000000009-23-4532', '000200007100272']

How can I check if a string has 9 or more digits?

I'm trying to detect if a string has 9 or more digits. What's the best way to approach this?
I want to be able to detect a phone number inside a string like this:
Call me # (123)123-1234
What's the best way to pull those numbers regardless of their positioning in the string?
Since it sounds like you just want to check whether there are 9 or more digits in the string, you can use the pattern
^(\D*\d){9}
It starts at the beginning of the string, and repeats a group composed of zero or more non-digit characters, followed by a digit character. Repeat that group 9 times, and you know that the string has at least 9 digits in it.
pattern = re.compile(r'^(?:\D*\d){9}')
print(pattern.match('Call me # (123)123-1234'))
print(pattern.match('Call me # (123)123-12'))
#Import the regular expressions library
import re
#set our string variable equal to yours above
string = 'Call me # (123)123-1234'
#create a list using regular expressions, of all digits in the string
a = re.findall("[0-9]",string)
#examine the list to see if its length is 9 digits or more, and print if so
if len(a) >= 9:
print(a)
Or without regex (slower for big strings):
print(sum(letter.isdigit() for letter in my_string)>=9)
Or part regex:
print(len(re.findall("[0-9]",my_string))>=9)
Just use python for checking if nine (9) or more digits.

Regex sub phone number format multiple times on same string

I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]
Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online

Categories

Resources