I want to write code that can parse American phone numbers (ie. "(664)298-4397") . Below are the constraints:
allow leading and trailing white spaces
allow white spaces that appear between area code and local numbers
no white spaces in area code or the seven digit number XXX-XXXX
Ultimately I want to print a tuple of strings (area_code, first_three_digits_local, last_four_digits_local)
I have two sets of questions.
Question 1:
Below are inputs my code should accept and print the tuple for:
'(664) 298-4397', '(664)298-4397', ' (664) 298-4397'
Below is the code I tried:
regex_parse1 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664) 298-4397')
print (f' groups are: {regex_parse1.groups()} \n')
regex_parse2 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664)298-4397')
print (f' groups are: {regex_parse2.groups()} \n')
regex_parse3 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' (664) 298-4397')
print (f' groups are: {regex_parse3.groups()}')
The string input for all three are valid and should return the tuple:
('664', '298', '4397')
But instead I'm getting the output below for all three:
groups are: ('', '', '4397')
What am I doing wrong?
Question 2:
The following two chunks of code should output an 'NoneType' object has no attribute 'group' error because the input phone number string violates the constraints. But instead, I get outputs for all three.
regex_parse4 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(404)555 -1212')
print (f' groups are: {regex_parse4.groups()}')
regex_parse5 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' ( 404)121-2121')
print (f' groups are: {regex_parse5.groups()}')
Expected output: should be an error but I get this instead for all three:
groups are: ('', '', '2121')
What is wrong with my regex code?
In general, your regex overuse the asterisk *. Details as follows:
You have 3 capturing groups:
([\s]*[(]*[0-9]*[)]*[\s]*)
([\s]*[0-9]*)
([0-9]*[\s]*)
You use asterisk on every single item, including the open and close parenthesis. Actually, almost everything in your regex is quoted with asterisk. Thus, the capturing groups match also null strings. That's why your first and second capturing groups return the null strings. The only item you don't use asterisk is the hyphen sign - just before the third capturing group. This is also the reason why your regex can capture the third capturing group as in the 4397 and 2121
To solve your problem, you have to use asterisk only when needed.
In fact, your regex still has plenty of rooms for improvement. For example, it now matches numeric digits of any length (instead of 3 or 4 digits chunks). It also allows the area code not enclosed in parenthesis (because of your use of asterisk around parenthesis symbols.
For this kind of common regex, I suggest you don't need to reinvent the wheel. You can refer to some already made regex easily found from the Internet. For example, you can refer to this post Although the post is using javascript instead of Python, the regex is just similar.
Try:
regex_parse4 = re.match(r'([(]*[0-9]{3}[)])\s*([0-9]{3}).([0-9]{4})', number)
Assumes 3 digit area code in parentheses, proceeded by XXX-XXXX.
Python returns 'NoneType' when there are no matches.
If above does not work, here is a helpful regex tool:
https://regex101.com
Edit:
Another suggestion is to clean data prior to applying a new regex. This helps with instances of abnormal spacing, gets rid of parentheses, and '-'.
clean_number = re.sub("[^0-9]", "", original_number)
regex_parse = re.match(r'([0-9]{3})([0-9]{3})([0-9]{4})', clean_number)
print(f'groups are: {regex_parse}.groups()}')
>>> ('xxx', 'xxx', 'xxxx')
I'm hoping to get some regex assistance. I've got lines of columnar text that I'd like to split with regexes. Each column can be phrases of arbitrary characters, separated by a whitespace or maybe even two. Columns are separated by a larger number of whitespaces, perhaps at least 4.
Ultimately, I need to match a date if its in the second column.
Here's an example. I need the date in this column to be the group important_date
Rain in Spain 11/01/2000 90 Days
important_date should not match the date in this next line:
Another line of text 10/15/1990
# EXAMPLE:
import re
regex = r"(.*)\s(?P<important_date>\d{1,2}\/\d{1,2}\/\d{4}).*"
match_this = " Rain in Spain 11/01/2000 90 Days"
not_this = " Another line of text 10/15/1990"
print(f"Finding this date is good:{re.search(regex, match_this).group('important_date')}" )
print(f"But this one should throw an error:{re.search(regex,not_this).group('important_date')}")
I'm also comparing these regexes against lots of other lines of text with various structures, so this is why I don't want to just split on a string of " ". To know I've got the important_date, I need to know that the whole line looks like: one-column, second column is date, maybe another column after the date too.
Doing this with a single regex would also just fit much more easily into the rest of the application. I'm worried that line.split(" ") and checking the resulting list would interfere with other checks going on in this app.
I have not been able to figure out how to write the first part of the regex that captures words with no-more-than-2 spaces between them. Can I use lookaheads for this somehow?
Thank you!
Try this: (?m)^\s*(\w+\s)+\s+(?P<important_date>\d\d/\d\d/\d\d\d\d).*$ (https://regex101.com/r/PnIU3e/3).
I assume that the first column consists of words separated by single spaces, and is separated from the second column by more than one space.
You can split on fields of 2 or more spaces and only use the data if it is the second field:
for x in (match_this, not_this):
te=re.split(r'[ \t]{2,}',x)
if re.match(r'\d{1,2}\/\d{1,2}\/\d{4}', te[2]):
# you have an important date
print(te[2])
else:
# you don't
print('no match')
I have multiple file names that are either a movie title or an episode in a TV show. For the movie titles I want to match the year the movie came out, and for the episode I want to match the season and episode number in the format S00E00. However, I can't known that the string contains either or, sometimes it can contain both the season and episode and the year. I also don't known what comes first in the string, the year or the season and episode.
I tried with the following pattern: (\d{4})|S(\d\d)E(\d\d), however that only returns a match for the one that came first. For the string 2012.S01E02, it returns 2012, and for the string S01E02.2012 it returns S01E02. The rest of the capture groups is None (I'm using Python 3.5).
I have a solution which uses two separate matches, if-statements and generally looks ugly. Is there's a way to have one regex pattern that returns a list (or tuple) witch contains (year, season, episode), regardless of what comes first in the string?
You could use the following regular expression:
.*?(\d{4}).*?(S\d{2}E\d{2}).*?|.*?(S\d{2}E\d{2}).*?(\d{4}).*?|.*?(S\d{2}E\d{2}).*?|.*?(\d{4}).*?
.*?(\d{4}).*?(S(\d\d)E(\d\d)).*?: This will first match the combination of the year and episode number in this order.
.*?(S(\d\d)E(\d\d)).*?(\d{4}).*?: This will match the reverse order
.*?(S(\d\d)E(\d\d)).*?: This will match the episode number
.*?(\d{4}).*?: This will match the year.
If you execute the regular expression in this order, you will always get both the year and the episode number.
var regex = /.*?(\d{4}).*?(S\d{2}E\d{2}).*?|.*?(S\d{2}E\d{2}).*?(\d{4}).*?|.*?(S\d{2}E\d{2}).*?|.*?(\d{4}).*?/;
var matches = "test|S02E12|2012_test".match(regex);
matches = matches.filter(function(item) {
return item !== undefined;
}).splice(1).sort();
console.log(matches);
I have a text file with 32 articles. Each article starts with the expression: <Number> of 32 DOCUMENTS, for example: 1 of 32 DOCUMENTS, 2 of 32 DOCUMENTS, etc. In order to find each article I have used the following code:
import re
sections = []
current = []
with open("Aberdeen2005.txt") as f:
for line in f:
if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
sections.append("".join(current))
current = [line]
else:
current.append(line)
print(len(sections))
So now, articles are represented by the expression sections
The next thing I want to do, is to subgroup the articles in 2 groups. Those articles containing the words: economy OR economic AND uncertainty OR uncertain AND tax OR policy, identify them with the number 1.
Whereas those articles containing the following words: economy OR economic AND uncertain OR uncertainty AND regulation OR spending, identify them with the number 2. This is what I have tried so far:
for i in range(len(sections)):
group1 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[tax|policy]", , sections[i])
group2 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[regulation|spending]", , sections[i])
Nevertheless, it does not seem to work. Any ideas why?
It's a bit wordy, but you can get away without using regular expressions here, for example:
# Take a lowercase copy for comparisons
s = sections[i].lower()
if (('economic' in s or 'economy' in s) and
('uncertainty' in s or 'uncertain' in s) and
('tax' in s or 'policy' in s)):
do_stuff()
It is possible to write this as a single regular expression, but it is a bit tricky. For each and you'd use a zero-width lookahead assertion (?= ), and for each or you'd use a branch. Also, we'd have to use the \b for a word boundary. We'd use re.match instead of re.search.
belongs_to_group1 = bool(re.match(
r'(?=.*\b(?:economic|economy)\b)'
r'(?=.*\b(?:uncertain|uncertainty)\b)'
r'(?=.*\b(?:tax|policy)\b)', text, re.I))
Thus not very readable.
A more fruitful approach would be to find all words and put them into a set
words = set(re.findall(r'\w+', text.lower()))
belongs_to_group1 = (('uncertainty' in words or 'uncertain' in words)
and ('economic' in words or 'economy' in words)
and ('tax' in words or 'policy' in words))
You can use re.search to find those words. Then you can use if statements and python's and and or statements for the logic, and then store group one and two as two lists with the section index number as a value.
One thing you might want to note is that your logic may need brackets.
By
economy OR economic AND uncertainty OR uncertain AND tax OR policy
I assume you mean
(economy OR economic) AND (uncertainty OR uncertain) AND (tax OR policy)
which is different to (for example)
economy OR (economic AND uncertainty) OR (uncertain AND tax) OR policy
EDIT1:
Python will evaluate your statement without brackets from left to right, i.e.:
( ( ( ( (economy OR economic) AND uncertainty) OR uncertain) AND tax) OR policy)
Which I imagine is not what you want (e.g. the above evaluates true if it includes the word policy but none of the others)
EDIT2:
As pointed out in comments, EDIT1 is incorrect, although you would still need brackets to achieve case 1, if you don't have them you will get case 2 instead (and case 3 is a load of rubbish)
To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all