I'm having trouble with the needed regular expression... I'm sure I need to probably be using some combination of 'lookaround' or conditional expressions, but I'm at a loss.
I have a data string like:
pattern1 pattern2 pattern3 unwanted-groups pattern4 random number of tokens pattern5 optional1 optional2 more unknown unwanted junk separated with white spaces optional3 optional4 etc
where I have a matching expression for each of the 'pattern#' and 'optional#' groups (optional groups being groups that are not required in the data and therefore not always present), but I don't have any pattern (text is free-form) or group count to skip for the other sections other than all 'tokens' are separated by white space.
I've managed to figure out how to skip the unwanted stuff between the required groups but when I hit the optional groups, I'm lost. any suggestion on where I should be looking for hints/help?
Thanks
this is what I currently have:
pattern = re.compile(r'(?:(METAR|SPECI)\s*)*(?P<ICAO>[\w]{4}\s)*'
r'(?P<NIL>(NIL)\s)*(?P<UTC>[\d]{6}Z\s)*(?P<AUTOCOR>(AUTO|COR)*\s)*'
r'(?P<WINDS>[\w]{5,6}G*[\d]{0,2}(MPS|KT|KMH)\s)\s*'
r'.*?\s' #skip miscellaneous between winds and thermal data
r'(?P<THERM>[\d]{2}/[\d]{2}\s)\s*(?P<PRESS>A[\d]{4}\s)\s*'
r'(?:RMK\s)\s*(?P<AUTO>AO\d\s)*'
r'(?P<PEAK>(PK\sWND\s[\d]{5,6}/[\d]{2,4}))*'
r'(?P<SLP>SLP[\d]{3}\s)*'
r'(?P<PRECIP>P[\d]{4}\s)*'
r'(?P<remains>.*)'
)
example = "METAR KCSM 162353Z AUTO 07011KT 10SM TS SCT100 28/19 A3000 RMK AO2 PK WND 06042/2325 WSHFT 2248 LTG DSNT ALQDS PRESRR SLP135 T02780189 10389 20272 53007="
data = pattern.match(example)
It seems to work for the first 10 groups, but that is about it....
again thanks everybody
If all the data is in that format I'd go with split instead. I think it will be faster.
str = "regex1 regex2 regex3 unwanted-regex regex4 random number of tokens regex5 optregex1 optregex2 more unknown unwanted junk separated with white spaces optregex3 optregex4 etc"
parts = str.split() # now you have each part as an element of the array.
for index,item in enumerate(parts):
if index == 3:
continue # this is unwanted-regex
else:
# do what you want with the information here
You need to use the | operator and findall:
>>> re.compile("(regex\d+|optregex\d+)")
>>> regex.findall(string)
[u'regex1', u'regex2', u'regex3', u'regex4', u'regex5', u'optregex1', u'optregex2', u'optregex3', u'optregex4']
An advice: there are several tools (GUIs) that allow you to experiment with (and actually help writing) regular expressions. For python, I'm quite fond of kodos.
If all of your targets consist of things like "foo1", "bar22" etc (in other words a sequence of letters followed by a sequence of digits) and everything else (sequences of digits, "words" without numeric suffixes, etc) is "junk" then the following seems to be sufficient:
re.findall(r'[A-Za-z]+\d+', targetstr)
(We can't use just r'\w+\d+' because \w matches digits and _ (underscores) as well as letters).
If you're looking for a limited number of key patterns, or some of the junk might match "foo123 ... then you'll obviously have to be more specific.
Related
I have some strings which contain a pairs of frequencies or pairs of frequency ranges
My regex function gets the following list from that example string:
example_string = ':2.400-2.483ghz;5.725-5.850ghz
transmissionpower(eirp),2'
re.findall(r"(\d+\.\d+.hz)", example_string)
# example output: ['2.483ghz', '5.850ghz']
How can I extract the range of frequencies rather than just the single float after the - character?
Output should be ['2.400-2.483ghz', '5.725-5.850ghz']
Something like this should (mostly) work to find all the occurences of those strings in the code (it should handle any number of ranges in the line):
>>> example_string = ':2.400-2.483ghz;5.725-5.850ghz
transmissionpower(eirp),2'
>>> re.findall('([0-9.]+-[0-9.]+.?hz)', example_string)
['2.400-2.483ghz', '5.725-5.850ghz']
To break it down:
[0-9.]+ - will find 1 or more numbers and .s together (e.g. 2.400)
.?hz finds 0 or 1 characters followed by 'hz' so it should handle most units (e.g. hz, ghz, etc.)
The whole thing essentially looks for <number><dash><number><units> zero or more times per line.
It's worth pointing out that, like most regexes, this is still pretty brittle so if the string is malformatted, if it's GHz instead of ghz, if the numbers are in scientific notation, etc., it will break, but hopefully you can adjust as needed.
You may use this regex:
(?:\d+\.\d+-)?\d+\.\d+.hz
RegEx Demo
Code:
>>> import re
>>> s = ':2.400-2.483ghz;5.725-5.850ghz
transmissionpower(eirp),2'
>>> re.findall(r'(?:\d+\.\d+-)?\d+\.\d+.hz', s);
['2.400-2.483ghz', '5.725-5.850ghz']
Explanation:
(?:\d+\.\d+-)?: In an optional group match a floating point number followed by hyphen
\d+\.\d+: Match a floating point number
.hz: Match any character followed by hz
I want to write code that can parse American phone numbers (ie. "(664)298-4397") . Below are the constraints:
allow leading and trailing white spaces
allow white spaces that appear between area code and local numbers
no white spaces in area code or the seven digit number XXX-XXXX
Ultimately I want to print a tuple of strings (area_code, first_three_digits_local, last_four_digits_local)
I have two sets of questions.
Question 1:
Below are inputs my code should accept and print the tuple for:
'(664) 298-4397', '(664)298-4397', ' (664) 298-4397'
Below is the code I tried:
regex_parse1 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664) 298-4397')
print (f' groups are: {regex_parse1.groups()} \n')
regex_parse2 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664)298-4397')
print (f' groups are: {regex_parse2.groups()} \n')
regex_parse3 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' (664) 298-4397')
print (f' groups are: {regex_parse3.groups()}')
The string input for all three are valid and should return the tuple:
('664', '298', '4397')
But instead I'm getting the output below for all three:
groups are: ('', '', '4397')
What am I doing wrong?
Question 2:
The following two chunks of code should output an 'NoneType' object has no attribute 'group' error because the input phone number string violates the constraints. But instead, I get outputs for all three.
regex_parse4 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(404)555 -1212')
print (f' groups are: {regex_parse4.groups()}')
regex_parse5 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' ( 404)121-2121')
print (f' groups are: {regex_parse5.groups()}')
Expected output: should be an error but I get this instead for all three:
groups are: ('', '', '2121')
What is wrong with my regex code?
In general, your regex overuse the asterisk *. Details as follows:
You have 3 capturing groups:
([\s]*[(]*[0-9]*[)]*[\s]*)
([\s]*[0-9]*)
([0-9]*[\s]*)
You use asterisk on every single item, including the open and close parenthesis. Actually, almost everything in your regex is quoted with asterisk. Thus, the capturing groups match also null strings. That's why your first and second capturing groups return the null strings. The only item you don't use asterisk is the hyphen sign - just before the third capturing group. This is also the reason why your regex can capture the third capturing group as in the 4397 and 2121
To solve your problem, you have to use asterisk only when needed.
In fact, your regex still has plenty of rooms for improvement. For example, it now matches numeric digits of any length (instead of 3 or 4 digits chunks). It also allows the area code not enclosed in parenthesis (because of your use of asterisk around parenthesis symbols.
For this kind of common regex, I suggest you don't need to reinvent the wheel. You can refer to some already made regex easily found from the Internet. For example, you can refer to this post Although the post is using javascript instead of Python, the regex is just similar.
Try:
regex_parse4 = re.match(r'([(]*[0-9]{3}[)])\s*([0-9]{3}).([0-9]{4})', number)
Assumes 3 digit area code in parentheses, proceeded by XXX-XXXX.
Python returns 'NoneType' when there are no matches.
If above does not work, here is a helpful regex tool:
https://regex101.com
Edit:
Another suggestion is to clean data prior to applying a new regex. This helps with instances of abnormal spacing, gets rid of parentheses, and '-'.
clean_number = re.sub("[^0-9]", "", original_number)
regex_parse = re.match(r'([0-9]{3})([0-9]{3})([0-9]{4})', clean_number)
print(f'groups are: {regex_parse}.groups()}')
>>> ('xxx', 'xxx', 'xxxx')
I have a list in Python with values
['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
I want to match only strings where length is 8 and there are 3 characters before underscore and 4 digits after underscore so I eliminate values not required. I am interested only in the MMM_YYYY values from above list.
Tried below and I am not able to filter values like YTD_TY_1 which has multiple underscores.
for c in col_headers:
d= (re.match('^(?=.*\d)(?=.*[A-Z0-9])[A-Z_0-9\d]{8}$',c))
if d:
data_period.append(d[0])
Update: based on #WiktorStribiżew observation that re.match does not require a full string match in Python
The regex I am using is based upon the one that #dvo provided in a comment:
import re
REGEX = '^[A-Z]{3}_[0-9]{4}$'
col_headers = ['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
regex = re.compile(REGEX)
data_period = list(filter(regex.search, col_headers))
Once again, based on a comment made by #WiktorStribiżew, if you do not want to match something as "SXX_0012" or "XYZ_0000", you should use the regex he has provided in a comment:
REGEX = r'^(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-[0-9]{4}$'
Rather than use regex for this, you should just try to parse it as a date in the first place:
from datetime import datetime
date_fmt = "%b_%Y"
for c in col_headers:
try:
d = datetime.strptime(c, date_fmt)
data_period.append(c) # Or just save the datetime object directly
except ValueError:
pass
The part of this code that is actually doing the matching in your solution is this
[A-Z_0-9\d]{8}
The problem with this is that you're asking to find exactly 8 characters that include A-Z, _, 0-9, and \d. Now, \d is equivalent to 0-9, so you can eliminate that, but that doesn't solve the whole problem, the issue here is that you've encased the entire solution in brackets []. Basically, your string will match anything that is 8 characters long and includes the above characters, ie: A_19_KJ9
What you need to do is specify that you want exactly 3 A-Z characters, then a single _, then 4 \d, see below:
[A-Z]{3}_\d{4}
This will match anything with exactly 3 A-Z characters, then a single _, then 4 \d(any numeric digit)
For a better understanding of regex, I'd encourage you to use an online tool, like regex101
I have a text file with 32 articles. Each article starts with the expression: <Number> of 32 DOCUMENTS, for example: 1 of 32 DOCUMENTS, 2 of 32 DOCUMENTS, etc. In order to find each article I have used the following code:
import re
sections = []
current = []
with open("Aberdeen2005.txt") as f:
for line in f:
if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
sections.append("".join(current))
current = [line]
else:
current.append(line)
print(len(sections))
So now, articles are represented by the expression sections
The next thing I want to do, is to subgroup the articles in 2 groups. Those articles containing the words: economy OR economic AND uncertainty OR uncertain AND tax OR policy, identify them with the number 1.
Whereas those articles containing the following words: economy OR economic AND uncertain OR uncertainty AND regulation OR spending, identify them with the number 2. This is what I have tried so far:
for i in range(len(sections)):
group1 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[tax|policy]", , sections[i])
group2 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[regulation|spending]", , sections[i])
Nevertheless, it does not seem to work. Any ideas why?
It's a bit wordy, but you can get away without using regular expressions here, for example:
# Take a lowercase copy for comparisons
s = sections[i].lower()
if (('economic' in s or 'economy' in s) and
('uncertainty' in s or 'uncertain' in s) and
('tax' in s or 'policy' in s)):
do_stuff()
It is possible to write this as a single regular expression, but it is a bit tricky. For each and you'd use a zero-width lookahead assertion (?= ), and for each or you'd use a branch. Also, we'd have to use the \b for a word boundary. We'd use re.match instead of re.search.
belongs_to_group1 = bool(re.match(
r'(?=.*\b(?:economic|economy)\b)'
r'(?=.*\b(?:uncertain|uncertainty)\b)'
r'(?=.*\b(?:tax|policy)\b)', text, re.I))
Thus not very readable.
A more fruitful approach would be to find all words and put them into a set
words = set(re.findall(r'\w+', text.lower()))
belongs_to_group1 = (('uncertainty' in words or 'uncertain' in words)
and ('economic' in words or 'economy' in words)
and ('tax' in words or 'policy' in words))
You can use re.search to find those words. Then you can use if statements and python's and and or statements for the logic, and then store group one and two as two lists with the section index number as a value.
One thing you might want to note is that your logic may need brackets.
By
economy OR economic AND uncertainty OR uncertain AND tax OR policy
I assume you mean
(economy OR economic) AND (uncertainty OR uncertain) AND (tax OR policy)
which is different to (for example)
economy OR (economic AND uncertainty) OR (uncertain AND tax) OR policy
EDIT1:
Python will evaluate your statement without brackets from left to right, i.e.:
( ( ( ( (economy OR economic) AND uncertainty) OR uncertain) AND tax) OR policy)
Which I imagine is not what you want (e.g. the above evaluates true if it includes the word policy but none of the others)
EDIT2:
As pointed out in comments, EDIT1 is incorrect, although you would still need brackets to achieve case 1, if you don't have them you will get case 2 instead (and case 3 is a load of rubbish)
To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all