Python regex get pairs of floats from string if present - python

I have some strings which contain a pairs of frequencies or pairs of frequency ranges
My regex function gets the following list from that example string:
example_string = ':2.400-2.483ghz;5.725-5.850ghz

transmissionpower(eirp),2'
re.findall(r"(\d+\.\d+.hz)", example_string)
# example output: ['2.483ghz', '5.850ghz']
How can I extract the range of frequencies rather than just the single float after the - character?
Output should be ['2.400-2.483ghz', '5.725-5.850ghz']

Something like this should (mostly) work to find all the occurences of those strings in the code (it should handle any number of ranges in the line):
>>> example_string = ':2.400-2.483ghz;5.725-5.850ghz

transmissionpower(eirp),2'
>>> re.findall('([0-9.]+-[0-9.]+.?hz)', example_string)
['2.400-2.483ghz', '5.725-5.850ghz']
To break it down:
[0-9.]+ - will find 1 or more numbers and .s together (e.g. 2.400)
.?hz finds 0 or 1 characters followed by 'hz' so it should handle most units (e.g. hz, ghz, etc.)
The whole thing essentially looks for <number><dash><number><units> zero or more times per line.
It's worth pointing out that, like most regexes, this is still pretty brittle so if the string is malformatted, if it's GHz instead of ghz, if the numbers are in scientific notation, etc., it will break, but hopefully you can adjust as needed.

You may use this regex:
(?:\d+\.\d+-)?\d+\.\d+.hz
RegEx Demo
Code:
>>> import re
>>> s = ':2.400-2.483ghz;5.725-5.850ghz

transmissionpower(eirp),2'
>>> re.findall(r'(?:\d+\.\d+-)?\d+\.\d+.hz', s);
['2.400-2.483ghz', '5.725-5.850ghz']
Explanation:
(?:\d+\.\d+-)?: In an optional group match a floating point number followed by hyphen
\d+\.\d+: Match a floating point number
.hz: Match any character followed by hz

Related

How To Extract Three Letters Followed By Five Digits Using Regex in Python

I have the following dataframe in Python:
abc12345
abc1234
abc1324.
How do I extract only the ones that have three letters followed by five digits?
The desired result would be:
abc12345.
df.column.str.extract('[^0-9](\d\d\d\d\d)$')
I think this works, but is there any better way to modify (\d\d\d\d\d) ?
What if I had like 30 digits. Then I'll have to type \d 30 times, which is inefficient.
You should be able to use:
'[a-zA-Z]{3}\d{5}'
If the strings don't include capital letters this can reduce to:
'[a-z]{3}\d{5}'
Change the values in the {x} to adjust the number of chars to capture.
Or like this following code:
'
import re
s = "abc12345"
p = re.compile(r"\d{5}")
c = p.match(s,3)
print(c.group())
'

Match characters and digits of fixed length and one occurance in Python

I have a list in Python with values
['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
I want to match only strings where length is 8 and there are 3 characters before underscore and 4 digits after underscore so I eliminate values not required. I am interested only in the MMM_YYYY values from above list.
Tried below and I am not able to filter values like YTD_TY_1 which has multiple underscores.
for c in col_headers:
d= (re.match('^(?=.*\d)(?=.*[A-Z0-9])[A-Z_0-9\d]{8}$',c))
if d:
data_period.append(d[0])
Update: based on #WiktorStribiżew observation that re.match does not require a full string match in Python
The regex I am using is based upon the one that #dvo provided in a comment:
import re
REGEX = '^[A-Z]{3}_[0-9]{4}$'
col_headers = ['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
regex = re.compile(REGEX)
data_period = list(filter(regex.search, col_headers))
Once again, based on a comment made by #WiktorStribiżew, if you do not want to match something as "SXX_0012" or "XYZ_0000", you should use the regex he has provided in a comment:
REGEX = r'^(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-[0-9]{4}$'
Rather than use regex for this, you should just try to parse it as a date in the first place:
from datetime import datetime
date_fmt = "%b_%Y"
for c in col_headers:
try:
d = datetime.strptime(c, date_fmt)
data_period.append(c) # Or just save the datetime object directly
except ValueError:
pass
The part of this code that is actually doing the matching in your solution is this
[A-Z_0-9\d]{8}
The problem with this is that you're asking to find exactly 8 characters that include A-Z, _, 0-9, and \d. Now, \d is equivalent to 0-9, so you can eliminate that, but that doesn't solve the whole problem, the issue here is that you've encased the entire solution in brackets []. Basically, your string will match anything that is 8 characters long and includes the above characters, ie: A_19_KJ9
What you need to do is specify that you want exactly 3 A-Z characters, then a single _, then 4 \d, see below:
[A-Z]{3}_\d{4}
This will match anything with exactly 3 A-Z characters, then a single _, then 4 \d(any numeric digit)
For a better understanding of regex, I'd encourage you to use an online tool, like regex101

I have a regex statement to pull all numbers out of a text file, but it only finds 77 out of the 81 numbers in the file

I have a text file with a lot of numbers that go to the 16th decimal place. There are 81 numbers in total. There are commas and brackets throughout the file, so I (who is new to regular expressions) tried to make one to take out the number. To put it simply, I need a regular expression that card find numbers that have 1 number(either positive or negative), followed by a decimal, followed by 16 more numbers. Some examples of the format of the numbers in the text file: -0.12345676890987654 or 0.7564738273839182. Sorry, but I do not have any examples of numbers that don't match but I can guarantee that all numbers are written the same way as the two examples I just gave.
I have already tried loading it as a string, splitting it at the brackets and comma, but all of these methods are not as elegant and take up way more lines. This is why I have chosen to learn regex.
from re import findall
File = open("Data.txt", 'r')
Data = File.read()
File.close()
Values = findall(r"(-\d\.|\d\.)(\d{16})", Data)
Data = [float(Item[0] + Item[1]) for Item in Values]
for Thing in Data:
print(Thing)
print(len(Data))
From my understanding, my regex statement will find and number, preceded by a "-" or not, followed by a period, that also has 16 numbers after it (ex. -0.12345676890987654 or 0.7564738273839182). Here is a short snippet of the file I am working with.
[[-0.8433461106676767, 0.5111623521263733, -0.39797568745771605,
0.8150308209141626, -0.9157151911545942, -0.4870281951128881],
[0.49680176773207174, -0.18390655568106262...
When I print len(Data) I get 77. I have counted the number of numbers in the file (and did the math as to how many I put it there) and both came out to be 81. So 4 numbers are not being found. A little more information: These numbers were generated randomly, so there is very little chance that two of them will be identical. I'm not sure if that makes a difference as the function called is named "findall". What I am looking for (in order of importance) is:
Why didn't this work?
What does a regex expression that works for this scenario look like?
You regex is working as you wrote it, and it is finding a pattern matching:
a negative sign (optional)
one digit
a decimal point (.)
exactly 16 digits after the decimal point.
Given that your numbers are random, some (statistically, around 10%) of them have a last digit 0, which was not printed, so they only have 15 (or less!) digits.
If the data was generated in Python, there will probably also be some numbers with more than 16 digits after the decimal point, but your pattern will truncate them down to 16 digits.
The solution is probably just to allow any number of digits: -?\d\.\d+

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Regex that match when beginning & end is of the same length

How do you make a regex that match when the beginning and the end is of the same length?
For example
>>> String = '[[A], [[B]], [C], [[D]]]'
>>> Result = re.findall(pattern, String)
>>> Result
>>> [ '[A]', '[[B]]', '[C]', '[[D]]' ]
Currently I use the pattern \[.*?\] but it resulted in
>>> ['[[A]', '[[B]', '[C]', '[[D]']
Thanks in advance.
You can define such a regular expression for a finite number of beginning/end characters (ie, something like "if it starts and ends with 1, or starts and ends with 2, or etc"). You, however, cannot do this for an unlimited number of characters. This is simply a fact of regular expressions. Regular expressions are the language of finite-state machines, and finite-state machines cannot do counting; at least the power of a pushdown-automaton (context-free grammar) is needed for that.
Put simply, a regular expression can say: "I saw x and then I saw y" but it cannot say "I saw x and then I saw y the same number of times" because it cannot remember how many times it saw x.
However, you can easily do this using the full power of the Python programming language, which is Turing-complete! Turing-complete languages can definitely do counting:
>>> string = '[[A], [[B]], [C], [[D]]]'
>>> sameBrackets = lambda s: len(re.findall('\[',s)) == len(re.findall('\]',s))
>>> filter(sameBrackets, string.split(", "))
['[[B]]', '[C]']
You can't. Sorry.
Python's regular expressions are an extension of "finite state automata", which only allow a finite amount of memory to be kept as you scan through the string for a match. This example requires an arbitrary amount of memory, depending on how many repetitions there are.
The only way in which Python allows more than just finite state is with "backreferences", which let you match an identical copy of a previously matched portion of the string -- but they don't allow you to match something with, say, the same number of characters.
You should try writing this by hand, instead.
To match balanced brackets you need a recursive regular expression. The stock re module doesn't support this syntax, but the alternative regex does:
import regex
r = r'\[(([^\[\]]+)|(?R))*\]'
print regex.match(r, '[[A], [[B]], [C], [[D]] ]') # ok
print regex.match(r, '[[A], [[B]], [C , [[D]] ]') # None
That expression basically says: match something surrounded by brackets, where "something" is either a series of non-brackets ([^\[\]]+) or the whole thing once again (?R).

Categories

Resources