only keep digits after ":" in regular expression in python [duplicate] - python

This question already has answers here:
How to grab number after word in python
(4 answers)
Closed 2 years ago.
I want to extract the numbers for each parameter below:
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
The desired output should be:['42602','42401','42101']
I first tried re.findall(r'\d+',parameters), but it also returns the "2" from "NO2" and "SO2".
Then I tried re.findall(':.*',parameters), but it returns [': 42602', ': 42401', ': 42101']
If I can not rename the "NO2" to "Nitrogen dioxide", is there a way just to collect numbers on the right (after ":")?
Many thanks.

If you do not want to use capturing groups, you could use look behind.
(?<=:\s)\d+
Details:
(?<=:\s): gets string after :\s
\d+: gets digits
I also tried result on python.
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
result = re.findall(r'(?<=:\s)\d+',parameters)
print (result)
Result
['42602', '42401', '42101']

You can use the following regex to capture the numbers
^\s*\w+:\s(\d+)$
Hereby, ^ in the beginning asserts the position at the start of the line. \s* means that there may be 0 or more whitespaces before the content. \w+:\s matches a word character followed by ":" and space, that is "NO2: ".
Finally, (\d+) matches the following digits you want as a group. $ matches the end of the line.
To get all the matches as a list you can use
matches = re.findall(r'^\s*\w+:\s(\d+)$', parameters, re.MULTILINE)
As re.MULTILINE is specified,
the pattern character '^' matches at the beginning of the string and
at the beginning of each line.
as stated in the docs.
The result is as follows
>> print(matches)
['42602', '42401', '42101']

To put my two cents in, you could simpley use
re.findall(r'(\b\d+\b)', parameters)
See a demo on regex101.com.
If you happen to have other digits floating around somewhere in your string, be more precise with
\w+:\s*(\d+)
See another demo on regex101.com.

re.findall(r'(?<=:\s)\d+', parameters)
Should work. You can learn more about look-behind from here.

You just need to specify where in your string do you want to search for digits, you can use:
re.findall(r': (\d+)', parameters)
This tells Python to look for digits in the part of the string after ":" and the "space".

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

using Python re to check a string

I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:
[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?
potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
status=False
uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
correctIDs=[]
for prot in potential_uniprots:
if is_uniprot(prot) == True:
correctIDs.append(prot)
print(correctIDs)
Expression Fixes:
BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:
You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?
You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.
Checking Length?
There is a simple way to do this without regex, which is as follows:
string = "Q08F88"
status = (len(string) == 6 or len(string) == 8)
But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.
Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:
abcd
And not:
eabcd
abcde
This is because ^ denotes the start of the string and $ denotes the end of the string.
In the end, you're left with this first expression:
(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)
You can modify your other expressions easily as they follow the same structure as above.
Code Suggestions
Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
To this:
return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
# -OR-
stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status
Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.

Regex to fix (all the matches or none) at the end to one

I'm trying to fix the . at the end to only one in a string. For example,
line = "python...is...fun..."
I have the regex \.*$ in Ruby, which is to be replaced by a single ., as in this demo, which don't seem to work as expected. I've searched for similar posts, and the closest I'd got is this answer in Python, which suggests the following,
>>> text1 = 'python...is...fun...'
>>> new_text = re.sub(r"\.+$", ".", text1)
>>> 'python...is...fun.'
But, it fails if I've no . at the end. So, I've tried like \b\.*$, as seen here, but this fails on the 3rd test which has some ?'s at end.
My question is, why \.*$ not matches all the .'s (despite of being greedy) and how to do the problem correctly?
Expected output:
python...is...fun.
python...is...fun.
python...is...fun??.
You might use an alternation matching either 2 or more dots or assert that what is directly to the left is not one of for example ! ? or a dot itself.
In the replacement use a single dot.
(?:\.{2,}|(?<!\.))$
Explanation
(?: Non capture group for the alternation
\.{2,} Match 2 or more dots
| Or
(?<!\.) Get the position where directly to the left is not a . (which you can extend with other characters as desired)
) Close non capture group
$ End of string (Or use \Z if there can be no newline following)
Regex demo | Python demo
For example
import re
strings = [
"python...is...fun...",
"python...is...fun",
"python...is...fun??"
]
for s in strings:
new_text = re.sub(r"(?:\.{2,}|(?<!\.))$", ".", s)
print(new_text)
Output
python...is...fun.
python...is...fun.
python...is...fun??.
If an empty string should not be replaced by a dot, you can use a positive lookbehind.
(?:\.{2,}|(?<=[^\s.]))$
Regex demo

Regex to exclude specific special characters, spaces and alphabets [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 4 years ago.
I want a regular expression which converts this:
91009-01-28-00 Maximum (c/s)................ 1543.5
to this:
91009-01-28-00 1543.5
So basically, a regular expression that escapes alphabets, spaces, forward slashes and brackets.
I have written the following python code so far:
with open('lcstats.txt', 'r') as lcstats_file:
with open (lcstats_full_path + '_lcstats_full.txt', "a+") as lcstats_full_file:
lcstats_full_file.write(obsid )
for line in lcstats_file.readlines():
if not re.search(r'Maximum [(c/s)]', line):
continue
line = (re.sub(**REGEX**,'',line))
lcstats_full_file.write(line)
It appears you want to have first and last part of the string. If that is the case for every line than spliting it accordingly can be helpful, as in the following code
import re
line = "91009-01-28-00 Maximum (c/s) ................ 1543.5"
line=line.split(' ')
line=line[0]+' '+ line[-1]
print(line)
Output:
91009-01-28-00 1543.5
In your code you are using search to check if you can match Maximum (c/s) and then you want to use a regex to remove that.
I think with your regex Maximum [(c/s)] you mean Maximum \(c/s\). The square brackets make it a character class and (c/s) captures c/s in a capturing group which is not required if you only want to match it.
Wat you could do is match Maximum (c/s) and match one or more times a whitespace or a comma using a character class [ .]+ and replace with an empty string.
Maximum \(c/s\)[ .]+
import re
s = "91009-01-28-00 Maximum (c/s)................ 1543.5"
print( re.sub(r"Maximum \(c/s\)[ .]+", "", s))
Demo
Try using this regex /\s[^0-9]+/ This will match from the first space followed by 1 or more not digit characters. You will need to add a space in the replacement string to keep the two bits of remaining data separate.
Regex:
((?<!\d)\D)
Match all non digits\D which is not followed by a digit \d

extracting items using regular expression in python

I have a a file which has the following :
new=['{"TES1":"=TES0"}}', '{"""TES1:IDD""": """=0x3C""", """TES1:VCC""": """=0x00"""}']
I am trying to extract the first item, TES1:=TES0 from the list. I am trying to use a regular expression to do this. This is what i tried but i am not able to grab the second item TES0.
import re
TES=re.compile('(TES[\d].)+')
for item in new:
result = TES.search(item)
print result.groups()
The result of the print was ('TES1:',). I have tried various ways to extract it but am always getting the same result. Any suggestion or help is appreciated. Thanks!
I think you are looking for findall:
import re
TES=re.compile('TES[\d].')
for item in new:
result = TES.findall(item)
print result
First Option (with quotes)
To match "TES1":"=TES0", you can use this regex:
"TES\d+":"=TES\d+"
like this:
match = re.search(r'"TES\d+":"=TES\d+"', subject)
if match:
result = match.group()
Second Option (without quotes)
If you want to get rid of the quotes, as in TES1:=TES0, you use this regex:
Search: "(TES\d+)":"(=TES\d+)"
Replace: \1:\2
like this:
result = re.sub(r'"(TES\d+)":"(=TES\d+)"', r"\1:\2", subject)
How does it work?
"(TES\d+)":"(=TES\d+)"
Match the character “"” literally "
Match the regex below and capture its match into backreference number 1 (TES\d+)
Match the character string “TES” literally (case sensitive) TES
Match a single character that is a “digit” (0–9 in any Unicode script) \d+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the character string “":"” literally ":"
Match the regex below and capture its match into backreference number 2 (=TES\d+)
Match the character string “=TES” literally (case sensitive) =TES
Match a single character that is a “digit” (0–9 in any Unicode script) \d+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the character “"” literally "
\1:\2
Insert the text that was last matched by capturing group number 1 \1
Insert the character “:” literally :
Insert the text that was last matched by capturing group number 2 \2
You can use a single replacement, example:
import re
result = re.sub(r'{"(TES\d)":"(=TES\d)"}}', '$1:$2', yourstr, 1)

Categories

Resources