I'm stuck with regular expressions [duplicate] - python

This question already has answers here:
Regexp to remove specific number of occurrences of character only
(2 answers)
How to only match a single instance of a character?
(3 answers)
Closed 11 months ago.
I'm stuck with regular expressions in Python...
#!/usr/bin/python3
import re
combi="ABBAEAADCA"
one_a = len(re.findall('[^A](A)[^A]', combi))
print("A:"+str(one_a))
I try to make this variable (one_a) contain the number of A's that appear alone (3) but it does not count those at the beginning and end of lines so....
one_a = len(re.findall('\A(A)[^A]', combi))
print("A ini:"+str(one_a))
one_a += len(re.findall('[^A](A)[^A]', combi))
print("A_cen:"+str(one_a))
one_a += len(re.findall('[^A](A)\Z', combi))
print("A_end:"+str(one_a))
but it didn't work either when in this particular case the value that should stay in the variable should be 3.
I would appreciate knowing what I am missing or what mistake I am making.
Thank you very much

Using a negated character class [^A] matches a single character, and \A asserts the start of the string.
To get the A's that stand alone you can negative lookarounds asserting not A directly to the left and right:
(?<!A)A(?!A)
See a regex demo and a Python demo.
import re
combi="ABBAEAADCA"
one_a = len(re.findall('(?<!A)A(?!A)', combi))
print("A:"+str(one_a))
Output
A:3

You can combine start-of-string (^) and end of string ($) with regular character classes through the or (|) operator.
re.findall(r'(?:^|[^A])A(?:$|[^A])', combi)
This gives you all substrings where A is either surrended by start of string and end of string, start of string and not-A, not-A or end of string or not-A and not-A.
>>> re.findall(r'(?:^|[^A])A(?:$|[^A])', combi)
['AB', 'BAE', 'CA']
Applying len to this list gives you the count of single A's.

Related

only keep digits after ":" in regular expression in python [duplicate]

This question already has answers here:
How to grab number after word in python
(4 answers)
Closed 2 years ago.
I want to extract the numbers for each parameter below:
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
The desired output should be:['42602','42401','42101']
I first tried re.findall(r'\d+',parameters), but it also returns the "2" from "NO2" and "SO2".
Then I tried re.findall(':.*',parameters), but it returns [': 42602', ': 42401', ': 42101']
If I can not rename the "NO2" to "Nitrogen dioxide", is there a way just to collect numbers on the right (after ":")?
Many thanks.
If you do not want to use capturing groups, you could use look behind.
(?<=:\s)\d+
Details:
(?<=:\s): gets string after :\s
\d+: gets digits
I also tried result on python.
import re
parameters = '''
NO2: 42602
SO2: 42401
CO: 42101
'''
result = re.findall(r'(?<=:\s)\d+',parameters)
print (result)
Result
['42602', '42401', '42101']
You can use the following regex to capture the numbers
^\s*\w+:\s(\d+)$
Hereby, ^ in the beginning asserts the position at the start of the line. \s* means that there may be 0 or more whitespaces before the content. \w+:\s matches a word character followed by ":" and space, that is "NO2: ".
Finally, (\d+) matches the following digits you want as a group. $ matches the end of the line.
To get all the matches as a list you can use
matches = re.findall(r'^\s*\w+:\s(\d+)$', parameters, re.MULTILINE)
As re.MULTILINE is specified,
the pattern character '^' matches at the beginning of the string and
at the beginning of each line.
as stated in the docs.
The result is as follows
>> print(matches)
['42602', '42401', '42101']
To put my two cents in, you could simpley use
re.findall(r'(\b\d+\b)', parameters)
See a demo on regex101.com.
If you happen to have other digits floating around somewhere in your string, be more precise with
\w+:\s*(\d+)
See another demo on regex101.com.
re.findall(r'(?<=:\s)\d+', parameters)
Should work. You can learn more about look-behind from here.
You just need to specify where in your string do you want to search for digits, you can use:
re.findall(r': (\d+)', parameters)
This tells Python to look for digits in the part of the string after ":" and the "space".

How to find values in specific format including parenthesis using regular expression in python [duplicate]

This question already has answers here:
Regular expression to return text between parenthesis
(11 answers)
Closed 2 years ago.
I have long string S, and I want to find value (numeric) in the following format "Value(**)", where ** is values I want to extract.
For example, S is "abcdef Value(34) Value(56) Value(13)", then I want to extract values 34, 56, 13 from S.
I tried to use regex as follows.
import re
regex = re.compile('\Value(.*'))
re.findall(regex, S)
But the code yields the result I did not expect.
Edit. I edited some mistakes.
You should escape the parentheses, correct the typo of Value (as opposed to Values), use a lazy repeater *? instead of *, add the missing right parenthesis, and capture what's enclosed in the escaped parentheses with a pair of parentheses:
regex = re.compile(r'Value\((.*?)\)')
Only one of your numbers follows the word 'Value', so you can extract anything inside parentheses. You also need to escape the parentheses which are special characters.
regex = re.compile('\(.*?\)')
re.findall(regex, S)
Output:
['(34)', '(56)', '(13)']
I think what you're looking for is a capturing group that can return multiple matches. This string is: (\(\d{2}\))?. \d matches an digit and {2} matches exactly 2 digits. {1,2} will match 1 or 2 digits ect. ? matches 0 to unlimited number of times. Each group can be indexed and combined into a list. This is a simple implementation and will return the numbers within parentheses.
eg. 'asdasd Value(102222), fgdf(20), he(77)' will match 20 and 77 but not 102222.

Regex to exclude specific special characters, spaces and alphabets [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 4 years ago.
I want a regular expression which converts this:
91009-01-28-00 Maximum (c/s)................ 1543.5
to this:
91009-01-28-00 1543.5
So basically, a regular expression that escapes alphabets, spaces, forward slashes and brackets.
I have written the following python code so far:
with open('lcstats.txt', 'r') as lcstats_file:
with open (lcstats_full_path + '_lcstats_full.txt', "a+") as lcstats_full_file:
lcstats_full_file.write(obsid )
for line in lcstats_file.readlines():
if not re.search(r'Maximum [(c/s)]', line):
continue
line = (re.sub(**REGEX**,'',line))
lcstats_full_file.write(line)
It appears you want to have first and last part of the string. If that is the case for every line than spliting it accordingly can be helpful, as in the following code
import re
line = "91009-01-28-00 Maximum (c/s) ................ 1543.5"
line=line.split(' ')
line=line[0]+' '+ line[-1]
print(line)
Output:
91009-01-28-00 1543.5
In your code you are using search to check if you can match Maximum (c/s) and then you want to use a regex to remove that.
I think with your regex Maximum [(c/s)] you mean Maximum \(c/s\). The square brackets make it a character class and (c/s) captures c/s in a capturing group which is not required if you only want to match it.
Wat you could do is match Maximum (c/s) and match one or more times a whitespace or a comma using a character class [ .]+ and replace with an empty string.
Maximum \(c/s\)[ .]+
import re
s = "91009-01-28-00 Maximum (c/s)................ 1543.5"
print( re.sub(r"Maximum \(c/s\)[ .]+", "", s))
Demo
Try using this regex /\s[^0-9]+/ This will match from the first space followed by 1 or more not digit characters. You will need to add a space in the replacement string to keep the two bits of remaining data separate.
Regex:
((?<!\d)\D)
Match all non digits\D which is not followed by a digit \d

Working With Python Regex [duplicate]

This question already has answers here:
Checking whole string with a regex
(5 answers)
Closed 5 years ago.
I am trying to compile a regex on python but am having limited success. I am doing the following
import re
pattern = re.compile("[a-zA-Z0-9_])([a-zA-Z0-9_-]*)")
m=pattern.match("gb,&^(#)")
if m: print 1
else: print 2
I am expecting the output of the above to print 2, but instead it is printing one. The regex should match strings as follows:
The first letter is alphanumeric or an underscore. All characters after that can be alphanumeric, an underscore, or a dash and there can be 0 or more characters after the first.
I was thinking that this thing should fail as soon as it sees the comma, but it is not.
What am I doing wrong here?
import re
pattern = re.compile("^([a-zA-Z0-9_])([a-zA-Z0-9_-]*)$") # when you don't use $ at end it will match only initial string satisfying your regex
m=pattern.match("gb,&^(#)")
if m:
print(1)
else:
print(2)
pat = re.compile("^([a-zA-Z0-9_])([a-zA-Z0-9_-]*)") # this pattern is written by you which matches any string having alphanumeric characters and underscore in starting
if pat.match("_#"):
print('match')
else:
print('no match 1')
This will also help you understand explaination by #Wiktor with example.

My regular expression is not getting matched exactly in python [duplicate]

This question already has answers here:
Checking whole string with a regex
(5 answers)
Closed 6 years ago.
Here's my code...
import re
l=["chap","chap11","chapa","chapb","chapc","chap3","chap2","chapf","chap4","chap55","chapf","chap33","chap54","chapgk"]
for i in l:
matchobj=re.match(r'chap[0-9]',i,re.M|re.I)
if matchobj:
print(i)
as I have mentioned chap[0-9].. so it should only those strings which follow only one integer after chap
so I should get the following output..
chap3
chap2
chap4
but I am getting the following output...
chap11
chap3
chap2
chap4
chap55
chap33
chap54
match matches your pattern at the beginning of the string. Append e.g. end of string '$' or word boundary '\b' to your pattern:
matchobj=re.match(r'chap\d$',i,re.M|re.I)
# \d (digit) is shortcut for [0-9]
From the docs on re.match:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance.
You should add a dollar sign to the end of your regex expression. The dollar ($) means the end of the string, and for future reference, the carat (^) signifies the beginning.
import re
l=["chap","chap11","chapa","chapb","chapc","chap3","chap2","chapf","chap4","chap55","chapf","chap33","chap54","chapgk"]
for i in l:
matchobj=re.match(r'chap[0-9]$',i,re.M|re.I)
if matchobj:
print(i)
Output
chap3
chap2
chap4

Categories

Resources