regex- remove the unwanted substring after second occurrence of hyphen in Python - python

Below are the strings out of which I need to pull out the meaningful IDs
'12345-1-abcde-aBCD'
'123-Abcdefghi abcdefghijkl'
'1234567-1-AB-ABC A/1 ABC (AB1234)'
'12345-ABC-Abcdefghijkl'
'123456-Abcdefgh'
'12345-AB1CDE'
Regex should match to all the above criteria and pass for all the cases to give below output
12345-1
123
1234567-1
12345
123456
12345
Regex should omit the part from the -hyphen if there are letters.

You can do this:
import re
l = ['12345-1-abcde-aBCD',
'123-Abcdefghi abcdefghijkl',
'1234567-1-AB-ABC A/1 ABC (AB1234)',
'12345-ABC-Abcdefghijkl',
'123456-Abcdefgh',
'12345-AB1CDE',]
In [10]: for s in l:
...: print(re.match(r'^(\d+[-]?\d+?)',s))
...:
<re.Match object; span=(0, 7), match='12345-1'>
<re.Match object; span=(0, 3), match='123'>
<re.Match object; span=(0, 9), match='1234567-1'>
<re.Match object; span=(0, 5), match='12345'>
<re.Match object; span=(0, 6), match='123456'>
<re.Match object; span=(0, 5), match='12345'>
If you can have multiple hyphens with subsequent digits you can do something like:
l = ['12345-1-abcde-aBCD',
'123-Abcdefghi abcdefghijkl',
'1234567-1-AB-ABC A/1 ABC (AB1234)',
'12345-ABC-Abcdefghijkl',
'123456-Abcdefgh',
'12345-AB1CDE',
'12345-1-1-ABC',
'1-2-3-4-5-A-B-C-D-E-F-/-(AB12345)0',
'12345-1A Abcd',]
In [31]: for s in l:
...: match = re.match(r'^([\d|-]*)(?![A-Za-z])',s)
...: print(match.group(0).rstrip('-'))
...:
12345-1
123
1234567-1
12345
123456
12345
12345-1-1
1-2-3-4-5
12345

Related

RegEx to validate for-loop header in Python

I took a string input from user that is a for-loop header and need to validate it using regex. It verifies successfully upto in keyword but what should I do for range(): or variable: matching in the end?
import re
loop_header = input("Enter the for loop header: ")
print(re.search(r'for[\s]+[a-zA-Z]+[\S][0-9]*[\s]+in[\s]+', loop_header))
I am not able to figure out how to validate the ending part of loop header.
With the regex pattern slightly modified (now taking variable number of non-space characters in the iterator, and ensuring a colon at the end), the following 4 different for-loop headers can be correctly matched:
import re
loop_headers = [
"for i in [1, 2, 3]:",
"for num in [1, 2, 3]:",
"for i in (1, 2, 3):",
"for word in enumerate(['and', 'now', 'for', 'something', 'completely', 'different']):",
]
regex = re.compile(r"for[\s]+[a-zA-Z]+[\S]*[0-9]*[\s]*in[\s]*(range\(.*\)|enumerate\(.*\)|zip\(.*\))?.*:$")
for loop_header in loop_headers:
print(re.search(regex, loop_header))
Returning
<re.Match object; span=(0, 18), match='for i in [1, 2, 3]:'>
<re.Match object; span=(0, 20), match='for num in [1, 2, 3]:'>
<re.Match object; span=(0, 18), match='for i in (1, 2, 3):'>
<re.Match object; span=(0, 84), match="for word in enumerate(['and', 'now', 'for', 'something', 'completely', 'different']:'>
You could then additionally have a second regex to require range, enumerate, zip etc. around the iterable, e.g.:
regex_2 = re.compile(
r"for[\s]+[a-zA-Z]+[\S]*[0-9]*[\s]*in[\s]*(range\(.*\)|enumerate\(.*\)|zip\(.*\))+:$"
)

Finding a given substring in a string with Regular Expression in Python

I am trying to find all the occurrences of a substring in a string like below:
import re
S = 'aaadaa'
matches = re.finditer('(aa)', S)
if matches:
#print(matches)
for match in matches:
print(match)
else:
print("No match")
The current output is:
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(4, 6), match='aa'>
But I am expecting that it should give the values as:
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(1, 3), match='aa'>
<re.Match object; span=(4, 6), match='aa'>
Could someone please help me on this?
Taken from the answer I linked in the comments, here is the pattern you need:
(?=(aa)).
You’ll have to access the matched substring using match_obj.groups(1), and the match indices using match_obj.span(1).
The problem here is that once the re module matches a double aa, it will also consume both of the letters. But, you want overlapping matches. One trick you could use here would be to search for a(?=a):
S = 'aaadaa'
matches = re.findall(r'a(?=a)', S)
matches = [s + "a" for s in matches]
print(matches)
['aa', 'aa', 'aa']
Note that we tag on the second a to the output list, since only the first letter is actually matched at each step.

How to negate default IGNORECASE by string modifier in python re

I have a script that searches through a text file for lines that match a regex passed in as a command line argument.
By default the scripts does a case insensitive search (I want it this way).
How do i pass in the -i flag in the regex argument to force a case sensitive search. I have tried the below but could not figure it out.
It always performs a case insensitive search.
I have tried this on both python 2.7 and 3.6.
>>> import re
>>> res1 = 'TEST'
>>> res2 = 'test'
>>> res3 = '(?-i:)TEST'
>>> res4 = '(?-i:)TeSt'
>>> res5 = '((?-i:)TeSt)'
>>>
>>> string = 'TeSt'
>>>
>>> def str_match(re_str = ''):
... print(re.search(r'(?i)' + re_str, string))
...
>>>
>>> str_match(res1)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match(res2)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match(res3)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match(res4)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match(res5)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>>
>>>
>>>
>>> def str_match_2(re_str = ''):
... print(re.search(re_str, string, re.I))
...
>>>
>>> str_match_2(res1)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match_2(res2)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match_2(res3)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match_2(res4)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match_2(res5)
<_sre.SRE_Match object; span=(0, 4), match='TeSt'>
>>>
>>> str_match('none')
None
>>>
>>> str_match_2('none')
None
I think I figured it out. Should have read the documentation more carefully.
Need to format the regex like this:
res5 = '(?-i:TeSt)'

How can I compare a regex object as if I would been using flex, in python?

I have the following code, what I'am trying to do is that when it receives the input from text_to_search, it finds if it is an instruction, it is any word that is not an instruction (ID) or it is an operator, so far it prints me if it founds an instruction but in the ID part it also prints me Set, instead of for example jaja, so how can I achieve this?
text_to_search="Set Sets UnionShowSets jaja:={hi};"
import re
t=re.search(r'Sets?|ShowSet|ShowSets|Union|Intersect|SetUnion|SetIntersect',text_to_search)
s=re.search(r':=|{|}|;',text_to_search)
d=t=re.search(r'[a-zA-Z0-9]+',text_to_search)
if t:
print("Instruction: ")
print(t)
else:
print("ID: ")
print(d)
if s:
print("Operator: ")
print(s)
Print result:
Instruction:
<_sre.SRE_Match object; span=(0, 3), match='Set'>
Operator:
<_sre.SRE_Match object; span=(27, 29), match=':='>
Desired print result:
Instruction:
<_sre.SRE_Match object; span=(0, 3), match='Set'>
Instruction:
<_sre.SRE_Match object; span=(0, 3), match='Sets'>
Instruction:
<_sre.SRE_Match object; span=(0, 3), match='Union'>
Instruction:
<_sre.SRE_Match object; span=(0, 3), match='ShowSets'>
ID:
<_sre.SRE_Match object; span=(0, 3), match='jaja'>
ID:
<_sre.SRE_Match object; span=(0, 3), match='hi'>
Operator:
<_sre.SRE_Match object; span=(0, 3), match='{'>
Operator:
<_sre.SRE_Match object; span=(0, 3), match='}'>
Operator:
<_sre.SRE_Match object; span=(27, 29), match=':='>
Operator:
<_sre.SRE_Match object; span=(27, 29), match=';'>
I fixed it just by saving in a list the elements I didnt want to print again:
text_to_search="Set Sets UnionShowSets jaja:={hola};"
import re
x=[]
for match in re.finditer('Sets?|ShowSet|ShowSets|Union|Intersect|SetUnion|SetIntersect',text_to_search):
print("Instruccion: ")
print(match)
x.append(match)
for match in re.finditer(r':=|{|}|;',text_to_search):
print("Operador: ")
print(match)
for match in re.finditer(r'[a-zA-Z0-9]+',text_to_search):
if match in x:
continue
else:
print("ID: ")
print(match)

Python Regular Expressions - How is "+?" equivalent to "*"

* : 0 or more occurrences of the pattern to its left
+ : 1 or more occurrences of the pattern to its left
? : 0 or 1 occurrences of the pattern to its left
How is "+?" equivalent to "*" ?
Consider a search for any 3 letter word if it exists.
re1.search(r,'(\w\w\w)*,"abc")
In case of re1, * tries to get either 0 or more occurrences of the pattern to its left which in this case is the group of 3 letters. So it will either try to find a 3 letter word or fail
re2.search(r,'(\w\w\w)+?,"abc")
In case of re2, it's supposed to give the same output but I'm confused as to why "*" and "?+" are equivalent. Can you please explain this ?
* and +? are not equivalent. The ? takes on a special meaning if it follows a quantifier, making that quantifier lazy.
Usually, quantifiers are greedy, meaning they will try to match as many repetitions as they can; lazy quantifiers match as few as they can. But a+? will still match at least one a.
In [1]: re.search("(a*)(.*)", "aaaaaa").groups()
Out[1]: ('aaaaaa', '')
In [2]: re.search("(a+?)(.*)", "aaaaaa").groups()
Out[2]: ('a', 'aaaaa')
In your example, both regexes happen to match the same text because both (\w\w\w)* and (\w\w\w)+? can match three letters, and there are exactly three letters in your input. But they will differ in other strings:
In [12]: re.search(r"(\w\w\w)+?", "abcdef")
Out[12]: <_sre.SRE_Match object; span=(0, 3), match='abc'>
In [13]: re.search(r"(\w\w\w)+?", "ab") # No match
In [14]: re.search(r"(\w\w\w)*", "abcdef")
Out[14]: <_sre.SRE_Match object; span=(0, 6), match='abcdef'>
In [15]: re.search(r"(\w\w\w)*", "ab")
Out[15]: <_sre.SRE_Match object; span=(0, 0), match=''>
If you run with a simpler expression you will see is not the same:
import re
>>> re.search("[0-9]*", "1")
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> re.search("[0-9]*", "")
<_sre.SRE_Match object; span=(0, 0), match=''>
>>> re.search("[0-9]+", "")
>>> re.search("[0-9]+", "1")
<_sre.SRE_Match object; span=(0, 1), match='1'>
The problem in your code is (words)+?. is one or more or nothing

Categories

Resources