Regex works fine on Pythex, but not in Python

Regex works fine on Pythex, but not in Python - python

I used the following regular expression on pythex to test it:
(\d|t)(_\d+){1}\.
It works fine and I am primarily interested in group 2. That it works successfully is shown below:
However, I can't get Python to actually show me the correct results. Here's a MWE:
fn_list = ['IMG_0064.png',
'IMG_0064.JPG',
'IMG_0064_1.JPG',
'IMG_0064_2.JPG',
'IMG_0064_2.PNG',
'IMG_0064_2.BMP',
'IMG_0064_3.JPEG',
'IMG_0065.JPG',
'IMG_0065.JPEG',
'IMG-20150623-00176-preview-left.jpg',
'IMG-20150623-00176-preview-left_2.jpg',
'thumb_2595.bmp',
'thumb_2595_1.bmp',
'thumb_2595_15.bmp']
pattern = re.compile(r'(\d|t)(_\d+){1}\.', re.IGNORECASE)
for line in fn_list:
search_obj = re.match(pattern, line)
if search_obj:
matching_group = search_obj.groups()
print matching_group
The output is nothing.
However, the pythex above clearly shows two groups returned for each, the second should be present and hit off many more files. What am I doing wrong?

You need to use re.search(), not re.match(). re.search() matches anywhere in the string, whereas re.match() matches only at the beginning.
import re
fn_list = ['IMG_0064.png',
'IMG_0064.JPG',
'IMG_0064_1.JPG',
'IMG_0064_2.JPG',
'IMG_0064_2.PNG',
'IMG_0064_2.BMP',
'IMG_0064_3.JPEG',
'IMG_0065.JPG',
'IMG_0065.JPEG',
'IMG-20150623-00176-preview-left.jpg',
'IMG-20150623-00176-preview-left_2.jpg',
'thumb_2595.bmp',
'thumb_2595_1.bmp',
'thumb_2595_15.bmp']
pattern = re.compile(r'(\d|t)(_\d+){1}\.', re.IGNORECASE)
for line in fn_list:
search_obj = re.search(pattern, line) # CHANGED HERE
if search_obj:
matching_group = search_obj.groups()
print matching_group
Result:
('4', '_1')
('4', '_2')
('4', '_2')
('4', '_2')
('4', '_3')
('t', '_2')
('5', '_1')
('5', '_15')
Since you are compiling the regular expression, you can do search_obj = pattern.search(line) instead of search_obj = re.search(pattern, line). As for your regular expression itself, r'([\dt])(_\d+)\.' is equivalent to the one you're using, and a bit cleaner.

You need to use the following code:
import re
fn_list = ['IMG_0064.png',
'IMG_0064.JPG',
'IMG_0064_1.JPG',
'IMG_0064_2.JPG',
'IMG_0064_2.PNG',
'IMG_0064_2.BMP',
'IMG_0064_3.JPEG',
'IMG_0065.JPG',
'IMG_0065.JPEG',
'IMG-20150623-00176-preview-left.jpg',
'IMG-20150623-00176-preview-left_2.jpg',
'thumb_2595.bmp',
'thumb_2595_1.bmp',
'thumb_2595_15.bmp']
pattern = re.compile(r'([\dt])(_\d+)\.', re.IGNORECASE) # OPTIMIZED REGEX A BIT
for line in fn_list:
search_obj = pattern.search(line) # YOU NEED SEARCH WITH THE COMPILED REGEX
if search_obj:
matching_group = search_obj.group(2) # YOU NEED TO ACCESS GROUP 2 IF YOU ARE INTERESTED JUST IN GROUP 2
print matching_group
See IDEONE demo
As for the regex, (\d|t) is the same as ([\dt]), but the latter is more efficient. Also, {1} is redundant in regex.

Related

Regex pattern to match multiple characters and split

I haven't used regex much and was having issues trying to split out 3 specific pieces of info in a long list of text I need to parse.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
pattern1 = Parse the **Name Text**
pattern2 = Parse the number `#x`
pattern3 = Grab everything else until the next pattern 1
What I have doesn't seem to work well. There are empty elements? They are not grouped together? And I can't figure out how to grab the last pattern text without it affecting the first 2 patterns. I'd also like it if all 3 matches were in a tuple together rather than separated. Here's what I have so far:
all = r"\*\*(.+?)\*\*|\`#(.+?)\`:"
l = re.findall(all, note)
Output:
[('Jane Greiz', ''), ('', '1'), ('Thomas Fitzpatrick', ''), ('', '90'), ('Anthony Smith', ''), ('', '91')]

Don't use alternatives. Put the name and number patterns after each other in a single alternative, and add another group for the match up to the next **.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
all = r"\*\*(.+?)\*\*.*?\`#(.+?)\`:(.*)"
print(re.findall(all, note))
Output is:
[('Jane Greiz', '1', ' Should be open here .'), ('Thomas Fitzpatrick', '90', ' Anim: Can we start the movement.'), ('Anthony Smith', '91', ' Her left shoulder.')]

Regex to catch only the certain part of the string

Is there universal regex to catch only the names of companies?
Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc
Q4_2018_Control4_Corp
The output should be:
American_Airlines_Group_Inc
Apple_Inc
Alcoa_Inc
Arconic_Inc
Orkla_ASA
AGCO_Corp
Autodesk_Inc
Note:
The name of the company may contain symbols or numbers

You can use this regex,
[a-zA-Z]+(?:_[a-zA-Z]+)*$
Your company names all start with alphabetical words and hyphen separated till end of string, for which above regex will work fine.
Here, [a-zA-Z]+ starts matching alphabetical company names, and (?:_[a-zA-Z]+)* further matches any alphabetical words having hyphen before them and $ ensures the matched string ends with the string.
Regex Demo
Python code,
import re
arr = ['Q4_2017_American_Airlines_Group_Inc','Q1_2016_Apple_Inc','Q4_2014_Alcoa_Inc','Q3_2015_Arconic_Inc','Q3_2017_Orkla_ASA','Q2_2018_AGCO_Corp','Quarter_3_2018_Autodesk_Inc']
for s in arr:
m = re.search(r'[a-zA-Z]+(?:_[a-zA-Z]+)*$', s)
print(s, '-->', m.group())
Prints,
Q4_2017_American_Airlines_Group_Inc --> American_Airlines_Group_Inc
Q1_2016_Apple_Inc --> Apple_Inc
Q4_2014_Alcoa_Inc --> Alcoa_Inc
Q3_2015_Arconic_Inc --> Arconic_Inc
Q3_2017_Orkla_ASA --> Orkla_ASA
Q2_2018_AGCO_Corp --> AGCO_Corp
Quarter_3_2018_Autodesk_Inc --> Autodesk_Inc
Also, if you have a single string of those company names, then you can use following code and use re.findall to list all company names,
import re
s = '''Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc'''
print(re.findall(r'(?m)[a-zA-Z]+(?:_[a-zA-Z]+)*$', s))
Prints,
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']
Edit:
As Chyngyz Akmatov raised, if name can contain numbers and in general any symbol, then this regex will get the name properly, which assumes company name starts after year part and underscore.
(?<=\d{4}_).*$
Demo handling any character in company name

You can use re.sub:
import re
data = [re.sub('\w+\d{4}_', '', i) for i in filter(None, content.split('\n'))]
Output:
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']

You can also use this regex:
_\d+(?:_\d+)*_(.*)
Code:
import re
lst = ['Q4_2017_American_Airlines_Group_Inc', 'Q1_2016_Apple_Inc', 'Q4_2014_Alcoa_Inc', 'Q3_2015_Arconic_Inc', 'Q3_2017_Orkla_ASA', 'Q2_2018_AGCO_Corp', 'Quarter_3_2018_Autodesk_Inc']
for x in lst:
print(re.search(r'_\d+(?:_\d+)*_(.*)', x).group(1))
# American_Airlines_Group_Inc
# Apple_Inc
# Alcoa_Inc
# Arconic_Inc
# Orkla_ASA
# AGCO_Corp
# Autodesk_Inc

Assuming there are only normal letters and the names are the end of each line :
grep -o '[A-Za-z][A-Za-z_]*$' names

Python Regular Expression Named Capture Groups

Im learning regular expressions, specifically named capture groups.
Having an issue where I'm not able to figure out how to write an if/else statement for my function findVul().
Basically how the code works or should work is that findVul() goes through data1 and data2, which has been added to the list myDATA.
If the regex finds a match for the entire named group, then it should print out the results. It currently works perfectly.
CODE:
import re
data1 = '''
dwadawa231d .2 vulnerabilities discovered dasdfadfad .One vulnerability discovered 123e2121d21 .12 vulnerabilities discovered sgwegew342 dawdwadasf
2r3232r32ee
'''
data2 = ''' d21d21 .2 vul discovered adqdwdawd .One vulnerability disc d12d21d .two vulnerabilities discovered 2e1e21d1d f21f21
'''
def findVul(data):
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.finditer(pattern, data)
for x in match:
print(x.group())
myDATA = [data1,data2] count_data = 1
for x in myDATA:
print('\n--->Reading data{0}\n'.format(count_data))
count_data+=1
findVul(x)
OUTPUT:
--->Reading data1
2 vulnerabilities discovered
One vulnerability discovered
12 vulnerabilities discovered
--->Reading data2
Now I want to add an if/else statement to check if there are any matches for the entire named group.
I tried something like this, but it doesn't seem to be working.
CODE:
def findVul(data):
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.finditer(pattern, data)
if len(list(match)) != 0:
print('\nVulnerabilities Found!\n')
for x in match:
print(x.group())
else:
print('No Vulnerabilities Found!\n')
OUTPUT:
--->Reading data1
Vulnerabilities Found!
--->Reading data2
No Vulnerabilities Found!
As you can see it does not print the vulnerabilities that should be in data1.
Could someone please explain the correct way to do this and why my logic is wrong.
Thanks so much :) !!

The problem is that re.finditer() returns an iterator that is evaluated when you do the len(list(match)) != 0 test; when you iterate over it again in the for-loop, it is already exhausted and there are no items left. The simple fix is just to add a match = list(match) line after the finditer() call.

I did some more research after #AdamKG response.
I wanted to utlize the re.findall() function.
re.findall() will return a list of all matched substrings. In my case I have capture groups inside of my named capture group. This will return a list with tuples.
For example the following regex with data1:
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+
(vulnerabilities|vulnerability)\s+discovered)')
match = re.findall(pattern, data)
Will return a list with tuples:
[('2 vulnerabilities discovered', '2', 'vulnerabilities'), ('One vulnerability
discovered', 'One', 'vulnerability'), ('12 vulnerabilities discovered', '12',
'vulnerabilities')]
My Final Code for findVul():
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.findall(pattern, data)
if len(match) != 0:
print('Vulnerabilties Found!\n')
for x in match:
print('--> {0}'.format(x[0]))
else:
print('No Vulnerability Found!\n')

Using Regular expressions to match a portion of the string?(python)

What regular expression can i use to match genes(in bold) in the gene list string:
GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8
I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene

Given:
>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
You can use Python string methods to do:
>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
For a regex:
(?<=[:;]\s)([^\s;]+)
Demo
Or, in Python:
>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']

You can use the following:
\s([^;\s]+)
Demo
The captured group, ([^;\s]+), will contain the desired substrings followed by whitespace (\s)
>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']

UPDATE
It's in fact much simpler:
[^\s;]+
however, first use substring to take only the part you need (the genes, without GENELIST )
demo: regex demo

string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)
The output is:
['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?

You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.

There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]

If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex works fine on Pythex, but not in Python - python

Related

Regex pattern to match multiple characters and split

Regex to catch only the certain part of the string

Python Regular Expression Named Capture Groups

Using Regular expressions to match a portion of the string?(python)

Python regex: Match ALL consecutive capitalized words

Categories

Resources