Regex to catch only the certain part of the string - python

Is there universal regex to catch only the names of companies?
Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc
Q4_2018_Control4_Corp
The output should be:
American_Airlines_Group_Inc
Apple_Inc
Alcoa_Inc
Arconic_Inc
Orkla_ASA
AGCO_Corp
Autodesk_Inc
Note:
The name of the company may contain symbols or numbers

You can use this regex,
[a-zA-Z]+(?:_[a-zA-Z]+)*$
Your company names all start with alphabetical words and hyphen separated till end of string, for which above regex will work fine.
Here, [a-zA-Z]+ starts matching alphabetical company names, and (?:_[a-zA-Z]+)* further matches any alphabetical words having hyphen before them and $ ensures the matched string ends with the string.
Regex Demo
Python code,
import re
arr = ['Q4_2017_American_Airlines_Group_Inc','Q1_2016_Apple_Inc','Q4_2014_Alcoa_Inc','Q3_2015_Arconic_Inc','Q3_2017_Orkla_ASA','Q2_2018_AGCO_Corp','Quarter_3_2018_Autodesk_Inc']
for s in arr:
m = re.search(r'[a-zA-Z]+(?:_[a-zA-Z]+)*$', s)
print(s, '-->', m.group())
Prints,
Q4_2017_American_Airlines_Group_Inc --> American_Airlines_Group_Inc
Q1_2016_Apple_Inc --> Apple_Inc
Q4_2014_Alcoa_Inc --> Alcoa_Inc
Q3_2015_Arconic_Inc --> Arconic_Inc
Q3_2017_Orkla_ASA --> Orkla_ASA
Q2_2018_AGCO_Corp --> AGCO_Corp
Quarter_3_2018_Autodesk_Inc --> Autodesk_Inc
Also, if you have a single string of those company names, then you can use following code and use re.findall to list all company names,
import re
s = '''Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc'''
print(re.findall(r'(?m)[a-zA-Z]+(?:_[a-zA-Z]+)*$', s))
Prints,
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']
Edit:
As Chyngyz Akmatov raised, if name can contain numbers and in general any symbol, then this regex will get the name properly, which assumes company name starts after year part and underscore.
(?<=\d{4}_).*$
Demo handling any character in company name

You can use re.sub:
import re
data = [re.sub('\w+\d{4}_', '', i) for i in filter(None, content.split('\n'))]
Output:
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']

You can also use this regex:
_\d+(?:_\d+)*_(.*)
Code:
import re
lst = ['Q4_2017_American_Airlines_Group_Inc', 'Q1_2016_Apple_Inc', 'Q4_2014_Alcoa_Inc', 'Q3_2015_Arconic_Inc', 'Q3_2017_Orkla_ASA', 'Q2_2018_AGCO_Corp', 'Quarter_3_2018_Autodesk_Inc']
for x in lst:
print(re.search(r'_\d+(?:_\d+)*_(.*)', x).group(1))
# American_Airlines_Group_Inc
# Apple_Inc
# Alcoa_Inc
# Arconic_Inc
# Orkla_ASA
# AGCO_Corp
# Autodesk_Inc

Assuming there are only normal letters and the names are the end of each line :
grep -o '[A-Za-z][A-Za-z_]*$' names

Related

How to extract person name using regular expression?

I am new to Regular Expression and I have kind of a phone directory. I want to extract the names out of it. I wrote this (below), but it extracts lots of unwanted text rather than just names. Can you kindly tell me what am i doing wrong and how to correct it? Here is my code:
import re
directory = '''Mark Adamson
Home: 843-798-6698
(424) 345-7659
265-1864 ext. 4467
326-665-8657x2986
E-mail:madamson#sncn.net
Allison Andrews
Home: 612-321-0047
E-mail: AEA#anet.com
Cellular: 612-393-0029
Dustin Andrews'''
nameRegex = re.compile('''
(
[A-Za-z]{2,25}
\s
([A-Za-z]{2,25})+
)
''',re.VERBOSE)
print(nameRegex.findall(directory))
the output it gives is:
[('Mark Adamson', 'Adamson'), ('net\nAllison', 'Allison'), ('Andrews\nHome', 'Home'), ('com\nCellular', 'Cellular'), ('Dustin Andrews', 'Andrews')]
Would be really grateful for help!
Your problem is that \s will also match newlines. Instead of \s just add a space. That is
name_regex = re.compile('[A-Za-z]{2,25} [A-Za-z]{2,25}')
This works if the names have exactly two words. If the names have more than two words (middle names or hyphenated last names) then you may want to expand this to something like:
name_regex = re.compile(r"^([A-Za-z \-]{2,25})+$", re.MULTILINE)
This looks for one or more words and will stretch from the beginning to end of a line (e.g. will not just get 'John Paul' from 'John Paul Jones')
I can suggest to try the next regex, it works for me:
"([A-Z][a-z]+\s[A-Z][a-z]+)"
The following regex works as expected.
Related part of the code:
nameRegex = re.compile(r"^[a-zA-Z]+[',. -][a-zA-Z ]?[a-zA-Z]*$", re.MULTILINE)
print(nameRegex.findall(directory)
Output:
>>> python3 test.py
['Mark Adamson', 'Allison Andrews', 'Dustin Andrews']
Try:
nameRegex = re.compile('^((?:\w+\s*){2,})$', flags=re.MULTILINE)
This will only choose complete lines that are made up of two or more names composed of 'word' characters.

Repeated regex groups of arbitrary number

I have this example text snippet
headline:
Status[apphmi]: blubb, 'Statustext1'
Main[apphmi]: bla, 'Maintext1'Main[apphmi]: blaa, 'Maintext2'
Popup[apphmi]: blaaa, 'Popuptext1'
and I want to extract the words within '', but sorted with the context (status, main, popup).
My current regex is (example at pythex.org):
headline:(?:\n +Status\[apphmi\]:.* '(.*)')*(?:\n +Main\[apphmi\]:.* '(.*)')*(?:\n +Popup\[apphmi\]:.* '(.*)')*
but with this I only get 'Maintext2' and not both. I don't know how to repeat the groups to an arbitrary number.
You can try with this:
r"(.*?]):(?:[^']*)'([^']*)'"g
Look here
Group1 and Group 2 for each match contains your key value pair
You can not merge the second match as one by using regex, once you get all the pairs... you can apply some programming here to merge duplicate keys as one.
Here I have used dictionary of list, if a key already exists in the dictionary then you should append the value to the list , otherwise insert a new key with a new list having the value.
This is how it should be done (tested in python 3+)
import re
d = dict()
regex = r"(.*?]):(?:[^']*)'([^']*)'"
test_str = ("headline: \n"
"Status[apphmi]: blubb, 'Statustext1'\n"
"Main[apphmi]: bla, 'Maintext1'Main[apphmi]: blaa, 'Maintext2'\n"
"Popup[apphmi]: blaaa, 'Popuptext1'")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
if match.group(1) in d:
d[match.group(1)].append(match.group(2))
else:
d[match.group(1)] = [match.group(2),]
print(d)
Output:
{
'Popup[apphmi]': ['Popuptext1'],
'Main[apphmi]': ['Maintext1', 'Maintext2'],
'Status[apphmi]': ['Statustext1']
}

Using Regular expressions to match a portion of the string?(python)

What regular expression can i use to match genes(in bold) in the gene list string:
GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8
I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene
Given:
>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
You can use Python string methods to do:
>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
For a regex:
(?<=[:;]\s)([^\s;]+)
Demo
Or, in Python:
>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
You can use the following:
\s([^;\s]+)
Demo
The captured group, ([^;\s]+), will contain the desired substrings followed by whitespace (\s)
>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
UPDATE
It's in fact much simpler:
[^\s;]+
however, first use substring to take only the part you need (the genes, without GENELIST )
demo: regex demo
string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)
The output is:
['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']

removing the \n when extracted the program

I made a regex for the number of followers on twitter and i have to extract it
# Create a regex for number of followers
(
(\s|-) # first separator
\d\d # first 2 digits
, # separator
\d\d\d # hundred thousands
, # separator
\d\d\d # hundreds
)
''', re.VERBOSE)
Extract username/followers from this text
extractedFollowers = followersRegex.findall(text)
allFollowers = []
for followerCount in extractedFollowers:
allFollowers.append(followerCount[0])
but whenever i run it, this appears:
['\n90,280,191', '\n84,239,451', '\n79,215,375', '\n75,925,596', '\n62,869,696']
How do i remove the \n?
>>> lst = ['\n90,280,191', '\n84,239,451', '\n79,215,375', '\n75,925,596', '\n62,869,696']
>>> [i.replace('\n', '') for i in lst]
# ['90,280,191', '84,239,451', '79,215,375', '75,925,596', '62,869,696']
If you provide more information about the original strings you are applying the regex to, maybe I could help with the regex part.
You can use replace or lstrip.
>>>lst = ['\n90,280,191', '\n84,239,451', '\n79,215,375', '\n75,925,596', '\n62,869,696']
>>>[i.lstrip('\n') for i in lst]
['90,280,191', '84,239,451', '79,215,375', '75,925,596', '62,869,696']

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?
You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.
There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]
If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Categories

Resources