Regex Python [python-2.7] - python

I'm working on a Python program that sifts through a .txt file to find the genus and species name. The lines are formatted like this (yes, the equals signs are consistently around the common name):
1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.
I can't seem to figure out a regex that will work to match only the genus and species and not the common name. I know the equals signs (=) will probably help in some way but I cannot think of how to use them.
Edit: Some real data:
1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.
2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.
3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.
4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.

You probably don't need regex for this one. If the order of the words you need and the count of the words is always the same, you can just split each line into list of substrings and get the third (genus) and the fourth (species) element of that list. The code will probably look like that:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split()
genus, species = words[2], words[3]
It just looks a little more "pythonic" to me.
If common name can consist of multiple words, then suggested code will return an incorrect result. To get the right result in this case too, you can use this code:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
genus, species = words[0], words[1]

If it is enough to capture words in groups (and you dont't wont direct match) you can try with:
(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))
DEMO
the desired values will be in groups <genus> and <species>. The whole regex is a positive lookbehind, so it match a zero point position on a beginning of string, but it captures some content into groups.
(?=\d\.\s*=[^=]+=\s - decimal folowed by some content between equal
signs and space,
(?:(?P<genus>\w+)\s(?P<species>\w+))) - capture first word to genus
groups, and second word do species groups,

You can try something like:
import re
txt='1. =Common Name= Genus Species some other words that I don\'t want.'
re1='.*?' # Non-greedy match on filler
re2='(?:[a-z][a-z]+)' # Uninteresting: word
re3='.*?' # Non-greedy match on filler
re4='(?:[a-z][a-z]+)' # Uninteresting: word
re5='.*?' # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?' # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
word1=m.group(1)
word2=m.group(2)
print "("+word1+")"+"("+word2+")"+"\n"
In your test input as shown in txt, this will print
(Genus)(Species)
You can you this awesome site to help do regexes like this!
Hope this helps

Related

Removing words and symbols from columns which do not match specific criteria

I would need to remove from rows words which are not in English and specific symbols, like | or -, and three dots (...) if they are at the end of each row.
In order to do this, I was considering to use googletranslate or langdetect packages in Python for detecting and removing from text words not in English, and create a list for symbols.
To apply them, I was doing as follows:
df['Text'] == df['Text'].apply(lambda x: detect(x) == 'en') # but this just detect the rows. I would like to remove only not English words within rows, not the whole rows.
df['Text'] = df['Text'].map(lambda x: str(x)[:-4]) # I would need to consider however a logical condition: if the last three characters are ..., then remove these three dots from the string.
to_remove=['|','-', '(',')']
df['Text'] = df['Text'].str.contains(|, to_remove)
english_data = [word for word in df['Text'].tolist() if detect_language(word) == 'English']
The column I should apply these changes is
Text
The is in with a... - KIDS ...
BoneMA – Synthesis and Characterization of a Methacrylated ...
新型冠状病毒肺炎诊疗方案 (试行第七版) - Law Translate
Expected output:
Text
The is in with a... KIDS
BoneMA Synthesis and Characterization of a Methacrylated
Law Translate
Any help and suggestions would be appreciated.
like regex
df['Text'].str.replace('[^0-9a-zA-Z.]|[.]+$',' ').str.replace('\s{2,}',' ')
Output
0 The is in with a... KIDS
1 BoneMA Synthesis and Characteriof a M
2 Law Translate

Create a list of alphabetically sorted UNIQUE words and display the first N words in python

I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.

Regex sub only removes certain expressions

I'm running a program which creates product labels based on csv data. The function which I am struggling with takes a data structure which consists of a number combination(width of a wooden plank) and a string (name of product). Possible combinations I search for are as follows:
5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF
My function needs to take in the data, split the width from the name and return them both as separate variables as follows:
desc = row[1]
if filter.lower() in desc.lower():
size = re.search(r'(\d{1})(\-*)(\d{0,1})(\/*)(\d{0,2})(\+*)(\d{0,1})(\-*)(\d{0,1})(\/*)(\d{0,2})', desc)
if size:
# remove size from description
desc = re.sub(size.group(), '', desc)
size = size.group() # extract match from obj
else:
size = "None"
The function does as intended with the first two samples, however when it comes across the last product, it recognizes the size but does not remove it from description. Screen shot below shows the output after I print (size + \n + desc)
Is there an issue with my re expression or elsewhere?
Thanks
re.sub() expects its first argument to be a regex. It works for the first two because they don't contain any characters that have special meaning in the context, however the third contains +, which is special.
There's not actually any reason to use regex there... regular string replacement should work:
desc = desc.replace(size.group(), '')
Why replace and not simply match what you need?
import re
text = """5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF""".split('\n')
print(text)
for t in text:
pattern = r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
m = re.search(pattern,t)
print(m.group('size'))
print(m.group('species'))
Output:
5
MAPLE PEPPER-ANTIQUE
3-1/4
MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4
MAPLE TIMBERWOLF
Regex:
r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
2 named groups, between them 0-n spaces.
1st group only 0123456789-+/ allowed
2nd group any but 0123456789 allowed

Delete regex matches within loop and continue with updated string version

I have a string that I want to run through four wordlists, one with four-grams, one with tri-grams, one with bigrams and one with single terms. To avoid that a word of the single term wordlist gets counted twice when it also forms part of a bigram or trigrams for example, I start with counting for four-grams, then want to update the string in terms of removing the matches to only check the remaining part of the string for matches of trigrams, bigrams and single terms, respectively. I have used the following code and illustrate it here just starting with fourgrams and then trigrams:
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
def clean_text1(pattern_fourgrams, new_strn):
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
return new_strn
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
def clean_text2(pattern_trigrams, new_strn1):
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
return new_strn1
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)
For fourgrams there is not match, so new_strn will be similar to strn. However, there is one match with trigrams ("chief financial officer"), however, I do not succees in deleteing the match from new_strn1. Instead, I again yield the full string, namely strn (or new_strn which is the same).
Could someone help me find the mistake here?
(As a complement to Tilak Putta's answer)
Note that you are searching the string twice: once when counting the occurrences of the ngrams with .count() and once more when you remove the matches using re.sub().
You can increase performance by counting and removing at the same time.
This can be done using re.subn. This function takes the same parameters as re.sub but returns a tuple containing the cleaned string as well as the number of matches.
Example:
for i in pattern_fourgrams:
new_strn, n = re.subn(r, '', new_strn)
financial_fourgrams_count += n
Note that this assumes the n-grams are pairwaise different (for fixed n), i.e. they shouldn't have a common word, since subn will delete that word the firs time it sees it and thus won't be able to find occurence of other ngrams containing that particular word.
you need to remove def
import re
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
print(new_strn1)
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)

python, re.search / re.split for phrases which looks like a title, i.e. starting with an uppper case

I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ I can find all the occurrences of these phrases like so
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(For simplicity, I am assuming a perfect spacing.)
2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.
But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.
Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.
Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.
I would use re.I and modify the list-comp to:
l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]
I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:
new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])
This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.
How about modifying the input so that it is in the correct case before you use it in the regular expression?

Categories

Resources