extracting strings using regular expression

extracting strings using regular expression - python

I have the following strings:
LOW QUALITY PROTEIN: cysteine proteinase 5-like [Solanum pennellii]
PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]
XP_019244624.1 PREDICTED: peroxidase 40-like [Nicotiana attenuata]
RVW92024.1 Retrovirus-related Pol polyprotein from transposon TNT 1-94 [Vitis vinifera]
hypothetical protein VITISV_035070 [Vitis vinifera]
How to extract the below strings from the above strings?
cysteine proteinase 5-like
uncharacterized protein LOC107059219
peroxidase 40-like
Retrovirus-related Pol polyprotein from transposon TNT 1-94
hypothetical protein VITISV_035070

I think this problem don't need regex. I would prefer following solution because it is easy to understand
st = "PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]"
st.split(":")[-1].split("[")[0].strip()

Related

How to match complete words for acronym using regex?

I want to only get complete words from acronyms with ( ) around them.
For example, there is a sentence
'Lung cancer screening (LCS) reduces NSCLC mortality';
->I want to get 'Lung cancer screening' as a result.
How can I do it with regex?
original question:
I want to remove repeated upper alphabets :
"HIV acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer" => " acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer"

Assuming you want to target 2 or more capital letters, I would use re.sub here:
inp = "Lung cancer screening (LCS) reduces NSCLC mortality"
output = re.sub(r'\s*(?:\([A-Z]+\)|[A-Z]{2,})\s*', ' ', inp).strip()
print(output) # Lung cancer screening reduces mortality

import re
s = 'HIV acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer'
print(re.sub(r'([A-Z])', lambda pat:'', s).strip()) # Inline
according to #jensgram answer

Find best matches of substring from list in corpus

I have a corpus that looks something like this
LETTER AGREEMENT N°5 CHINA SOUTHERN AIRLINES COMPANY LIMITED Bai Yun
Airport, Guangzhou 510405, People's Republic of China Subject: Delays
CHINA SOUTHERN AIRLINES COMPANY LIMITED (the ""Buyer"") and AIRBUS
S.A.S. (the ""Seller"") have entered into a purchase agreement (the
""Agreement"") dated as of even date
And a list of company names that looks like this
l = [ 'airbus', 'airbus internal', 'china southern airlines', ... ]
The elements of this list do not always have exact matches in the corpus, because of different formulations or just typos: for this reason I want to perform fuzzy matching.
What is the most efficient way of finding the best matches of l in the corpus? In theory the task is not super difficult but I don't see a way of solving it that does not entail looping through both the corpus and list of matches, which could cause huge slowdowns.

You can concatenate your list l in a single regex expression, then use regex to fuzzy match (https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109) the words in the corpus.
Something like
my_regex = ""
for pattern in l:
my_regex += f'(?:{pattern}' + '{1<=e<=3})' #{1<=e<=3} permit at least 1 and at most 3 errors
my_regex += '|'
my_regex = my_regex[:-1] #remove the last |

Python regex - get contents in between

I have a word/text file containing,
1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.
(A)15kJ
(B)23kJ
(C)32kJ
(D)50kJ
[Answer]:(B)
[QuestionType]:single_correct
2. Which of the following statement is correct
(A)Li is hander than the other alkali metals.
(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.
(C)Na2CO3 is pearl ash.
(D)Berylium and Aluminium ions do not have strong tendency to form complexes like
[Answer]:(C)
[QuestionType]:single_correct
I need to get each question in a separate list starting from question number to [QuestionType].
( 1. to [QuestionType])
Output :
[[1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.,(A)15kJ,(B)23kJ,(C)32kJ,(D)50kJ,[Answer]:(B),[QuestionType]:single_correct],
[2. Which of the following statement is correct,(A)Li is hander than the other alkali metals.,(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.,(C)Na2CO3 is pearl ash.,(D)Berylium and Aluminium ions do not have strong tendency to form complexes like ,[Answer]:(C),[QuestionType]:single_correct]]
I tried in for loop but cant able to get contents in between
import docx
import re
doc = docx.Document("QnA.docx")
for i in doc.paragraphs:
if re.match(r"^[0-9]+[.]+",i.text):
print(i.text) # matched number condition
if re.match(r"(^\[QuestionType\])",i.text):
print(i.text) # matched QuestionType condition

You might use a single pattern, starting the match with 1 or more digits and a dot.
Then continue matching all the lines that do not start with [QuestionType] and finally match that line.
^\d+\..*(?:\r?\n(?!\[QuestionType]).*)*\r?\n\[QuestionType]:.*
See a regex demo and a Python demo
For example
import re
regex = r"^\d+\..*(?:\r?\n(?!\[QuestionType]).*)*\r?\n\[QuestionType]:.*"
s = ("1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.\n"
"(A)15kJ\n"
"(B)23kJ\n"
"(C)32kJ\n"
"(D)50kJ\n\n"
"[Answer]:(B)\n\n"
"[QuestionType]:single_correct\n\n"
"2. Which of the following statement is correct\n\n"
"(A)Li is hander than the other alkali metals.\n"
"(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.\n"
"(C)Na2CO3 is pearl ash.\n"
"(D)Berylium and Aluminium ions do not have strong tendency to form complexes like \n\n"
"[Answer]:(C)\n\n"
"[QuestionType]:single_correct")
print(re.findall(regex, s, re.M))
Output
['1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.\n(A)15kJ\n(B)23kJ\n(C)32kJ\n(D)50kJ\n\n[Answer]:(B)\n\n[QuestionType]:single_correct', '2. Which of the following statement is correct\n\n(A)Li is hander than the other alkali metals.\n(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.\n(C)Na2CO3 is pearl ash.\n(D)Berylium and Aluminium ions do not have strong tendency to form complexes like \n\n[Answer]:(C)\n\n[QuestionType]:single_correct']

First, you get content of each question using regex. After, you split \n for content of each question.
You could try following regex.
\d+\.[\s\S]+?QuestionType.*
I also try to test on python.
import re
content = '''1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.
(A)15kJ
(B)23kJ
(C)32kJ
(D)50kJ
[Answer]:(B)
[QuestionType]:single_correct
2. Which of the following statement is correct
(A)Li is hander than the other alkali metals.
(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.
(C)Na2CO3 is pearl ash.
(D)Berylium and Aluminium ions do not have strong tendency to form complexes like
[Answer]:(C)
[QuestionType]:single_correct
'''
splitQuestion = re.findall(r"\d+\.[\s\S]+?QuestionType.*", content)
result = [];
for eachQuestion in splitQuestion:
result.append(eachQuestion.split("\n"))
print(result)
Result.
[['1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.', '(A)15kJ', '(B)23kJ', '(C)32kJ', '(D)50kJ', '', '[Answer]:(B)', '', '[QuestionType]:single_correct'], ['2. Which of the following statement is correct', '', '(A)Li is hander than the other alkali metals.', '(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.', '(C)Na2CO3 is pearl ash.', '(D)Berylium and Aluminium ions do not have strong tendency to form complexes like ', '', '[Answer]:(C)', '', '[QuestionType]:single_correct']]

Python: Finding The Longest/Shortest Sentence In A Random Paragraph?

I am using Python 2.7 and need 2 functions to find the longest and shortest sentence (in terms of word count) in a random paragraph. For example, if I choose to put in this paragraph:
"Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
The output for this should be 36 and 16 with 36 meaning there are 36 words in the longest sentence and 16 words in the shortest sentence.

def MaxMinWords(paragraph):
numWords = [len(sentence.split()) for sentence in paragraph.split('.')]
return max(numWords), min(numWords)
EDIT : As many have pointed out in the comments, this solution is far from robust. The point of this snippet is to simply serve as a pointer to the OP.

You need a way to split the paragraph into sentences and to count words in a sentence. You could use nltk package for both:
from nltk.tokenize import sent_tokenize, word_tokenize # $ pip install nltk
sentences = sent_tokenize(paragraph)
word_count = lambda sentence: len(word_tokenize(sentence))
print(min(sentences, key=word_count)) # the shortest sentence by word count
print(max(sentences, key=word_count)) # the longest sentence by word count

EDIT: As has been mentioned in the comments below, programmatically determining what constitutes the sentences in a paragraph is quite a complex task. However, given the example you provided, I have elucidated a nice start to perhaps solving your problem below.
First, we want to tokenize the paragraph into sentences. We do this by splitting the text on every occurrence of a . (period). This returns a list of strings, each of which is a sentence.
We then want to break each sentence into its corresponding list of words. Then, using this list of lists, we want the sentence (represented as a list of words) whose length is a maximum and the sentence whose length is a minimum. Consider the following code:
par = "Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
# split paragraph into sentences
sentences = par.split(". ")
# split each sentence into words
tokenized_sentences = [sentence.split(" ") for sentence in sentences]
# get longest sentence and its length
longest_sen = max(tokenized_sentences, key=len)
longest_sen_len = len(longest_sen)
# get shortest word and its length
shortest_sen = min(tokenized_sentences, key=len)
shortest_sen_len = len(shortest_sen)
print longest_sen_len
print shortest_sen_len

Extract text between last occurrence of braces

I have strings like this,
Protein XVZ [Human]
Protein ABC [Mouse]
Protein CDY [Chicken [type1]]
Protein BBC [type 2] [Bacteria]
Output should be,
Human
Mouse
Chicken [type1]
Bacteria
Thus, I want everything inside the last pair of braces. Braces that precede that pair must be ignored as in last example. Is there an effective way to do this in Python? Thanks in advance for your help.

how about this:
import re
list = ["Protein XVZ [Human]","Protein ABC [Mouse]","go UDP[3] glucosamine N-acyltransferase [virus1]","Protein CDY [Chicken [type1]]","Protein BBC [type 2] [Bacteria] [cat] [mat]","gi p19-gag protein [2] [Human T-lymphotropic virus 2]"]
pattern = re.compile("\[(.*?)\]$")
for string in list:
match = re.search(pattern,string)
lastBracket = re.split("\].*\[",match.group(1))[-1]
print lastBracket

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting strings using regular expression - python

I think this problem don't need regex. I would prefer following solution because it is easy to understand st = "PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]" st.split(":")[-1].split("[")[0].strip()

Related

How to match complete words for acronym using regex?

Find best matches of substring from list in corpus

Python regex - get contents in between

Python: Finding The Longest/Shortest Sentence In A Random Paragraph?

Extract text between last occurrence of braces

Categories

Resources