Extract text between last occurrence of braces - python

I have strings like this,
Protein XVZ [Human]
Protein ABC [Mouse]
Protein CDY [Chicken [type1]]
Protein BBC [type 2] [Bacteria]
Output should be,
Human
Mouse
Chicken [type1]
Bacteria
Thus, I want everything inside the last pair of braces. Braces that precede that pair must be ignored as in last example. Is there an effective way to do this in Python? Thanks in advance for your help.

how about this:
import re
list = ["Protein XVZ [Human]","Protein ABC [Mouse]","go UDP[3] glucosamine N-acyltransferase [virus1]","Protein CDY [Chicken [type1]]","Protein BBC [type 2] [Bacteria] [cat] [mat]","gi p19-gag protein [2] [Human T-lymphotropic virus 2]"]
pattern = re.compile("\[(.*?)\]$")
for string in list:
match = re.search(pattern,string)
lastBracket = re.split("\].*\[",match.group(1))[-1]
print lastBracket

Related

Extract full sentence with list of words

I hope to extract the full sentence, if containing certain key words (like or love).
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]* like|love [^.]*\.'
re.findall(pattern,text)
Using | for the divider , I was expected ['I like blueberry icecream.']
But only got ['I like']
I also tried pattern = '[^.]*(like|love)[^.]*\.' but got only ['like']
What did I do wrong as I know single word works with following RegEx - '[^.]* like [^.]*\.'
You need to put a group around like|love. Otherwise the | applies to the entire patterns on either side of it. So it's matching either a string ending with like or a string beginning with love.
pattern = '[^.]* (?:like|love) [^.]*\.'
Research more and found out I was missing ?:
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]*(?:like|love)[^.]*\.'
Output
['I like blueberry icecream.']
Source: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
I actually think it would be easier to do this without regex. Just my two cents.
text = 'I like blueberry icecream. He has a green car. She has blue car. I love dogs.'
print([x for x in text.split('.') if any(y in x for y in ['like', 'love'])])
You can use below regex
regex = /[^.]* (?:like|love) [^.]*\./g
Demo here

Removing words and symbols from columns which do not match specific criteria

I would need to remove from rows words which are not in English and specific symbols, like | or -, and three dots (...) if they are at the end of each row.
In order to do this, I was considering to use googletranslate or langdetect packages in Python for detecting and removing from text words not in English, and create a list for symbols.
To apply them, I was doing as follows:
df['Text'] == df['Text'].apply(lambda x: detect(x) == 'en') # but this just detect the rows. I would like to remove only not English words within rows, not the whole rows.
df['Text'] = df['Text'].map(lambda x: str(x)[:-4]) # I would need to consider however a logical condition: if the last three characters are ..., then remove these three dots from the string.
to_remove=['|','-', '(',')']
df['Text'] = df['Text'].str.contains(|, to_remove)
english_data = [word for word in df['Text'].tolist() if detect_language(word) == 'English']
The column I should apply these changes is
Text
The is in with a... - KIDS ...
BoneMA – Synthesis and Characterization of a Methacrylated ...
新型冠状病毒肺炎诊疗方案 (试行第七版) - Law Translate
Expected output:
Text
The is in with a... KIDS
BoneMA Synthesis and Characterization of a Methacrylated
Law Translate
Any help and suggestions would be appreciated.
like regex
df['Text'].str.replace('[^0-9a-zA-Z.]|[.]+$',' ').str.replace('\s{2,}',' ')
Output
0 The is in with a... KIDS
1 BoneMA Synthesis and Characteriof a M
2 Law Translate

Regular expression in Python, 2-3 numbers then 2 letters

I am trying to do autodetection of bra size in a list of clothes. While I managed to extract only the bra items, I am now looking at extracting the size information and I think I am almost there (thanks to the stackoverflow community). However, there is a particular case that I could not find on another post.
I am using:
regexp = re.compile(r' \d{2,3} ?[a-fA-F]([^bce-zBCE-Z]|$)')
So
Possible white space if not at the beginning of the description
two or three numbers
Another possible white space or not
Any letters (lower or upper case) between A and F
and then another letter for the two special case AA and FF or the end of the string.
My question is, is there a way to have the second letter to be a match of the first letter (AA or FF) because in my case, my code output me some BA and CA size which are not existing
Examples:
Not working:
"bh sexig top matchande h&m genomskinligt parti svart detaljer 42 basic plain" return "42 ba" instead of not found
"puma, sport-bh, strl: 34cd, svart/grå", I guess the customer meant c/d
Working fine:
"victoria's secret, bh, strl: 32c, gul/vit" returns "32 c"
"pink victorias secret bh 75dd burgundy" returns "75 dd"
Thanks!
You might use
\d{2,3} ?([a-fA-F])\1?(?![a-fA-F])
Explanation
\d{2,3} ? Match a space, 2-3 digits and optional space
([a-fA-F])\1? Capture a-fA-F in group 1 followed by an optional backreference to group 1
(?![a-fA-F]) Negative lookahead, assert what is on the right is not a-fA-F
Regex demo

Parsing variable length data

I'm using Python 3 and Im relatively new to RegEx.
I'm struggling to come up with a good way to tackle the following problem.
I have a text string (that can include line breaks etc) that contains a several sets of information.
For example:
TAG1/123456 TAG2/ABCDEFG HISTAG3/A1B1C1D1 QWERTY TAG4/0987654321
TAG5/THE CAT SAT ON THE MAT MYTAG6/FLINTSTONE
TAG7/99887766AA
I need this parsed to the following
TAG1/123456
TAG2/ABCDEFG
HISTAG3/A1B1C1D1 QWERTY
TAG4/0987654321
TAG5/THE CAT SAT ON THE MAT
MYTAG6/FLINTSTONE
TAG7/99887766AA
I can't seem to work out how to deal with the variable length tags :( TAG3 and TAG5
I always end up capturing the next tag i.e.
TAG5/THE CAT SAT ON THE MAT TAG6
In reality the TAGs themselves are also variable. Most are 3 characters followed by '/' but not all. Some are 4, 5 and 6 characters long. But all are followed by '/' and all EXCEPT the first one are preceded by a space
Updated Information
I have updated the example to show these variable tags. But to clarify a tag can be 1-8 alphabetic characters, preceded by a space and terminated by '/'
The data after the tag can be one or more words (alphanumeric) and is defined as all the data that follows the '/' of the tag up until the start of the next tag or the end of the string.
Any pointers would be greatly appreciated.
This is one way to achieve what you want I think:
import re
s = """TAG1/123456 TAG2/ABCDEFG TAG3/A1B1C1D1 QWERTY TAG4/0987654321
TAG5/THE CAT SAT ON THE MAT TAG6/FLINTSTONE
TAG7/99887766AA"""
r = re.compile(r'\w+/.+?(?=$|\s+\w+/)')
tags = r.findall(s)
print(*tags, sep='\n')
Output:
TAG1/123456
TAG2/ABCDEFG
TAG3/A1B1C1D1 QWERTY
TAG4/0987654321
TAG5/THE CAT SAT ON THE MAT
TAG6/FLINTSTONE
TAG7/99887766AA
The important bits are the non-greedy qualifier +? and the lookahead (?=$|\s+\w+/).

Python matching regex multiple times in a row (not the findall way)

This question is not asking about finding 'a' multiple times in a string etc.
What I would like to do is match:
[ a-zA-Z0-9]{1,3}\.
regexp multiple times, one way of doing this is using |
'[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.|[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.|[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.'
so this matches the regexp 4 or 3 or 2 times.
Matches stuff like:
a. v. b.
m a.b.
Is there any way to make this more coding like?
I tried doing
([ a-zA-Z0-9]{1,3}\.){2,4}
but the functionality is not the same what I expected. THis one matches:
regex.findall(string)
[u' b.', u'b.']
string is:
a. v. b. split them a.b. split somethinf words. THen we say some more words, like ten
Is there any way to do this? THe goal is to match possible english abbreviations and names like Mary J. E. things that the sentence tokenizer recognizes as sentence punctuation but are not.
I want to match all of this:
U.S. , c.v.a.b. , a. v. p.
first of all Your regex will work as you expect :
>>> s="aa2.jhf.jev.d23.llo."
>>> import re
>>> re.search(r'([ a-zA-Z0-9]{1,3}\.){2,4}',s).group(0)
'aa2.jhf.jev.d23.'
But if you want to match some sub strings like U.S. , c.v.a.b. , a. v. p. you need to put the whole of regex in a capture group :
>>> s= 'a. v. b. split them a.b. split somethinf words. THen we say' some more
>>> re.findall(r'(([ a-zA-Z0-9]{1,3}\.){2,4})',s)
[('a. v. b.', ' b.'), ('m a.b.', 'b.')]
then use a list comprehension to get the first matches :
>>> [i[0] for i in re.findall(r'(([ a-zA-Z0-9]{1,3}\.){2,4})',s)]
['a. v. b.', 'm a.b.']

Categories

Resources