Regex match key except one - python

I have a python list. It contains strings like items[number].some field. I want get all this strings except strings that match items[<number>].classification. How I can do this by regex or maybe there is another way?
As an example, I have something like:
data.items.[0].deliveryAddress.region
data.items.[0].classification.scheme
data.items.[0].classification.id
data.items.[0].description
And I want to stay only with :
data.items.[0].description
data.items.[0].deliveryAddress.region

To do this, I used this regex to match the strings you want to discard:
data.items.\[\d+\].classification
Say I have a Python list containing those items called l:
l = ["data.items.[0].deliveryAddress.region",
"data.items.[0].classification.scheme",
"data.items.[0].classification.id",
"data.items.[0].description"]
I can then use a list comprehension to only keep the values that don't match the regex, by using re.match.
>>> import re
>>> [x for x in l if not re.match(r"data.items.\[\d+\].classification", x)]
['data.items.[0].deliveryAddress.region', 'data.items.[0].description']

You could go for a negative lookahead combined with anchors:
^((?:.(?!classification))+)$
In Python code this would be:
import re
string = """
data.items.[0].deliveryAddress.region
data.items.[0].classification.scheme
data.items.[0].classification.id
data.items.[0].description
"""
rx = re.compile(r'^((?:.(?!classification))+)$', re.MULTILINE)
matches = rx.findall(string)
print matches
# ['data.items.[0].deliveryAddress.region', 'data.items.[0].description']
Obviously, this will work with a list as well:
import re
lst = ['data.items.[0].deliveryAddress.region',
'data.items.[0].classification.scheme',
'data.items.[0].classification.id',
'data.items.[0].description']
# no need for re.MULTILINE here
rx = re.compile(r'^((?:.(?!classification))+)$')
matches = [x for x in lst if rx.match(x)]
print matches
# ['data.items.[0].deliveryAddress.region', 'data.items.[0].description']
See a demo on regex101.com.

Related

Insert Colon between each element of a list python

[x[1] for x in matches]
x
newtest = [x2[-2:] for x2 in x]
newtest
I have a list
[u'asvbsMasd', u'abdhesMrty', u'ahdksC', u'ahdeO', u'ahdnL', u'ahddsS',]
now i want my list to be like a colon between where it finds a lower case and upper case
[u'asvbs:Masd', u'abdhes:Mrty', u'ahdks:C', u'ahde:Oqqq', u'ahdn:L', u'ahdds:S',]
You need to write a regex that matches <lowercase><uppercase> pair:
>>> import re
>>> r = re.compile(r'([a-z])([A-Z])')
Note the letters itself marked as a groups via (). If you have regex matching the pair and the neighbor letters as two separate groups, you may just use the substitution (\1 and \2 are places where matched groups are put into the substitution string):
>>> r.sub(r'\1:\2', u'asvbsMasd')
u'asvbs:Masd'
Then you can use list comprehension to apply that substitution to each element of a list:
>>> l = [u'asvbsMasd', u'abdhesMrty', u'ahdksC', u'ahdeO', u'ahdnL', u'ahddsS']
>>> [r.sub(r'\1:\2', s) for s in l]
[u'asvbs:Masd', u'abdhes:Mrty', u'ahdks:C', u'ahde:O', u'ahdn:L', u'ahdds:S']
Or if you want it wrapped into a function:
import re
re_lowerupper = r = re.compile(r'([a-z])([A-Z])')
def add_colons(l):
global re_lowerupper
return [re_lowerupper.sub(r'\1:\2', s) for s in l]
print add_colons([u'asvbsMasd', u'abdhesMrty', u'ahdksC', u'ahdeO', u'ahdnL', u'ahddsS'])
You may of course simplify it just to a single lambda, like in the next example.
One importand disclaimer, as I see you use Unicode strings: there is no easy way of finding arbitrary Unicode upper/lowercase character. There is no shorthand defined like for matching any digit (\d) or any alphanumeric character (\w). If you need to match diacritics too, you may need to list the lowercase and uppercase diacritics of your language explicitly in the regex, like:
re_lower = ur'[a-zßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿāăąćĉčēĕėęěğģĥĩīĭįĵķļľŀłņňŋōŏőœŕŗřśŝşţťũūŭůűųŵŷźžǎǐǒǔǖǘǚǜǩǫǵǹȟȧȩȯȳəḅḋḍḑḟḡḣḥḧḩḱḳṃṕṗṙṛṡṣṫṭṽẁẃẅẇẉẍẏẑẓẗẘẙạẹẽịọụỳỵỹ]'
re_upper = ur'[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞĀĂĄĆĈČĒĔĖĘĚĞĢĤĨĪĬĮİĴĶĻĽĿŁŅŇŊŌŎŐŒŔŖŘŚŜŞŢŤŨŪŬŮŰŲŴŶŸŹŽƏǍǏǑǓǕǗǙǛǨǪǴǸȞȦȨȮȲḄḊḌḐḞḠḢḤḦḨḰḲṂṔṖṘṚṠṢṪṬṼẀẂẄẆẈẌẎẐẒẠẸẼỊỌỤỲỴỸ]'
re_lowerupper = re.compile('(%s)(%s)' % (re_lower, re_upper))
add_colons = lambda l: [re_lowerupper.sub(r'\1:\2', s) for s in l]
This should do the job for the Latin script European languages.

Using Regexp to catch substring python

Let's assume I have some string like that:
x = 'Wish she could have told me herself. #NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart::heavy_black_heart: some string too :smiling_face:'
So, I want to get from that :
:heavy_black_heart:
:smiling_face:
To do that I did the following :
import re
result = re.search(':(.*?):', x)
result.group()
It only gives me the ':heavy_black_heart:' . How could I make it work ? If possible I want to store them in dictonary after I found all of them.
print re.findall(':.*?:', x) is doing the job.
Output:
[':heavy_black_heart:', ':heavy_black_heart:', ':smiling_face:']
But if you want to remove the duplicates:
Use:
res = re.findall(':.*?:', x)
dictt = {x for x in res}
print list(dictt)
Output:
[':heavy_black_heart:', ':smiling_face:']
You seem to want to match smilies that are some symbols in-between 2 :s. The .*? can match 0 symbols, and your regex can match ::, which I think is not what you would want to get. Besdies, re.search only returns one - the first - match, and to get multiple matches, you usually use re.findall or re.finditer.
I think you need
set(re.findall(r':[^:]+:', x))
or if you only need to match word chars inside :...::
set(re.findall(r':\w+:', x))
or - if you want to match any non-whitespace chars in between two ::
set(re.findall(r':[^\s:]+:', x))
The re.findall will find all non-overlapping occurrences and set will remove dupes.
The patterns will match :, then 1+ chars other than : ([^:]+) (or 1 or more letters, digits and _) and again :.
>>> import re
>>> x = 'Wish she could have told me herself. #NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart::heavy_black_heart: some string too :smiling_face:'
>>> print(set(re.findall(r':[^:]+:', x)))
{':smiling_face:', ':heavy_black_heart:'}
>>>
try this regex:
:([a-z0-9:A-Z_]+):
import re
x = 'Wish she could have told me herself. #NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart::heavy_black_heart: some string too :smiling_face:'
print set(re.findall(':.*?:', x))
output:
{':heavy_black_heart:', ':smiling_face:'}
Just for fun, here's a simple solution without regex. It splits around ':' and keeps the elements with odd index:
>>> text = 'Wish she could have told me herself. #NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart::heavy_black_heart: some string too :smiling_face:'
>>> text.split(':')[1::2]
['heavy_black_heart', 'heavy_black_heart', 'smiling_face']
>>> set(text.split(':')[1::2])
set(['heavy_black_heart', 'smiling_face'])

How to parse values appear after the same string in python?

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy
Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters
For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.
You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)

search patterns with variable gaps in python

I am looking for patterns in a list containing different strings as:
names = ['TAATGH', 'GHHKLL', 'TGTHA', 'ATGTTKKKK', 'KLPPNF']
I would like to select the string that has the pattern 'T--T' (no matter how the string starts), so those elements would be selected and appended to a new list as:
namesSelected = ['TAATGH', 'ATGTTKKKK']
Using grep I could:
grep "T[[:alpha:]]\{2\}T"
Is there a similar mode in re python?
Thanks for any help!
I think this is most likely what you want:
re.search(r'T[A-Z]{2}T', inputString)
The equivalent in Python for [[:alpha:]] would be [a-zA-Z]. You may replace [A-Z] with [a-zA-Z] in the code snippet above if you wish to allow lowercase alphabet.
Documentation for re.search.
Yep, you can use re.search:
>>> names = ['TAATGH', 'GHHKLL', 'TGTHA', 'ATGTTKKKK', 'KLPPNF']
>>> reslist = []
>>> for i in names:
... res = re.search(r'T[A-Z]{2}T', i)
... if res:
... reslist.append(i)
...
>>>
>>> print(reslist)
['TAATGH', 'ATGTTKKKK']
import re
def grep(l, pattern):
r = re.compile(pattern)
return [_ for _ in l if r.search(pattern)]
nameSelected = grep(names, "T\w{2}T")
Note the use of \w instead of [[:alpha:]]

python regular expression to find something in between two strings or phrases

How can I use regex in python to capture something between two strings or phrases, and removing everything else on the line?
For example, the following is a protein sequence preceded by a one-line header. How can I sift off "CG33289-PC" from the header below based on the stipulation that is occurs after the phrase "FlyBase_Annotation_IDs:" and before the next comma "," ?
I need to substitute the header with this simplified result "CG33289-PC" and not destroy the protein sequence (found below the header-line in all caps).
This is what each protein sequence entry looks like - a header followed by a sequence:
>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel;
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
This is the desired output:
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
Using regexps:
>>> s = """>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel; MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV"""
>>> import re
>>> print re.sub(r'.*FlyBase_Annotation_IDs:([\w-]+).*;', r'\1\n', s)
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
>>>
Not an elegant solution, but this should work for you:
>>> fly = 'FlyBase_Annotation_IDs'
>>> repl = 'CG33289-PC'
>>> part1, part2 = protein.split(fly)
>>> part2 = part2.replace(repl, "FooBar")
>>> protein = fly.join([part1, part2])
assuming FlyBase_Annotation_IDs can only appear once in the data.
I'm not sure about the format of the file, but this regex will capture the data in your example:
"FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);"
Use findall function to get the match.
Assuming there is a newline after the header:
>>> import re
>>> protein = "..."
>>> r = re.compile(r"^.*FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);.*$", re.MULTILINE)
>>> r.sub(r"\1", protein)
The group ([A-Z0-9a-z-]*) in the regular expression extracts any alphanumeric character and the dash. If ids can have other characters, just add them.

Categories

Resources