Python regex parsing string containing braced items

Python regex parsing string containing braced items - python

So I have a set of strings which look like:
Callable {option-1} {option-2} {option-3} {option-n}
Callable
Callable {option-1}
There may be none or n options.
What I want to do is to parse out the options from this string in a list ([option-1, option-2, option-3, option-n]), or None if there were no braced options. What is the best way of doing it? At present I do lots of split('{') and then strip/clean the output. This feels very ugly.
What is the clean(est) method for doing this?

Use re.findall():
re.findall(r'{([^}]+)}', inputtext)
This pattern matches anything that isn't a closing brace as the option text; alternatively, you can use word characters, digits and dashes:
re.findall(r'{([\w\d-]+)}', inputtext)
Demo:
>>> import re
>>> samples = '''\
... Callable {option-1} {option-2} {option-3} {option-n}
... Callable
... Callable {option-1}
... '''
>>> for line in samples.splitlines():
... print re.findall(r'{([^}]+)}', line)
...
['option-1', 'option-2', 'option-3', 'option-n']
[]
['option-1']
This produces lists of matches; no matches results in an empty list.

Related

Regular expression to retrieve string parts within parentheses separated by commas

I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31

This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.

You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)

A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31

How to read access-log hosts with regex?

I have such entries:
e179206120.adsl.alicedsl.de
safecamp-plus-2098.unibw-hamburg.de
p5B30EBFE.dip0.t-ipconnect.de
and I would like to match only the main domain names like
alicedsl.de
unibw-hamburg.de
t-ipconnect.de
I tried this \.\w+\.\w+\.\w{2,3} but that matches .adsl.alicedsl.de

How about [^.]+\.\w+$
See it work
Or, in Python:
import re
tgt='''\
e179206120.adsl.alicedsl.de
safecamp-plus-2098.unibw-hamburg.de
p5B30EBFE.dip0.t-ipconnect.de'''
print re.findall(r'([^.]+\.\w+$)', tgt, re.M | re.S)
# ['alicedsl.de', 'unibw-hamburg.de', 't-ipconnect.de']
Regex explanation:
[^.]+ 1 or more characters EXCEPT a literal .
\. literal . It needs the \ because it would be any chaarcter to regex if not used
\w+ 1 or more characters in the ranges of [a-z] [A-Z] [0-9] [_] Potentially a better regex for TLD's in ASCII is [a-zA-Z]+ since there aren't any old TLD's that are not ASCII. If you want to manage newer Internationalized TLD's, you need a different regex.
$ assertion for the end of the line
You should know that you definition of TLD's is incomplete. For example, this regex approach will break on the legitimate url of bbc.co.uk and many others that include a common SLD. Use a library if you can for more general applicability. You can also use the mozilla list of TLD and SLD's to know when it is appropriate to include two periods in the definition of host.

You could use the following with your given data.
[^.]+\.[^.]+$
See Live demo

If you dont have restrictions on using external libraries, check out TLD extract library
https://pypi.python.org/pypi/tldextract
import tldextract
for input in ["e179206120.adsl.alicedsl.de", "safecamp-plus-2098.unibw-hamburg.de", "p5B30EBFE.dip0.t-ipconnect.de"]:
input_tld = tldextract.extract(input)
print input_tld.domain+"."+input_tld.suffix

You actually do not need Regex for this. A list comprehension will be far more efficient:
>>> mystr = """
... e179206120.adsl.alicedsl.de
... safecamp-plus-2098.unibw-hamburg.de
... p5B30EBFE.dip0.t-ipconnect.de
... """
>>> [".".join(line.rsplit(".", 2)[-2:]) for line in mystr.splitlines() if line]
['alicedsl.de', 'unibw-hamburg.de', 't-ipconnect.de']
>>>
Also, if you want it, here is a reference on Python's string methods (it explains str.splitlines, str.rsplit, and str.join).
If you run a speed test using timeit.timeit, you will see that the list comprehension is much faster:
>>> from timeit import timeit
>>> mystr = """
... e179206120.adsl.alicedsl.de
... safecamp-plus-2098.unibw-hamburg.de
... p5B30EBFE.dip0.t-ipconnect.de
... """
>>> def func():
... import re
... re.findall(r'([^.]+\.\w+$)', mystr, re.M | re.S)
...
>>> timeit("func()", "from __main__ import func") # Regex's time
51.85605544838802
>>> def func():
... [".".join(line.rsplit(".", 2)[-2:]) for line in mystr.splitlines() if line]
...
>>> timeit("func()", "from __main__ import func") # List comp.'s time
12.113929004943316
>>>

string.translate() with unicode data in python

I have 3 API's that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list valuelist. One of the steps is to remove the punctuation from them. I normally use string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:
wordlist = [s.translate(None, string.punctuation)for s in valuelist]
TypeError: translate() takes exactly one argument (2 given)
Is there a way around this? Either by encoding the unicode or a replacement for string.translate?

The translate method work differently on Unicode objects than on byte-string objects:
>>> help(unicode.translate)
S.translate(table) -> unicode
Return a copy of the string S, where all characters have been mapped
through the given translation table, which must be a mapping of
Unicode ordinals to Unicode ordinals, Unicode strings or None.
Unmapped characters are left untouched. Characters mapped to None
are deleted.
So your example would become:
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
word_list = [s.translate(remove_punctuation_map) for s in value_list]
Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.

I noticed that string.translate is deprecated. Since you are removing punctuation, not actually translating characters, you can use the re.sub function.
>>> import re
>>> s1="this.is a.string, with; (punctuation)."
>>> s1
'this.is a.string, with; (punctuation).'
>>> re.sub("[\.\t\,\:;\(\)\.]", "", s1, 0, 0)
'thisis astring with punctuation'
>>>

In this version you can relatively make one's letters to other
def trans(to_translate):
tabin = u'привет'
tabout = u'тевирп'
tabin = [ord(char) for char in tabin]
translate_table = dict(zip(tabin, tabout))
return to_translate.translate(translate_table)

Python re module allows to use a function as a replacement argument, which should take a Match object and return a suitable replacement. We may use this function to build a custom character translation function:
import re
def mk_replacer(oldchars, newchars):
"""A function to build a replacement function"""
mapping = dict(zip(oldchars, newchars))
def replacer(match):
"""A replacement function to pass to re.sub()"""
return mapping.get(match.group(0), "")
return replacer
An example. Match all lower-case letters ([a-z]), translate 'h' and 'i' to 'H' and 'I' respectively, delete other matches:
>>> re.sub("[a-z]", mk_replacer("hi", "HI"), "hail")
'HI'
As you can see, it may be used with short (incomplete) replacement sets, and it may be used to delete some characters.
A Unicode example:
>>> re.sub("[\W]", mk_replacer(u'\u0435\u0438\u043f\u0440\u0442\u0432', u"EIPRTV"), u'\u043f\u0440\u0438\u0432\u0435\u0442')
u'PRIVET'

As I stumbled upon the same problem and Simon's answer was the one that helped me to solve my case, I thought of showing an easier example just for clarification:
from collections import defaultdict
And then for the translation, say you'd like to remove '#' and '\r' characters:
remove_chars_map = defaultdict()
remove_chars_map['#'] = None
remove_chars_map['\r'] = None
new_string = old_string.translate(remove_chars_map)
And an example:
old_string = "word1#\r word2#\r word3#\r"
new_string = "word1 word2 word3"
'#' and '\r' removed

python regular expression to find something in between two strings or phrases

How can I use regex in python to capture something between two strings or phrases, and removing everything else on the line?
For example, the following is a protein sequence preceded by a one-line header. How can I sift off "CG33289-PC" from the header below based on the stipulation that is occurs after the phrase "FlyBase_Annotation_IDs:" and before the next comma "," ?
I need to substitute the header with this simplified result "CG33289-PC" and not destroy the protein sequence (found below the header-line in all caps).
This is what each protein sequence entry looks like - a header followed by a sequence:
>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel;
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
This is the desired output:
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV

Using regexps:
>>> s = """>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel; MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV"""
>>> import re
>>> print re.sub(r'.*FlyBase_Annotation_IDs:([\w-]+).*;', r'\1\n', s)
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
>>>

Not an elegant solution, but this should work for you:
>>> fly = 'FlyBase_Annotation_IDs'
>>> repl = 'CG33289-PC'
>>> part1, part2 = protein.split(fly)
>>> part2 = part2.replace(repl, "FooBar")
>>> protein = fly.join([part1, part2])
assuming FlyBase_Annotation_IDs can only appear once in the data.

I'm not sure about the format of the file, but this regex will capture the data in your example:
"FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);"
Use findall function to get the match.

Assuming there is a newline after the header:
>>> import re
>>> protein = "..."
>>> r = re.compile(r"^.*FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);.*$", re.MULTILINE)
>>> r.sub(r"\1", protein)
The group ([A-Z0-9a-z-]*) in the regular expression extracts any alphanumeric character and the dash. If ids can have other characters, just add them.

Python: Regex to find but not include an alphanumeric

Is there an regular expression to find, for example, ">ab" but do not include ">" in the result?
I want to replace some strings using re.sub, and I want to find strings starting with ">" without remove the ">".

You want a positive lookbehind assertion. See the docs.
r'(?<=>)ab'
It needs to be a fixed length expression, it can't be a variable number of characters. Basically, do
r'(?<=stringiwanttobebeforethematch)stringiwanttomatch'
So, an example:
import re
# replace 'ab' with 'e' if it has '>' before it
#here we've got '>ab' so we'll get '>ecd'
print re.sub(r'(?<=>)ab', 'e', '>abcd')
#here we've got 'ab' but no '>' so we'll get 'abcd'
print re.sub(r'(?<=>)ab', 'e', 'abcd')

You can use a back reference in sub:
import re
test = """
>word
>word2
don't replace
"""
print re.sub('(>).*', r'\1replace!', test)
Outputs:
>replace!
>replace!
don't replace
I believe this accomplishes what you actually want when you say "I want to replace some strings using re.sub, and I want to find strings starting with '>' without remove the '>'."

if you want to avoid using the re module you can also use the startswith() string method.
>>> foo = [ '>12', '>54', '34' ]
>>> for line in foo:
... if line.startswith('>'):
... line = line.strip('>')
... print line
...
12
54
34
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex parsing string containing braced items - python

Related

Regular expression to retrieve string parts within parentheses separated by commas

How to read access-log hosts with regex?

string.translate() with unicode data in python

python regular expression to find something in between two strings or phrases

Python: Regex to find but not include an alphanumeric

Categories

Resources