Hopefully someone can help, I'm trying to use a regular expression to extract something from a string that occurs after a pattern, but it's not working and I'm not sure why. The regex works fine in linux...
import re
s = "GeneID:5408878;gbkey=CDS;product=carboxynorspermidinedecarboxylase;protein_id=YP_001405731.1"
>>> x = re.search(r'(?<=protein_id=)[^;]*',s)
>>> print(x)
<_sre.SRE_Match object at 0x000000000345B7E8>
Use .group() on the search result to print the captured groups:
>>> print(x.group(0))
YP_001405731.1
As Martijn has had pointed out, you created a match object. The regular expression is correct. If it was wrong, print(x) would have printed None.
You should probably think about re-writing your regex so that you find all pairs so you don't have to muck around with specific groups and hard-coded look behinds...
import re
kv = dict(re.findall('(\w+)=([^;]+)', s))
# {'gbkey': 'CDS', 'product': 'carboxynorspermidinedecarboxylase', 'protein_id': 'YP_001405731.1'}
print kv['protein_id']
# YP_001405731.1
Related
I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.
I'm trying to create a regex to catch all hexadecimal colors in a string literal. I'm using Python 3, and that's what I have:
import re
pattern = re.compile(r"#[a-fA-F\d]{3}([a-fA-F\d]{3})?")
However, when I apply the findall regex method on #abcdef here's what I get:
>>> re.findall(pattern,"#abcdef")
["def"]
Can someone explain me why do I have that? I actually need to get ["#abcdef"]
Thank you in advance
According to http://regex101.com:
It looks like this regex is looking for
#(three characters a through f, A through F or a digit)(three characters a through f, A through F or a digit, which may or may not be present, and if they are they are what is returned from the match)
If you are looking to match any instance of the whole above string, I would recommend this instead:
#[a-fA-F\d]{6}
Thanks to Andrej Kesely, I got the answer to my question, that is:
Regex will return capturing group.
To bypass this, just change the regex from:
r"#[a-fA-F\d]{3}([a-fA-F\d]{3})?"
to:
r"#[a-fA-F\d]{3}(?:[a-fA-F\d]{3})?"
I want to normalize strings like
'1:2:3','10:20:30'
to
'01:02:03','10:20:30'
by using re module of Python,
so I am trying to select the string like '1:2:3' then match the single number '1','2','3'..., here is my pattern:
^\d(?=\D)|(?<=\D)\d(?=\D)|(?<=\D)\d$
it works but I think the pattern is not simple enough, anybody could help me simplify it? or use map()/split() if it's more sophisticated.
\b matches between a word character and a non-word character.
>>> import re
>>> l = ['1:2:3','10:20:30']
>>> [re.sub(r'\b(\d)\b', r'0\1', i) for i in l]
['01:02:03', '10:20:30']
DEMO
re.sub(r"(?<!\d)(\d)(?!\d)",r"0\1",test_str)
You can simplify it to this.See demo.
https://regex101.com/r/nD5jY4/4#python
If the string is like
x="""'1:2:3','10:20:30'"""
Then do
print ",".join([re.sub(r"(?<!\d)(\d)(?!\d)",r"0\1",i) for i in x.split(",")])
You could do this with re, but pretty much nobody will know how it works afterwards. I'd recommend this instead:
':'.join("%02d" % int(x) for x in original_string.split(':'))
It's more clear how it works.
I would like to use the regular expressions in Python to get everything that is after a </html> tag, an put it in a string. So I tried to understand how to do it in Python but I was not able to make it work. Can anyone explain me how to do this ridiculous simple task ?
You can do this without a regular expression:
text[text.find('</html>')+7:]
m = re.match(".*<\html>(.*)",my_html_text_string)
print m.groups()
or even better
print my_html_string.split("</html>")[-1]
import re
text = 'foo</html>bar'
m = re.search('</html>(.*)', text)
print m.group(1)
i got an string that might look like this
"myFunc('element','node','elementVersion','ext',12,0,0)"
i'm currently checking for validity using, which works fine
myFunc\((.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\)
now i'd like to replace whatever string is at the 3rd parameter.
unfortunately i cant just use a stringreplace on whatever sub-string on the 3rd position since the same 'sub-string' could be anywhere else in that string.
with this and a re.findall,
myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)
i was able to get the contents of the substring on the 3rd position, but re.sub does not replace the string it just returns me the string i want to replace with :/
here's my code
myRe = re.compile(r"myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)")
val = "myFunc('element','node','elementVersion','ext',12,0,0)"
print myRe.findall(val)
print myRe.sub("noVersion",val)
any idea what i've missed ?
thanks!
Seb
In re.sub, you need to specify a substitution for the whole matching string. That means that you need to repeat the parts that you don't want to replace. This works:
myRe = re.compile(r"(myFunc\(.+?\,.+?\,)(.+?)(\,.+?\,.+?\,.+?\,.+?\))")
print myRe.sub(r'\1"noversion"\3', val)
If your only tool is a hammer, all problems look like nails. A regular expression is a powerfull hammer but is not the best tool for every task.
Some tasks are better handled by a parser. In this case the argument list in the string is just like a Python tuple, sou you can cheat: use the Python builtin parser:
>>> strdata = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> args = re.search(r'\(([^\)]+)\)', strdata).group(1)
>>> eval(args)
('element', 'node', 'elementVersion', 'ext', 12, 0, 0)
If you can't trust the input ast.literal_eval is safer than eval for this. Once you have the argument list in the string decontructed I think you can figure out how to manipulate and reassemble it again, if needed.
Read the documentation: re.sub returns a copy of the string where every occurrence of the entire pattern is replaced with the replacement. It cannot in any case modify the original string, because Python strings are immutable.
Try using look-ahead and look-behind assertions to construct a regex that only matches the element itself:
myRe = re.compile(r"(?<=myFunc\(.+?\,.+?\,)(.+?)(?=\,.+?\,.+?\,.+?\,.+?\))")
Have you tried using named groups? http://docs.python.org/howto/regex.html#search-and-replace
Hopefully that will let you just target the 3rd match.
If you want to do this without using regex:
>>> s = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> l = s.split(",")
>>> l[2]="'noVersion'"
>>> s = ",".join(l)
>>> s
"myFunc('element','node','noVersion','ext',12,0,0)"