I would like to use the regular expressions in Python to get everything that is after a </html> tag, an put it in a string. So I tried to understand how to do it in Python but I was not able to make it work. Can anyone explain me how to do this ridiculous simple task ?
You can do this without a regular expression:
text[text.find('</html>')+7:]
m = re.match(".*<\html>(.*)",my_html_text_string)
print m.groups()
or even better
print my_html_string.split("</html>")[-1]
import re
text = 'foo</html>bar'
m = re.search('</html>(.*)', text)
print m.group(1)
Related
I want to normalize strings like
'1:2:3','10:20:30'
to
'01:02:03','10:20:30'
by using re module of Python,
so I am trying to select the string like '1:2:3' then match the single number '1','2','3'..., here is my pattern:
^\d(?=\D)|(?<=\D)\d(?=\D)|(?<=\D)\d$
it works but I think the pattern is not simple enough, anybody could help me simplify it? or use map()/split() if it's more sophisticated.
\b matches between a word character and a non-word character.
>>> import re
>>> l = ['1:2:3','10:20:30']
>>> [re.sub(r'\b(\d)\b', r'0\1', i) for i in l]
['01:02:03', '10:20:30']
DEMO
re.sub(r"(?<!\d)(\d)(?!\d)",r"0\1",test_str)
You can simplify it to this.See demo.
https://regex101.com/r/nD5jY4/4#python
If the string is like
x="""'1:2:3','10:20:30'"""
Then do
print ",".join([re.sub(r"(?<!\d)(\d)(?!\d)",r"0\1",i) for i in x.split(",")])
You could do this with re, but pretty much nobody will know how it works afterwards. I'd recommend this instead:
':'.join("%02d" % int(x) for x in original_string.split(':'))
It's more clear how it works.
Hopefully someone can help, I'm trying to use a regular expression to extract something from a string that occurs after a pattern, but it's not working and I'm not sure why. The regex works fine in linux...
import re
s = "GeneID:5408878;gbkey=CDS;product=carboxynorspermidinedecarboxylase;protein_id=YP_001405731.1"
>>> x = re.search(r'(?<=protein_id=)[^;]*',s)
>>> print(x)
<_sre.SRE_Match object at 0x000000000345B7E8>
Use .group() on the search result to print the captured groups:
>>> print(x.group(0))
YP_001405731.1
As Martijn has had pointed out, you created a match object. The regular expression is correct. If it was wrong, print(x) would have printed None.
You should probably think about re-writing your regex so that you find all pairs so you don't have to muck around with specific groups and hard-coded look behinds...
import re
kv = dict(re.findall('(\w+)=([^;]+)', s))
# {'gbkey': 'CDS', 'product': 'carboxynorspermidinedecarboxylase', 'protein_id': 'YP_001405731.1'}
print kv['protein_id']
# YP_001405731.1
Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?
Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.
Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy
re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.
It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4
I have to split a string into a list of substrings according to the criteria that all the parenthesis strings should be split .
Lets say I have (9+2-(3*(4+2))) then I should get (4+2), (3*6) and (9+2-18).
The basic objective is that I learn which of the inner parenthesis is going to be executed first and then execute it.
Please help....
It would be helpful if you could suggest a method using re module. Just so this is for everyone it is not homework and I understand Polish notation. What I am looking for is using the power of Python and re module to use it in less lines of code.
Thanks a lot....
The eval is insecure, so you have to check input string for dangerous things.
>>> import re
>>> e = "(9+2-(3*(4+2)))"
>>> while '(' in e:
... inner = re.search('(\([^\(\)]+\))', e).group(1)
... e = re.sub(re.escape(inner), eval('str'+inner), e)
... print inner,
...
(4+2) (3*6) (9+2-18)
Try something like this:
import re
a = "(9+2-(3*(4+2)))"
s,r = a,re.compile(r'\([^(]*?\)')
while('(' in s):
g = r.search(s).group(0)
s = r.sub(str(eval(g)),s)
print g
print s
This sounds very homeworkish so I am going to reply with some good reading that might lead you down the right path. Take a peek at http://en.wikipedia.org/wiki/Polish_notation. It's not exactly what you want but understanding will lead you pretty close to the answer.
i don't know exactly what you want to do, but if you want to add other operations and if you want to have more control over the expression, i suggest you to use a parser
http://www.dabeaz.com/ply/ <-- ply, for example
If I am finding & replacing some text how can I get it to replace some text that will change each day so ie anything between (( & )) whatever it is?
Cheers!
Use regular expressions (http://docs.python.org/library/re.html)?
Could you please be more specific, I don't think I fully understand what you are trying to accomplish.
EDIT:
Ok, now I see. This may be done even easier, but here goes:
>>> import re
>>> s = "foo(bar)whatever"
>>> r = re.compile(r"(\()(.+?)(\))")
>>> r.sub(r"\1baz\3",s)
'foo(baz)whatever'
For multiple levels of parentheses this will not work, or rather it WILL work, but will do something you probably don't want it to do.
Oh hey, as a bonus here's the same regular expression, only now it will replace the string in the innermost parentheses:
r1 = re.compile(r"(\()([^)^(]+?)(\))")