Python regex help needed - python

I need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
I can get it when I use
start = '<font color="red">'
end = '</font>'
expression = start + '(.*?)' + end
match = re.compile(expression).search(web_source_code)
needed_info = match.group(1)
, but then I have to pick to fetch either <font> or <span>, failing, when the site uses the other tag.
How do I modify the regular expression so it would always succeed?

Don't parse HTML with regex.
Regex is not the right tool to use for this problem. Look up BeautifulSoup or lxml.

You can join two alternatives with a vertical bar:
start = '<font color="red">|<span style="font-weight:bold;">'
end = '</font>|</span>'
since you know that a font tag will always be closed by </font>, a span tag always by </span>.
However, consider also using a solid HTML parser such as BeautifulSoup, rather than rolling your own regular expressions, to parse HTML, which is particularly unsuitable in general for getting parsed by regular expressions.

Although regular expressions are not your best choice for parsing HTML.
For the sake of education, here is a possible answer to your question:
start = '<(?P<tag>font|tag) color="red">'
end = '</(?P=tag)>'
expression = start + '(.*?)' + end

expression = '(<font color="red">(.*?)</font>|<span style="font-weight:bold;">(.*?)</span>)'
match = re.compile(expression).search(web_source_code)
needed_info = match.group(2)
This would get the job done but you shouldn't really be using regex to parse html

Regex and HTML are not such a good match, HTML has too many potential variations that will trip up your regex. BeautifulSoup is the standard tool to employ here, but I find pyparsing can be just as effective, and sometimes even simpler to construct when trying to locate a particular tag relative to a particular previous tag.
Here is how to address your question using pyparsing:
html = """ need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
<font color="white">but not this info</font> and
<span style="font-weight:normal;">dont want this either</span>
"""
from pyparsing import *
font,fontEnd = makeHTMLTags("FONT")
# only match <font> tags with color="red"
font.setParseAction(withAttribute(color="red"))
# only match <span> tags with given style
span,spanEnd = makeHTMLTags("SPAN")
span.setParseAction(withAttribute(style="font-weight:bold;"))
# define full match patterns, define "body" results name for easy access
fontpattern = font + SkipTo(fontEnd)("body") + fontEnd
spanpattern = span + SkipTo(spanEnd)("body") + spanEnd
# now create a single pattern, matching either of the other patterns
searchpattern = fontpattern | spanpattern
# call searchString, and extract body element from each match
for text in searchpattern.searchString(html):
print text.body
Prints:
needed-info-here
needed-info-here

I haven't used Python, but if you make expressions equal to the following, it should work:
/(?P<open><(font|span)[^>]*>)(?P<info>[^<]+)(?P<close><\/(font|span)>)/gi
Then just access your needed info with the name "info".
PS - I also agree about the "not parsing HTML with regex" rule, but if you know that it will appear in either font or span tags, then so be it...
Also, why use the font tag? I haven't used a font tag since I learned CSS.

Related

Alternatives to Python's re.search

I am using re.search to check if a string to text is found in a html page. Sometimes it does not find the string although it is definitely there. For example I would like to find: <div class="dlInfo-Speed"> Does anyone know how to create regex to find that string?
Does anyone know of any good alternatives to re.search?
Thanks
If you just want to determine if a substring is present, you can use in for that.
if some_substring in some_string:
do_something_exciting()
As for a regex, this is the best I got right now:
if re.search(r"<[dD][iI][vV]\s+.*?class="dlInfo-Speed".*?>(.*?)</[dD][iI][vV]>",
html_doc,
re.DOTALL):
print "found"
else:
print "not found"
http://regexr.com?37iqr
I found that regex's are usually not the best solution for %99 of problems like this.
My alternative is BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
Here's how to solve it with bs4:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tag = soup.find("div", class_="dlInfo-Speed")
print tag.string #one way to get the contents
As noted, it is possible that the string is not found because other HTML is mixed in with it. It's also possible that it's formatted in such a way that there are newlines in between the tag attributes, like:
some text goes here <div
class="dlInfo-Speed"> More text
or even
some text goes here <div class="dlInfo-Speed"
> More text
You can write a regex that will account for whitespace (including newlines and tabs) in all the places it may occur:
re.search(text, r'<div \s+ class="dlInfo-Speed" \s* >', re.VERBOSE)
But overall I strongly agree with the comment that for anything more than very simple, well-defined searches, it is usually best to parse the HTML properly and walk the document tree to find what you're looking for.
There is a chance that the string that fails to be found is mixed with some html tags:
<div>string you are <span class="x">looking</span> for</div>
Maybe you should try removing html tags (unless they contain the string you search for) so the text is easier to search through. A simple way to do it using regex:
text = re.sub('<[^<]+?>', '', html_page)
if some_substring in text:
do_something(text)
As for re.search alternatives, you can use string index method.
try:
index = html_data.index(some_substring)
do_something(html_data)
except ValueError:
# string not found
pass
or even find method:
if html_data.find(some_substring) >= 0:
do_something(html_data)

python match image tags from large content string using regular expressions

am really a noob with regular expressions, I tried to do this on my own but I couldn't understand from the manuals how to approach it. Am trying to find all img tags of a given content, I wrote the below but its returning None
content = i.content[0].value
prog = re.compile(r'^<img')
result = prog.match(content)
print result
any suggestions?
Multipurpose solution:
image_re = re.compile(r"""
(?P<img_tag><img)\s+ #tag starts
[^>]*? #other attributes
src= #start of src attribute
(?P<quote>["''])? #optional open quote
(?P<image>[^"'>]+) #image file name
(?(quote)(?P=quote)) #close quote
[^>]*? #other attributes
> #end of tag
""", re.IGNORECASE|re.VERBOSE) #re.VERBOSE allows to define regex in readable format with comments
image_tags = []
for match in image_re.finditer(content):
image_tags.append(match.group("img_tag"))
#print found image_tags
for image_tag in image_tags:
print image_tag
As you can see in regex definition, it contains
(?P<group_name>regex)
It allows you to access found groups by group_name, and not by number. It is for readability. So, if you want to show all src attributes of img tags, then just write:
for match in image_re.finditer(content):
image_tags.append(match.group("image"))
After this image_tags list will contain src of image tags.
Also, if you need to parse html, then there are instruments that were designed exactly for such purposes. For example it is lxml, that use xpath expressions.
I don't know Python but assuming it uses normal Perl compatible regular expressions...
You probably want to look for "<img[^>]+>" which is: "<img", followed by anything that is not ">", followed by ">". Each match should give you a complete image tag.

Python regex: how to extract inner data from regex

I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!

replace URLs in text with links to URLs

Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does.
Can this be done in a one liner regular expression?
Edit: by body of text I just meant plain text - no HTML
You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:
/(https?:[;\/?\\#&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\#\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g
I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.
Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?
import re
urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:#&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")
def urlify2(value):
return urlfinder.sub(r'\1', value)
call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.
I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:
_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')
def linkify(text, maxlinklength):
def replacewithlink(matchobj):
url = matchobj.group(0)
text = unicode(url)
if text.startswith('http://'):
text = text.replace('http://', '', 1)
elif text.startswith('https://'):
text = text.replace('https://', '', 1)
if text.startswith('www.'):
text = text.replace('www.', '', 1)
if len(text) > maxlinklength:
halflength = maxlinklength / 2
text = text[0:halflength] + '...' + text[len(text) - halflength:]
return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'
if text != None and text != '':
return _urlfinderregex.sub(replacewithlink, text)
else:
return ''
You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.
/\w+:\/\/[^\s]+/
When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.
Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?
Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.
See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?
Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)
Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name
Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.
I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.
I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1
Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.

Categories

Resources