Regexp to parse HTML imgs - python

I'm crawling through an HTML page and I want to extract the img srcs and the a hrefs.
On the particular site, all of them are encapsulated in double quotes.
I've tried a wide variety of regexps with no success. Assume characters inside the double-quotes will be [-\w/] (printable characters [a-zA-Z\d-_] and / and .)
In python:
re.search(r'img\s+src="(?P<src>[\w-/]+_"', line)
Doesn't return anything, but
re.search(r'img\s+src="(?P[-\w[/]]+)"', line)
Returns wayy to much (i.e., does not stop at the " ).
I need help creating the right regexp. Thanks in advance!

I need help creating the right regexp.
No, you need help in finding the right tool.
Try BeautifulSoup.
(If you insist on using regular expressions - and I'd advise against it - try changing the greedy + to non-greedy +?).

Here's an example of a better way to do it than with regex, using the excellent lxml library and xpath
In [1]: import lxml.html
In [2]: doc = lxml.html.parse('http://www.google.com/search?q=kittens&tbm=isch')
In [3]: doc.xpath('//img/#src')
Out[3]:
['/images/nav_logo_hp2.png',
'http://t1.gstatic.com/images?q=tbn:ANd9GcQhajNZimPGLw9iTfzrAF_HV5UogY-KGep5WYgw-VHZ15oaAwGquNb5Q2I',
'http://t2.gstatic.com/images?q=tbn:ANd9GcS1LgVIlDgoIfNzwU4xBz9fL32ZJjZU26aB4aynRsEcz2VuXmjCtvxUonM',
'http://t1.gstatic.com/images?q=tbn:ANd9GcRgouJt5Moe8uTnDPUFTo4csZOcBtEDA_B7WdRPe8pdZroR5QB2q_-LT59G',
[...]
]

A good trick for finding things inside quotes you do "([^"]+)". So you search for any characters but the quote that are between quotes.
For help with creating regular expressions I can strongly recommend Expresso ( http://www.ultrapico.com/Expresso.htm )

Related

Processing a HTML file using Python

I wanted to remove all the tags in HTML file. For that I used re module of python.
For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?
You can make the match non-greedy: '<.*?>'
You also need to be careful, HTML is a crafty beast, and can thwart your regexes.
Parse the HTML using BeautifulSoup, then only retrieve the text.
make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy
off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/
Use a parser, either lxml or BeautifulSoup:
import lxml.html
print lxml.html.fromstring(mystring).text_content()
Related questions:
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
Beautiful Soup is great for parsing html!
You might not require it now, but it's worth learning to use it. Will help you in the future too.

What is wrong with this Regular Expression?

I am trying to create a test to verify that a link is rendered on a webpage.
I'm not understanding what I'm doing wrong on this assertion test:
self.assertRegexpMatches( response.content, r'elite')
I know that the markup is on the page because I copied it from response.content
I tried to use the regular expression in the Python shell:
In [27]: links = """<div class="tabsA">activenewesthottestmost votedelite</div>"""
In [28]: re.search(r'elite', links)
For some reason it's not working their either.
How do I create the regular expression so it works?
Why are you using a regex here? There's absolutely no reason to. You're just matching a simple string. Use:
self.assertContains(response, 'elite')
The ? in your regex is getting interpreted as a ? quantifier (end of this part):
<a href="/questions/?...
Thus the engine never matches the literal ? that appears in the string, and instead matches an optional / at that position. Escape it with a backslash like so:
<a href="/questions/\?...
You should escape "?", because that symbol has a special meaning on regex.
>>> re.search(r'elite', links)
The ? character is a special RegEx Character and must be escaped.
The follow regexp would work
elite
Note the \ before the ?
A great tool for messing around with RegEx can be found here:
http://regexpal.com/
It can save you an awful lot of time and headaches...
It's probably the "<" and ">" characters. In some regular expression syntaxes they are special characters that indicate beginning and end of line.
You might look at a regular expression tester tool to help you learn them.

De-greedifying a regular expression in python

I'm trying to write a regular expression that will convert a full path filename to a short filename for a given filetype, minus the file extension.
For example, I'm trying to get just the name of the .bar file from a string using
re.search('/(.*?)\.bar$', '/def_params/param_1M56/param/foo.bar')
According to the Python re docs, *? is the ungreedy version of *, so I was expecting to get
'foo'
returned for match.group(1) but instead I got
'def_params/param_1M56/param/foo'
What am I missing here about greediness?
What you're missing isn't so much about greediness as about regular expression engines: they work from left to right, so the / matches as early as possible and the .*? is then forced to work from there. In this case, the best regex doesn't involve greediness at all (you need backtracking for that to work; it will, but could take a really long time to run if there are a lot of slashes), but a more explicit pattern:
'/([^/]*)\.bar$'
I would suggest changing your regex so that it doesn't rely on greedyness.
You want only the filename before the extension .bar and everything after the final /. This should do:
re.search(`/[^/]*\.bar$`, '/def_params/param_1M56/param/foo.bar')
What this does is it matches /, then zero or more characters (as much as possible) that are not / and then .bar.
I don't claim to understand the non-greedy operators all that well, but a solution for that particular problem would be to use ([^/]*?)
The regular expressions starts from the right. Put a .* at the start and it should work.
I like regex but there is no need of one here.
path = '/def_params/param_1M56/param/foo.bar'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/fululu'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/one.before.two.dat'
print path.rsplit('/',1)[1].rsplit('.',1)[0]
result
foo
fululu
one.before.two
Other people have answered the regex question, but in this case there's a more efficient way than regex:
file_name = path[path.rindex('/')+1 : path.rindex('.')]
try this one on for size:
match = re.search('.*/(.*?).bar$', '/def_params/param_1M56/param/foo.bar')

HTML code processing

I want to process some HTML code and remove the tags as in the example:
"<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting paragraph."
I'm using Python as technology; do you know any framework I may use to remove the HTML tags?
Thanks!
This question may help you: Strip HTML from strings in Python
No matter what solution you choose, I'd recommend avoiding regular expressions. They can be slow when processing large strings, they might not work due to invalid HTML, and stripping HTML with regex isn't always secure or reliable.
BeautifulSoup
import libxml2
text = "<p><b>This</b> is a very interesting paragraph.</p>"
root = libxml2.parseDoc(text)
print root.content
# 'This is a very interesting paragraph.'
Depending on your needs, you could just use the regular expression /<(.|\n)*?>/ and replace all matches with empty strings. This works perfectly for manual cases, but if you're building this as an application feature then you'll need a more robust and secure option.
you can use lxml.

How to remove tags from a string in python using regular expressions? (NOT in HTML)

I need to remove tags from a string in python.
<FNT name="Century Schoolbook" size="22">Title</FNT>
What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.
This should work:
import re
re.sub('<[^>]*>', '', mystring)
To everyone saying that regexes are not the correct tool for the job:
The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.
I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.
Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)
print(soup.get_text())
Searching this regex and replacing it with an empty string should work.
/<[A-Za-z\/][^>]*>/
Example (from python shell):
>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title
If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.
If the source text is well-formed XML, you can use the stdlib module ElementTree:
import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text # 'Title'
If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.
Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

Categories

Resources