python regex: extract contents of an HTML element [duplicate] - python

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
I have elements in an HTML page in this format:
<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg-
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td
class="cell7">Philadelphia</td>
I want to use python to extract the "Dave Mason's Traffic Jam" part, the "Scottish Rite Auditorium" part etc. individually from the the text.
using this regular expression '.*' returns from the first tag to the last tag before the next newline. How can I change the expression so that it only returns the chunk between the tag pairs?
Edit: #HenryKeiter & #Hakiko that'd be grand but this for an assignment that requires me to use python regex.

Here is a hint, not a full solution: you'll need to use a non-greedy regexp in your case. Basically, you'll need to use
.*?
instead of
.*
Non-greedy means that a minimal pattern will be matched. By default - it's maximum.

Use Beautiful Soup:
from bs4 import BeautifulSoup
html = '''
<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg-
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td
class="cell7">Philadelphia</td>
'''.strip()
soup = BeautifulSoup(html)
tds = soup.find_all('td')
contentList = []
for td in tds:
contentList.append(td.get_text())
print contentList
Returns
[u"Dave Mason's Traffic Jam", u'Scottish Rite\nAuditorium', u'$29-$45', u'On sale now', u'TIX', u'AA', u'Philadelphia']

Related

how to look behind in regex without matching a pattern itself?

Lets say we want to extract the link in a tag like this:
input:
<p><b>some text</b></p>
desired output:
http://www.google.com/home/etc
the first solution is to find the match with reference using this href=[\'"]?([^\'" >]+) regex
but what I want to achieve is to match the link followed by href. so trying this (?=href\")... (lookahead assertion: matches without consuming) is still matching the href itself.
It is a regex only question.
One of many regex based solutions would be a capturing group:
>>> re.search(r'href="([^"]*)"', s).group(1)
'http://www.google.com/home/etc'
[^"]* matches any number non-".
A solution could be:
(?:href=)('|")(.*)\1
(?:href=) is a non capturing group. It means that the parser use href during the matching but it actually does not return it. As a matter of fact if you try this in regex you will see there's no group holding it.
Besides, every time you open and close a round bracket, you create a group. As a consequence, ('|") defines the group #1 and the URL you want will be in group #2. The way you retrieve this info depends on the programming language.
At the end, the \1 returns the value hold by group #1 (in this case it will be ") to provide a delimiter to the URL
Make yourself comfortable with a parser, e.g. with BeautifulSoup.
With this, it could be achieved with
from bs4 import BeautifulSoup
html = """<p><b>some text</b></p>"""
soup = BeautifulSoup(html, "html5lib")
print(soup.find('a').text)
# some text
BeautifulSoup supports a number of selectors including CSS selectors.

Python regex to extract html paragraph

I'm trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?
Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <p>text1</p>
... <p></p>
... <p>text2</p>
... </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']
Note that text=True helps to filter out empty paragraphs.
Make sure you use re.search (or re.findall) instead of re.match, which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).
Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want. Try
re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
instead. The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search.
You should be using re.search instead of re.match. The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.
That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.

Looking for the right RE expression (python) [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 7 years ago.
I want to make a python script, that look for:
<span class="toujours_cacher">(.)*?</span>
I use this RE:
r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?\<\/span\>"
However, in some of my pages, I found this kind of expression
<span class="toujours_cacher">*
<span class="exposant" size="1">*</span> *</span>
so I tried this RE:
r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?(\<\/span\>|\<\/span\>(.|\n)*?<\/span>)"
which is not good, because when there is no span in between, it looks for the next .
I need to delete the content between the span with the class "toujours_cacher".
Is there any way to do it with one RE?
I will be pleased to hear any of your suggestions :)
This is (provably) impossible with regular expressions - they cannot match delimiters to arbitrary depth. You'll need to move to using an actual parser instead.
Please do not use regex to parse HTML, as it is not regular. You could use BeautifulSoup. Here is an example of BeautifulSoup finding the tag <span class="toujours_cacher">(.)*?</span>.
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlCode)
spanTags = soup.findAll('span', attrs={'class': 'toujours_cacher'})
This will return a list of all the span tags that have the class toujours_cacher.

Using Python and Regex,How do you remove <sup> tags from html? [duplicate]

This question already has answers here:
Strip HTML from strings in Python
(28 answers)
Closed 8 years ago.
Using python regex, how do i remove all tags in html? The tags sometimes have styling, such as below:
<sup style="vertical-align:top;line-height:120%;font-size:7pt">(1)</sup>
I would like to remove everything between and including the sup tags in a larger string of html.
I would use an HTML Parser instead (why). For example, BeautifulSoup and unwrap() can handle your beautiful sup:
Tag.unwrap() is the opposite of wrap(). It replaces a tag with
whatever’s inside that tag. It’s good for stripping out markup.
from bs4 import BeautifulSoup
data = """
<div>
<sup style="vertical-align:top;line-height:120%;font-size:7pt">(1)</sup>
</div>
"""
soup = BeautifulSoup(data)
for sup in soup.find_all('sup'):
sup.unwrap()
print soup.prettify()
Prints:
<div>
(1)
</div>

matching elements using OR with regex in python [duplicate]

This question already has answers here:
Regex in python not taking the specified data in td element
(3 answers)
Closed 8 years ago.
i m using regex in python to extract data from html. the regex that i ve written is like this:
result = re.findall(r'<td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+|<td align="lef(.*?)" >(.*?)</td>\s+', webpage)
assuming that this will the td which follows either of the format -
<td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+
OR
<td align="lef(.*?)" >(.*?)</td>
this is because the td can take different format in that particular cell (either have data with a link, or even just have no data at all).
I assume that the OR condition that i ve used is incorrect - believe that the OR is matching only the "just" preceding regex and the "just" following regex, and not between the two entire td tags.
my question is, how do i group it (for example with paranthesis), so that the OR is matched between the entire td tags.
You are using a regular expression, but matching XML with such expressions gets too complicated, too fast.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
print ElementTree.tostring(elem)
In <td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+ the .?* should be replaced with .*?.
And, to answer your question, you can use non-capturing grouping to do what you want as follows:
(?:first_regex)|(?:second_regex)
BTW. You can also replace \d\d\d\d with \d{4}, which I think is easier to read.

Categories

Resources