This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 7 years ago.
I want to make a python script, that look for:
<span class="toujours_cacher">(.)*?</span>
I use this RE:
r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?\<\/span\>"
However, in some of my pages, I found this kind of expression
<span class="toujours_cacher">*
<span class="exposant" size="1">*</span> *</span>
so I tried this RE:
r"(?i)\<span (\n|\t| )*?class=\"toujours_cacher\"(.|\n)*?\>(.|\n)*?(\<\/span\>|\<\/span\>(.|\n)*?<\/span>)"
which is not good, because when there is no span in between, it looks for the next .
I need to delete the content between the span with the class "toujours_cacher".
Is there any way to do it with one RE?
I will be pleased to hear any of your suggestions :)
This is (provably) impossible with regular expressions - they cannot match delimiters to arbitrary depth. You'll need to move to using an actual parser instead.
Please do not use regex to parse HTML, as it is not regular. You could use BeautifulSoup. Here is an example of BeautifulSoup finding the tag <span class="toujours_cacher">(.)*?</span>.
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlCode)
spanTags = soup.findAll('span', attrs={'class': 'toujours_cacher'})
This will return a list of all the span tags that have the class toujours_cacher.
Related
This question already has answers here:
Strip HTML from strings in Python
(28 answers)
Closed 8 years ago.
Using python regex, how do i remove all tags in html? The tags sometimes have styling, such as below:
<sup style="vertical-align:top;line-height:120%;font-size:7pt">(1)</sup>
I would like to remove everything between and including the sup tags in a larger string of html.
I would use an HTML Parser instead (why). For example, BeautifulSoup and unwrap() can handle your beautiful sup:
Tag.unwrap() is the opposite of wrap(). It replaces a tag with
whatever’s inside that tag. It’s good for stripping out markup.
from bs4 import BeautifulSoup
data = """
<div>
<sup style="vertical-align:top;line-height:120%;font-size:7pt">(1)</sup>
</div>
"""
soup = BeautifulSoup(data)
for sup in soup.find_all('sup'):
sup.unwrap()
print soup.prettify()
Prints:
<div>
(1)
</div>
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
I have elements in an HTML page in this format:
<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg-
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td
class="cell7">Philadelphia</td>
I want to use python to extract the "Dave Mason's Traffic Jam" part, the "Scottish Rite Auditorium" part etc. individually from the the text.
using this regular expression '.*' returns from the first tag to the last tag before the next newline. How can I change the expression so that it only returns the chunk between the tag pairs?
Edit: #HenryKeiter & #Hakiko that'd be grand but this for an assignment that requires me to use python regex.
Here is a hint, not a full solution: you'll need to use a non-greedy regexp in your case. Basically, you'll need to use
.*?
instead of
.*
Non-greedy means that a minimal pattern will be matched. By default - it's maximum.
Use Beautiful Soup:
from bs4 import BeautifulSoup
html = '''
<td class="cell1"><b>Dave Mason's Traffic Jam</b></td><td class="cell2">Scottish Rite
Auditorium</td><td class="cell3">$29-$45</td><td class="cell4">On sale now</td><td class="cell5"><a
href="http://www.ticketmaster.com/dave-masons-traffic-jam-collingswood-new-jersey-11-29-2014/event
/02004B48C416D202?artistid=1033927&majorcatid=10001&minorcatid=1&tm_link=venue_msg-
1_02004B48C416D202" target="_blank">TIX</a></td><td class="cell6">AA</td><td
class="cell7">Philadelphia</td>
'''.strip()
soup = BeautifulSoup(html)
tds = soup.find_all('td')
contentList = []
for td in tds:
contentList.append(td.get_text())
print contentList
Returns
[u"Dave Mason's Traffic Jam", u'Scottish Rite\nAuditorium', u'$29-$45', u'On sale now', u'TIX', u'AA', u'Philadelphia']
This question already has answers here:
Regex in python not taking the specified data in td element
(3 answers)
Closed 8 years ago.
i m using regex in python to extract data from html. the regex that i ve written is like this:
result = re.findall(r'<td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+|<td align="lef(.*?)" >(.*?)</td>\s+', webpage)
assuming that this will the td which follows either of the format -
<td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+
OR
<td align="lef(.*?)" >(.*?)</td>
this is because the td can take different format in that particular cell (either have data with a link, or even just have no data at all).
I assume that the OR condition that i ve used is incorrect - believe that the OR is matching only the "just" preceding regex and the "just" following regex, and not between the two entire td tags.
my question is, how do i group it (for example with paranthesis), so that the OR is matched between the entire td tags.
You are using a regular expression, but matching XML with such expressions gets too complicated, too fast.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
print ElementTree.tostring(elem)
In <td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+ the .?* should be replaced with .*?.
And, to answer your question, you can use non-capturing grouping to do what you want as follows:
(?:first_regex)|(?:second_regex)
BTW. You can also replace \d\d\d\d with \d{4}, which I think is easier to read.
I'm working on a small Python script to clean up HTML documents. It works by accepting a list of tags to KEEP and then parsing through the HTML code trashing tags that are not in the list I've been using regular expressions to do it and I've been able to match opening tags and self-closing tags but not closing tags.
The pattern I've been experimenting with to match closing tags is </(?!a)>. This seems logical to me so why is not working? The (?!a) should match on anything that is NOT an anchor tag (not that the "a" is can be anything-- it's just an example).
Edit: AGG! I guess the regex didn't show!
Read:
RegEx match open tags except XHTML self-contained tags
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Repent.
Use a real HTML parser, like BeautifulSoup.
<TAG\b[^>]*>(.*?)</TAG>
Matches the opening and closing pair of a specific HTML tag.
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Will match the opening and closing pair of any HTML tag.
See here.
Don't use regex to parse HTML. It will only give you headaches.
Use an XML parser instead. Try BeautifulSoup or lxml.
You may also consider using the html parser that is built into python (Documentation for Python 2 and Python 3)
This will help you home in on the specific area of the HTML Document you would like to work on - and use regular expressions on it.
This question already has answers here:
Strip HTML from strings in Python
(28 answers)
Closed 10 months ago.
I have downloaded a page using urlopen. How do I remove all html tags from it? Is there any regexp to replace all <*> tags?
I can also recommend BeautifulSoup which is an easy to use html parser. There you would do something like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
all_text = ''.join(soup.findAll(text=True))
This way you get all the text from a html document.
There's a great python library called bleach. This call below will remove all html tags, leaving everything else (but not removing the content inside tags that are not visible).
bleach.clean(thestring, tags=[], attributes={}, styles=[], strip=True)
Try this:
import re
def remove_html_tags(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
You could use html2text which is supposed to make a readable text equivalent from an HTML source (programatically with Python or as a command-line tool).
Thus I may extrapolate your needs from your question...
If you need HTML parsing, Python has a module for you!
There are multiple options to filter out Html tags from data. you can use Regex or remove_tags from w3lib which is in-built in python.
from w3lib.html import remove_tags
data_to_remove = '<p>hello\t\t, \tworld\n</p>'
print remove_tags(data_to_remove)`
OUTPUT: hello-world
Note: remove_tags accept string object. you can pass remove_tags(str(data_to_remove))
A very simple regexp would be :
import re
notag = re.sub("<.*?>", " ", html)
The drawback of this solution is that it doesn't remove javascript or css, but only tags.