Python regex to extract html paragraph - python

I'm trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?

Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <p>text1</p>
... <p></p>
... <p>text2</p>
... </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']
Note that text=True helps to filter out empty paragraphs.

Make sure you use re.search (or re.findall) instead of re.match, which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).
Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want. Try
re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
instead. The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search.

You should be using re.search instead of re.match. The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.
That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.

Related

Python remove certain elements in source

I have some source where I am trying to remove some tags, I do know that using regular expression for removing tags and such is not advised but figured this would be the easiest route to take.
What I need to do is remove all img and a tags along with the contents of the a tags that are only inside a p tag but I am unsure how to do this using regular expression.
For example when it comes across:
<p><img src="center.jpg">centerTEXT<img src="right.jpg">right MORE TEXT<img src="another.jpg"></p>
The output should be the following where all a tags and content and img tags are removed.
<p>TEXT MORE TEXT</p>
The problem is like I stated i'm not sure how to do this, and my regular expression removes all of the a and img tags in the source, not just the ones inside of a p tag.
re.sub(r'<(img|a).*?>|</a>', '', text)
Your regular expression indeed will remove all tags without using some type of assertion. Although you possibly could use regular expression to perform this, I do advise not going this route for many reasons.
You could simply use BeautifulSoup to pass a list of what to remove.
>>> from BeautifulSoup import BeautifulSoup
>>> html = '<p><img src="center.jpg">centerTEXT<img src="right.jpg">right MORE TEXT<img src="another.jpg"></p>'
>>> soup = BeautifulSoup(html)
>>> for m in soup.findAll(['a', 'img']):
... if m.parent.name == 'p':
... m.replaceWith('')
>>> print soup
<p>TEXT MORE TEXT</p>
Note: This will replace all <a>, </a> and <img> elements (including content) that are inside of a <p> element leaving the rest untouched. If you have BS4, use find_all() and replace_with()

Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this:
<p><span class=love><p>miracle</p>...</span></p><br>love</br>
And I want to remove the string:
<span class=love><p>miracle</p>...</span>
and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.
The result should be like this:
<p></p><br>love</br>
I want to know how to do this using regex pattern?
what I have tried :
r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)
but it will leave the
</span>
can you help me using re module this time?and i will learn html parser next
First things first: Don’t parse HTML using regular expressions
That being said, if there is no additional span tag within that span tag, then you could do it like this:
text = re.sub('<span class=love>.*?</span>', '', text)
On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).
The expression you have tried, <span class=love>.*?(?=</span>), is already quite good. The problem is that the lookahead (?=</span>) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, i.e. <span class=love>.*?(?=</span>)</span>, but that’s not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*?</span> the .*? will only match until a closing span is found where it immediately stops.

How to strip characters interfering with Beautiful Soup returning links with specified text?

I am trying to do two things with Beautiful Soup:
Find and print divs with a certain class
Find and print links that contain certain text
The first part is working. The second part is returning an empty list, that is, []. In trying to troubleshoot this, I created the following which works as intended:
from bs4 import BeautifulSoup
def my_funct():
content = "<div class=\"class1 class2\">some text</div> \
<a href='#' title='Text blah5454' onclick='blahblahblah'>Text blah5454</a>"
soup = BeautifulSoup(content)
thing1 = soup("div", "class1 class2")
thing2 = soup("a", text="Text")
print thing1
print thing2
my_funct()
After looking at the source of the original content (of my actual implementation) in SciTE editor. However, one difference is that there is an LF and four ->'s on a new line between Text and blah5454 in the link text, for example:
And therefore I think that is the reason that I am getting an empty [].
My questions are:
Is this the likely cause?
If so, is the best solution to 'strip' these characters and if so what is the best way to do that?
The text paramater only matches on the whole text content. You need to use a regular expression instead:
import re
thing2 = soup("a", text=re.compile(r"\bText\b"))
The \b word boundary anchors make sure you only match the whole word, not a partial word. Do mind the r'' raw string literal used here, \b means something different when interpreted as a normal string; you'd have to double the backslashes if you don't use a raw string literal here.
Demo:
>>> from bs4 import BeautifulSoup
>>> content = "<div class=\"class1 class2\">some text</div> \
... <a href='#' title='wooh!' onclick='blahblahblah'>Text blah5454</a>"
>>> soup = BeautifulSoup(content)
>>> soup("a", text='Text')
[]
>>> soup("a", text=re.compile(r"\bText\b"))
[Text blah5454]

BeautifulSoup, simple regex issue

I just hit a snag with regex and have no idea why this's not working.
Here is what BeautifulSoup doc says:
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]
Here is my html:
Aouate</span><span class="pos_text pos3_l_4">
and I'm trying to match the span tag (last position).
>>> if soup.find(class_=re.compile("pos_text pos3_l_\d{1}")):
print "Yes"
# prints nothing - indicating there is no such pattern in the html
So, I'm just repeating the BS4 docs, except my regex is not working. Sure enough if I replace the \d{1} with 4 (as originally in the html) it succeedes.
Try "\\d" in your regex. It's probably interpreting "\d" as trying to escape 'd'.
Alternatively, a raw string ought to work. Just put an 'r' in front of the regex, like this:
re.compile(r"pos_text pos3_l_\d{1}")
I'm not entirely sure, but this worked for me:
soup.find(attrs={'class':re.compile('pos_text pos3_l_\d{1}')})
You are matching not for a class but for an specific combination of classes in an specific order.
From the documentation:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>] But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
So you should problable fist match for post_text and then in the result try to match with a regexp in the matches for that search

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?
Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)
Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name
Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.
I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.
I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1
Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.

Categories

Resources