Python remove certain elements in source - python

I have some source where I am trying to remove some tags, I do know that using regular expression for removing tags and such is not advised but figured this would be the easiest route to take.
What I need to do is remove all img and a tags along with the contents of the a tags that are only inside a p tag but I am unsure how to do this using regular expression.
For example when it comes across:
<p><img src="center.jpg">centerTEXT<img src="right.jpg">right MORE TEXT<img src="another.jpg"></p>
The output should be the following where all a tags and content and img tags are removed.
<p>TEXT MORE TEXT</p>
The problem is like I stated i'm not sure how to do this, and my regular expression removes all of the a and img tags in the source, not just the ones inside of a p tag.
re.sub(r'<(img|a).*?>|</a>', '', text)

Your regular expression indeed will remove all tags without using some type of assertion. Although you possibly could use regular expression to perform this, I do advise not going this route for many reasons.
You could simply use BeautifulSoup to pass a list of what to remove.
>>> from BeautifulSoup import BeautifulSoup
>>> html = '<p><img src="center.jpg">centerTEXT<img src="right.jpg">right MORE TEXT<img src="another.jpg"></p>'
>>> soup = BeautifulSoup(html)
>>> for m in soup.findAll(['a', 'img']):
... if m.parent.name == 'p':
... m.replaceWith('')
>>> print soup
<p>TEXT MORE TEXT</p>
Note: This will replace all <a>, </a> and <img> elements (including content) that are inside of a <p> element leaving the rest untouched. If you have BS4, use find_all() and replace_with()

Related

Python regex to extract html paragraph

I'm trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?
Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <p>text1</p>
... <p></p>
... <p>text2</p>
... </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']
Note that text=True helps to filter out empty paragraphs.
Make sure you use re.search (or re.findall) instead of re.match, which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).
Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want. Try
re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
instead. The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search.
You should be using re.search instead of re.match. The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.
That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.

extracting text from mangled html tag with <br> separating the elements

So I have this html piece:
<p class="tbtx">
MWF
<br></br>
TH
</p>
which is completely mangled it seems. I need to extract the data i.e. ['MWF', 'TH'].
The only solution I could think of is to replace all newlines and spaces in the html, then split it at and rebuild html structure and then extract .text but it's a bit ridiculous.
Any proper solutions for this?
.stripped_strings is what you are looking for - it removes unneccessary whitespace and returns the strings.
Demo:
from bs4 import BeautifulSoup
data = """<p class="tbtx">
MWF
<br></br>
TH
</p>"""
soup = BeautifulSoup(data)
print list(soup.stripped_strings) # prints [u'MWF', u'TH']
You can do this using filter and BeautifulSoup to pull out just the text from your HTML snippet.
from bs4 import BeautifulSoup
html = """<p class="tbtx">
MWF
<br></br>
TH
</p>"""
print filter(None,BeautifulSoup(html).get_text().strip().split("\n"))
Outputs:
[u'MWF', u'TH']
I would recommend extracting text using Regular Expressions
For instance if your html was as you noted:
"
<p class="tbtx">
MWF
<br></br>
TH
</p>
"
We can see that the desired text ("MWF","TH") is surround by whitespace characters.
Thus the regular expression("\s\w+\s") reads "find any set of word characters that are surrounded by white space characters" and would identify the desired text.
Here is a cheat sheet for creating Regular Expressions: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1
And you can test your Regular Expression on desired text here: http://regexpal.com/

Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this:
<p><span class=love><p>miracle</p>...</span></p><br>love</br>
And I want to remove the string:
<span class=love><p>miracle</p>...</span>
and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.
The result should be like this:
<p></p><br>love</br>
I want to know how to do this using regex pattern?
what I have tried :
r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)
but it will leave the
</span>
can you help me using re module this time?and i will learn html parser next
First things first: Don’t parse HTML using regular expressions
That being said, if there is no additional span tag within that span tag, then you could do it like this:
text = re.sub('<span class=love>.*?</span>', '', text)
On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).
The expression you have tried, <span class=love>.*?(?=</span>), is already quite good. The problem is that the lookahead (?=</span>) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, i.e. <span class=love>.*?(?=</span>)</span>, but that’s not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*?</span> the .*? will only match until a closing span is found where it immediately stops.

How to strip characters interfering with Beautiful Soup returning links with specified text?

I am trying to do two things with Beautiful Soup:
Find and print divs with a certain class
Find and print links that contain certain text
The first part is working. The second part is returning an empty list, that is, []. In trying to troubleshoot this, I created the following which works as intended:
from bs4 import BeautifulSoup
def my_funct():
content = "<div class=\"class1 class2\">some text</div> \
<a href='#' title='Text blah5454' onclick='blahblahblah'>Text blah5454</a>"
soup = BeautifulSoup(content)
thing1 = soup("div", "class1 class2")
thing2 = soup("a", text="Text")
print thing1
print thing2
my_funct()
After looking at the source of the original content (of my actual implementation) in SciTE editor. However, one difference is that there is an LF and four ->'s on a new line between Text and blah5454 in the link text, for example:
And therefore I think that is the reason that I am getting an empty [].
My questions are:
Is this the likely cause?
If so, is the best solution to 'strip' these characters and if so what is the best way to do that?
The text paramater only matches on the whole text content. You need to use a regular expression instead:
import re
thing2 = soup("a", text=re.compile(r"\bText\b"))
The \b word boundary anchors make sure you only match the whole word, not a partial word. Do mind the r'' raw string literal used here, \b means something different when interpreted as a normal string; you'd have to double the backslashes if you don't use a raw string literal here.
Demo:
>>> from bs4 import BeautifulSoup
>>> content = "<div class=\"class1 class2\">some text</div> \
... <a href='#' title='wooh!' onclick='blahblahblah'>Text blah5454</a>"
>>> soup = BeautifulSoup(content)
>>> soup("a", text='Text')
[]
>>> soup("a", text=re.compile(r"\bText\b"))
[Text blah5454]

how to find specific strings, using the python regular expression

I have this HTML:
<li class="news_list_bo">URLHunter 프로그램 버퍼오버플로우 취약점 발견!
<ul class="new_liview">
<li class="img"><img height="45" width="65" src="/image_article/458226972502b655fa1b7b.jpg" /></li>
<li class="text">웹페이지를 구성하는 그림파일, 플래쉬파일, 미디어파일들과 같은 구성요소를 사용자에게 보여주는 URLHunter 프로그램에서 버퍼오...</li>
</ul>
I'm trying to retrieve the text in the a tags like this:
>>> tmp_title = re.findall(r'(.*?)',tmp_str,re.I|re.DOTALL)'
However, it doesn't find anything:
>>> print tmp_title
[]
How can I find the text between <li class="text"> and </li>?
I'd recommend using an HTML parser like Beautiful Soup to handle most of this rather than trying to wrangle regular expressions into doing all of it. Regular expressions may be good for matching the URLs once the HTML is parsed, though.
We can start by constructing a regular expression to match the URLs you want. Your problem was that ? has a special meaning in regular expressions. If you need to literally match a ?, you'll need to escape it. Anyway, here's a regular expression for matching the URLs you want:
^/news_view\.php\?article_id=[0-9]+$
When you need to find the strings, you can first parse the HTML into a soup:
soup = bs4.BeautifulSoup(html)
See the documentation's section on SoupStrainers to improve performance.
Then you can match all a tags with a href you're interested in:
links = soup.find_all('a', href=NEWS_URL_RE)
Then you can get all of the text out of the links:
link_texts = [link.get_text() for link in links]

Categories

Resources