Parsing HTML File BeautifulSoup - python

I'm trying to parse a local html file with BeautifulSoup but having trouble navigating the tree.
The file is in the following format:
<div class="contents">
<h1>
USERNAME
</h1>
<div>
<div class="thread">
N1, N2
<div class="message">
<div class="message_header">
<span class="user">
USERNAME
</span>
<span class="meta">
Thursday, 1 January 2015 at 19:52 UTC
</span>
</div>
</div>
<p>
They're just friends
</p>
<div class="message">
<div class="message_header">
<span class="user">
USERNAME
</span>
<span class="meta">
Thursday, 1 January 2015 at 19:52 UTC
</span>
</div>
</div>
<p>
MESSAGE
</p>
...
I want to extract, for each thread:
for each div class='message':
the span class='user' and meta data
the message in the p directly after
This is a long file with many of these threads and many messages within each thread.
So far I've just opened the file and turned it into a soup
raw_data = open('file.html', 'r')
soup = BeautifulSoup(raw_data)
contents = soup.find('div', {'class' : 'contents'})
I'm looking at storing this data in a dictionary in the format
dict[USERNAME] = ([(MESSAGE1, time1), [MESSAGE2, time2])

The username and meta info are relatively easy to grab, as they are nicely contained within their own span tags, with a class identifier. The message itself is hanging around in loose paragraph tags, this is the more tricky beast...
If you have a look at the "Going Sideways" section HERE it says "You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree".
with this in mind, you can extract the parts you want with this:
from bs4 import BeautifulSoup
your_html = "your html data"
souped_data = BeautifulSoup(your_html)
for message in souped_data.find_all("div", {"class": "message"}):
username = message.find('span', attrs={'class': 'user'}).get_text()
meta = message.find('span', attrs={'class': 'meta'}).get_text()
message = message.next_sibling
First, find all the message tags. Within each, you can search for the user and meta class names. However, this just returns the tag itself, use .get_text() to get the data of the tag. Finally, use the magical .next_sibling to get your message content, in the lonely old 'p' tags.
That gets you the data you need. As for the dictionary structure. Hmmm... I would throw them all in a list of dictionary objects. Then JSONify that badboy! However, maybe that's not what you need?

Related

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

How to wrap string by tag in Beautifulsoup?

I want to wrap the content of a lot of div-elements/blocks with p tags:
<div class='value'>
some content
</div>
It should become:
<div class='value'>
<p>
some content
</p>
</div>
My idea was to get the content (using bs4) by filtering strings with find_all and then wrap it with the new tag. Don't know, if its working. I cant filter content from tags with specific attributes/values.
I can do this instead of bs4 with regex. But I'd like to do all transformations (there are some more beside this one) in bs4.
Believe it or not, you can use wrap. :-)
Because you might, or might not, want to wrap inner div elements I decided to alter your HTML code a little bit, so that I could give you code that shows how to alter an inner div without changing the one 'outside' it. You will see how to alter all divs, I'm sure.
Here's how.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('pjoern.htm').read(), 'lxml')
>>> inner_div = soup.findAll('div')[1]
>>> inner_div
<div>
some content
</div>
>>> inner_div.contents[0].wrap(soup.new_tag('p'))
<p>
some content
</p>
>>> print(soup.prettify())
<html>
<body>
<div class="value">
<div>
<p>
some content
</p>
</div>
</div>
</body>
</html>

web parsing using selenium and classes

I am trying to parse several items from a blog but I am unable to to reach the last two items I need.
The html is:
<div class="post">
<div class="postHeader">
<h2 class="postTitle"><span></span>cuba and the cameraman</h2>
<span class="postMonth" title="2017">Nov</span>
<span class="postDay" title="2017">24</span>
<div class="postSubTitle"><span class="postCategories">TV Shows</span></div>
</div>
<div class="postContent"><p><a target="_blank" href="https://image.com/test.jpg"><img class="aligncenter" src="https://image.com/test.jpg"/></a> <br />
n/A<br />
<br />
<strong>Links:</strong> <a target='_blank' href='http://www.imdb.com/title/tt7320560/'>IMDB</a><br />
</p>
The data I need is the "cuba and the cameraman" (code below), the "https://image.com/test.jpg" url and the "http://www.imdb.com/title/tt7320560/" IMDB link.
I managed to parse correctly only all the postTile for the website:
all_titles = []
url = 'http://test.com'
browser.get(url)
titles = browser.find_elements_by_class_name('postHeader')
for title in titles:
link = title.find_element_by_tag_name('a')
all_titles.append(link.text)
But I can't get the the image and imdb links using the same method as above , class name.
COuld you support me on this? Thanks.
You need a more accurate search, there is a family of find_element_by_XX functions built in, try xpath:
for post in driver.find_elements_by_xpath('//div[#class="post"]'):
title = post.find_element_by_xpath('.//h2[#class="postTitle"]//a').text
img_src = post.find_element_by_xpath('.//div[#class="postContent"]//img').get_attribute('src')
link = post.find_element_by_xpath('.//div[#class="postContent"]//a[last()]').get_attribute('href')
Remeber you can always get the html source by driver.page_source and parse it using whatever tool you like.

how to access elements by path?

I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below:
import bs4
with open("smartradio.html") as f:
html = f.read()
soup = bs4.BeautifulSoup(html)
x = soup.find_all("div", class_="ue-alarm-status", playerid="43733")
print(x)
extracts the fragments I would like to analyze further:
[<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 1: </div>
<div>allumé</div>
<div>7:00</div>
</div>
<div>
<div class="ue-alarm-dow">Lu, Ma, Me, Je, Ve </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>, <div alarmid="ea510709" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 2: </div>
<div>allumé</div>
<div>7:30</div>
</div>
<div>
<div class="ue-alarm-dow">Sa </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>]
I am interested in retrieving:
the hour (line 5 and 14)
the string (days in French) under <div class="ue-alarm-dow">
I believe that for the days it is enough to repeat a find() or find_all(). I am mentioning that because while it grabs the right information, I am not sure that this is the right way to parse the file with BeautifulSoup (but at least it works):
for y in x:
z = y.find("div", class_="ue-alarm-dow")
print(z.text)
# output:
# Lu, Ma, Me, Je, Ve
# Sa
I do not know how to get to the hour, though. Is there a way to navigate the tree by path (in the sense that I know that the hour is under the second <div>, three <div> deep)? Or should I do it differently?
You can also rely on the allumé text and get the next sibling div element:
y.find('div', text=u'allumé').find_next_sibling('div').text
or, in a similar manner, relying on the class of the previous div:
y.find('div', class_='ue-alarm-edit').find_next_siblings('div')[1].text
or, using regular expressions:
y.find('div', text=re.compile(r'\d+:\d+')).text
or, just get the div by index:
y.find_all('div')[4].text

Exclude hidden tags while scraping using b4

I have a website that has plenty of hidden tags in the html.
I have pasted the source code below.
The challenge is that there are 2 types on hidden tags,
1. Ones with style="display:none"
2. They have list of styles mentioned under every td tag.
And it changes with every td tag.
for the example below it has the following styles,
hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
So the elements with class=hLcj, kUC, mXJU, rr9s,etc are hidden elements
I want to extract the text of entire tr but exclude these hidden tags.
I have been scratching my head for hours and still no success.
Any help would be much appreciated. Thanks
I am using bs4 and python 2.7
<td class="leftborder timestamp" rel="1416853322">
<td>
<span>
<style>
.hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
</style>
<span class="rr9s">35</span>
<span></span>
<div style="display:none">121</div>
<span class="226">199</span>
.
<span class="rr9s">116</span>
<div style="display:none">116</div>
<span></span>
<span class="Dzkb">200</span>
<span style="display: inline">.</span>
<span style="display:none">86</span>
<span class="kUC-">86</span>
<span></span>
120
<span class="kUC-">134</span>
<div style="display:none">134</div>
<span class="mXJU">151</span>
<div style="display:none">151</div>
<span class="rr9s">154</span>
<span class="Dzkb">.</span>
<span class="119">36</span>
<span class="kUC-">157</span>
<div style="display:none">157</div>
<span class="rr9s">249</span>
<div style="display:none">249</div>
</span>
</td>
<td> 7808</td>
Using selenium would make the task much easier since it knows what elements are hidden and which aren't.
But, anyway, here's a basic code that you would probably need to improve more. The idea here is to parse the style tag and get the list of classes to exclude, have a list of tags to exclude and check the style attribute of each child element in tr:
import re
from bs4 import BeautifulSoup
data = """ your html here """
soup = BeautifulSoup(data)
tr = soup.tr
# get classes to exclude
classes_to_exclude = []
for line in tr.style.text.split():
match = re.match(r'^\.(.*?)\{display:none\}', line)
if match:
classes_to_exclude.append(match.group(1))
tags_to_exclude = ['style', 'script']
texts = []
for item in tr.find_all(text=True):
if item.parent.name in tags_to_exclude:
continue
class_ = item.parent.get('class')
if class_ and class_[0] in classes_to_exclude:
continue
if item.parent.get('style') == 'display:none':
continue
texts.append(item)
print ''.join(texts.strip())
Prints:
199.200.120.36
Also see:
BeautifulSoup Grab Visible Webpage Text

Categories

Resources