BeautifulSoup scraping information from multiple divs using loops into JSON

BeautifulSoup scraping information from multiple divs using loops into JSON - python

I am scraping titles, descriptions, links, and people's names from a multiple divs that follow the same structure. I am using BeautifulSoup, and I am able to scrape everything out of the first div. However, I'm having trouble scraping from my long list of divs, and getting the data in a portable format like CSV or JSON.
How can I scrape each item from my long list of divs, and store that information in JSON objects together for each mp3?
The divs look like this:
<div class="audioBoxWrap clearBoth">
<h3>Title 1</h3>
<p>Description 1</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ Right-click to download] </div>
</div>
<div class="audioBoxWrap clearBoth">
<h3>Title 2</h3>
<p>Description 2</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ Right-click to download] </div>
</div>
I've figured out how to scrape from the first div, but I cannot grab the info for each div. For example, my code below only spits out the h3 for the first div over and over.
I know that I can create a python list for titles, descriptions, etc, but how do I keep the metadata structure like JSON, so that title1, link1, and description1 stay together, as well as title2's information.
with open ('soup.html', 'r') as myfile:
html_doc = myfile.read()
soup = BeautifulSoup(html_doc, 'html.parser')
audio_div = soup.find_all('div', {'class':"audioBoxWrap clearBoth"})
print len(audio_div)
#create dictionary for storing scraped data. I don't know how to store the values for each mp3 separately.
for i in audio_div:
print soup.find('h3').text
I want my JSON to look something like this:
{
"podcasts":[
{
"title":"title1",
"description":"description1",
"link":"link1"
},
{
"title":"title2",
"description":"description2",
"link":"link2"
}
]
}

Iterate over every track and make context specific searches:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<div>
<div class="audioBoxWrap clearBoth">
<h3>Title 1</h3>
<p>Description 1</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ Right-click to download] </div>
</div>
<div class="audioBoxWrap clearBoth">
<h3>Title 2</h3>
<p>Description 2</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ Right-click to download] </div>
</div>
</div>"""
soup = BeautifulSoup(data, "html.parser")
tracks = soup.find_all('div', {'class':"audioBoxWrap clearBoth"})
result = {
"podcasts": [
{
"title": track.h3.get_text(strip=True),
"description": track.p.get_text(strip=True),
"link": track.a["href"]
}
for track in tracks
]
}
pprint(result)
Prints:
{'podcasts': [{'description': 'Description 1',
'link': 'link1.mp3',
'title': 'Title 1'},
{'description': 'Description 2',
'link': 'link2.mp3',
'title': 'Title 2'}]}

Related

Extracting text from generic span tag nested within multiple div tags

I need to grab the text (shown as a date) in the following piece of html code. The code a1v2v3 changes depending on the page, so I cannot use that as a reference or use a css selector.
Relevant HTML:
<div class="mvp-collapse-content-box">
<div data-v-a1v2v3="" class="mvp-row"><div data-v-a1v2v3="" class="mvp-tag mvp-tag-default mvp-tag-checked" style="margin-left: -16px; visibility: hidden;">
<!----> <span class="mvp-tag-text">LIVE</span>
<!----></div><span data-v-a1v2v3="" style="display: inline-block;">
2019.06.12 17:09
<br data-v-a1v2v3="">
Full HTML:
<div class="mvp-collapse-content-box">
<div data-v-a1v2v3="">
<div data-v-a1v2v3="" class="mvp-collapse mvp-collapse-simple">
<div data-v-a1v2v3="" class="mvp-collapse-item mvp-collapse-item-active" style="padding-left: 6px;">
<div class="mvp-collapse-header">
<!----> <div data-v-a1v2v3="" class="mvp-row-flex mvp-row-flex-middle">
<div data-v-a1v2v3="" class="mvp-col mvp-col-span-18 mvp-col-span-xs-16 mvp-col-span-sm-18 mvp-col-span-md-18 mvp-col-span-lg-18"><div data-v-a1v2v3="" class="mvp-tag mvp-tag-blue mvp-tag-checked" style="margin-left: -16px;">
<!----> <span class="mvp-tag-text mvp-tag-color-white">LIVE</span>
<!----></div><div data-v-a1v2v3="" class="versionAndMemo">
<span data-v-a1v2v3="" style="display: inline-block; line-height: 26px; vertical-align: middle; margin: 0px 1px; font-weight: bold; font-size: 14px;">1.2.3.44</span>
<!----></div></div>
<div data-v-a1v2v3="" class="mvp-col mvp-col-span-6 mvp-col-span-xs-8 mvp-col-span-sm-6 mvp-col-span-md-6 mvp-col-span-lg-6"><div data-v-a1v2v3="" style="display: inline-block; float: right; margin-right: 6px;"><i data-v-a1v2v3="" class="lal la-download" style="font-size: 1.8em; margin-top: 8px; display: table-cell; vertical-align: middle;"></i></div>
<div data-v-a1v2v3="" style="float: right; margin-right: 22px;"><i data-v-a1v2v3="" class="lal la-link" style="font-size: 1.8em; margin-top: 8px; display: table-cell; vertical-align: middle;"></i></div></div></div></div> <div class="mvp-collapse-content" style="" data-old-padding-top="" data-old-padding-bottom="" data-old-overflow="">
<div class="mvp-collapse-content-box">
<div data-v-a1v2v3="" class="mvp-row"><div data-v-a1v2v3="" class="mvp-tag mvp-tag-default mvp-tag-checked" style="margin-left: -16px; visibility: hidden;">
<!----> <span class="mvp-tag-text">LIVE</span>
<!----></div><span data-v-a1v2v3="" style="display: inline-block;">
2019.06.12 17:09
<br data-v-a1v2v3="">
Here is what I have so far:
page = requests.get(app, headers=headers, cookies=cookies).text
soup = BeautifulSoup(page, 'html.parser')
for spantime in soup.findAll("div", {"class": "mvp-collapse-content-box"}):
print(spantime)
But nothing is being printed. I have also tried adding the following:
page = requests.get(app, headers=headers, cookies=cookies).text
soup = BeautifulSoup(page, 'html.parser')
for spantime in soup.findAll("div", {"class": "mvp-collapse-content-box"}):
print(spantime.text)
for span in spantime.find_all('span', recursive=True):
print(span.text)
But the neither of them prints anything. I have a feeling that it might have something to do with the mvp-collapse-content-box class that I've used - some of the div tags with that class do not necessarily have span tags, as shown in the Full HTML.

Use find_next() to find the span tag and use text property.
from bs4 import BeautifulSoup
html='''<div class="mvp-collapse-content-box">
<div data-v-a1v2v3="" class="mvp-row"><div data-v-a1v2v3="" class="mvp-tag mvp-tag-default mvp-tag-checked" style="margin-left: -16px; visibility: hidden;">
<!----> <span class="mvp-tag-text">LIVE</span>
<!----></div><span data-v-a1v2v3="" style="display: inline-block;">
2019.06.12 17:09
<br data-v-a1v2v3="">'''
soup=BeautifulSoup(html,'html.parser')
div=soup.find('div',class_="mvp-collapse-content-box")
print(div.find_next('span').find_next('span').text.strip())
Output:
2019.06.12 17:09

You can always select a child tag like this:
div = soup.find("div", { "class" : "mvp-collapse-content-box" })
spans = div.findChildren("span" , recursive=False)
for span in spans:
print span

How to access data within nested span tags

I've tried replacing each string but I can't get it to work. I can get all the data between <span>...</span> but I can't if is closed, how could I do it? I've tried replacing the text afterwards, but I am not able to do it. I am quite new to python.
I have also tried using for x in soup.find_all('/span', class_ = "textLarge textWhite") but that won't display anything.
Relevant html:
<div style="width:100%; display:inline-block; position:relative; text-
align:center; border-top:thin solid #fff; background-image:linear-
gradient(#333,#000);">
<div style="width:100%; max-width:1400px; display:inline-block;
position:relative; text-align:left; padding:20px 15px 20px 15px;">
<a href="/manpower-fit-for-military-service.asp" title="Manpower
Fit for Military Service ranked by country">
<div class="smGraphContainer"><img class="noBorder"
src="/imgs/graph.gif" alt="Small graph icon"></div>
</a>
<span class="textLarge textWhite"><span
class="textBold">FIT-FOR-SERVICE:</span> 18,740,382</span>
</div>
<div class="blockSheen"></div>
</div>
Relevant python code:
for y in soup.find_all('span', class_ = "textBold"):
print(y.text) #this gets FIT-FOR-SERVICE:
for x in soup.find_all('span', class_ = "textLarge textWhite"):
print(x.text) #this gets FIT-FOR-SERVICE: 18,740,382 but i only want the number
Expected result: "18,740,382"

I believe you have two options here:
1 - Use regex on the parent span tag to extract only digits.
2 - Use decompose() function to remove the child span tag from the tree, and extract the text afterwards, like this:
from bs4 import BeautifulSoup
h = """<div style="width:100%; display:inline-block; position:relative; text-
align:center; border-top:thin solid #fff; background-image:linear-
gradient(#333,#000);">
<div style="width:100%; max-width:1400px; display:inline-block;
position:relative; text-align:left; padding:20px 15px 20px 15px;">
<a href="/manpower-fit-for-military-service.asp" title="Manpower
Fit for Military Service ranked by country">
<div class="smGraphContainer"><img class="noBorder"
src="/imgs/graph.gif" alt="Small graph icon"></div>
</a>
<span class="textLarge textWhite"><span
class="textBold">FIT-FOR-SERVICE:</span> 18,740,382</span>
</div>
<div class="blockSheen"></div>
</div>"""
soup = BeautifulSoup(h, "lxml")
soup.find('span', class_ = "textLarge textWhite").span.decompose()
res = soup.find('span', class_ = "textLarge textWhite").text.strip()
print(res)
#18,740,382

Here is how you could do it:
soup.find('span', {'class':'textLarge textWhite'}).find('span').extract()
output = soup.find('span', {'class':'textLarge textWhite'}).text.strip()
output:
18,740,382

Instead of grabbing the text using x.text you can use x.find_all(text=True, recursive=False) which will give you all the top-level text (in a list of strings) for a node without going into the children. Here's an example using your data:
for x in soup.find_all('span', class_ = "textLarge textWhite"):
res = x.find_all(text=True, recursive=False)
# join and strip the strings then print
print(" ".join(map(str.strip, res)))
#outputs: '18,740,382'

How to prettify HTML so tag attributes will remain in one single line?

I got this little piece of code:
text = """<html><head></head><body>
<h1 style="
text-align: center;
">Main site</h1>
<div>
<p style="
color: blue;
text-align: center;
">text1
</p>
<p style="
color: blueviolet;
text-align: center;
">text2
</p>
</div>
<div>
<p style="text-align:center">
<img src="./foo/test.jpg" alt="Testing static images" style="
">
</p>
</div>
</body></html>
"""
import sys
import re
import bs4
def prettify(soup, indent_width=4):
r = re.compile(r'^(\s*)', re.MULTILINE)
return r.sub(r'\1' * indent_width, soup.prettify())
soup = bs4.BeautifulSoup(text, "html.parser")
print(prettify(soup))
The output of the above snippet right now is:
<html>
<head>
</head>
<body>
<h1 style="
text-align: center;
">
Main site
</h1>
<div>
<p style="
color: blue;
text-align: center;
">
text1
</p>
<p style="
color: blueviolet;
text-align: center;
">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style="
"/>
</p>
</div>
</body>
</html>
I'd like to figure out how to format the output so it becomes this instead:
<html>
<head>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue;text-align: center;">
text1
</p>
<p style="color: blueviolet;text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style=""/>
</p>
</div>
</body>
</html>
Said otherwise, I'd like to keep html statements such as <tag attrib1=value1 attrib2=value2 ... attribn=valuen> in one single line if possible. When I say "if possible" I mean without screwing up the value of the attributes themselves (value1, value2, ..., valuen).
Is this possible to achieve with beautifulsoup4? As far I've read in the docs it seems you can use a custom formatter but I don't know how I could have a custom formatter so it can accomplish the described requirements.
EDIT:
#alecxe solution is quite simple, unfortunately fails in some more complex cases like the below one, ie:
test1 = """
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
<div id="sessionsGrid" data-columns="[
{ field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 },
{ field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}},
{ field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80},
{ field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 },
{ field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}},
{ field: 'note', title:'Note'}
]">
</div>
</div>
"""
from bs4 import BeautifulSoup
import re
def prettify(soup, indent_width=4, single_lines=True):
if single_lines:
for tag in soup():
for attr in tag.attrs:
print(tag.attrs[attr], tag.attrs[attr].__class__)
tag.attrs[attr] = " ".join(
tag.attrs[attr].replace("\n", " ").split())
r = re.compile(r'^(\s*)', re.MULTILINE)
return r.sub(r'\1' * indent_width, soup.prettify())
def html_beautify(text):
soup = BeautifulSoup(text, "html.parser")
return prettify(soup)
print(html_beautify(test1))
TRACEBACK:
dialer-capmaign-console <class 'str'>
['fill-vertically'] <class 'list'>
Traceback (most recent call last):
File "d:\mcve\x.py", line 35, in <module>
print(html_beautify(test1))
File "d:\mcve\x.py", line 33, in html_beautify
return prettify(soup)
File "d:\mcve\x.py", line 25, in prettify
tag.attrs[attr].replace("\n", " ").split())
AttributeError: 'list' object has no attribute 'replace'

BeautifulSoup tried to preserve the newlines and multiple spaces you had in the attribute values in the input HTML.
One workaround here would be to iterate over the element attributes and clean them up prior to prettifying - removing the newlines and replacing multiple consecutive spaces with a single space:
for tag in soup():
for attr in tag.attrs:
tag.attrs[attr] = " ".join(tag.attrs[attr].replace("\n", " ").split())
print(soup.prettify())
Prints:
<html>
<head>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue; text-align: center;">
text1
</p>
<p style="color: blueviolet; text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style=""/>
</p>
</div>
</body>
</html>
Update (to address the multi-valued attributes like class):
You just need to add a slight modification adding special handling for the case when an attribute is of a list type:
for tag in soup():
tag.attrs = {
attr: [" ".join(attr_value.replace("\n", " ").split()) for attr_value in value]
if isinstance(value, list)
else " ".join(value.replace("\n", " ").split())
for attr, value in tag.attrs.items()
}

While BeautifulSoup is more commonly used, HTML Tidy may be a better choice if you're working with quirks and have more specific requirements.
After installing the library for Python (pip install pytidylib) try the following code:
from tidylib import Tidy
tidy = Tidy()
# assign string to text
config = {
"doctype": "omit",
# "show-body-only": True
}
print tidy.tidy_document(text, options=config)[0]
tidy.tidy_document returns a tuple with the HTML and any errors that may have occurred. This code will output
<html>
<head>
<title></title>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue; text-align: center;">
text1
</p>
<p style="color: blueviolet; text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img src="./foo/test.jpg" alt="Testing static images" style="">
</p>
</div>
</body>
</html>
By uncommenting the "show-body-only": True for the second sample.
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
<div id="sessionsGrid" data-columns="[ { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 }, { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}}, { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80}, { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 }, { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}}, { field: 'note', title:'Note'} ]"></div>
</div>
See more configuration for further options and customization. There are wrapping options specific to attributes which may help. As you can see, empty elements will only take up one line, and html-tidy will automatically try to add things like DOCTYPE, head and title tags.

Python html div class

I'm trying to write a simple program which saves the values of a table in a matrix (later I want to send the matrix to a database).
Here is my code:
pfad = "https://business.facebook.com/ads/manager/account/ads/?act=516059741896803&pid=p2&report_spec=6056690557117&business_id=401807279988717"
html = urlopen(pfad)
r=requests.get(pfad)
soup = BeautifulSoup(html.read(),'html.parser')
mydivs = soup.findAll("div", { "class" : "ellipsis_1ha3" })
# no output:
for div in mydivs:
if (div["class"]=="ellipsis_1ha3"):
print div
# output: []
print(mydivs)
I want the values inside of the divs with class ellipsis _1ha3, but I don't know why it doesn't work. Can anyone help me?
Here is an example html which is like the original
<!DOCTYPE html>
<html>
<head>
<style>
.ellipsis_1ha3
{
width: 100px;
border: 1px solid black;
}
.a
{
width: 100px;
border: 1px solid black;
}
</style>
</head>
<body>
<div>
<div style="display: inline-flex;">
<div class="a">Purchase</div>
<div class="a">Clicks</div>
</div>
</br>
<div style="display: inline-flex;">
<div class="ellipsis_1ha3">20</div>
<div class="ellipsis_1ha3">30</div>
</div>
</br>
<div style="display: inline-flex;">
<div class="ellipsis_1ha3">10</div>
<div class="ellipsis_1ha3">50</div>
</div>
</div>
</body>
</html>
SECOND EXAMPLE
pfad = "http://www.bundesliga.de/de/liga/tabelle/"
html = urlopen(pfad)
soup = BeautifulSoup(html.read(),'html.parser')
mydivs = soup.findAll('div', { 'class' : 'wwe-cursor-pointer' })
for div in mydivs:
if ("wwe-cursor-pointer" in div["class"]):
print div

Try using lxml and xpath expressions to pull out the relevant information. Beautifulsoup is built on lxml, I believe. Assuming you loaded the document into a string called html_string.
from lxml import html
h = html.fromstring(html_string)
h.xpath('//div[#class="ellipsis_1ha3"]/node()')
#output:
['20', '30', '10', '50']

Using beautifulsoup4 in my django app how do I get the "a" href and image src?

I am using beautifulsoup4 in my django app to scrape data. I was able to get the data from this html structure
<div class="thumbnail thumb">
<h6 id="date">May 9, 2016</h6>
<img src="http://assets.system.jpg" class="img-responsive post">
<div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div>
<div class="caption" id="cap">
<a href="/blog/homeland-security-attack/">
<h5 class="post-title" id="title">Homeland Security </h5>
</a>
<p>
delete
edit
</p>
</div>
</div>
using this in my view
url = 'http://www.hispanicheights.com/'
google = requests.get(url)
bs = BeautifulSoup(google.content, 'html.parser')
divs = bs.findAll('div', 'thumbnail')
entries = [{'text': div.text,
'href': div.find('a').get('href'),
'src': div.find('img').get('src')
} for div in divs][:6]
But when I tried to scrape this html structure
<div class="entry entry-pos-1" id="entry-217985">
<a href="/article/murder" data-page="1">
<p class="entry-comments">6</p>
<img data-original="/images17985.jpg" alt="Chicago Rapper & OTF Aff Murder" width="320" height="179" class="image-load" src="/images/size_mb/video-217985.jpg" style="display: block;">
</a>
<p class="entry-title">
Chicago Rapper & OT Murder
</p>
<p class="entry-meta">97 views</p>
<p class="entry-date">
<span class="entry-recent">11 Mins Ago</span>
</p>
</div>
using the same thing
ad_url = 'http://www.ad.com/'
ad_get = requests.get(ad_url, headers=headers)
ad_soup = BeautifulSoup(ad_get.content, 'html.parser')
ad_div = vlad_soup.findAll('div', 'entry')
ad_entry = [{'text': div.text,
'href': div.find('a').get('href'),
'src': div.find('img').get('src')
} for div in ad_div]
It get the error Nonetype object has attribute has attribute get
Whats the proper syntax to grab the href and src?

If you call div.find('a') for a div that does not contain an anchor, it will return None. Your code has to handle this. For example, you could do:
entries = []
for div in vlad_div:
a = div.find('a')
img = div.find('img')
if a is not None and img is not None:
entry = {
'text': div.text
'href': a.get('href')
'src': img.get('src')
}
entries.append(entry)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup scraping information from multiple divs using loops into JSON - python

Related

Extracting text from generic span tag nested within multiple div tags

How to access data within nested span tags

How to prettify HTML so tag attributes will remain in one single line?

Python html div class

Using beautifulsoup4 in my django app how do I get the "a" href and image src?

Categories

Resources