I'm trying to write a simple program which saves the values of a table in a matrix (later I want to send the matrix to a database).
Here is my code:
pfad = "https://business.facebook.com/ads/manager/account/ads/?act=516059741896803&pid=p2&report_spec=6056690557117&business_id=401807279988717"
html = urlopen(pfad)
r=requests.get(pfad)
soup = BeautifulSoup(html.read(),'html.parser')
mydivs = soup.findAll("div", { "class" : "ellipsis_1ha3" })
# no output:
for div in mydivs:
if (div["class"]=="ellipsis_1ha3"):
print div
# output: []
print(mydivs)
I want the values inside of the divs with class ellipsis _1ha3, but I don't know why it doesn't work. Can anyone help me?
Here is an example html which is like the original
<!DOCTYPE html>
<html>
<head>
<style>
.ellipsis_1ha3
{
width: 100px;
border: 1px solid black;
}
.a
{
width: 100px;
border: 1px solid black;
}
</style>
</head>
<body>
<div>
<div style="display: inline-flex;">
<div class="a">Purchase</div>
<div class="a">Clicks</div>
</div>
</br>
<div style="display: inline-flex;">
<div class="ellipsis_1ha3">20</div>
<div class="ellipsis_1ha3">30</div>
</div>
</br>
<div style="display: inline-flex;">
<div class="ellipsis_1ha3">10</div>
<div class="ellipsis_1ha3">50</div>
</div>
</div>
</body>
</html>
SECOND EXAMPLE
pfad = "http://www.bundesliga.de/de/liga/tabelle/"
html = urlopen(pfad)
soup = BeautifulSoup(html.read(),'html.parser')
mydivs = soup.findAll('div', { 'class' : 'wwe-cursor-pointer' })
for div in mydivs:
if ("wwe-cursor-pointer" in div["class"]):
print div
Try using lxml and xpath expressions to pull out the relevant information. Beautifulsoup is built on lxml, I believe. Assuming you loaded the document into a string called html_string.
from lxml import html
h = html.fromstring(html_string)
h.xpath('//div[#class="ellipsis_1ha3"]/node()')
#output:
['20', '30', '10', '50']
Related
I am having issues with comparing two HTML files using Pythob difflib. While I was able to generate a comparison file that highlighted any changes, when I opened the comparison file, it displayed the raw HTML and CSS script/tags instead of the plain text.
E.g
<Html><Body><div class="a"> Text Here</div></Body></html>
instead of
Text Here
My Python Script is as follows:
import difflib
file1 = open('file1.html', 'r').readlines()
file2 = open('file2.html', 'r').readlines()
diffHTML = difflib.HtmlDiff()
htmldiffs = diffHTML.make_file(file1,file2)
with open('Comparison.html', 'w') as outFile:
outFile.write(htmldiffs)
My input files looks something like this
<!DOCTYPE html>
<html>
<head>
<title>Text here</title>
<style type="text/css">
#media all {
h1 {
margin: 0px;
color: #222222;
}
#page-title {
color: #222222;
font-size: 1.4em;
font-weight: bold;
}
body {
font: 0.875em/1.231 tahoma, geneva, verdana, sans-serif;
padding: 30px;
min-width: 800px;
}
.divider {
margin: 0.5em 15% 0.5em 10%;
border-bottom: 1px solid #000;
}
}
.section.header {
background-color: #303030;
color: #ffffff;
font-weight: bold;
margin: 0 0 5px 0;
padding: 5px 0 5px 5px;
}
.section.subheader {
background-color: #CFCFCF;
color: #000;
font-weight: bold;
padding: 1px 5px;
margin: 0px auto 5px 0px;
}
.response_rule_prefix {
font-style: italic;
}
.exception-scope
{
color: #666666;
padding-bottom: 5px;
}
.where-clause-header
{
color:#AAAA99;
}
.section {
padding: 0em 0em 1.2em 0em;
}
#generated-Time {
padding-top:5px;
float:right;
}
#page-title, #generated-Time {
display: inline-block;
}
</style></head>
<body>
<div id="title-section" class="section ">
<div id="page-branding">
<h1>Title</h1>
</div>
<div id="page-title">
Sub title
</div>
<div id="generated-Time">
Date & Time : Jul 2, 2020 2:42:48 PM
</div>
</div>
<div class="section header">General</div>
<div id="general-section" class="section">
<div class="general-detail-label-container">
<label id="policy-name-label">Text here</label>
</div>
<div class="general-detail-content-container">
<span id="policy-name-content" >Text here</span>
</div>
<div class="general-detail-label-container">
<label id="policy-description-label">Description A :</label>
</div>
<div class="general-detail-content-container">
<span id="policy-description-content""></span>Text here</span>
</div>
<div class="general-detail-label-container">
<label id="policy-label-label" class="general-detail-label">Description b:</label>
</div>
<div class="general-detail-content-container">
<span id="policy-label-content" class="wrapping-text"></span>
</div>
<div class="general-detail-label-container">
<label id="policy-group-label" class="general-detail-label">Group:</label>
</div>
<div class="general-detail-content-container">
<span id="policy-group-content" class="wrapping-text">Text here</span>
</div>
<div class="general-detail-label-container">
<label id="policy-status-label" class="general-detail-label">Status:</label>
</div>
<div class="general-detail-content-container">
<span id="policy-status-content">
<label id="policy-status-message">Active</label>
</span>
</div>
<div class="general-detail-label-container">
<label id="policy-version-label" class="general-detail-label">Version:</label>
</div>
<div class="general-detail-content-container">
<span id="policy-version-content" class="wrapping-text">7</span>
</div>
<div class="general-detail-label-container">
<label id="policy-last-modified-label" class="general-detail-label">Last Modified:</label>
</div>
<div class="general-detail-content-container">
<span id="policy-last-modified-content" class="wrapping-text">Jun 15, 2020 2:41:48 PM</span>
</div>
</div>
</body>
</html>
Assuming that you are only looking for the text changes and not changes to the HTML, you could strip the output of HTML after the comparison. There are a number of ways to achieve this. The two that I first thought of was:
RegEx, because this is native in Python, and
BeautifulSoup, because it was created to read webpages
Create a function, using either of above methods, to strip the output of HTML
e.g. using BeautifulSoup
UPDATE
Reading through the documentation again, it seems to me that the comparison will yield the actual HTML too, however, you could create an additional output that only shows the text changes.
Also to avoid showing the whole document, I've set the context parameter to True
Using BeautifulSoup
import re
import difflib
from bs4 import BeautifulSoup
def remove_html_bs(raw_html):
data = BeautifulSoup(raw_html, 'html.parser')
data = data.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element.encode('utf-8'))):
return False
return True
result = list(filter(visible, data))
return result
def compare_files(file1, file2):
"Creates two comparison files"
file1 = file1.readlines()
file2 = file2.readlines()
# Difference line by line - HTML
difference_html = difflib.HtmlDiff(tabsize=8).make_file(file1, file2, context=True, numlines=5)
# Difference line by line - raw
difference_file = set(file1).difference(file2)
# List of differences by line index
difference_index = []
for dt in difference_file:
for index, t in enumerate(file1):
if dt in t:
difference_index.append(f'{index}, {remove_html_bs(dt)[0]}, {remove_html_bs(file2[index])[0]}')
# Write entire line with changes
with open('comparison.html', 'w') as outFile:
outFile.write(difference_html)
# Write only text changes, by index
with open('comparison.txt', 'w') as outFile:
outFile.write('LineNo, File1, File2\n')
outFile.write('\n'.join(difference_index))
return difference_html, difference_file
file1 = open('file1.html', 'r')
file2 = open('file2.html', 'r')
difference_html, difference_file = compare_files(file1, file2)
I've tried replacing each string but I can't get it to work. I can get all the data between <span>...</span> but I can't if is closed, how could I do it? I've tried replacing the text afterwards, but I am not able to do it. I am quite new to python.
I have also tried using for x in soup.find_all('/span', class_ = "textLarge textWhite") but that won't display anything.
Relevant html:
<div style="width:100%; display:inline-block; position:relative; text-
align:center; border-top:thin solid #fff; background-image:linear-
gradient(#333,#000);">
<div style="width:100%; max-width:1400px; display:inline-block;
position:relative; text-align:left; padding:20px 15px 20px 15px;">
<a href="/manpower-fit-for-military-service.asp" title="Manpower
Fit for Military Service ranked by country">
<div class="smGraphContainer"><img class="noBorder"
src="/imgs/graph.gif" alt="Small graph icon"></div>
</a>
<span class="textLarge textWhite"><span
class="textBold">FIT-FOR-SERVICE:</span> 18,740,382</span>
</div>
<div class="blockSheen"></div>
</div>
Relevant python code:
for y in soup.find_all('span', class_ = "textBold"):
print(y.text) #this gets FIT-FOR-SERVICE:
for x in soup.find_all('span', class_ = "textLarge textWhite"):
print(x.text) #this gets FIT-FOR-SERVICE: 18,740,382 but i only want the number
Expected result: "18,740,382"
I believe you have two options here:
1 - Use regex on the parent span tag to extract only digits.
2 - Use decompose() function to remove the child span tag from the tree, and extract the text afterwards, like this:
from bs4 import BeautifulSoup
h = """<div style="width:100%; display:inline-block; position:relative; text-
align:center; border-top:thin solid #fff; background-image:linear-
gradient(#333,#000);">
<div style="width:100%; max-width:1400px; display:inline-block;
position:relative; text-align:left; padding:20px 15px 20px 15px;">
<a href="/manpower-fit-for-military-service.asp" title="Manpower
Fit for Military Service ranked by country">
<div class="smGraphContainer"><img class="noBorder"
src="/imgs/graph.gif" alt="Small graph icon"></div>
</a>
<span class="textLarge textWhite"><span
class="textBold">FIT-FOR-SERVICE:</span> 18,740,382</span>
</div>
<div class="blockSheen"></div>
</div>"""
soup = BeautifulSoup(h, "lxml")
soup.find('span', class_ = "textLarge textWhite").span.decompose()
res = soup.find('span', class_ = "textLarge textWhite").text.strip()
print(res)
#18,740,382
Here is how you could do it:
soup.find('span', {'class':'textLarge textWhite'}).find('span').extract()
output = soup.find('span', {'class':'textLarge textWhite'}).text.strip()
output:
18,740,382
Instead of grabbing the text using x.text you can use x.find_all(text=True, recursive=False) which will give you all the top-level text (in a list of strings) for a node without going into the children. Here's an example using your data:
for x in soup.find_all('span', class_ = "textLarge textWhite"):
res = x.find_all(text=True, recursive=False)
# join and strip the strings then print
print(" ".join(map(str.strip, res)))
#outputs: '18,740,382'
I got this little piece of code:
text = """<html><head></head><body>
<h1 style="
text-align: center;
">Main site</h1>
<div>
<p style="
color: blue;
text-align: center;
">text1
</p>
<p style="
color: blueviolet;
text-align: center;
">text2
</p>
</div>
<div>
<p style="text-align:center">
<img src="./foo/test.jpg" alt="Testing static images" style="
">
</p>
</div>
</body></html>
"""
import sys
import re
import bs4
def prettify(soup, indent_width=4):
r = re.compile(r'^(\s*)', re.MULTILINE)
return r.sub(r'\1' * indent_width, soup.prettify())
soup = bs4.BeautifulSoup(text, "html.parser")
print(prettify(soup))
The output of the above snippet right now is:
<html>
<head>
</head>
<body>
<h1 style="
text-align: center;
">
Main site
</h1>
<div>
<p style="
color: blue;
text-align: center;
">
text1
</p>
<p style="
color: blueviolet;
text-align: center;
">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style="
"/>
</p>
</div>
</body>
</html>
I'd like to figure out how to format the output so it becomes this instead:
<html>
<head>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue;text-align: center;">
text1
</p>
<p style="color: blueviolet;text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style=""/>
</p>
</div>
</body>
</html>
Said otherwise, I'd like to keep html statements such as <tag attrib1=value1 attrib2=value2 ... attribn=valuen> in one single line if possible. When I say "if possible" I mean without screwing up the value of the attributes themselves (value1, value2, ..., valuen).
Is this possible to achieve with beautifulsoup4? As far I've read in the docs it seems you can use a custom formatter but I don't know how I could have a custom formatter so it can accomplish the described requirements.
EDIT:
#alecxe solution is quite simple, unfortunately fails in some more complex cases like the below one, ie:
test1 = """
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
<div id="sessionsGrid" data-columns="[
{ field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 },
{ field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}},
{ field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80},
{ field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 },
{ field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}},
{ field: 'note', title:'Note'}
]">
</div>
</div>
"""
from bs4 import BeautifulSoup
import re
def prettify(soup, indent_width=4, single_lines=True):
if single_lines:
for tag in soup():
for attr in tag.attrs:
print(tag.attrs[attr], tag.attrs[attr].__class__)
tag.attrs[attr] = " ".join(
tag.attrs[attr].replace("\n", " ").split())
r = re.compile(r'^(\s*)', re.MULTILINE)
return r.sub(r'\1' * indent_width, soup.prettify())
def html_beautify(text):
soup = BeautifulSoup(text, "html.parser")
return prettify(soup)
print(html_beautify(test1))
TRACEBACK:
dialer-capmaign-console <class 'str'>
['fill-vertically'] <class 'list'>
Traceback (most recent call last):
File "d:\mcve\x.py", line 35, in <module>
print(html_beautify(test1))
File "d:\mcve\x.py", line 33, in html_beautify
return prettify(soup)
File "d:\mcve\x.py", line 25, in prettify
tag.attrs[attr].replace("\n", " ").split())
AttributeError: 'list' object has no attribute 'replace'
BeautifulSoup tried to preserve the newlines and multiple spaces you had in the attribute values in the input HTML.
One workaround here would be to iterate over the element attributes and clean them up prior to prettifying - removing the newlines and replacing multiple consecutive spaces with a single space:
for tag in soup():
for attr in tag.attrs:
tag.attrs[attr] = " ".join(tag.attrs[attr].replace("\n", " ").split())
print(soup.prettify())
Prints:
<html>
<head>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue; text-align: center;">
text1
</p>
<p style="color: blueviolet; text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style=""/>
</p>
</div>
</body>
</html>
Update (to address the multi-valued attributes like class):
You just need to add a slight modification adding special handling for the case when an attribute is of a list type:
for tag in soup():
tag.attrs = {
attr: [" ".join(attr_value.replace("\n", " ").split()) for attr_value in value]
if isinstance(value, list)
else " ".join(value.replace("\n", " ").split())
for attr, value in tag.attrs.items()
}
While BeautifulSoup is more commonly used, HTML Tidy may be a better choice if you're working with quirks and have more specific requirements.
After installing the library for Python (pip install pytidylib) try the following code:
from tidylib import Tidy
tidy = Tidy()
# assign string to text
config = {
"doctype": "omit",
# "show-body-only": True
}
print tidy.tidy_document(text, options=config)[0]
tidy.tidy_document returns a tuple with the HTML and any errors that may have occurred. This code will output
<html>
<head>
<title></title>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue; text-align: center;">
text1
</p>
<p style="color: blueviolet; text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img src="./foo/test.jpg" alt="Testing static images" style="">
</p>
</div>
</body>
</html>
By uncommenting the "show-body-only": True for the second sample.
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
<div id="sessionsGrid" data-columns="[ { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 }, { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}}, { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80}, { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 }, { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}}, { field: 'note', title:'Note'} ]"></div>
</div>
See more configuration for further options and customization. There are wrapping options specific to attributes which may help. As you can see, empty elements will only take up one line, and html-tidy will automatically try to add things like DOCTYPE, head and title tags.
I have a web page that is set up like this:
//a bunch of container divs....
<a class="food cat2 isotope-item" href="#" style="position: absolute; left: 45px; top: 0px;">
<div class="background"></div>
<div class="image">
<img src="/assets/score-images/cereal2.png" alt="">
</div>
<div class="score">1148</div>
<div class="name">Cereal with Banana</div>
</a>
<a class="food cat1 isotope-item" href="#" style="position: absolute; left: 215px; top: 0px;">
<div class="background"></div>
<div class="image">
<img src="/assets/score-images/burrito-all.png" alt="">
</div>
<div class="score">2257</div>
<div class="name">Beef & Cheese Burrito</div>
</a>
//hundreds more a tags....
</div>
I'm running this code to extra the name and score of each "a" attribute.
page = requests.get('http://www.eatlowcarbon.org/food-scores')
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print('HEllO')
foodDict = {}
aTag = soup.findAll('a')
for tag in aTag:
print('HELLO 2')
name = tag.find("div", {"class": "name"}).text
score = tag.find("div", {"class": "score"}).text
foodDict[name] = score
print('hello')
Both print statements are successfully executed, and so the second one tells me that I've entered the for loop at least. However, I get the error,
File "scrapeRecipe.py", line 40, in <module>
name = tag.find("div", {"class": "name"}).text
AttributeError: 'NoneType' object has no attribute 'text'
From this post, I'm assuming that my code doesn't find any div with a class type equal to "name", or "score" for that matter. I'm completely new to python. Does anyone have any advice?
The problem is not with your tag.find('div', ...), but rather your soup.findAll('a'). You are pulling every a tag, even those without child tags you are trying to pull data from
By the looks of what you are needing, you need to add a class to your findAll as well
aTag = soup.findAll('a', {'class': 'food'})
I am scraping titles, descriptions, links, and people's names from a multiple divs that follow the same structure. I am using BeautifulSoup, and I am able to scrape everything out of the first div. However, I'm having trouble scraping from my long list of divs, and getting the data in a portable format like CSV or JSON.
How can I scrape each item from my long list of divs, and store that information in JSON objects together for each mp3?
The divs look like this:
<div class="audioBoxWrap clearBoth">
<h3>Title 1</h3>
<p>Description 1</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ Right-click to download] </div>
</div>
<div class="audioBoxWrap clearBoth">
<h3>Title 2</h3>
<p>Description 2</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ Right-click to download] </div>
</div>
I've figured out how to scrape from the first div, but I cannot grab the info for each div. For example, my code below only spits out the h3 for the first div over and over.
I know that I can create a python list for titles, descriptions, etc, but how do I keep the metadata structure like JSON, so that title1, link1, and description1 stay together, as well as title2's information.
with open ('soup.html', 'r') as myfile:
html_doc = myfile.read()
soup = BeautifulSoup(html_doc, 'html.parser')
audio_div = soup.find_all('div', {'class':"audioBoxWrap clearBoth"})
print len(audio_div)
#create dictionary for storing scraped data. I don't know how to store the values for each mp3 separately.
for i in audio_div:
print soup.find('h3').text
I want my JSON to look something like this:
{
"podcasts":[
{
"title":"title1",
"description":"description1",
"link":"link1"
},
{
"title":"title2",
"description":"description2",
"link":"link2"
}
]
}
Iterate over every track and make context specific searches:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<div>
<div class="audioBoxWrap clearBoth">
<h3>Title 1</h3>
<p>Description 1</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ Right-click to download] </div>
</div>
<div class="audioBoxWrap clearBoth">
<h3>Title 2</h3>
<p>Description 2</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ Right-click to download] </div>
</div>
</div>"""
soup = BeautifulSoup(data, "html.parser")
tracks = soup.find_all('div', {'class':"audioBoxWrap clearBoth"})
result = {
"podcasts": [
{
"title": track.h3.get_text(strip=True),
"description": track.p.get_text(strip=True),
"link": track.a["href"]
}
for track in tracks
]
}
pprint(result)
Prints:
{'podcasts': [{'description': 'Description 1',
'link': 'link1.mp3',
'title': 'Title 1'},
{'description': 'Description 2',
'link': 'link2.mp3',
'title': 'Title 2'}]}