Find and retrieve content from html text using BeautifulSoup - python

I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.
I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.
(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).
Any help is appreciated. Thank you!
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>

You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', {'id':'directs'})
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', {'class':re.compile('c2 value\s*')})]
result = dict(zip(_labels, _content))
Output:
{'label': '\n\n\nxsd:string \n intervento di Fabrizio CICCHITTO\n \n\n',
'Title': '\n\n\n intervento di Fabrizio CICCHITTO\n \n\n',
'': '\n\n\nxsd:dateTime \n 2016-07-05T12:26:02Z\n \n\n',
'type': '\n\n\n ocd:intervento\n \n\n',
'rierimento a deputato': '\n\n\n http://dati.camera.it/ocd/deputato.rdf/d15080_17\n \n\n',
'Relation': '\n\n\n http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010\n \n\n'}

Related

Adding multiple loop outputs to single dictionary

I'm learning how to use python and trying to use beautiful soup to do some web scraping. I want to pull the product name and product number from the saved page I'm referencing in my python code, but have provided a snippet of a section where this script is looking. They're located under a div with the class name and a span with the id product_id
Essentially, my python script does put in all the product names, but once it gets to the product_id loop, it overwrites the initial values from my first loop. Looking to see if anyone can point me in the right direction.
OUTPUT
After first loop
{'name': 'ADA Hi-Lo Power Plinth Table'}
{'name': 'Adjustable Headrest Couch - Chrome-Plated Steel Legs'}
{'name': 'Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)'}
After second loop
{'name': 'Weekender Folding Cot', 'product_ID': '55984'}
{'name': 'Weekender Folding Cot', 'product_ID': '31350'}
{'name': 'Weekender Folding Cot', 'product_ID': '31351'}
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="1496" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="ADA-Hi-Lo-Power-Plinth-Table_p_1496.html">
<img alt="ADA Hi-Lo Power Plinth Table" class="img-responsive" src="assets/images/thumbnails/55984_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="ADA-Hi-Lo-Power-Plinth-Table_p_1496.html">
ADA Hi-Lo Power Plinth Table
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
55984
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$2,849.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=1496&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="2878" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs_p_2878.html">
<img alt="Adjustable Headrest Couch - Chrome-Plated Steel Legs" class="img-responsive" src="assets/images/thumbnails/31350_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs_p_2878.html">
Adjustable Headrest Couch - Chrome-Plated Steel Legs
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
31350
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$729.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=2878&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="2879" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs-X-Large_p_2879.html">
<img alt="Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)" class="img-responsive" src="assets/images/thumbnails/31350_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs-X-Large_p_2879.html">
Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
31351
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$769.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=2879&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
BEGINNING OF PYTHON SCRIPT
import requests
from bs4 import BeautifulSoup
with open('recoveryCouches','r') as html_file:
content= html_file.read()
soup = BeautifulSoup(content,'lxml')
allProductDivs = soup.find('div', class_='product-items product-items-4')
#get names of products on page
nameDiv = soup.find_all('div',class_='name')
prodID = soup.find_all('span', id='product_id')
records=[]
d=dict()
for name in nameDiv:
d['name'] = name.find('a').text
records.append(d)
print(d)
for productId in prodID:
d['product_ID'] = productId.text
records.append(d)
print(d)
Try this:
nameDiv = soup.find_all('div',class_='name')
prodID = soup.find_all('span', id='product_id')
records=[]
for i in range(len(nameDiv)):
records.append({
"name": nameDiv[i].find('a').text.strip(),
"product_ID": prodID[i].text.strip()
})
to write data to csv file:
import csv
with open("file.csv", 'w') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=records[0].keys())
writer.writeheader()
for record in records:
writer.writerow(record)
If I understand the question correctly, you're trying to get all the names and productIds and store them. The problem you're running into is, in the dictionary, your values are getting overwritten.
One solution to that problem would be to initialize your python dictionary values as lists, like so:
d = {
'name': [],
'product_ID': []
}
Then in each of the loops, you can append the new value to that array. What you currently have will overwrite the previous value.
for name in nameDiv:
d['name'].append(name.find('a').text)
for productId in prodID:
d['product_ID'].append(productId.text)
This will result in a list of all names and product_IDs stored in that dictionary.
If you want to put these lists together in a format like this:
[(name0, productId0), (name1, productId1), ...]
Then you can make use of zip, which will basically combine your lists as long as they are equal length. For example:
zipped_results = list(zip(d['name'], d['product_ID']))

Scrape everything between two unested tags

Is it possible to scrape everything between two unested tags ?
For instance:
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
So I would like to scrape just what is located under Title 1 until Title 2. Is this possible using bs4 ?
Right now I have something like this (problem is it scrape everything since classes are all the same):
for i in soup.findAll("div",{"class":"div"}):
print(i.span.text)
Now I get:
span1
span2
span3
span4
I'd like to get:
span1
span2
I don't know if this is best solution to this problem but you can split your text and scrape only the part that you need.
text = """
<h3>Title 1</h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
This will give:
'"\n<h3>Title 1</h3>\n<div class="div">\n <span class="span">span1</span>\n <label class="label">label1</label>\n</div>\n<div class="div">\n <span class="span">span2</span>\n</div>\n<h3>'
After converting that string into a bs4 object, you can scrape all you need:
scrape_me = BeautifulSoup(sub_text, 'lxml')
for i in scrape_me.findAll("div", class_="div"):
print(i.span.text)
# -> span1 span2
One approach is:
find the second class="span", then navigate backwards, and find_all_previous() div.
The tags are in backward order, so use the reversed() function..
find the <span> tags
from bs4 import BeautifulSoup
html = """
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for tag in reversed(
soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
):
print(tag.find("span").text)
Output:
span1
span2

BeautilfulSoup find_all method returns the same elements

Hi here is my soup object:
<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>
How can I get all the c-codes and its corresponding text from the object?
For example: c-code: "c5ff5b1d0dc93c" and its corresponding text: "Herren" for the first row...
My code looks like this (categories is the soup object):
for category in categories.find_all('div'):
category = categories.find('div')
print(category)
I only receive the information of the first row....
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
What happens?
categories holds your html
in your loop you do category = categories.find('div') - find('div') always returns the first occurrence, so category will always be <div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
You should do category = element.get_text() to get the text and code = element.get('data-navi-cat') to get the code.
Example
from bs4 import BeautifulSoup
html = '''<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>'''
soup = BeautifulSoup(html, "lxml")
for element in soup.find_all('div'):
category = element.get_text()
code = element.get('data-navi-cat')
print(category, code)
Output
Herren
c5ff5b1d0dc93c
Frauen
c5ff5b1d0dc95f
A-Jugend (U19)
c5ff5b1d0dc978
B-Jugend (U17)
c5ff5b1d0dc98c

Return text surrounded by double tag with BeautifulSoup

I am looping through list with urls. On each page there is between 1 and n descriptions which are surrounded by double p tag.
BeautifulSoup.find(class_='view-content')
# url 1
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One animal</p>
</p>
</div>
</div>
</div>
# url 2
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One person</p>
</p>
</div>
</div>
<div class="row">
<div class="description">
<p><p>Two people </p>
</p>
</div>
</div>
</div>
When I use
for d in soup.find(class_='view-content').find_all('p'):
dd = d.contents[0]
print(dd)
I get
<p>One animal</p>One animal
<p>One person</p>One person
<p>Two people</p>Two people
Instead of expected
One animal
One person
Two people
Any way to retrieve content surrounded by double p tags?
Edit: The following returns the same, but at least without p tags.
for d in soup.find_all("div",class_="view-content"):
print(' '.join(i.text for i in review.find_all('p')[1:]))
Another solution.
from simplified_scrapy import SimplifiedDoc
html = '''
# url 1
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One animal</p>
</p>
</div>
</div>
</div>
# url 2
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One person</p>
</p>
</div>
</div>
<div class="row">
<div class="description">
<p><p>Two people </p>
</p>
</div>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.selects('div.view-content')
datas=[]
for div in divs:
datas.extend ([p.text for p in div.ps])
print (datas)
Result:
['One animal', 'One person', 'Two people']

How to get all values of a nested div

I'd like to take all the values of a nested div.
<div class="upcoming-events__event js-event-filter" data-eventtype="Mission Day" data-region="AMER" data-eventdate="2019-07-13" data-latlon="40.167207,-105.101928" data-distance="8693671.480264762" style="display: block;">
<a href="http://events.ingress.com/MissionDay/Longmont2019" class="upcoming-events__event-link">
<img src="/assets/img/events/md-2019-7-longmontcousa.jpg" class="upcoming-events__image">
<div class="upcoming-events__content">
<img src="/assets/img/missionday.png" class="event-icon">
<div class="upcoming-events__date">Jul 13, 2019</div>
<div class="upcoming-events__location">Longmont, CO, USA</div>
</div>
</a>
</div>
for each class upcoming-events__event I'd like to print
upcoming-events__location, upcoming-events__date.
For more information: upcoming-events__event-link
Using bs4, you can get the text.
from bs4 import BeautifulSoup
html='''<div class="upcoming-events__event js-event-filter" data-eventtype="Mission Day" data-region="AMER" data-eventdate="2019-07-13" data-latlon="40.167207,-105.101928" data-distance="8693671.480264762" style="display: block;">
<a href="http://events.ingress.com/MissionDay/Longmont2019" class="upcoming-events__event-link">
<img src="/assets/img/events/md-2019-7-longmontcousa.jpg" class="upcoming-events__image">
<div class="upcoming-events__content">
<img src="/assets/img/missionday.png" class="event-icon">
<div class="upcoming-events__date">Jul 13, 2019</div>
<div class="upcoming-events__location">Longmont, CO, USA</div>
</div>
</a>
</div>
'''
soup=BeautifulSoup(html,'lxml')
for i in (soup.select('.upcoming-events__event')):
location=i.select('.upcoming-events__location')[0].string
date=i.select('.upcoming-events__date')[0].string
link=i.select('.upcoming-events__event-link')[0].get('href')
print(f'{location}, {date}\nFor more info :{link}')

Categories

Resources