I am trying to save my Python console output to an excel file. Unfortunately, it only saves the first line, but truncates the rest.
html = '''<div class="dynamicBottom">
<div class="dynamicLeft">
<div class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<div class="improve_listing_btn ui_button primary small">improve this entry</div>
<h3 class="tabs_header">Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Location</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
</li>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for div in soup.find_all('div', class_="ratingRow wrap"):
text = div.text.strip()
alt = div.find('img').get('alt')
print(text, alt)
This gives out:
Location 45 out of fifty points
Service 45 out of fifty points
I tried saving this to excel adding the last two lines:
for div in soup.find_all('div', class_="ratingRow wrap"):
text = div.text.strip()
alt = div.find('img').get('alt')
print(text, alt)
worksheet.write_string(row, col+15, text)
worksheet.write_string(row, col+16, alt)
But this only saves "Location; 45 out of fifty points". I would like to save the info given in the output to excel using one row but different columns. Is there any way I can do that?
Thank you very much for your help!
Related
Is it possible to scrape everything between two unested tags ?
For instance:
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
So I would like to scrape just what is located under Title 1 until Title 2. Is this possible using bs4 ?
Right now I have something like this (problem is it scrape everything since classes are all the same):
for i in soup.findAll("div",{"class":"div"}):
print(i.span.text)
Now I get:
span1
span2
span3
span4
I'd like to get:
span1
span2
I don't know if this is best solution to this problem but you can split your text and scrape only the part that you need.
text = """
<h3>Title 1</h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
This will give:
'"\n<h3>Title 1</h3>\n<div class="div">\n <span class="span">span1</span>\n <label class="label">label1</label>\n</div>\n<div class="div">\n <span class="span">span2</span>\n</div>\n<h3>'
After converting that string into a bs4 object, you can scrape all you need:
scrape_me = BeautifulSoup(sub_text, 'lxml')
for i in scrape_me.findAll("div", class_="div"):
print(i.span.text)
# -> span1 span2
One approach is:
find the second class="span", then navigate backwards, and find_all_previous() div.
The tags are in backward order, so use the reversed() function..
find the <span> tags
from bs4 import BeautifulSoup
html = """
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for tag in reversed(
soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
):
print(tag.find("span").text)
Output:
span1
span2
Hi here is my soup object:
<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>
How can I get all the c-codes and its corresponding text from the object?
For example: c-code: "c5ff5b1d0dc93c" and its corresponding text: "Herren" for the first row...
My code looks like this (categories is the soup object):
for category in categories.find_all('div'):
category = categories.find('div')
print(category)
I only receive the information of the first row....
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
What happens?
categories holds your html
in your loop you do category = categories.find('div') - find('div') always returns the first occurrence, so category will always be <div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
You should do category = element.get_text() to get the text and code = element.get('data-navi-cat') to get the code.
Example
from bs4 import BeautifulSoup
html = '''<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>'''
soup = BeautifulSoup(html, "lxml")
for element in soup.find_all('div'):
category = element.get_text()
code = element.get('data-navi-cat')
print(category, code)
Output
Herren
c5ff5b1d0dc93c
Frauen
c5ff5b1d0dc95f
A-Jugend (U19)
c5ff5b1d0dc978
B-Jugend (U17)
c5ff5b1d0dc98c
I am looping through list with urls. On each page there is between 1 and n descriptions which are surrounded by double p tag.
BeautifulSoup.find(class_='view-content')
# url 1
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One animal</p>
</p>
</div>
</div>
</div>
# url 2
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One person</p>
</p>
</div>
</div>
<div class="row">
<div class="description">
<p><p>Two people </p>
</p>
</div>
</div>
</div>
When I use
for d in soup.find(class_='view-content').find_all('p'):
dd = d.contents[0]
print(dd)
I get
<p>One animal</p>One animal
<p>One person</p>One person
<p>Two people</p>Two people
Instead of expected
One animal
One person
Two people
Any way to retrieve content surrounded by double p tags?
Edit: The following returns the same, but at least without p tags.
for d in soup.find_all("div",class_="view-content"):
print(' '.join(i.text for i in review.find_all('p')[1:]))
Another solution.
from simplified_scrapy import SimplifiedDoc
html = '''
# url 1
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One animal</p>
</p>
</div>
</div>
</div>
# url 2
<div class="view-content">
<div class="row">
<div class="description">
<p><p>One person</p>
</p>
</div>
</div>
<div class="row">
<div class="description">
<p><p>Two people </p>
</p>
</div>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.selects('div.view-content')
datas=[]
for div in divs:
datas.extend ([p.text for p in div.ps])
print (datas)
Result:
['One animal', 'One person', 'Two people']
I'm trying to extract information from a webpage using beautifulsoup and python. I want to extract the information right below a particular tag. To know if its the right tag I would like to do a comparison of its text and then extract the text in the next immediate tag.
Say for example, if the following is a part of an HTML page-source,
<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>
I want to check if the <p class="title"> has text value as Procurement type then I want to print out Services Similarly, if the <p class="title"> has text value as Reference then I want to print out ANAJSKJD23423-Commission and if <p class="title"> has value as Countries then print out all the countries i.e. Belgium,France,Luxembourg.
I know I can extract all the texts with <p class="data strong"> and append them to a list and later fetch all values using indexing. But the thing is, the order of the occurrence of these <p class="title> is not fixed....at some places countries could be mentioned before procurement-type. I, therefore, want to perform a check on the text values and then extract the next immediate tag's text value. I'm still new to BeautifulSoup so any help is appreciated. Thanks
You can do it many ways.Here you go.
from bs4 import BeautifulSoup
htmldata='''<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>'''
soup=BeautifulSoup(htmldata,'html.parser')
items=soup.find_all('p', class_='title')
for item in items:
if ('Procurement type' in item.text) or ('Reference' in item.text):
print(item.findNext('p').text)
You can also use :contains pseudo class with bs4 4.7.1. Although I have passed as a list you can separate out each condition
from bs4 import BeautifulSoup as bs
import re
html = 'yourHTML'
soup = bs(html, 'lxml')
items=[re.sub(r'\n\s+','', item.text.strip()) for item in soup.select('p.title:contains("Procurement type") + p, p.title:contains(Reference) + p, p.title:contains(Countries) + p')]
print(items)
Output:
You can add the argument to check for specific text when you use .find() or .find_all() then use .next_sibling or findNext() to grab the next tags with the content
Ie:
soup.find('p', {'class':'title'}, text = 'Procurement type')
Given:
html = '''<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>'''
you could do something like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
alpha = soup.find('p', {'class':'title'}, text = 'Procurement type')
for sibling in alpha.next_siblings:
try:
print (sibling.text)
except:
continue
Output:
Services
or
ref = soup.find('p', {'class':'title'}, text = 'Reference')
for sibling in ref.next_siblings:
try:
print (sibling.text)
except:
continue
Output:
ANAJSKJD23423-Commission
or
countries = soup.find('p', {'class':'title'}, text = 'Countries')
names = countries.findNext('p', {'class':'data strong'}).text.replace('", "','').strip().split('\n')
names = [name.strip() for name in names if not name.isspace()]
for country in names:
print (country)
Output:
Belgium
France
Luxembourg
I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.
I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.
(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).
Any help is appreciated. Thank you!
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', {'id':'directs'})
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', {'class':re.compile('c2 value\s*')})]
result = dict(zip(_labels, _content))
Output:
{'label': '\n\n\nxsd:string \n intervento di Fabrizio CICCHITTO\n \n\n',
'Title': '\n\n\n intervento di Fabrizio CICCHITTO\n \n\n',
'': '\n\n\nxsd:dateTime \n 2016-07-05T12:26:02Z\n \n\n',
'type': '\n\n\n ocd:intervento\n \n\n',
'rierimento a deputato': '\n\n\n http://dati.camera.it/ocd/deputato.rdf/d15080_17\n \n\n',
'Relation': '\n\n\n http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010\n \n\n'}