Trying to loop through profile lists using Selenium - python

I'm trying to loop through all profiles and store the name of the person, the job profile and the location in a list. Here is the screenshot of the screen LinkedIn screen I am on:
Here is the li html tag that I'll have to loop over:
<li class="reusable-search__result-container ">
<div class="entity-result ">
<div class="entity-result__item">
<div class="entity-result__image">
<div class="display-flex align-items-center">
<a class="app-aware-link" aria-hidden="true" href="https://www.linkedin.com/search/results/people/headless?geoUrn=%5B103644278%5D&origin=FACETED_SEARCH&keywords=python%20developer">
<div id="ember522" class="ivm-image-view-model ember-view"> <div class="
ivm-view-attr__img-wrapper ivm-view-attr__img-wrapper--use-img-tag display-flex
">
<div class="EntityPhoto-circle-3-ghost-person ivm-view-attr__ghost-entity ">
<!----> </div>
</div>
</div>
</a>
</div>
</div>
<div class="entity-result__content entity-result__divider pt3 pb3 t-12 t-black--light">
<div class="mb1">
<div class="linked-area flex-1 cursor-pointer">
<div class="t-roman t-sans">
<span class="entity-result__title">
<div class="display-flex">
<span class="entity-result__title-line flex-shrink-1 entity-result__title-text--black ">
<span class="entity-result__title-text t-16">
<a class="app-aware-link" href="https://www.linkedin.com/search/results/people/headless?geoUrn=%5B103644278%5D&origin=FACETED_SEARCH&keywords=python%20developer">
<!---->LinkedIn Member<!---->
</a>
<!----> </span>
</span>
<!----></div>
</span>
</div>
<div>
<div class="entity-result__primary-subtitle t-14 t-black">
<!---->Software Developer<!---->
</div>
<div class="entity-result__secondary-subtitle t-14">
<!---->United States<!---->
</div>
</div>
</div>
</div>
<div class="linked-area flex-1 cursor-pointer">
<p class="entity-result__summary entity-result__summary--2-lines t-12 t-black--light ">
<!---->Current: Full Stack Software<span class="white-space-pre"> </span><strong><!---->Developer<!----></strong><span class="white-space-pre"> </span>at GE Healthcare<!---->
</p>
</div>
<!----> </div>
<div class="entity-result__actions entity-result__divider entity-result__actions--empty">
<!----> <!---->
</div>
</div>
</div>
</li>
Currently, I'm able to get the profile names using this code:
profile_names = []
linkedin_members = browser.find_elements_by_xpath('//span[#class="entity-result__title"]')
for linkedin_member in linkedin_members:
name = linkedin_member.find_element_by_xpath('.//a[#class="app-aware-link"]').get_attribute('text').strip()
profile_names.append(name)
But I'm unable to get the job locations and job profiles. Can anyone guide me on the code for that?
I tried something like this but it threw an error:
profile_names = []
job_profiles = []
linkedin_members = browser.find_elements_by_xpath('//div[#class="linked-area flex-1 cursor-pointer"]')
for linkedin_member in linkedin_members:
name = linkedin_member.find_element_by_xpath('.//a[#class="app-aware-link"]').get_attribute('text').strip()
job_profile = linkedin_member.find_element_by_xpath('.//div[#class="entity-result__primary-subtitle"]').text
profile_names.append(name)
job_profiles.append(job_profiles)

Another way to do this is:
members_serach_results_xpath = '//div[#class="entity-result__item"]'
member_name_xpath = '//span[contains(#class,"entity-result__title-text")]//span[#dir]'
member_location_xpath = '//div[contains(#class,"entity-result__secondary-subtitle")]'
member_job_title_xpath = '//div[#class="entity-result__item"]//div[contains(#class,"entity-result__primary-subtitle")]'
profile_names = []
profile_addresses = []
profile_job_titles = []
linkedin_members = browser.find_elements_by_xpath(members_serach_results_xpath)
for linkedin_member in linkedin_members:
name = linkedin_member.find_element_by_xpath('.' + member_name_xpath).get_attribute('text').strip()
profile_names.append(name)
address = linkedin_member.find_element_by_xpath('.' + member_location_xpath).get_attribute('text').strip()
profile_addresses.append(address)
job_title = linkedin_member.find_element_by_xpath('.' + member_job_title_xpath).get_attribute('text').strip()
profile_job_titles.append(job_title)
Here I put the locators as parameters out of the code.
It's one of best practices not to put locators hardcoded inside the methods using it.

You just have to identify those elements (and I think you can do so using the class with a css selector), then loop through the elements and append the text to the appropriate array.
profile_names = []
linkedin_members = browser.find_elements_by_xpath('//span[#class="entity-result__title"]')
for linkedin_member in linkedin_members:
name = linkedin_member.find_element_by_xpath('.//a[#class="app-aware-link"]').get_attribute('text').strip()
profile_names.append(name)
user_positions = []
positions = browser.find_elements_by_css_selector('div.entity-result__primary-subtitle')
for position in positions:
user_positions.append(position.text.strip())
user_locations = []
locations = browser.find_elements_by_css_selector('div.entity-result__secondary-subtitle')
for location in locations:
user_locations.append(location.text.strip())

Related

Adding multiple loop outputs to single dictionary

I'm learning how to use python and trying to use beautiful soup to do some web scraping. I want to pull the product name and product number from the saved page I'm referencing in my python code, but have provided a snippet of a section where this script is looking. They're located under a div with the class name and a span with the id product_id
Essentially, my python script does put in all the product names, but once it gets to the product_id loop, it overwrites the initial values from my first loop. Looking to see if anyone can point me in the right direction.
OUTPUT
After first loop
{'name': 'ADA Hi-Lo Power Plinth Table'}
{'name': 'Adjustable Headrest Couch - Chrome-Plated Steel Legs'}
{'name': 'Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)'}
After second loop
{'name': 'Weekender Folding Cot', 'product_ID': '55984'}
{'name': 'Weekender Folding Cot', 'product_ID': '31350'}
{'name': 'Weekender Folding Cot', 'product_ID': '31351'}
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="1496" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="ADA-Hi-Lo-Power-Plinth-Table_p_1496.html">
<img alt="ADA Hi-Lo Power Plinth Table" class="img-responsive" src="assets/images/thumbnails/55984_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="ADA-Hi-Lo-Power-Plinth-Table_p_1496.html">
ADA Hi-Lo Power Plinth Table
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
55984
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$2,849.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=1496&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="2878" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs_p_2878.html">
<img alt="Adjustable Headrest Couch - Chrome-Plated Steel Legs" class="img-responsive" src="assets/images/thumbnails/31350_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs_p_2878.html">
Adjustable Headrest Couch - Chrome-Plated Steel Legs
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
31350
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$729.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=2878&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="2879" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs-X-Large_p_2879.html">
<img alt="Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)" class="img-responsive" src="assets/images/thumbnails/31350_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs-X-Large_p_2879.html">
Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
31351
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$769.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=2879&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
BEGINNING OF PYTHON SCRIPT
import requests
from bs4 import BeautifulSoup
with open('recoveryCouches','r') as html_file:
content= html_file.read()
soup = BeautifulSoup(content,'lxml')
allProductDivs = soup.find('div', class_='product-items product-items-4')
#get names of products on page
nameDiv = soup.find_all('div',class_='name')
prodID = soup.find_all('span', id='product_id')
records=[]
d=dict()
for name in nameDiv:
d['name'] = name.find('a').text
records.append(d)
print(d)
for productId in prodID:
d['product_ID'] = productId.text
records.append(d)
print(d)
Try this:
nameDiv = soup.find_all('div',class_='name')
prodID = soup.find_all('span', id='product_id')
records=[]
for i in range(len(nameDiv)):
records.append({
"name": nameDiv[i].find('a').text.strip(),
"product_ID": prodID[i].text.strip()
})
to write data to csv file:
import csv
with open("file.csv", 'w') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=records[0].keys())
writer.writeheader()
for record in records:
writer.writerow(record)
If I understand the question correctly, you're trying to get all the names and productIds and store them. The problem you're running into is, in the dictionary, your values are getting overwritten.
One solution to that problem would be to initialize your python dictionary values as lists, like so:
d = {
'name': [],
'product_ID': []
}
Then in each of the loops, you can append the new value to that array. What you currently have will overwrite the previous value.
for name in nameDiv:
d['name'].append(name.find('a').text)
for productId in prodID:
d['product_ID'].append(productId.text)
This will result in a list of all names and product_IDs stored in that dictionary.
If you want to put these lists together in a format like this:
[(name0, productId0), (name1, productId1), ...]
Then you can make use of zip, which will basically combine your lists as long as they are equal length. For example:
zipped_results = list(zip(d['name'], d['product_ID']))

Scrape everything between two unested tags

Is it possible to scrape everything between two unested tags ?
For instance:
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
So I would like to scrape just what is located under Title 1 until Title 2. Is this possible using bs4 ?
Right now I have something like this (problem is it scrape everything since classes are all the same):
for i in soup.findAll("div",{"class":"div"}):
print(i.span.text)
Now I get:
span1
span2
span3
span4
I'd like to get:
span1
span2
I don't know if this is best solution to this problem but you can split your text and scrape only the part that you need.
text = """
<h3>Title 1</h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
This will give:
'"\n<h3>Title 1</h3>\n<div class="div">\n <span class="span">span1</span>\n <label class="label">label1</label>\n</div>\n<div class="div">\n <span class="span">span2</span>\n</div>\n<h3>'
After converting that string into a bs4 object, you can scrape all you need:
scrape_me = BeautifulSoup(sub_text, 'lxml')
for i in scrape_me.findAll("div", class_="div"):
print(i.span.text)
# -> span1 span2
One approach is:
find the second class="span", then navigate backwards, and find_all_previous() div.
The tags are in backward order, so use the reversed() function..
find the <span> tags
from bs4 import BeautifulSoup
html = """
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for tag in reversed(
soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
):
print(tag.find("span").text)
Output:
span1
span2

BeautilfulSoup find_all method returns the same elements

Hi here is my soup object:
<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>
How can I get all the c-codes and its corresponding text from the object?
For example: c-code: "c5ff5b1d0dc93c" and its corresponding text: "Herren" for the first row...
My code looks like this (categories is the soup object):
for category in categories.find_all('div'):
category = categories.find('div')
print(category)
I only receive the information of the first row....
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
<div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
What happens?
categories holds your html
in your loop you do category = categories.find('div') - find('div') always returns the first occurrence, so category will always be <div data-navi-cat="c5ff5b1d0dc93c">Herren</div>
You should do category = element.get_text() to get the text and code = element.get('data-navi-cat') to get the code.
Example
from bs4 import BeautifulSoup
html = '''<td class="kategorie">
<div data-navi-cat="c5ff5b1d0dc93c">
Herren
</div>
<div data-navi-cat="c5ff5b1d0dc95f">
Frauen
</div>
<div data-navi-cat="c5ff5b1d0dc978">
A-Jugend (U19)
</div>
<div data-navi-cat="c5ff5b1d0dc98c">
B-Jugend (U17)
</div>
<div data-navi-cat="c5ff5b1d0dc9a2">
C-Jugend (U15)
</div>
<div data-navi-cat="c5ff5b1d0dc9b1">
U17-Juniorinnen
</div>
<div data-navi-cat="c5ff5b1d0dc9b6">
Futsal
</div>
<div data-navi-cat="c5ff5b1d0dc9bd">
eSport
</div>
</td>'''
soup = BeautifulSoup(html, "lxml")
for element in soup.find_all('div'):
category = element.get_text()
code = element.get('data-navi-cat')
print(category, code)
Output
Herren
c5ff5b1d0dc93c
Frauen
c5ff5b1d0dc95f
A-Jugend (U19)
c5ff5b1d0dc978
B-Jugend (U17)
c5ff5b1d0dc98c

Selenium Python Link Text-var to Value-var in subchild under different conditions

I have the following code:
<div class="_97KWR">
<div class="_12x3s">
<div class="_1PF0y">
<div class="_1xNd0 CTwrR _3ASIK">
<a class="_3bf_k _1WIu4" href="/en/gilgamesh/hola-dola/gof-turkey-championship-2019-555/live/gazorpazorp/log-arena-fams-5551565"><span>Franco Pescatore</span></a>
<div class="_2cZpo">
<svg width="32px" height="32px" xmlns="http://www.w3.org/2000/svg">
<use xlink:href="dist/icons.svg#hula-hula"></use>
</svg>
</div>
<div class="_1sT8o _2ALVH">
<div class="_2Li3l _1fAz9"><img src="/dist/disabled-gg.png" alt="Disabled gg"></div>
</div>
</div>
<div class="_2b9x1">
<div>
<svg width="18px" height="14px" xmlns="http://www.w3.org/2000/svg">
<use xlink:href="dist/icons.svg#vs"></use>
</svg>
</div>
</div>
<div class="_1xNd0 CTwrR _3ASIK">
<a class="_3bf_k _1WIu4" href="/en/gilgamesh/hola-dola/gof-turkey-championship-2019-555/live/gazorpazorp/log-arena-fams-5551565"><span>Giorgio Pescato</span></a><img class="_25rWl _2cZpo" alt="Giorgio Pescato" src="https://ultramedia.com/Media/GiorgioPescato_f6f84978-6da4-4e14-a36b-860dce530d08.png">
<div class="_1sT8o _2ALVH">
<div class="_2Li3l _1fAz9"><img src="/dist/disabled-gg.png" alt="Disabled gg"></div>
</div>
</div>
</div>
<div class="_1xNd0 CTwrR">
<a class="_3bf_k _1WIu4" href="'/en/gilgamesh/hola-dola/swing/fun#4844844">
<span>BipolarFun</span></a><img class="_25rWl _2cZpo" alt="BipolarFun" src="https://promedia.com/Media//en/gilgamesh/hola-dola/swing/4844844.png">
<div class="_1sT8o _2ALVH">
<button class="_2Li3l _1fAz9"><span>1,30</span></button>
<div class="_1CeGR">
<div class="_1T5lR"></div>
</div>
</div>
</div>
And it goes dynamically in this way,
Since the class of the DIV is dynamic, I 'fixed' by using the xpath
//a[starts-with(#href,'/en/gilgamesh/hola-dola/
with this I could access the Name like Franco Pescatore or Giorgio Pescato
and with xpath
//button[starts-with(#class,'_')]
I could access the value, but not when it's "Disabled"
also I cannot do first ones and then the others because it doesn't return the same amount of elements, so I'm having trouble linking the Text element with the Value element which is in a subchild,
at the end the resoult should be like this:
n[0] = 'BipolarFun'
a[0] = 1,30
n[1] = 'Franco Pescatore'
a[1] = 0 #disabled
n[2] = 'Giorgio Pescato'
a[2] = 0 #disabled
please help, i'm stuck :|
Code you can try:
# get all labels
labels = driver.find_elements_by_xpath("//a[contains(#href,'/en/gilgamesh/hola-dola/')]")
for label in labels:
# get buttons inside label's parent DIV
buttons = label.find_elements_by_xpath("./ancestor::div[1]//button")
# if buttons more than 0, then get button text else use 0.
button_value = buttons[0].text if len(buttons) > 0 else 0
print(f"Label: {label.text}, value: {button_value}")

Find and retrieve content from html text using BeautifulSoup

I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.
I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.
(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).
Any help is appreciated. Thank you!
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', {'id':'directs'})
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', {'class':re.compile('c2 value\s*')})]
result = dict(zip(_labels, _content))
Output:
{'label': '\n\n\nxsd:string \n intervento di Fabrizio CICCHITTO\n \n\n',
'Title': '\n\n\n intervento di Fabrizio CICCHITTO\n \n\n',
'': '\n\n\nxsd:dateTime \n 2016-07-05T12:26:02Z\n \n\n',
'type': '\n\n\n ocd:intervento\n \n\n',
'rierimento a deputato': '\n\n\n http://dati.camera.it/ocd/deputato.rdf/d15080_17\n \n\n',
'Relation': '\n\n\n http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010\n \n\n'}

Categories

Resources