Find() ==> how extract attribute="value" - python

I want to extract the attribute Value "705-419-1151"
<a href="javascript:void(0)" class="mlr__item__cta jsMlrMenu" title="Get the Phone Number" data-phone="705-419-1151">
from bs4 import BeautifulSoup
url='https://www.yellowpages.ca/search/si/2/hvac+services/Ontario+ON'
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('div', class_ ='listing__content__wrapper')
for item in articles:
tel = item.find('li' , {'data-phone' : 'attr(data-phone)'}).get()
print(tel)
How can I do this?

Try to focus while processing the data, select your elements more specific and always check if the element is available before call methods:
e.get('data-phone') if(e := item.select_one('[data-phone]')) else None
Example
This example stores results in list of dicts, so you could easy create an DataFrame and save to specific format.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://www.yellowpages.ca/search/si/2/hvac+services/Ontario+ON'
headers = {'user-agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' , 'Accept-Language': 'en-US, en;q=0.5'}
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('div', class_ ='listing__content__wrapper')
data = []
for item in articles:
com = e.get_text(strip=True, separator='\n') if(e := item.select_one('[itemprop="name"]')) else None
add = e.text.strip() if(e := item.select_one('[itemprop="address"]')) else None
tel = e.get('data-phone') if(e := item.select_one('[data-phone]')) else None
data.append({
'com':com,
'add':add,
'tel':tel
})
#create a csv file with results
pd.DataFrame(data).to_csv('filename.csv', index=False)
Output of data
[{'com': '1\nCity Experts',
'add': '17 Raffia Ave, Richmond Hill, ON L4E 4M9',
'tel': '416-858-3051'},
{'com': '2\nAssociateair Mechanical Systems Ltd',
'add': '40-81 Auriga Dr, Nepean, ON K2E 7Y5',
'tel': '343-700-1174'},
{'com': '3\nAffordable Comfort Heating & Cooling',
'add': '54 Cedar Pointe Dr, Unit 1207 Suite 022, Barrie, ON L4N 5R7',
'tel': '705-300-9536'},
{'com': '4\nHenderson Metal Fabricating Co Ltd',
'add': '76 Industrial Park Cres, Sault Ste Marie, ON P6B 5P2',
'tel': '705-910-5895'},...]

Related

Amazon scraper no data display

import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
headers = ({'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
search_query = 'home office'.replace(' ', '+')
base_url = 'https://www.amazon.com/s?k={0}'.format(search_query)
items = []
for i in range(1, 11):
print('Processing {0}...'.format(base_url + '&page={0}'.format(i)))
response = requests.get(base_url + '&page={0}'.format(i), headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('div', {'class': 's-result-item', 'data-component-type': 's-search-result'})
I don't know why each time to run the code. it only appends the strings together and gives me the link to the pages. It doesn't scrape any data from the page at all. I attached a screen shot of my screen as wellscreenshot
Main issue is you never append / return / print your ResultSet
...
for i in range(1, 11):
print('Processing {0}...'.format(base_url + '&page={0}'.format(i)))
response = requests.get(base_url + '&page={0}'.format(i), headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
items.extend(soup.find_all('div', {'class': 's-result-item', 'data-component-type': 's-search-result'}))
print(items)
Example
This will iterate the ResultSetand store each item as dict with specific information in your list:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
headers = ({'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
search_query = 'home office'.replace(' ', '+')
base_url = 'https://www.amazon.com/s?k={0}'.format(search_query)
items = []
for i in range(1, 2):
print('Processing {0}...'.format(base_url + '&page={0}'.format(i)))
response = requests.get(base_url + '&page={0}'.format(i), headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
for item in soup.find_all('div', {'class': 's-result-item', 'data-component-type': 's-search-result'}):
items.append({
'title':item.h2.text,
'url': item.a.get('href')
})
items
Output
[{'title': 'Raven Pro Document Scanner - Huge Touchscreen, High Speed Color Duplex Feeder (ADF), Wireless Scan to Cloud, WiFi, Ethernet, USB, Home or Office Desktop ',
'url': '/sspa/click?ie=UTF8&spc=MTo4NzYzMDkwMjIzMjg0MTI3OjE2NjAzOTk5ODA6c3BfYXRmOjIwMDAyMzU4Mzg0OTg2MTo6MDo6&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY&url=%2FRaven-Pro-Document-Scanner-Touchscreen%2Fdp%2FB07MFRJWY6%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Dhome%2Boffice%26qid%3D1660399980%26sr%3D8-1-spons%26psc%3D1'},
{'title': 'Home Office Desk Chair, Ergonomic Mesh Executive Office Chair with 3 Position Tilt Function, Comfortable High Back Black Computer Chair with 3D Adjustable Armrest & Lumbar Support, FANMEN ',
'url': '/Ergonomic-Executive-Comfortable-Adjustable-FANMEN/dp/B09KRKX9FT/ref=sr_1_2?keywords=home+office&qid=1660399980&sr=8-2'},
{'title': 'bonsaii Paper Shredder for Home Use,6-Sheet Crosscut Paper and Credit Card Shredder for Home Office,Home Shredder with Handle for Document,Mail,Staple,Clip-3.4 Gal Wastebasket(C237-B) ',
'url': '/bonsaii-Paper-Shredder-6-Sheet-Crosscut-Paper-Design-Home-Shredder-Clip-3-4-Gal-Wastebasket-C237-B/dp/B0834J2SVR/ref=sr_1_3?keywords=home+office&qid=1660399980&sr=8-3'},...]

Beautifullsoup Amazon Product Detail

I can't scrape the "Product Details" section (scrolling down the webpage you'll find it) html by using requests or requests_html.
Find_all returns a 0 size object... Any Help?
from requests import session
from requests_html import HTMLSession
s = HTMLSession()
#s = session()
r = s.get("https://www.amazon.com/dp/B094HWN66Y")
soup = BeautifulSoup(r.text, 'html.parser')
len(soup.find_all("div", {"id":"detailBulletsWrapper_feature_div"}))
Product details with different information:
Code:
from bs4 import BeautifulSoup
import requests
cookies = {'session': '131-1062572-6801905'}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
r = requests.get("https://www.amazon.com/dp/B094HWN66Y",headers=headers,cookies=cookies)
print(r)
soup = BeautifulSoup(r.text, 'lxml')
key = [x.get_text(strip=True).replace('\u200f\n','').replace('\u200e','').replace(':\n','').replace('\n', '').strip() for x in soup.select('ul.a-unordered-list.a-nostyle.a-vertical.a-spacing-none.detail-bullet-list > li > span > span.a-text-bold')][:13]
#print(key)
value = [x.get_text(strip=True) for x in soup.select('ul.a-unordered-list.a-nostyle.a-vertical.a-spacing-none.detail-bullet-list > li > span > span:nth-child(2)')]
#print(value)
product_details = {k:v for k, v, in zip(key, value)}
print(product_details)
Output:
{'ASIN': 'B094HWN66Y', 'Publisher': 'Boldwood Books (September 7, 2021)', 'Publication date':
'September 7, 2021', 'Language': 'English', 'File size': '1883 KB', 'Text-to-Speech': 'Enabled', 'Screen Reader': 'Supported', 'Enhanced typesetting': 'Enabled', 'X-Ray': 'Enabled', 'Word
Wise': 'Enabled', 'Print length': '332 pages', 'Page numbers source ISBN': '1800487622', 'Lending': 'Not Enabled'}
This is an example of how to scrape the title of the product using bs4 and requests, easily expandable to getting other info from the product.
The reason yours doesn't work is your request has no headers so Amazon realises your a bot and doesn't want you scraping their site. This is shown by your request being returned as <Response [503]> and explained in r.text.
I believe Amazon have an API for this (that they'd probably like you to use) but it'll be fine to scrape like this for small-scale stuff.
import requests
import bs4
# Amazon don't like you scrapeing them however these headers should stop them from noticing a small number of requests
HEADERS = ({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.157 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
def main():
url = "https://www.amazon.com/dp/B094HWN66Y"
title = get_title(url)
print("The title of %s is: %s" % (url, title))
def get_title(url: str) -> str:
"""Returns the title of the amazon product."""
# The request
r = requests.get(url, headers=HEADERS)
# Parse the content
soup = bs4.BeautifulSoup(r.content, 'html.parser')
title = soup.find("span", attrs={"id": 'productTitle'}).string
return title
if __name__ == "__main__":
main()
Output:
The title of https://www.amazon.com/dp/B094HWN66Y is: Will They, Won't They?

Tablescraping from a website with ID using beautifulsoup

Im having a problem with scraping the table of this website, I should be getting the heading but instead am getting
AttributeError: 'NoneType' object has no attribute 'tbody'
Im a bit new to web-scraping so if you could help me out that would be great
import requests
from bs4 import BeautifulSoup
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL)
soup = BeautifulSoup(page.content, "lxml")
table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")
headings = []
for td in table_data[0].find_all("td"):
headings.append(td.b.text.replace('\n', ' ').strip())
print(headings)
What happens?
Note: Always look at your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the dev tools.
Access Revoked
Your IP address has been blocked. We
detected irregular, bot-like usage of our Property Search originating
from your IP address. This block was instated to reduce stress on our
webserver, to ensure that we're providing optimal site performance to
the taxpayers of Collin County. We have
not blocked your ability to download our
data exports, which you can still use to acquire bulk property
data.
How to fix?
Add a user-agent to your requets so that it looks like your requesting with a "browser".
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
page = s.get(URL,headers=headers)
Or as alternativ just download the results.
Example (scraping table)
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")
data = []
for row in soup.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
pd.DataFrame(data[1:], columns=data[0])
Output
Property ID ↓ Geographic ID ↓
Owner Name
Property Address
Legal Description
2021 Market Value
1
2709013 R-10644-00H-0010-1
PARTHASARATHY SURESH & ANITHA HARIKRISHNAN
12209 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 1
$513,019
2
2709018 R-10644-00H-0020-1
JOSHI PRASHANT & SHWETA PANT
12235 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 2
$546,254
3
2709019 R-10644-00H-0030-1
THALLAPUREDDY RAVENDRA & UMA MAHESWARI VEMULA
12261 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 3
$550,768
4
2709020 R-10644-00H-0040-1
KULKARNI BHEEMSEN T & GOURI R
12287 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 4
$509,593
5
2709021 R-10644-00H-0050-1
BALAM GANESH & SHANTHIREKHA LOKULA
12313 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 5
$553,949
...
import requests
from bs4 import BeautifulSoup
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")
Finding Table Data:
column_data=soup.find("table").find_all("tr")[0]
column=[i.get_text() for i in column_data.find_all("td") if i.get_text()!=""]
row=soup.find("table").find_all("tr")[1:]
main_lst=[]
for row_details in row:
lst=[]
for i in row_details.find_all("td")[1:]:
if i.get_text()!="":
lst.append(i.get_text())
main_lst.append(lst)
Converting to pandas DataFrame:
import pandas as pd
df=pd.DataFrame(main_lst,columns=column)
Output:
Property ID↓ Geographic ID ↓ Owner Name Property Address Legal Description 2021 Market Value
0 2709013R-10644-00H-0010-1 PARTHASARATHY SURESH & ANITHA HARIKRISHNAN 12209 Willowgate DrFrisco, TX 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 1 $513,019
.....
If you look at page.content, you will see that "Your IP address has been blocked".
You should add some headers to your request because the website is blocking your request. In your specific case, it will be enough to add a User-Agent:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "lxml")
table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")
headings = []
for td in table_data[0].find_all("td"):
headings.append(td.b.text.replace('\n', ' ').strip())
print(headings)
If you add headers, you will still have error, but in the row:
headings.append(td.b.text.replace('\n', ' ').strip())
You should change it to
headings.append(td.text.replace('\n', ' ').strip())
because td doesn't always have b.

Webscraping information could be in 1 of 3 places on a page, not sure how to use an if statement to eliminate the nonetype results

The following code is scraping data from an amazon product page. I have since discovered that the data for the price could be in 1 of 3 places depending on the type of product and how the price has been added to the page. the 2 other CSS selectors are not present.
for the website url = f'https://www.amazon.co.uk/dp/B083PHB6XX' I am using price = soup.find('span', {'id': 'priceblock_ourprice'}).text.strip()
However, for the website url = f'https://www.amazon.co.uk/dp/B089SQHDMR' I need to use price = soup.find('span', {'id':'priceblock_pospromoprice'}).text.strip()
Finally, for the website url = f'https://www.amazon.co.uk/dp/B0813QVVSW' I need to use price = soup.find('span', {'class':'a-size-base a-color-price'}).text.strip()
I think a solution to hand this is for each url, try to find priceblock_ourprice and if that fails, try to find priceblock_pospromoprice and if that fails, go on to find a-size base a-color-price. I am not able to understand how to put this into an if statement which will not stop when an element is not found.
My original code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"}
urls = [f'https://www.amazon.co.uk/dp/B083PHB6XX', f'https://www.amazon.co.uk/dp/B089SQHDMR', f'https://www.amazon.co.uk/dp/B0813QVVSW']
for url in urls:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find('span', {'id': 'productTitle'}).text.strip()
price = soup.find('span', {'id': 'priceblock_ourprice'}).text.strip()
print(name)
print(price)
You were already on the right track with your thoughts. First check the existence of the element in the conditions and then set the value for the price.
if soup.find('span', {'id': 'priceblock_ourprice'}):
price = soup.find('span', {'id': 'priceblock_ourprice'}).text.strip()
elif soup.find('span', {'id':'priceblock_pospromoprice'}):
price = soup.find('span', {'id':'priceblock_pospromoprice'}).text.strip()
elif soup.find('span', {'class':'a-size-base a-color-price'}):
price = soup.find('span', {'class':'a-size-base a-color-price'}).text.strip()
else:
price = ''
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"}
urls = [f'https://www.amazon.co.uk/dp/B083PHB6XX', f'https://www.amazon.co.uk/dp/B089SQHDMR', f'https://www.amazon.co.uk/dp/B0813QVVSW']
for url in urls:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find('span', {'id': 'productTitle'}).text.strip()
if soup.find('span', {'id': 'priceblock_ourprice'}):
price = soup.find('span', {'id': 'priceblock_ourprice'}).text.strip()
elif soup.find('span', {'id':'priceblock_pospromoprice'}):
price = soup.find('span', {'id':'priceblock_pospromoprice'}).text.strip()
elif soup.find('span', {'class':'a-size-base a-color-price'}):
price = soup.find('span', {'class':'a-size-base a-color-price'}).text.strip()
else:
price = ''
print(name)
print(price)
Output
Boti 36475 Shark Hand Puppet Baby Shark with Sound Function and Speed Control, Approx. 27 x 21 x 14 cm, Soft Polyester, Battery Operated
£19.50
LEGO 71376 Super Mario Thwomp Drop Expansion Set
£34.99
LEGO 75286 Star Wars General Grievous’s Starfighter Set
£86.99

not able to create a dictionary from a one-element list

I am new in python programming and webscraping, I am able to get the relevant information from the website but it generates only one element with all the information needed in the list. The problem is that I cannot delete the unwanted things in this one element list. I am not sure if it is at all possible to do this from a single element list.Is there any way to create a python dictionary as in the example below:
{Kabul: River Kabul, Tirana: River Tirane, etc}
Any help will be really appreciated. Thanks in advance.
from bs4 import BeautifulSoup
import urllib.request
url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
html = resp.read()
soup = BeautifulSoup(html, "html.parser")
attr = {"class":"sites-layout-tile sites-tile-name-content-1"}
rivers = soup.find_all(["table", "tr", "td","div","div","div"], attrs=attr)
data = [div.text for div in rivers]
print(data[0])
Code:
from bs4 import BeautifulSoup
import urllib.request
url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
html = resp.read()
soup = BeautifulSoup(html, "html.parser")
rivers = soup.select_one("td.sites-layout-tile.sites-tile-name-content-1")
data = [
div.text.split('-')[1:]
for div in rivers.find_all('div', style='font-size:small')
if div.text.strip()
][4:-4]
data = {k.strip():v.strip() for k,v in data}
print(data)
Steps:
Select the container tag ('tr.sites-layout-tile.sites-tile-name-content-1')
Find all <div style='font-size:small'> children tags, select the text and split by '-'.
Create a dictionary from the items in data.
If you can figure out a better way to pull your data from the webpage you might want to, but assuming you don't, this will get you a usable and modifiable dictionary:
web_ele = ['COUNTRY - CAPITAL CITY - RIVER A Afghanistan - Kabul - River Kabul. Albania - Tirana - River Tirane. Andorra - Andorra La Vella - The Gran Valira. Argentina - Buenos Aries - River Plate. ']
web_ele[0] = web_ele[0].replace('COUNTRY - CAPITAL CITY - RIVER A ', '')
rows = web_ele[0].split('.')
data_dict = {}
for row in rows:
data = row.split(' - ')
if len(data) == 3:
data_dict[data[0].strip()] = {
'Capital':data[1].strip(),
'River':data[2].strip(),
}
print(data_dict)
# output: {'Afghanistan': {'Capital': 'Kabul', 'River': 'River Kabul'}, 'Albania': {'Capital': 'Tirana', 'River': 'River Tirane'}, 'Andorra': {'Capital': 'Andorra La Vella', 'River': 'The Gran Valira'}, 'Argentina': {'Capital': 'Buenos Aries', 'River': 'River Plate'}}
You'll probably have to account for the various 'A', 'B', 'C' ... elements that seem to be part of your string but the header shouldn't pop back up any more than the one time it did but if it does you should be able to parse it out.
Again, I would probably suggest finding a cleaner way to pull your data but this will get you something to work with.
Another way you can get required result (dictionary with city: river pairs) is to use requests and lxml as below:
import requests
from lxml import html
url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = requests.get(url, headers=headers)
source = html.fromstring(req.content)
xpath = '//b[.="COUNTRY - CAPITAL CITY - RIVER"]/following::div[b and following-sibling::hr]'
rivers = [item.text_content().strip() for item in source.xpath(xpath) if item.text_content().strip()]
rivers_dict = {}
for river in rivers:
rivers_dict[river.split("-")[1].strip()] = river.split("-")[2].strip()
print(rivers_dict)
Output:
{'Asuncion': 'River Paraguay.', 'La Paz': 'River Choqueapu.', 'Kinshasa': 'River Congo.', ...}
...147 items total

Categories

Resources