How to read a specific table from a given url? - python

I am new to python and trying to download the countries GDP per capita data. I am trying to read the data from this website: https://worldpopulationreview.com/countries/by-gdp
I tried to read the data but, I found no tables found error.
I can see the data is in r.text but somehow pandas can not read that table.
How to solve the problem and read the data?
MWE
import pandas as pd
import requests
url = "https://worldpopulationreview.com/countries/by-gdp"
r = requests.get(url)
raw_html = r.text # I can see the data is here, but pd.read_html says no tables found
df_list = pd.read_html(raw_html)
print(len(df_list))

Data is embedded via <script id="__NEXT_DATA__" type="application/json"> and rendered by browser only, so you have to adjust your script a bit:
pd.json_normalize(
json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
Example
import pandas as pd
import requests,json
from bs4 import BeautifulSoup
url = "https://worldpopulationreview.com/countries/by-gdp"
df = pd.json_normalize(
json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
df[['continent', 'country', 'pop','imfGDP', 'unGDP', 'gdpPerCapita']]
Output
continent
country
pop
imfGDP
unGDP
gdpPerCapita
0
North America
United States
338290
2.08938e+13
18624475000000
61762.9
1
Asia
China
1.42589e+06
1.48626e+13
11218281029298
10423.4
...
...
...
...
...
...
...
210
Asia
Syria
22125.2
0
22163075121
1001.71
211
North America
Turks and Caicos Islands
45.703
0
917550492
20076.4

Related

Python inside Power BI

Good afternoon, guys.
I have a python project which brings me the Continent of a Country. It's a web scrapping in wikipedia site. How can I apply this as a new column in Power BI, turning the country (in example below, the "United_States") into a parameter which gonna be the countries of my Power BI Report (the countries are in the 1st column of my BI Report.)?
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Geography_of_United_States"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
continent = soup.select_one("th:-soup-contains(Continent) + td").text
print(continent)
If you want to do this in Power BI, you should use the Web.Contents function to scrape the web page.
Here's a simple query that gets the html from the wikipedia page and parses it:
let
Source = Web.Contents("https://en.wikipedia.org/wiki/Geography_of_United_States"),
Data = Json.Document(Source)
in
Data
You can then use Power BI's Parsed HTML feature to parse the HTML and create a data set.
If you want to use Python to do this, you should use the pandas library to load the data into a DataFrame and then use the to_csv() function to write the data to a CSV file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/Geography_of_United_States"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
continent = soup.select_one("th:-soup-contains(Continent) + td").text
print(continent)
df = pd.DataFrame([continent])
df.to_csv("continent.csv", index=False, header=False)
If you want to do this in R, you should use the rvest library to parse the HTML and then use the readr library to read the data into a data frame.
library(rvest)
library(readr)
url <- "https://en.wikipedia.org/wiki/Geography_of_United_States"
html <- read_html(url)
continent <- html_nodes(html, "th:-soup-contains(Continent) + td") %>% html_text()
df <- data.frame(continent)
write_csv(df, "continent.csv")
You can get the continent of any country with the following code:
#load libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
#country name
country_name = "United States"
#build url
url_start = "https://en.wikipedia.org/wiki/"
url_country = country_name.replace(" ", "_")
url = url_start + url_country
#get HTML
r = requests.get(url)
#parse HTML
soup = BeautifulSoup(r.content, "html.parser")
#get continent
continent = soup.select_one("th:contains(Continent) + td").text
#print continent
print(continent)
#save to csv
df = pd.DataFrame([continent], columns = ["continent"])
df.to_csv("continent.csv", index = False, header = False)

Geopandas and beautiful soup - web scraping and writing to shapefile

When using the below code I get the desired column in the shapefile with title content. But only if the shapefile has one row/feature. When running on a shapefile with more than one feature no column is written at all. Any tips/help greatly appreciated!
import geopandas as gpd
import requests
from bs4 import BeautifulSoup
gdf = gpd.read_file("Test404_PhotosMeta.shp", driver="ESRI Shapefile", encoding ="utf8")
for url in gdf['url']:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for title in soup.find_all('title'):
gdf['HTitle']=title
gdf.to_file("HTitle.shp", driver="ESRI Shapefile")
you have not provided sample data / shape file. So have used inbuilt shape file and constructed a url column for purpose of demonstration
really this is a pandas question not geopandas
your looping logic - only last url will be in response object
have refactored to use apply() so that it is run for each row
from bs4 import BeautifulSoup
import requests
import geopandas as gpd
from pathlib import Path
# use included sample geodataframe
gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# add a url column
gdf["url"] = (
"https://simplemaps.com/data/"
+ world.loc[~world["iso_a3"].eq("-99"), "iso_a3"].str[0:2].str.lower()
+ "-cities"
)
# utility function to get titles for a URL
def get_titles(url):
if pd.isna(url):
return ""
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# if there are multiple titles for a row join them
return ",".join([str(t) for t in soup.find_all("title")])
# get the titles
gdf['HTitle'] = gdf["url"].apply(get_titles)
gdf.to_file(Path.cwd().joinpath("HTitle.shp"), driver="ESRI Shapefile")
sample output
gpd.read_file(Path.cwd().joinpath("HTitle.shp")).drop(columns="geometry").sample(5)
pop_est
continent
name
iso_a3
gdp_md_est
url
HTitle
98
1281935911
Asia
India
IND
8.721e+06
https://simplemaps.com/data/in-cities
India Cities Database
157
28036829
Asia
Yemen
YEM
73450
https://simplemaps.com/data/ye-cities
Yemen Cities Database
129
11491346
Europe
Belgium
BEL
508600
https://simplemaps.com/data/be-cities
Belgium Cities Database
113
38476269
Europe
Poland
POL
1.052e+06
https://simplemaps.com/data/po-cities
404
57
24994885
Africa
Cameroon
CMR
77240
https://simplemaps.com/data/cm-cities
Cameroon Cities Database

Scraping Data with Beautiful Soup Issues

I am working on scraping the countries of astronauts from this website: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order. I am using BeautifulSoup to perform this task, but I'm having some issues. Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')
for item in tags:
name = item.select_one('bau astronaut_cell__title bold mr05')
country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
data.append([name,country])
df = pd.DataFrame(data)
df
df is returning an empty list. Not sure what is going on. When I take the code out of the for loop, it can't seem to find the select_one function. Function should be coming from bs4 - not sure why that's not working. Also, is there a repeatable pattern for web scraping that I'm missing? Seems like it's a different beast every time I try to tackle these kinds of problems.
Any help would be appreciated! Thank you!
The url's data is generated dynamically by javascript and Beautifulsoup can't grab dynamic data.So, You can use automation tool something like selenium with Beautifulsoup.Here I apply selenium with Beautifulsoup.Please just run the code.
Script:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
#print(name.text)
country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
if country:
country=country.get_text()
#print(country)
data.append([name, country])
cols=['name','country']
df = pd.DataFrame(data,columns=cols)
print(df)
Output:
name country
0 Bess, Cameron United States of America
1 Bess, Lane United States of America
2 Dick, Evan United States of America
3 Taylor, Dylan United States of America
4 Strahan, Michael United States of America
.. ... ...
295 Jones, Thomas United States of America
296 Sega, Ronald United States of America
297 Usachov, Yury Russia
298 Fettman, Martin United States of America
299 Wolf, David United States of America
[300 rows x 2 columns]
The page is dynamically loaded using javascript, so requests can't get to it directly. The data is loaded from another address and is received in json format. You can get to it this way:
url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)
Once you have it loaded, you can iterate through it and retrieve relevant information. For example:
for astro in data['astronauts']:
print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank'])
Output:
1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General
etc.
You can then load the output to a pandas dataframe or whatever.

Beautiful Soup cannot find table on iShares

I have been trying to scrape ETF data from iShares.com for an ongoing project for a while now. I am trying to create web scrapers for multiple websites but they are all identical. Essentially I run into two issues:
I keep getting the error :"AttributeError: 'NoneType' object has no attribute 'tr'" although I am quite sure that I have chosen the correct table.
When I look into the "Inspect elements" on some of the websites, I have to click the "Show more" in order to see the code for all of the rows.
I am not a computer scientist, but I have tried many different approaches which have sadly all been unsuccessful so I hope you can help.
The URL: https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf
The table can be found on the URL under "Holdings". Alternatively, it can be found under the following paths:
JS Path: <document.querySelector("#allHoldingsTable > tbody")>
xPath: //*[#id="allHoldingsTable"]/tbody
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
'https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf'
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".allHoldingsTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl),thousands='.', decimal=',')[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
from openpyxl import load_workbook
path= '/Users/karlemilthulstrup/Downloads/ishares.xlsx'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
df.to_excel(writer, index=False)
writer.save()
writer.close()
As stated in the comments, the data is dynamically rendered. If you don't want to go the route of accessing the data directly, you could use something like Selenium, that will allow the page to render, THEN you can go in there the way you have it above.
Also, there's a button that will download this into a csv for you. Why not just do that?
But if you must scrape the page, you get the data in json format. Just parse it:
import requests
import json
import pandas as pd
url = 'https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf/1506575576011.ajax?tab=all&fileType=json'
r = requests.get(url)
r.encoding='utf-8-sig'
jsonData = json.loads(r.text)
rows = []
for each in jsonData['aaData']:
row = {'Issuer Ticker':each[0],
'Name':each[1],
'Sector':each[2],
'Asset Class':each[3],
'Market Value':each[4]['display'],
'Market Value Raw':each[4]['raw'],
'Weight (%)':each[5]['display'],
'Weight (%) Raw':each[5]['raw'],
'Notaional Value':each[6]['display'],
'Notaional Value Raw':each[6]['raw'],
'Nominal':each[7]['display'],
'Nominal Raw':each[7]['raw'],
'ISIN':each[8],
'Price':each[9]['display'],
'Price Raw':each[9]['raw'],
'Location':each[10],
'Exchange':each[11],
'Market Currency':each[12]}
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df)
Issuer Ticker ... Market Currency
0 VZ ... USD
1 ROG ... CHF
2 NESN ... CHF
3 WM ... USD
4 PEP ... USD
.. ... ... ...
309 ESH2 ... USD
310 TUH2 ... USD
311 JPY ... USD
312 MARGIN_JPY ... JPY
313 MARGIN_SGD ... SGD
[314 rows x 18 columns]

Scrape zoho-analitics externally stored table. Is it possible?

I am trying to scrape a zoho-analytics table from this webpage for a project at the university. For the moment I have no ideas. I can't see the values in the inspect, and therefore I cannot use Beautifulsoup in Python (my favourite one).
enter image description here
Does anybody have any idea?
Thanks a lot,
Joseph
I tried it with BeautifulSoup, seems like you can't soup these values that are inside the table because they are not on the website but stored externally(?)
EDIT:
https://analytics.zoho.com/open-view/938032000481034014
This is the link the table and its data are stored.
So I tried scraping from it with bs4 and it works.
The class of the rows is "zdbDataRowDiv"
Try:
container = page_soup.findAll("div","class":"zdbDataRowDiv")
Code explanation:
container # the variable where your data is stored, name it how you like
page_soup # your html page you souped with BeautifulSoup
findAll("tag",{"attribute":"value"}) # this function finds every tag which has the specific value inside its attribute
They are stored within the <script> tags in json format. Just a matter of pulling those out and parsing:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'https://flo.uri.sh/visualisation/4540617/embed'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var _Flourish_data_column_names = ' in script.text:
json_str = script.text
col_names = json_str.split('var _Flourish_data_column_names = ')[-1].split(',\n')[0]
cols = json.loads(col_names)
data = json_str.split('_Flourish_data = ')[-1].split(',\n')[0]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
headers = cols['rows']['columns']
for row in jsonData['rows']:
rows.append(row['columns'])
table = pd.DataFrame(rows,columns=headers)
for col in headers[1:]:
table.loc[table[col] != '', col] = 'A'
Output:
print (table)
Company Climate change Forests Water security
0 Danone A A A
1 FIRMENICH SA A A A
2 FUJI OIL HOLDINGS INC. A A A
3 HP Inc A A A
4 KAO Corporation A A A
.. ... ... ... ...
308 Woolworths Limited A
309 Workspace Group A
310 Yokogawa Electric Corporation A A
311 Yuanta Financial Holdings A
312 Zalando SE A
[313 rows x 4 columns]

Categories

Resources