Geopandas and beautiful soup - web scraping and writing to shapefile

Geopandas and beautiful soup - web scraping and writing to shapefile - python

When using the below code I get the desired column in the shapefile with title content. But only if the shapefile has one row/feature. When running on a shapefile with more than one feature no column is written at all. Any tips/help greatly appreciated!
import geopandas as gpd
import requests
from bs4 import BeautifulSoup
gdf = gpd.read_file("Test404_PhotosMeta.shp", driver="ESRI Shapefile", encoding ="utf8")
for url in gdf['url']:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for title in soup.find_all('title'):
gdf['HTitle']=title
gdf.to_file("HTitle.shp", driver="ESRI Shapefile")

you have not provided sample data / shape file. So have used inbuilt shape file and constructed a url column for purpose of demonstration
really this is a pandas question not geopandas
your looping logic - only last url will be in response object
have refactored to use apply() so that it is run for each row
from bs4 import BeautifulSoup
import requests
import geopandas as gpd
from pathlib import Path
# use included sample geodataframe
gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# add a url column
gdf["url"] = (
"https://simplemaps.com/data/"
+ world.loc[~world["iso_a3"].eq("-99"), "iso_a3"].str[0:2].str.lower()
+ "-cities"
)
# utility function to get titles for a URL
def get_titles(url):
if pd.isna(url):
return ""
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# if there are multiple titles for a row join them
return ",".join([str(t) for t in soup.find_all("title")])
# get the titles
gdf['HTitle'] = gdf["url"].apply(get_titles)
gdf.to_file(Path.cwd().joinpath("HTitle.shp"), driver="ESRI Shapefile")
sample output
gpd.read_file(Path.cwd().joinpath("HTitle.shp")).drop(columns="geometry").sample(5)
pop_est
continent
name
iso_a3
gdp_md_est
url
HTitle
98
1281935911
Asia
India
IND
8.721e+06
https://simplemaps.com/data/in-cities
India Cities Database
157
28036829
Asia
Yemen
YEM
73450
https://simplemaps.com/data/ye-cities
Yemen Cities Database
129
11491346
Europe
Belgium
BEL
508600
https://simplemaps.com/data/be-cities
Belgium Cities Database
113
38476269
Europe
Poland
POL
1.052e+06
https://simplemaps.com/data/po-cities
404
57
24994885
Africa
Cameroon
CMR
77240
https://simplemaps.com/data/cm-cities
Cameroon Cities Database

Related

How to read a specific table from a given url?

I am new to python and trying to download the countries GDP per capita data. I am trying to read the data from this website: https://worldpopulationreview.com/countries/by-gdp
I tried to read the data but, I found no tables found error.
I can see the data is in r.text but somehow pandas can not read that table.
How to solve the problem and read the data?
MWE
import pandas as pd
import requests
url = "https://worldpopulationreview.com/countries/by-gdp"
r = requests.get(url)
raw_html = r.text # I can see the data is here, but pd.read_html says no tables found
df_list = pd.read_html(raw_html)
print(len(df_list))

Data is embedded via <script id="__NEXT_DATA__" type="application/json"> and rendered by browser only, so you have to adjust your script a bit:
pd.json_normalize(
json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
Example
import pandas as pd
import requests,json
from bs4 import BeautifulSoup
url = "https://worldpopulationreview.com/countries/by-gdp"
df = pd.json_normalize(
json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
df[['continent', 'country', 'pop','imfGDP', 'unGDP', 'gdpPerCapita']]
Output
continent
country
pop
imfGDP
unGDP
gdpPerCapita
0
North America
United States
338290
2.08938e+13
18624475000000
61762.9
1
Asia
China
1.42589e+06
1.48626e+13
11218281029298
10423.4
...
...
...
...
...
...
...
210
Asia
Syria
22125.2
0
22163075121
1001.71
211
North America
Turks and Caicos Islands
45.703
0
917550492
20076.4

Python inside Power BI

Good afternoon, guys.
I have a python project which brings me the Continent of a Country. It's a web scrapping in wikipedia site. How can I apply this as a new column in Power BI, turning the country (in example below, the "United_States") into a parameter which gonna be the countries of my Power BI Report (the countries are in the 1st column of my BI Report.)?
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Geography_of_United_States"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
continent = soup.select_one("th:-soup-contains(Continent) + td").text
print(continent)

If you want to do this in Power BI, you should use the Web.Contents function to scrape the web page.
Here's a simple query that gets the html from the wikipedia page and parses it:
let
Source = Web.Contents("https://en.wikipedia.org/wiki/Geography_of_United_States"),
Data = Json.Document(Source)
in
Data
You can then use Power BI's Parsed HTML feature to parse the HTML and create a data set.
If you want to use Python to do this, you should use the pandas library to load the data into a DataFrame and then use the to_csv() function to write the data to a CSV file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/Geography_of_United_States"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
continent = soup.select_one("th:-soup-contains(Continent) + td").text
print(continent)
df = pd.DataFrame([continent])
df.to_csv("continent.csv", index=False, header=False)
If you want to do this in R, you should use the rvest library to parse the HTML and then use the readr library to read the data into a data frame.
library(rvest)
library(readr)
url <- "https://en.wikipedia.org/wiki/Geography_of_United_States"
html <- read_html(url)
continent <- html_nodes(html, "th:-soup-contains(Continent) + td") %>% html_text()
df <- data.frame(continent)
write_csv(df, "continent.csv")

You can get the continent of any country with the following code:
#load libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
#country name
country_name = "United States"
#build url
url_start = "https://en.wikipedia.org/wiki/"
url_country = country_name.replace(" ", "_")
url = url_start + url_country
#get HTML
r = requests.get(url)
#parse HTML
soup = BeautifulSoup(r.content, "html.parser")
#get continent
continent = soup.select_one("th:contains(Continent) + td").text
#print continent
print(continent)
#save to csv
df = pd.DataFrame([continent], columns = ["continent"])
df.to_csv("continent.csv", index = False, header = False)

Scraping Data with Beautiful Soup Issues

I am working on scraping the countries of astronauts from this website: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order. I am using BeautifulSoup to perform this task, but I'm having some issues. Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')
for item in tags:
name = item.select_one('bau astronaut_cell__title bold mr05')
country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
data.append([name,country])
df = pd.DataFrame(data)
df
df is returning an empty list. Not sure what is going on. When I take the code out of the for loop, it can't seem to find the select_one function. Function should be coming from bs4 - not sure why that's not working. Also, is there a repeatable pattern for web scraping that I'm missing? Seems like it's a different beast every time I try to tackle these kinds of problems.
Any help would be appreciated! Thank you!

The url's data is generated dynamically by javascript and Beautifulsoup can't grab dynamic data.So, You can use automation tool something like selenium with Beautifulsoup.Here I apply selenium with Beautifulsoup.Please just run the code.
Script:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
#print(name.text)
country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
if country:
country=country.get_text()
#print(country)
data.append([name, country])
cols=['name','country']
df = pd.DataFrame(data,columns=cols)
print(df)
Output:
name country
0 Bess, Cameron United States of America
1 Bess, Lane United States of America
2 Dick, Evan United States of America
3 Taylor, Dylan United States of America
4 Strahan, Michael United States of America
.. ... ...
295 Jones, Thomas United States of America
296 Sega, Ronald United States of America
297 Usachov, Yury Russia
298 Fettman, Martin United States of America
299 Wolf, David United States of America
[300 rows x 2 columns]

The page is dynamically loaded using javascript, so requests can't get to it directly. The data is loaded from another address and is received in json format. You can get to it this way:
url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)
Once you have it loaded, you can iterate through it and retrieve relevant information. For example:
for astro in data['astronauts']:
print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank'])
Output:
1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General
etc.
You can then load the output to a pandas dataframe or whatever.

Scrape zoho-analitics externally stored table. Is it possible?

I am trying to scrape a zoho-analytics table from this webpage for a project at the university. For the moment I have no ideas. I can't see the values in the inspect, and therefore I cannot use Beautifulsoup in Python (my favourite one).
enter image description here
Does anybody have any idea?
Thanks a lot,
Joseph

I tried it with BeautifulSoup, seems like you can't soup these values that are inside the table because they are not on the website but stored externally(?)
EDIT:
https://analytics.zoho.com/open-view/938032000481034014
This is the link the table and its data are stored.
So I tried scraping from it with bs4 and it works.
The class of the rows is "zdbDataRowDiv"
Try:
container = page_soup.findAll("div","class":"zdbDataRowDiv")
Code explanation:
container # the variable where your data is stored, name it how you like
page_soup # your html page you souped with BeautifulSoup
findAll("tag",{"attribute":"value"}) # this function finds every tag which has the specific value inside its attribute

They are stored within the <script> tags in json format. Just a matter of pulling those out and parsing:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'https://flo.uri.sh/visualisation/4540617/embed'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var _Flourish_data_column_names = ' in script.text:
json_str = script.text
col_names = json_str.split('var _Flourish_data_column_names = ')[-1].split(',\n')[0]
cols = json.loads(col_names)
data = json_str.split('_Flourish_data = ')[-1].split(',\n')[0]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
headers = cols['rows']['columns']
for row in jsonData['rows']:
rows.append(row['columns'])
table = pd.DataFrame(rows,columns=headers)
for col in headers[1:]:
table.loc[table[col] != '', col] = 'A'
Output:
print (table)
Company Climate change Forests Water security
0 Danone A A A
1 FIRMENICH SA A A A
2 FUJI OIL HOLDINGS INC. A A A
3 HP Inc A A A
4 KAO Corporation A A A
.. ... ... ... ...
308 Woolworths Limited A
309 Workspace Group A
310 Yokogawa Electric Corporation A A
311 Yuanta Financial Holdings A
312 Zalando SE A
[313 rows x 4 columns]

get information from a website in an organized way

I'm trying to web-scrape a website with Python and I'm having some trouble. I've already red a looooot of articles online and questions here and I still can't do what I need to do.
I have this website:
https://beta.nhs.uk/find-a-pharmacy/results?latitude=51.2457238068354&location=Little%20London%2C%20Hampshire%2C%20SP11&longitude=-1.45959328501975
and I need to print the name of the store and it's adress, and save it on an file (can be csv or excel). I've tried with selenium, pandas, beautiful soup and nothing worked :(
Can someone help me please?

import requests
from bs4 import BeautifulSoup
page = requests.get("https://beta.nhs.uk/find-a-pharmacy/results?latitude=51.2457238068354&location=Little%20London%2C%20Hampshire%2C%20SP11&longitude=-1.45959328501975")
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find_all("div", class_="results__details")
for container in data:
Pharmacyname = container.find_all("h2")
Pharmacyadd = container.find_all("p")
for pharmacy in Pharmacyname:
for add in Pharmacyadd:
print(add.text)
continue
print(pharmacy.text)
OUTPUT:
Shepherds Spring Pharmacy Ltd is 1.8 miles away
The Oval,
Cricketers Way,
Andover,
Hampshire,
SP10 5DN
01264 355700
Map and directions for Shepherds Spring Pharmacy Ltd at The Oval
Services available in Shepherds Spring Pharmacy Ltd at The Oval
Open until 6:15pm today
Shepherds Spring Pharmacy Ltd
Tesco Instore Pharmacy is 2.1 miles away
Tesco Superstore,
River Way,
Andover,
Hampshire,
SP10 1UZ
0345 677 9007
.
.
.
Note: You could create separate lists for pharmacy_name and
pharmacy_add to store the data and then write to the files. PS. You
could also strip off the unwanted text from the lists (let's say the
text after the Phone number from each pharmacy)

import requests
from bs4 import BeautifulSoup
import re
import xlsxwriter
workbook = xlsxwriter.Workbook('File.xlsx')
worksheet = workbook.add_worksheet()
request = requests.get("https://beta.nhs.uk/find-a-pharmacy/results?latitude=51.2457238068354&location=Little%20London%2C%20Hampshire%2C%20SP11&longitude=-1.45959328501975")
soup = BeautifulSoup(request.content, 'html.parser')
data = soup.find_all("div", class_="results__details")
formed_data=[]
for results_details in data:
formed_data.append([results_details.find_all("h2")[0].text,re.sub(' +',' ',results_details.find_all("p")[1].text.replace('\n',''))])
row=col=0
for name, adress in (formed_data):
worksheet.write(row, col, name)
worksheet.write(row, col + 1, adress)
row += 1
workbook.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Geopandas and beautiful soup - web scraping and writing to shapefile - python

Related

How to read a specific table from a given url?

Python inside Power BI

Scraping Data with Beautiful Soup Issues

Scrape zoho-analitics externally stored table. Is it possible?

get information from a website in an organized way

Categories

Resources