I am working on scraping the countries of astronauts from this website: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order. I am using BeautifulSoup to perform this task, but I'm having some issues. Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')
for item in tags:
name = item.select_one('bau astronaut_cell__title bold mr05')
country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
data.append([name,country])
df = pd.DataFrame(data)
df
df is returning an empty list. Not sure what is going on. When I take the code out of the for loop, it can't seem to find the select_one function. Function should be coming from bs4 - not sure why that's not working. Also, is there a repeatable pattern for web scraping that I'm missing? Seems like it's a different beast every time I try to tackle these kinds of problems.
Any help would be appreciated! Thank you!
The url's data is generated dynamically by javascript and Beautifulsoup can't grab dynamic data.So, You can use automation tool something like selenium with Beautifulsoup.Here I apply selenium with Beautifulsoup.Please just run the code.
Script:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
#print(name.text)
country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
if country:
country=country.get_text()
#print(country)
data.append([name, country])
cols=['name','country']
df = pd.DataFrame(data,columns=cols)
print(df)
Output:
name country
0 Bess, Cameron United States of America
1 Bess, Lane United States of America
2 Dick, Evan United States of America
3 Taylor, Dylan United States of America
4 Strahan, Michael United States of America
.. ... ...
295 Jones, Thomas United States of America
296 Sega, Ronald United States of America
297 Usachov, Yury Russia
298 Fettman, Martin United States of America
299 Wolf, David United States of America
[300 rows x 2 columns]
The page is dynamically loaded using javascript, so requests can't get to it directly. The data is loaded from another address and is received in json format. You can get to it this way:
url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)
Once you have it loaded, you can iterate through it and retrieve relevant information. For example:
for astro in data['astronauts']:
print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank'])
Output:
1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General
etc.
You can then load the output to a pandas dataframe or whatever.
Related
I am new to python and trying to download the countries GDP per capita data. I am trying to read the data from this website: https://worldpopulationreview.com/countries/by-gdp
I tried to read the data but, I found no tables found error.
I can see the data is in r.text but somehow pandas can not read that table.
How to solve the problem and read the data?
MWE
import pandas as pd
import requests
url = "https://worldpopulationreview.com/countries/by-gdp"
r = requests.get(url)
raw_html = r.text # I can see the data is here, but pd.read_html says no tables found
df_list = pd.read_html(raw_html)
print(len(df_list))
Data is embedded via <script id="__NEXT_DATA__" type="application/json"> and rendered by browser only, so you have to adjust your script a bit:
pd.json_normalize(
json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
Example
import pandas as pd
import requests,json
from bs4 import BeautifulSoup
url = "https://worldpopulationreview.com/countries/by-gdp"
df = pd.json_normalize(
json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
df[['continent', 'country', 'pop','imfGDP', 'unGDP', 'gdpPerCapita']]
Output
continent
country
pop
imfGDP
unGDP
gdpPerCapita
0
North America
United States
338290
2.08938e+13
18624475000000
61762.9
1
Asia
China
1.42589e+06
1.48626e+13
11218281029298
10423.4
...
...
...
...
...
...
...
210
Asia
Syria
22125.2
0
22163075121
1001.71
211
North America
Turks and Caicos Islands
45.703
0
917550492
20076.4
When using the below code I get the desired column in the shapefile with title content. But only if the shapefile has one row/feature. When running on a shapefile with more than one feature no column is written at all. Any tips/help greatly appreciated!
import geopandas as gpd
import requests
from bs4 import BeautifulSoup
gdf = gpd.read_file("Test404_PhotosMeta.shp", driver="ESRI Shapefile", encoding ="utf8")
for url in gdf['url']:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for title in soup.find_all('title'):
gdf['HTitle']=title
gdf.to_file("HTitle.shp", driver="ESRI Shapefile")
you have not provided sample data / shape file. So have used inbuilt shape file and constructed a url column for purpose of demonstration
really this is a pandas question not geopandas
your looping logic - only last url will be in response object
have refactored to use apply() so that it is run for each row
from bs4 import BeautifulSoup
import requests
import geopandas as gpd
from pathlib import Path
# use included sample geodataframe
gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# add a url column
gdf["url"] = (
"https://simplemaps.com/data/"
+ world.loc[~world["iso_a3"].eq("-99"), "iso_a3"].str[0:2].str.lower()
+ "-cities"
)
# utility function to get titles for a URL
def get_titles(url):
if pd.isna(url):
return ""
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# if there are multiple titles for a row join them
return ",".join([str(t) for t in soup.find_all("title")])
# get the titles
gdf['HTitle'] = gdf["url"].apply(get_titles)
gdf.to_file(Path.cwd().joinpath("HTitle.shp"), driver="ESRI Shapefile")
sample output
gpd.read_file(Path.cwd().joinpath("HTitle.shp")).drop(columns="geometry").sample(5)
pop_est
continent
name
iso_a3
gdp_md_est
url
HTitle
98
1281935911
Asia
India
IND
8.721e+06
https://simplemaps.com/data/in-cities
India Cities Database
157
28036829
Asia
Yemen
YEM
73450
https://simplemaps.com/data/ye-cities
Yemen Cities Database
129
11491346
Europe
Belgium
BEL
508600
https://simplemaps.com/data/be-cities
Belgium Cities Database
113
38476269
Europe
Poland
POL
1.052e+06
https://simplemaps.com/data/po-cities
404
57
24994885
Africa
Cameroon
CMR
77240
https://simplemaps.com/data/cm-cities
Cameroon Cities Database
I have one of those nightmare tables with no class given for the tr and td tags.
A sample page is here: https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m
(You'll see in the code below that I'm getting multiple pages, but that's not the problem.)
I want the team name (nothing else) from each bracket. The output should be:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
etc.
I've been able to get every td in the specified tables. But every attempt to use [0] to get the first td of every row gives me an "index out of range" error.
The code is:
import requests
import csv
from bs4 import BeautifulSoup
batch_size = 2
urls = ['https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m', 'https://system.gotsport.com/org_event/events/1271/schedules?age=17&gender=m']
# iterate through urls
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# iterate through leagues and teams
leagues = soup.find_all('table', class_='table table-bordered table-hover table-condensed')
for league in leagues:
row = ''
rows = league.find_all('tr')
for row in rows:
team = row.find_all('td')
teamName = team[0].text.strip()
print(teamName)
After a couple of hours of work, I feel like I'm just one syntax change away from getting this right. Yes?
You can use a CSS Selector nth-of-type(n). It works for both links:
import requests
from bs4 import BeautifulSoup
url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for tag in soup.select(".small-margin-bottom td:nth-of-type(1)"):
print(tag.text.strip())
Output:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
...
...
Real Salt Lake U19
Real Colorado
Empire United Soccer Academy
Each bracket corresponds to one "panel", and each panel has two rows, the first of which contains the first table of all teams in the match tables.
def main():
import requests
from bs4 import BeautifulSoup
url = "https://system.gotsport.com/org_event/events/1271/schedules?age=19&gender=m"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for panel in soup.find_all("div", {"class": "panel-body"}):
for row in panel.find("tbody").find_all("tr"):
print(row.find("td").text.strip())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
OCYS
FL Rush
Jacksonville FC
Atlanta United
SSA
Miami Rush Kendall SC
IMG
Tampa Bay United
Weston FC
Chargers SC
South Florida FA
Solar SC
RISE SC
...
I think the problem is with the header of the table, which contains th elements instead of td elements. It leads to the index of range error, when you try to retrieve first element from an empty list. Try to add check for the length of the td:
for row in rows:
team = row.find_all('td')
if(len(team) > 0):
teamName = team[0].text.strip()
print(teamName)
It should print you the team names.
I am trying to write a Python program to gather data from Google Trends (GT)- specifically, I want to automatically open URLs and access the specific values that are displayed in the title.
I have written the code and i am able to scrape data successfully. But i compare the data returned by code and one present in the url, the results are only partially returned.
For e.g. in the below image, the code returns the first title "Manchester United F.C. • Tottenham Hotspur F.C." But the actual website has 4 results "Manchester United F.C. • Tottenham Hotspur F.C. , International Champions Cup, Manchester
".
google trends image
screenshot output of code
We have currently tried all all possible locate elements in a page but we are still unable to fund a fix for this. We didn't want to use scrapy or beautiful soup for this
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
import time
from selenium import webdriver
links=["https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"]
for link in links:
Title_temp=[]
Titile=''
seleniumDriver = r"C:/Users/Downloads/chromedriver_win32/chromedriver.exe"
chrome_options = Options()
brow = webdriver.Chrome(executable_path=seleniumDriver, chrome_options=chrome_options)
try:
brow.get(link) ## getting the url
try:
content = brow.find_elements_by_class_name("details-top")
for element in content:
Title_temp.append(element.text)
Title=' '.join(Title_temp)
except:
Title=''
brow.quit()
except Exception as error:
print error
break
Final_df = pd.DataFrame(
{'Title': Title_temp
})
From what I see, data is retrieved from an API endpoint you can call direct. I show how to call and then extract only the title (note more info is returned other than just title from API call). You can explore the breadth of what is returned (which includes article snippets, urls, image links etc) here.
import requests
import json
r = requests.get('https://trends.google.com/trends/api/realtimetrends?hl=en-GB&tz=-60&cat=s&fi=0&fs=0&geo=DE&ri=300&rs=20&sort=0')
data = json.loads(r.text[5:])
titles = [story['title'] for story in data['storySummaries']['trendingStories']]
print(titles)
Here is the code which printed all the information.
url = "https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"
driver.get(url)
WebDriverWait(driver,30).until(EC.presence_of_element_located((By.CLASS_NAME,'details-top')))
Title_temp = []
try:
content = driver.find_elements_by_class_name("details-top")
for element in content:
Title_temp.append(element.text)
Title=' '.join(Title_temp)
except:
Title=''
print(Title_temp)
driver.close()
Here is the output.
['Hertha BSC • Fenerbahçe S.K. • Bundesliga • Ante Čović • Berlin', 'Eintracht Frankfurt • UEFA Europa League • Tallinn • Estonia • Frankfurt', 'FC Augsburg • Galatasaray S.K. • Martin Schmidt • Bundesliga • Stefan Reuter', 'Austria national football team • FIFA • Austria • FIFA World Rankings', 'Lechia Gdańsk • Brøndby IF • 2019–20 UEFA Europa League • Gdańsk', 'Alexander Zverev • Hamburg', 'Julian Lenz • Association of Tennis Professionals • Alexander Zverev', 'UEFA Europa League • Diego • Nairo Quintana • Tour de France']
Screenshot:
We were able to find a fix for this. We had to scrape data from inner html and then do some bit of data cleaning to get required records
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#html parser
def parse_html(content):
from bs4 import BeautifulSoup
from bs4.element import Comment
soup = BeautifulSoup(content, 'html.parser')
text_elements = soup.findAll(text=True)
tag_blacklist = ['style', 'script', 'head', 'title', 'meta', '[document]','img']
clean_text = []
for element in text_elements:
if element.parent.name in tag_blacklist or isinstance(element, Comment):
continue
else:
text_ = element.strip()
clean_text.append(text_)
result_text = " ".join(clean_text)
result_text = result_text.replace(r'[\r\n]','')
tag_remove_pattern = re.compile(r'<[^>]+>')
result_text = tag_remove_pattern.sub('', result_text)
result_text = re.sub(r'\\','',result_text)
return result_text
seleniumDriver = r"./chromedriver.exe"
chrome_options = Options()
brow = webdriver.Chrome(executable_path=seleniumDriver, chrome_options=chrome_options)
links=["https://trends.google.com/trends/trendingsearches/realtime?geo=DE&category=s"]
title_temp = []
for link in links:
try:
brow.get(link)
try:
elements = brow.find_elements_by_class_name('details-top')
for element in elements:
html_text = parse_html(element.get_attribute("innerHTML"))
title_temp.append(html_text.replace('share','').strip())
except Exception as error:
print(error)
time.sleep(1)
brow.quit()
except Exception as error:
print(error)
break
Final_df = pd.DataFrame(
{'Title': title_temp
})
print(Final_df)
I'm trying to web-scrape a website with Python and I'm having some trouble. I've already red a looooot of articles online and questions here and I still can't do what I need to do.
I have this website:
https://beta.nhs.uk/find-a-pharmacy/results?latitude=51.2457238068354&location=Little%20London%2C%20Hampshire%2C%20SP11&longitude=-1.45959328501975
and I need to print the name of the store and it's adress, and save it on an file (can be csv or excel). I've tried with selenium, pandas, beautiful soup and nothing worked :(
Can someone help me please?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://beta.nhs.uk/find-a-pharmacy/results?latitude=51.2457238068354&location=Little%20London%2C%20Hampshire%2C%20SP11&longitude=-1.45959328501975")
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find_all("div", class_="results__details")
for container in data:
Pharmacyname = container.find_all("h2")
Pharmacyadd = container.find_all("p")
for pharmacy in Pharmacyname:
for add in Pharmacyadd:
print(add.text)
continue
print(pharmacy.text)
OUTPUT:
Shepherds Spring Pharmacy Ltd is 1.8 miles away
The Oval,
Cricketers Way,
Andover,
Hampshire,
SP10 5DN
01264 355700
Map and directions for Shepherds Spring Pharmacy Ltd at The Oval
Services available in Shepherds Spring Pharmacy Ltd at The Oval
Open until 6:15pm today
Shepherds Spring Pharmacy Ltd
Tesco Instore Pharmacy is 2.1 miles away
Tesco Superstore,
River Way,
Andover,
Hampshire,
SP10 1UZ
0345 677 9007
.
.
.
Note: You could create separate lists for pharmacy_name and
pharmacy_add to store the data and then write to the files. PS. You
could also strip off the unwanted text from the lists (let's say the
text after the Phone number from each pharmacy)
import requests
from bs4 import BeautifulSoup
import re
import xlsxwriter
workbook = xlsxwriter.Workbook('File.xlsx')
worksheet = workbook.add_worksheet()
request = requests.get("https://beta.nhs.uk/find-a-pharmacy/results?latitude=51.2457238068354&location=Little%20London%2C%20Hampshire%2C%20SP11&longitude=-1.45959328501975")
soup = BeautifulSoup(request.content, 'html.parser')
data = soup.find_all("div", class_="results__details")
formed_data=[]
for results_details in data:
formed_data.append([results_details.find_all("h2")[0].text,re.sub(' +',' ',results_details.find_all("p")[1].text.replace('\n',''))])
row=col=0
for name, adress in (formed_data):
worksheet.write(row, col, name)
worksheet.write(row, col + 1, adress)
row += 1
workbook.close()