beautifulsoup problems extracting a table

beautifulsoup problems extracting a table - python

It's literally my first time using BeautifulSoup, and I'm having trouble extracting the table I want to work with ([https://ansm.sante.fr/disponibilites-des-produits-de-sante/medicaments]). I want to extract the table table table-products sortable searchable .
import requests
from bs4 import BeautifulSoup
url="https://ansm.sante.fr/disponibilites-des-produits-de-sante/medicaments"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "html.parser")
table = soup.find("table", class_="table table-products sortable searchable ")
table_data = table.tbody.find_all("tr")
This outputs:
AttributeError: 'NoneType' object has no attribute 'tbody'.
I guess I'm not reaching the table correctly, which is why it comes out as 'NoneType'.

There must be something wrong with the CSS class filter, without it works:
table = soup.find("table")
table_data = table.tbody.find_all("tr")
Add the class filter back but remove the trailing space:
table = soup.find("table", class_="table table-products sortable searchable") # last space at the end removed
Works too.
See:
BeautifulSoup and class with spaces
Beautiful Soup find element with multiple classes.

you have an extra space at the end class - table = soup.find("table", class_="table table-products sortable searchable ") But you can get table more simple:
import pandas as pd
df = pd.read_html('https://ansm.sante.fr/disponibilites-des-produits-de-sante/medicaments')[1]
print(df)
OUTPUT:
Statut ... Remise à disposition
0 Rupture de stock ... NaN
1 Rupture de stock ... NaN
2 Rupture de stock ... NaN
3 Tension d'approvisionnement ... NaN
4 Remise à disposition ... NaN
.. ... ... ...
373 Arrêt de commercialisation ... NaN
374 Rupture de stock ... NaN
375 Rupture de stock ... NaN
376 Remise à disposition ... 2 mars 2021
377 Rupture de stock ... NaN

There are two tables in the webpage and class value table table-products sortable searchable select both of them. The desired table is 2 and I use pandas to pull the complete table data
import pandas as pd
df = pd.read_html('https://ansm.sante.fr/disponibilites-des-produits-de-sante/medicaments')[1]
print(df)
Output:
Statut ... Remise à disposition
0 Rupture de stock ... NaN
1 Rupture de stock ... NaN
2 Rupture de stock ... NaN
3 Tension d'approvisionnement ... NaN
4 Remise à disposition ... NaN
.. ... ... ...
373 Arrêt de commercialisation ... NaN
374 Rupture de stock ... NaN
375 Rupture de stock ... NaN
376 Remise à disposition ... 2 mars 2021
377 Rupture de stock ... NaN
[378 rows x 4 columns]

Related

Generate a pandas dataframe from collections.OrderedDict

I have to open this xml file from this website and make a dataframe.
I tried to pas a xml to dict and then pass to dataframe
from urllib.request import urlopen
import xmltodict
from collections import OrderedDict
from io import StringIO
from collections import OrderedDict, Counter
import pandas as pd
file = urlopen('https://analisi.transparenciacatalunya.cat/download/8s6p-h233/text%2Fxml')
data_bytes = file.read()
orderDictListData = xmltodict.parse(data_bytes)
orderDictListData
df = pd.DataFrame(orderDictListData)
I need a dataframe since key "id" until "codiINEmunicipi" like that:

How about simply using pandas.read_xml:
url = 'https://analisi.transparenciacatalunya.cat/download/8s6p-h233/text%2Fxml'
df = pd.read_xml(url)
output:
id nom carrec tractament resp iddep dep idpare codidep nif ordre datamodificacio datacreacio centres sinonims
0 535 012 Atenció Ciutadana None None None 3392 Departament de la Vicepresidència i de Polítiques Digitals i Territori 6564 PTO None 912000 02/06/2021 19/06/1997 NaN NaN
1 3383 061 Salut Respon None None None 2803 Departament de Salut 7021 SLT None 1000 23/02/2021 19/06/1997 NaN NaN
2 5500 ACCIÓ - Agència per a la Competitivitat de l'Empresa consellera delegada de l'Agència per a la Competitivitat de l'Empresa, ACCIÓ Sra. Natàlia Mas Guix 19775 Departament d'Empresa i Treball 19035 EMO S-0800476-D 323699 28/02/2022 19/06/1997 NaN NaN
3 5504 ACCIÓ a Girona delegat d'ACCIÓ a Girona Sr. Ferran Rodero 19775 Departament d'Empresa i Treball 5500 EMO None 10500 25/01/2016 19/06/1997 NaN NaN
4 5505 ACCIÓ a Lleida delegada d'ACCIÓ a Lleida Sra. Clara Porta Sànchez 19775 Departament d'Empresa i Treball 5500 EMO None 11500 25/01/2016 19/06/1997 NaN NaN
...

How can scrape the team names and goals from this site into a table? Ive been trying a few different methods but can't quite figure it out

import requests
from bs4 import BeautifulSoup
URL = "https://www.hockey-reference.com/leagues/NHL_2021_games.html"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="all_games")
table = soup.find('div', attrs = {'id':'div_games'})
print(table.prettify())

Select the table not the div to print the table:
table = soup.find('table', attrs = {'id':'games'})
print(table.prettify())
Or use pandas.read_html() to get the table and transform into a dataframe:
import pandas as pd
pd.read_html('https://www.hockey-reference.com/leagues/NHL_2021_games.html', attrs={'id':'games'})[0].iloc[:,:5]
Output:
Date
Visitor
G
Home
G.1
2021-01-13
St. Louis Blues
4
Colorado Avalanche
1
2021-01-13
Vancouver Canucks
5
Edmonton Oilers
3
2021-01-13
Pittsburgh Penguins
3
Philadelphia Flyers
6
2021-01-13
Chicago Blackhawks
1
Tampa Bay Lightning
5
2021-01-13
Montreal Canadiens
4
Toronto Maple Leafs
5
...
...
...
...
...

table = soup.find('div', attrs = {'id':'div_games'})
trs = table.find_all('tr')
gamestats = []
for tr in trs:
gamestat = {}
gamestat['home_team_name'] = tr.find('td', attrs = {'data-stat' : 'home_team_name'})
gamestat['visit_team_name'] = tr.find('td', attrs = {'data-stat' : 'visit_team_name'})
gamestats.append(gamestat)

Not getting any output - webscraping using beautifulsoup

I am trying to scrape data from the following website. For the year 1993, for eg, this is the link.
https://www.ugc.ac.in/jobportal/search.aspx?tid=MTk5Mw==
Firstly, I am not sure how to navigate between the pages as the url for every page is the same.
Secondly, I wrote the following code to scrape information on any given page.
url = "https://www.ugc.ac.in/jobportal/search.aspx?tid=MTk5Mw=="
File = []
response = requests.get(url)
soup = bs(response.text,"html.parser")
entries = soup.find_all('tr',{'class': 'odd'})
for entry in entries:
columns = {}
Cells = entry.find_all("td")
columns['Gender'] = Cells[3].get_text()
columns['Category'] = Cells[4].get_text()
columns['Subject'] = Cells[5].get_text()
columns['NET Qualified'] = Cells[6].get_text()
columns['Month/Year'] = Cells[7].get_text()
File.append(columns)
df = pd.DataFrame(File)
df
I am not getting any error while running the code but I am not getting any output. I cant figure out what mistake I am doing here. Would appreciate any inputs. Thanks!

All data is stored inside <script> on that HTML page. To read it into panda's dataframe you can use next example:
import re
import requests
import pandas as pd
url = "https://www.ugc.ac.in/jobportal/search.aspx?tid=MTk5Mw=="
html_doc = requests.get(url).text
data = re.search(r'"aaData":(\[\{.*?\}]),', html_doc, flags=re.S).group(1)
df = pd.read_json(data)
print(df)
df.to_csv("data.csv", index=False)
Prints:
ugc_net_ref_no candidate_name roll_no subject gender jrf_lec cat vh_ph dob fname mname address exam_date result_date emailid mobile_no subject_code year
0 1/JRF (DEC.93) SHRI VEERENDRA KUMAR SHARMA N035010 ECONOMICS Male JRF GEN NaN NaN SHRI SATYA NARAIN SHARMA None 27 E KARANPUR (PRAYAG) ALLAHABAD U.P. 19th DEC.93 NULL NaN NaN NaN 1993
1 1/JRF (JUNE,93) SH MD TARIQUE R020005 ECONOMICS Male JRF GEN NaN NaN MD. ZAFIR ALAM None D-32, R.M. HALL, A.M.U. ALIGARH 20th June,93 NULL NaN NaN NaN 1993
2 10/JRF (DEC.93) SHRI ARGHYA GHOSH A245015 ECONOMICS Male JRF GEN NaN NaN SHRI BHOLANATH GHOSH None C/O SH.B. GHOSH 10,BAMACHARAN GHOSH LANE P.O.-BHADRESWAR,DIST.HOOGHLY CALCUTTA-24. 19th DEC.93 NULL NaN NaN NaN 1993
3 10/JRF (JUNE,93) SH SANTADAS GHOSH T210024 ECONOMICS Male JRF GEN NaN NaN SHRI HARIDAS GHOSH None P-112, MOTILAL COLONY, NO.-1 P.O. DUM DUM, CALCUTTA - 700 028 20th June,93 NULL NaN NaN NaN 1993
...
and saves data.csv (screenshot from LibreOffice):

How to properly place web-scraped data into a pandas data frame?

Problem: I have used BeautifulSoup to scrape a Wikipedia page for the meat consumption per capita for each country in the world. Having trouble putting it into a data frame using Pandas - my data frame is coming up blank.
Wikipedia page: https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption
Goal: Place web scraped data into a data frame
Code:
url_meat1='https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
page=urllib.request.urlopen(url_meat1)
soup= BeautifulSoup(page, "lxml")# parse the HTML from our URL into the BeautifulSoup parse tree format
print(soup.prettify()) #print results of the web page scrape
table_meat1 = soup.find('table', class_='wikitable sortable')
A=[]
B=[]
C=[]
for row in table_meat1.findAll('tr'):
cells=row.findAll('td')
if len(cells)==3:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
df_meat1=pd.DataFrame(A,columns=['Country'])
df_meat1['kg/person (2009)']=B
df_meat1['kg/person (2017)']=C
df_meat1
I get a blank data frame...
Result

Replace your for loop with this for loop:
for row in table_meat1.findAll('tr'):
cells=row.find_all('td')
if len(cells)==4:
A.append(cells[0].a['title'])
B.append(cells[2].find(text=True))
C.append(cells[3].find(text=True).strip())
Output:
Country kg/person (2009) kg/person (2017)
0 Albania None
1 Algeria 19.5 17.33
2 American Samoa 26.8
3 Angola 22.4
4 Antigua and Barbuda 84.3
.. ... ... ...
183 Venezuela 76.8
184 Vietnam 49.9 52.90
185 Yemen 17.9
186 Zambia 12.3
187 Zimbabwe 21.3 13.64
[188 rows x 3 columns]
Same data in a csv file:

How to scrape data through the table?

import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get("https://www.worldometers.info/coronavirus/#countries")
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("table", {"id":"main_table_countries_today"})
columns = [i.get_text(strip=True) for i in table.find("thead").find_all("th")]
rows = []
for row in table.find("tbody").find_all("tr"):
rows.append([i.get_text(strip=True) for i in row.find_all("td")])
df = pd.DataFrame(rows, columns=columns)
df.to_csv("data.csv", index=False)
print(df)
Output:
# Country,Other ... 1 Deathevery X ppl 1 Testevery X ppl
0 North America ...
1 South America ...
2 Asia ...
3 Europe ...
4 Africa ...
.. ... ... ... ... ...
218 211 St. Barth ... 8
219 212 British Virgin Islands ... 30,249 24
220 213 Saint Pierre Miquelon ...
221 214 Anguilla ... 40
222 215 China ... 310,601 16
[223 rows x 19 columns]
I changed to the above but why only part of the data are shown instead of the table? And how can I indicate the columns by using index? Because I would like to select the five columns to store data 'Country','Total Cases','Total Deaths','Total Recover' and 'Population'

import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get("https://www.worldometers.info/coronavirus/#countries")
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("table", {"id":"main_table_countries_today"})
columns = [i.get_text(strip=True) for i in table.find("thead").find_all("th")]
rows = []
for row in table.find("tbody").find_all("tr"):
rows.append([i.get_text(strip=True) for i in row.find_all("td")])
df = pd.DataFrame(rows, columns=columns)
df.to_csv("data.csv", index=False)
print(df)
Output:
#,"Country,Other",TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Deaths/1M pop,TotalTests,Tests/1M pop,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl
,North America,"5,657,552","+5,378","222,196",+295,"2,919,610","+4,662","2,515,746","26,013",,,,,,North America,,,
,South America,"4,245,834","+1,360","146,906",+89,"2,851,587",+188,"1,247,341","14,300",,,,,,South America,,,
,Asia,"4,453,650","+3,721","99,365",+41,"3,301,717","+5,326","1,052,568","19,086",,,,,,Asia,,,
,Europe,"2,898,953",+456,"203,794",,"1,748,496",+41,"946,663","5,143",,,,,,Europe,,,
,Africa,"961,388",,"20,350",,"615,346",+2,"325,692","1,150",,,,,,Africa,,,
,Oceania,"20,106",+397,246,+13,"12,276",+202,"7,584",43,,,,,,Australia/Oceania,,,
,,721,,15,,651,,55,4,,,,,,,,,
,World,"18,238,204","+11,312","692,872",+438,"11,449,683","+10,421","6,095,649","65,739","2,340",88.9,,,,All,,,
1,USA,"4,813,647",,"158,365",,"2,380,217",,"2,275,065","18,623","14,535",478,"59,935,508","180,977","331,176,957",North America,69,"2,091",6
2,Brazil,"2,733,677",,"94,130",,"1,884,051",,"755,496","8,318","12,853",443,"13,096,132","61,573","212,694,204",South America,78,"2,260",16
3,India,"1,805,838","+1,136","38,176",+15,"1,188,389","+1,161","579,273","8,944","1,307",28,"20,202,858","14,627","1,381,196,835",Asia,765,"36,180",68
4,Russia,"850,870",,"14,128",,"650,173",,"186,569","2,300","5,830",97,"28,793,260","197,295","145,940,242",Europe,172,"10,330",5
5,South Africa,"511,485",,"8,366",,"347,227",,"155,892",539,"8,615",141,"3,036,779","51,147","59,373,395",Africa,116,"7,097",20
...
...
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

beautifulsoup problems extracting a table - python

Related

Generate a pandas dataframe from collections.OrderedDict

How can scrape the team names and goals from this site into a table? Ive been trying a few different methods but can't quite figure it out

Not getting any output - webscraping using beautifulsoup

How to properly place web-scraped data into a pandas data frame?

How to scrape data through the table?

Categories

Resources