I'm trying to parse this webpage into a pandas dataframe to analyze, but the page is set up such that the table only has two columns of use, one with the name and the other containing all the other information as a single cell.
For example, with my code below:
import bs4
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "https://education.scripps.edu/alumni/graduate-alumni-list/index.html"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
table = soup.find('tbody')
td = table.find_all('td')
data = []
for element in td:
sub_data = []
for sub_element in element:
try:
sub_data.append(sub_element.get_text())
except:
continue
data.append(sub_data)
dataFrame = pd.DataFrame(data = data)
df = dataFrame[[1,3]]
df = df.dropna()
So df.iat[0,1] would have the program, defense year, advisor, dissertation title, and undergraduate institution. The HTML only uses "br" and "strong" to separate these values, and I am wondering if there is any way to separate this text into different columns so the columns would be "name", "program", "defense year" and such, instead of one cell containing all the information.
thank you so much!
After try: and before sub_data.append line in your code you should split your sub_element text by "<br>". You can try the following:
sub_data_splitted = sub_element.get_text().split("<br>").
# After that you are able to use each field of the data i.e.
program = sub_data_splitted[0].split(":")[1]
defense_year = sub_data_splitted[1].split(":")[1]
advisor = sub_data_splitted[2].split(":")[1]
dissertation_title = sub_data_splitted[3].split(":")[1]
ug_institution = sub_data_splitted[4].split(":")[1]
You can do like this.
You can use .stripped_strings() to get a list of data from each <tr> of the table.
Since you only need the values and not the titles (like Name, Defense Year, etc.) Use List comprehension to select only the required values.
Append the list to a dataframe.
Here is how it is done.
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = "https://education.scripps.edu/alumni/graduate-alumni-list/index.html"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")
t = soup.find('table').find('tbody')
trs = t.find_all('tr')
data = []
for i in trs:
l = [x for i,x in enumerate(list(i.stripped_strings)) if i%2 == 0]
data.append(l)
df = pd.DataFrame(data=data)
0 ... 6
0 Abbott, PhD, Jason ... None
1 Adam, PhD, Gregory Charles ... None
2 Adhikari, PhD, Pramisha ... None
3 Al-Bassam, PhD, Jawdat M. H. ... None
4 Albertshofer, PhD, Klaus ... None
.. ... ... ...
682 Zhou, PhD, Jiacheng ... None
683 Zhou, PhD, Zhaohui (Sunny) ... None
684 Zhu, PhD, Ruyi ... None
685 Zhu, PhD, Yan ... None
686 Zuhl, PhD, Andrea M. ... None
[687 rows x 7 columns]
Is this what you're trying to do?
import bs4
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "https://education.scripps.edu/alumni/graduate-alumni-list/index.html"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
table = soup.find('tbody')
td = table.find_all('td')
data = {}
names = []
prev_name = None
for element in td:
sub_data = {}
for sub_element in element:
try:
data[sub_element['alt']] = {}
prev_name = sub_element['alt']
except:
sub_element = str(sub_element).replace('</strong>', '').replace('<br/>', '</strong>')
temp = BeautifulSoup(sub_element)
if len(temp.find_all('strong')) > 0:
temp = [str(i.string) for i in temp.find_all('strong') if i.string is not None]
temp = {i.split(':', 1)[0] : i.split(':', 1)[1] for i in temp if ':' in i}
data[prev_name] = temp
df = pd.DataFrame(data = data)
df = df.T.reset_index()
df.rename(columns={'index' : 'Name'}, inplace=True)
Related
I'm trying to scrape a table on the markets.ft website which unfortunately has a number of icons in it (table: 'Lipper Leader Scorecard' - https://markets.ft.com/data/funds/tearsheet/ratings?s=LU0526609390:EUR).
When I use BeautifulSoup, I can grab the table but all the values are NaN.
Is there a way to scrape the icons inside the table and convert them into a numerical number?
My code is:
import requests
import pandas as pd
from bs4 import BeautifulSoup
id_list = ['LU0526609390:EUR','IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR', 'LU1116896876:EUR']
urls = ['https://markets.ft.com/data/funds/tearsheet/ratings?s='+ x for x in id_list]
dfs =[]
for url in urls:
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
# Some funds in the list do not have any data.
try:
table = soup.find_all('table')[0]
print(table)
except Exception:
continue
df = pd.read_html(str(table), index_col=0)[0]
dfs.append(df)
print(dfs)
Required Output for fund (LU0526609390):
Total return Consistent return Preservation Expense
Overall rating 3 3 5 5
3 year rating 3 3 5 5
5 year rating 2 3 5 5
You can use a dictionary to transform the class value to the corresponding integer
import requests, bs4
import pandas as pd
from io import StringIO
options = {
'mod-sprite-lipper-1': 1,
'mod-sprite-lipper-2': 2,
'mod-sprite-lipper-3': 3,
'mod-sprite-lipper-4': 4,
'mod-sprite-lipper-5': 5,
}
soup = bs4.BeautifulSoup(requests.get(
url= 'https://markets.ft.com/data/funds/tearsheet/ratings?s=LU0526609390:EUR'
).content, 'html.parser').find('table', {'class': 'mod-ui-table'})
header = [x.text.strip() for x in soup.find('thead').find_all('th')]
data = [header] + [
[x.find('td').text.strip()] + [
options[e.find('i') .get('class')[-1]]
for e in x.find_all('td')[1:]
]
for x in soup.find('tbody').find_all('tr')
]
df = pd.read_csv(
StringIO('\n'.join([','.join(str(x) for x in xs) for xs in data])),
index_col = 0,
)
print(df)
I am trying to get second table elements for a list of links and store them as a pandas dataframe, to accomplish this task I defined a function getCitySalaryTable():
from bs4 import BeautifulSoup
import lxml
import requests
import pandas as pd
job_title_urls=['https://www.salario.com.br/profissao/abacaxicultor-cbo-612510',
'https://www.salario.com.br/profissao/abade-cbo-263105']
def getCitySalaryTable(job_title_urls, city_salary_df):
for url in job_title_urls:
original_url= url
url = requests.get(url)
soup=BeautifulSoup(url.text, 'lxml')
tables=soup.find_all('table', attrs={'class':'listas'})
# I suspect the problem is here #
city_salary_table=tables[1]
#################################
# extracting column names
heads= city_salary_table.find('thead').find('tr').find_all('th')
colnames = [hdr.text for hdr in heads]
# extracting rows
data = {k:[] for k in colnames}
rows = city_salary_table.find('tbody').find_all('tr')
for rw in rows:
for col in colnames:
cell = rw.find('td', attrs={'data-label':'{}'.format(col)})
data[col].append(cell.text)
#print(data)
# Constructing a pandas dataframe using the data just parsed
"""
adding keys: cbo, job_title
"""
cbo = original_url.split('/')[-1].split('-')[-1]
job_title = original_url.split('/')[-1].split('-')[0]
df = pd.DataFrame.from_dict(data)
df.insert(0,'cbo','')
df['cbo'] = cbo
df.insert(1, 'job_title', '')
df['job_title'] = job_title
city_salary_df = pd.concat([city_salary_df, df], ignore_index=True)
return city_salary_df
However when applied:
city_salary_df = pd.DataFrame()
city_salary_df = getCitySalaryTable(job_title_urls, city_salary_df)
It returns a dataframe just for the first link, I suspect that the index in the function (city_salary_table=tables[1]) is not correct for other links.
# cbo job_title ... Salário/Hora Total
#0 612510 abacaxicultor ... 6,16 29
#1 612510 abacaxicultor ... 5,96 6
#2 612510 abacaxicultor ... 6,03 4
#3 612510 abacaxicultor ... 16,02 4
#4 612510 abacaxicultor ... 4,75 3
#5 612510 abacaxicultor ... 5,13 3
#[6 rows x 9 columns]
How could I properly tell the function to return me just the second table for all links?
Use nth-of-type if it is truly the 2nd table
soup.select_one('table:nth-of-type(2)')
Though a class selector is faster than type selector
soup.select_one('.listas:nth-of-type(2)')
import request
from bs4 import BeautifulSoup as bs
soup = bs(requests.get('https://www.salario.com.br/profissao/abacaxicultor-cbo-612510').text, 'lxml')
soup.select_one('.listas:nth-of-type(2)')
Your last link doesn't have that table so add a check on whether city_salary_table is None:
from bs4 import BeautifulSoup
import lxml
import requests
import pandas as pd
job_title_urls=['https://www.salario.com.br/profissao/abacaxicultor-cbo-612510',
'https://www.salario.com.br/profissao/abade-cbo-263105',
'https://www.salario.com.br/profissao/abadessa-cbo-263105',
'https://www.salario.com.br/profissao/abanador-na-agricultura-cbo-622020']
def getCitySalaryTable(job_title_urls, city_salary_df):
for url in job_title_urls:
r = requests.get(url)
print()
soup=BeautifulSoup(r.text, 'lxml')
# I suspect the problem is here #
city_salary_table = soup.select_one('.listas:nth-of-type(2)')
#################################
if city_salary_table is not None:
# extracting column names
heads= city_salary_table.find('thead').find('tr').find_all('th')
colnames = [hdr.text for hdr in heads]
# extracting rows
data = {k:[] for k in colnames}
rows = city_salary_table.find('tbody').find_all('tr')
for rw in rows:
for col in colnames:
cell = rw.find('td', attrs={'data-label':'{}'.format(col)})
data[col].append(cell.text)
#print(data)
# Constructing a pandas dataframe using the data just parsed
"""
adding keys: cbo, job_title
"""
cbo = url.split('/')[-1].split('-')[-1]
job_title = url.split('/')[-1].split('-')[0]
df = pd.DataFrame.from_dict(data)
df.insert(0,'cbo','')
df['cbo'] = cbo
df.insert(1, 'job_title', '')
df['job_title'] = job_title
city_salary_df = pd.concat([city_salary_df, df], ignore_index=True)
return city_salary_df
city_salary_df = pd.DataFrame()
city_salary_df = getCitySalaryTable(job_title_urls, city_salary_df)
print(city_salary_df)
Google colab:
I think that Google Colab is using an ancient version of soupsieve and we are not seeing the not implemented error being reported for nth-of-type. Instead, you can use city_salary_table = soup.select_one('table + table')
I want to scrape the urls from a html table of this website. I was able to gather LOCATION | DATE | SUMMARY | DEADLINE. But the SUMMARY field is having a url to another page. I want to scrape the entire table along with this url so my scraped data becomes LOCATION | DATE | SUMMARY | DEADLINE | URLS
import requests as rq
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.tendersinfo.com/global-information-technology-tenders-{}.php'
amount_of_pages = 4796 #5194
rows = []
for i in range(1,amount_of_pages):
response = rq.get(url.format(i))
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',{'id':'datatable'})
headers = []
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
for tr in table.find_all("tr")[1:]:
cells = []
tds = tr.find_all("td")
if len(tds) == 0:
ths = tr.find_all("th")
for th in ths:
cells.append(th.text.strip())
links = [th.findAll('a')]
else:
for td in tds:
cells.append(td.text.strip())
links = [td.findAll('a')]
rows.append(cells)
You'll need to get the '' tag under the <td> tag, and pull out the href attribute.
import requests as rq
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.tendersinfo.com/global-information-technology-tenders-{}.php'
amount_of_pages = 4796 #5194
rows = []
headers = []
for i in range(1,amount_of_pages+1): #<-- if theres 4796 pages, your range needs to be to 4797. range goes from (start, end) but the is not inclusive of the end value
response = rq.get(url.format(i))
print (i)
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',{'id':'datatable'})
if len(headers) == 0:
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
headers.append('URL')
for tr in table.find_all("tr")[1:]:
cells = []
tds = tr.find_all("td")
for td in tds:
cells.append(td.text.strip())
if td.find('a'):
link = td.find('a')['href']
cells = cells + [link]
rows.append(cells)
df = pd.DataFrame(rows,
columns =headers)
I'm trying to web scrape a data table in wikipedia using python bs4. But I'm stuck with this problem. When getting the data values my code is not getting the first column or index zero. I feel there something wrong with the index but I can't figure it out. Please help. See the
response_obj = requests.get('https://en.wikipedia.org/wiki/Metro_Manila').text
soup = BeautifulSoup(response_obj,'lxml')
Neighborhoods_MM_Table = soup.find('table', {'class':'wikitable sortable'})
rows = Neighborhoods_MM_Table.select("tbody > tr")[3:8]
cities = []
for row in rows:
city = {}
tds = row.select('td')
city["City or Municipal"] = tds[0].text.strip()
city["%_Population"] = tds[1].text.strip()
city["Population"] = float(tds[2].text.strip().replace(",",""))
city["area_sqkm"] = float(tds[3].text.strip().replace(",",""))
city["area_sqm"] = float(tds[4].text.strip().replace(",",""))
city["density_sqm"] = float(tds[5].text.strip().replace(",",""))
city["density_sqkm"] = float(tds[6].text.strip().replace(",",""))
cities.append(city)
print(cities)
df=pd.DataFrame(cities)
df.head()
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True) for item in soup.findAll(
"td", style="text-align:right") if "%" in item.text] + [""]
df = pd.read_html(r.content, header=0)[5]
df = df.iloc[1: -1]
df['Population (2015)[3]'] = target
print(df)
df.to_csv("data.csv", index=False)
main("https://en.wikipedia.org/wiki/Metro_Manila")
Output: view-online
I'm trying to scrape a table into a dataframe. My attempt only returns the table name and not the data within rows for each region.
This is what i have so far:
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
table_regions = soup.find('table', {'class': "t4"})
regions = table_regions.find_all('tr')
for row in regions:
print row
ideal outcome i'd like to get:
region | price
---------------|-------
new england | 2.59
new york city | 2.52
Thanks for any assistance.
If you check your html response (soup) you will see that the table tag you get in this line table_regions = soup.find('table', {'class': "t4"}) its closed up before the rows that contain the information you need (the ones that contain the td's with the class names: up dn d1 and s1.
So how about using the raw td tags like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
a = soup.find_all('tr')
rows = []
subel = []
for tr in a[42:50]:
b = tr.find_all('td')
for td in b:
subel.append(td.string)
rows.append(subel)
subel = []
df = pd.DataFrame(rows, columns=['Region','Price_1', 'Percent_change_1', 'Price_2', 'Percent_change_2', 'Spark Spread'])
Notice that I use just the a[42:50] slice of the results because a contains all the td's of the website. You can use the rest too if you need to.