Python: How to Webscrape All Rows from a Specific Table

Python: How to Webscrape All Rows from a Specific Table - python

For practice, I am trying to webscrape financial data from one table in this url: https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue
I'd like to save the data from the "Tesla Quarterly Revenue" table into a data frame and return two columns: Data, Revenue.
Currently the code as it runs now is grabbing data from the adjacent table, "Tesla Annual Revenue." Since the tables don't seem to have unique id's from which to separate them in this instance, how would I select elements only from the "Tesla Quarterly Revenue" table?
Any help or insight on how to remedy this would be deeply appreciated.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html5lib')
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all("tr"):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append({"Date":date, "Revenue":revenue},ignore_index=True)
tesla_revenue.head()
Below are the results when I run this code:

You can let pandas do all the work
import pandas as pd
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
tables = pd.read_html(url)
for df in tables:
# loop over all found tables
pass
# quarterly revenue is the second table
df = tables[1]
df.columns = ['Date', 'Revenue'] # rename the columns if you want to
print(df)

Related

Cannot find the table tag in the website to scrap information using Beautiful soup

I am trying to obtain the values of this columns (Year, Mom Dy, Hr, Mn, Sec) from the following [website https://www.ngdc.noaa.gov/hazel/view/hazards/tsunami/event-data?maxYear=2022&minYear=2010&country=USA] but I am new using Beautiful soup and I cannot find the table tag in the inspection to obtain the information.
This are the columns
I have tried using this code:
url = 'https://www.ngdc.noaa.gov/hazel/view/hazards/tsunami/event-data?
maxYear=2022&minYear=2010&country=USA'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
soup.find('class',attrs={'ReactVirtualized__Grid__innerScrollContainer'})
But nothing is returned.

Data comes from an API you can call. You can optionally create a date index and sort on that as well after generating a DataFrame from the returned json.
import requests
import pandas as pd
df = pd.DataFrame(requests.get('https://www.ngdc.noaa.gov/hazel/hazard-service/api/v1/tsunamis/events?++maxYear=2022&minYear=2010&country=USA').json()['items'])
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df.set_index('date', inplace=True)
df.sort_index(inplace=True)
df
You can read about the API options here:
https://www.ngdc.noaa.gov/hazel/view/swagger#/Tsunami%20Events#
There is also a search tool here:
https://www.ngdc.noaa.gov/hazel/view/hazards/tsunami/event-search

Getting headers from html (parsing)

The source is https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. I am looking to use the table called "COVID-19 pandemic in the United States by state and territory" which is the third diagram on the page.
Here is my code so far
from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
This last line with header1 i giving me the error "list index out of range". What it is supposed to print is "U.S State or territory....."
I don't know anything about html, and everything gets me stuck and confused. The soup.find could also be referencing the wrong part of the webpage.

Can you just use
headers = [element.text.strip() for element in data_table.find_all("th")]
To get the text in the headers?
To get the entire table as a pandas dataframe, you can do:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
rows = data_table.find_all("tr")
# Delete first row as it's not part of the table and confuses pandas
# this removes it from both soup and data_table
rows[0].decompose()
# Same for third row
rows[2].decompose()
# Same for last two rows
rows[-1].decompose()
rows[-2].decompose()
# Read html with pandas
df = pd.read_html(str(data_table))[0]
# Keep only the useful columns
df = df[['U.S. state or territory[i].1', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]']]
# Rename columns
df.columns = ["State", "Cases", "Deaths", "Recov.", "Hosp."]

It's probably easier in these cases to try to read tables with pandas, and go from there:
import pandas as pd
table = soup.select_one("div#covid19-container table")
df = pd.read_html(str(table))[0]
df
The output is the target table.

by looking at your code, I think you should call the html tag by find, not by find_all in the title tag

Html table scraping using beautifulsoup

I am trying to a table from SEC filling 10-K, I think it is going allright except the part where pandas converting it to dataframes as I am new to data frames so I think making mistake in indexing, Please help me on this as I am getting fallowing error "IndexError: index 2 is out of bounds for axis 0 with size 2"
I am using this programming
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1022344/000155837017000934/spg-20161231x10k.htm#Item8FinancialStatementsandSupplementary'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'lxml')
table = soup.find_all('table')[0]
new_table = pd.DataFrame(columns=range(0,2), index = [0])
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
new_table
If dataframe issue isn't resolvable than please suggest any other substitute like writing data to csv/excel also any sugesstion for extracting mutiple table at once will be really helpful

Get xml from webservice?

I'm trying to get a data from this site
and then use some of it. Sorry for not copy-paste it but it's a long xml. So far I tried to get this data those ways:
from urllib.request import urlopen
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
s = urlopen(url)
content = s.read()
as print(content) looks good, now I would like to get a data from it
<tabela_rozklad data-aktualizacji="1480583567">
<DZIEN>2</DZIEN>
<GODZ>3</GODZ>
<ILOSC>2</ILOSC>
<TYG>0</TYG>
<ID_NAUCZ>66</ID_NAUCZ>
<ID_SALA>79</ID_SALA>
<ID_PRZ>104</ID_PRZ>
<RODZ>W</RODZ>
<GRUPA>1</GRUPA>
<ID_ST>13</ID_ST>
<SEM>1</SEM>
<ID_SPEC>0</ID_SPEC>
</tabela_rozklad>
How can I handle this data to easy use it?

You can use Beautiful soup and capture the tags you want. The code below should get you started!
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
# secure url content
response = requests.get(url).content
soup = BeautifulSoup(response)
# find each tabela_rozklad
tables = soup.find_all('tabela_rozklad')
# for each tabela_rozklad looks like there is 12 nested corresponding tags
tags = ['dzien', 'godz', 'ilosc', 'tyg', 'id_naucz', 'id_sala',
'id_prz', 'rodz', 'grupa', 'id_st', 'sem', 'id_spec']
# initialize empty dataframe
df = pd.DataFrame()
# iterate over each tabela_rozklad and extract each tag and append to pandas dataframe
for table in tables:
all = map(lambda x: table.find(x).text, tags)
df = df.append([all])
# insert tags as columns
df.columns = tags
# display first 5 rows of table
df.head()
# and the shape of the data
df.shape # 665 rows, 12 columns
# and now you can get to the information using traditional pandas functionality
# for instance, count observations by rodz
df.groupby('rodz').count()
# or subset only observations where rodz = J
J = df[df.rodz == 'J']

Convert a cell into different columns & rows of the DataFrame

I am trying to scrape a table out of a site. I have tried to convert data_row to a Pandas DataFrame; however, all the data are lumped in one cell of the DataFrame. Would you guys please help me convert the data_row into a Pandas DataFrame with "Business Mileage, "Charitable Mileage," "Medical mileage," and "Moving mileage" as rows and "2016," "2015," "2014," "2013," "2012," "2011," and "2010" as columns ?
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
r = urllib2.urlopen('http://www.smbiz.com/sbrl003.html#cmv')
soup = BeautifulSoup(r)
print soup.prettify()
data_row = soup.findAll('pre')[0:1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: How to Webscrape All Rows from a Specific Table - python

Related

Cannot find the table tag in the website to scrap information using Beautiful soup

Getting headers from html (parsing)

Html table scraping using beautifulsoup

Get xml from webservice?

Convert a cell into different columns & rows of the DataFrame

Categories

Resources