Convert a cell into different columns & rows of the DataFrame - python

I am trying to scrape a table out of a site. I have tried to convert data_row to a Pandas DataFrame; however, all the data are lumped in one cell of the DataFrame. Would you guys please help me convert the data_row into a Pandas DataFrame with "Business Mileage, "Charitable Mileage," "Medical mileage," and "Moving mileage" as rows and "2016," "2015," "2014," "2013," "2012," "2011," and "2010" as columns ?
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
r = urllib2.urlopen('http://www.smbiz.com/sbrl003.html#cmv')
soup = BeautifulSoup(r)
print soup.prettify()
data_row = soup.findAll('pre')[0:1]

Related

Python: How to Webscrape All Rows from a Specific Table

For practice, I am trying to webscrape financial data from one table in this url: https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue
I'd like to save the data from the "Tesla Quarterly Revenue" table into a data frame and return two columns: Data, Revenue.
Currently the code as it runs now is grabbing data from the adjacent table, "Tesla Annual Revenue." Since the tables don't seem to have unique id's from which to separate them in this instance, how would I select elements only from the "Tesla Quarterly Revenue" table?
Any help or insight on how to remedy this would be deeply appreciated.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html5lib')
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all("tr"):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append({"Date":date, "Revenue":revenue},ignore_index=True)
tesla_revenue.head()
Below are the results when I run this code:
You can let pandas do all the work
import pandas as pd
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
tables = pd.read_html(url)
for df in tables:
# loop over all found tables
pass
# quarterly revenue is the second table
df = tables[1]
df.columns = ['Date', 'Revenue'] # rename the columns if you want to
print(df)

Adding href to panda .read_html DF

I want to create a table with the information available on this website. I want the table to have 3 columns: 0 series/date, 1 title and 2 links. I already managed to get the first two columns but I don't know how to get the link for each entry.
import pandas as pd
import requests
url = "http://legislaturautuado.com/pgs/resolutions.php?st=5&f=2016"
r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
Will it be possible to get what I want by only using pandas?
As far as I know, it's not possible with pandas only. It can be done with BeautifulSoup, though:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "http://legislaturautuado.com/pgs/resolutions.php?st=5&f=2016"
r = requests.get(url)
html_table = BeautifulSoup(r.text).find('table')
r.close()
df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]

Getting headers from html (parsing)

The source is https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. I am looking to use the table called "COVID-19 pandemic in the United States by state and territory" which is the third diagram on the page.
Here is my code so far
from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
This last line with header1 i giving me the error "list index out of range". What it is supposed to print is "U.S State or territory....."
I don't know anything about html, and everything gets me stuck and confused. The soup.find could also be referencing the wrong part of the webpage.
Can you just use
headers = [element.text.strip() for element in data_table.find_all("th")]
To get the text in the headers?
To get the entire table as a pandas dataframe, you can do:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
rows = data_table.find_all("tr")
# Delete first row as it's not part of the table and confuses pandas
# this removes it from both soup and data_table
rows[0].decompose()
# Same for third row
rows[2].decompose()
# Same for last two rows
rows[-1].decompose()
rows[-2].decompose()
# Read html with pandas
df = pd.read_html(str(data_table))[0]
# Keep only the useful columns
df = df[['U.S. state or territory[i].1', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]']]
# Rename columns
df.columns = ["State", "Cases", "Deaths", "Recov.", "Hosp."]
It's probably easier in these cases to try to read tables with pandas, and go from there:
import pandas as pd
table = soup.select_one("div#covid19-container table")
df = pd.read_html(str(table))[0]
df
The output is the target table.
by looking at your code, I think you should call the html tag by find, not by find_all in the title tag

Crawling data from the web then restructure to a Pandas DataFrame

I have a code like this:
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
from datetime import datetime, timedelta
URL_TEMPLATES ='https://freemeteo.vn/thoi-tiet/ngoc-ha/history/daily-history/?gid=1572463&station=11376&date={}-{:02d}-{:02d}&language=vietnamese&country=vietnam' #%loc
urls = URL_TEMPLATES.format(2015,1,1)
html_docs = requests.get(urls).text
soups = BeautifulSoup(html_doc)
tables = soup.find(class_='table-list')
tables
Then the results like this:
<div class="table-list">
<ul><li>Yên Phụ</li>
<li>Hữu Tiệp</li>
Can anyone help me to create tables to pandas DataFrame to easy to handle like I can select 'Yên Phụ' string? Thanks
You can parse the tables directly with pandas.
import pandas as pd
url = 'https://freemeteo.vn/thoi-tiet/ngoc-ha/history/daily-history/?gid=1572463&station=11376&date={}-{:02d}-{:02d}&language=vietnamese&country=vietnam'
tables = pd.read_html(url)
That will give you a list of dataframes. Each table on the page will be one dataframe.
Then you can query a dataframe like:
tables[0].query("Tên == 'Hà Nội'")
If you just wanted the cities in table-list div:
resp = requests.get(url)
soup = BeautifulSoup(resp.text)
table_list = soup.find('div', {'class': 'table-list'})
names, links = [], []
for city in table_list.find_all('a'):
names.append(city.text)
links.append(city['href'])
To turn the above 2 lists into a dataframe:
df = pd.DataFrame(zip(names, links), columns=['City', 'Link'])

A table into a graph (beautifulsoup in python )

It possible (is there an easy way )to get a table out of a website and then translate it into a graph not a table ?
Here is the code the code extracts a table into a table.
import the library used to query a website
import urllib2
#specify the url
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)
all_tables=soup.find_all('table')
right_table=soup.find('table', class_='wikitable sortable plainrowheaders')
right_table
#Generate lists
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
for row in right_table.findAll("tr"):
cells = row.findAll('td')
states=row.findAll('th') #To store second column data
if len(cells)==6: #Only extract table body not heading
A.append(cells[0].find(text=True))
B.append(states[0].find(text=True))
C.append(cells[1].find(text=True))
D.append(cells[2].find(text=True))
E.append(cells[3].find(text=True))
F.append(cells[4].find(text=True))
G.append(cells[5].find(text=True))
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(A,columns=['Number'])
df['State/UT']=B
df['Admin_Capital']=C
df['Legislative_Capital']=D
df['Judiciary_Capital']=E
df['Year_Capital']=F
df['Former_Capital']=G
df
You can use read_html and select second table by [1] (read_html return list of DataFrames from all tables in webpage) with DataFrame.plot:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India', header=0, index_col=0)[1]
print (df)
import matplotlib.pyplot as plt
#there are 2 values of year, if need first add [0] if secind add [1] after split()
df.loc[2, 'Year capital was established'] = df.loc[2, 'Year capital was established'].split()[0]
df.loc[21, 'Year capital was established'] = df.loc[21, 'Year capital was established'].split()[0]
#convert to number years
df['Year capital was established'] = df['Year capital was established'].astype(int)
df.plot(x='Judiciary capitals', y='Year capital was established')
plt.show()
You can use Pandas' readhtml function, you just need a table with some good numeric data (see the one in the snippet below). Then use the plot function and you have a good starting point.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area', header=0, index_col=0, skiprows=1)[1]
df.plot(x='sq mi', y='sq mi.2', kind='scatter')
plt.xlabel('Total area [sq mi]')
plt.ylabel('Water [sq mi]')
plt.show()

Categories

Resources