I am creating a web scraping program using python, BeautifulSoup, pandas and Google Sheets.
Up until now I have managed to scrape data tables from urls which I’m getting from a list in Google sheets - I have created data frames for each dataset. From my list of urls, some of the cells in the column is empty, which gives me the following error when I try to import the dataframes into another sheet:
MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant
http://?
What I’d like to achieve is, that for every cell that’s empty in the sheets with urls, I would like to create an empty dataframe, just like the ones with data inside them. Is that possible?
My code so far looks like this:
import gspread
from df2gspread import df2gspread as d2g
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession
from bs4 import BeautifulSoup
import pandas as pd
import requests
credentials = service_account.Credentials.from_service_account_file(
'credentials.json')
scoped_credentials = credentials.with_scopes(
['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive']
)
gc = gspread.Client(auth=scoped_credentials)
gc.session = AuthorizedSession(scoped_credentials)
spreadsheet_key = gc.open_by_key('api_key')
# Data import
data_worksheet = spreadsheet_key.worksheet("Data")
# Url's
url_worksheet = spreadsheet_key.worksheet("Urls")
link_list = url_worksheet.col_values(2)
def get_info(linkIndex) :
page = requests.get(link_list[linkIndex])
soup = BeautifulSoup(page.content, 'html.parser')
try :
tbl = soup.find('table')
labels = []
results = []
for tr in tbl.findAll('tr'):
headers = [th.text.strip() for th in tr.findAll('th')]
data = [td.text.strip() for td in tr.findAll('td')]
labels.append(headers)
results.append(data)
final_results = []
for final_labels, final_data in zip(labels, results):
final_results.append({'Labels': final_labels, 'Data': final_data})
df = pd.DataFrame(final_results)
df['Labels'] = df['Labels'].str[0]
df['Data'] = df['Data'].str[0]
indexNames = df[df['Labels'] == 'Links'].index
df.drop(indexNames , inplace=True)
set_with_dataframe(data_worksheet, df, col=(linkIndex*6)+1, row=2,
include_column_header=False)[1:]
except Exception as e:
print(e)
for linkInd in range(len(link_list))[1:] :
get_info(linkInd)
It depends on what do you mean by an empty dataframe. If that's dataframe containing no data, it can be created with statement pd.DataFrame(). If that's dataframe containing np.NaN / None values in same columns as other dataframes, it can be created from a dict:
import pandas as pd
# x is the amount of rows in dataframe
d = {
'column1': [np.NaN] * x,
'column2': [np.NaN] * x,
'column3': [np.NaN] * x
}
df = pd.DataFrame(d)
In the beginning of get_info() function there should be a check added:
if link_list[linkIndex] is not None: # or if link_list[linkIndex] != '' (depending on format of an empty cell)
In if section should be placed already existing logic, in else section an empty dataframe should be created. Function set_with_dataframe() should be called after if / else statement, because it's executed in both cases.
Related
# Import libs
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
# Form Data for passing to the request body
formdata = {'objid': '14'}
# URL
url = "https://www.sec.kerala.gov.in/public/getalllbcmp/byd"
# Query
for i in range(1, 15):
formdata["objid"] = str(i)
response = requests.request("POST", url, data=formdata, timeout=1500)
out = response.content
soup = BeautifulSoup(out,"html.parser")
bat = json.loads(soup.text)
df = pd.DataFrame(bat["ops1"])
df.to_csv(str(i) + ".csv")
Right now this query creates 14 csv files. What I wanted is, the for loop to remove the first row of column headers and append the data to a dataframe I created outside the for loop. so that I can get it as single csv file.
I am using BS and Pandas.
This is one way of achieving your goal:
# Import libs
import pandas as pd
import requests
from tqdm import tqdm ## if using jupyter: from tqdm.notebook import tqdm
final_df = pd.DataFrame()
# URL
url = "https://www.sec.kerala.gov.in/public/getalllbcmp/byd"
# Query
for i in tqdm(range(1, 15)):
formdata = {'objid': i}
r = requests.post(url, data=formdata)
df = pd.json_normalize(r.json()["ops1"])
final_df = pd.concat([final_df, df], axis=0, ignore_index=True)
final_df.to_csv('some_data_saved.csv')
print(final_df)
Data will be saved to a csv file, and also printed in terminal:
100%
14/14 [00:14<00:00, 1.05s/it]
value text
0 8o7LEdvX2e G14001-Kumbadaje
1 jw2XOQyZ4K G14002-Bellur
2 0lMB1O4LbV G14003-Karadka
3 zodLro2Z39 G14004-Muliyar
4 dWxLYn8ZME G14005-Delampady
... ... ...
1029 Qy6Z09bBKE G01073-Ottoor
1030 ywoXG8wLxV M01001-Neyyattinkara
1031 Kk8Xvz7XO9 M01002-Nedumangad
1032 r7eXQYgX8m M01003-Attingal
1033 b3KXlO2B8g M01004-Varkala
1034 rows × 2 columns
Requests can return responses in JSON format, so you don;t need to import bs4 & json.
For TQDM, please see https://pypi.org/project/tqdm/
For pandas documentation, visit https://pandas.pydata.org/docs/
Also for Requests: https://requests.readthedocs.io/en/latest/
I would use a function to get the data and return a DataFrame, then use it within concat:
def get_data(i):
formdata["objid"] = str(i)
response = requests.request("POST", url, data=formdata, timeout=1500)
out = response.content
soup = BeautifulSoup(out,"html.parser")
bat = json.loads(soup.text)
return pd.DataFrame(bat["ops1"])
df = pd.concat([get_data(i) for i in range(1, 15)])
df.to_csv('all_data.csv')
NB. if this gives you unsatisfactory results, please provide a short extract of 2/3 dataframes and the expected merged output.
The following is a python code which prints live data from an API of a data feed vendor. I want the data in the panda's data frame but it prints only the following result
"Empty DataFrame
Columns: []
Index: []"
from truedata_ws.websocket.TD import TD
import time
import pandas as pd
username = ''
password = ''
realtime_port = 8084
url = 'push.truedata.in'
symbols = []
td_obj = TD(username, password, live_port=realtime_port, url=url, log_level=logging.DEBUG, log_format="%(message)s")
print('\nStarting Real Time Feed.... ')
req_ids = td_obj.start_live_data(symbols)
live_data_objs = {}
time.sleep(1)
for req_id in req_ids:
print(f'touchlinedata -> {td_obj.touchline_data[req_id]}')
df=pd.DataFrame(live_data_objs)
print(df)
#td_obj.trade_callback
def strategy_callback(symbol_id, tick_data):
print(f'Trade update > {tick_data}')
while True:
time.sleep(120)
In your code, you pass an empty dictionary as an argument for creating a Data-frame, the Data-Frame you will get back for passing an empty dictionary will be Empty
I am trying to convert multiple html tables to a pandas dataframe,
For this task I've defined a function to return all these html tables as a pandas dataframe,
However the function returns a null list [] when the idea is that it returns a pandas dataframe.
Here's what I've tried so far:
Getting all the needed links as a list
import requests
from bs4 import BeautifulSoup
import lxml
import html5lib
import pandas as pd
import string
### defining a list for all the needed links ###
first_url='https://www.salario.com.br/tabela-salarial/?cargos='
second_url='#listaSalarial'
allTheLetters = string.ascii_uppercase
links = []
for letter in allTheLetters:
links.append(first_url+letter+second_url)
defining a function
### defining function to parse html objects ###
def getUrlTables(links):
for link in links:
# requesting link, parsing and finding tag:table #
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
tab_div = soup.find_all('table', {'class':'listas'})
# writing html files into directory #
with open('listas_salariales.html', "w") as file:
file.write(str(tab_div))
file.close
# reading html file as a pandas dataframe #
tables=pd.read_html('listas_salariales.html')
return tables
Testing output
getUrlTables(links)
[]
Am I missing something in getUrlTables()?
Is there an easier way to accomplish this task?
The following code will fetch the HTML from all the links, parse them to extract the table data and construct a large combined dataframe (I have not stored the intermediate dataframes to the disk, which might be needed if the size of the tables become too large):
import requests
from bs4 import BeautifulSoup
import lxml
import html5lib
import pandas as pd
import string
### defining a list for all the needed links ###
first_url='https://www.salario.com.br/tabela-salarial/?cargos='
second_url='#listaSalarial'
allTheLetters = string.ascii_uppercase
links = []
for letter in allTheLetters:
links.append(first_url+letter+second_url)
### defining function to parse html objects ###
def getUrlTables(links, master_df):
for link in links:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'lxml') # using the lxml parser
try:
table = soup.find('table', attrs={'class':'listas'})
# finding table headers
heads = table.find('thead').find('tr').find_all('th')
colnames = [hdr.text for hdr in heads]
#print(colnames)
# Now extracting the values
data = {k:[] for k in colnames}
rows = table.find('tbody').find_all('tr')
for rw in rows:
for col in colnames:
cell = rw.find('td', attrs={'data-label':'{}'.format(col)})
data[col].append(cell.text)
# Constructing a pandas dataframe using the data just parsed
df = pd.DataFrame.from_dict(data)
master_df = pd.concat([master_df, df], ignore_index=True)
except AttributeError as e:
print('No data from the link: {}'.format(link))
return master_df
master_df = pd.DataFrame()
master_df = getUrlTables(links, master_df)
print(master_df)
The output from the above code is as follows:
CBO Cargo ... Teto Salarial Salário Hora
0 612510 Abacaxicultor ... 2.116,16 6,86
1 263105 Abade ... 5.031,47 17,25
2 263105 Abadessa ... 5.031,47 17,25
3 622020 Abanador na Agricultura ... 2.075,81 6,27
4 862120 Abastecedor de Caldeira ... 3.793,98 11,65
... ... ... ... ... ...
9345 263110 Zenji (missionário) ... 3.888,52 12,65
9346 723235 Zincador ... 2.583,20 7,78
9347 203010 Zoologista ... 4.615,45 14,21
9348 203010 Zoólogo ... 4.615,45 14,21
9349 223310 Zootecnista ... 5.369,59 16,50
[9350 rows x 8 columns]
I am analyzing the balance sheet of Amazon on Yahoo Finance. It contains nested rows, and I cannot extract all of them. The sheet looks like this:
I used BeautifulSoup4 and the Selenium web driver to get me the following output:
The following is the code:
import pandas as pd
from bs4 import BeautifulSoup
import re
from selenium import webdriver
import string
import time
# chart display specifications w/ Panda
pd.options.display.float_format = '{:.0f}'.format
pd.set_option('display.width', None)
is_link = 'https://finance.yahoo.com/quote/AMZN/balance-sheet/'
chrome_path = r"C:\\Users\\hecto\\Documents\\python\\drivers\\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get(is_link)
html = driver.execute_script('return document.body.innerHTML;')
soup = BeautifulSoup(html,'lxml')
features = soup.find_all('div', class_='D(tbr)')
headers = []
temp_list = []
label_list = []
final = []
index = 0
#create headers
for item in features[0].find_all('div', class_='D(ib)'):
headers.append(item.text)
#statement contents
while index <= len(features)-1:
#filter for each line of the statement
temp = features[index].find_all('div', class_='D(tbc)')
for line in temp:
#each item adding to a temporary list
temp_list.append(line.text)
#temp_list added to final list
final.append(temp_list)
#clear temp_list
temp_list = []
index+=1
df = pd.DataFrame(final[1:])
df.columns = headers
#function to make all values numerical
def convert_to_numeric(column):
first_col = [i.replace(',','') for i in column]
second_col = [i.replace('-','') for i in first_col]
final_col = pd.to_numeric(second_col)
return final_col
for column in headers[1:]:
df[column] = convert_to_numeric(df[column])
final_df = df.fillna('-')
print(df)
Again, I cannot seem to get all the rows of the balance sheet on my output (i.e. Cash, Total Current Assets). Where did I go wrong? Am I missing something?
You may have to click the "Expand All" button to see the additional rows. Refer to this thread to see how to simulate the click in Selenium: python selenium click on button
I am currently trying to scrape data from 1001TrackLists, a website that lists tracks in DJ mixes, using BeautifulSoup.
I wrote a script to collect all track information and create a dataframe which worked perfectly when I first finished it and returned the dataframe as expected. However, when I closed my jupyter notebook and restarted Python, the script returns a blank dataframe that only returns the column headers. Each list in the for loops that I created which I used to build the dataframe are also blank.
I've tried restarting my kernel, restarting/clearing output, and restarting my computer - nothing seems to work.
Here's my code so far:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import numpy as np
import re
import urllib.request
import matplotlib.pyplot as plt
url_list = ['https://www.1001tracklists.com/tracklist/yj03rk/joy-orbison-resident-advisor-podcast-331-2012-10-01.html', 'https://www.1001tracklists.com/tracklist/50khrzt/joy-orbison-greenmoney-radio-2009-08-16.html', 'https://www.1001tracklists.com/tracklist/7mzt0y9/boddika-joy-orbison-rinse-fm-hessle-audio-cover-show-2014-01-16.html', 'https://www.1001tracklists.com/tracklist/6l8q8l9/joy-orbison-bbc-radio-1-essential-mix-2014-07-26.html', 'https://www.1001tracklists.com/tracklist/5y6fl1k/kerri-chandler-joy-orbison-ben-ufo-bbc-radio-1-essential-mix-07-18-live-from-lovebox-festival-2015-07-24.html', 'https://www.1001tracklists.com/tracklist/1p6g9u49/joy-orbison-andrew-lyster-nts-radio-2016-07-23.html', 'https://www.1001tracklists.com/tracklist/qgz18zk/joy-orbison-dekmantel-podcast-081-2016-08-01.html', 'https://www.1001tracklists.com/tracklist/26wlts2k/george-fitzgerald-joy-orbison-bbc-radio-1-residency-2016-11-03.html', 'https://www.1001tracklists.com/tracklist/t9gkru9/james-blake-joy-orbison-bbc-radio-1-residency-2018-02-22.html', 'https://www.1001tracklists.com/tracklist/2gfzrxw1/joy-orbison-felix-hall-nts-radio-2019-08-23.html']
djnames = []
tracknumbers = []
tracknames = []
artistnames = []
mixnames = []
dates = []
url_scrape = []
for url in url_list:
count = 0
headers = {'User-Agent': 'Chrome/51.0.2704.103'}
page_link = url
page_response = requests.get(page_link, headers=headers)
soup = bs(page_response.content, "html.parser")
title = (page_link[48:-15])
title = title.replace('-', ' ')
title = (title[:-1])
title = title.title()
date = (page_link[-15:-5])
tracknames_scrape = soup.find_all("div", class_="tlToogleData")
artistnames_scrape = soup.find_all("meta", itemprop="byArtist")
for (i, track) in enumerate(tracknames_scrape):
if track.meta:
trackname = track.meta['content']
tracknames.append(trackname)
mixnames.append(title)
dates.append(date)
djnames.append('Joy Orbison')
url_scrape.append(url2)
count +=1
tracknumbers.append(count)
else:
continue
for artist in artistnames_scrape:
artistname = artist["content"]
artistnames.append(artistname)
df = pd.DataFrame({'DJ Name': djnames, 'Date': dates, 'Mix Name': mixnames, 'Track Number': tracknumbers,'Track Names': tracknames, 'Artist Names': artistnames, 'URL':url_scrape})
Change the line 38th line from url_scrape.append(url2) to the following and it works:
url_scrape.append(url)
Otherwise you get NameError: name 'url2' is not defined.