Scrape Embedded Google Sheet from HTML in Python

Scrape Embedded Google Sheet from HTML in Python - python

This one has been relatively tricky for me. I am trying to extract the embedded table sourced from google sheets in python.
Here is the link
I do not own the sheet but it is publicly available.
here is my code thus far, when I go to output the headers it is showing me "". Any help would be greatly appreciated. End goal is to convert this table into a pandas DF. Thanks guys
import lxml.html as lh
import pandas as pd
url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i +=1
name = t.text_content()
print('%d:"%s"'%(i,name))
col.append((name,[]))

Well if you would like to get the data into a DataFrame, you could load it directly:
df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727',
header=1)[0]
df.drop(columns='1', inplace=True) # remove unnecessary index column called "1"
This will give you:
Target Ticker Acquirer \
0 Acacia Communications Inc Com ACIA Cisco Systems Inc Com
1 Advanced Disposal Services Inc Com ADSW Waste Management Inc Com
2 Allergan Plc Com AGN Abbvie Inc Com
3 Ak Steel Holding Corp Com AKS Cleveland Cliffs Inc Com
4 Td Ameritrade Holding Corp Com AMTD Schwab (Charles) Corp Com
Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced \
0 CSCO $68.79 $70.00 $1.21 1.76% 7/9/2019
1 WM $32.93 $33.15 $0.22 0.67% 4/15/2019
2 ABBV $197.05 $200.22 $3.17 1.61% 6/25/2019
3 CLF $2.98 $3.02 $0.04 1.34% 12/3/2019
4 SCHW $49.31 $51.27 $1.96 3.97% 11/25/2019
Deal Type
0 Cash
1 Cash
2 C&S
3 Stock
4 Stock
Note read_html returns a list. In this case there is only
1 DataFrame, so we can refer to the first and only index location [0]

Related

Parsing idx file or string to Pandas DataFrame

I would like to parse the following idx file: https://www.sec.gov/Archives/edgar/daily-index/2022/QTR1/company.20220112.idx into Pandas DataFrame.
I use the following code to check how it would look like as a text file:
import os, requests
base_path = '/Users/GunardiLin/Desktop/Insider_Ranking/temp/'
current_dirs = os.listdir(path=base_path)
local_filename = f'20200102'
local_file_path = '/'.join([base_path, local_filename])
if local_filename in base_path:
print(f'Skipping index file for {local_filename} because it is already saved.')
url = f'https://www.sec.gov/Archives/edgar/daily-index/2020/QTR1/company.20200102.idx'
r = requests.get(url, stream=True, headers= {'user-agent': 'MyName myname#outlook.com'})
with open(local_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=10240):
f.write(chunk)
Next I would like to build a parser that is fault tollerance, because it should parse daily a new idx file into pd.DataFrame.
My idea was to use string manipulation, but it would be very complicated and not fault tollerance.
I would be thankful if someone can show the best practice to parse and give a boilerplate code.

Since this is mostly a fixed width file you could use pandas read_fwf to read this file. You can skip over the leading information (via skiprows=) and get straight to the data. The column names are predefined and assigned when read:
idx_path = 'company.20220112.idx'
names = ['Company Name','Form Type','CIK','Date Filed','File Name']
df = pd.read_fwf(idx_path, colspecs=[(0,61),(62,74),(74,84),(86,94),(98,146)], names=names, skiprows=11)
df.head(10)
Company Name Form Type CIK Date Filed File Name
0 005 - Series of IPOSharks Venture Master Fund,... D 1888451 20220112 edgar/data/1888451/0001888451-22-000002.txt
1 10X Capital Venture Acquisition Corp. III EFFECT 1848948 20220111 edgar/data/1848948/9999999995-22-000102.txt
2 110 White Partners LLC D 1903845 20220112 edgar/data/1903845/0001884293-22-000001.txt
3 15 Beach, MHC 3 1903509 20220112 edgar/data/1903509/0001567619-22-001073.txt
4 15 Beach, MHC SC 13D 1903509 20220112 edgar/data/1903509/0000943374-22-000014.txt
5 170 Valley LLC D 1903913 20220112 edgar/data/1903913/0001903913-22-000001.txt
6 1st FRANKLIN FINANCIAL CORP 424B3 38723 20220112 edgar/data/38723/0000038723-22-000003.txt
7 1st FRANKLIN FINANCIAL CORP 424B3 38723 20220112 edgar/data/38723/0000038723-22-000004.txt
8 215 BF Associates LLC D 1904145 20220112 edgar/data/1904145/0001904145-22-000001.txt
9 2401 Midpoint Drive REIT, LLC D 1903337 20220112 edgar/data/1903337/0001903337-22-000001.txt

How can I use pd.read_html for scraping HTML tables with % values?

I'm trying to scrape the 'Profile and investment' table from the following url: https://markets.ft.com/data/funds/tearsheet/summary?s=LU0526609390:EUR, using the following code:
import requests
import pandas as pd
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
for url in urls:
r = requests.get(url).content
df = pd.read_html(r)[0]
print (df)
However, when I use the pd.read_html function, I get the following error code:
ValueError: invalid literal for int() with base 10: '100%'
because the table has entries in %. Is there a way to make Pandas accept % values?
My required output is to get a table with the following format:
Fund_ID Fund_type Income_treatment Morningstar category ......
LU0526609390:EUR ... ... ....
IE00BHBX0Z19:EUR ... ... ....
LU1076093779:EUR ... ... ....
LU1116896363:EUR ... ... ....

The issue is the site uses the 'colspan' attribute and uses % instead of with an int. As AsishM mentions in the comments:
browsers are usually more lenient with things like %, but the html spec for colspan clearly mentions this should be an integer. Browsers treat 100% as 100. mdn link. It's not a pandas problem per se.
these should be in the form of an int, and while some browsers will accommodate for that, pandas is specifically wanting it to be the appropriate syntax of:
<td colspan="number">
Ways to approach this is:
Use BeautifulSoup to fix those attributes
Since it's not within the table you actually want to parse, use BeautifulSoup to grab that first table and then don't need to worry about it.
See if the table has a specific attribute and could add that to the .read_html() as a parameter so it grabs only that specific table.
I chose option 2 here:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data __scraping__ from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
results = pd.DataFrame()
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
table = soup.find('table')
df = pd.read_html(str(table), index_col=0)[0].T
results = results.append(df, sort=False)
results = results.reset_index(drop=True)
print (results)
Output:
print(results.to_string())
0 Fund type Income treatment Morningstar category IMA sector Launch date Price currency Domicile ISIN Manager & start date Investment style (bonds) Investment style (stocks)
0 SICAV Income Global Bond - EUR Hedged -- 06 Aug 2010 GBP Luxembourg LU0526609390 Jonathan Gregory01 Nov 2012Vivek Acharya09 Dec 2015Simon Foster01 Nov 2012 NaN NaN
1 Open Ended Investment Company Income EUR Diversified Bond -- 21 Feb 2014 EUR Ireland IE00BHBX0Z19 Lorenzo Pagani12 May 2017Konstantin Veit01 Jul 2019 Credit Quality: HighInterest-Rate Sensitivity: Mod NaN
2 SICAV Income Eurozone Large-Cap Equity -- 11 Jul 2014 GBP Luxembourg LU1076093779 NaN NaN Market Cap: LargeInvestment Style: Blend
3 SICAV Income EUR Flexible Bond -- 01 Dec 2014 EUR Luxembourg LU1116896363 NaN NaN NaN
Here's how you could use BeautifulSoup to fix those colspan attributes.
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
all_colspan = soup.find_all(attrs={'colspan':True})
for colspan in all_colspan:
colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
df = pd.read_html(str(soup))

Python BeautifulSoup filter data while parsing a URL

I'm trying to parse these on daily basis, before market open and I successfully get the list, but now i wanted to add additional filter for "strong buys" and "Volume" > 5000000 from the underlying url data https://www.tradingview.com/markets/stocks-usa/market-movers-gainers/
full code below
import requests
from bs4 import BeautifulSoup
url = "https://www.tradingview.com/markets/stocks-usa/market-movers-gainers/"
siteinfo = requests.get(url)
i = 0
content = siteinfo.content
html = content
parsed_html = BeautifulSoup(html, features="lxml")
doneList = []
for link in parsed_html.find_all('a'):
a = link.get('href')
if "symbol" in str(a) and "-" in str(a):
if i < 25:
i += 1
else:
x = a.split("-")
x = x[1].split("/")
doneList.append(x[0])
i += 1
print(doneList)

In this particular case, you're probably better off using pandas w/ multiple conditions and a filter:
import pandas as pd
url = 'https://www.tradingview.com/markets/stocks-usa/market-movers-gainers/'
df = pd.read_html(url)[0]
#create a helper function as a filter - it returns a series of boolean values
def filter_out(row):
#Unnamed: 4 is the buy recommendation and the next one is volume
if 'Strong' in row['Unnamed: 4'] and 'M' in row['Unnamed: 5']:
#since you're using a 5M volume as condition, you have to check for its existence:
if (int(row['Unnamed: 5'].split('.')[0])>5):
return True
else:
return False
else:
return False
#use the boolean values to filter the dataframe:
bulls = df.apply(filter_out, axis=1)
df[bulls]
Output (pardon the formatting):
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 M MRIN Marin Software Incorporated 7.50 96.85% 3.69 Strong Buy 263.387M 41.786M — -1.59 162.00 Technology Services
2 NTLA Intellia Therapeutics, Inc. 133.43 50.21% 44.60 Strong Buy 21.740M 6.054B — -2.46 312.00 Health Technology
3 A AUUD Auddia Inc. 5.89 43.66% 1.79 Strong Buy 36.281M 46.296M — — 11.00 Technology Services
etc. You can then change columns names or do other processing.
EDIT:
To get only the tickers of these companies use:
ticks = df[bulls]['Unnamed: 0'].to_list()
for tick in ticks:
print(tick.split(' ')[-2])
Output:
MRIN
NTLA
AUUD
WTT
etc.

First of all, exit code 0 means there was no error in your code. Thus, the index out of range error must be from the print statement print(e) (exception handled).
Looking at your code, this is the most vulnerable part for list index out of range error.
quote = get_quote(stock)
I don't know the inner mechanism of get_quote, but I guess this is where the error occured.

Python:Pandas dataframe object unable to convert to string

I would like to get the string "A" instead of object "A"
>>> comp
1
0
marketCapitalization 27879.5
name Agilent Technologies Inc
exchange NEW YORK STOCK EXCHANGE, INC.
country US
weburl https://www.agilent.com/
ipo 1999-11-18
phone 14083458886
currency USD
logo https://static.finnhub.io/logo/5f1f8412-80eb-1...
ticker A
marketCapitalization 27879.5
finnhubIndustry Life Sciences Tools & Services
shareOutstanding 308.777
>>> comp.loc['ticker']
1 A
Name: ticker, dtype: object
I am trying comp.loc['ticker'].astype(str) but still return an object. I need it to show "A" only

Maybe try to select the first element with .iloc[0]:
comp.loc['ticker'].iloc[0]

scraping data from wikipedia table

I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks

You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront

You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront

Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape Embedded Google Sheet from HTML in Python - python

Related

Parsing idx file or string to Pandas DataFrame

How can I use pd.read_html for scraping HTML tables with % values?

Python BeautifulSoup filter data while parsing a URL

Python:Pandas dataframe object unable to convert to string

scraping data from wikipedia table

Categories

Resources