Scrape table from static web site

Scrape table from static web site - python

I need scrape table with top level domains from iana.org.
My code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.iana.org/domains/root/db'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='tld-table')
How can I get this to pandas DataFrame with structure as it is on web site (DOMAIN, TYPE, TLD MANAGER).

Pandas already comes with something to read tables from html, no need to use BeautifulSoup:
import pandas as pd
url = "https://www.iana.org/domains/root/db"
# This returns a list of DataFrames with all tables in the page.
df = pd.read_html(url)[0]

You can use pandas pd.read_html
import pandas as pd
URL = "https://www.iana.org/domains/root/db"
df = pd.read_html(URL)[0]
print(df.head())
Domain Type TLD Manager
0 .aaa generic American Automobile Association, Inc.
1 .aarp generic AARP
2 .abarth generic Fiat Chrysler Automobiles N.V.
3 .abb generic ABB Ltd
4 .abbott generic Abbott Laboratories, Inc.

Related

Extracting chosen information from URL results into a dataframe

I would like to create a dataframe by pulling only certain information from this website.
https://www.stockrover.com/build/production/Research/tail.js?1644930560
I would like to pull all the entries like this one. ["0005.HK","HSBC HOLDINGS","",""]
Another problem is, suppose I only want only the first 20,000 lines which is the stock information and there is other information after line 20,000 that I don't want included in the dataframe.
To summarize, could someone show me how to pull out just the information I'm trying to extract and create a dataframe with those results if this is possible.
A sample of the website results
function getStocksLibraryArray(){return[["0005.HK","HSBC HOLDINGS","",""],["0006.HK","Power Assets Holdings Ltd","",""],["000660.KS","SK hynix","",""],["004370.KS","Nongshim","",""],["005930.KS","Samsung Electroni","",""],["0123.HK","YUEXIU PROPERTY","",""],["0336.HK","HUABAO INTL","",""],["0408.HK","YIP'S CHEMICAL","",""],["0522.HK","ASM PACIFIC","",""],["0688.HK","CHINA OVERSEAS","",""],["0700.HK","TENCENT","",""],["0762.HK","CHINA UNICOM","",""],["0808.HK","PROSPERITY REIT","",""],["0813.HK","SHIMAO PROPERTY",
Code to pull all lines including ones not wanted
import requests
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)

Use regex to extract the details followed by literal_eval to convert string to python object
import re
from ast import literal_eval
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
response = requests.request("GET", url, headers={}, data={})
regex_ = re.compile(r"getStocksLibraryArray\(\)\{return(.+?)}", re.DOTALL)
print(pd.DataFrame(literal_eval(regex_.search(response.text).group(1))))
0 1 2 3
0 0005.HK HSBC HOLDINGS
1 0006.HK Power Assets Holdings Ltd
2 000660.KS SK hynix
3 004370.KS Nongshim
4 005930.KS Samsung Electroni
... ... ... ... ..
21426 ZZHGF ZhongAn Online P&C _INSUP
21427 ZZHGY ZhongAn Online P&C _INSUP
21428 ZZLL ZZLL Information Tech _INTEC
21429 ZZZ.TO Sleep Country Canada _SPECR
21430 ZZZOF Zinc One Resources _OTHEI

How can I use pd.read_html for scraping HTML tables with % values?

I'm trying to scrape the 'Profile and investment' table from the following url: https://markets.ft.com/data/funds/tearsheet/summary?s=LU0526609390:EUR, using the following code:
import requests
import pandas as pd
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
for url in urls:
r = requests.get(url).content
df = pd.read_html(r)[0]
print (df)
However, when I use the pd.read_html function, I get the following error code:
ValueError: invalid literal for int() with base 10: '100%'
because the table has entries in %. Is there a way to make Pandas accept % values?
My required output is to get a table with the following format:
Fund_ID Fund_type Income_treatment Morningstar category ......
LU0526609390:EUR ... ... ....
IE00BHBX0Z19:EUR ... ... ....
LU1076093779:EUR ... ... ....
LU1116896363:EUR ... ... ....

The issue is the site uses the 'colspan' attribute and uses % instead of with an int. As AsishM mentions in the comments:
browsers are usually more lenient with things like %, but the html spec for colspan clearly mentions this should be an integer. Browsers treat 100% as 100. mdn link. It's not a pandas problem per se.
these should be in the form of an int, and while some browsers will accommodate for that, pandas is specifically wanting it to be the appropriate syntax of:
<td colspan="number">
Ways to approach this is:
Use BeautifulSoup to fix those attributes
Since it's not within the table you actually want to parse, use BeautifulSoup to grab that first table and then don't need to worry about it.
See if the table has a specific attribute and could add that to the .read_html() as a parameter so it grabs only that specific table.
I chose option 2 here:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data __scraping__ from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
results = pd.DataFrame()
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
table = soup.find('table')
df = pd.read_html(str(table), index_col=0)[0].T
results = results.append(df, sort=False)
results = results.reset_index(drop=True)
print (results)
Output:
print(results.to_string())
0 Fund type Income treatment Morningstar category IMA sector Launch date Price currency Domicile ISIN Manager & start date Investment style (bonds) Investment style (stocks)
0 SICAV Income Global Bond - EUR Hedged -- 06 Aug 2010 GBP Luxembourg LU0526609390 Jonathan Gregory01 Nov 2012Vivek Acharya09 Dec 2015Simon Foster01 Nov 2012 NaN NaN
1 Open Ended Investment Company Income EUR Diversified Bond -- 21 Feb 2014 EUR Ireland IE00BHBX0Z19 Lorenzo Pagani12 May 2017Konstantin Veit01 Jul 2019 Credit Quality: HighInterest-Rate Sensitivity: Mod NaN
2 SICAV Income Eurozone Large-Cap Equity -- 11 Jul 2014 GBP Luxembourg LU1076093779 NaN NaN Market Cap: LargeInvestment Style: Blend
3 SICAV Income EUR Flexible Bond -- 01 Dec 2014 EUR Luxembourg LU1116896363 NaN NaN NaN
Here's how you could use BeautifulSoup to fix those colspan attributes.
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
all_colspan = soup.find_all(attrs={'colspan':True})
for colspan in all_colspan:
colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
df = pd.read_html(str(soup))

scraping data from wikipedia table

I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks

You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront

You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront

Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps

Python - Combine two, single column lists into one dual column list and print

I'm just beginning to dabble with Python, and as many have done I am starting with a web-scraping example to try the language.
I have seen many examples of using zip and map to combine lists, but I am having issues attempting to have that list print.
Again, I am new so please be gentle.
The code gathers everything from 2 certain tag types (the date and title of a post) and returns them as 2 lists. For this I am using BeautifulSoup and requests.
The site I am practicing on for this test is the blog for a small game called 'Staxel'
I can get my code to print a full list of one tag using [soup.find] and [print] in a for loop, but when I attempt to add a 2nd list to print I am simply getting a termination with no error.
Any tips on how to correctly print the 2 lists?
I am looking for output like
Entry 2019-01-06 New Years
Entry 2018-11-30 Staxel Changelog for 1.3.52
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
title_box = soup.find_all('h1',attrs={'class':'entry-title'})
date_box = soup.find_all('span',attrs={'class':'entry-date published'})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip()for date in date_box]
date_list = zip(dates, titles)
for heading in date_list:
print ("Entry {}")

The problem is your query for dates is returning an empty list, so the zipped result will also be empty. To extract the date from that page, you want to look for tags of type time, not span, with class entry-date published:
like this:
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
So with the following code:
import requests
from bs4 import BeautifulSoup
quote_page = "https://blog.playstaxel.com"
page = requests.get(quote_page)
soup = BeautifulSoup(page.content, "lxml")
title_box = soup.find_all("h1", attrs={"class": "entry-title"})
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip() for date in date_box]
for date, title in zip(dates, titles):
print(f"{date}: {title}")
The result becomes:
2019-01-10: Magic update – feature preview
2019-01-06: New Years
2018-11-30: Staxel Changelog for 1.3.52
2018-11-13: Staxel Changelog for 1.3.49
2018-10-21: Staxel Changelog for 1.3.48
2018-10-12: Halloween Update & GOG

BeautifulSoup web scraping all 'li' text to dataframe

I am trying to use BeautifulSoup to scrape a list list of properties from a real estate web site and pass them into a data table. I am using python 3.
The following code I have works to print the required data. But I need a way to output the data into table. Between each li tag are 3 items, a property number (1 - 50), tenant name and square footage. ideally the output would be structured in a data frame with column headers number, tenant, square footage.
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get("http://properties.kimcorealty.com/properties/0014/")
soup = BeautifulSoup(page.content, 'html.parser')
start = soup.find('div', {'id' : 'units_box_1'})
for litag in start.find_all('li'):
print(litag.text)
start = soup.find('div', {'id' : 'units_box_2'})
for litag in start.find_all('li'):
print(litag.text)
start = soup.find('div', {'id' : 'units_box_3'})
for litag in start.find_all('li'):
print(litag.text)

You can do it like this, getting all the divs in one go, finding enclosing "a" tags for groups of 3 "li" tags containing one set of data.
from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame
page = requests.get("http://properties.kimcorealty.com/properties/0014/")
soup = BeautifulSoup(page.content, 'html.parser')
table = []
# Find all the divs we need in one go.
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
# find all the enclosing a tags.
anchors = div.find_all('a')
for anchor in anchors:
# Now we have groups of 3 list items (li) tags
lis = anchor.find_all('li')
# we clean up the text from the group of 3 li tags and add them as a list to our table list.
table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
# We have all the data so we add it to a DataFrame.
headers = ['Number', 'Tenant', 'Square Footage']
df = DataFrame(table, columns=headers)
print (df)
Outputs:
Number Tenant Square Footage
0 1 Nordstrom Rack 34,032
1 2 Total Wine & More 29,981
2 3 Thomasville Furniture 10,628
...
47 49 Jo-Ann Fabrics 45,940
48 50 Available 32,572

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape table from static web site - python

Pandas already comes with something to read tables from html, no need to use BeautifulSoup: import pandas as pd url = "https://www.iana.org/domains/root/db" # This returns a list of DataFrames with all tables in the page. df = pd.read_html(url)[0]

Related

Extracting chosen information from URL results into a dataframe

How can I use pd.read_html for scraping HTML tables with % values?

scraping data from wikipedia table

Python - Combine two, single column lists into one dual column list and print

BeautifulSoup web scraping all 'li' text to dataframe

Categories

Resources