I have been trying to scrape ETF data from iShares.com for an ongoing project for a while now. I am trying to create web scrapers for multiple websites but they are all identical. Essentially I run into two issues:
I keep getting the error :"AttributeError: 'NoneType' object has no attribute 'tr'" although I am quite sure that I have chosen the correct table.
When I look into the "Inspect elements" on some of the websites, I have to click the "Show more" in order to see the code for all of the rows.
I am not a computer scientist, but I have tried many different approaches which have sadly all been unsuccessful so I hope you can help.
The URL: https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf
The table can be found on the URL under "Holdings". Alternatively, it can be found under the following paths:
JS Path: <document.querySelector("#allHoldingsTable > tbody")>
xPath: //*[#id="allHoldingsTable"]/tbody
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
'https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf'
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".allHoldingsTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl),thousands='.', decimal=',')[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
from openpyxl import load_workbook
path= '/Users/karlemilthulstrup/Downloads/ishares.xlsx'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
df.to_excel(writer, index=False)
writer.save()
writer.close()
As stated in the comments, the data is dynamically rendered. If you don't want to go the route of accessing the data directly, you could use something like Selenium, that will allow the page to render, THEN you can go in there the way you have it above.
Also, there's a button that will download this into a csv for you. Why not just do that?
But if you must scrape the page, you get the data in json format. Just parse it:
import requests
import json
import pandas as pd
url = 'https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf/1506575576011.ajax?tab=all&fileType=json'
r = requests.get(url)
r.encoding='utf-8-sig'
jsonData = json.loads(r.text)
rows = []
for each in jsonData['aaData']:
row = {'Issuer Ticker':each[0],
'Name':each[1],
'Sector':each[2],
'Asset Class':each[3],
'Market Value':each[4]['display'],
'Market Value Raw':each[4]['raw'],
'Weight (%)':each[5]['display'],
'Weight (%) Raw':each[5]['raw'],
'Notaional Value':each[6]['display'],
'Notaional Value Raw':each[6]['raw'],
'Nominal':each[7]['display'],
'Nominal Raw':each[7]['raw'],
'ISIN':each[8],
'Price':each[9]['display'],
'Price Raw':each[9]['raw'],
'Location':each[10],
'Exchange':each[11],
'Market Currency':each[12]}
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df)
Issuer Ticker ... Market Currency
0 VZ ... USD
1 ROG ... CHF
2 NESN ... CHF
3 WM ... USD
4 PEP ... USD
.. ... ... ...
309 ESH2 ... USD
310 TUH2 ... USD
311 JPY ... USD
312 MARGIN_JPY ... JPY
313 MARGIN_SGD ... SGD
[314 rows x 18 columns]
Related
I am new to python and trying to download the countries GDP per capita data. I am trying to read the data from this website: https://worldpopulationreview.com/countries/by-gdp
I tried to read the data but, I found no tables found error.
I can see the data is in r.text but somehow pandas can not read that table.
How to solve the problem and read the data?
MWE
import pandas as pd
import requests
url = "https://worldpopulationreview.com/countries/by-gdp"
r = requests.get(url)
raw_html = r.text # I can see the data is here, but pd.read_html says no tables found
df_list = pd.read_html(raw_html)
print(len(df_list))
Data is embedded via <script id="__NEXT_DATA__" type="application/json"> and rendered by browser only, so you have to adjust your script a bit:
pd.json_normalize(
json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
Example
import pandas as pd
import requests,json
from bs4 import BeautifulSoup
url = "https://worldpopulationreview.com/countries/by-gdp"
df = pd.json_normalize(
json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
df[['continent', 'country', 'pop','imfGDP', 'unGDP', 'gdpPerCapita']]
Output
continent
country
pop
imfGDP
unGDP
gdpPerCapita
0
North America
United States
338290
2.08938e+13
18624475000000
61762.9
1
Asia
China
1.42589e+06
1.48626e+13
11218281029298
10423.4
...
...
...
...
...
...
...
210
Asia
Syria
22125.2
0
22163075121
1001.71
211
North America
Turks and Caicos Islands
45.703
0
917550492
20076.4
I have scraped some updated day-by-day data (only numbers). I want to show them in a good table (data frame). I don't know how to use Pandas. I am using python and the end result should look like a table with defined keys on it. Thanks
And here is my python code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.worldometers.info/coronavirus/country/Austria/'
page = requests.get(url)
soup = BeautifulSoup(page.text , 'html.parser')
#RECOVERD , DEATHS AND TOTAL CASES
Covid_Cases_Array = []
get_Covid_Cases = soup.find_all(class_ = 'maincounter-number')
for item in get_Covid_Cases:
Covid_Cases_Array.append(item.text)
print(item.text)
# Active ND CLOSED DATA
Covid_Active_Closed = []
get_Activ_Closed = soup.find_all(class_ = 'number-table-main')
for item in get_Activ_Closed:
Covid_Active_Closed.append(item.text)
print(item.text)
And the result of that code :
600,089
9,997
563,256
26,836
573,253
You can use this example how to get the data from the page to DataFrame:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.worldometers.info/coronavirus/country/Austria/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
cases, deaths, recovered = soup.select(".maincounter-number")
active_cases, closed_cases = soup.select(".number-table-main")
active_cases_mild, active_cases_serious, _, _ = soup.select(".number-table")
df = pd.DataFrame(
{
"Coronavirus Cases": [cases.get_text(strip=True)],
"Deaths": [deaths.get_text(strip=True)],
"Recovered": [recovered.get_text(strip=True)],
"Currently infected": [active_cases.get_text(strip=True)],
"Closed cases": [closed_cases.get_text(strip=True)],
"Active cases (mild)": [active_cases_mild.get_text(strip=True)],
"Active cases (serious)": [active_cases_serious.get_text(strip=True)],
}
)
print(df)
Prints:
Coronavirus Cases Deaths Recovered Currently infected Closed cases Active cases (mild) Active cases (serious)
0 600,089 9,997 563,256 26,836 573,253 26,279 557
I am trying to scrape a zoho-analytics table from this webpage for a project at the university. For the moment I have no ideas. I can't see the values in the inspect, and therefore I cannot use Beautifulsoup in Python (my favourite one).
enter image description here
Does anybody have any idea?
Thanks a lot,
Joseph
I tried it with BeautifulSoup, seems like you can't soup these values that are inside the table because they are not on the website but stored externally(?)
EDIT:
https://analytics.zoho.com/open-view/938032000481034014
This is the link the table and its data are stored.
So I tried scraping from it with bs4 and it works.
The class of the rows is "zdbDataRowDiv"
Try:
container = page_soup.findAll("div","class":"zdbDataRowDiv")
Code explanation:
container # the variable where your data is stored, name it how you like
page_soup # your html page you souped with BeautifulSoup
findAll("tag",{"attribute":"value"}) # this function finds every tag which has the specific value inside its attribute
They are stored within the <script> tags in json format. Just a matter of pulling those out and parsing:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'https://flo.uri.sh/visualisation/4540617/embed'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var _Flourish_data_column_names = ' in script.text:
json_str = script.text
col_names = json_str.split('var _Flourish_data_column_names = ')[-1].split(',\n')[0]
cols = json.loads(col_names)
data = json_str.split('_Flourish_data = ')[-1].split(',\n')[0]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
headers = cols['rows']['columns']
for row in jsonData['rows']:
rows.append(row['columns'])
table = pd.DataFrame(rows,columns=headers)
for col in headers[1:]:
table.loc[table[col] != '', col] = 'A'
Output:
print (table)
Company Climate change Forests Water security
0 Danone A A A
1 FIRMENICH SA A A A
2 FUJI OIL HOLDINGS INC. A A A
3 HP Inc A A A
4 KAO Corporation A A A
.. ... ... ... ...
308 Woolworths Limited A
309 Workspace Group A
310 Yokogawa Electric Corporation A A
311 Yuanta Financial Holdings A
312 Zalando SE A
[313 rows x 4 columns]
I am scraping a website table form https://csr.gov.in/companyprofile.php?year=FY+2015-16&CIN=L00000CH1990PLC010573 but I am not getting the exact result I am looking for. I want 11 columns from this link, "company name", "Class", "State", "Company Type", "RoC", "Sub Category", "Listing Status". These are 7 columns and after that you can see an expand button " CSR Details of FY 2017-18" when you will click on that button you will get 4 more columns "Average Net Profit", "CSR Prescribed Expenditure", "CSR Spent", "Local Area Spent". I want all these columns in csv file. I wrote a code and it is not working properly. I am attaching an Image of result for refference. and here is my code. please help to get these data.
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import csv
driver = webdriver.Chrome()
url_file = "csrdata.txt"
with open(url_file, "r") as url:
url_pages = url.read()
# we need to split each urls into lists to make it iterable
pages = url_pages.split("\n") # Split by lines using \n
data = []
# now we run a for loop to visit the urls one by one
for single_page in pages:
driver.get(single_page)
r = requests.get(single_page)
soup = BeautifulSoup(r.content, 'html5lib')
driver.find_element_by_link_text("CSR Details of FY 2017-18").click()
table = driver.find_elements_by_xpath("//*[contains(#id,'colfy4')]")
about = table.__getitem__(0).text
x = about.split('\n')
print(x)
data.append(x)
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('csr.csv')
You dont need to use selenium since all the informations are inside the html code. Also you can use pandas inbuild function pd_read_html() to directly transform the html-table into a dataframe.
data = []
for single_page in pages:
r = requests.get(single_page)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find_all('table') #finds all tables
table_top = pd.read_html(str(table))[0] #the top table
try: #try to get the other table if exists
table_extra = pd.read_html(str(table))[7]
except:
table_extra = pd.DataFrame()
result = pd.concat([table_top, table_extra])
data.append(result)
pd.concat(data).to_csv('test.csv')
output:
0 1
0 Class Public
1 State Chandigarh
2 Company Type Other than Govt.
3 RoC RoC-Chandigarh
4 Sub Category Company limited by shares
5 Listing Status Listed
0 Average Net Profit 0
1 CSR Prescribed Expenditure 0
2 CSR Spent 0
3 Local Area Spent 0
I wanted to try to scrape some specific columns (Company details column) in the CNBC Nasdaq 100 website specifically the Adobe stocks, below is the snippet of my code
# Importing Libraries
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
def get_company_info(url):
original_url = url
key = {}
l = []
page_response = requests.get(url, timeout=240)
page_content = BeautifulSoup(page_response.content, "html.parser")
name = page_content.find('div',{"class":"quote-section-header large-header"}).find("span",{"class":"symbol"}).text
description = page_content.find_all('div',{"class":"moduleBox"})
for items in description:
for i in range(len(items.find_all("tr"))-1):
# Gather data
key["stock_desc"] = items.find_all("td", {"class":"desc"})[i].find('div',attrs={'id':'descLong'}).text
shares = items.find_all("td").find("table",attrs={"id":"shares"})
for rest_of_items in shares:
for i in range(len(items.find_all("tr"))-1):
key["stock_outstanding-shares"] = items.find_all("td", {"class":"bold aRit"})[i].text
key["stock_ownership"] = items.find_all("td", {"class":"bold aRit"})[i].text
key["stock_market_cap"] = items.find_all("td", {"class":"bold aRit"})[i].text
key["stock_lastSplit"] = items.find_all("td", {"class":"bold aRit"})[i].text
# Print ("")
l.append(key)
key['name'] = name
df = pd.DataFrame(l)
print(df)
return key, df
get_company_info("https://www.cnbc.com/quotes/?symbol=ADBE&tab=profile")
So, I'm keen to get the result in dataframe so that I can change to CSV file, but my code keep showing empty dataframe result, Below are the error shown
The result I wanted is something like this
The information you are looking for is not available in the url you requested. This is because the information is fetched by the page using a JavaScript. Which in turn requests a different URL which provides the data.
Example code
from bs4 import BeautifulSoup
import requests
page=requests.get("https://apps.cnbc.com/view.asp?symbol=ADBE.O&uid=stocks/summary")
soup = BeautifulSoup(page.content, 'html.parser')
Name=soup.find("h5",id="companyName").text
stock_desc= soup.find("div",id="descLong").text
table=soup.find("table",id="shares")
details=table.find_all("td", class_="bold aRit")
stock_outstanding_shares= details[0].text
stock_ownership= details[1].text
stock_market_cap= details[2].text
stock_lastSplit= details[3].text
You can create dataframe and export to csv.