Web Scraping data with BS4 - Python - python

I have been trying to export a web scraped document from the below code.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url="https://www.marketwatch.com/tools/markets/stocks/country/sri-lanka/1"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
cse = pd.DataFrame(columns=["Name", "Exchange", "Sector"])
for row in soup.find('tbody').find('tr'): ##for row in soup.find("tbody").find_all('tr'):
col = row.find("td")
Name = col[0].text
Exchange = col[1].text
Sector = col[2].text
cse = cse.append({"Name":Company_Name,"Exchange":Exchange_code,"Sector":Industry}, ignore_index=True)
but I am receiving an error 'TypeError: 'int' object is not subscriptable'. Can anyone help me to crack this out?

You need to know the difference between .find() and .find_all().
The only difference is that find_all() returns a list containing the single result, and find() just returns the result.
Since you are using col = row.find_all("td"), col is not a list. So you get this error -
'TypeError: 'int' object is not subscriptable'
Since you need to iterate over all the <tr> and inturn <td> inside every <tr>, you have to use find_all().
You can try this out.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url="https://www.marketwatch.com/tools/markets/stocks/country/sri-lanka/1"
data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')
cse = pd.DataFrame(columns=["Name", "Exchange", "Sector"])
for row in soup.find('tbody').find_all('tr'):
col = row.find_all("td")
Company_Name = col[0].text
Exchange_code = col[1].text
Industry = col[2].text
cse = cse.append({"Name":Company_Name,"Exchange":Exchange_code,"Sector":Industry}, ignore_index=True)
Name ... Sector
0 Abans Electricals PLC (ABAN.N0000) ... Housewares
1 Abans Finance PLC (AFSL.N0000) ... Finance Companies
2 Access Engineering PLC (AEL.N0000) ... Construction
3 ACL Cables PLC (ACL.N0000) ... Industrial Electronics
4 ACL Plastics PLC (APLA.N0000) ... Industrial Products
.. ... ... ...
145 Lanka Hospital Corp. PLC (LHCL.N0000) ... Healthcare Provision
146 Lanka IOC PLC (LIOC.N0000) ... Specialty Retail
147 Lanka Milk Foods (CWE) PLC (LMF.N0000) ... Food Products
148 Lanka Realty Investments PLC (ASCO.N0000) ... Real Estate Developers
149 Lanka Tiles PLC (TILE.N0000) ... Building Materials/Products
[150 rows x 3 columns]

Related

Extracting values from HTML in python

My project involves web scraping using python. In my project I need to get data about a given its registration. I have managed to get the html from the site into python but I am struggling to extract the values.
I am using this website: https://www.carcheck.co.uk/audi/N18CTN
from bs4 import BeautifulSoup
import requests
url = "https://www.carcheck.co.uk/audi/N18CTN"
r= requests.get(url)
soup = BeautifulSoup(r.text)
print(soup)
I need to get this information about the vehicle
<td>AUDI</td>
</tr>
<tr>
<th>Model</th>
<td>A3</td>
</tr>
<tr>
<th>Colour</th>
<td>Red</td>
</tr>
<tr>
<th>Year of manufacture</th>
<td>2017</td>
</tr>
<tr>
<th>Top speed</th>
<td>147 mph</td>
</tr>
<tr>
<th>Gearbox</th>
<td>6 speed automatic</td>
How would I go about doing this?
Since you don't have extensive experience with BeautifulSoup, you can effortlessly match the table containing the car information using a CSS selector and then you can extract the header and data rows to combine them into a dictionary:
import requests
from bs4 import BeautifulSoup
url = "https://www.carcheck.co.uk/audi/N18CTN"
soup = BeautifulSoup(requests.get(url).text, "lxml")
# Select the table containing the car information using CSS selector
table = soup.select_one("div.page:nth-child(2) > div:nth-child(4) > div:nth-child(1) > table:nth-child(1)")
# Extract header rows from the table and store them in a list
headers = [th.text for th in table.select("th")]
# Extract data rows from the table and store them in a list
data = [td.text for td in table.select("td")]
# Combine header rows and data rows into a dictionary using a dict comprehension
car_info = {key: value for key, value in zip(headers, data)}
print(car_info)
Ouput:
{'Make': 'AUDI', 'Model': 'A3', 'Colour': 'Red', 'Year of manufacture': '2017', 'Top speed': '147 mph', 'Gearbox': '6 speed automatic'}
In order to obtain the CSS selector pattern of the table you can use the devtools of your browser:
You can use this example to get you started how to get information from this page:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.carcheck.co.uk/audi/N18CTN'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('tr:has(th):has(td):not(:has(table))'):
header = row.find_previous('h1').text.strip()
title = row.th.text.strip()
text = row.td.text.strip()
all_data.append((header, title, text))
df = pd.DataFrame(all_data, columns = ['Header', 'Title', 'Value'])
print(df.head(20).to_markdown(index=False))
Prints:
Header
Title
Value
General information
Make
AUDI
General information
Model
A3
General information
Colour
Red
General information
Year of manufacture
2017
General information
Top speed
147 mph
General information
Gearbox
6 speed automatic
Engine & fuel consumption
Power
135 kW / 184 HP
Engine & fuel consumption
Engine capacity
1.968 cc
Engine & fuel consumption
Cylinders
4
Engine & fuel consumption
Fuel type
Diesel
Engine & fuel consumption
Consumption city
42.0 mpg
Engine & fuel consumption
Consumption extra urban
52.3 mpg
Engine & fuel consumption
Consumption combined
48.0 mpg
Engine & fuel consumption
CO2 emission
129 g/km
Engine & fuel consumption
CO2 label
D
MOT history
MOT expiry date
2023-10-27
MOT history
MOT pass rate
83 %
MOT history
MOT passed
5
MOT history
Failed MOT tests
1
MOT history
Total advice items
11

Scraping Data with Beautiful Soup Issues

I am working on scraping the countries of astronauts from this website: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order. I am using BeautifulSoup to perform this task, but I'm having some issues. Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')
for item in tags:
name = item.select_one('bau astronaut_cell__title bold mr05')
country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
data.append([name,country])
df = pd.DataFrame(data)
df
df is returning an empty list. Not sure what is going on. When I take the code out of the for loop, it can't seem to find the select_one function. Function should be coming from bs4 - not sure why that's not working. Also, is there a repeatable pattern for web scraping that I'm missing? Seems like it's a different beast every time I try to tackle these kinds of problems.
Any help would be appreciated! Thank you!
The url's data is generated dynamically by javascript and Beautifulsoup can't grab dynamic data.So, You can use automation tool something like selenium with Beautifulsoup.Here I apply selenium with Beautifulsoup.Please just run the code.
Script:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
#print(name.text)
country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
if country:
country=country.get_text()
#print(country)
data.append([name, country])
cols=['name','country']
df = pd.DataFrame(data,columns=cols)
print(df)
Output:
name country
0 Bess, Cameron United States of America
1 Bess, Lane United States of America
2 Dick, Evan United States of America
3 Taylor, Dylan United States of America
4 Strahan, Michael United States of America
.. ... ...
295 Jones, Thomas United States of America
296 Sega, Ronald United States of America
297 Usachov, Yury Russia
298 Fettman, Martin United States of America
299 Wolf, David United States of America
[300 rows x 2 columns]
The page is dynamically loaded using javascript, so requests can't get to it directly. The data is loaded from another address and is received in json format. You can get to it this way:
url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)
Once you have it loaded, you can iterate through it and retrieve relevant information. For example:
for astro in data['astronauts']:
print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank'])
Output:
1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General
etc.
You can then load the output to a pandas dataframe or whatever.

Scrape zoho-analitics externally stored table. Is it possible?

I am trying to scrape a zoho-analytics table from this webpage for a project at the university. For the moment I have no ideas. I can't see the values in the inspect, and therefore I cannot use Beautifulsoup in Python (my favourite one).
enter image description here
Does anybody have any idea?
Thanks a lot,
Joseph
I tried it with BeautifulSoup, seems like you can't soup these values that are inside the table because they are not on the website but stored externally(?)
EDIT:
https://analytics.zoho.com/open-view/938032000481034014
This is the link the table and its data are stored.
So I tried scraping from it with bs4 and it works.
The class of the rows is "zdbDataRowDiv"
Try:
container = page_soup.findAll("div","class":"zdbDataRowDiv")
Code explanation:
container # the variable where your data is stored, name it how you like
page_soup # your html page you souped with BeautifulSoup
findAll("tag",{"attribute":"value"}) # this function finds every tag which has the specific value inside its attribute
They are stored within the <script> tags in json format. Just a matter of pulling those out and parsing:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'https://flo.uri.sh/visualisation/4540617/embed'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var _Flourish_data_column_names = ' in script.text:
json_str = script.text
col_names = json_str.split('var _Flourish_data_column_names = ')[-1].split(',\n')[0]
cols = json.loads(col_names)
data = json_str.split('_Flourish_data = ')[-1].split(',\n')[0]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
headers = cols['rows']['columns']
for row in jsonData['rows']:
rows.append(row['columns'])
table = pd.DataFrame(rows,columns=headers)
for col in headers[1:]:
table.loc[table[col] != '', col] = 'A'
Output:
print (table)
Company Climate change Forests Water security
0 Danone A A A
1 FIRMENICH SA A A A
2 FUJI OIL HOLDINGS INC. A A A
3 HP Inc A A A
4 KAO Corporation A A A
.. ... ... ... ...
308 Woolworths Limited A
309 Workspace Group A
310 Yokogawa Electric Corporation A A
311 Yuanta Financial Holdings A
312 Zalando SE A
[313 rows x 4 columns]

Web Scraping a page with multiple tables

I am trying to web scrape the second table from this website:
https://fbref.com/en/comps/9/stats/Premier-League-Stats
However, I have only ever managed to extract the information from the first table when trying to access the information by finding the table tag. Would anyone be able to explain to me why I cannot access the second table or show me how to do it.
import requests
from bs4 import BeautifulSoup
url = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
pl_table = soup.find_all("table")
player_table = tables[0]
Something along these lines should do it
tables = soup.find_all("table") # returns a list of tables
second_table = tables[1]
The table is inside HTML comments <!-- ... -->.
To get the table from comments, you can use this example:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = BeautifulSoup(soup.select_one('#all_stats_standard').find_next(text=lambda x: isinstance(x, Comment)), 'html.parser')
#print some information from the table to screen:
for tr in table.select('tr:has(td)'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
print('{:<30}{:<20}{:<10}'.format(tds[0], tds[3], tds[5]))
Prints:
Patrick van Aanholt Crystal Palace 1990
Max Aarons Norwich City 2000
Tammy Abraham Chelsea 1997
Che Adams Southampton 1996
Adrián Liverpool 1987
Sergio Agüero Manchester City 1988
Albian Ajeti West Ham 1997
Nathan Aké Bournemouth 1995
Marc Albrighton Leicester City 1989
Toby Alderweireld Tottenham 1989
...and so on.

incomplete result in table scraping

I am trying to scrape http://bifr.nic.in/asp/list.asp this page with beautifulsoup and get the table from it.
Following is my code
from bs4 import BeautifulSoup
import urllib.request
base_url = "http://bifr.nic.in/asp/list.asp"
page = urllib.request.urlopen(base_url)
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table",{"class":"forumline"})
tr = table.find_all("tr")
for rows in tr:
print(rows.get_text())
It shows no error, but when i execute it i am only able to get first row of content from the table.
List of Companies
Case
No
Company
Name
359 2000 A & F OVERSEAS LTD.
359 2000 A & F OVERSEAS LTD.
359 2000 A & F OVERSEAS LTD.
This is the result i am getting. I can't understand what's going on.
Try this to get all the data from that table:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen("http://bifr.nic.in/asp/list.asp")
soup = BeautifulSoup(page, "html5lib")
table = soup.select_one("table.forumline")
for items in table.select("tr")[4:]:
data = ' '.join([item.get_text(" ",strip=True) for item in items.select("td")])
print(data)
Partial Output:
359 2000 A & F OVERSEAS LTD.
99 1988 A B C PRODUCTS LTD.
103 1989 A INFRASTRUCTURE LTD.
3 2006 A V ALLOYS LTD.
13 1988 A V J WIRES LTD.
Probably the page code contains some errors in html markup, try use html5lib instead of html.parser, but before you need install it:
pip install html5lib
soup = BeautifulSoup(page, "html5lib")

Categories

Resources