Im just 3 months into learning python and I run into a little problem while building a Finance Yahoo web Scraper.
import pandas as pd
from bs4 import BeautifulSoup
import lxml
import requests
import openpyxl
index = 'MSFT'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT'
read_data = requests.get(url,headers=headers, timeout=5)
content = read_data.content
soup_is = BeautifulSoup(content,'lxml')
ls = []
for l in soup_is.find_all('div') and soup_is.find_all('span'):
ls.append(l.string)
new_ls = list(filter(None,ls))
new_ls = new_ls[45:]
is_data = list(zip(*[iter(new_ls)]*6))
Income_st = pd.DataFrame(is_data[0:])
print(Income_st)
Everything goes smoothly when I noticed that the content of rows "Diluted EPS" and "Basic EPS" weren't copied.
While inspecting the source code ive noticed that the EPS values are stored in the div tag if I can say it like that? Instead of the <span>"Value"</span> underneath it.
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col">**<span>39,240,000</span>**</div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"**>9.70<**/div>
Any idea on how I can fix the code to get those values out? Also any idea how I can extract data separately on two different pages "Annually" and "Quartely"?
I was trying to change the tags, attributes etc but with no avail. :(
Try to select your elements more specific and use stripped_strings in this case to extract the infos from the data rows:
[e.stripped_strings for e in soup.select('[data-test="fin-row"]')]
and the columns:
soup.select_one('div:has(>[data-test="fin-row"])').previous_sibling.stripped_strings
Example
import pandas as pd
from bs4 import BeautifulSoup
index = 'MSFT'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT'
soup = BeautifulSoup(requests.get(url,headers=headers, timeout=5).text)
pd.DataFrame(
[e.stripped_strings for e in soup.select('[data-test="fin-row"]')],
columns=soup.select_one('div:has(>[data-test="fin-row"])').previous_sibling.stripped_strings
)
Output
Breakdown
ttm
6/30/2022
6/30/2021
6/30/2020
6/30/2019
0
Total Revenue
204,094,000
198,270,000
168,088,000
143,015,000
125,843,000
1
Cost of Revenue
64,984,000
62,650,000
52,232,000
46,078,000
42,910,000
2
Gross Profit
139,110,000
135,620,000
115,856,000
96,937,000
82,933,000
3
Operating Expense
56,295,000
52,237,000
45,940,000
43,978,000
39,974,000
4
Operating Income
82,815,000
83,383,000
69,916,000
52,959,000
42,959,000
5
Net Non Operating Interest Income Expense
423,000
31,000
-215,000
89,000
76,000
6
Other Income Expense
-650,000
302,000
1,401,000
-12,000
653,000
7
Pretax Income
82,588,000
83,716,000
71,102,000
53,036,000
43,688,000
8
Tax Provision
15,139,000
10,978,000
9,831,000
8,755,000
4,448,000
9
Net Income Common Stockholders
67,449,000
72,738,000
61,271,000
44,281,000
39,240,000
10
Diluted NI Available to Com Stockholders
67,449,000
72,738,000
61,271,000
44,281,000
39,240,000
11
Basic EPS
-
9.70
8.12
5.82
5.11
12
Diluted EPS
-
9.65
8.05
5.76
5.06
13
Basic Average Shares
-
7,496,000
7,547,000
7,610,000
7,673,000
14
Diluted Average Shares
-
7,540,000
7,608,000
7,683,000
7,753,000
...
26
Net Income from Continuing Operation Net Minority Interest
67,449,000
72,738,000
61,271,000
44,281,000
39,240,000
27
Total Unusual Items Excluding Goodwill
-547,000
334,000
1,303,000
28,000
710,000
28
Total Unusual Items
-547,000
334,000
1,303,000
28,000
710,000
29
Normalized EBITDA
99,314,000
99,905,000
83,831,000
68,395,000
57,346,000
30
Tax Rate for Calcs
0
0
0
0
0
31
Tax Effect of Unusual Items
-100,269
43,420
182,420
4,620
72,420
To extract the EPS values, you can try modifying your code to search for the div tag with class "Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" that contains the EPS values you're interested in, and extract the span tag within. Here's an example:
eps_values = []
eps_divs = soup_is.find_all('div', {'data-test': 'fin-col', 'class': 'Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)'})
for div in eps_divs:
eps_value = div.find('span').string
eps_values.append(eps_value)
print(eps_values)
Regarding extracting data from different pages, you can change the URL in your requests.get the call to the URL of the desired page, then process the data as you did for the original page. Here's an example for the "Annually" page:
url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT&annual'
read_data = requests.get(url,headers=headers, timeout=5)
content = read_data.content
soup_is = BeautifulSoup(content,'lxml')
Could anyone assist me with my code I am trying to scrape products and prices from a patisserie website however it only retrieves the products on the main page. The rest of the products which are classified in categories have the same tags and attributes however when I run my code only products on the main page only appear. Here is my code;
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
cakes = []
url = "https://mrbakeregypt.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
articles = soup.find_all("div", class_="grid-view-item product-card")
for item in articles:
product = item.find("div", class_="h4 grid-view-item__title product-
card__title").text
price_regular = item.find("div", class_="price__regular").text.strip().replace('\n',
'')
item_cost = {"name": product,
"cost": price_regular
}
`[![enter code here][1]][1]`cakes.append(item_cost)
As mentioned you have to process all collections / categories and one approache could be to collect the links from your baseUrl - Note I used a set comprehension to get only unique urls and avoid to iterate the same categorie more than one time:
urlList = list(set(baseUrl+a['href'] for a in soup.select('a[href*="collection"]')))
Now you could itarate this urlList to scrape your informations:
...
for url in urlList:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content)
articles = soup.find_all("div", class_="grid-view-item product-card")
...
Example
Take a look it also handles the type / categorie of product and both prices, so you could filter based on these in your dataframe
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
baseUrl = "https://mrbakeregypt.com"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(baseUrl, headers=headers)
soup = BeautifulSoup(requests.get(baseUrl).content)
urlList = list(set(baseUrl+a['href'] for a in soup.select('a[href*="collection"]')))
data = []
for url in urlList:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content)
articles = soup.find_all("div", class_="grid-view-item product-card")
for item in articles:
data.append({
'name': item.a.text.strip(),
'price_regular': item.find("div", class_="price__regular").dd.text.split()[-1].strip(),
'price_sale': item.find("div", class_="price__sale").dd.text.split()[-1].strip(),
'type': url.split('/')[-1],
'url': baseUrl+item.a.get('href')
})
df = pd.DataFrame(data)
Output
name
price_regular
price_sale
type
url
0
Mini Sandwiches Mix - 20 Pieces Bread Basket
402
402
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/mini-sandwiches-mix-bread-basket
1
Spiced Aubergine Mini Sandwiches - Box 2 Pieces
35
35
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/spiced-aubergine-mini-sandwich
2
Tuna Mini Sandwiches - Box 2 Pieces
49
49
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/tuna-mini-sandwich
3
Turkey Coleslaw Mini Sandwiches - Box 2 Pieces
45
45
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/turkey-coleslaw-mini-sandwich
4
Roast Beef Mini Sandwiches - Box 2 Pieces
49
49
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/roast-beef-mini-sandwich
...
I'm developing a web scraping to collect some information from AllMusic. However, I am having difficulties to correctly return information when there is more than one option inside the tag (e.g. href).
Question: I need to return the first music genre for each artist. In the case of one value per artist, my code works. However, in situations with more than one music genre, I'm not able to select just the first one.
Here is the code created:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
performer = []
links = []
genre = []
for artist in artists:
url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" + urllib.parse.quote(artist))
soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
div = soup.select("div.name")[0]
link = div.find_all('a')[0]['href']
links.append(link)
for l in links:
soup = BeautifulSoup(requests.get(l, headers=headers).content, "html.parser")
divGenre= soup.select("div.genre")[0]
genres = divGenre.find('a')
performer.append(artist)
genre.append(genres.text)
df = pd.DataFrame(zip(performer, genre, links), columns=["artist", "genre", "link"])
df
Hopfully understand your question right - Main issue is that you iterate the links inside your for-loop and that causes the repetition.
May change your strategy, try to get all information in one iteration and store them in a more structured way.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
data = []
for artist in artists:
url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" + urllib.parse.quote(artist))
soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
link = soup.select_one("div.name a").get('href')
soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
data.append({
'artist':artist,
'genre':soup.select_one("div.genre a").text,
'link':link
})
print(pd.DataFrame(data).to_markdown(index=False))
Output
artist
genre
link
Alexander 23
Pop/Rock
https://www.allmusic.com/artist/alexander-23-mn0003823464
Alex & Sierra
Folk
https://www.allmusic.com/artist/alex-sierra-mn0003280540
Tion Wayne
Rap
https://www.allmusic.com/artist/tion-wayne-mn0003666177
Tom Cochrane
Pop/Rock
https://www.allmusic.com/artist/tom-cochrane-mn0000931015
The Waked
Electronic
https://www.allmusic.com/artist/the-waked-mn0004025091
Im having a problem with scraping the table of this website, I should be getting the heading but instead am getting
AttributeError: 'NoneType' object has no attribute 'tbody'
Im a bit new to web-scraping so if you could help me out that would be great
import requests
from bs4 import BeautifulSoup
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL)
soup = BeautifulSoup(page.content, "lxml")
table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")
headings = []
for td in table_data[0].find_all("td"):
headings.append(td.b.text.replace('\n', ' ').strip())
print(headings)
What happens?
Note: Always look at your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the dev tools.
Access Revoked
Your IP address has been blocked. We
detected irregular, bot-like usage of our Property Search originating
from your IP address. This block was instated to reduce stress on our
webserver, to ensure that we're providing optimal site performance to
the taxpayers of Collin County. We have
not blocked your ability to download our
data exports, which you can still use to acquire bulk property
data.
How to fix?
Add a user-agent to your requets so that it looks like your requesting with a "browser".
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
page = s.get(URL,headers=headers)
Or as alternativ just download the results.
Example (scraping table)
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")
data = []
for row in soup.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
pd.DataFrame(data[1:], columns=data[0])
Output
Property ID ↓ Geographic ID ↓
Owner Name
Property Address
Legal Description
2021 Market Value
1
2709013 R-10644-00H-0010-1
PARTHASARATHY SURESH & ANITHA HARIKRISHNAN
12209 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 1
$513,019
2
2709018 R-10644-00H-0020-1
JOSHI PRASHANT & SHWETA PANT
12235 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 2
$546,254
3
2709019 R-10644-00H-0030-1
THALLAPUREDDY RAVENDRA & UMA MAHESWARI VEMULA
12261 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 3
$550,768
4
2709020 R-10644-00H-0040-1
KULKARNI BHEEMSEN T & GOURI R
12287 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 4
$509,593
5
2709021 R-10644-00H-0050-1
BALAM GANESH & SHANTHIREKHA LOKULA
12313 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 5
$553,949
...
import requests
from bs4 import BeautifulSoup
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")
Finding Table Data:
column_data=soup.find("table").find_all("tr")[0]
column=[i.get_text() for i in column_data.find_all("td") if i.get_text()!=""]
row=soup.find("table").find_all("tr")[1:]
main_lst=[]
for row_details in row:
lst=[]
for i in row_details.find_all("td")[1:]:
if i.get_text()!="":
lst.append(i.get_text())
main_lst.append(lst)
Converting to pandas DataFrame:
import pandas as pd
df=pd.DataFrame(main_lst,columns=column)
Output:
Property ID↓ Geographic ID ↓ Owner Name Property Address Legal Description 2021 Market Value
0 2709013R-10644-00H-0010-1 PARTHASARATHY SURESH & ANITHA HARIKRISHNAN 12209 Willowgate DrFrisco, TX 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 1 $513,019
.....
If you look at page.content, you will see that "Your IP address has been blocked".
You should add some headers to your request because the website is blocking your request. In your specific case, it will be enough to add a User-Agent:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "lxml")
table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")
headings = []
for td in table_data[0].find_all("td"):
headings.append(td.b.text.replace('\n', ' ').strip())
print(headings)
If you add headers, you will still have error, but in the row:
headings.append(td.b.text.replace('\n', ' ').strip())
You should change it to
headings.append(td.text.replace('\n', ' ').strip())
because td doesn't always have b.
I'm scraping the activities to do in Paris from TripAdvisor (https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html).
The code that I've written works well, but I haven't still found a way to obtain the rating of each activity. The rating in Tripadvisor is represented from 5 rounds, I need to know how many of these rounds are colored.
I obtain nothing in the "rating" field.
Following the code:
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.get("https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html")
import pprint
detail_tours = []
for tour in list_tours:
url = tour.find_elements_by_css_selector("a")[0].get_attribute("href")
title = ""
reviews = ""
rating = ""
if(len(tour.find_elements_by_css_selector("._1gpq3zsA._1zP41Z7X")) > 0):
title = tour.find_elements_by_css_selector("._1gpq3zsA._1zP41Z7X")[0].text
if(len(tour.find_elements_by_css_selector("._7c6GgQ6n._22upaSQN._37QDe3gr.WullykOU._3WoyIIcL")) > 0):
reviews = tour.find_elements_by_css_selector("._7c6GgQ6n._22upaSQN._37QDe3gr.WullykOU._3WoyIIcL")[0].text
if(len(tour.find_elements_by_css_selector(".zWXXYhVR")) > 0):
rating = tour.find_elements_by_css_selector(".zWXXYhVR")[0].text
detail_tours.append({'url': url,
'title': title,
'reviews': reviews,
'rating': rating})
I would use BeautifulSoup in a way similar to the suggested code. (I would also recommend you study the structure of the html, but seeing the original code I don't think that's necessary.)
import requests
from bs4 import BeautifulSoup
import re
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"}
resp = requests.get('https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html', headers=header)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, 'lxml')
cards = soup.find_all('div', {'data-automation': 'cardWrapper'})
for card in cards:
rating = card.find('svg', {'class': 'zWXXYhVR'})
match = re.match('Punteggio ([0-9,]+)', rating.attrs['aria-label'])[1]
print(float(match.replace(',', '.')))
And a small bonus-info, the part in the link preceeded by oa (In the example below: oa60), indicates the starting offset, which runs in 30 result increments - So in case you want to change pages, you can change your link to include oa30, oa60, oa90, etc.: https://www.tripadvisor.it/Attractions-g187147-Activities-c42-oa60-Paris_Ile_de_France.html