I'm creating an Amazon web-scraper which just returns the name and price of all products on the search results. Will filter through a dictionary of strings (products) and collect the titles and pricing for all results. I'm doing this to calculate the average / mean of a products pricing and also to find the highest and lowest prices for that product found on Amazon.
So making the scraper was easy enough. Here's a snippet so you understand the code I am using.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Key
driver = webdriver.Chrome()
driver.get("https://www.amazon.co.uk/s?k=nike+shoes&crid=25W2RSXZBGPX3&sprefix=nike+shoe%2Caps%2C105&ref=nb_sb_noss_1")
# retrieving item titles
shoes = driver.find_elements(By.XPATH, '//span[#class="a-size-base-plus a-color-base a-text-normal"]')
shoes_list = []
for s in range(len(shoes)):
shoes_list.append(shoes[s].text)
# retrieving prices
price = driver.find_elements(By.XPATH, '//span[#class="a-price"]')
price_list = []
for p in range(len(price)):
price_list.append(price[p].text)
# prices are retuned with a newline instead of a decimal
# example: £9\n99 instead of £9.99
# so fixing that
temp_price_list = []
for price in price_list:
price = price.replace("\n", ".")
temp_price_list.append(price)
price_list = temp_price_list
So here's the issue. Almost without fail, Amazon have a handful of the products with no price? This really messes with things. Because once I've sorted out the data into a dataframe
title_and_price = list(zip(shoes_list[0:],price_list[0:]))
df = DataFrame(title_and_price, columns=['Product','Price'])
At some point the data gets mixed up and the price will be sorted next to the wrong product. I have left screenshots below for you to see.
Missing prices on Amazon site
Incorrect data
Unfortunately, when pulling the price data, it does not pull in a 'blank' set of data if it's blank, which if it did I wouldn't need to be asking for help as I could just display a blank price next to the item and everything would still remain in order.
Is there anyway to alter the code that it would be able to detect a non-displayed price and therefore keep all the data in order? The data stays in order right up until there's a product with no price, which in every single case of an Amazon search, there is. Really appreciate any insight on this.
To make sure price is married to shoe name, you should locate the parent element of both shoe name and price, and add them as a tuple to a list (which is to become a dataframe), like in example below:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
df_list = []
url = 'https://www.amazon.co.uk/s?k=nike+shoes&crid=25W2RSXZBGPX3&sprefix=nike+shoe%2Caps%2C105&ref=nb_sb_noss_1'
browser.get(url)
shoes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".s-result-item")))
for shoe in shoes:
# print(shoe.get_attribute('outerHTML'))
try:
shoe_title = shoe.find_element(By.CSS_SELECTOR, ".a-text-normal")
except Exception as e:
continue
try:
shoe_price = shoe.find_element(By.CSS_SELECTOR, 'span[class="a-price"]')
except Exception as e:
continue
df_list.append((shoe_title.text.strip(), shoe_price.text.strip()))
df = pd.DataFrame(df_list, columns = ['Shoe', 'Price'])
print(df)
This would return (depending on Amazon's appetite for serving ads in html tags similar to products):
Shoe Price
0 Nike NIKE AIR MAX MOTION 2, Men's Running Shoe... £79\n99
1 Nike Air Max 270 React Se GS Running Trainers ... £69\n99
2 NIKE Women's Air Max Low Sneakers £69\n99
3 NIKE Men's React Miler Running Shoes £109\n99
4 NIKE Men's Revolution 5 Flyease Running Shoe £38\n70
5 NIKE Women's W Revolution 6 Nn Running Shoe £48\n00
6 NIKE Men's Downshifter 10 Running Shoe £54\n99
7 NIKE Women's Court Vision Low Better Basketbal... £30\n00
8 NIKE Team Hustle D 10 Gs Gymnastics Shoe, Blac... £20\n72
9 NIKE Men's Air Max Wright Gs Running Shoe £68\n51
10 NIKE Men's Air Max Sc Trainers £54\n99
11 NIKE Pegasus Trail 3 Gore-TEX Men's Waterproof... £134\n95
12 NIKE Women's W Superrep Go 2 Sneakers £54\n00
13 NIKE Boys Tanjun Running Shoes £35\n53
14 NIKE Women's Air Max Bella Tr 4 Gymnastics Sho... £28\n00
15 NIKE Men's Defy All Day Gymnastics Shoe £54\n95
16 NIKE Men's Venture Runner Sneaker £45\n90
17 Nike Nike Court Borough Low 2 (gs), Boy's Bask... £24\n00
18 NIKE Men's Court Royale 2 Better Essential Tra... £25\n81
19 NIKE Men's Quest 4 Running Shoe £38\n00
20 Women Trainers Running Shoes - Air Cushion Sne... £35\n69
21 Men Women Walking Trainers Light Running Breat... £42\n99
22 JSLEAP Mens Running Shoes Fashion Non Slip Ath... £44\n99
[...]
You should pay attention to a couple of things:
I am waiting for the element to load in page, then try to locate it, see the imports (Webdriverwait etc)
Your results may vary, depending on your advertising profile
You can select more details for each item, use ddifferent css/xpath/etc selectors, this is meant to give you a headstart only
I found the answer to this question. Thank you to #platipus_on_fire for the answer. You certainly guided me towards to correct place but I did find that the issue was still persistent in that prices were being mismatched with the names.
The solution I found was finding the price first before the name. If there is no price, then it shouldn't log down the name. In your example you had it search for the name first which was still causing issues as sometimes the name is there but no the price.
Here is the code that finally works for me:
for shoe in shoes:
try:
price_list.append(shoe.find_element(By.CSS_SELECTOR, 'span[class="a-price"]').text)
shoe_list.append(shoe.find_element(By.CSS_SELECTOR, ".a-text-normal").text)
except Exception as e:
continue
shoe_list and price_list now match regardless if some items don;t display price. Those items in question are now removed from the list entirely.
Thanks again to those who answered.
Related
I am trying to extract a table from the ESPN website using the code below, however I am unable to extract the whole table and paste the whole table in the new csv file. I am getting error as AttributeError: 'list' object has no attribute 'to_csv'
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import requests
service = Service(r"C:\Users\Sai Ram\Documents\Python\Driver\chromedriver.exe")
def get_driver():
options = webdriver.ChromeOptions()
options.add_argument("disable-infobars")
options.add_argument("start-maximized")
options.add_argument("disable-dev-shm-usage")
options.add_argument("no-sandbox")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(service = service, options = options)
driver.get("https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=3;id=2022;type=year")
return driver
def main():
get_driver()
url = "https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=3;id=2022;type=year"
html = requests.get(url).content
df_list = pd.read_html(html)
df=df_list
df.to_csv(r'C:\Users\Sai Ram\Documents\Power BI\End to End T-20 World Cup DashBoard\CSV Data\scappying.csv')
main()
Could you please help in resolving this issue and if possible suggest me the best way to extract data using web scrapping code in python.
You can simply do this:
import pandas as pd
url = "https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=3;id=2022;type=year"
df = pd.read_html(url)[0] # <- this index is important
df.head()
Output
Team 1 Team 2 Winner Margin Ground Match Date Scorecard
0 West Indies England West Indies 9 wickets Bridgetown Jan 22, 2022 T20I # 1453
1 West Indies England England 1 run Bridgetown Jan 23, 2022 T20I # 1454
2 West Indies England West Indies 20 runs Bridgetown Jan 26, 2022 T20I # 1455
3 West Indies England England 34 runs Bridgetown Jan 29, 2022 T20I # 1456
4 West Indies England West Indies 17 runs Bridgetown Jan 30, 2022 T20I # 1457
To save your dataframe as a CSV file:
df.to_csv("your_data_file.csv")
By loading the webpage into pandas you get 2 dataframes (in this case for this particular site). The first dataframe is the one you want (at least I understand this from you post) with the large table of results. The second one is a smaller list. So by indexing the table with [0] you retrieve the first dataframe.
You get the attribute error because you try to treat the list of (2) dataframes as a dataframe.
I got this code to almost work, despite much ignorance. Please help on the home run!
Problem 1: INPUT:
I have a long list of URLs (1000+) to read from and they are in a single column in .csv. I would prefer to read from that file than to paste them into code, like below.
Problem 2: OUTPUT:
The source files actually have 3 drivers and 3 challenges each. In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).
Problem 3: OUTPUT:
I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns. But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.
At the end I'm showing both the inputs and the current & desired output. Sorry for the long question. I'll be very grateful for any help!
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
df = pd.DataFrame(data, columns=[url])
dataframes.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes)
tdata = df2.T
tdata.to_csv(f'detail-dr.csv', header=True)
get_drivers()
def get_challenges():
data = []
for y in toc.select('li:-soup-contains-own("Market challenges") li'):
data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
df = pd.DataFrame(data, columns=[url])
dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes2)
tdata = df2.T
tdata.to_csv(f'detail-ch.csv', header=True)
get_challenges()
The inputs look like this in each URL. They are just lists:
Market drivers
Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices
Market challenges
Rapid technological changes in semiconductor industry
Volatility in semiconductor industry
Impact of technology chasm Table Impact of drivers and challenges
My desired output for drivers is:
0
1
2
3
http/.../Global-Induction-Hobs-30196623/
Product innovations and new designs
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/
Demand for automated recruitment processes
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
http/.../Global-Probe-Card-30196643/
Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices
But instead I get:
0
1
2
3
4
5
6
http/.../Global-Induction-Hobs-30196623/
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
http/.../Global-Probe-Card-30196643/
Miniaturization of electronic products
Increasing demand for IoT devices
Store your data in a list of dicts, create a data frame from it. Split the list of drivers / challenges into single columns and concat it to the final data frame.
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url':url,
'type':'driver',
'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url':url,
'type':'challenges',
'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')
Output
url
type
0
1
2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/
driver
Product innovations and new designs
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/
challenges
High cost limiting the adoption in the mass segment
Health hazards related to induction hobs
Limitation of using only flat - surface utensils and induction-specific cookware
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/
driver
Demand for automated recruitment processes
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/
challenges
Threat from open-source software
High implementation and maintenance cost
Threat to data security
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/
driver
Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/
challenges
Rapid technological changes in semiconductor industry
Volatility in semiconductor industry
Impact of technology chasm
I am scraping data on mutual funds from the vanguard website and my code is giving me some inconsistencies in my data between runs. How can I make my scraping code more robust to avoid these?
I am scraping data from this page and trying to get the average duration in the characteristic table.
Sometimes all the tickers will go through with no problem and other times it will miss some of the data on the page. I assume this has to do with the scraping happening before the data fully loads but it only happens sometimes.
Here is the output for 2 back-to-back runs showing it successfully scraping a ticker and then missing the data on the following run.
VBIRX
Fund total net assets $74.7 billion
Number of bonds 2654
Average effective maturity 2.9 years
Average duration 2.8 years
Yield to maturity 0.5%
VSGBX
Fund total net assets $8.5 billion
Number of bonds 195
Average effective maturity 3.4 years
Average duration 1.7 years
Yield to maturity 0.4%
VFSTX # Here the data for VFSTX is successfully scraped
Fund total net assets $79.3 billion
Number of bonds 2519
Average effective maturity 2.8 years
Average duration 2.7 years
Yield to maturity 1.0%
VFISX
Fund total net assets $7.8 billion
Number of bonds 75
2.2 years
2.2 years
Yield to maturity 0.3%
# here data is missing for VFISX, Second run:
VBIRX
Fund total net assets $74.7 billion
Number of bonds 2654
Average effective maturity 2.9 years
Average duration 2.8 years
Yield to maturity 0.5%
VSGBX
Fund total net assets $8.5 billion
Number of bonds 195
Average effective maturity 3.4 years
Average duration 1.7 years
Yield to maturity 0.4%
VFSTX
Fund total net assets $79.3 billion
Number of bonds 2519
2.8 years
2.7 years
Yield to maturity 1.0%
# Here data is missing for VFSTX even though it worked in the previous run
The main issue is that for certain tickers the table is a different length so I am using a dictionary to store the data using the relevant label as a key. For some of the runs, the 'Average effective maturity' and 'Average duration' labels go missing, screwing up how I access the data.
As you can see from my output the code will work sometimes and I am not sure if choosing to wait for a different element to load on the page would fix it. How should I go about identifying my problem?
Here is the relevant code I am using:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import os
import csv
def extractOverviewTable(htmlTable):
table = htmlTable.find('tbody')
rows = table.findAll('tr')
returnDict = {}
for row in rows:
cols = row.findAll('td')
key = cols[0].find('span').text.replace('\n', '')
value = cols[1].text.replace('\n', '')
if 'Layer' in key:
key = key[:key.index('Layer')]
print(key, value)
returnDict[key] = value
return returnDict
def main():
dirname = os.path.dirname(__file__)
symbols = []
with open(os.path.join(dirname, 'symbols.csv')) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if row:
symbols.append(row[0])
symbols = [s.strip() for s in symbols if s.startswith('V')]
options = webdriver.ChromeOptions()
options.page_load_strategy = 'normal'
options.add_argument('--headless')
browser = webdriver.Chrome(options=options, executable_path=os.path.join(dirname, 'chromedriver'))
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
for symbol in symbols:
browser.get(url_vanguard.format(symbol))
print(symbol)
WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH,'/html/body/div[1]/div[3]/div[3]/div[1]/div/div[1]/div/div/div/div[2]/div/div[2]/div[4]/div[2]/div[2]/div[1]/div/table/tbody/tr[4]')))
html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.findAll('table',{'role':'presentation'})
overviewDataList = extractOverviewTable(htmlData[2])
Here is a subset of the symbols.csv file I am using:
VBIRX
VSGBX
VFSTX
VFISX
VMLTX
VWSTX
VFIIX
VWEHX
VBILX
VFICX
VFITX
Try EC.visibility_of_element_located instead of EC.presence_of_element_located, if that doesn't work try adding a time.sleep() for 1-2 seconds after the WebDriverWait statement.
I am trying to get all the links from this page to the incident reports, in a csv format. However, as they don't seem to be "real links" (if you open in new tab then you receive an "about:blank" error). They do have their own links - visible in inspect element. I'm pretty confused. I did find some code online to do this, but just got "Javascript.void()" as every link.
Surely there must be a way to do this?
https://www.avalanche.state.co.us/accidents/us/
To load all the links into a DataFrame and save it to CSV, you can use this example:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.avalanche.state.co.us/caic/acc/acc_us.php?'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r"window\.open\('(.*?)'")
data = []
for link in soup.select('a[onclick*="window"]'):
data.append({'text':link.get_text(strip=True),
'link':r.search(link['onclick']).group(1)})
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
text link
0 Mount Emmons, west of Crested Butte https://www.avalanche.state.co.us/caic/acc/acc...
1 Point 12885 near Red Peak, west of Silverthorne https://www.avalanche.state.co.us/caic/acc/acc...
2 Austin Canyon, Snake River Range https://www.avalanche.state.co.us/caic/acc/acc...
3 Taylor Mountain, northwest of Teton Pass https://www.avalanche.state.co.us/caic/acc/acc...
4 North of Skyline Peak https://www.avalanche.state.co.us/caic/acc/acc...
.. ... ...
238 Battle Mountain - outside Vail Mountain ski area https://www.avalanche.state.co.us/caic/acc/acc...
239 Scotch Bonnet Mountain, near Lulu Pass https://www.avalanche.state.co.us/caic/acc/acc...
240 Near Paulina Peak https://www.avalanche.state.co.us/caic/acc/acc...
241 Rock Lake, Cascade, Idaho https://www.avalanche.state.co.us/caic/acc/acc...
242 Hyalite Drainage, northern Gallatins, Bozeman https://www.avalanche.state.co.us/caic/acc/acc...
[243 rows x 2 columns]
And saves data.csv (screenshot from LibreOffice):
Look at the onclick property of this link and get "real" address from them.
I can download the annual data from this link by the following code, but it's not the same as what's shown on the website because it's the data of June:
Now I have two questions:
How do I specific the date so the annual data is the same as the following picture(September instead of June as shown in red rectangle)?
By clicking quarterly as shown in orange rectangle, the link won't be changed. How do I grab the quarterly data?
Thanks.
Just curious, but why write the html to file first and then read it with pandas? Pandas can take in the html request directly:
import pandas as pd
symbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/%s/financials?p=%s' %(symbol, symbol)
dfs = pd.read_html(url)
print(dfs[0])
Secondly, not sure why yours is popping up with the yearly dates. Doing the way as I have it above is showing September.
print(dfs[0])
0 ... 4
0 Revenue ... 9/26/2015
1 Total Revenue ... 233715000
2 Cost of Revenue ... 140089000
3 Gross Profit ... 93626000
4 Operating Expenses ... Operating Expenses
5 Research Development ... 8067000
6 Selling General and Administrative ... 14329000
7 Non Recurring ... -
8 Others ... -
9 Total Operating Expenses ... 162485000
10 Operating Income or Loss ... 71230000
11 Income from Continuing Operations ... Income from Continuing Operations
12 Total Other Income/Expenses Net ... 1285000
13 Earnings Before Interest and Taxes ... 71230000
14 Interest Expense ... -733000
15 Income Before Tax ... 72515000
16 Income Tax Expense ... 19121000
17 Minority Interest ... -
18 Net Income From Continuing Ops ... 53394000
19 Non-recurring Events ... Non-recurring Events
20 Discontinued Operations ... -
21 Extraordinary Items ... -
22 Effect Of Accounting Changes ... -
23 Other Items ... -
24 Net Income ... Net Income
25 Net Income ... 53394000
26 Preferred Stock And Other Adjustments ... -
27 Net Income Applicable To Common Shares ... 53394000
[28 rows x 5 columns]
For the second part, you could try to find the data 1 of a few ways:
1) Check the XHR requests and get the data you want by including parameters to the request url that generates that data and can return to you in json format (which when I looked for, I could not find right off the bat, so moved on to the next option)
2) Search through the <script> tags, as the json format can sometimes be within those tags (which I didn't search through very thoroughly, and think Selenium would just be a direct way since pandas can read in the tables)
3) Use selenium to simulate opening the browser, getting the table, and clicking on "Quarterly", then getting that table
I went with option 3:
from selenium import webdriver
import pandas as pd
symbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/%s/financials?p=%s' %(symbol, symbol)
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(url)
# Get Table shown in browser
dfs_annual = pd.read_html(driver.page_source)
print(dfs_annual[0])
# Click "Quarterly"
driver.find_element_by_xpath("//span[text()='Quarterly']").click()
# Get Table shown in browser
dfs_quarter = pd.read_html(driver.page_source)
print(dfs_quarter[0])
driver.close()