Dealing with run inconsistencies with web scraping

Dealing with run inconsistencies with web scraping - python

I am scraping data on mutual funds from the vanguard website and my code is giving me some inconsistencies in my data between runs. How can I make my scraping code more robust to avoid these?
I am scraping data from this page and trying to get the average duration in the characteristic table.
Sometimes all the tickers will go through with no problem and other times it will miss some of the data on the page. I assume this has to do with the scraping happening before the data fully loads but it only happens sometimes.
Here is the output for 2 back-to-back runs showing it successfully scraping a ticker and then missing the data on the following run.
VBIRX
Fund total net assets $74.7 billion
Number of bonds 2654
Average effective maturity 2.9 years
Average duration 2.8 years
Yield to maturity 0.5%
VSGBX
Fund total net assets $8.5 billion
Number of bonds 195
Average effective maturity 3.4 years
Average duration 1.7 years
Yield to maturity 0.4%
VFSTX # Here the data for VFSTX is successfully scraped
Fund total net assets $79.3 billion
Number of bonds 2519
Average effective maturity 2.8 years
Average duration 2.7 years
Yield to maturity 1.0%
VFISX
Fund total net assets $7.8 billion
Number of bonds 75
2.2 years
2.2 years
Yield to maturity 0.3%
# here data is missing for VFISX, Second run:
VBIRX
Fund total net assets $74.7 billion
Number of bonds 2654
Average effective maturity 2.9 years
Average duration 2.8 years
Yield to maturity 0.5%
VSGBX
Fund total net assets $8.5 billion
Number of bonds 195
Average effective maturity 3.4 years
Average duration 1.7 years
Yield to maturity 0.4%
VFSTX
Fund total net assets $79.3 billion
Number of bonds 2519
2.8 years
2.7 years
Yield to maturity 1.0%
# Here data is missing for VFSTX even though it worked in the previous run
The main issue is that for certain tickers the table is a different length so I am using a dictionary to store the data using the relevant label as a key. For some of the runs, the 'Average effective maturity' and 'Average duration' labels go missing, screwing up how I access the data.
As you can see from my output the code will work sometimes and I am not sure if choosing to wait for a different element to load on the page would fix it. How should I go about identifying my problem?
Here is the relevant code I am using:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import os
import csv
def extractOverviewTable(htmlTable):
table = htmlTable.find('tbody')
rows = table.findAll('tr')
returnDict = {}
for row in rows:
cols = row.findAll('td')
key = cols[0].find('span').text.replace('\n', '')
value = cols[1].text.replace('\n', '')
if 'Layer' in key:
key = key[:key.index('Layer')]
print(key, value)
returnDict[key] = value
return returnDict
def main():
dirname = os.path.dirname(__file__)
symbols = []
with open(os.path.join(dirname, 'symbols.csv')) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if row:
symbols.append(row[0])
symbols = [s.strip() for s in symbols if s.startswith('V')]
options = webdriver.ChromeOptions()
options.page_load_strategy = 'normal'
options.add_argument('--headless')
browser = webdriver.Chrome(options=options, executable_path=os.path.join(dirname, 'chromedriver'))
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
for symbol in symbols:
browser.get(url_vanguard.format(symbol))
print(symbol)
WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH,'/html/body/div[1]/div[3]/div[3]/div[1]/div/div[1]/div/div/div/div[2]/div/div[2]/div[4]/div[2]/div[2]/div[1]/div/table/tbody/tr[4]')))
html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.findAll('table',{'role':'presentation'})
overviewDataList = extractOverviewTable(htmlData[2])
Here is a subset of the symbols.csv file I am using:
VBIRX
VSGBX
VFSTX
VFISX
VMLTX
VWSTX
VFIIX
VWEHX
VBILX
VFICX
VFITX

Try EC.visibility_of_element_located instead of EC.presence_of_element_located, if that doesn't work try adding a time.sleep() for 1-2 seconds after the WebDriverWait statement.

Related

Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas

I got this code to almost work, despite much ignorance. Please help on the home run!
Problem 1: INPUT:
I have a long list of URLs (1000+) to read from and they are in a single column in .csv. I would prefer to read from that file than to paste them into code, like below.
Problem 2: OUTPUT:
The source files actually have 3 drivers and 3 challenges each. In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).
Problem 3: OUTPUT:
I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns. But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.
At the end I'm showing both the inputs and the current & desired output. Sorry for the long question. I'll be very grateful for any help!
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
df = pd.DataFrame(data, columns=[url])
dataframes.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes)
tdata = df2.T
tdata.to_csv(f'detail-dr.csv', header=True)
get_drivers()
def get_challenges():
data = []
for y in toc.select('li:-soup-contains-own("Market challenges") li'):
data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
df = pd.DataFrame(data, columns=[url])
dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes2)
tdata = df2.T
tdata.to_csv(f'detail-ch.csv', header=True)
get_challenges()
The inputs look like this in each URL. They are just lists:
Market drivers
Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices
Market challenges
Rapid technological changes in semiconductor industry
Volatility in semiconductor industry
Impact of technology chasm Table Impact of drivers and challenges
My desired output for drivers is:
0
1
2
3
http/.../Global-Induction-Hobs-30196623/
Product innovations and new designs
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/
Demand for automated recruitment processes
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
http/.../Global-Probe-Card-30196643/
Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices
But instead I get:
0
1
2
3
4
5
6
http/.../Global-Induction-Hobs-30196623/
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
http/.../Global-Probe-Card-30196643/
Miniaturization of electronic products
Increasing demand for IoT devices

Store your data in a list of dicts, create a data frame from it. Split the list of drivers / challenges into single columns and concat it to the final data frame.
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url':url,
'type':'driver',
'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url':url,
'type':'challenges',
'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')
Output
url
type
0
1
2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/
driver
Product innovations and new designs
Increasing demand for convenient home appliances with changes in lifestyle patterns
Growing adoption of energy-efficient appliances
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/
challenges
High cost limiting the adoption in the mass segment
Health hazards related to induction hobs
Limitation of using only flat - surface utensils and induction-specific cookware
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/
driver
Demand for automated recruitment processes
Increasing demand for unified solutions for all HR functions
Increasing workforce diversity
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/
challenges
Threat from open-source software
High implementation and maintenance cost
Threat to data security
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/
driver
Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/
challenges
Rapid technological changes in semiconductor industry
Volatility in semiconductor industry
Impact of technology chasm

How can I do Python Selenium multiples loops from one element?

I want to extract price data from one same element, via 2 consecutive while loops functions, by executing first one lazy-loop (like time.sleep(1)) that will extract price data every 1 second, and at the same time from second fast-loop (with time.sleep(0.2)) that will extract price data every 0.2 seconds.
Code for extracting price data with only one loop that works:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from datetime import datetime
import time
#~~~~~~~~~~~~~~~# Cookies Saver:
chrome_options = Options()
chrome_options.add_argument("user-data-dir=selenium")
driver = webdriver.Chrome(chrome_options=chrome_options)
#~~~~~~~~~~~~~~~# Get to the link:
driver.get('https://coininfoline.com/currencies/ETH/ethereum/')
input('ENTER AFTER PAGE LOADED:')
#~~~~~~~~~~~~~~~# Get price data:
while True:
price_extracor = driver.find_elements_by_xpath('//span[#class="cmc-formatted-price"]')
for price_raw in price_extracor:
price = price_raw.text
#~~~~~~~~~~~~~~~# Time Stamp:
timestamper = datetime.now()
timestamper.microsecond
#~~~~~~~~~~~~~~~# Date and Price Printer:
print(timestamper,str(price))
time.sleep(1)
I expect:
A second while loop that will extract price data AT THE SAME TIME faster with time.sleep(0.2). How can this be done, or is it even possible?
Maybe trying with multiprocessing would be possible?

How to grab quarterly and specific the date of yahoo financial data with python?

I can download the annual data from this link by the following code, but it's not the same as what's shown on the website because it's the data of June:
Now I have two questions:
How do I specific the date so the annual data is the same as the following picture(September instead of June as shown in red rectangle)?
By clicking quarterly as shown in orange rectangle, the link won't be changed. How do I grab the quarterly data?
Thanks.

Just curious, but why write the html to file first and then read it with pandas? Pandas can take in the html request directly:
import pandas as pd
symbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/%s/financials?p=%s' %(symbol, symbol)
dfs = pd.read_html(url)
print(dfs[0])
Secondly, not sure why yours is popping up with the yearly dates. Doing the way as I have it above is showing September.
print(dfs[0])
0 ... 4
0 Revenue ... 9/26/2015
1 Total Revenue ... 233715000
2 Cost of Revenue ... 140089000
3 Gross Profit ... 93626000
4 Operating Expenses ... Operating Expenses
5 Research Development ... 8067000
6 Selling General and Administrative ... 14329000
7 Non Recurring ... -
8 Others ... -
9 Total Operating Expenses ... 162485000
10 Operating Income or Loss ... 71230000
11 Income from Continuing Operations ... Income from Continuing Operations
12 Total Other Income/Expenses Net ... 1285000
13 Earnings Before Interest and Taxes ... 71230000
14 Interest Expense ... -733000
15 Income Before Tax ... 72515000
16 Income Tax Expense ... 19121000
17 Minority Interest ... -
18 Net Income From Continuing Ops ... 53394000
19 Non-recurring Events ... Non-recurring Events
20 Discontinued Operations ... -
21 Extraordinary Items ... -
22 Effect Of Accounting Changes ... -
23 Other Items ... -
24 Net Income ... Net Income
25 Net Income ... 53394000
26 Preferred Stock And Other Adjustments ... -
27 Net Income Applicable To Common Shares ... 53394000
[28 rows x 5 columns]
For the second part, you could try to find the data 1 of a few ways:
1) Check the XHR requests and get the data you want by including parameters to the request url that generates that data and can return to you in json format (which when I looked for, I could not find right off the bat, so moved on to the next option)
2) Search through the <script> tags, as the json format can sometimes be within those tags (which I didn't search through very thoroughly, and think Selenium would just be a direct way since pandas can read in the tables)
3) Use selenium to simulate opening the browser, getting the table, and clicking on "Quarterly", then getting that table
I went with option 3:
from selenium import webdriver
import pandas as pd
symbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/%s/financials?p=%s' %(symbol, symbol)
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(url)
# Get Table shown in browser
dfs_annual = pd.read_html(driver.page_source)
print(dfs_annual[0])
# Click "Quarterly"
driver.find_element_by_xpath("//span[text()='Quarterly']").click()
# Get Table shown in browser
dfs_quarter = pd.read_html(driver.page_source)
print(dfs_quarter[0])
driver.close()

Extract using Beautiful Soup

I want to fetch the stock price from web site: http://www.bseindia.com/
For example stock price appears as "S&P BSE :25,489.57".I want to fetch the numeric part of it as "25489.57"
This is the code i have written as of now.It is fetching the entire div in which this amount appears but not the amount.
Below is the code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = "http://www.bseindia.com"
html_page = urlopen(page)
html_text = html_page.read()
soup = BeautifulSoup(html_text,"html.parser")
divtag = soup.find_all("div",{"class":"sensexquotearea"})
for oye in divtag:
tdidTags = oye.find_all("div", {"class": "sensexvalue2"})
for tag in tdidTags:
tdTags = tag.find_all("div",{"class":"newsensexvaluearea"})
for newtag in tdTags:
tdnewtags = newtag.find_all("div",{"class":"sensextext"})
for rakesh in tdnewtags:
tdtdsp1 = rakesh.find_all("div",{"id":"tdsp"})
for texts in tdtdsp1:
print(texts)

I had a look around in what is going on when that page loads the information and I was able to simulate what the javascript is doing in python.
I found out it is referencing a page called IndexMovers.aspx?ln=en check it out here
It looks like this page is a comma separated list of things. First comes the name, next comes the price, and then a couple other things you don't care about.
To simulate this in python, we request the page, split it by the commas, then read through every 6th value in the list, adding that value and the value one after that to a new list called stockInformation.
Now we can just loop through stock information and get the name using item[0] and price with item[1]
import requests
newUrl = "http://www.bseindia.com/Msource/IndexMovers.aspx?ln=en"
response = requests.get(newUrl).text
commaItems = response.split(",")
#create list of stocks, each one containing information
#index 0 is the name, index 1 is the price
#the last item is not included because for some reason it has no price info on indexMovers page
stockInformation = []
for i, item in enumerate(commaItems[:-1]):
if i % 6 == 0:
newList = [item, commaItems[i+1]]
stockInformation.append(newList)
#print each item and its price from your list
for item in stockInformation:
print(item[0], "has a price of", item[1])
This prints out:
S&P BSE SENSEX has a price of 25489.57
SENSEX#S&P BSE 100 has a price of 7944.50
BSE-100#S&P BSE 200 has a price of 3315.87
BSE-200#S&P BSE MidCap has a price of 11156.07
MIDCAP#S&P BSE SmallCap has a price of 11113.30
SMLCAP#S&P BSE 500 has a price of 10399.54
BSE-500#S&P BSE GREENEX has a price of 2234.30
GREENX#S&P BSE CARBONEX has a price of 1283.85
CARBON#S&P BSE India Infrastructure Index has a price of 152.35
INFRA#S&P BSE CPSE has a price of 1190.25
CPSE#S&P BSE IPO has a price of 3038.32
#and many more... (total of 40 items)
Which clearly is equivlent to the values shown on the page
So there you have it, you can simulate exactly what the javascript on that page is doing to load the information. Infact you now have even more information than was just shown to you on the page and the request is going to be faster because we are downloading just data, not all that extraneous html.

If you look into the source code of your page (e.g. by storing it into a file and opening it with an editor), you will see that the actual stock price 25,489.57 does not show up directly. The price is not in the stored html code but loaded in a different way.
You could use the linked page where the numbers show up:
http://www.bseindia.com/sensexview/indexview_new.aspx?index_Code=16&iname=BSE30

How can python auto scrape on a daily basis given certain url change rules?

Hi new comer in the python learning. Sales man travel a lot, would like to save some bucks in the hotel booking, so I am using python to scrape certain hotels on certain days for personal use.
I can use python to scrape a specific webpage, but im having trouble in making a serial search.
The single webpage scrape goes like this:
import requests
from bs4 import BeautifulSoup
url ="http://hotelname.com/arrivalDate=05%2F23%2F2016**&departureDate=05%2F24%2F2016" #means arrive on May23 and leaves on May
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('.PropertyName')
prices = soup.select('.RateSection ')
for name,price in zip(names,prices):
data = {
"name":name.get_text(),
"price":price.get_text()
}
print (data)
By doing this I can get the price of the hotels on that day. But I would like to know the price in a longer period(say 15 days), so I can arrange my travel and save some bucks. The question is how can I make the search auto loop itself?
eg. hotelname('') price(200USD) May 1 Check in(CI) and May 2 check out(CO)
hotelname('') price(150USD) May 2 CI May 3 CO
..........
hotelname('') price(170USD) May30 CI May 31 CO
Hope I make my intentions clear. Can someone help guide in what way should I do to achieve this auto search? It is too much work to manually change the date in the urls. Thanks

You can use the datetime lib to get the dates and increment a day at a time in the loop for n days:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
def n_booking(n):
# start tomorrow
bk = (datetime.now() + timedelta(days=1))
# check next n days
for i in range(n):
mon, day, year = bk.month, bk.day, bk.year
# go to next day
bk = (datetime.now() + timedelta(days=1))
d_mon, d_day, d_year = bk.month, bk.day, bk.year
url ="http://hotelname.com/arrivalDate=d{mon}%2F{day}%2F{year}**&departureDate={d_mon}%2F{d_day}%2F{d_year}"\
.format(mon=mon, day=day, year=year, d_day=d_day, d_mon=d_mon,d_year=d_year)
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('.PropertyName')
prices = soup.select('.RateSection ')
for name,price in zip(names,prices):
yield {
"name":name.get_text(),
"price":price.get_text()
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dealing with run inconsistencies with web scraping - python

Try EC.visibility_of_element_located instead of EC.presence_of_element_located, if that doesn't work try adding a time.sleep() for 1-2 seconds after the WebDriverWait statement.

Related

Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas

How can I do Python Selenium multiples loops from one element?

How to grab quarterly and specific the date of yahoo financial data with python?

Extract using Beautiful Soup

How can python auto scrape on a daily basis given certain url change rules?

Categories

Resources