How can I do Python Selenium multiples loops from one element? - python

I want to extract price data from one same element, via 2 consecutive while loops functions, by executing first one lazy-loop (like time.sleep(1)) that will extract price data every 1 second, and at the same time from second fast-loop (with time.sleep(0.2)) that will extract price data every 0.2 seconds.
Code for extracting price data with only one loop that works:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from datetime import datetime
import time
#~~~~~~~~~~~~~~~# Cookies Saver:
chrome_options = Options()
chrome_options.add_argument("user-data-dir=selenium")
driver = webdriver.Chrome(chrome_options=chrome_options)
#~~~~~~~~~~~~~~~# Get to the link:
driver.get('https://coininfoline.com/currencies/ETH/ethereum/')
input('ENTER AFTER PAGE LOADED:')
#~~~~~~~~~~~~~~~# Get price data:
while True:
price_extracor = driver.find_elements_by_xpath('//span[#class="cmc-formatted-price"]')
for price_raw in price_extracor:
price = price_raw.text
#~~~~~~~~~~~~~~~# Time Stamp:
timestamper = datetime.now()
timestamper.microsecond
#~~~~~~~~~~~~~~~# Date and Price Printer:
print(timestamper,str(price))
time.sleep(1)
I expect:
A second while loop that will extract price data AT THE SAME TIME faster with time.sleep(0.2). How can this be done, or is it even possible?
Maybe trying with multiprocessing would be possible?

Related

How to retreive till the last page of the table?

In the code below I don't know where is the last page, the code below works till PAGE 25 which I mentioned manually! sometimes we have 60 or 70 pages! How can I change the code and get the table till the last page??
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome('C:\Webdriver\chromedriver.exe')
driver.get('https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01')
time.sleep(2)
Canada_Result=[]
for J in range (25):
commodities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[2]/a')
Countries = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[4]')
quantities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[7]')
weights = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[8]/abbr')
for i in range(25):
temporary_data= {'Commodity': commodities[i].text,'Country': Countries[i].text,'quantity': quantities[i].text, 'weight': weights[i].text }
Canada_Result.append(temporary_data)
df_data = pd.DataFrame(Canada_Result)
df_data
df_data.to_excel('Canada_scrapping_result.xlsx', index=False)
# click on the Next button
driver.find_element_by_xpath('//*[#id="report_results_next"]').click()
time.sleep(1)
I used the variable pages to find how many buttons there were and then work out how many pages there are by getting the text from the last non "Next" button on the page. This is great for dealing with multi-page selections but I have also implemented a solution in case the selection only has one page e.g. less than 25 rows retrieved.
I added an if statement near the start for the categories that have less rows than 25 for example this one https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=9&r3=1&r4=02&r5=0&r7=0&r8=2022-01-01&r9=2022-04-01 which only has 19 rows retrieved.
The variable period_entries is just used to see how many row entries there are on each page. I implemented this mainly due to the last page having only 22 entries instead of 25 which broke the program initially when it got to the end.
The last if statement is there to ensure the program still scrapes the very last page but does not try to click the next button since it is not available.
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome()
driver.get('https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01')
time.sleep(2)
Canada_Result=[]
pages = len(driver.find_elements_by_xpath('//a[#class="paginate_button" or #class="paginate_button current" and #title]'))
pages = driver.find_element_by_xpath('//a[#onclick and #class="paginate_button" or #class="paginate_button current" and #title][%d]' % (pages)).text.strip("Page\n")
if pages == '':
pages = 1
for J in range (int(pages)):
commodities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[2]/a')
Countries = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[4]')
quantities = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[7]')
weights = driver.find_elements_by_xpath('.//*[#id="report_table"]/tbody/tr["i"]/td[8]/abbr')
period_entries = len(commodities)
for i in range(period_entries):
temporary_data= {'Commodity': commodities[i].text,'Country': Countries[i].text,'quantity': quantities[i].text, 'weight': weights[i].text }
Canada_Result.append(temporary_data)
df_data = pd.DataFrame(Canada_Result)
df_data
df_data.to_excel('Canada_scrapping_result.xlsx', index=False)
if J == int(pages) - 1:
print("Done")
break
# click on the Next button
driver.find_element_by_xpath('//*[#id="report_results_next"]').click()
time.sleep(1)

Dealing with run inconsistencies with web scraping

I am scraping data on mutual funds from the vanguard website and my code is giving me some inconsistencies in my data between runs. How can I make my scraping code more robust to avoid these?
I am scraping data from this page and trying to get the average duration in the characteristic table.
Sometimes all the tickers will go through with no problem and other times it will miss some of the data on the page. I assume this has to do with the scraping happening before the data fully loads but it only happens sometimes.
Here is the output for 2 back-to-back runs showing it successfully scraping a ticker and then missing the data on the following run.
VBIRX
Fund total net assets $74.7 billion
Number of bonds 2654
Average effective maturity 2.9 years
Average duration 2.8 years
Yield to maturity 0.5%
VSGBX
Fund total net assets $8.5 billion
Number of bonds 195
Average effective maturity 3.4 years
Average duration 1.7 years
Yield to maturity 0.4%
VFSTX # Here the data for VFSTX is successfully scraped
Fund total net assets $79.3 billion
Number of bonds 2519
Average effective maturity 2.8 years
Average duration 2.7 years
Yield to maturity 1.0%
VFISX
Fund total net assets $7.8 billion
Number of bonds 75
2.2 years
2.2 years
Yield to maturity 0.3%
# here data is missing for VFISX, Second run:
VBIRX
Fund total net assets $74.7 billion
Number of bonds 2654
Average effective maturity 2.9 years
Average duration 2.8 years
Yield to maturity 0.5%
VSGBX
Fund total net assets $8.5 billion
Number of bonds 195
Average effective maturity 3.4 years
Average duration 1.7 years
Yield to maturity 0.4%
VFSTX
Fund total net assets $79.3 billion
Number of bonds 2519
2.8 years
2.7 years
Yield to maturity 1.0%
# Here data is missing for VFSTX even though it worked in the previous run
The main issue is that for certain tickers the table is a different length so I am using a dictionary to store the data using the relevant label as a key. For some of the runs, the 'Average effective maturity' and 'Average duration' labels go missing, screwing up how I access the data.
As you can see from my output the code will work sometimes and I am not sure if choosing to wait for a different element to load on the page would fix it. How should I go about identifying my problem?
Here is the relevant code I am using:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import os
import csv
def extractOverviewTable(htmlTable):
table = htmlTable.find('tbody')
rows = table.findAll('tr')
returnDict = {}
for row in rows:
cols = row.findAll('td')
key = cols[0].find('span').text.replace('\n', '')
value = cols[1].text.replace('\n', '')
if 'Layer' in key:
key = key[:key.index('Layer')]
print(key, value)
returnDict[key] = value
return returnDict
def main():
dirname = os.path.dirname(__file__)
symbols = []
with open(os.path.join(dirname, 'symbols.csv')) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if row:
symbols.append(row[0])
symbols = [s.strip() for s in symbols if s.startswith('V')]
options = webdriver.ChromeOptions()
options.page_load_strategy = 'normal'
options.add_argument('--headless')
browser = webdriver.Chrome(options=options, executable_path=os.path.join(dirname, 'chromedriver'))
url_vanguard = 'https://investor.vanguard.com/mutual-funds/profile/overview/{}'
for symbol in symbols:
browser.get(url_vanguard.format(symbol))
print(symbol)
WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH,'/html/body/div[1]/div[3]/div[3]/div[1]/div/div[1]/div/div/div/div[2]/div/div[2]/div[4]/div[2]/div[2]/div[1]/div/table/tbody/tr[4]')))
html = browser.page_source
mySoup = BeautifulSoup(html, 'html.parser')
htmlData = mySoup.findAll('table',{'role':'presentation'})
overviewDataList = extractOverviewTable(htmlData[2])
Here is a subset of the symbols.csv file I am using:
VBIRX
VSGBX
VFSTX
VFISX
VMLTX
VWSTX
VFIIX
VWEHX
VBILX
VFICX
VFITX
Try EC.visibility_of_element_located instead of EC.presence_of_element_located, if that doesn't work try adding a time.sleep() for 1-2 seconds after the WebDriverWait statement.

how do i stop this print(val.text) from printing the same table 8 times

Am scraping tables from a Matchday dropdown. when the print(val.text) is executed it prints each table 8 times just like the number of rows instead of just once it then moves to the next Matchday and prints the table 8 times.i just want it be printed once. can someone help me find where the problem is.
i would also like to append the tables scraped to an excel file any assistance would be a appreciated
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome('C:\Chrome\chromedriver.exe')
driver.get('https://www.betika.com/mobile/#/virtuals/results')
#Selecting the Season dropdown box
Seasons = Select(driver.find_element_by_css_selector('td:nth-of-type(1) > .virtuals__results__header__select'))
#Selecting the Matchday dropdown box
Matchdays = Select(driver.find_element_by_css_selector('td:nth-of-type(2) > .virtuals__results__header__select'))
import time
#selecting all the options the Seasons dropdowns.
Season = len(Seasons.options)
for items in reversed(range(Season)):
Seasons.select_by_index(items)
time.sleep(2)
#selecting all the options the Matchdays dropdowns.
Matchday = len(Matchdays.options)
for items in range(Matchday):
Matchdays.select_by_index(items)
time.sleep(2)
rows = len(driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table[2]/tbody/tr')) #count number of rows in the table
columns = len(driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table[2]/tbody/tr[2]/td')) #count number of columns in the table
# print(rows)
# print(columns)
for r in range(3,rows+1):
for c in range(1,columns+1):
value = driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table["+str(r)+"]/tbody/tr["+str(c)+"]')
for val in value:
print(val.text)```
It looks like this line is your problem.
value = driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table["+str(r)+"]/tbody/tr["+str(c)+"]')
.format() is what I would sugest to use. See below. I commented out the problematic line and added my updated line.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome('C:\Chrome\chromedriver.exe')
driver.get('https://www.betika.com/mobile/#/virtuals/results')
#Selecting the Season dropdown box
Seasons = Select(driver.find_element_by_css_selector('td:nth-of-type(1) > .virtuals__results__header__select'))
#Selecting the Matchday dropdown box
Matchdays = Select(driver.find_element_by_css_selector('td:nth-of-type(2) > .virtuals__results__header__select'))
import time
#selecting all the options the Seasons dropdowns.
Season = len(Seasons.options)
for items in reversed(range(Season)):
Seasons.select_by_index(items)
time.sleep(2)
#selecting all the options the Matchdays dropdowns.
Matchday = len(Matchdays.options)
for items in range(Matchday):
Matchdays.select_by_index(items)
time.sleep(2)
rows = len(driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table[2]/tbody/tr')) #count number of rows in the table
columns = len(driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table[2]/tbody/tr[2]/td')) #count number of columns in the table
# print(rows)
# print(columns)
for r in range(3,rows+1):
for c in range(1,columns+1):
#value = driver.find_elements_by_xpath('//*[#id="app"]/main/div/div[2]/table["+str(r)+"]/tbody/tr["+str(c)+"]')
value = driver.find_elements_by_xpath("//*[#id='app']/main/div/div[2]/table['{}']/tbody/tr['{}']".format(r, c))
for val in value:
print(val.text)

Pass two parameters into a url element while running the loop to webscrape data

import requests
for i in range(len(lat_lon_df)):
lat,lon = lat_lon_df.iloc[i]
try:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat={}&lon={}&unit=0&lg=english&FcstType=graphical".format(lat).format(lon))
print(page)
except:
continue
Lat Long
0 55.999722 -161.207778
1 60.891854 -161.392330
2 60.890632 -161.199325
3 54.143012 -165.785368
4 62.746967 -164.602280
I am trying to run a loop to scrape data while taking new latitude and longitude parameters for every iteration.
Use one time format, information can be found here:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat={}&lon={}&unit=0&lg=english&FcstType=graphical".format(lat, lon))
Alternative:
page = requests.get(f"https://forecast.weather.gov/MapClick.php?lat={lat}&lon={lon}&unit=0&lg=english&FcstType=graphical")

How can python auto scrape on a daily basis given certain url change rules?

Hi new comer in the python learning. Sales man travel a lot, would like to save some bucks in the hotel booking, so I am using python to scrape certain hotels on certain days for personal use.
I can use python to scrape a specific webpage, but im having trouble in making a serial search.
The single webpage scrape goes like this:
import requests
from bs4 import BeautifulSoup
url ="http://hotelname.com/arrivalDate=05%2F23%2F2016**&departureDate=05%2F24%2F2016" #means arrive on May23 and leaves on May
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('.PropertyName')
prices = soup.select('.RateSection ')
for name,price in zip(names,prices):
data = {
"name":name.get_text(),
"price":price.get_text()
}
print (data)
By doing this I can get the price of the hotels on that day. But I would like to know the price in a longer period(say 15 days), so I can arrange my travel and save some bucks. The question is how can I make the search auto loop itself?
eg. hotelname('') price(200USD) May 1 Check in(CI) and May 2 check out(CO)
hotelname('') price(150USD) May 2 CI May 3 CO
..........
hotelname('') price(170USD) May30 CI May 31 CO
Hope I make my intentions clear. Can someone help guide in what way should I do to achieve this auto search? It is too much work to manually change the date in the urls. Thanks
You can use the datetime lib to get the dates and increment a day at a time in the loop for n days:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
def n_booking(n):
# start tomorrow
bk = (datetime.now() + timedelta(days=1))
# check next n days
for i in range(n):
mon, day, year = bk.month, bk.day, bk.year
# go to next day
bk = (datetime.now() + timedelta(days=1))
d_mon, d_day, d_year = bk.month, bk.day, bk.year
url ="http://hotelname.com/arrivalDate=d{mon}%2F{day}%2F{year}**&departureDate={d_mon}%2F{d_day}%2F{d_year}"\
.format(mon=mon, day=day, year=year, d_day=d_day, d_mon=d_mon,d_year=d_year)
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('.PropertyName')
prices = soup.select('.RateSection ')
for name,price in zip(names,prices):
yield {
"name":name.get_text(),
"price":price.get_text()
}

Categories

Resources