I am trying to scrape finance.yahoo.com and download a data file. Specifically, this url: https://finance.yahoo.com/quote/AAPL/history?p=AAPL
I would like to complete two objectives here:
I would like to set the data time period parameters to "Max", which I believe I would need to use Selenium and
would like to download and save the data file that is embedded in the href that appears when inspect "Download Data".
So far, I am unable to access the drop-down required to click "Max" and also cannot locate the href required to download the file.
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options
options = webdriver.ChromeOptions()
options.add_argument('--log-level=3')
stock = input()
base_url = 'https://finance.yahoo.com/quote/{}/history?p=
{}'.format(stock,stock)
driver = webdriver.Chrome()
driver.get(base_url)
driver.maximize_window()
driver.implicitly_wait(4)
driver.find_element_by_class_name("Fl(end) Mt(3px) Cur(p)").click()
time.sleep(4)
driver.quit()
The following shows selectors you can use. I haven't added any wait conditions as the only one needed, in my test runs, I couldn't find; the wait for all new data to be present after pressing apply button. Instead, I use a hard coded time.sleep(5) which should be replaced with a better condition based wait if possible.
from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
import time
d = webdriver.Chrome()
d.get('https://finance.yahoo.com/quote/AAPL/history?p=AAPL')
try:
d.find_element_by_css_selector('[name=agree]').click() #oauth
except:
pass
d.find_element_by_css_selector('[data-icon=CoreArrowDown]').click() #dropdown
d.find_element_by_css_selector('[data-value=MAX]').click() #max
d.find_element_by_css_selector('button.Fl\(start\)').click() # done
d.find_element_by_css_selector('button.Fl\(end\) span').click() #apply
time.sleep(5)
d.find_element_by_css_selector('[download]').click() #download
You can eliminate #1 right off the bat -- just view the page directly, passing the parameters as requested.
The base URI is: https://finance.yahoo.com/quote/AAPL/history
The available parameters are: period1, period2, interval, filter and frequency.
Pretty simple, just grab now as an epoch timestamp, and use it as the period2 parameter, where period1 can simply be the beginning epoch 0. The interval and frequency can be whatever you want; daily 1d, weekly 1wk or monthly 1mo. Lastly, the filter should be history.
The completed URI:
https://finance.yahoo.com/quote/AAPL/history?period1=0&period2=1555905600&interval=1d&filter=history&frequency=1d
From there, use Selenium to locate and click the Download Data link.
UPDATE:
As #QHarr also said, there's numerous questions all over Stack Overflow detailing how to work with Yahoo finance. I also recommend you give searching a whirl.
Related
I'm trying to pull the airline names and prices of a specific flight. I'm having trouble with the x.path and/or using the right html tags because when I run the code below, all I get back is 14 empty lists.
from selenium import webdriver
from lxml import html
from time import sleep
driver = webdriver.Chrome(r"C:\Users\14074\Python\chromedriver")
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
sleep(1)
tree = html.fromstring(driver.page_source)
for flight_tree in tree.xpath('//div[#class="TQqf0e sSHqwe tPgKwe ogfYpf"]'):
title = flight_tree.xpath('.//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[2]/div/div[2]/div[6]/div/div[2]/div/div[1]/div/div[1]/div/div[2]/div[2]/div[2]/span/text()')
price = flight_tree.xpath('.//span[contains(#data-gs, "CjR")]')
print(title, price)
#driver.close()
This is just the first part of my code but I can't really continue without getting this to work. If anyone has some ideas on what I'm doing wrong that would be amazing! It's been driving me crazy. Thank you!
I noticed a few issues with your code. First of all, I believe that when entering this page, first google will show you the "I agree to terms and conditions" popup before showing you the content of the page, therefore you need to first click on that button.
Also, you should use the find_elements_by_xpath function directly on driver instead of using the page content, as this also allows you to render the javascript content. You can find more info here: python tree.xpath return empty list
To get more info on how to scrape using selenium and python you could check out this guide: https://www.webscrapingapi.com/python-selenium-web-scraper/
I used the following code to scrape the titles. (I also changed the xpaths to do so, by extracting them directly from google chrome. You can do that by right clicking on an element -> inspect and in the elements tab where the element is, you can right click -> copy -> Copy xpath)
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# I used these for the code to work on my windows subsystem linux
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
driver = webdriver.Chrome(ChromeDriverManager().install(), options=option)
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
driver.find_element_by_xpath('//*[#id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span').click() # this is necessary to pres the I agree button
elements = driver.find_elements_by_xpath('//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[3]/div[3]/c-wiz/div/div[2]/div[1]/div/div/ol/li')
for flight_tree in elements:
title = flight_tree.find_element_by_xpath('.//*[#class="W6bZuc YMlIz"]').text
print(title)
I tried the below code, with screen maximized and having explicit waits and could successfully extract the information, please see below :
Sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.get("https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA")
wait = WebDriverWait(driver, 10)
titles = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div/descendant::h3")))
for name in titles:
print(name.text)
price = name.find_element(By.XPATH, "./../following-sibling::div/descendant::span[2]").text
print(price)
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
Tokyo
₹38,473
Mumbai
₹3,515
Dubai
₹15,846
import os
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--start-maximized')
options.page_load_strategy = 'eager'
driver = webdriver.Chrome(options=options)
url = "https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24#MS24"
driver.get(url)
wait = WebDriverWait(driver, 20)
I want to find the value of cash EPS (standalone as well as consolidated) but the main problem is, only 5 values are on the page and other values are retrieved with the arrow button till it ends.
How to retrieve such values in one go?
Taking my comment further to the code.
Comment:
this is a paging element, it's getting href as "javascript:void();" once click are over paging count. If data is still there its has a paging # number there(refer 4 in this case). moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/…. So any one condition can be used for the exit!
comment in code refers to the suggestion.
df_list=pd.read_html(driver.page_source) # read the table through pandas
result=df_list[0] #load the result, which will be eventually appended for next pages.
current_page=driver.find_element_by_class_name('nextpaging') # find elment of span
while True:
current_page.click()
time.sleep(20) # sleep for 20
current_page=driver.find_element_by_class_name('nextpaging')
paging_link = current_page.find_element_by_xpath('..') # get the parent of this span which has the href
print(f"Currentl url : { driver.current_url } Next paging link : { paging_link.get_attribute('href')} ")
if "void" in paging_link.get_attribute('href'):
print(f"Time to exit {paging_link.get_attribute('href')}")
break # exit rule
df_list=pd.read_html(driver.page_source)
result=result.append(df_list[0]) # append the result
Based on looking at the URL while navigating through this sight
https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/1#MS24
It appears the arrows navigate to a new URL, incrementing a number in the URL in front of the # symbol.
so, navigating through pages looks like this:
Page1: https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/1#MS24
Page2: https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/2#MS24
Page3: https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/3#MS24
etc...
these separate urls can be used to navigate through this particular website. Probably this would work
def get_pg_url(pgnum):
return 'https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/{}#MS24'.format(pgnum)
web scraping requires tuning to fit the target sight. I entered pgnum=10000, which resulted in the text Data Not Available for Key Financial Ratios being displayed. You can probably us this text to tell you when there are no remaining pages.
Page which i need to scrape data from: Digikey Search result
Issue
It is allowed to show only 100 row in each table, so i have to move between multiple tables using the NextPageButton.
As illustrated in the code below, I actually do though, but the results retrieves to me every time the first table results and doesn't move on to the next table results on my click action ActionChains(driver).click(element).perform().
Keep in mind that NO new pages is opened, click is going to be intercepted by some sort of JavaScript to do some rich UI stuff on the same page to load a new table of data
My Expectations
I am just trying to validate that I could move to the next table, then i will edit the code to loop through all of them.
This piece of code should return the data in the second table from results, BUT it actually returns the values from the first table which loaded initially with the URL. This means that the click action didn't occur or it actually occurred but the WebDriver driver content isn't being updated by interacting with dynamic JavaScript elements in the page.
I will appreciate any help, Thanks..
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver import ActionChains
import time
import sys
url = "https://www.digikey.com/en/products/filter/coaxial-connectors-rf-terminators/382?s=N4IgrCBcoA5QjAGhDOl4AYMF9tA"
chrome_driver_path = "..PATH\\chromedriver"
chrome_options = Options()
chrome_options.add_argument ("--headless")
webdriver = webdriver.Chrome(
executable_path= chrome_driver_path
,options= chrome_options
)
with webdriver as driver:
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(presence_of_element_located((By.CSS_SELECTOR, "tbody")))
element = driver.find_element_by_css_selector("button[data-testid='btn-next-page']")
ActionChains(driver).click(element).perform()
time.sleep(10) #too much time i know, but to make sure it is not a waiting issue. something needs to be updated
results = driver.find_elements_by_css_selector("tbody")
for count in results:
countArr = count.text
print(countArr)
print()
driver.close()
Finally found a SOLUTION !
Source of the solution.
As expected the issue was in the clicking action itself. It is somehow not being done right or it's not being done at all as illustrated in the solution Source question.
the solution is to click the button using Javascript execution.
Change line 30
ActionChains(driver).click(element).perform()
to be as following:
driver.execute_script("arguments[0].click();",element)
That's it..
I have been trying to increase a date range of a table that gets generated from a different web page than the one where I set the parameters (that is how it is set up and there is nothing I can do to change that), then scrape it to a text or csv file. However, I always receive this timeout error by the time this date range even gets up to greater than 10 days.
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: -0.001
(Session info: chrome=81.0.4044.122)
I eventually want this dataset to be greater than 20 years worth of data, but that will take forever if I can only download days at a time!!! So I've been trying to use an explicit wait from webdriver, and I have been trying to use --headless, but I have no idea if I am using any of these features correctly.
What I can present is the very beginning of my code
from selenium import webdriver
#from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import requests
import os
import html5lib
import time
#Load webpage faster
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
...and the following code when I try to have my code pay attention from one web page to another. The '/html/body/p[2]/text()[2]' xpath is the very last item on the webpage I would like to load completely.
#Click Run Button From First Webpage
driver.find_element_by_partial_link_text('Run').click()
#time.sleep(2)
driver.implicitly_wait(2)
#Scrape Second Web Page
driver.switch_to.window(driver.window_handles[1])
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, '/html/body/p[2]/text()[2]')))
df_url = driver.current_url
I've even tried By.CLASS_NAME = 'info' for good measure.
So my questions for this are the following...how can I be able to use selenium to download a bigger dataset? In particular, how can I best use explicit waits from webdriver or headless for what I am trying to deal with right now (building my dataset bit by bit using data from every 10-15 days)? Any assistance is truly appreciated.
I am trying to automate a YouTube search, click on a search result, and then follow the recommended videos on the right-hand side. The code I wrote goes to youtube, makes the search, and clicks on a video and the video gets opened up on the browser. However, I cannot bring it to click on one of the recommended videos.
The problem seems to be that, when I use recommended_videos = driver.find_elements_by_id("video-title") to get a list of elements of recommended videos from the right, what I get instead is a list from the previous page (when I first type in the word and get search results).
The code does work properly when I go directly to a video link with driver.get(url), instead of doing a search first.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import random
seed = 1
random.seed(seed)
driver = webdriver.Firefox()
driver.get("http://www.youtube.com/")
element = driver.find_element_by_tag_name("input")
# Put the word "history" in the search box and hit enter
element.send_keys("history")
element.send_keys(Keys.RETURN)
time.sleep(5)
# Get a list of elements (videos) that get returned by the search
search_results = driver.find_elements_by_id("video-title")
# Click randomly on one of the first five results
search_results[random.randint(0,4)].click()
# Go to the end of the page (I don't know if this is necessary
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
time.sleep(10)
# Get the recommended videos the same way as above. This is where the problem starts, because recommended_videos essentially becomes the same thing as the previous page's search_results, even though the browser is in a new page now.
recommended_videos = driver.find_elements_by_id("video-title")
recommended_videos[random.randint(0,4)].click()
So when I click (last line), I get the error
ElementNotInteractableException: Element <a id="video-title" class="yt-simple-endpoint style-scope ytd-video-renderer" href="/watch?v=1oean5l__Cc"> could not be scrolled into view
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import random
seed = 1
random.seed(seed)
driver = webdriver.Chrome()
driver.get("https://www.youtube.com")
element = driver.find_element_by_tag_name("input")
# Put the word "history" in the search box and hit enter
element.send_keys("history")
element.send_keys(Keys.RETURN)
time.sleep(5)
# Get a list of elements (videos) that get returned by the search
search_results = driver.find_elements_by_id("video-title")
# Click randomly on one of the first five results
search_results[random.randint(0,10)].click()
# Go to the end of the page (I don't know if this is necessary
#
time.sleep(4)
# Get the recommended videos the same way as above. This is where the problem starts, because recommended_videos essentially becomes the same thing as the previous page's search_results, even though the browser is in a new page now.
while True:
recommended_videos = driver.find_elements_by_xpath("//*[#id='dismissable']/div/a")
print(recommended_videos)
recommended_videos[random.randint(1,4)].click()
time.sleep(4)
i am also completely new and thank you for giving me interest in selenium. may this code helps you. change the driver to firefox if you want.