How to Data Scrape from multiple pages

How to Data Scrape from multiple pages - python

import os
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--start-maximized')
options.page_load_strategy = 'eager'
driver = webdriver.Chrome(options=options)
url = "https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24#MS24"
driver.get(url)
wait = WebDriverWait(driver, 20)
I want to find the value of cash EPS (standalone as well as consolidated) but the main problem is, only 5 values are on the page and other values are retrieved with the arrow button till it ends.
How to retrieve such values in one go?

Taking my comment further to the code.
Comment:
this is a paging element, it's getting href as "javascript:void();" once click are over paging count. If data is still there its has a paging # number there(refer 4 in this case). moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/…. So any one condition can be used for the exit!
comment in code refers to the suggestion.
df_list=pd.read_html(driver.page_source) # read the table through pandas
result=df_list[0] #load the result, which will be eventually appended for next pages.
current_page=driver.find_element_by_class_name('nextpaging') # find elment of span
while True:
current_page.click()
time.sleep(20) # sleep for 20
current_page=driver.find_element_by_class_name('nextpaging')
paging_link = current_page.find_element_by_xpath('..') # get the parent of this span which has the href
print(f"Currentl url : { driver.current_url } Next paging link : { paging_link.get_attribute('href')} ")
if "void" in paging_link.get_attribute('href'):
print(f"Time to exit {paging_link.get_attribute('href')}")
break # exit rule
df_list=pd.read_html(driver.page_source)
result=result.append(df_list[0]) # append the result

Based on looking at the URL while navigating through this sight
https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/1#MS24
It appears the arrows navigate to a new URL, incrementing a number in the URL in front of the # symbol.
so, navigating through pages looks like this:
Page1: https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/1#MS24
Page2: https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/2#MS24
Page3: https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/3#MS24
etc...
these separate urls can be used to navigate through this particular website. Probably this would work
def get_pg_url(pgnum):
return 'https://www.moneycontrol.com/financials/marutisuzukiindia/ratiosVI/MS24/{}#MS24'.format(pgnum)
web scraping requires tuning to fit the target sight. I entered pgnum=10000, which resulted in the text Data Not Available for Key Financial Ratios being displayed. You can probably us this text to tell you when there are no remaining pages.

Related

Dynamic element (Table) in page is not updated when i use Click() in selenium, so i couldn't retrive the new data

Page which i need to scrape data from: Digikey Search result
Issue
It is allowed to show only 100 row in each table, so i have to move between multiple tables using the NextPageButton.
As illustrated in the code below, I actually do though, but the results retrieves to me every time the first table results and doesn't move on to the next table results on my click action ActionChains(driver).click(element).perform().
Keep in mind that NO new pages is opened, click is going to be intercepted by some sort of JavaScript to do some rich UI stuff on the same page to load a new table of data
My Expectations
I am just trying to validate that I could move to the next table, then i will edit the code to loop through all of them.
This piece of code should return the data in the second table from results, BUT it actually returns the values from the first table which loaded initially with the URL. This means that the click action didn't occur or it actually occurred but the WebDriver driver content isn't being updated by interacting with dynamic JavaScript elements in the page.
I will appreciate any help, Thanks..
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver import ActionChains
import time
import sys
url = "https://www.digikey.com/en/products/filter/coaxial-connectors-rf-terminators/382?s=N4IgrCBcoA5QjAGhDOl4AYMF9tA"
chrome_driver_path = "..PATH\\chromedriver"
chrome_options = Options()
chrome_options.add_argument ("--headless")
webdriver = webdriver.Chrome(
executable_path= chrome_driver_path
,options= chrome_options
)
with webdriver as driver:
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(presence_of_element_located((By.CSS_SELECTOR, "tbody")))
element = driver.find_element_by_css_selector("button[data-testid='btn-next-page']")
ActionChains(driver).click(element).perform()
time.sleep(10) #too much time i know, but to make sure it is not a waiting issue. something needs to be updated
results = driver.find_elements_by_css_selector("tbody")
for count in results:
countArr = count.text
print(countArr)
print()
driver.close()

Finally found a SOLUTION !
Source of the solution.
As expected the issue was in the clicking action itself. It is somehow not being done right or it's not being done at all as illustrated in the solution Source question.
the solution is to click the button using Javascript execution.
Change line 30
ActionChains(driver).click(element).perform()
to be as following:
driver.execute_script("arguments[0].click();",element)
That's it..

Selenium after clicking load more i cant get the newly loaded content

I am working on a project where am required to fetch data from a site using selenium.
The website has a load more clickable div.
i have managed to make selenium click the div and it works you can see it do the clicking when its running on none --headless mode
However when i try to get all the items i don't get the newly loaded
items after clicking.
Here is my code snippet
driver.get('https://jamboshop.com/search/tv')
i=1
maximum=4
while i<maximum:
try:
i += 1
el=driver.find_element_by_css_selector("div.showMoreLoaderPanel")
action=ActionChains(driver)
action.move_to_element(el).click().perform()
driver.implicitly_wait(3)
except:
break
products =driver.find_elements_by_css_selector("div.col-xs-6.col-sm-4.col-md-4.col-lg-3")
for product in products:
print({"item_name":product.find_element_by_css_selector("h6.prd-title").text})
This only prints the items that were present before the clicks...how do i get all the items in the page including ones loaded after clicking load more?
extra
# My imports and chrome settings
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,1080')
#chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(chrome_options=chrome_options)

I think this is a lazy loading application.So when go bottom of the page it seems lost the previous elements it has capture and that is why you can see only current elements on the page available.
There is an alternative way to handle this by checking with a list and then capture those data while iterating the while loop.
Code:
import time
driver.get('https://jamboshop.com/search/tv')
i=1
maximum=4
itemlist=[]
while i<maximum:
try:
products = driver.find_elements_by_css_selector("div.col-xs-6.col-sm-4.col-md-4.col-lg-3")
for product in products:
if product.find_element_by_css_selector("h6.prd-title").text in itemlist:
continue
else:
itemlist.append(product.find_element_by_css_selector("h6.prd-title").text)
i += 1
el=driver.find_element_by_css_selector("div.showMoreLoaderPanel")
action=ActionChains(driver)
action.move_to_element(el).click().perform()
time.sleep(3)
except:
break
print(len(itemlist))
print(itemlist)
Let me know if this works for you.Website is not accessible at my end.

Selenium webdriver finds elements from the previous page, not the current

I am trying to automate a YouTube search, click on a search result, and then follow the recommended videos on the right-hand side. The code I wrote goes to youtube, makes the search, and clicks on a video and the video gets opened up on the browser. However, I cannot bring it to click on one of the recommended videos.
The problem seems to be that, when I use recommended_videos = driver.find_elements_by_id("video-title") to get a list of elements of recommended videos from the right, what I get instead is a list from the previous page (when I first type in the word and get search results).
The code does work properly when I go directly to a video link with driver.get(url), instead of doing a search first.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import random
seed = 1
random.seed(seed)
driver = webdriver.Firefox()
driver.get("http://www.youtube.com/")
element = driver.find_element_by_tag_name("input")
# Put the word "history" in the search box and hit enter
element.send_keys("history")
element.send_keys(Keys.RETURN)
time.sleep(5)
# Get a list of elements (videos) that get returned by the search
search_results = driver.find_elements_by_id("video-title")
# Click randomly on one of the first five results
search_results[random.randint(0,4)].click()
# Go to the end of the page (I don't know if this is necessary
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
time.sleep(10)
# Get the recommended videos the same way as above. This is where the problem starts, because recommended_videos essentially becomes the same thing as the previous page's search_results, even though the browser is in a new page now.
recommended_videos = driver.find_elements_by_id("video-title")
recommended_videos[random.randint(0,4)].click()
So when I click (last line), I get the error
ElementNotInteractableException: Element <a id="video-title" class="yt-simple-endpoint style-scope ytd-video-renderer" href="/watch?v=1oean5l__Cc"> could not be scrolled into view

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import random
seed = 1
random.seed(seed)
driver = webdriver.Chrome()
driver.get("https://www.youtube.com")
element = driver.find_element_by_tag_name("input")
# Put the word "history" in the search box and hit enter
element.send_keys("history")
element.send_keys(Keys.RETURN)
time.sleep(5)
# Get a list of elements (videos) that get returned by the search
search_results = driver.find_elements_by_id("video-title")
# Click randomly on one of the first five results
search_results[random.randint(0,10)].click()
# Go to the end of the page (I don't know if this is necessary
#
time.sleep(4)
# Get the recommended videos the same way as above. This is where the problem starts, because recommended_videos essentially becomes the same thing as the previous page's search_results, even though the browser is in a new page now.
while True:
recommended_videos = driver.find_elements_by_xpath("//*[#id='dismissable']/div/a")
print(recommended_videos)
recommended_videos[random.randint(1,4)].click()
time.sleep(4)
i am also completely new and thank you for giving me interest in selenium. may this code helps you. change the driver to firefox if you want.

Yahoo Finance Download Data

I am trying to scrape finance.yahoo.com and download a data file. Specifically, this url: https://finance.yahoo.com/quote/AAPL/history?p=AAPL
I would like to complete two objectives here:
I would like to set the data time period parameters to "Max", which I believe I would need to use Selenium and
would like to download and save the data file that is embedded in the href that appears when inspect "Download Data".
So far, I am unable to access the drop-down required to click "Max" and also cannot locate the href required to download the file.
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options
options = webdriver.ChromeOptions()
options.add_argument('--log-level=3')
stock = input()
base_url = 'https://finance.yahoo.com/quote/{}/history?p=
{}'.format(stock,stock)
driver = webdriver.Chrome()
driver.get(base_url)
driver.maximize_window()
driver.implicitly_wait(4)
driver.find_element_by_class_name("Fl(end) Mt(3px) Cur(p)").click()
time.sleep(4)
driver.quit()

The following shows selectors you can use. I haven't added any wait conditions as the only one needed, in my test runs, I couldn't find; the wait for all new data to be present after pressing apply button. Instead, I use a hard coded time.sleep(5) which should be replaced with a better condition based wait if possible.
from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
import time
d = webdriver.Chrome()
d.get('https://finance.yahoo.com/quote/AAPL/history?p=AAPL')
try:
d.find_element_by_css_selector('[name=agree]').click() #oauth
except:
pass
d.find_element_by_css_selector('[data-icon=CoreArrowDown]').click() #dropdown
d.find_element_by_css_selector('[data-value=MAX]').click() #max
d.find_element_by_css_selector('button.Fl\(start\)').click() # done
d.find_element_by_css_selector('button.Fl\(end\) span').click() #apply
time.sleep(5)
d.find_element_by_css_selector('[download]').click() #download

You can eliminate #1 right off the bat -- just view the page directly, passing the parameters as requested.
The base URI is: https://finance.yahoo.com/quote/AAPL/history
The available parameters are: period1, period2, interval, filter and frequency.
Pretty simple, just grab now as an epoch timestamp, and use it as the period2 parameter, where period1 can simply be the beginning epoch 0. The interval and frequency can be whatever you want; daily 1d, weekly 1wk or monthly 1mo. Lastly, the filter should be history.
The completed URI:
https://finance.yahoo.com/quote/AAPL/history?period1=0&period2=1555905600&interval=1d&filter=history&frequency=1d
From there, use Selenium to locate and click the Download Data link.
UPDATE:
As #QHarr also said, there's numerous questions all over Stack Overflow detailing how to work with Yahoo finance. I also recommend you give searching a whirl.

Scraper doesn't stop clicking on the next page button

I've written a script in python in combination with selenium to get some names and corresponding addresses displayed upon a search and the search keyword is "Saskatoon". However, the data, in this case, traverse multiple pages. My script almost does everything except for one thing.
It still runs even though there are no more pages to traverse. The last page also holds ">" sign for next page option and is not grayed out.
Here is the link: Page_link
Search_keyword: Saskatoon (in the city/town field).
Here is what I've written:
from selenium import webdriver; import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("above_link")
time.sleep(3)
search_input = driver.find_element_by_id("cityField")
search_input.clear()
search_input.send_keys("Saskatoon")
search_input.send_keys(Keys.ENTER)
while True:
try:
wait.until(EC.visibility_of_element_located((By.LINK_TEXT, "›"))).click()
time.sleep(2)
except:
break
driver.quit()
BTW, I've just taken out the name and address part form this script which I suppose is not relevant here. Thanks.

You can use class attribute of > button as on last page it is "ng-scope disabled" while on rest pages - "ng-scope":
wait.until(EC.visibility_of_element_located((By.XPATH, "//li[#class='ng-scope']/a[.='›']"))).click()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Data Scrape from multiple pages - python

Related

Dynamic element (Table) in page is not updated when i use Click() in selenium, so i couldn't retrive the new data

Selenium after clicking load more i cant get the newly loaded content

Selenium webdriver finds elements from the previous page, not the current

Yahoo Finance Download Data

Scraper doesn't stop clicking on the next page button

Categories

Resources