Scraping an updating JavaScript page in Python

Scraping an updating JavaScript page in Python - python

I've been working on a research project that is looking to obtain a list of reference articles from the Brazil Hemeroteca (The desired page reference: http://memoria.bn.br/DocReader/720887x/839, needs to be collected from two hidden elements on the following page: http://memoria.bn.br/DocReader/docreader.aspx?bib=720887x&pasta=ano%20189&pesq=Milho). I asked a question a few weeks back that was answered and I was able to get things running well in regards to that, but now I've hit a new snag and I'm not exactly sure how to get around it.
The problem is that after the first form is filled in, the page redirects to a second page, which is a JavaScript/AJAX enabled page which I need to spool through all of the matches, which is done by means of clicking a button at the top of the page. The problem I'm encountering is that when clicking the next page button I'm dealing with elements on the page that are updating, which leads to Stale Elements. I've tried to implement a few pieces of code to detect when this "stale" effect occurs to indicate the page has changed, but this has not provided much luck. Here is the code I've implemented:
import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time
saveDir = "C:/tmp"
print("Opening Page...")
browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)
print("Searching for elements")
fLink = ""
fails = 0
frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"
xpath_form = "//input[#name=\'PesquisarBtn1\']"
xpath_journal = "//li[text()=\'"+search_journal+"\']"
xpath_timeRange = "//input[#name=\'PeriodoCmb1\' and not(#disabled)]"
xpath_timeSelect = "//li[text()=\'"+search_timeRange+"\']"
xpath_searchTerm = "//input[#name=\'PesquisaTxt1\']"
print("Locating Journal/Periodical")
journal.click()
dropDownJournal = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.XPATH, xpath_journal)))
dropDownJournal.click()
print("Waiting for Time Selection")
try:
timeRange = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_timeRange)))
timeRange.click()
time.sleep(1)
print("Locating Time Range")
dropDownTime = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_timeSelect)))
dropDownTime.click()
time.sleep(1)
except:
print("Failed...")
print("Adding Search Term")
searchTerm = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_searchTerm)))
searchTerm.clear()
searchTerm.send_keys(search_text)
time.sleep(5)
print("Perform search")
submitButton = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_form)))
submitButton.click()
# Wait for the second page to load, pull what we need from it.
download_list = []
browser.switch_to_window(browser.window_handles[-1])
print("Waiting for next page to load...")
matches = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//span[#id=\'OcorNroLbl\']")))
print("Next page ready, found match element... counting")
countText = matches.text
countTotal = int(countText[countText.find("/")+1:])
print("A total of " + str(countTotal) + " matches have been found, standing by for page load.")
for i in range(1, countTotal+2):
print("Waiting for page " + str(i-1) + " to load...")
while(fLink in download_list):
try:
jIDElement = browser.find_element_by_xpath("//input[#name=\'HiddenBibAlias\']")
jPageElement = browser.find_element_by_xpath("//input[#name=\'hPagFis\']")
fLink = "http://memoria.bn.br/DocReader/" + jIDElement.get_attribute('value') + "/" + jPageElement.get_attribute('value') + "&pesq=" + search_text
except:
fails += 1
time.sleep(1)
if(fails == 10):
print("Locked on a page, attempting to push to next.")
nextPageButton = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, "//input[#id=\'OcorPosBtn\']")))
nextPageButton.click()
#raise
while(fLink == ""):
jIDElement = browser.find_element_by_xpath("//input[#name=\'HiddenBibAlias\']")
jPageElement = browser.find_element_by_xpath("//input[#name=\'hPagFis\']")
fLink = "http://memoria.bn.br/DocReader/" + jIDElement.get_attribute('value') + "/" + jPageElement.get_attribute('value') + "&pesq=" + search_text
fails = 0
print("Link obtained: " + fLink)
download_list.append(fLink)
if(i != countTotal):
print("Moving to next page...")
nextPageButton = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, "//input[#id=\'OcorPosBtn\']")))
nextPageButton.click()
There are two "bugs" I'm trying to solve with this block. First, the very first page is always skipped in the loop (IE: fLink = ""), even though there is a test in there for it, I'm not sure why this occurs. The other bug is that the code will hang on specific pages completely randomly and the only way out is to break the code execution.
This block has been modified a few times so I know it's not the most "elegant" of solutions, but I'm starting to run out of time.

After taking a day off from this to think about it (And get some more sleep), I was able to figure out what was going on. The above code has three "big faults". This first is that it does not handle the StaleElementException versus the NoSuchElementException, which can occur while the page is shifting. Secondly, the loop condition was checking iteratively that a page wasn't in the list, which when entering the first run allowed the blank condition to load in directly as the loop was never executed on the first run (Should have used a do-while there, but I made more modifications). Finally, I made the silly error of only checking if the first hidden element was changing, when in fact that is the journal ID, and is pretty much constant through all.
The revisions began with an adaptation of a code on this other SO article to implement a "hold" condition until either one of the hidden elements changed:
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import NoSuchElementException
def hold_until_element_changed(driver, element1_xpath, element2_xpath, old_element1_text, old_element2_text):
while True:
try:
element1 = driver.find_element_by_xpath(element1_xpath)
element2 = driver.find_element_by_xpath(element2_xpath)
if (element1.get_attribute('value') != old_element1_text) or (element2.get_attribute('value') != old_element2_text):
break
except StaleElementReferenceException:
break
except NoSuchElementException:
return False
time.sleep(1)
return True
I then modified the original looping condition, going back to the original "for loop" counter I had created without an internal loop, instead shooting a call to the above function to create the "hold" until the page had flipped, and voila, worked like a charm. (NOTE: I also upped the timeout on the next page button as this is what caused the locking condition)
for i in range(1, countTotal+1):
print("Waiting for page " + str(i) + " to load...")
bibxpath = "//input[#name=\'HiddenBibAlias\']"
pagexpath = "//input[#name=\'hPagFis\']"
jIDElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, bibxpath)))
jPageElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, pagexpath)))
jidtext = jIDElement.get_attribute('value')
jpagetext = jPageElement.get_attribute('value')
fLink = "http://memoria.bn.br/DocReader/" + jidtext + "/" + jpagetext + "&pesq=" + search_text
print("Link obtained: " + fLink)
download_list.append(fLink)
if(i != countTotal):
print("Moving to next page...")
nextPageButton = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//input[#id=\'OcorPosBtn\']")))
nextPageButton.click()
# Wait for next page to be ready
change = hold_until_element_changed(browser, bibxpath, pagexpath, jidtext, jpagetext)
if(change == False):
print("Something went wrong.")
All in all, a good exercise in thought and some helpful links for me to consider when posting future questions. Thanks!

Related

How iterate & getting a dynamic list?

Since few days i try hard too understand why my code return a "IndexError", i would get 150 instagrams likers on my list, so i discovery just recently when i scrolling into the likes pop-up it generating a new div according to my scrolls:
On yellow : the new div appeared when i scroll
For this im made a loop who uses a "scroll script" every 5th likers
My code :
fBody = browser.find_element_by_xpath("/html/body/div[5]/div/div/div[2]/div")
raw_elems = browser.find_elements_by_xpath("//body//div//span[#class='Jv7Aj mArmR MqpiF ']//a[#class='FPmhX notranslate MBL3Z']")
for i in range(0,150):
i += 1
if(i%6) == 5 :
browser.execute_script('arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;', fBody)
print("-------------------------SCROLL-----------------------------------")
sleep(10)
current = raw_elems[i].get_attribute('href')
followers.append(current)
print(current)
sleep(1)
print(followers)
I really don't understand why at the end of 10th username, it return me "list index out of range",
Paece ^ ^

I cannot test this code because it requires doing login part.
But in general you should wait for the list of your elements to be visible or present and then iterate over it.
Create an empty list and append values to it. Scrolling part check by yourself.
I did not remove your waits, but you should think about using Selenium's WebDriverWait methods.
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
# do login part here
#
# Adding wait here
wait = WebDriverWait(driver, 15)
fBody = driver.find_element_by_xpath("/html/body/div[5]/div/div/div[2]/div")
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//body//div//span[#class='Jv7Aj mArmR MqpiF ']//a[#class='FPmhX notranslate MBL3Z']"))) # waiting for the list to become present
raw_elems = driver.find_elements_by_xpath("//body//div//span[#class='Jv7Aj mArmR MqpiF ']//a[#class='FPmhX notranslate MBL3Z']")
followers = []
for el in raw_elems:
# driver.execute_script('arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;', fBody)
# print("-------------------------SCROLL-----------------------------------")
# time.sleep(10)
current = el.get_attribute('href')
followers.append(current)
print(current)
time.sleep(1)
print(followers)

When browser runs JavaScript to add new element then you have to use again find_elements
if i % 6 == 5 :
browser.execute_script('arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;', fBody)
print("-------------------------SCROLL-----------------------------------")
sleep(10)
# get all elements again
raw_elems = driver.find_elements_by_xpath("//body//div//span[#class='Jv7Aj mArmR MqpiF ']//a[#class='FPmhX notranslate MBL3Z']")

How to stop selenium scraper from redirecting to another internal weblink of the scraped website?

Was wondering if anyone knows of a way for instructing a selenium script to avoid visiting/redirecting to an internal page that wasn't part of the code. Essentially, my code opens up this page:
https://cryptwerk.com/companies/?coins=1,6,11,2,3,8,17,7,13,4,25,29,24,32,9,38,15,30,43,42,41,12,40,44,20
keeps clicking on show more button until there's none (at end of page) - which by then - it should have collected the links of all the products listed on the page it scrolled through till the end, then visit each one respectively.
What happens instead, it successfully clicks on show more till the end of the page, but then it visits this weird promotion page of the same website instead of following each of the gathered links respectively and then scraping further data points located off each of those newly opened ones.
In a nutshell, would incredibly appreciate it if someone can explain how to avoid this automated redirection on its own! And this is the code in case someone can gratefully nudge me in the right direction :)
from selenium.webdriver import Chrome
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
from selenium.common.exceptions import NoSuchElementException, ElementNotVisibleException
import json
import selenium.common.exceptions as exception
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
webdriver = '/Users/karimnabil/projects/selenium_js/chromedriver-1'
driver = Chrome(webdriver)
driver.implicitly_wait(5)
url = 'https://cryptwerk.com/companies/?coins=1,6,11,2,3,8,17,7,13,4,25,29,24,32,9,38,15,30,43,42,41,12,40,44,20'
driver.get(url)
links_list = []
coins_list = []
all_names = []
all_cryptos = []
all_links = []
all_twitter = []
all_locations = []
all_categories = []
all_categories2 = []
wait = WebDriverWait(driver, 2)
sign_in = driver.find_element_by_xpath("//li[#class='nav-item nav-guest']/a")
sign_in.click()
time.sleep(2)
user_name = wait.until(EC.presence_of_element_located((By.XPATH, "//input[#name='login']")))
user_name.send_keys("karimnsaber95#gmail.com")
password = wait.until(EC.presence_of_element_located((By.XPATH, "//input[#name='password']")))
password.send_keys("PleomaxCW#2")
signIn_Leave = driver.find_element_by_xpath("//div[#class='form-group text-center']/button")
signIn_Leave.click()
time.sleep(3)
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//button[#class='btn btn-outline-primary']")
time.sleep(2)
loadMoreButton.click()
time.sleep(2)
except exception.StaleElementReferenceException:
print('stale element')
break
print('no more elements to show')
try:
company_links = driver.find_elements_by_xpath("//div[#class='companies-list items-infinity']/div[position() > 3]/div[#class='media-body']/div[#class='title']/a")
for link in company_links:
links_list.append(link.get_attribute('href'))
except:
pass
try:
with open("links_list.json", "w") as f:
json.dump(links_list, f)
with open("links_list.json", "r") as f:
links_list = json.load(f)
except:
pass
try:
for link in links_list:
driver.get(link)
name = driver.find_element_by_xpath("//div[#class='title']/h1").text
try:
show_more_coins = driver.find_element_by_xpath("//a[#data-original-title='Show more']")
show_more_coins.click()
time.sleep(1)
except:
pass
try:
categories = driver.find_elements_by_xpath("//div[contains(#class, 'categories-list')]/a")
categories_list = []
for category in categories:
categories_list.append(category.text)
except:
pass
try:
top_page_categories = driver.find_elements_by_xpath("//ol[#class='breadcrumb']/li/a")
top_page_categories_list = []
for category in top_page_categories:
top_page_categories_list.append(category.text)
except:
pass
coins_links = driver.find_elements_by_xpath("//div[contains(#class, 'company-coins')]/a")
all_coins = []
for coin in coins_links:
all_coins.append(coin.get_attribute('href'))
try:
location = driver.find_element_by_xpath("//div[#class='addresses mt-3']/div/div/div/div/a").text
except:
pass
try:
twitter = driver.find_element_by_xpath("//div[#class='links mt-2']/a[2]").get_attribute('href')
except:
pass
try:
print('-----------')
print('Company name is: {}'.format(name))
print('Potential Categories are: {}'.format(categories_list))
print('Potential top page categories are: {}'.format(top_page_categories_list))
print('Supporting Crypto is:{}'.format(all_coins))
print('Registered location is: {}'.format(location))
print('Company twitter profile is: {}'.format(twitter))
time.sleep(1)
except:
pass
all_names.append(name)
all_categories.append(categories_list)
all_categories2.append(top_page_categories_list)
all_cryptos.append(all_coins)
all_twitter.append(twitter)
all_locations.append(location)
except:
pass
df = pd.DataFrame(list(zip(all_names, all_categories, all_categories2, all_cryptos, all_twitter, all_locations)), columns=['Company name', 'Categories1', 'Categories2', 'Supporting Crypto', 'Twitter Handle', 'Registered Location'])
CryptoWerk_Data = df.to_csv('CryptoWerk4.csv', index=False)

Redirect calls happen for two reasons, in your case either by executing some javascript code when clicking the last time on the load more button or by receiving an HTTP 3xx code, which is the least likely in your case.
So you need to identify when this javascript code is executed and send an ESC_KEY before it loads and then executing the rest of your script.
You could also scrape the links and append them to your list before clicking the load more button and each time it is clicked, make an if statement the verify the link of the page you're in, if it is that of the promotion page then execute the rest of your code, else click load more.
while page_is_same:
scrape_elements_add_to_list()
click_load_more()
verify_current_page_link()
if current_link_is_same != link_of_scraped_page:
page_is_same = False
# rest of the code here

In selenium how to find out the exact number of XPATH links with different ids?

With Python3 and selenium I want to automate the search on a public information site. In this site it is necessary to enter the name of a person, then select the spelling chosen for that name (without or with accents or name variations), access a page with the list of lawsuits found and in this list you can access the page of each case.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
import re
Name that will be searched
name = 'JOSE ROBERTO ARRUDA'
Create path, search start link, and empty list to store information
firefoxPath="/home/abraji/Documentos/Code/geckodriver"
link = 'https://ww2.stj.jus.br/processo/pesquisa/?aplicacao=processos.ea'
processos = []
Call driver and go to first search page
driver = webdriver.Firefox(executable_path=firefoxPath)
driver.get(link)
Position cursor, fill and click
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#idParteNome'))).click()
time.sleep(1)
driver.find_element_by_xpath('//*[#id="idParteNome"]').send_keys(name)
time.sleep(6)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#idBotaoPesquisarFormularioExtendido'))).click()
Mark all spelling possibilities for searching
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#idBotaoMarcarTodos'))).click()
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#idBotaoPesquisarMarcados'))).click()
time.sleep(1)
Check how many pages of data there are - to be used in "for range"
capta = driver.find_element_by_xpath('//*[#id="idDivBlocoPaginacaoTopo"]/div/span/span[2]').text
print(capta)
paginas = int(re.search(r'\d+', capta).group(0))
paginas = int(paginas) + 1
print(paginas)
Capture routine
for acumula in range(1, paginas):
# Fill the field with the page number and press enter
driver.find_element_by_xpath('//*[#id="idDivBlocoPaginacaoTopo"]/div/span/span[2]/input').send_keys(acumula)
driver.find_element_by_xpath('//*[#id="idDivBlocoPaginacaoTopo"]/div/span/span[2]/input').send_keys(Keys.RETURN)
time.sleep(2)
# Captures the number of processes found on the current page - qt
qt = driver.find_element_by_xpath('//*[#id="idDivBlocoMensagem"]/div/b').text
qt = int(qt) + 2
print(qt)
# Iterate from found number of processes
for item in range(2, qt):
# Find the XPATH of each process link - start at number 2
vez = '//*[#id="idBlocoInternoLinhasProcesso"]/div[' + str(item) + ']/span[1]/span[1]/span[1]/span[2]/a'
print(vez)
# Access the direct link and click
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, vez)))
element.click()
# Run tests to get data
try:
num_unico = driver.find_element_by_xpath('//*[#id="idProcessoDetalhesBloco1"]/div[6]/span[2]/a').text
except NoSuchElementException:
num_unico = "sem_numero_unico"
try:
nome_proc = driver.find_element_by_xpath('//*[#id="idSpanClasseDescricao"]').text
except NoSuchElementException:
nome_proc = "sem_nome_encontrado"
try:
data_autu = driver.find_element_by_xpath('//*[#id="idProcessoDetalhesBloco1"]/div[5]/span[2]').text
except NoSuchElementException:
data_autu = "sem_data_encontrada"
# Fills dictionary and list
dicionario = {"num_unico": num_unico,
"nome_proc": nome_proc,
"data_autu": data_autu
}
processos.append(dicionario)
# Return a page to click on next process
driver.execute_script("window.history.go(-1)")
# Close driver
driver.quit()
In this case I captured the number of link pages (3) and the total number of links (84). So my initial idea was to do the "for" three times and within them split the 84 links
The direct address of each link is in XPATH (//*[#id="idBlocoInternoLinhasProcesso"]/div[41]/span[1]/span[1]/span[1]/span[2]/a) which I replace with the "item" to click
For example, when it arrives at number 42 I have an error because the first page only goes up to 41
My problem is how to go to the second page and then restart only "for" secondary
I think the ideal would be to know the exact number of links on each of the three pages
Anyone have any ideas?

Code below is "Capture routine":
wait = WebDriverWait(driver, 20)
#...
while True:
links = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//span[contains(#class,'classSpanNumeroRegistro')]")))
print("links len", len(links))
for i in range(1, len(links) + 1):
# Access the direct link and click
.until(EC.element_to_be_clickable((By.XPATH, f"(//span[contains(#class,'classSpanNumeroRegistro')])[{i}]//a"))).click()
# Run tests to get data
try:
num_unico = driver.find_element_by_xpath('//*[#id="idProcessoDetalhesBloco1"]/div[6]/span[2]/a').text
except NoSuchElementException:
num_unico = "sem_numero_unico"
try:
nome_proc = driver.find_element_by_xpath('//*[#id="idSpanClasseDescricao"]').text
except NoSuchElementException:
nome_proc = "sem_nome_encontrado"
try:
data_autu = driver.find_element_by_xpath('//*[#id="idProcessoDetalhesBloco1"]/div[5]/span[2]').text
except NoSuchElementException:
data_autu = "sem_data_encontrada"
# Fills dictionary and list
dicionario = {"num_unico": num_unico,
"nome_proc": nome_proc,
"data_autu": data_autu
}
processos.append(dicionario)
# Return a page to click on next process
driver.execute_script("window.history.go(-1)")
# wait.until(EC.presence_of_element_located((By.CLASS_NAME, "classSpanPaginacaoImagensDireita")))
next_page = driver.find_elements_by_css_selector(".classSpanPaginacaoProximaPagina")
if len(next_page) == 0:
break
next_page[0].click()

You can try run the loop until next button is present on the screen. the logic will look like this,
try:
next_page = driver.find_element_by_class_name('classSpanPaginacaoProximaPagina')
if(next_page.is_displayed()):
next_page.click()
except NoSuchElementException:
print('next page does not exists')

How to iterate trough a list of web elements that is refreshing every 10 sec?

I am trying to iterate through a list that refreshes every 10 sec.
this is what I have tried:
driver.get("https://www.winmasters.ro/ro/live-betting/")
events = driver.find_elements_by_css_selector('.event-wrapper.v1.event-live.odds-hidden.event-sport-1')
for i in range(len(events)):
try:
event = events[i]
name = event.find_element_by_css_selector('.event-details-team-name.event-details-team-a')# the error occurs here
except: # NoSuchElementException or StaleElementReferenceException
time.sleep(3) # i have tried up to 20 sec
event = events[i]
name = event.find_element_by_css_selecto('.event-details-team-name.event-details-team-a')
this did not work so I tried another except
except: # second try that also did not work
element = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.event-details-team-name.event-details-team-a'))
)
name = event.find_element_by_css_selecto('.event-details-team-name.event-details-team-a')
Now I am assigning something that I will never use to name like:
try:
event = events[i]
name = event.find_element_by_css_selector('.event-details-team-name.event-details-team-a')
except:
name = "blablabla"
With this code when the page refreshes I get about 7 or 8 of the "blablabla" until it finds my selector again from the webpage

You can get all required data using JavaScript.
Code below will give you list of events map with all details instantly and without NoSuchElementException or StaleElementReferenceException errors:
me_id : unique identificator
href : href with details which you can use to get details
team_a : name of the first team
team_a_score : score of the first team
team_b : name of the second team
team_b_score : score of the second team
event_status : status of the event
event_clock : time of the event
def events = driver.execute_script('return [...document.querySelectorAll(\'[data-uat="live-betting-overview-leagues"] .events-for-league .event-live\')].map(e=>{return {me_id:e.getAttribute("me_id"), href:e.querySelector("a.event-details-live").href, team_a:e.querySelector(".event-details-team-a").textContent, team_a_score:e.querySelector(".event-details-score-1").textContent, team_b:e.querySelector(".event-details-team-b").textContent, team_b_score:e.querySelector(".event-details-score-2").textContent, event_status:e.querySelector(\'[data-uat="event-status"]\').textContent, event_clock:e.querySelector(\'[data-uat="event-clock"]\').textContent}})')
for event in events:
print(event.get('me_id'))
print(event.get('href')) #using href you can open event details using: driver.get(event.get('href'))
print(event.get('team_a'))
print(event.get('team_a_score'))
print(event.get('team_b'))
print(event.get('team_b_score'))
print(event.get('event_status'))
print(event.get('event_clock'))

One primary problem is that you are acquiring all of the elements up front, and then iterating through that list. As the page itself is updating frequently, the elements you've already acquired have gone "stale", meaning they are not long associated with current DOM objects. When you try to use those stale elements, Selenium throw StaleElementReferenceExceptions because it has no way of doing anything with those now out-of-date objects.
One way to overcome this is to only acquire and use an element right as you need it, rather than fetching them all up front. I personally feel the cleanest approach is to use the CSS :nth-child() approach:
from selenium import webdriver
def main():
base_css = '.event-wrapper.v1.event-live.odds-hidden.event-sport-1'
driver = webdriver.Chrome()
try:
driver.get("https://www.winmasters.ro/ro/live-betting/")
# Get a list of all elements
events = driver.find_elements_by_css_selector(base_css)
print("Found {} events".format(len(events)))
# Iterate through the list, keeping track of the index
# note that nth-child referencing begins at index 1, not 0
for index, _ in enumerate(events, 1):
name = driver.find_element_by_css_selector("{}:nth-child({}) {}".format(
base_css,
index,
'.event-details-team-name.event-details-team-a'
))
print(name.text)
finally:
driver.quit()
if __name__ == "__main__":
main()
If I run the above script, I get this output:
$ python script.py
Found 2 events
Hapoel Haifa
FC Ashdod
Now, as the underlying webpage really does update a lot, there is still a decent chance you can get a SERE error. To overcome that you can use a retry decorator (pip install retry to get the package) to handle the SERE and reacquire the element:
import retry
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException
#retry.retry(StaleElementReferenceException, tries=3)
def get_name(driver, selector):
elem = driver.find_element_by_css_selector(selector)
return elem.text
def main():
base_css = '.event-wrapper.v1.event-live.odds-hidden.event-sport-1'
driver = webdriver.Chrome()
try:
driver.get("https://www.winmasters.ro/ro/live-betting/")
events = driver.find_elements_by_css_selector(base_css)
print("Found {} events".format(len(events)))
for index, _ in enumerate(events, 1):
name = get_name(
driver,
"{}:nth-child({}) {}".format(
base_css,
index,
'.event-details-team-name.event-details-team-a'
)
)
print(name)
finally:
driver.quit()
if __name__ == "__main__":
main()
Now, despite the above examples, I think you still have issues with your CSS selectors, which is the primary reason for the NoSuchElement exceptions. I can't help with that without a better description of what you are actually trying to accomplish with this script.

python selenium: how to visit a web page many times on the same page

I'm using selenium with python to test my web server. Here is my test code:
i = 0
msg = 'abc'
while i < 10:
driver = webdriver.Chrome()
driver.get("http://www.example.com")
txt = driver.find_element_by_id('input-text')
txt.clear()
txt.send_keys(msg)
btn = driver.find_element_by_id('input-search')
btn.click()
driver.quit()
i += 1
The code works well except only one thing: it executes Chrome, do the test and close it for each time of loop. Obviously it's not necessary. What I need is simply to execute Chrome only one time and do many requests. I've tried as below but it doesn't work:
i = 0
msg = 'abc'
driver = webdriver.Chrome()
while i < 10:
driver.get("http://www.example.com")
txt = driver.find_element_by_id('input-text')
txt.clear()
txt.send_keys(msg)
btn = driver.find_element_by_id('input-search')
btn.click()
i += 1
driver.quit()
I think it's because in my test, there are two things:
1) fill abc in input-text;
2) click a button, submit the abc and open a new web page.
On the new page, there is also an input-text and a button input-search, so it will fill the abc and click the button on the new page, which is not what I want.

At btn.click(), it doesn't wait for page loaded because it is not a hyperlink and it doesn't submit a webform. So your script might be failed. You should fix it by waiting for some elements to determine if it should reload page or not. See code below.
try:
driver.get('http://www.example.com')
txt = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "input-text")))
txt.clear()
txt.send_keys(msg)
btn = driver.find_element_by_id('input-search')
btn.click()
countryDescriptionElement = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "country-description")))
#print(driver.page_source.encode('utf-8'))
except WebDriverException as ex:
print("Enter: " + msg + ", Error: " + str(ex) + ", Found: " + driver.page_source)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping an updating JavaScript page in Python - python

Related

How iterate & getting a dynamic list?

How to stop selenium scraper from redirecting to another internal weblink of the scraped website?

In selenium how to find out the exact number of XPATH links with different ids?

How to iterate trough a list of web elements that is refreshing every 10 sec?

python selenium: how to visit a web page many times on the same page

Categories

Resources