I am trying to scrape the headers of wikipedia pages as an exercise, and i want to be able to distinguish between headers with "h2" and "h3" tags.
Therefore i wrote this code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys #For being able to input key presses
import time #Useful for if your browser is faster than your code
PATH = r"C:\Users\Alireza\Desktop\chromedriver\chromedriver.exe" #Location of the chromedriver
driver = webdriver.Chrome(PATH)
driver.get("https://de.wikipedia.org/wiki/Alpha-Beta-Suche") #Open website in Chrome
print(driver.title) #Print title of the website to console
h1Header = driver.find_element_by_tag_name("h1") #Find the first heading in the article
h2HeaderTexts = driver.find_elements_by_tag_name("h2") #List of all other major headers in the article
h3HeaderTexts = driver.find_elements_by_tag_name("h3") #List of all minor headers in the article
for items in h2HeaderTexts:
scor = items.find_element_by_class_name("mw-headline")
driver.quit()
However, this does not work and the program does not terminate.
Anybody have a solution for this?
The problem here lies in the for loop! Apparently i can not scrape any elements by class name (or anything else) from the elements in h2HeaderTexts, although this should be possible.
You can filter in xpath iteself :
PATH = r"C:\Users\Alireza\Desktop\chromedriver\chromedriver.exe" #Location of the chromedriver
driver = webdriver.Chrome(PATH)
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://de.wikipedia.org/wiki/Alpha-Beta-Suche") #Open website in Chrome
print(driver.title)
for item in driver.find_elements(By.XPATH, "//h2/span[#class='mw-headline']"):
print(item.text)
this should give you, h2 heading with class mw-headline elements.
output :
Informelle Beschreibung
Der Algorithmus
Implementierung
Optimierungen
Vergleich von Minimax und AlphaBeta
Geschichte
Literatur
Weblinks
Fußnoten
Process finished with exit code 0
Update 1 :
The reason why your loop is still running and program does not terminate, is cause if you look the page HTML source, and the first h2 tag, that h2 tag does not have a child span with mw-headline, so selenium is trying to locate the element which is not there in HTML DOM. also you are using find_elements which return a list of web elements if found, if not return an empty list, is the reason you do not see exception as well.
You should wait until elements appearing on the page before accessing them.
Also, there are several elements with h1 tag name too there.
To search for elements inside element you should use xpath starting with a dot. Otherwise this will search for the first match on the entire page.
The first h2 element on that page has no element with class name mw-headline inside it. So, you should handle this issue too.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC #For being able to input key presses
import time #Useful for if your browser is faster than your code
PATH = r"C:\Users\Alireza\Desktop\chromedriver\chromedriver.exe" #Location of the chromedriver
driver = webdriver.Chrome(PATH)
wait = WebDriverWait(driver, 20)
driver.get("https://de.wikipedia.org/wiki/Alpha-Beta-Suche") #Open website in Chrome
print(driver.title) #Print title of the website to console
wait.until(EC.visibility_of_element_located((By.XPATH, "//h1")))
h1Headers = driver.find_elements_by_tag_name("h1") #Find the first heading in the article
h2HeaderTexts = driver.find_elements_by_tag_name("h2") #List of all other major headers in the article
h3HeaderTexts = driver.find_elements_by_tag_name("h3") #List of all minor headers in the article
for items in h2HeaderTexts:
scor = items.find_elements_by_xpath(".//span[#class='mw-headline']")
if scor:
#do what you need with scor[0] element
driver.quit()
You're version does not finish executing because selenium will drop the process if it could not locate an element.
Devs do not like to use try/catch but i personnaly have not found a better way to work around. If you do :
for items in h2HeaderTexts:
try:
scor = items.find_element_by_class_name('mw-headline').text
print(scor)
except:
print('nothing found')
You will notice that it will execute till the end and you have a result.
Related
I'm trying to find all the elements on a page with the specific selector but as it seems, not all the elements are found.
Code I'm using:
from selenium import webdriver
from selenium.webdriver.common.by import By
PATH = "C:\Program Files (x86)\chromedriver.exe"#path of chrome driver
driver = webdriver.Chrome(PATH)#accesses the chrome driver
web = driver.get("https://www.eduqas.co.uk/qualifications/computer-science-as-a-level/#tab_pastpapers")#website
driver.maximize_window()
driver.implicitly_wait(3)
driver.execute_script("window.scrollTo(0, 540)")
driver.implicitly_wait(3)
elements = driver.find_elements(By.CSS_SELECTOR, ".css-13punl2")
driver.implicitly_wait(3)
for x in elements:
x.click()
print(len(elements))
When I print the length of the array "elements" it returns 1, when there are multiple elements on the web page with the selector ".css-13punl2". As seen here image of web page code
link to the website: https://www.eduqas.co.uk/qualifications/computer-science-as-a-level/#tab_pastpapers
For some reason, when I inspect the web page there will sometimes be 6 elements with selector ".css-13punl2" and sometimes there will be 7, but I'm not too sure.
is the selector stable?
im not much familiar with selenium in python, but from what i know, in runtime there are some element attributes that change...
my advice:
put a sleep for 30 seconds, open console (F12) in the opened driver, and write the following command:
$$(".css-13punl2")
if it gives you only 1 element, than you found the problem
or that even it gave you 6 elements, but most of them are invisible.
could you also provide a screenshot of the web itself? or even the link to it
EDITED ANSWER:
try this selector:
#pastpapers_content button
You are trying to wait for the element with driver.implicitly_wait(3). This method is created for activating implicit wait. It can be turned On once and then you don't need to turn it again. When it's turned On Selenium will try to find the element you want for 3 (in this case) seconds instead of one attempt immediately after page loading. In your case it finds one element quickly and doesn't wait for all the elements to appear on the page.
You need to give the page some time to load all the stuff. For that you can use a sleep method for example. It pauses the script execution for the amount of seconds you've set.
Also, the elements on this page disappear from the view so you need to scroll each time you click an element. Additionally you need to close this 'Cookie' prompt as it intercepts the clicks.
And I guess you need the elements with Year, so it's better to skip clicking the 'GCSE' which is also found by this selector.
So, the code will look like that:
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
PATH = "C:\Program Files (x86)\chromedriver.exe" # path of chrome driver
driver = webdriver.Chrome(PATH) # accesses the chrome driver
driver.get("https://www.eduqas.co.uk/qualifications/computer-science-as-a-level/#tab_pastpapers") # website
driver.maximize_window()
driver.implicitly_wait(3)
driver.execute_script("window.scrollTo(0, 540)")
sleep(3) # Giving time to fully load the content
elements = driver.find_elements(By.CSS_SELECTOR, ".css-13punl2")
driver.find_element(By.ID, 'accept-cookies').click() # Closes the cookies prompt
for x in elements:
if x.text == 'GCSE':
continue
x.click()
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # this scrolls the page to the bottom
sleep(1) # This sleep is necessary to give time to finish scrolling
print(len(elements))
Don't use implicitly_wait() more than once, try the below code, it will click on each year:
driver.get("https://www.eduqas.co.uk/qualifications/computer-science-as-a-level/#tab_pastpapers")
wait = WebDriverWait(driver, 10)
time.sleep(1)
# to click on Accept Cookies button
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#accept-cookies"))).click()
# waiting for list of years to appear and scrolling to it
content = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#pastpapers_content")))
driver.execute_script("arguments[0].scrollIntoView(true)", content)
# get all the elements
elements = wait.until(EC.presence_of_all_elements_located((By.XPATH, ".//*[#id='pastpapers_content']//button[#type='button']")))
print("Total elements: ", len(elements))
for i in range(len(elements)):
ele = driver.find_element(By.XPATH, "(.//*[#id='pastpapers_content']//button[#type='button'])[" + str(i + 1) + "]")
time.sleep(1)
ele.location_once_scrolled_into_view
time.sleep(1)
ele.click()
I'm trying to pull the airline names and prices of a specific flight. I'm having trouble with the x.path and/or using the right html tags because when I run the code below, all I get back is 14 empty lists.
from selenium import webdriver
from lxml import html
from time import sleep
driver = webdriver.Chrome(r"C:\Users\14074\Python\chromedriver")
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
sleep(1)
tree = html.fromstring(driver.page_source)
for flight_tree in tree.xpath('//div[#class="TQqf0e sSHqwe tPgKwe ogfYpf"]'):
title = flight_tree.xpath('.//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[2]/div/div[2]/div[6]/div/div[2]/div/div[1]/div/div[1]/div/div[2]/div[2]/div[2]/span/text()')
price = flight_tree.xpath('.//span[contains(#data-gs, "CjR")]')
print(title, price)
#driver.close()
This is just the first part of my code but I can't really continue without getting this to work. If anyone has some ideas on what I'm doing wrong that would be amazing! It's been driving me crazy. Thank you!
I noticed a few issues with your code. First of all, I believe that when entering this page, first google will show you the "I agree to terms and conditions" popup before showing you the content of the page, therefore you need to first click on that button.
Also, you should use the find_elements_by_xpath function directly on driver instead of using the page content, as this also allows you to render the javascript content. You can find more info here: python tree.xpath return empty list
To get more info on how to scrape using selenium and python you could check out this guide: https://www.webscrapingapi.com/python-selenium-web-scraper/
I used the following code to scrape the titles. (I also changed the xpaths to do so, by extracting them directly from google chrome. You can do that by right clicking on an element -> inspect and in the elements tab where the element is, you can right click -> copy -> Copy xpath)
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# I used these for the code to work on my windows subsystem linux
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
driver = webdriver.Chrome(ChromeDriverManager().install(), options=option)
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
driver.find_element_by_xpath('//*[#id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span').click() # this is necessary to pres the I agree button
elements = driver.find_elements_by_xpath('//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[3]/div[3]/c-wiz/div/div[2]/div[1]/div/div/ol/li')
for flight_tree in elements:
title = flight_tree.find_element_by_xpath('.//*[#class="W6bZuc YMlIz"]').text
print(title)
I tried the below code, with screen maximized and having explicit waits and could successfully extract the information, please see below :
Sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.get("https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA")
wait = WebDriverWait(driver, 10)
titles = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div/descendant::h3")))
for name in titles:
print(name.text)
price = name.find_element(By.XPATH, "./../following-sibling::div/descendant::span[2]").text
print(price)
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
Tokyo
₹38,473
Mumbai
₹3,515
Dubai
₹15,846
I`m trying to get data from polish Wiki-dictonary. Code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
driver.get("https://pl.wiktionary.org/wiki/Kategoria:J%C4%99zyk_polski_-_rzeczowniki")
page = driver.find_element_by_xpath('//*[#id="mw-pages"]/div/div')
words = page.find_elements_by_tag_name('li') #loading all the words
delay = 30
for word in words:
myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, '//*[#id="mw-pages"]/a[2]')))
word.find_element_by_tag_name('a').click() #entering word
#COLLECTING DATA
driver.back()
# also tried with driver.execute_script("window.history.go(-1)") - same reasult
time.sleep(5) #added to make sure that time is not an obstacle
I get this error while trying to enter next word:
StaleElementReferenceException: stale element reference: element is not attached to the page document
(Session info: chrome=88.0.4324.190)
When you click you're changing the page which renders the previous elements stale.
So you need to either collect the pages you want to go to FIRST and step through them or you need to keep track of which element you're viewing and increment when you go back:
i = 0
for word in words:
myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, '//*[#id="mw-pages"]/a[2]')))
word.find_elements_by_tag_name('a')[i].click() #entering word
#COLLECTING DATA
driver.back()
i++
# also tried with driver.execute_script("window.history.go(-1)") - same reasult
time.sleep(5) #added to make sure that time is not an obstacle
But, as you can find in stack overflow, there are ways to launch the link in a NEW window, switch_to that window, grab the data, then close that window and proceed onto the next link element.
Normally when we are working with ahref tags we get their href values and then loop and driver.get() them.
driver.get("https://pl.wiktionary.org/wiki/Kategoria:J%C4%99zyk_polski_-_rzeczowniki")
ahrefs= [x.get_attribute('href') for x in driver.find_elements_by_xpath('//*[#id="mw-pages"]/div/div//li/a')]
for ahref in ahrefs:
driver.get(ahref)
I am running a loop that performs a search and grabs an element. The element on each search page appears to have the same CSS selector. However, it always prints the element associated with one search, the search I first began testing the script with. Not sure if this a CSS selector issue? Or a cookie issue perhaps?
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
EXE_PATH = r'C:\\geckodriver.exe'
tickers = ["bitcoin", "ethereum", "litecoin"]
for t in tickers:
with webdriver.Firefox(executable_path = EXE_PATH) as driver:
wait = WebDriverWait(driver, 10)
driver.get("https://coingecko.com/en")
driver.find_element_by_css_selector(".px-2").send_keys(t + Keys.RETURN)
first_result = wait.until(presence_of_element_located((By.CSS_SELECTOR, "div.text-3xl > span:nth-child(1)")))
price = first_result.get_attribute("innerHTML")
print(price)
I found the root cause of that issue my friend. The thing is that when you are sending the ticker in search field, it is taking some time to load the options as search field is an auto suggestive dropdown. But as per your script as soon as you are sending the ticker to the search field, you are hitting the enter button and the thing which is happening in the background is it is selecting bitcoin because if you will see in the trending bitcoin is at rank 1 and because of the lack of delay in between sending the ticker and hitting the enter, it is selecting bitcoin by default. I have modified a script you can view it below. If you don't want to use sleep then add web driver wait and wait for the desired option to be displayed in the search field drop down. Hope so that helps you. Please mark it is as accepted if you are happy with my answer.
tickers = ["bitcoin", "ethereum", "litecoin"]
for t in tickers:
with webdriver.Chrome(executable_path = EXE_PATH) as driver:
wait = WebDriverWait(driver, 10)
driver.get("https://coingecko.com/en")
driver.find_element_by_css_selector(".px-2").send_keys(t)
time.sleep(5)
driver.find_element_by_css_selector(".px-2").send_keys(Keys.RETURN);
first_result = wait.until(presence_of_element_located((By.CSS_SELECTOR, "div.text-3xl > span:nth-child(1)")))
price = first_result.get_attribute("innerHTML")
print(pric
I am trying to webscrape data for several bills initiated in the Peruvian Congress from the following website: http://www.congreso.gob.pe/pley-2016-2021
Basically, I want to click on each link in the search results, scrape the relevant information for the bill, return to the search results and then click on the next link for the next bill and repeat the process. Obviously with so many bills over congressional sessions it would be great if I could automate this.
So far I've been able to accomplish everything up until clicking on the next bill. I've been able to use Selenium to initiate a web browser that displays the search results, click on the first link using the xpath embedded in the iframe and then scrape the content with beautifulsoup and then navigate back to the search results. What I'm having trouble with is being able to click on the next bill in the search results because I'm not sure how to iterate over the xpath (or how to iterate over something that would take me to each subsequent bill). I'd like to be able to scrape the information for all of the bills on each page and then be able to navigate to the next page of the search results.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import requests
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
WebDriverWait(driver, 50).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'ventana02')))
elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
(By.XPATH, "//a[#href='/Sicr/TraDocEstProc/CLProLey2016.nsf/641842f7e5d631bd052578e20058a231/243a65573d33ecc905258449007d20cc?OpenDocument']")))
elem.click()
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find('table', {'bordercolor' : '#6583A0'})
table_items = table.findAll('font')
table_authors = table.findAll('a')
for item in table_items:
content = item.contents[0]
print(content)
for author in table_authors:
authors = author.contents[0]
print(authors)
driver.back()
So far this is the code I have that launches the web browser, clicks on the first link of the search results, scrapes the necessary data, and then returns to the search results.
The following code will navigate to different pages in the search results:
elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
(By.XPATH, "//a[contains(#onclick,'D32')]/img[contains(#src,'Sicr/TraDocEstProc/CLProLey')]")))
elem.click()
I guess my specific problem is being able to figure out how to automate clicking on subsequent bills in the iframe because once I'm able to do that, I'm assuming I could just loop over the bills on each page and then nest that loop within a function that loops over the pages of search results.
UPDATE: Somewhat with the help of the answer below I applied the logic but used beautifulsoup to scrape the href links in the iframe, and store them in a list concatenating the necessary string elements to create a list of xpaths for all of the bills on the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import requests
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
WebDriverWait(driver, 50).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'ventana02')))
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find('table', {'cellpadding' : '2'})
table_items = table.find_all('a')
for item in table_items:
elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
(By.XPATH, "//a[#href='" + item.get('href') + "']")))
elem.click()
driver.back()
Now my issue is that it will click the link for the first item in the loop and click back to the search results but does not progress to clicking on the next item in the loop (the code just times out). I'm also pretty new to writing loops in Python so I was wondering if there would be a way for me to iterate through the items of xpaths so that I can click an xpath, scrape the info on that page, click back to the search results and then click on the next item in the list?
Here is my logic to this problem.
1. First get into the Iframe using switchTo.
2. Get webelements for xpath "//a" using driver.findElements in to a variable "billLinks" as this frame has links for only bills.
3. Now iterate through billLinks and perform desired actions.
I hope this solution helps.