I am trying to scrape a twitter site, There are a long list of comments so using selenium I scrolled down till the end:
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
for i in range(30):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
Now when I try to get elements by tag name, article all the tags aren't captured.
> len(driver.find_elements_by_tag_name('article'))
16
When I scroll the page manually and try the same code
> len(driver.find_elements_by_tag_name('article'))
20
Same is the case for page_source. When I save the driver.page_source to a file, and open that file to search existing twitter username, the name is not found. Only the usernames at the end of html are present.
First, I thought it might have been browser issue. Then I tried same thing with ChromeDriver. But the results were similar.
Related
I am trying to scrape the headers of wikipedia pages as an exercise, and i want to be able to distinguish between headers with "h2" and "h3" tags.
Therefore i wrote this code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys #For being able to input key presses
import time #Useful for if your browser is faster than your code
PATH = r"C:\Users\Alireza\Desktop\chromedriver\chromedriver.exe" #Location of the chromedriver
driver = webdriver.Chrome(PATH)
driver.get("https://de.wikipedia.org/wiki/Alpha-Beta-Suche") #Open website in Chrome
print(driver.title) #Print title of the website to console
h1Header = driver.find_element_by_tag_name("h1") #Find the first heading in the article
h2HeaderTexts = driver.find_elements_by_tag_name("h2") #List of all other major headers in the article
h3HeaderTexts = driver.find_elements_by_tag_name("h3") #List of all minor headers in the article
for items in h2HeaderTexts:
scor = items.find_element_by_class_name("mw-headline")
driver.quit()
However, this does not work and the program does not terminate.
Anybody have a solution for this?
The problem here lies in the for loop! Apparently i can not scrape any elements by class name (or anything else) from the elements in h2HeaderTexts, although this should be possible.
You can filter in xpath iteself :
PATH = r"C:\Users\Alireza\Desktop\chromedriver\chromedriver.exe" #Location of the chromedriver
driver = webdriver.Chrome(PATH)
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://de.wikipedia.org/wiki/Alpha-Beta-Suche") #Open website in Chrome
print(driver.title)
for item in driver.find_elements(By.XPATH, "//h2/span[#class='mw-headline']"):
print(item.text)
this should give you, h2 heading with class mw-headline elements.
output :
Informelle Beschreibung
Der Algorithmus
Implementierung
Optimierungen
Vergleich von Minimax und AlphaBeta
Geschichte
Literatur
Weblinks
Fußnoten
Process finished with exit code 0
Update 1 :
The reason why your loop is still running and program does not terminate, is cause if you look the page HTML source, and the first h2 tag, that h2 tag does not have a child span with mw-headline, so selenium is trying to locate the element which is not there in HTML DOM. also you are using find_elements which return a list of web elements if found, if not return an empty list, is the reason you do not see exception as well.
You should wait until elements appearing on the page before accessing them.
Also, there are several elements with h1 tag name too there.
To search for elements inside element you should use xpath starting with a dot. Otherwise this will search for the first match on the entire page.
The first h2 element on that page has no element with class name mw-headline inside it. So, you should handle this issue too.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC #For being able to input key presses
import time #Useful for if your browser is faster than your code
PATH = r"C:\Users\Alireza\Desktop\chromedriver\chromedriver.exe" #Location of the chromedriver
driver = webdriver.Chrome(PATH)
wait = WebDriverWait(driver, 20)
driver.get("https://de.wikipedia.org/wiki/Alpha-Beta-Suche") #Open website in Chrome
print(driver.title) #Print title of the website to console
wait.until(EC.visibility_of_element_located((By.XPATH, "//h1")))
h1Headers = driver.find_elements_by_tag_name("h1") #Find the first heading in the article
h2HeaderTexts = driver.find_elements_by_tag_name("h2") #List of all other major headers in the article
h3HeaderTexts = driver.find_elements_by_tag_name("h3") #List of all minor headers in the article
for items in h2HeaderTexts:
scor = items.find_elements_by_xpath(".//span[#class='mw-headline']")
if scor:
#do what you need with scor[0] element
driver.quit()
You're version does not finish executing because selenium will drop the process if it could not locate an element.
Devs do not like to use try/catch but i personnaly have not found a better way to work around. If you do :
for items in h2HeaderTexts:
try:
scor = items.find_element_by_class_name('mw-headline').text
print(scor)
except:
print('nothing found')
You will notice that it will execute till the end and you have a result.
I was trying to search on google using selenium and when i get the results then open the first link and go back to the previous page and open the second link ...etc and i wanted to do the same with other links but i don't know where is the probleme
ANY HELP , PLEASE .
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Edge()
url = "https://www.google.com/"
driver.get(url)
search_field = driver.find_element(By.XPATH, '/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input')
search_field.send_keys("Anime")
search_field.send_keys(Keys.ENTER)
all_links = driver.find_elements(By.CLASS_NAME, "yuRUbf")
a = 0
for link in all_links :
link[a].click()
a += 1
time.sleep(10)
driver.back()
By clicking on a link, browser focus jumps to the new opened tab while selenium driver doesn't do this automatically. So you need to switch a driver to that opened tab. After that you can close that tab and switch driver back to the first tab.
browser.getAllWindowHandles().then(function (handles) {
browser.driver.switchTo().window(handles[1]);
// do what you want on the new tab here
browser.driver.close();
browser.driver.switchTo().window(handles[0]);
});
I am trying to webscrape data for several bills initiated in the Peruvian Congress from the following website: http://www.congreso.gob.pe/pley-2016-2021
Basically, I want to click on each link in the search results, scrape the relevant information for the bill, return to the search results and then click on the next link for the next bill and repeat the process. Obviously with so many bills over congressional sessions it would be great if I could automate this.
So far I've been able to accomplish everything up until clicking on the next bill. I've been able to use Selenium to initiate a web browser that displays the search results, click on the first link using the xpath embedded in the iframe and then scrape the content with beautifulsoup and then navigate back to the search results. What I'm having trouble with is being able to click on the next bill in the search results because I'm not sure how to iterate over the xpath (or how to iterate over something that would take me to each subsequent bill). I'd like to be able to scrape the information for all of the bills on each page and then be able to navigate to the next page of the search results.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import requests
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
WebDriverWait(driver, 50).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'ventana02')))
elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
(By.XPATH, "//a[#href='/Sicr/TraDocEstProc/CLProLey2016.nsf/641842f7e5d631bd052578e20058a231/243a65573d33ecc905258449007d20cc?OpenDocument']")))
elem.click()
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find('table', {'bordercolor' : '#6583A0'})
table_items = table.findAll('font')
table_authors = table.findAll('a')
for item in table_items:
content = item.contents[0]
print(content)
for author in table_authors:
authors = author.contents[0]
print(authors)
driver.back()
So far this is the code I have that launches the web browser, clicks on the first link of the search results, scrapes the necessary data, and then returns to the search results.
The following code will navigate to different pages in the search results:
elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
(By.XPATH, "//a[contains(#onclick,'D32')]/img[contains(#src,'Sicr/TraDocEstProc/CLProLey')]")))
elem.click()
I guess my specific problem is being able to figure out how to automate clicking on subsequent bills in the iframe because once I'm able to do that, I'm assuming I could just loop over the bills on each page and then nest that loop within a function that loops over the pages of search results.
UPDATE: Somewhat with the help of the answer below I applied the logic but used beautifulsoup to scrape the href links in the iframe, and store them in a list concatenating the necessary string elements to create a list of xpaths for all of the bills on the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import requests
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
WebDriverWait(driver, 50).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'ventana02')))
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find('table', {'cellpadding' : '2'})
table_items = table.find_all('a')
for item in table_items:
elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
(By.XPATH, "//a[#href='" + item.get('href') + "']")))
elem.click()
driver.back()
Now my issue is that it will click the link for the first item in the loop and click back to the search results but does not progress to clicking on the next item in the loop (the code just times out). I'm also pretty new to writing loops in Python so I was wondering if there would be a way for me to iterate through the items of xpaths so that I can click an xpath, scrape the info on that page, click back to the search results and then click on the next item in the list?
Here is my logic to this problem.
1. First get into the Iframe using switchTo.
2. Get webelements for xpath "//a" using driver.findElements in to a variable "billLinks" as this frame has links for only bills.
3. Now iterate through billLinks and perform desired actions.
I hope this solution helps.
I am trying to scrape data from a website that returns results from a search criteria that spans into multiple pages... using Selenium, beautifulsoup on Python. first page is easy to read. Moving to next page requires to click on the '>' button. The element looks like this:
<a href ng-click="selectPage(page + 1, $event)" class="ng-binding">Next
I tried the following:
browser = webdriver.Chrome()
browser.get ("https:www....com/search/?lat=dfdfd ")
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')
# scraping the first page
#now need to click on the ">" , so that it can take me to the next page
Control should go to the next page, so that I can scrape. There are
about 250 pages from these results.
In Chrome, if you right-click the page, in the context menu there will be an option called "inspect". Click that and find the element in the html. Once you find it, right click it and go Copy > Copy XPath. You can then use the browser.find_element_by_xpath method to assign that element to a variable. You can then use element.click() to click it.
Well, how you don't has provide the URL, I'll show an example to solve this.
I'm considering the button has an ID, but you can change to find by a class, etc.
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = Chrome()
browser.get("https:www....com/search/?lat=dfdfd ")
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')
wait = WebDriverWait(browser, 30)
wait.until(EC.visibility_of_element_located((By.ID, 'next-button')))
# Next page
browser.find_element_by_id('next-button').click()
# Continuous your code ...
I've written a scraper in python in combination with selenium to get all the product names from redmart.com. Every time I run my code, i get only 27 names from that page although the page has got numerous names. FYI, the page has got lazy-loading method enabled. My scraper can reach the bottom of the page but scrape only 27 names. I can't understand where I'm getting lost with the logic I've applied in my scraper. Hope to get any workaround.
Here is the script I've written so far:
from selenium import webdriver; import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://redmart.com/new")
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
wait.until(lambda driver: driver.execute_script("return document.body.scrollHeight;") > check_height)
check_height = driver.execute_script("return document.body.scrollHeight;")
except:
break
for names in driver.find_elements_by_css_selector('.description'):
item_name = names.find_element_by_css_selector('h4 a').text
print(item_name)
driver.quit()
You have to wait for new content to be loaded.
Here is a very simple example:
driver.get('https://redmart.com/new')
products = driver.find_elements_by_xpath('//div[#class="description"]/h4/a')
print(len(products)) # 18 products
driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(5) # wait for new content to be loaded
products = driver.find_elements_by_xpath('//div[#class="description"]/h4/a')
print(len(products)) # 36 products
It works.
You can also look at XHR requests and try to scrape anything You want without using "time.sleep()" and "driver.execute_script".
For example, while scrolling their website, new products are loaded from this URL:
https://api.redmart.com/v1.6.0/catalog/search?q=new&pageSize=18&page=1
As you can see, it is possible to modify parameters like pageSize (max 100 products) and page. With this URL you can scrape all products without even using Selenium and Chrome. You can do all of this with Python Requests