scrape websites with infinite scrolling - python

I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.

You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
pip install selenium
Step 2 : use the code below to automate infinite scroll and extract the source code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys
import unittest, time, re
class Sel(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.base_url = "https://twitter.com"
self.verificationErrors = []
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
driver.find_element_by_link_text("All").click()
for i in range(1,100):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(4)
html_source = driver.page_source
data = html_source.encode('utf-8')
if __name__ == "__main__":
unittest.main()
Step 3 : Print the data if required.

Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.
But if you must scrape...
Such sites are using JavaScript to request additional content from the site when you reach the bottom of the page. All you need to do is figure out the URL of that additional content and you can retrieve it. Figuring out the required URL can be done by inspecting the script, by using the Firefox Web console, or by using a debug proxy.
For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You'll see all the files as they are loaded. Scroll the page while watching the Web Console and you'll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.

Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like QWebKit from PyQt and send keyboard events while reading the data from the DOM tree. QWebKit has a nice and simple api.

Related

How can I get Html of a website as seen on browser?

A website loads a part of the site after the site is opened, when I use libraries such as request and urllib3, I cannot get the part that is loaded later, how can I get the html of this website as seen in the browser. I can't open a browser using Selenium and get html because this process should not slow down with the browser.
I tried htppx, httplib2, urllib, urllib3 but I couldn't get the later loaded section.
You can use the BeautifulSoup library or Selenium to simulate a user-like page loading and waiting to load additional HTML elements.
I would suggest using Selenium since it contains the WebDriverWait Class that can help you scrape the additional HTML elements.
This is my simple example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Replace with the URL of the website you want
url = "https://www.example.com"
# Adding the option for headless browser
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)
# Create a new instance of the Chrome webdriver
driver = webdriver.Chrome()
driver.get(url)
# Wait for the additional HTML elements to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(#class, 'lazy-load')]")))
# Get HTML
html = driver.page_source
print(html)
driver.close()
In the example above you can see that I'm using an explicit wait to wait (10secs) for a specific condition to occur. More specifically, I'm waiting until the element with the 'lazy-load' class is located By.XPath and then I retrieve the HTML elements.
Finally, I would recommend checking both BeautifulSoup and Selenium since both have tremendous capabilities for scrapping websites and automating web-based tasks.

Struggling to click the load more button with Selenium

I plan to build a scraper that'll utilize both Selenium and BeautifulSoup.
I'm struggling to click the load more button with selenium. I've managed to detect the button, scroll to it etc. - can't seem to figure out a way to continuously click the button.
Any suggestions on how to pass this hurdle?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time, requests
from bs4 import BeautifulSoup
def search_agent(zip):
location = bot.find_element_by_name('hheroquotezip')
time.sleep(3)
location.clear()
location.send_keys(zip)
location.submit()
def load_all_agents():
# click more until no more results to load
while True:
try:
#more_button = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'results.length'))).click()
more_button = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="searchResults"]/div[3]/button'))).click()
except TimeoutException:
break
# wait for results to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.seclection-result .partners-detail')))
print ("Complete")
bot.quit()
#define Zip for search query
zip = 20855
bot = webdriver.Safari()
wait = WebDriverWait(bot, 10)
#fetch agents page
bot.get('https://www.erieinsurance.com/find-an-insurance-agent')
search_agent(zip)
load_all_agents()
With the above approach, the console spits out these errors:
[Error] Refused to load https://9203275.fls.doubleclick.net/activityi;src=9203275;type=agent0;cat=agent0;ord=7817740349177;gtm=2wg783;auiddc=373080108.1594822533;~oref=https%3A%2F%2Fwww.erieinsurance.com%2Ffind-an-insurance-agent-results%3Fzipcode%3D20855? because it does not appear in the frame-src directive of the Content Security Policy.
[Error] Refused to connect to https://api.levelaccess.net/analytics/3.0/results because it does not appear in the connect-src directive of the Content Security Policy.
Creating an answer to post a couple of images.
When i ran the attached script in chrome it worked fine.
When #furas did the same in firefox he had the same result
I ran the same script 10 times back to back and i wasn't refused.
What i note based on the error is that iframe seems broswer sensitive:
In Chrome this header contains chromium scripts:
In Firefox it contains no scripts:
Have a look and see what you get manually in your safari.
A simple answer might be to not use safari - use chrome or FF. Is that an option? (if it MUST be safari just say and i'll look again.)
Finally - couple of quick additional notes.
The site is using angular, so you might want to consider protractor if you're struggling with synchronisation. (protractor helps with some script-syncing capailies)
Also worth a note - don't feel you have to land on the home page and then navigate as user. Update your URL to the search results page and feed in the zip code and save yourself some time:
https://www.erieinsurance.com/find-an-insurance-agent-results?zipcode=20855
[edit/update]
This the same thing? https://github.com/SeleniumHQ/selenium/issues/458
Closed bug in 2016 around "Content Security Policies" - logged as an apple thing.

Python print Xpath element gives empty array

I'm trying to get the xpath of an element in site https://www.tradingview.com/symbols/BTCUSD/technicals/
Specifically the result under the summary speedometer. Whether it's buy or sell.
Speedometer
Using Google Chrome xpath I get the result
//*[#id="technicals-root"]/div/div/div[2]/div[2]/span[2]
and to try and get that data in python I plugged it into
from lxml import html
import requests
page = requests.get('https://www.tradingview.com/symbols/BTCUSD/technicals/')
tree = html.fromstring(page.content)
status = tree.xpath('//*[#id="technicals-root"]/div/div/div[2]/div[2]/span[2]/text()')
When I print status I get an empty array. But it doesn't seem like anything is wrong with the xpath. I've read that google does some shenanigans with incorrectly written HTML tables which will output the wrong xpath but that doesn't seem to be the issue.
When I run your code, the "technicals-root" div is empty. I assume javascript is filling it in. When you can't get a page statically, you can always turn to Selenium to run a browser and let it figure everything out. You may have to tweak the driver path to get it working in your environment but this works for me:
import time
import contextlib
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
option = webdriver.ChromeOptions()
option.add_argument(" — incognito")
with contextlib.closing(webdriver.Chrome(
executable_path='/usr/lib/chromium-browser/chromedriver',
chrome_options=option)) as browser:
browser.get('https://www.tradingview.com/symbols/BTCUSD/technicals/')
# wait until js has filled in the element - and a bit longer for js churn
WebDriverWait(browser, 20).until(EC.visibility_of_element_located(
(By.XPATH,
'//*[#id="technicals-root"]/div/div/div[2]/div[2]/span')))
time.sleep(1)
status = browser.find_elements_by_xpath(
'//*[#id="technicals-root"]/div/div/div[2]/div[2]/span[2]')
print(status[0].text)

Not able to scrape Google Adsense

I am trying to scrape a website and want to get the url's and images from Google AdSense. But it seems I am not getting any details of Google Adsense.
Here I want
If we search "refrigerator" in google then we will get some ads there which I need to fetch. Or some blogs, website showing Google Ads like See image
But when I inspect I can find related divs and url but when I hit url then i am getting only static html data.
Here is code which I need to fetch
Here is script which I have written in Selenium, Python.
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
import time
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "http://www.compiletimeerror.com/"
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url) # load page
delay = 10 # seconds
try:
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_xpath("(//div[#class='pla-unit'])[0]")))
print "Page is ready!"
Element=browser.find_element(By.ID,value="google_image_div")
print Element
print Element.text
except TimeoutException:
print "Loading took too much time!"
But I'm still unable to get data. Please give me any reference or hint.
You need to first select the frame which contains the elements you want to work with.
select_frame("id=google_ads_frame1");
NOTE: I am not sure about the python syntax. But it should be something similar to this.
Use Selenium's switch_to.frame method to direct your browser to the iframe in your html, before selecting your element variable (untested):
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
import time
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "http://www.compiletimeerror.com/"
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url) # load page
delay = 10 # seconds
try:
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_xpath("(//div[#class='pla-unit'])[0]")))
print "Page is ready!"
browser.switch_to.frame(browser.find_element_by_id('google_ads_frame1'))
element=browser.find_element(By.ID,value="google_image_div")
print element
print element.text
except TimeoutException:
print "Loading took too much time!"
http://elementalselenium.com/tips/3-work-with-frames
A note on Python style best practices: use lowercase when declaring local variables (element vs. Element).

Selenium - Python bindings - Detect new AJAX data

As a beginner programmer, I have found a lot of useful information on this site, but could not find an answer to my specific question. I want to scrape data from a webpage, but some of the data I am interested in scraping can only be obtained after clicking a "more" button. The below code executes without producing an error, but it does not appear to click the "more" button and display the additional data on the page. I am only interested in viewing the information on the "Transcripts" tab, which seems to complicate things a bit for me because there are "more" buttons on the other tabs. The relevant portion of my code is as follows:
from mechanize import Browser
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver import ActionChains
import urllib2
import mechanize
import logging
import time
import httplib
import os
import selenium
url="http://seekingalpha.com/symbol/IBM/transcripts"
ua='Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)'
br=Browser()
br.addheaders=[('User-Agent', ua), ('Accept', '*/*')]
br.set_debug_http(True)
br.set_debug_responses(True)
logging.getLogger('mechanize').setLevel(logging.DEBUG)
br.set_handle_robots(False)
chromedriver="~/chromedriver"
os.environ["webdriver.chrome.driver"]=chromedriver
driver=webdriver.Chrome(chromedriver)
time.sleep(1)
httplib.HTTPConnection._http_vsn=10
httplib.HTTPConnection._http_vsn_str='HTTP/1.0'
page=br.open(url)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
actions=ActionChains(driver)
elem=driver.find_element_by_css_selector("div #transcripts_show_more div#more.older_archives")
actions.move_to_element(elem).click()
A couple of things:
Given you're using selenium, you don't need either mechanize or urllib2 as selenium is doing the actual page loading. As for the other imports (httplib, logging, os and time), they're either unused or redundant.
For my own convenience, I changed the code to use Firefox; you can change it back to Chrome (or other any browser).
In regards to the ActionChains, you don't them here as you're only doing a single click (nothing to chain really).
Given the browser is receiving data (via AJAX) instead of loading a new page, we don't know when the new data has appeared; so we need to detect the change.
We know that 'clicking' the button loads more <li> tags, so we can check if the number of <li> tags has changed. That's what this line does:
WebDriverWait(selenium_browser, 10).until(lambda driver: len(driver.find_elements_by_xpath("//div[#id='headlines_transcripts']//li")) != old_count)
It will wait up to 10 seconds, periodically comparing the current number of <li> tags from before and during the button click.
import selenium
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import WebDriverException
from selenium.common.exceptions import TimeoutException as SeleniumTimeoutException
from selenium.webdriver.support.ui import WebDriverWait
url = "http://seekingalpha.com/symbol/IBM/transcripts"
selenium_browser = webdriver.Firefox()
selenium_browser.set_page_load_timeout(30)
selenium_browser.get(url)
selenium_browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
elem = selenium_browser.find_element_by_css_selector("div #transcripts_show_more div#more.older_archives")
old_count = len(selenium_browser.find_elements_by_xpath("//div[#id='headlines_transcripts']//li"))
elem.click()
try:
WebDriverWait(selenium_browser, 10).until(lambda driver: len(driver.find_elements_by_xpath("//div[#id='headlines_transcripts']//li")) != old_count)
except StaleElementReferenceException:
pass
except SeleniumTimeoutException:
pass
print(selenium_browser.page_source.encode("ascii", "ignore"))
I'm on python2.7; if you're on python3.X, you probably won't need .encode("ascii", "ignore").

Categories

Resources