lambda_options = [
'--autoplay-policy=user-gesture-required',
'--disable-background-networking',
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-breakpad',
'--disable-client-side-phishing-detection',
'--disable-default-apps',
'--disable-dev-shm-usage',
'--disable-extensions',
'--disable-features=AudioServiceOutOfProcess',
'--disable-hang-monitor',
'--disable-notifications',
'--disable-offer-store-unmasked-wallet-cards',
'--disable-print-preview',
'--disable-prompt-on-repost',
'--disable-speech-api',
'--disable-sync',
'--ignore-gpu-blacklist',
'--ignore-certificate-errors',
'--mute-audio',
'--no-default-browser-check',
'--no-first-run',
'--no-pings',
'--no-sandbox',
'--no-zygote',
'--password-store=basic',
'--use-gl=swiftshader',
'--use-mock-keychain',
'--single-process',
'--headless']
for argument in lambda_options:
options.add_argument(argument)
process_ids = list(range(10))
drivers = {i: webdriver.Chrome(executable_path='path', options=options) for i in process_ids}
So these are my chrome options and how i set them up to have 10 instances running in a single lambda invocation. When I run it non headless on my pc, the crawler misses very few sites due to errors with page loading or selenium not responding, but in lambdas I am missing a ton of data. What can I do to rectify this?
I am scraping mostly with python selenium and some pages with beautifulsoup, and the sites I visit require some actions to be done before I can grab the data I want.
Related
following many article one can log XHR calls in an automated browser (using selenium) as bellow:
capabilities = DesiredCapabilities.CHROME
capabilities["loggingPrefs"] = {"performance": "ALL"} # newer: goog:loggingPrefs driver = webdriver.Chrome(
desired_capabilities=capabilities, executable_path="./chromedriver" )
...
logs_raw = driver.get_log("performance")
my probleme is the target request is performed by a "WebWorker", so it's not listed in the performance list of the browser main thread.
getting in chrome and manualy selecting the webworkers scope in the dev console "performance.getEntries()" gets me the request i want;
my question is how can someone perform such an action in selenium ? (python preferable).
no where in the doc of python selenium or Devtool Api have i found something similar
i'm so greatful in advance.
Edit: after some deggin i found that it has something to do with execution context of javascrip, i have no clue how to switch that in selenium
I am having trouble getting a page source HTML out of a site with selenium through a proxy. Here is my code
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import codecs
import time
import shutil
proxy_username = 'myProxyUser'
proxy_password = 'myProxyPW'
port = '1080'
hostname = 'myProxyIP'
PROXY = proxy_username+":"+proxy_password+"#"+hostname+":"+port
options = Options()
options.add_argument("--headless")
options.add_argument("--kiosk")
options.add_argument('--proxy-server=%s' %PROXY)
driver = webdriver.Chrome(r'C:\Users\kingOtto\Downloads\chromedriver\chromedriver.exe', options=options)
driver.get("https://www.whatismyip.com")
time.sleep(10)
html = driver.page_source
f = codecs.open('dummy.html', "w", "utf-8")
f.write(html)
driver.close()
This results in a very incomplete HTML, showing only outer brackets of head and body:
html
Out[3]: '<html><head></head><body></body></html>'
Also the dummy.html file written to disk does not show any other content that what is displayed in the line above.
I am lost, here is what I tried
It does work when I run it without options.add_argument('--proxy-server=%s' %PROXY) line. So I am sure it is the proxy. But the proxy connection itself seems to be ok (I do not get any proxy connection erros - plus I do get the outer frame from the website, right? So the driver request gets through & back to me)
Different URLs: Not only whatismyip.com fails, also any other pages - tried different news outlets such as CNN or even google - virtually nothing comes back from any website, except for head and body brackets. It cannot be any javascript/iframe issue, right?
Different wait times (this article does not help: Make Selenium wait 10 seconds), up to 60 seconds -- plus my connection is super fast, <1 second should be enough (in browser)
What am I getting wrong about the connection?
driver.page_source does not always return what you expect via selenium. It's likely NOT the full dom. This is documented in the selenium doc and in various SO answers, e.g.:
https://stackoverflow.com/a/45247539/1387701
Selenium gives a best effort to provide the page source as it is fetched. Only highly dynamic pages this can often be limited in it's return.
I have recently started learning web scraping with Scrapy and as a practice, I decided to scrape a weather data table from this url.
By inspecting the table element of the page, I copy its XPath into my code but I only get an empty list when running the code. I tried to check which tables are present in the HTML using this code:
from scrapy import Selector
import requests
import pandas as pd
url = 'https://www.wunderground.com/history/monthly/OIII/date/2000-5'
html = requests.get(url).content
sel = Selector(text=html)
table = sel.xpath('//table')
It only returns one table and it is not the one I wanted.
After some research, I found out that it might have something to do with JavaScript rendering in the page source code and that Python requests can't handle JavaScript.
After going through a number of SO Q&As, I came upon a certain requests-html library which can apparently handle JS execution so I tried acquiring the table using this code snippet:
from requests_html import HTMLSession
from scrapy import Selector
session = HTMLSession()
resp = session.get('https://www.wunderground.com/history/monthly/OIII/date/2000-5')
resp.html.render()
html = resp.html.html
sel = Selector(text=html)
tables = sel.xpath('//table')
print(tables)
But the result doesn't change. How can I acquire that table?
Problem
Multiple problems may be at play here—not only javascript execution, but HTML5 APIs, cookies, user agent, etc.
Solution
Consider using Selenium with headless Chrome or Firefox web driver. Using selenium with a web driver ensures that page will be loaded as intended. Headless mode ensures that you can run your code without spawning the GUI browser—you can, of course, disable headless mode to see what's being done to the page in realtime and even add a breakpoint so that you can debug beyond pdb in the browser's console.
Example Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.wunderground.com/history/monthly/OIII/date/2000-5")
tables = driver.find_elements_by_xpath('//table') # There are several APIs to locate elements available.
print(tables)
References
Selenium Github: https://github.com/SeleniumHQ/selenium
Selenium (Python) Documentation: https://selenium-python.readthedocs.io/getting-started.html
Locating Elements: https://selenium-python.readthedocs.io/locating-elements.html
you can use scrapy-splash plugin to work scrapy with Splash (scrapinghub's javascript browser)
Using splash you can render javascript and also execute user events like mouse click
I am using the latest chromedriver 2.45. I am currently building a program which scrape stocks data from a website. I have a list of around 3000 stocks to scrape, so I used multithreading to speed up my work. My program seems to work fine if I turn off the headless browser, but when I turn headless browser to true(with the aim to speed up the script), sometimes the thread will get stuck when running the following line:
browser.get(url)
For each stock, prior running the above script, the following script will run:
options = Options()
chrome_prefs = {}
options.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
options.add_argument('--headless')
options.add_argument("–no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--disable-extensions")
options.add_argument("disable-infobars")
options.add_argument('--disable-useAutomationExtension')
options.Proxy = None
options.add_argument("–disable-dev-shm-usage")
options.add_argument('blink-settings=imagesEnabled=false')
browser = webdriver.Chrome(options=options)
browser.minimize_window()
The sad thing is when it get stuck in the line, it does not raise any exception. I believe that the thread is trying to access the url but the site does not load so it just keep waiting and waiting? Could that be the case? How to stop the problem? Or maybe a way out is to make a timer for browser.get(url), let say for 10 seconds, if it does not get any data, it will refresh the link again and continue on the script?
Is there also any ways or settings that I can speed up the script? And is it possible to make the program run in the background when I execute the script as it keep popping up (although it minimize itself a second later but the chromedriver is still on the front..)
Thank you for your time!
I'm trying to use PhantomJS to write a scraper but even the example in the documentation of morph.io is not working. I guess the problem is "https", I tested it with http and it is working. Can you please give me a solution?
I tested it using firefox and it works.
from splinter import Browser
with Browser("phantomjs") as browser:
# Optional, but make sure large enough that responsive pages don't
# hide elements on you...
browser.driver.set_window_size(1280, 1024)
# Open the page you want...
browser.visit("https://morph.io")
# submit the search form...
browser.fill("q", "parliament")
button = browser.find_by_css("button[type='submit']")
button.click()
# Scrape the data you like...
links = browser.find_by_css(".search-results .list-group-item")
for link in links:
print link['href']
PhantomJS is not working on https urls?
Splinter uses the Selenium WebDriver bindings (example) for Python under the hood, so you can simply pass the necessary options like this:
with Browser("phantomjs", service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any']) as browser:
...
See PhantomJS failing to open HTTPS site for why those options might be necessary. Take a look at the PhantomJS commandline interface for more options.