Selenium with proxy returns empty website - python

I am having trouble getting a page source HTML out of a site with selenium through a proxy. Here is my code
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import codecs
import time
import shutil
proxy_username = 'myProxyUser'
proxy_password = 'myProxyPW'
port = '1080'
hostname = 'myProxyIP'
PROXY = proxy_username+":"+proxy_password+"#"+hostname+":"+port
options = Options()
options.add_argument("--headless")
options.add_argument("--kiosk")
options.add_argument('--proxy-server=%s' %PROXY)
driver = webdriver.Chrome(r'C:\Users\kingOtto\Downloads\chromedriver\chromedriver.exe', options=options)
driver.get("https://www.whatismyip.com")
time.sleep(10)
html = driver.page_source
f = codecs.open('dummy.html', "w", "utf-8")
f.write(html)
driver.close()
This results in a very incomplete HTML, showing only outer brackets of head and body:
html
Out[3]: '<html><head></head><body></body></html>'
Also the dummy.html file written to disk does not show any other content that what is displayed in the line above.
I am lost, here is what I tried
It does work when I run it without options.add_argument('--proxy-server=%s' %PROXY) line. So I am sure it is the proxy. But the proxy connection itself seems to be ok (I do not get any proxy connection erros - plus I do get the outer frame from the website, right? So the driver request gets through & back to me)
Different URLs: Not only whatismyip.com fails, also any other pages - tried different news outlets such as CNN or even google - virtually nothing comes back from any website, except for head and body brackets. It cannot be any javascript/iframe issue, right?
Different wait times (this article does not help: Make Selenium wait 10 seconds), up to 60 seconds -- plus my connection is super fast, <1 second should be enough (in browser)
What am I getting wrong about the connection?

driver.page_source does not always return what you expect via selenium. It's likely NOT the full dom. This is documented in the selenium doc and in various SO answers, e.g.:
https://stackoverflow.com/a/45247539/1387701
Selenium gives a best effort to provide the page source as it is fetched. Only highly dynamic pages this can often be limited in it's return.

Related

HTML acquired in Python code is not the same as displayed webpage

I have recently started learning web scraping with Scrapy and as a practice, I decided to scrape a weather data table from this url.
By inspecting the table element of the page, I copy its XPath into my code but I only get an empty list when running the code. I tried to check which tables are present in the HTML using this code:
from scrapy import Selector
import requests
import pandas as pd
url = 'https://www.wunderground.com/history/monthly/OIII/date/2000-5'
html = requests.get(url).content
sel = Selector(text=html)
table = sel.xpath('//table')
It only returns one table and it is not the one I wanted.
After some research, I found out that it might have something to do with JavaScript rendering in the page source code and that Python requests can't handle JavaScript.
After going through a number of SO Q&As, I came upon a certain requests-html library which can apparently handle JS execution so I tried acquiring the table using this code snippet:
from requests_html import HTMLSession
from scrapy import Selector
session = HTMLSession()
resp = session.get('https://www.wunderground.com/history/monthly/OIII/date/2000-5')
resp.html.render()
html = resp.html.html
sel = Selector(text=html)
tables = sel.xpath('//table')
print(tables)
But the result doesn't change. How can I acquire that table?
Problem
Multiple problems may be at play here—not only javascript execution, but HTML5 APIs, cookies, user agent, etc.
Solution
Consider using Selenium with headless Chrome or Firefox web driver. Using selenium with a web driver ensures that page will be loaded as intended. Headless mode ensures that you can run your code without spawning the GUI browser—you can, of course, disable headless mode to see what's being done to the page in realtime and even add a breakpoint so that you can debug beyond pdb in the browser's console.
Example Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.wunderground.com/history/monthly/OIII/date/2000-5")
tables = driver.find_elements_by_xpath('//table') # There are several APIs to locate elements available.
print(tables)
References
Selenium Github: https://github.com/SeleniumHQ/selenium
Selenium (Python) Documentation: https://selenium-python.readthedocs.io/getting-started.html
Locating Elements: https://selenium-python.readthedocs.io/locating-elements.html
you can use scrapy-splash plugin to work scrapy with Splash (scrapinghub's javascript browser)
Using splash you can render javascript and also execute user events like mouse click

How to get python requests or urllib2 to wait for final page

I'm using python requests and beautifulsoup to verify a html document. However, the server for the landing page has some backend code that delays several seconds before presenting the final html document. I've tried the redirect=true approach but I end up with the original document. When loading the url on a browser, there is a 2-3 second delay while the page is created by the server. I've tried various samples like url2.geturl() after page load but all of these return the original url (and do so well before the 2-3 seconds elapses). I need something that emulates a browser and grabs the final document.
Btw, I am able to view the correct DOM elements in Chrome, just not problematically in python.
Figured this out after a few cycles. This requires a combination of 2 solutions (use python selenium package and time.sleep). Sets the background chrome process to run headless, get the url, wait for server side code to complete, then load the document. Here, I'm using BeautifulSoup to parse the DOM.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
def run():
url = "http://192.168.1.55"
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(chrome_options=options)
browser.get(url)
time.sleep(5)
bs = BeautifulSoup(browser.page_source, 'html.parser')
data = bs.find_all('h3')
if __name__ == "__main__":
run()

Getting page source by Python + Selenium not works, connection refused

I try to get page source by using Selenium.
My code looks like below:
#!/usr/bin/env python
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://python.org')
html_source = browser.page_source
print html_source
When I run the script, it opens browser but nothing happens. When I'm waiting without doing anything it throws "Connection refused", after about 15 seconds.
If I enter the address and go to the website, nothing happens too.
Why doesn't it work? Script looks good in my opinion and it should work
I'm doing it because I need to get page source after JS scripts are executed and I suspect that it can be done by Selenium.
Or maybe you know any other ways to get page source after JavaScript is loaded?
As per your question you have invoked get() method passing the argument as https://python.org. Instead you must have passed the argument as https://www.python.org/ as follows :
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://www.python.org/')
html_source = browser.page_source
print (html_source)
Note : Ensure that you are using the latest Selenium-Python v3.8.0 clients, GeckoDriver v0.19.1 binary along with latest Firefox Quantum v57.x Web Browser.

Selenium Webdriver Python- Page loads incompletely/ sometimes freezes on refresh

I am scraping a website with a lot of javascript that is generated when the page is called. As a result, traditional web scraping methods (beautifulsoup, ect.) are not working for my purposes (at least I have been unsuccessful in getting them to work, all of the important data is in the javascript parts). As a result I have started using selenium webdriver. I need to scrape a few hundred pages, each of which has between 10 and 80 data points (each with about 12 fields), so it is important that this script (is that the right terminology?) can run for quite awhile without me having to babysit it.
I have the code working for a single page, and I have a controlling section that tells the scraping section what page to scrape. The problem is that sometimes the javascript portions of the page load, and sometimes they don't when they don't(~1/7), a refresh fixes things, but occasionally the refresh will freeze webdriver and thus the python runtime environment as well. Annoyingly, when it freezes like this, the code fails to time out. What is going on?
Here is a stripped down version of my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time, re, random, csv
from collections import namedtuple
def main(url_full):
driver = webdriver.Firefox()
driver.implicitly_wait(15)
driver.set_page_load_timeout(30)
#create HealthPlan namedtuple
HealthPlan = namedtuple( "HealthPlan", ("State, County, FamType, Provider, PlanType, Tier,") +
(" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") +
(" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct"))
#check whether the page has loaded and handle page load and time out errors
pageNotLoaded= bool(True)
while pageNotLoaded:
try:
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
# Handle page load error by testing presence of showAll,
# an important feature of the page, which only appears if everything else loads
try:
driver.find_element_by_xpath('//*[#id="showAll"]').text
# catch NoSuchElementException=>refresh page
except NoSuchElementException:
try:
driver.refresh()
# catch TimeoutException => quit and load the page
# in a new instance of firefox,
# I don't think the code ever gets here, because it freezes in the refresh
# and will not throw the timeout exception like I would like
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
pageNotLoaded= False
scrapePage() # this is a dummy function, everything from here down works fine,
I have looked extensively for similar problems, and I do not think anyone else has posted about this on so, or anywhere else that I have looked. I am using python 2.7, selenium 2.39.0 and I am trying to scrape Healthcare.gov 's get premium estimate's pages
EDIT: (as an example,this page) It may also be worth mentioning that the page fails to load completely more often when the computer has been on/ doing this for awhile (i'm guessing that the free ram is getting full, and it glitches while loading) this is kind of beside the point though, because this should be handled by the try/except.
EDIT2: I should also mention that this is being run on windows7 64bit, with firefox 17 (which I believe is the newest supported version)
Dude, time.sleep it's a fail!
What's this?
time.sleep(3+ abs(random.normalvariate(1.8,3)))
Try this:
class TestPy(unittest.TestCase):
def waits(self):
self.implicit_wait = 30
Or this:
(self.)driver.implicitly_wait(10)
Or this:
WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('some_xpath'))
Or, instead of driver.refresh() you can trick it :
driver.get(your url)
Also you can cick the cookie :
driver.delete_all_cookies()
scrapePage() # this is a dummy function, everything from here down works fine, :
http://scrapy.org

How to retrieve the values of dynamic html content using Python

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.
Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)
I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`
Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.
I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.

Categories

Resources