Can't capture HAR using Python Selenium Script with BrowserMob-Proxy

Can't capture HAR using Python Selenium Script with BrowserMob-Proxy - python

Goal:
I want to run a Selenium Python script through BrowserMob-Proxy, which will capture and output a HAR file capture.
Problem:
I have a functional (very basic) Python script (shown below). When it is altered to utilize BrowserMob-Proxy to capture HAR however, it fails. Below I provide two different scripts that both fail, but for differing reasons (details provided after code snippets).
BrowserMob-Proxy Explanation:
As mentioned before, I am using both 0.6.0 AND 2.0-beta-8. The reasoning for this is that A) LightBody (lead designer of BMP) recently indicated that his most current release (2.0-beta-9) is not functional and advises users to use 2.0-beta-8 instead and B) from what I can tell from reading various site/stackoverflow information is that 0.6.0 (acquired through PIP) is used to make calls to the Client.py/Server.py, whereas 2.0-beta-8 is used to initiate the Server. To be honest, this confuses me. When importing BMP's Server however, it requires a batch (.bat) file to initiate the server, which is not provided in 0.6.0, but is with 2.0-beta-8...if anyone can shed some light on this area of confusion (I suspect it is the root of my problems described below), then I'd be most appreciative.
Software Specs:
Operating System: Windows 7 (64x) -- running in VirtualBox
Browser: FireFox (32.0.2)
Script Language: Python (2.7.8)
Automated Web Browser: Selenium (2.43.0) -- installed via PIP
BrowserMob-Proxy: 0.6.0 AND 2.0-beta-8 -- see explanation below
Selenium Script (this script works):
"""This script utilizes Selenium to obtain the Google homepage"""
from selenium import webdriver
driver = webdriver.Firefox() # Opens FireFox browser.
driver.get('https://google.com/') # Gets google.com and loads page in browser.
driver.quit() # Closes Firefox browser
This script succeeds in running and does not produce any errors. It is provided for illustrative purposes to indicate it works before adding BMP logic.
Script ALPHA with BMP (does not work):
"""Using the same functional Selenium script, produce ALPHA_HAR.har output"""
from browsermobproxy import Server
server = Server('C:\Users\Matt\Desktop\\browsermob-proxy-2.0-beta-8\\bin\\browsermob-proxy')
server.start()
proxy = server.create_proxy()
from selenium import webdriver
driver = webdriver.Firefox() # Opens FireFox browser.
proxy.new_har("ALPHA_HAR") # Creates a new HAR
driver.get("https://www.google.com/") # Gets google.com and loads page in browser.
proxy.har # Returns a HAR JSON blob
server.stop()
This code will succeed in running the script and will not produce any errors. However, when searching the entirety of my hard drive, I never succeed in locating ALPHA_HAR.har.
Script BETA with BMP (does not work):
"""Using the same functional Selenium script, produce BETA_HAR.har output"""
from browsermobproxy import Server
server = Server("C:\Users\Matt\Desktop\\browsermob-proxy-2.0-beta-8\\bin\\browsermob-proxy")
server.start()
proxy = server.create_proxy()
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)
proxy.new_har("BETA_HAR") # Creates a new HAR
driver.get("https://www.google.com/") # Gets google.com and loads page in browser.
proxy.har # Returns a HAR JSON blob
server.stop()
This code was taken from http://browsermob-proxy-py.readthedocs.org/en/latest/. When running the above code, FireFox will attempt to get google.com, but will never succeed in loading the page. Eventually it will time out without producing any errors. And BETA_HAR.har can't be found anywhere on my hard drive. I have also noticed that, when trying to use this browser to visit any other site, it will similarly fail to load (I suspect this is due to the proxy not being configured properly).

Try this:
from browsermobproxy import Server
from selenium import webdriver
import json
server = Server("path/to/browsermob-proxy")
server.start()
proxy = server.create_proxy()
profile = webdriver.FirefoxProfile()
profile.set_proxy(self.proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)
proxy.new_har("http://stackoverflow.com", options={'captureHeaders': True})
driver.get("http://stackoverflow.com")
result = json.dumps(proxy.har, ensure_ascii=False)
print result
proxy.stop()
driver.quit()

I use phantomJS, here is an example of how to use it with python:
import browsermobproxy as mob
import json
from selenium import webdriver
BROWSERMOB_PROXY_PATH = '/usr/share/browsermob/bin/browsermob-proxy'
url = 'http://google.com'
s = mob.Server(BROWSERMOB_PROXY_PATH)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [ proxy_address, '--ignore-ssl-errors=yes', ] #so that i can do https connections
driver = webdriver.PhantomJS(service_args=service_args)
driver.set_window_size(1400, 1050)
proxy.new_har(url)
driver.get(url)
har_data = json.dumps(proxy.har, indent=4)
screenshot = driver.get_screenshot_as_png()
imgname = "google.png"
harname = "google.har"
save_img = open(imgname, 'a')
save_img.write(screenshot)
save_img.close()
save_har = open(harname, 'a')
save_har.write(har_data)
save_har.close()
driver.quit()
s.stop()

What worked for me was to downgrade java version to java11. I used jenv to install and manage multiple java versions.

When you do:
proxy.har
You need to parse that response, proxy.har is a JSON object, so if you need to generate a file, you need to do this:
myFile = open('BETA_HAR.har','w')
myFile.write( str(proxy.har) )
myFile.close()
Then you will find your .har

Finding your HAR file
Inherently, the HAR object generated by the proxy is just that: an object in memory. The reason you can't find it on your hard drive is because it's not being saved there unless you write it there yourself. This is a pretty simple operation, as the HAR is just JSON.
with open("harfile", "w") as harfile:
harfile.write(json.dumps(proxy.har))
Why does ALPHA not work?
When you start dumping your HAR file, you'll find that your HAR file is empty with the ALPHA script. This is because you are not adding the proxy to the settings for Firefox, meaning that it will just connect directly bypassing your proxy.
What about BETA?
This code is written correctly as far as connecting to the proxy, although personally I prefer adding the proxy to the capabilities and passing those through. The code for that is:
cap = webdriver.DesiredCapabilities.FIREFOX.copy()
proxy.add_to_capabilities(cap)
driver = webdriver.Firefox(capabilities=cap)
I would guess that your issue lies with the proxy itself. Check the bmp.log and/or server.log files in the location of the python script and see what it is saying if something is going wrong.
Another alternative is that selenium is reporting back that the webpage has loaded before it actually has finished getting all of the elements, and as such your proxy is shutting down too early. Try making the script wait a bit longer before shutting down the proxy, or running it interactively through the interpreter.

Related

Selenium Firefox browser is stuck after downloading pdf

Was hoping someone could help me understand what's going on:
I'm using Selenium with Firefox browser to download a pdf (need Selenium to login to the corresponding website):
le = browser.find_elements_by_xpath('//*[#title="Download PDF"]')
time.sleep(5)
if le:
pdf_link = le[0].get_attribute("href")
browser.get(pdf_link)
The code does download the pdf, but after that just stays idle.
This seems to be related to the following browser settings:
fp.set_preference("pdfjs.disabled", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
If I disable the first, it doesn't hang, but opens pdf instead of downloading it. If I disable the second, a "Save As" pop-up window shows up. Could someone explain how to handle this?

For me, the best way to solve this was to let Firefox render the PDF in the browser via pdf.js and then send a subsequent fetch via the Python requests library with the selenium cookies attached. More explanation below:
There are several ways to render a PDF via Firefox + Selenium. If you're using the most recent version of Firefox, it'll most likely render the PDF via pdf.js so you can view it inline. This isn't ideal because now we can't download the file.
You can disable pdf.js via Selenium options but this will likely lead to the issue in this question where the browser gets stuck. This might be because of an unknown MIME-Type but I'm not totally sure. (There's another StackOverflow answer that says this is also due to Firefox versions.)
However, we can bypass this by passing Selenium's cookie session to requests.session().
Here's a toy example:
import requests
from selenium import webdriver
pdf_url = "/url/to/some/file.pdf"
# setup driver with options
driver = webdriver.Firefox(..options)
# do whatever you need to do to auth/login/click/etc.
# navigate to the PDF URL in case the PDF link issues a
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)
# get the URL from Selenium
current_pdf_url = driver.current_url
# create a requests session
session = requests.session()
# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
session.cookies.set(cookie["name"], cookie["value"])
# Note: If headers are also important, you'll need to use
# something like seleniumwire to get the headers from Selenium
# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)
# access the bytes response from the session
pdf_bytes = pdf_response.content
I highly recommend using seleniumwire over regular selenium because it extends Python Selenium to let you return headers, wait for requests to finish, use proxies, and much more.

How Do I Reset The Har File Used By Browser-Mob-Proxy Module For Python?

i'm using the browser-mob-proxy module and selenium for python in order to record some things that are sent by a website when you are using it.
What I am doing is creating a server using browser-mob-proxy, then creating a proxy for that server. I then create a HAR to record the data. I later use this data for something else.
I want to know if there is a way to reset the HAR file so that it is empty or if I will have to create a new HAR to store new data.
The proxy is assigned to the selenium browser which is using the chrome driver.

I do this in my testing framework so that each test has its own HAR file for debugging purposes. Even when they use the same browser.
The command you are looking for is "new_har". This creates a new session and begins logging to a new HAR file. You can also specify a name for the session. I normally get the old HAR first and save it before clearing and starting a new session. But you don't have to do that if all you want to do is clear the proxy log.
Here is an example using the Python module.
from browsermobproxy import Server
server = Server("path/to/browsermob-proxy")
server.start()
proxy = server.create_proxy()
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)
proxy.new_har("google") # Start first session
driver.get("http://www.google.co.uk")
proxy.har # returns a HAR JSON blob for first session
proxy.new_har("Yahoo") # Start second session
driver.get("http://www.yahoo.co.uk")
proxy.har # returns a HAR JSON blob for second session
server.stop()
driver.quit()

Getting page source by Python + Selenium not works, connection refused

I try to get page source by using Selenium.
My code looks like below:
#!/usr/bin/env python
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://python.org')
html_source = browser.page_source
print html_source
When I run the script, it opens browser but nothing happens. When I'm waiting without doing anything it throws "Connection refused", after about 15 seconds.
If I enter the address and go to the website, nothing happens too.
Why doesn't it work? Script looks good in my opinion and it should work
I'm doing it because I need to get page source after JS scripts are executed and I suspect that it can be done by Selenium.
Or maybe you know any other ways to get page source after JavaScript is loaded?

As per your question you have invoked get() method passing the argument as https://python.org. Instead you must have passed the argument as https://www.python.org/ as follows :
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://www.python.org/')
html_source = browser.page_source
print (html_source)
Note : Ensure that you are using the latest Selenium-Python v3.8.0 clients, GeckoDriver v0.19.1 binary along with latest Firefox Quantum v57.x Web Browser.

How to return selenium browser (or how to import a def that return selenium browser)

I would like to start a selenium browser with a particular setup (privoxy, Tor, randon user agent...) in a function and then call this function in my code. I have created a python script mybrowser.py with this inside:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from fake_useragent import UserAgent
from stem import Signal
from stem.control import Controller
class MyBrowserClass:
def start_browser():
service_args = [
'--proxy=127.0.0.1:8118',
'--proxy-type= http',
]
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (UserAgent().random)
browser = webdriver.PhantomJS(service_args = service_args, desired_capabilities=dcap)
return browser
def set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password=password)
controller.signal(Signal.NEWNYM)
Then I import it into another script myscraping.py with this inside:
import mybrowser
import time
browser= mybrowser.MyBrowserClass.start_browser()
browser.get("https://canihazip.com/s")
print(browser.page_source)
mybrowser.MyBrowserClass.set_new_ip()
time.sleep(12)
browser.get("https://canihazip.com/s")
print(browser.page_source)
The browser is working - I can access the page and retrieve it with .page_source.
But the IP doesn't change between the first and the second print. If I move the content of the function inside myscraping.py (and remove the import + function call) then the IP change.
Why? Is it a problem with returning the browser? How can I fix this?
Actually, the situation is a bit more complex. When I connect to https://check.torproject.org before and after the call to mybrowser.set_new_ip() and the wait of 12 sec (cf the lines below), the IP given by the webpage changes between the first and the second call. So my Ip is changed (according to Tor) but neither https://httpbin.org/ip nor icanhazip.com detects the change in the IP.
...
browser.get("https://canihazip.com/s")
print(browser.page_source)
browser.get("https://check.torproject.org/")
print(browser.find_element_by_xpath('//div[#class="content"]').text )
mybrowser.set_new_ip()
time.sleep(12)
browser.get("https://check.torproject.org/")
print(browser.find_element_by_xpath('//div[#class="content"]').text )
browser.get("https://canihazip.com/s")
print(browser.page_source)
So the IP that are printed are like that:
42.38.215.198 (canihazip before mybrowser.set_new_ip() )
42.38.215.198 (check.torproject before mybrowser.set_new_ip() )
106.184.130.30 (check.torproject after mybrowser.set_new_ip() )
42.38.215.198 (canihazip after mybrowser.set_new_ip())
Privoxy configuration: in C:\Program Files (x86)\Privoxy\config.txt, I have uncommented this line (9050 is the port Tor uses):
forward-socks5t / 127.0.0.1:9050
Tor configuration: in torcc, I have this:
ControlPort 9051
HashedControlPassword : xxxx

This is probably because of PhantomJS keeping a memory cache of requested content. So your first visit using a PhantomJS browser can have a dynamic result but that result is then cached and each consecutive visit uses that cached page.
This memory cache has caused issues like CSRF-Token's not changing on refresh and now I believe it is the root cause of your problem. The issue was presented and resolved in 2013 but the solution is a method, clearMemoryCache, found in PhantomJS's WebPage class. Sadly, we are dealing with a Selenium webdriver.PhantomJS instance.
So, unless I am overseeing something, it'd be tough to access this method through Selenium's abstraction.
The only solution I see fit is to use another webdriver that doesn't have a memory cache like PhantomJS's. I have tested it using Chrome and it works perfectly:
103.***.**.***
72.***.***.***
Also, as a side note, Selenium is phasing out PhantomJS:
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead

You might need to check if a new nym is available or not.
is_newnym_available - true if tor would currently accept a NEWNYM signal
if controller.is_newnym_available():
controller.signal(Signal.NEWNYM)

How to capture network traffic using selenium webdriver and browsermob proxy on Python?

I would like to capture network traffic by using Selenium Webdriver on Python. Therefore, I must use a proxy (like BrowserMobProxy)
When I use webdriver.Chrome:
from browsermobproxy import Server
server = Server("~/browsermob-proxy")
server.start()
proxy = server.create_proxy()
from selenium import webdriver
co = webdriver.ChromeOptions()
co.add_argument('--proxy-server={host}:{port}'.format(host='localhost', port=proxy.port))
driver = webdriver.Chrome(executable_path = "~/chromedriver", chrome_options=co)
proxy.new_har
driver.get(url)
proxy.har # returns a HAR
for ent in proxy.har['log']['entries']:
print ent['request']['url']
the webpage is loaded properly and all requests are available and accessible in the HAR file.
But when I use webdriver.Firefox:
# The same as above
# ...
from selenium import webdriver
profile = webdriver.FirefoxProfile()
driver = webdriver.Firefox(firefox_profile=profile, proxy = proxy.selenium_proxy())
proxy.new_har
driver.get(url)
proxy.har # returns a HAR
for ent in proxy.har['log']['entries']:
print ent['request']['url']
The webpage cannot be loaded properly and the number of requests in the HAR file is smaller than the number of requests that should be.
Do you have any idea what the problem of proxy settings in the second code? How should I fix it to use webdriver.Firefox properly for my purpose?

Just stumbled across this project https://github.com/derekargueta/selenium-profiler. Spits out all network data for a URL. Shouldn't be hard to hack and integrate into whatever tests you're running.
Original source: https://www.openhub.net/p/selenium-profiler

For me, following code component works just fine.
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)

I am sharing my solution, this would not write any logs to any file
but you can collect all sort of messages such as Errors,
Warnings, Logs, Info, Debug , CSS, XHR as well as Requests(traffic)
1. We are going to create Firefox profile so that we can enable option of
"Persist Logs" on Firefox (you can try it to enable on your default browser and see
if it launches with "Persist Logs" without creating firefox profile )
2. we need to modify the Firefox initialize code
where this line will do magic : options.AddArgument("--jsconsole");
so complete Selenium Firefox code would be, this will open Browser Console
everytime you execute your automation :
else if (browser.Equals(Constant.Firefox))
{
var profileManager = new FirefoxProfileManager();
FirefoxProfile profile = profileManager.GetProfile("ConsoleLogs");
FirefoxDriverService service = FirefoxDriverService.CreateDefaultService(DrivePath);
service.FirefoxBinaryPath = DrivePath;
profile.SetPreference("security.sandbox.content.level", 5);
profile.SetPreference("dom.webnotifications.enabled", false);
profile.AcceptUntrustedCertificates = true;
FirefoxOptions options = new FirefoxOptions();
options.AddArgument("--jsconsole");
options.AcceptInsecureCertificates = true;
options.Profile = profile;
options.SetPreference("browser.popups.showPopupBlocker", false);
driver = new FirefoxDriver(service.FirefoxBinaryPath, options, TimeSpan.FromSeconds(100));
driver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(10);
}
3. Now you can write your logic since you have traffic/ logging window open so don't
go to next execution if test fails. That way Browser Console will keep your errors
messages and help you to troubleshoot further
Browser : Firefox v 61
How can you launch Browser Console for firefox:
1. open firefox (and give any URL )
2. Press Ctrl+Shift+J (or Cmd+Shift+J on a Mac)
Link : https://developer.mozilla.org/en-US/docs/Tools/Browser_Console

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't capture HAR using Python Selenium Script with BrowserMob-Proxy - python

What worked for me was to downgrade java version to java11. I used jenv to install and manage multiple java versions.

When you do: proxy.har You need to parse that response, proxy.har is a JSON object, so if you need to generate a file, you need to do this: myFile = open('BETA_HAR.har','w') myFile.write( str(proxy.har) ) myFile.close() Then you will find your .har

Related

Selenium Firefox browser is stuck after downloading pdf

How Do I Reset The Har File Used By Browser-Mob-Proxy Module For Python?

Getting page source by Python + Selenium not works, connection refused

How to return selenium browser (or how to import a def that return selenium browser)

How to capture network traffic using selenium webdriver and browsermob proxy on Python?

Categories

Resources