I have a webdriver using selenium that opens a browser for me, points it to an ip Address, does a bunch of stuff and closes.
I want to know all of the urls accessed during this time. That is, any ads that are loaded, any css calls that were made out to any url and so on.
Here is the code im using
from selenium import webdriver
browser = webdriver.Firefox(profile) # Get local session of firefox
browser.get(url) # Open a url and wait for it to finish
I did it by loading the firefox plugins Firebug and Netexport. The first is a tool that allows you to see all the exchange of information, the second allows to write all of it in a file (.har extension). So basically selenium has to load the plugins, the website and wait the time you want, and when it closes, you get a file with the result.
Its not a python solution.. But you can add fiddler plug in to Firefox. We needed to do exact same thing about a year ago. We used selenium to open browser and all UI stuff and in background Fiddler captured all traffic (http and https) .. This also list all JS CSS src and you can debug later with inspector see what request is sent and what response is received
Related
the thing is I can login to the site only from one device, which is my browser so I can't use selenium, so I have to use something else which opens my browser and copy the text from an element using python.
I think that you may use one of two solutions:
use selenium to open a browser every time, and automate the login process.
If the site allows you to access without logging in when you access with your default browser(such as Stack Overflow for example - you don't need to login every time you open the website from your device), you can use the same browser profile and it should do the same job - login automatically to that website.
references:
what's a browser profile?
how to load a chrome's browser profile with python3 and selenium?
Long story short all I am trying to do is scrape the contents of a certain page. Unfortunately, the specific info I need on that page is within an iFrame and I have tried several headless browser options, all yielding the same response which is the HTML displaying:
<iframe>Your browser does not support iframe</iframe>
In Python I have tried both Selenium (even tried the --web-security=no & --disable-web-security flags) & PhantomJS (so I know it's not JavaScript related), and in NodeJS I've tried Puppeteer, all of which aren't working...
Is there anything else out there I can try that may work?
Also, no, a direct GET request is useless because the page detects it's not a real user and loads nothing entirely regardless of user-agent etc etc so I really need a browser solution that can preferably be headless
I'm trying to scrape articles on PubChem, such as this one, for instance. PubChem requires browsers to have Javascript enabled, or else it redirects to a page with virtually no content that says "This application requires Javascript. Please turn on Javascript in order to use this application". To go around this, I used the Chrome web driver from the Selenium library to obtain the HTML that PubChem generates with JavaScript.
And it does that about half the time. It also frequently does not render the full html, and redirects to the Javascript warning page. How do I make it so that the script retrieves the JS version of the site consistently?
I've also tried to overcome this issue by using PhantomJS, except PhantomJS somehow does not work on my machine after installation.
from bs4 import BeautifulSoup
from requests import get
from requests_html import HTMLSession
from selenium import webdriver
import html5lib
session = HTMLSession()
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
url = "https://pubchem.ncbi.nlm.nih.gov/compound/"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(innerHTML, "html5lib")
There are no error messages whatsoever. The only issues is that sometimes the web scraper cannot obtain the JS-rendered webpage as expected. Thank you so much!
Answering my own question because why not.
You need to quit your browser by
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
# stuff
browser.quit()
and do so right after the last operation that involves the browser, as you risk having the browser cache affect your outputs in next iterations of running the script.
Hope that whoever has this issue finds this helpful!
UPDATE EDIT:
So closing the browser does increase the frequency of success, but doesn't make it consistent. Another thing that was helpful in making it work more frequently was running
sudo purge
in the terminal. I'm still not getting consistent results, however. If anyone has an idea of how to do it without using brute force (i.e. opening and closing the WebDriver until it renders the proper page), please let me know! Many thanks
I am struggling to find a method in python which allows you to read data in a currently used web browser. Effectively, I am trying to download a massive dataframe of data on a locally controlled company webpage and implement it into a dataframe. The issue is that the website has a fairly complex authentication token process which I have not been able to bypass using Selenium using a slew of webdrivers, Requests, urllib, and cookielib using a variety of user parameters. I have given up on this front entirely as I am almost positive that there is more to the authentication process than can be achieved easily with these libraries.
However, I did manage to bypass the required tokenization process when I quickly tested opening a new tab in a current browser which was already logged in using WebBrowser. Classically, WebBrowser does not offer a read function meaning that even though the page can be opened the data on the page cannot be read into a pandas dataframe. This got me thinking I could use Win32com, open a browser, login, then run the rest of the script, but again, there is no general read ability of the dispatch for internet explorer meaning I can't send the information I want to pandas. I'm stumped. Any ideas?
I could acquire the necessary authentication token scripts, but I am sure that it would take a week or two before anything would happen on that front. I would obviously prefer to get something in the mean time while I wait for the actual auth scripts from the company.
Update: I received authentication tokens from the company, however it requires using a python package on another server I do not have access too, mostly because its an oddity that I am using Python in my department. Thus the above still applies - need a method for reading and manipulating an open browser.
Step-by-step
1) Start browser with Selenium.
2) Script should start waiting for certain element that inform you that you got required page and logged in.
3) You can use this new browser window to login to page manually.
4) Script detects that you are on required page and logged in.
5) Script processes page the way you like.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# start webdriver (opens Chrome in new window)
chrome = webdriver.Chrome()
# initialize waiter with maximum 300 seconds to wait.
waiter = WebDriverWait(chrome , 300)
# Will wait for appear of #logout element.
# I assume it shows that you are logged in.
wait.until(EC.presence_of_element_located(By.ID, "logout"))
# Extract data etc.
It might be easier if you use your Chrome user's profile. This way you may have previous session continued so you will not need to do any login actions.
options = webdriver.ChromeOptions()
options.add_argument("user-data-dir=FULL_PATH__TO_PROFILE")
chrome = webdriver.Chrome(chrome_options=options)
chrome.get("https://your_page_here")
Does WebDriver (Firefox) have ability for disable requests by mime-types?
I have one html page. In it i have css and js files load + some images.
Need make one browser request to this page, but without load a text/css content. After that i need make more one request to this page, with text/css content, but without application/javascript content. And in the end, more one page load, but without only image/png. Is it possible? If not, may be some extensions for Firefox can help me?
I tried found solution with small proxy on python (filtering requests by content-type), but this relation have many troubles.
Your basic options are:
configure your target browser to disable CSS, javascript or images. For example, it can be done by tweaking the about:config in Firefox, see:
Do not want images to load and CSS to render on Firefox in Selenium WebDriver tests with Python
install a browser addon and configure it's preferences. For firefox, there is Web Developer, QuickJava and I'm sure there are others.
use a proxy like polipo, squid, privoxy .. there are so many (your current approach), see: Webdriver and proxy server for firefox