Headless Browser in Azure Notebook - python

I want to scrape a website which content is generated by Javascript dynamically. The web scraping is executed on Microsoft Azure Notebook so that I can continue processing it via Python and Jupyter.
Therefore, a headless browser is needed to render the content during web scraping. I'm thinking either PhantomJS or CasperJS but they require installation with root permission, and I cannot install it.
What else option can I use in Azure Notebook for dynamically generated content?

Related

Load local chrome user profile to Heroku to use it with selenium

I want to build a simple web scraper using python selenium and deploy it on Heroku. I've already done the deployment process with chromedriver and chrome buildpacks and everything is working fine. But I still need to implement one thing. I want to use my local chrome profile so that I don't have to sign up into Google. This is working fine locally by using
options = webdriver.ChromeOptions()
options.add_argument(r"user-data-dir=C:\\Users\\user\\AppData\\Local\\Google\\Chrome\\User Data")
To access my chrome profile on Heroku I just uploaded it the whole folder in the same directory as the code. After the deployment I can access the folder under /app/User Data and can see all files. However if I pass
options.add_argument(r"user-data-dir=/app/User Data")
to the Driver, it doesn't load the profile and the login process fails. I tried to get more information by printing the source code of "chrome://version", but that's just an empty page.
Do you have any suggestion what I can try instead to get it working? Thank you!
If you just need to login you can use https://pypi.org/project/selenium-stealth/ to login. It is selenium but it doesn't get detected when signing in.

Can a backend web service use selenium with a headless Chromium?

I'm developing a chatbot for a school project, which will utilise a web service in the backend, intending to deploy it onto a third party cloud server host such as Heroku.
The web service will be doing periodic web scraping in realtime. I was developing with BeautifulSoup until I discover dynamically loaded content in the pages I need to scrape, so I've to switch to Selenium.
The problem is that Selenium requires a browser, but the cloud server doesn't have a GUI and probably doesn't allow installation of applications too.
So one solution I thought of is to use Chromium, a portable version of Chrome which doesn't need installation, in headless mode, which doesn't need a GUI.
How to connect to Chromium Headless using Selenium
Can Selenium Webdriver open browser windows silently in background?
I'm still a long way from figuring out how to deploy onto a cloud hosting server, let alone test my idea, so I thought to just seek professional input in advance. Will my web service be permitted by host servers to run in this manner?

How to use Selenium with the Wappalyzer browser plugin

I want to make a python CLI for Wappalyzer (https://www.wappalyzer.com), but it is a browser plugin. The plugin identifies programs/frameworks running on a webpage, and I want to be able to use/get that information from a python script. While they do have a paid API, I was wondering if it is possible to use Selenium and the ChromeDriver to visit the page with a chrome extension, and then retrieve the data generated by Wappalyzer.

I need to save wordpress Dynamically loaded web html pages using python code

I have deployed WordPress inside my localhost and I am using selenium web driver to automatically navigate to each and every link. I need to save each dynamically loaded html pages of that WordPress site using a python script.Please help me. Here I am using Ubuntu 14.04.
If you're just trying to get a plain HTML version of your wordpress site, you'll usually not need to go at it via a full browser, as very few wordpress sites are ajax-heavy.
Give Webhttrack (or plain httrack) a try.
Install webhttrack on your machine and run it via the menu (It's usually found under Internet / Network) or terminal (by simple running "webhttrack").
It will start a local webserver and point your browser to a web interface which you can use to setup your download project. Once you run it, you will find a plain copy of the wordpress site in ~/websites/.

How websites identify my web browser doesn't have the plugin installed?

I am trying to scrape things out of a website. Whenever i access the website using curl or python-requests, it keeps showing me that,i need to install a plugin to continue. But when i use a web browser which has the plugin installed , everything works fine.
I want to understand, how a website identifies that a web browser has the plugin installed or not?

Categories

Resources