I want to scrape a website which content is generated by Javascript dynamically. The web scraping is executed on Microsoft Azure Notebook so that I can continue processing it via Python and Jupyter.
Therefore, a headless browser is needed to render the content during web scraping. I'm thinking either PhantomJS or CasperJS but they require installation with root permission, and I cannot install it.
What else option can I use in Azure Notebook for dynamically generated content?
Related
I want to build a simple web scraper using python selenium and deploy it on Heroku. I've already done the deployment process with chromedriver and chrome buildpacks and everything is working fine. But I still need to implement one thing. I want to use my local chrome profile so that I don't have to sign up into Google. This is working fine locally by using
options = webdriver.ChromeOptions()
options.add_argument(r"user-data-dir=C:\\Users\\user\\AppData\\Local\\Google\\Chrome\\User Data")
To access my chrome profile on Heroku I just uploaded it the whole folder in the same directory as the code. After the deployment I can access the folder under /app/User Data and can see all files. However if I pass
options.add_argument(r"user-data-dir=/app/User Data")
to the Driver, it doesn't load the profile and the login process fails. I tried to get more information by printing the source code of "chrome://version", but that's just an empty page.
Do you have any suggestion what I can try instead to get it working? Thank you!
If you just need to login you can use https://pypi.org/project/selenium-stealth/ to login. It is selenium but it doesn't get detected when signing in.
I'm developing a chatbot for a school project, which will utilise a web service in the backend, intending to deploy it onto a third party cloud server host such as Heroku.
The web service will be doing periodic web scraping in realtime. I was developing with BeautifulSoup until I discover dynamically loaded content in the pages I need to scrape, so I've to switch to Selenium.
The problem is that Selenium requires a browser, but the cloud server doesn't have a GUI and probably doesn't allow installation of applications too.
So one solution I thought of is to use Chromium, a portable version of Chrome which doesn't need installation, in headless mode, which doesn't need a GUI.
How to connect to Chromium Headless using Selenium
Can Selenium Webdriver open browser windows silently in background?
I'm still a long way from figuring out how to deploy onto a cloud hosting server, let alone test my idea, so I thought to just seek professional input in advance. Will my web service be permitted by host servers to run in this manner?
I want to make a python CLI for Wappalyzer (https://www.wappalyzer.com), but it is a browser plugin. The plugin identifies programs/frameworks running on a webpage, and I want to be able to use/get that information from a python script. While they do have a paid API, I was wondering if it is possible to use Selenium and the ChromeDriver to visit the page with a chrome extension, and then retrieve the data generated by Wappalyzer.
I have deployed WordPress inside my localhost and I am using selenium web driver to automatically navigate to each and every link. I need to save each dynamically loaded html pages of that WordPress site using a python script.Please help me. Here I am using Ubuntu 14.04.
If you're just trying to get a plain HTML version of your wordpress site, you'll usually not need to go at it via a full browser, as very few wordpress sites are ajax-heavy.
Give Webhttrack (or plain httrack) a try.
Install webhttrack on your machine and run it via the menu (It's usually found under Internet / Network) or terminal (by simple running "webhttrack").
It will start a local webserver and point your browser to a web interface which you can use to setup your download project. Once you run it, you will find a plain copy of the wordpress site in ~/websites/.
I am trying to scrape things out of a website. Whenever i access the website using curl or python-requests, it keeps showing me that,i need to install a plugin to continue. But when i use a web browser which has the plugin installed , everything works fine.
I want to understand, how a website identifies that a web browser has the plugin installed or not?