I am trying to scrape things out of a website. Whenever i access the website using curl or python-requests, it keeps showing me that,i need to install a plugin to continue. But when i use a web browser which has the plugin installed , everything works fine.
I want to understand, how a website identifies that a web browser has the plugin installed or not?
Related
You may know that heroku will stop their free dyno, free postgres etc from November. So I was finding some alternative to run my python web apps. I have almost 10 regular web apps which I visit regularly, like: url shortener, keyword research, google drive direct link generator site and many more. All of these are hosted on heroku. But I'm moving to vercel now. I setup all projects on vercel but the last one is complicated. My last project is python selenium bot. This one is my keyword research web app. I used some buildpack eg: Headless Chrome (https://github.com/heroku/heroku-buildpack-google-chrome) and Chromedriver (https://github.com/heroku/heroku-buildpack-chromedriver) to make this project run properly. But the problem is I could not find anything like buildpack in vercel to add Chrome and Chromedriver.
Anyone know about this?
Edit:
That was a kind of story and many people didn’t understand what I was asking.
So, My project is about selenium (python). Selenium needs google chrome browser installed and a chromedriver to run itself. There is another option without installing chrome is to set chrome binary location in webdriver.ChromeOptions(). I want to host this selenium project on vercel.com which is linux based.
So my question is how can I install Chrome Browser and ChromeDriver in vercel?
I have created a website that scrapes multiple hockey websites for game scores. It runs perfectly on my local server and I am now in the process of trying to deploy. I have tried using pythonanywhere.com but selenium does not seem to be working on it. For any one who has deployed a website that uses selenium/webdriver, what is the easiest/best platform to deploy a website like this (it does not have to be free like pythonanywhere, as long as it is not too expensive, lol!). Thanks in advance
Selenium does work on PythonAnywhere. If you use a free account, you'd have restricted internet access though. Also it's recommended to scrape outside of the web app, since it would slow the views down -- you should rather use a Schedule/Always-on task for that instead. You can also refer to those PythonAnywhere help pages:
Using Selenium
Async work in web apps
You can use the AWS, GCP, or Digitalocean Linux servers. In this case, you first have to install chrome in Linux and then put the relevant version of the chrome driver in your project directory. Make sure to check the chrome version first and then put the relevant Chrome driver on your machine.
I want to make a python CLI for Wappalyzer (https://www.wappalyzer.com), but it is a browser plugin. The plugin identifies programs/frameworks running on a webpage, and I want to be able to use/get that information from a python script. While they do have a paid API, I was wondering if it is possible to use Selenium and the ChromeDriver to visit the page with a chrome extension, and then retrieve the data generated by Wappalyzer.
I want to scrape a website which content is generated by Javascript dynamically. The web scraping is executed on Microsoft Azure Notebook so that I can continue processing it via Python and Jupyter.
Therefore, a headless browser is needed to render the content during web scraping. I'm thinking either PhantomJS or CasperJS but they require installation with root permission, and I cannot install it.
What else option can I use in Azure Notebook for dynamically generated content?
I have deployed WordPress inside my localhost and I am using selenium web driver to automatically navigate to each and every link. I need to save each dynamically loaded html pages of that WordPress site using a python script.Please help me. Here I am using Ubuntu 14.04.
If you're just trying to get a plain HTML version of your wordpress site, you'll usually not need to go at it via a full browser, as very few wordpress sites are ajax-heavy.
Give Webhttrack (or plain httrack) a try.
Install webhttrack on your machine and run it via the menu (It's usually found under Internet / Network) or terminal (by simple running "webhttrack").
It will start a local webserver and point your browser to a web interface which you can use to setup your download project. Once you run it, you will find a plain copy of the wordpress site in ~/websites/.