I am scraping some websites that seem to have pretty good protection against it. The only way I can get it to work is to use Selenium to load the page and then scrape stuff from that.
Currently this works on my local computer (a firefox windows opens and closed when I access my page and it's HTML is processed further in my script). However, I need my scraper to be accessible on the web. The scraper is embedded within a Flask app on Heroku. Is there a way to make the Selenium browser work on Heroku servers? Or are there any hosting providers where it can work?
Heroku, wonderful as it is, has a major limitation in that one cannot use custom software or in many cases, libraries. In providing an easy to use, centrally-controlled, managed stack, Heroku strips their servers down to prevent other usage.
What this boils down to is there is no Xorg on a Heroku dyno. Lack of Xorg and lack of ability to install custom software means no xvfb either, and no ability to run the browser that selenium expects to exist. Further, the browser is not generally available.
You'll have better luck with a cloud offering like AWS, where you can install custom software, including firefox, xvfb (to keep from needing all the Xorg overhead), and of course the rest of your scraping stack. This answer explains how to do it properly.
There are buildpacks to make selenium work on heroku.
Add below buildpacks.
1) heroku buildpacks:add https://github.com/kevinsawicki/heroku-buildpack-xvfb-google-chrome/
2) heroku buildpacks:add https://github.com/heroku/heroku-buildpack-chromedriver
And set heroku stack to cedar-14 as shown below, as xvfb buildpack works only with cedar-14.
heroku stack:set cedar-14 -a stocksdata
Then point the google chrome location as below
options = ChromeOptions()
options.binary_location = "/app/.apt/usr/bin/google-chrome-stable"
driver = webdriver.Chrome(chrome_options=options)
Related
You may know that heroku will stop their free dyno, free postgres etc from November. So I was finding some alternative to run my python web apps. I have almost 10 regular web apps which I visit regularly, like: url shortener, keyword research, google drive direct link generator site and many more. All of these are hosted on heroku. But I'm moving to vercel now. I setup all projects on vercel but the last one is complicated. My last project is python selenium bot. This one is my keyword research web app. I used some buildpack eg: Headless Chrome (https://github.com/heroku/heroku-buildpack-google-chrome) and Chromedriver (https://github.com/heroku/heroku-buildpack-chromedriver) to make this project run properly. But the problem is I could not find anything like buildpack in vercel to add Chrome and Chromedriver.
Anyone know about this?
Edit:
That was a kind of story and many people didn’t understand what I was asking.
So, My project is about selenium (python). Selenium needs google chrome browser installed and a chromedriver to run itself. There is another option without installing chrome is to set chrome binary location in webdriver.ChromeOptions(). I want to host this selenium project on vercel.com which is linux based.
So my question is how can I install Chrome Browser and ChromeDriver in vercel?
I have created a website that scrapes multiple hockey websites for game scores. It runs perfectly on my local server and I am now in the process of trying to deploy. I have tried using pythonanywhere.com but selenium does not seem to be working on it. For any one who has deployed a website that uses selenium/webdriver, what is the easiest/best platform to deploy a website like this (it does not have to be free like pythonanywhere, as long as it is not too expensive, lol!). Thanks in advance
Selenium does work on PythonAnywhere. If you use a free account, you'd have restricted internet access though. Also it's recommended to scrape outside of the web app, since it would slow the views down -- you should rather use a Schedule/Always-on task for that instead. You can also refer to those PythonAnywhere help pages:
Using Selenium
Async work in web apps
You can use the AWS, GCP, or Digitalocean Linux servers. In this case, you first have to install chrome in Linux and then put the relevant version of the chrome driver in your project directory. Make sure to check the chrome version first and then put the relevant Chrome driver on your machine.
I'm developing a chatbot for a school project, which will utilise a web service in the backend, intending to deploy it onto a third party cloud server host such as Heroku.
The web service will be doing periodic web scraping in realtime. I was developing with BeautifulSoup until I discover dynamically loaded content in the pages I need to scrape, so I've to switch to Selenium.
The problem is that Selenium requires a browser, but the cloud server doesn't have a GUI and probably doesn't allow installation of applications too.
So one solution I thought of is to use Chromium, a portable version of Chrome which doesn't need installation, in headless mode, which doesn't need a GUI.
How to connect to Chromium Headless using Selenium
Can Selenium Webdriver open browser windows silently in background?
I'm still a long way from figuring out how to deploy onto a cloud hosting server, let alone test my idea, so I thought to just seek professional input in advance. Will my web service be permitted by host servers to run in this manner?
I'm trying to use chrome driver for web testing with selenium on heroku.
But I've found that heroku doesn't support chrome driver.
I've consulted with google many times how to use chrome driver on heroku.
I have added buildpack such as https://github.com/jimmynguyc/heroku-buildpack-chromedriver.git, https://github.com/tstachl/heroku-buildpack-selenium and so on, but I totally failed to use chrome driver.
I would like to know how to handle it.
Did you already install Chrome? The Heroku CI site has a page that points out you'll need the buildpack "https://github.com/heroku/heroku-buildpack-google-chrome"
I have a Python program that works with Selenium and PhantomJS, and I’d like to distribute it. The functionality is quite simple; it goes onto a website, fills certain forms and returns the outcome, without any visible browser action.
The problem is that I can’t expect an arbitrary user to have PhantomJS installed on their computers. How should I approach the distribution process?
I already checked Setuptools and PythonAnywhere, but I don’t think they work for what I want.
Edit: May be too hopeful, but I'd like to be able to distribute it for Windows, OSX and Ubuntu.
The way I do it is through a web application built on Flask (one of many great python web frameworks) and hosted on PythonAnywhere.
To use PhantomJS and Selenium in PythonAnywhere you have to ask for Docker Consoles. Instructions here: https://www.pythonanywhere.com/forums/topic/1320/