I'm developing a chatbot for a school project, which will utilise a web service in the backend, intending to deploy it onto a third party cloud server host such as Heroku.
The web service will be doing periodic web scraping in realtime. I was developing with BeautifulSoup until I discover dynamically loaded content in the pages I need to scrape, so I've to switch to Selenium.
The problem is that Selenium requires a browser, but the cloud server doesn't have a GUI and probably doesn't allow installation of applications too.
So one solution I thought of is to use Chromium, a portable version of Chrome which doesn't need installation, in headless mode, which doesn't need a GUI.
How to connect to Chromium Headless using Selenium
Can Selenium Webdriver open browser windows silently in background?
I'm still a long way from figuring out how to deploy onto a cloud hosting server, let alone test my idea, so I thought to just seek professional input in advance. Will my web service be permitted by host servers to run in this manner?
Related
I'm trying to create a screenshot taking web app written in python django, meant for user work monitoring. It is supposed to take screenshots of the whole visible screen, not just the browser, in 5minute interval. The web app is to be built using Python Django and is hosted on amazon linux 2 Centos.
I tried using pyautogui but it looks like it doesn't work on CLI or headless servers. Selenium and other python and javascript packages I've checked are only limited to the browser (As of my knowledge. Correct me if I'm wrong.). I need to be able to get screenshots of the whole visible screen regardless of open application. I know it sounds hacker-ish so please guide me to the correct path. How do I implement this? Thank you!
You may know that heroku will stop their free dyno, free postgres etc from November. So I was finding some alternative to run my python web apps. I have almost 10 regular web apps which I visit regularly, like: url shortener, keyword research, google drive direct link generator site and many more. All of these are hosted on heroku. But I'm moving to vercel now. I setup all projects on vercel but the last one is complicated. My last project is python selenium bot. This one is my keyword research web app. I used some buildpack eg: Headless Chrome (https://github.com/heroku/heroku-buildpack-google-chrome) and Chromedriver (https://github.com/heroku/heroku-buildpack-chromedriver) to make this project run properly. But the problem is I could not find anything like buildpack in vercel to add Chrome and Chromedriver.
Anyone know about this?
Edit:
That was a kind of story and many people didn’t understand what I was asking.
So, My project is about selenium (python). Selenium needs google chrome browser installed and a chromedriver to run itself. There is another option without installing chrome is to set chrome binary location in webdriver.ChromeOptions(). I want to host this selenium project on vercel.com which is linux based.
So my question is how can I install Chrome Browser and ChromeDriver in vercel?
I have created a website that scrapes multiple hockey websites for game scores. It runs perfectly on my local server and I am now in the process of trying to deploy. I have tried using pythonanywhere.com but selenium does not seem to be working on it. For any one who has deployed a website that uses selenium/webdriver, what is the easiest/best platform to deploy a website like this (it does not have to be free like pythonanywhere, as long as it is not too expensive, lol!). Thanks in advance
Selenium does work on PythonAnywhere. If you use a free account, you'd have restricted internet access though. Also it's recommended to scrape outside of the web app, since it would slow the views down -- you should rather use a Schedule/Always-on task for that instead. You can also refer to those PythonAnywhere help pages:
Using Selenium
Async work in web apps
You can use the AWS, GCP, or Digitalocean Linux servers. In this case, you first have to install chrome in Linux and then put the relevant version of the chrome driver in your project directory. Make sure to check the chrome version first and then put the relevant Chrome driver on your machine.
I have deployed WordPress inside my localhost and I am using selenium web driver to automatically navigate to each and every link. I need to save each dynamically loaded html pages of that WordPress site using a python script.Please help me. Here I am using Ubuntu 14.04.
If you're just trying to get a plain HTML version of your wordpress site, you'll usually not need to go at it via a full browser, as very few wordpress sites are ajax-heavy.
Give Webhttrack (or plain httrack) a try.
Install webhttrack on your machine and run it via the menu (It's usually found under Internet / Network) or terminal (by simple running "webhttrack").
It will start a local webserver and point your browser to a web interface which you can use to setup your download project. Once you run it, you will find a plain copy of the wordpress site in ~/websites/.
I am scraping some websites that seem to have pretty good protection against it. The only way I can get it to work is to use Selenium to load the page and then scrape stuff from that.
Currently this works on my local computer (a firefox windows opens and closed when I access my page and it's HTML is processed further in my script). However, I need my scraper to be accessible on the web. The scraper is embedded within a Flask app on Heroku. Is there a way to make the Selenium browser work on Heroku servers? Or are there any hosting providers where it can work?
Heroku, wonderful as it is, has a major limitation in that one cannot use custom software or in many cases, libraries. In providing an easy to use, centrally-controlled, managed stack, Heroku strips their servers down to prevent other usage.
What this boils down to is there is no Xorg on a Heroku dyno. Lack of Xorg and lack of ability to install custom software means no xvfb either, and no ability to run the browser that selenium expects to exist. Further, the browser is not generally available.
You'll have better luck with a cloud offering like AWS, where you can install custom software, including firefox, xvfb (to keep from needing all the Xorg overhead), and of course the rest of your scraping stack. This answer explains how to do it properly.
There are buildpacks to make selenium work on heroku.
Add below buildpacks.
1) heroku buildpacks:add https://github.com/kevinsawicki/heroku-buildpack-xvfb-google-chrome/
2) heroku buildpacks:add https://github.com/heroku/heroku-buildpack-chromedriver
And set heroku stack to cedar-14 as shown below, as xvfb buildpack works only with cedar-14.
heroku stack:set cedar-14 -a stocksdata
Then point the google chrome location as below
options = ChromeOptions()
options.binary_location = "/app/.apt/usr/bin/google-chrome-stable"
driver = webdriver.Chrome(chrome_options=options)