Migrating Python Selenium web scrapers to remote server - python

I have a number of web scrapers built with Selenium in Python that run 24/7 for several months of the year. At the moment, all scrapers run locally on my PC. This comes at quite a cost to the performance of my PC and basically means it's constantly working fairly hard.
Ideally I would like to use remote servers instead but not really sure where to start. I would like to be able to run these things remotely, speed is also quite important - the pages need to load fairly quickly. What would be worth looking into? I'm willing to pay for a quality service.
Thanks

Related

Advice on running flask app ONLY locally forever

I want to create web form that stays on forever on a single computer. Users can come to the computer fill out the form and submit it. After submitting, it will record the responses in an excel file and send emails. The next user can then come and fill out a new form automatically. I was planning on using Flask for this task since it is simple to create, but since I am not doing this on some production server, I will just have it running locally in development on the single computer.
I have never seen anyone do something like this with Flask so I was wondering if my idea is possible or if I should avoid it. I am also new to web development so I was wondering what problems there could be with keeping a flask application stay on 24/7 on a local development computer.
Thanks
There is nothing wrong with doing this in principle however, it is likely not the best solution for the time-to-reward payoff.
First, to answer your question, this could easily be done, even for a beginner, completing this in a few hours with minimal Python and HTML experience could definitely be done. Your app could crash in the background for many reasons (running out of space, bad memory addresses, etc) but most likely you will be fine.
As for specifically building it, it is all possible, there are libraries you can use to add the results to an excel file, or you can easily just append to a CSV (which is what I would recommend). Creating and sending an email, similarly is relatively simple, but again, doing it without python would be much easier.
If you are not set on flask/python, you could check out Google Forms but if you are set on python, or want to use it as a learning experience, it can definitely be done.
Your idea is possible and while there are many ways to do this kind of thing, what you are suggesting is not necessarily to be avoided.
All apps that run on a computer over a long period of time start a process and keep it going until closed. That is essentially what you are doing.
Having done this myself (and still currently doing it) at my business, I can say that it works great.
The only caveat is that to ensure that it will always be available, you need to have the process monitored by some tool to make sure that it gets restarted if it ever closes due to a variety of reasons.
In linux, supervisor is a great tool for doing that. In windows you could register it as a service. But you could also just create an easy way to restart and make it easy for the user to do so if it is down when they need it.
Yes, this could be done. It's very similar to the applications that run on the servers in data centers.
To keep the application running forever or restarting it after your system starts you'll need to use a system manager similar to systemd in Unix. You could use NSSM - the Non-Sucking Service Manager
or Service Control to monitor your application and restart it if it crashes. This will also have to be enabled on startup.
Other than this, you could use Waitres to serve your Flask application. Waitress is a WSGI web server with which you can easily configure the number of threads and workers to enable serving multiple users at the same time.
In a production environment, it's always suggested to use a web server interface like Gunicorn or Waitress.

Setup VPN through python script for web crawling

I have been using selenium to do some web scraping and I'm in need for changing my ip. After having done some reserach into this I have discovered that it is fairly easy to setup and use a proxy. However, I am already paying for a VPN and therefore I would like to use it for this application as well. The free proxy lists that I have found have been way to slow to be useful for me.
I did some googling and found vpnc and other libraries but I couldn't get it to work all the way. I'm fairly new to web scraping and python so therefore I would appreciate if someone could help me on my level of knowledge.
Is it possible to do this or am I trying to achieve something that is way to difficult for an amateur like me? I'm trying to set this up on MacOS as well as Windows 7.

Python web scraper, json output, framework, server

I want to create python web scraper to get and format some data for me and output it in json format so that other web pages can access it. I want to put this service on some of the free python hosts out there.
Because this is my first python project I have some questions.
Should I use any of the python web frameworks for this? As I am not really concern about security (I will have only couple of pages with on input) I thought to leave it just as a script.
I do need some small database. What library can you suggest for this?
Are there cron jobs on python web servers?
Do free servers allow site scrapping every X minutes?
I have python 2.7 as default in my linux. Can/ Should I work with it or should I try to get the new version up and running?
yes, it makes life easier. But you have to check what framework can be used on free server. Sometimes you can't install own modules.
sqlite doesn't need installation. mysql and postgres mostly are preinstalled on servers but you have to check it.
mostly yes but you have to check it.
some servers may not allow scraping any sites but you have to check it.
use version which is installed on server so you have to check it.
Some free servers run page 18 hours a day and freezes page on 6 hours a day - but you have to check it.

Web Server load testing

I am suppose to test web server's load, performance and stress testing. There will be over 100 client machines that will be connecting to it.
I am using python-selenium webdriver to start a grid. the server is considered as a selenium 'hub' and clients as 'nodes'. This part is working fine till now.
Now the hard part, I need to monitor the server's performance, load and stress from another third party system while the scripts are running.
Is there any possibility if this can work ? I tried using many open sources like funkload, locust, web server stress tool 8. But none of them can monitor the load tests that is swarming dynamically on to the load.
While I was browsing in this site; I came across this project on https://github.com/djangofan/WebDriverTestingTemplate. Will this be helpful to my project?
Selenium is a functional tool so it's not a good idea to use it for performance test.
To achieve same you can go with JMETER as it a good open source tool available
still if you want to use selenium then there is some script present to integrate JMeter with selenium. I never tried it but you can try it.
Refer below link for same:-
https://blazemeter.com/blog/jmeter-webdriver-sampler
http://seleniummaster.com/sitecontent/index.php/performance-test-menu/selenium-load-test-menu/174-use-selenium-webdriver-with-jmeter
Hope it will help :)
It is possible to do with Selenium, but it will take much more resources (especially your time).
I would also recommend to try out the
LoadComplete from SmartBear
It is a very simple and intuitive tool, which lets you run and schedule your tests + send a report with execution results.
You can use Apache JMeter to conduct the load from 3rd-party web server.
With PerfMon Metrics Collector plugin you will be able to get server-side health metrics along with the load test results.
See Getting Started: Scripting with JMeter guide and Learn JMeter in 60 minutes for quick ramp-up on Apache JMeter.

What is the easiest way to run python scripts in a cloud server?

I have a web crawling python script that takes hours to complete, and is infeasible to run in its entirety on my local machine. Is there a convenient way to deploy this to a simple web server? The script basically downloads webpages into text files. How would this be best accomplished?
Thanks!
Since you said that performance is a problem and you are doing web-scraping, first thing to try is a Scrapy framework - it is a very fast and easy to use web-scraping framework. scrapyd tool would allow you to distribute the crawling - you can have multiple scrapyd services running on different servers and split the load between each. See:
Distributed crawls
Running Scrapy on Amazon EC2
There is also a Scrapy Cloud service out there:
Scrapy Cloud bridges the highly efficient Scrapy development
environment with a robust, fully-featured production environment to
deploy and run your crawls. It's like a Heroku for Scrapy, although
other technologies will be supported in the near future. It runs on
top of the Scrapinghub platform, which means your project can scale on
demand, as needed.
As an alternative to the solutions already given, I would suggest Heroku. You can not only deploy easily a website, but also scripts for bots to run.
Basic account is free and is pretty flexible.
This blog entry, this one and this video contain practical examples of how to make it work.
There are multiple places where you can do that. Just google for "python in the cloud", you will come up with a few, for example https://www.pythonanywhere.com/.
In addition, there are also several cloud IDEs that essentially give you a small VM for free where you can develop your code in a web-based IDE and also run it in the VM, one example is http://www.c9.io.
In 2021, Replit.com makes it very easy to write and run Python in the cloud.
If you have a google e-mail account you have an access to google drive and utilities. Choose for colaboratory (or find it in more... options first). This "CoLab" is essentially your python notebook on google drive with full access to your files on your drive, also with access to your GitHub. So, in addition to your local stuff you can edit your GitHub scripts as well.

Categories

Resources