Web Scraping with Google Compute Engine / App Engine

Web Scraping with Google Compute Engine / App Engine - python

I've written a python script that uses Selenium to scrape information from a website and stores it in a csv file. It works well on my local machine when I manually execute it but I now want to run the script automatically once per hour for several weeks and safe the data in a database. It may take about 5-10 minutes to run the script.
I've just started off with Google Cloud and it looks like there are several ways of implementing it with either Compute Engine or App Engine. So far, I get stuck at a certain point with all three ways that I found so far (e.g. getting the scheduled task call a URL of my backend instance and getting that instance to kick off the script). I've tried to:
Execute the script via Compute Engine and use datastore or cloud sql. Unclear if crontab can easily be set up.
Use Task Queues and Scheduled Tasks on App Engine.
Use backend instance and Scheduled Tasks on App Engine.
I'd be curious to hear from others what they would recommend as the easiest and most appropriate way given that this is truly a backend script that does not need a user front end.

App Engine is feasible but only if you limit your use of Selenium to a .remote out to a site such as http://crossbrowsertesting.com/ -- feasible but messy.
I'd use Compute Engine -- and cron is trivial to use on any Linux image, see e.g http://www.thegeekstuff.com/2009/06/15-practical-crontab-examples/ !

Related

A Python function being triggered by Google Cloud Scheduler fails to work on scheduled time but executes perfectly fine when I run it manually

I have project on Google Cloud App Engine. I have set up a cloud Scheduler to make a GET request every 24 hours to a certain endpoint on the app engine which invokes a simple Python script. The script simply reads a Google Sheet and updates the Cloud Firestore with the data from the sheet. It was working perfectly but for the past couple of days it fails to update the Database on scheduled time and gives an error. But when I trigger it manually from the console it works just fine. So that means the problem is not with my script. Can anyone have an idea what could be causing the problem?

I don't think there is enough information in your question, but I think you should analyze logs. In Cloud Scheduler/ Jobs you can find column "Logs" which contain links for every job. You can access Stockdriver Logging for this particular job directly from there.
I hope this will help!

Celery with Google Cloud MemoryStore for a Flask website

We are building a simple single-page website using flask to be deployed on GKE. In this , we have queries on MSSQL databases (used by another application) for which we want to use celery with Google cloud memorystore redis instance to run those queries scheduled once a day , then use that result data from queries on the website for that day as we do not want to query the databases everytime there is a visitor to the site (as the data is mostly static for a day).
Now , I am quite new to Software development and particularly DevOps less so. After reading up on resources online ,I couldn't much about it and I am still unsure about how this works .
Is The result data after completing the celery task stored in the Redis Result backend(Google cloud memorystore) in Google storage the entire day and can be accessed anytime in my Python code using celery task variable whenever a user visits the site ? Or should I access the data stored in Redis Result backend(GCM) in Google storage using another query to google cloud db in my code ? Or is the data stored in Redis Result backend(GCM) only temporary until the task is marked as Done and cannot be accessed throughout the day ? How do I move forward ? Can someone please point this out ?

How to turn a simple python script to basic webapp?

I would like to know what is the fastest way to turn a simple Python script into a basic web app.
For example, say I would like to create a web app that takes a keyword from the user and display the most retweeted tweet on Twitter. If I write a python script that is capable of performing that task using Twitter's API, how would I go about turning it into a web app for people to access?
I have looked at frameworks such as Django, but it would take me weeks or months to learn how to use it. I just need something quick and simple. Any such alternatives?

Make a CGI script out of it. You basically get the request information from the webserver via environment variables and you print the desired HTML to stdout. There are helper libraries such as Werkzeug which help with abstracting away the handling of the environment variables by wrapping them in a Request object.
This technique is quite outdated and isn't normally used nowadays as the script has to be run on every request and thus incurs the startup cost all the time.
Nevertheless this may actually be a good solution for you because it is quick and every webserver supports it.

choosing an application framework to handle offline analysis with web requests

I am trying to design a web based app at the moment, that involves requests being made by users to trigger analysis of their previously entered data. The background analysis could be done on the same machine as the web server or be run on remote machines, and should not significantly impede the performance of the website, so that other users can also make analysis requests while the background analysis is being done. The requests should go into some form of queueing system, and once an analysis is finished, the results should be returned and viewable by the user in their account.
Please could someone advise me of the most efficient framework to handle this project? I am currently working on Linux, the analysis software is written in Python, and I have previously designed dynamic sites using Django. Is there something compatible with this that could work?

Given your background and the analysys code already being written in Python, Django + Celery seems like an obvious candidate here. We're currently using this solution for a very processing-heavy app with one front-end django server, one dedicated database server, and two distinct celery servers for the background processing. Having the celery processes on distinct servers keeps the djangon front responsive whatever the load on the celery servers (and we can add new celery servers if required).
So well, I don't know if it's "the most efficient" solution but it does work.

Run the some code whenever I upload the project on to the app engine server

I've built an appeninge project so, how can I run some piece of code on the appserver only once, i.e when ever I upload the whole project on to the server.
How should I achieve this task???

There isn't an official way to discover if your application has been modified altought each time you upload your application it gets a unique version number {app version.(some unique number)} but since there isn't a document API on how to get it I wound't take a risk and use it.
What you need todo is to have a script that will upload your application and when the script is done you can call a handler in your application that set a value in the datastore that marks the application as new.
Once you have that, you can look for it in the datastore in your handlers and run the code if you find it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping with Google Compute Engine / App Engine - python

Related

A Python function being triggered by Google Cloud Scheduler fails to work on scheduled time but executes perfectly fine when I run it manually

Celery with Google Cloud MemoryStore for a Flask website

How to turn a simple python script to basic webapp?

choosing an application framework to handle offline analysis with web requests

Run the some code whenever I upload the project on to the app engine server

Categories

Resources