I have a python script that is intended to run on my local machine every night. It's goal is to pull data from a third party server, do some processing on it, and execute bulk upload to GAE datastore.
My issue though is hot to run bulk upload from a python script. All examples I have seen (including Google's documentation) use command line "appcfg.py upload_data ..." and as far as I can see appcfg.py and bulkloader.py do not expose any API that is guaranteed not to change.
My two options as I see them now is to either execute "appcfg.py upload_data ..." command from my python script, which seems a roundabout way of doing things. Or to directly call appcfg.py's internal methods, which means I have to recode tings in case they change.
Appengine can run cron jobs. All you need is to write is a single script which pulls the data from third party server and upload it to appengine engine, Appenigne will do the rest for you. Appengine cron this has everything you need to know about running a cron job in appengine
This answer is now outdated. Please see the below link for my latest answer for bulk upload data to app engine.
How to upload data in bulk to the appengine datastore? Older methods do not work
Related
I'm looking to create a publisher that streams and sends tweets containing a certain hashtag to a pub/sub topic.
The tweets will then be ingested with cloud dataflow and then loaded into a Big Query database.
In the following article they do something similar where the publisher is hosted on a docker image on a Google Compute Engine instance.
Can anyone recommend alternative Google Cloud resources that could host the publisher code more simply, that avoids the need to create a docker file etc?
The publisher would need to run constantly. Would cloud run for e.g. be a suitable alternative?
There are some workarounds I can think of:
A quick way to avoid containers architecture is having the on_data method inside a loop, for example, by using something like while(true) or start a Stream like explained in Create your Python script and run the code in a Compute Engine in the background with nohup python -u myscript.py. Or follow the steps described in Script on GCE to capture tweets that uses tweepy.Stream to start the streaming.
You might want to reconsider the Dockerfile option since its configuration could be not so difficult, see Tweets & pipelines where there is a script that read the data and publish to PubSub, you will see that 9 lines are used for the Docker file and it is deployed in App Engine using Cloud Build. Another implementation with a Docker file that requires more steps is twitter-for-bigquery, in case it helps, you will see that there are more specific steps and more configurations.
Cloud Functions is also another option, in this guide Serverless Twitter with Google Cloud you can check the Design section to know if it fits your use case.
Airflow with Twitter Scraper could work for you since Cloud Composer is a managed service for Airflow and you can create an Airflow environment quickly. It uses the Twint library, check the Technical section in the link for more details.
Stream Twitter Data into BigQuery with Cloud Dataprep is a workaround that put aside complex configurations. In this case the job won't run constantly but can be scheduled to run within minutes.
We have just signed up with Azure and were wondering how to schedule and run Python scripts that extract data from various sources like APIs, web scrape scripts, etc. What is the best tool on Azure that can run and schedule those scripts as well as save to target destination.
The output of the scripts will be saved to either data lakes and/or azure sql database.
Thank you.
There're several services in azure can do this task.
I suggest you can take use of azure webjobs(it supports python as well as support running as per schedule).
The rough guidelines are as below:
1.Develop your python scripts locally, make sure it can work locally(like extract data from other sources, save to azure database).
2.In azure portal, Create a scheduled WebJob. During creation, you need to upload the .py file(zip all the files into a .zip file); For "Type", please select "Triggered"; in the Triggers dropdown, select "Scheduled"; then specify at which time to run the .py file by using CRON Expression.
3.It's done.
You can also consider other azure services like azure function with time trigger. But the webjob is much more easier.
Hope it helps, and also please let me know if you still have more issues about that.
I have tried to follow googles documentation on how to set up local development using a database (https://cloud.google.com/appengine/docs/standard/python/tools/using-local-server#Python_Using_the_Datastore). However, i do not have the experience level to follow along. I am not even sure if that was the right guide. The application is a Django project that uses python 2.7. To run the local host, i usually type dev_appserver.py --host 127.0.0.1 .
My questions are:
how do i download the data store database on google cloud. I do not want to download the entire database, just enough data to populate local host so i can do tests
once the database is download, what do i need to do to connect it to the localhost? Do i have to change a parameter somewhere?
do i need to download the datastore? Can i just make a duplicate on the cloud and then connect to that datastore?
When i run localhost, should it not already be connected to the datastore? Since the site works when it is running on the cloud. Where can i find the connection URI?
Thanks for the help
The development server is meant to simulate the whole App Engine Environment, if you examine the output of the dev_appserver.py command you'll see something like Starting Cloud Datastore emulator at: http://localhost:PORT. Your code will interact with that bundled Datastore automatically, pushing and retrieving data according to the code you wrote. Your data will be saved on a file in local storage and will persist across different runs of the development server unless it's explicitly deleted.
This option doesn't provide facilities to import data from your existing Cloud Datastore instance although it's a ready to go solution if your testing procedures can afford populating the local database with mock data through the use of a custom created script that does so programmatically. If you decide for this approach just write the data creation script and execute it before running the tests.
Now, there is another option to simulate local Datastore using the Cloud SDK that comes with handy features for your purposes. You can find the available information for it under Running the Datastore Emulator documentation page. This emulator has support to import entities downloaded from your production Cloud Datastore as well as for exporting them into files.
Back to your questions:
Export data from the Cloud instance into a GCS bucket following this, then download the data from the bucket to your filesystem following this, finally import the data into the emulator with the command shown here.
To use the emulator you need to first run gcloud beta emulators datastore start in a Cloud Shell and then in a separate tab run dev_appserver.py --support_datastore_emulator=true --datastore_emulator_port=8081 app.yaml.
The development server uses one of the two aforementioned emulators, in both cases it is not connected to your Cloud Datastore. You might create another project aimed for development purposes with a copy of your database and deploy your application there so you don't use the emulator at all.
Requests at datastore are made trough the endpoint https://datastore.googleapis.com/v1/projects/project-id although this is not related to how the emulators manage the connections in your local server.
Hope this helps.
I have a Python 2.7 script at https://github.com/jhsu802701/dopplervalueinvesting . When I run the screen.py script locally, the end result is a new screen-output sub-directory (within the root directory) and a results.csv file within it.
What I'm trying to do is put this script on a remote server, run this screen.py script every night, and make the results.csv file publicly readable.
I've tried to do this on Google App Engine, but I can't get it to work. The Google App Engine tutorial revolves around trying to dynamically create a web site, and I haven't been able to figure out how to make anything other than an index.html file in the root directory work. HOW DO I MAKE OTHER FILES PUBLICLY READABLE?
Is Google App Engine the way to go, or am I barking up the wrong tree? I understand that another route is using WebFaction, a web hosting provider that offers a whole Linux system. (Running my app on my current web host, MDDHosting, is not an option because lxml is not available without a much more costly VPS.)
In summary, my questions are:
How do I run my Python script in Google App Engine and make the output results.csv file publicly available?
If Google App Engine isn't the solution for me, should I use WebFaction? (I already tried Heroku, and it didn't work for me.)
What are my other options?
I'm willing to pay for a solution, but only if I get web hosting as well. (I'm not willing to pay for MDDHosting for my dopplervalueinvesting.com web site AND another host for running my script.)
I think GAE should be good for what you want, but you may need to work differently because, as a comment pointed out, you can't write to the file system but have to use the datastore instead.
So you need in your app.yaml list of handlers, something like
- url: /results.csv
script: deliver_results_file.py
- url: /screen
login: admin
script: screen.py
screen.py needs to save the results to the datastore in some convenient format. deliver_results_file.py then queries the datastore, and if the results are not already in CSV format then it converts them accordingly. It then writes the formatted data directly to the output (usually using self.response.out.write within a webapp request handler) as if it was a dynamically generated webpage.
Finally you want to schedule it to run once a night - I believe this is possible using cron jobs.
I just deployed a site on GAE which requires me to stage some data for dropdown fields (i.e. us states, status, etc.).
In development, I have created an entity for each type of data (US State entity for example) and was able to preload the data using the interactive console by creating the entity and then calling the put() method.
Now that the application is deployed I don’t know of a way to preload this data. How would you recommend doing this in a deployed instance?
I am using SDK version 1.7.0, python 2.7, High Replication Datastore (HRD), and memcache when data is retrieved.
Thanks in advance for your help!
If you want to do it programmatically, you may use the interactive console in production. Check out How do I activate the Interactive Console on App Engine?
You may also create a temporary request handler that'll do the job, deploy it (e.g. as a different version of the app to make it easy to delete) and launch the respective URL in your browser.
You can use the bulkloader to upload your entities to your deployed version. See the doc Uploading and Downloading Data for details and examples.