I'm looking to create a publisher that streams and sends tweets containing a certain hashtag to a pub/sub topic.
The tweets will then be ingested with cloud dataflow and then loaded into a Big Query database.
In the following article they do something similar where the publisher is hosted on a docker image on a Google Compute Engine instance.
Can anyone recommend alternative Google Cloud resources that could host the publisher code more simply, that avoids the need to create a docker file etc?
The publisher would need to run constantly. Would cloud run for e.g. be a suitable alternative?
There are some workarounds I can think of:
A quick way to avoid containers architecture is having the on_data method inside a loop, for example, by using something like while(true) or start a Stream like explained in Create your Python script and run the code in a Compute Engine in the background with nohup python -u myscript.py. Or follow the steps described in Script on GCE to capture tweets that uses tweepy.Stream to start the streaming.
You might want to reconsider the Dockerfile option since its configuration could be not so difficult, see Tweets & pipelines where there is a script that read the data and publish to PubSub, you will see that 9 lines are used for the Docker file and it is deployed in App Engine using Cloud Build. Another implementation with a Docker file that requires more steps is twitter-for-bigquery, in case it helps, you will see that there are more specific steps and more configurations.
Cloud Functions is also another option, in this guide Serverless Twitter with Google Cloud you can check the Design section to know if it fits your use case.
Airflow with Twitter Scraper could work for you since Cloud Composer is a managed service for Airflow and you can create an Airflow environment quickly. It uses the Twint library, check the Technical section in the link for more details.
Stream Twitter Data into BigQuery with Cloud Dataprep is a workaround that put aside complex configurations. In this case the job won't run constantly but can be scheduled to run within minutes.
Related
I have tried to follow googles documentation on how to set up local development using a database (https://cloud.google.com/appengine/docs/standard/python/tools/using-local-server#Python_Using_the_Datastore). However, i do not have the experience level to follow along. I am not even sure if that was the right guide. The application is a Django project that uses python 2.7. To run the local host, i usually type dev_appserver.py --host 127.0.0.1 .
My questions are:
how do i download the data store database on google cloud. I do not want to download the entire database, just enough data to populate local host so i can do tests
once the database is download, what do i need to do to connect it to the localhost? Do i have to change a parameter somewhere?
do i need to download the datastore? Can i just make a duplicate on the cloud and then connect to that datastore?
When i run localhost, should it not already be connected to the datastore? Since the site works when it is running on the cloud. Where can i find the connection URI?
Thanks for the help
The development server is meant to simulate the whole App Engine Environment, if you examine the output of the dev_appserver.py command you'll see something like Starting Cloud Datastore emulator at: http://localhost:PORT. Your code will interact with that bundled Datastore automatically, pushing and retrieving data according to the code you wrote. Your data will be saved on a file in local storage and will persist across different runs of the development server unless it's explicitly deleted.
This option doesn't provide facilities to import data from your existing Cloud Datastore instance although it's a ready to go solution if your testing procedures can afford populating the local database with mock data through the use of a custom created script that does so programmatically. If you decide for this approach just write the data creation script and execute it before running the tests.
Now, there is another option to simulate local Datastore using the Cloud SDK that comes with handy features for your purposes. You can find the available information for it under Running the Datastore Emulator documentation page. This emulator has support to import entities downloaded from your production Cloud Datastore as well as for exporting them into files.
Back to your questions:
Export data from the Cloud instance into a GCS bucket following this, then download the data from the bucket to your filesystem following this, finally import the data into the emulator with the command shown here.
To use the emulator you need to first run gcloud beta emulators datastore start in a Cloud Shell and then in a separate tab run dev_appserver.py --support_datastore_emulator=true --datastore_emulator_port=8081 app.yaml.
The development server uses one of the two aforementioned emulators, in both cases it is not connected to your Cloud Datastore. You might create another project aimed for development purposes with a copy of your database and deploy your application there so you don't use the emulator at all.
Requests at datastore are made trough the endpoint https://datastore.googleapis.com/v1/projects/project-id although this is not related to how the emulators manage the connections in your local server.
Hope this helps.
How might one send data from Twitter directly to Google Cloud data storage. Would like to skip the step of first downloading it down to my local machine and then uploading it up to the cloud. It would run once. Not looking for full code, but any pointers or tutorials that someone might have learned from. Using python to interact with google-cloud and storage.
Any help would be appreciated.
Here's a blog post which describes the following architecture:
Run a Python script on Compute Engine
Moving your data to BigQuery for storage
Here's another one that describes a somewhat more complex architecture, including the ability to analyze tweets:
Use Google Cloud Dataflow templates
Launch Dataflow pipelines from a Google App Engine (GAE) app
In order to support MapReduce jobs
I have a huge amount of data (hundreds of Gigas) on Google BigQuery and for easy of use (many post query treatements) I'm working with the bigquery python package. The problem is that I have to run again all my queries whenever I shut my laptop down, this is very expensive as my dataset is about one Tera. I think of Google Compute Engine but this is a poor solution as I will still paying for my machines if I don't stop them. My last solution is to mount a docker image on our own sandbox, this is cheaper and can do exactly what I'm looking for. So I would like to know if someone has ever mounted a docker image for BigQuery ? Thanks for helping!
We mount all of our python/bigquery projects into docker containers and push them to google cloud registry.
Automated scheduling, dependancy graphing, and logging can be handled with Google Cloud Composer (Airflow). Its pretty simple to get set up, and Airflow has a Kubernetes Pod Operator, That allows you to specify a python file to run in your docker image on GCR. You can use this workflow to make sure all of your queries and python scripts are run on GCP without having to worry about Google Compute Engine, or any devops type of things.
https://cloud.google.com/composer/docs/how-to/using/using-kubernetes-pod-operator
https://cloud.google.com/composer/
I'm working on a flask API, which one of the endpoint is to receive a message and publish it to PubSub. Currently, in order to test that endpoint, I will have to manually spin-up a PubSub emulator from the command line, and keep it running during the test. It working just fine, but it wouldn't be ideal for automated test.
I wonder if anyone knows a way to spin-up a test PubSub emulator from python? Or if anyone has a better solution for testing such an API?
As far as I know, there is no Python native Google Cloud PubSub emulator available.
You have few options, all of them require launching an external program from Python:
Just invoke the gcloud command you mentioned: gcloud beta emulators pubsub start [options] directly from your python application to start this as an external program.
The PubSub emulator which comes as part of Cloud SDK is a JAR file bootstrapped by the bash script present in CLOUD_SDK_INSTALL_DIR/platform/pubsub-emulator/bin/cloud-pubsub-emulator. You could possibly run this bash script directly.
Here is a StackOverflow answer which covers multiple ways to launch an external program from Python.
Also, it is not quite clear from your question how you're calling the PubSub APIs in Python.
For unit tests, you could consider setting up a wrapper over the code which actually invokes the Cloud PubSub APIs, and inject a fake for this API wrapper. This way, you can test the rest of the code which invokes just your fake API wrapper instead of the real API wrapper and not worry about starting any external programs.
For integration tests, the PubSub emulator will definitely be useful.
This is how I usually do:
1. I create a python client class which does publish and subscribe with the topic, project and subscription used in emulator.
Note: You need to set PUBSUB_EMULATOR_HOST=localhost:8085 as env in your python project.
2. I spin up a pubsub-emulator as a docker container.
Note: You need to set some envs, mount volumes and expose port 8085.
set following envs for container:
PUBSUB_EMULATOR_HOST
PUBSUB_PROJECT_ID
PUBSUB_TOPIC_ID
PUBSUB_SUBSCRIPTION_ID
Write whatever integration tests you want to. Use publisher or subscriber from client depending on your test requirements.
I have a python script that is intended to run on my local machine every night. It's goal is to pull data from a third party server, do some processing on it, and execute bulk upload to GAE datastore.
My issue though is hot to run bulk upload from a python script. All examples I have seen (including Google's documentation) use command line "appcfg.py upload_data ..." and as far as I can see appcfg.py and bulkloader.py do not expose any API that is guaranteed not to change.
My two options as I see them now is to either execute "appcfg.py upload_data ..." command from my python script, which seems a roundabout way of doing things. Or to directly call appcfg.py's internal methods, which means I have to recode tings in case they change.
Appengine can run cron jobs. All you need is to write is a single script which pulls the data from third party server and upload it to appengine engine, Appenigne will do the rest for you. Appengine cron this has everything you need to know about running a cron job in appengine
This answer is now outdated. Please see the below link for my latest answer for bulk upload data to app engine.
How to upload data in bulk to the appengine datastore? Older methods do not work