Data Import uploadData API - Automating through GCS & App Engine app

Data Import uploadData API - Automating through GCS & App Engine app - python

I'd like to automate the process of Data Import to Google Analytics, so that every night raw CSVs uploaded to Google Cloud Storage > my app reads these files and make data import to Google Analytics
I believe this could be achieved through App Engine app:
running nightly cron job, to read data CSVs from GCS
and potentially using uploadData Management APImanagement/uploads/uploadData to make Data Import POST https://www.googleapis.com/upload/analytics/v3/management/accounts/accountId/webproperties/webPropertyId/customDataSources/customDataSourceId/uploads
but the problem i encounter is:: the Python client library for uploadData API only accepts file name to upload data:
media = MediaFileUpload('custom_data.csv',mimetype,resumable=False)
daily_upload = analytics.management().uploads().uploadData(accountId, webPropertyId, customDataSourceId, media_body).execute()
Whereas App Engine app doesn't have any persistent disk where it can read file from (I tried to insert GCS Bucket name instead of File name, but it doesn't work)
Any suggestions how I could automate the process? or am I on right track..?
Thanks a lot for any clue.. I'm stuck here :)

The client library and the api do not require that it be a file. Its only your MediaFileUpload that does You should be able to create a media uploader that takes a stream. the documentation mentions its possible
https://developers.google.com/api-client-library/python/guide/media_upload#extending-mediaupload

Related

How to update Dashboard (dash/plotly) with new data uploaded to the folder?

I have built a Dashboard using dash and plotly which is plotting data from the excel file located on my local PC. The data in the excel file will be updated on a daily basis but I don't know what should I do, so that the Dashboard gets updated as well without me redeploying it every day. Maybe you can give some advice or ideas? I am a newbie in Python, so maybe there is something pretty obvious but I haven't come across it yet..
P.S. The dashboard is currently deployed using Heroku.

the best solution for you will be to use a cloud storage solution like AWS S3 or Google Cloud Storage. Here is a link to another Stack Overflow post on uploading to Google Cloud and an in-depth post on reading CSV from google cloud.
You will need to create a service on your local computer to upload your CSV file automatically, and a service on your dash app to query google cloud for the new CSV / pandas dataframe at given intervals.
One solution could be to use a scheduler to run the update check from your app in a new thread:
from datetime import datetime
from apscheduler.schedulers.background import BackgroundScheduler
def start():
scheduler = BackgroundScheduler(timezone="Pacific/Auckland")
scheduler.add_job(loadCSVData, 'cron', day_of_week='mon-fri', hour=10, minute=47)
print("loadCSVData called - CSV data updated in app")
scheduler.start()

Automatically retrieving large files via public HTTP into Google Cloud Storage

For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage.
The files are available on public HTTP URL (http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home), but they are very large (between 30 and 300 Megabytes). Size of files is the main issue.
After looking at previous stackoverflow topics, I have tried two unsuccessful methods:
1/ First attempt via urlfetch in Google App Engine
from google.appengine.api import urlfetch
url = "http://dcpc-nwp.meteo.fr/servic..."
result = urlfetch.fetch(url)
[...] # Code to save in a Google Cloud Storage bucket
But I get the following error message on the urlfetch line :
DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL
2/ Second attempt via the Cloud Storage Transfert Service
According to the documentation, it is possible to retrieve HTTP Data into Cloud Storage directly via the Cloud Storage Transfert Service :
https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata
But it requires the size and md5 of the files before the download. This option cannot work in my case because the website does not provide those information.
3/ Any ideas ?
Do you see any solution to retrieve automatically large file on HTTP into my Cloud Storage bucket?

3/ Workaround with a Compute Engine instance
Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.
This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.
For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :
It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)

The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.
Another option would be to use a serverless Cloud Function. It could look like something below in Python.
import requests
def download_url_file(url):
try:
print('[ INFO ] Downloading {}'.format(url))
req = requests.get(url)
if req.status_code==200:
# Download and save to /tmp
output_filepath = '/tmp/{}'.format(url.split('/')[-1])
output_filename = '{}'.format(url.split('/')[-1])
open(output_filepath, 'wb').write(req.content)
print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
return output_filename
else:
print('[ ERROR ] Status Code: {}'.format(req.status_code))
except Exception as e:
print('[ ERROR ] {}'.format(e))
return output_filename

Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.
Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.

How to upload a small JSON object in memory to Google BigQuery in Python?

Is there any way to take a small data JSON object from memory and upload it to Google BigQuery without using the file system?
We have working code that uploads files to BigQuery. This code no longer works for us because our new script runs on Google App Engine which doesn't have write access to the file system.
We notice that our BigQuery upload configuration has a "MediaFileUpload" option with a data_path. We have been unable to find if there is some way to change the data_path to something in memory - or if there is another method apart from MediaFileUpload that could upload data from memory to BigQuery.
media_body=MediaFileUpload(
data_path,
mimetype='application/octet-stream'))
job = insert_request.execute()
We have seen solutions which recommend uploading data to Google Cloud Storage then referencing that file to upload to BigQuery for when datasets are large. Our dataset is small so this step seems like an unnecessary use of bandwidth, processing and time.
As a result, we are wondering if there is a way to take our small JSON object and upload it directly to BigQuery using Python?

Upload content from a BIG CSV to CloudSQL using App Engine Python

I'm pretty new with Google App Engine.
What i need to do is to upload a pretty large CSV to CloudSQL.
I've got an HTML page that has a file upload module which when uploaded reaches the Blobstore.
After which i open the CSV with the Blob reader and execute each line to CloudSQL using cursor.execute("insert into table values"). The problem here is that i can only execute the HTTP request for a minute and not all the data gets inserted in that short a time. It also keeps the screen in a loading state throughout which i would like to avoid by making the code run in the back end if that's possible?
I also tried going the "LOAD DATA LOCAL INFILE" way.
"LOAD DATA LOCAL INFILE" works from my local machine when i'm connected to CloudSQL via the terminal. And its pretty quick.
How would i go about using this within App Engine?
Or is there a better way to import a large CSV into CloudSQL through the Blobstore or Google Cloud Storage directly after uploading the CSV from the HTML?
Also, is it possible to use Task Queues with Blob Store and then insert the data into CloudSQL on the backend?

I have used a similar approach for Datastore and not CloudSQL but the same approach can be applied to your scenario.
Setup a non-default module (previously backend, deprecated now) of your application
Send a http request which will trigger the module endpoint through a task queue (to avoid 60 second deadline)
Use mapreduce with CSV as input and do the operation on each line of csv within the map function (to avoid memory errors and resume pipeline from where it left in case of any errors during operation)
EDIT: Elaborating map reduce as per OP request, and also eliminating the use of taskqueue
Read the mapreduce basics from the docs found here
Download the dependency folders for mapreduce to work (simplejson, graphy, mapreduce)
Download this file to your project folder and save as "custom_input_reader.py"
Now copy the code below to your main_app.py file.
main_app.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from custom_input_reader import GoogleStorageLineInputReader
def testMapperFunc(row):
# do process with csv row
return
class TestGCSReaderPipeline(base_handler.PipelineBase):
def run(self):
yield mapreduce_pipeline.MapPipeline(
"gcs_csv_reader_job",
"main_app.testMapperFunc",
"custom_input_reader.GoogleStorageLineInputReader",
params={
"input_reader": {
"file_paths": ['/' + bucketname + '/' + filename]
}
})
Create a http handler which will initiate the map job
main_app.py
class BeginUpload(webapp2.RequestHandler):
# do whatever you want
upload_task = TestGCSReaderPipeline()
upload_task.start()
# do whatever you want
If you want to pass any parameters, add the parameter in "run" method and provide values when creating the pipeline object

You can try importing CSV data via cloud console:
https://cloud.google.com/sql/docs/import-export?hl=en#import-csv

How does a user of an AppEngine app upload a large file if the app is using GCS?

With the old blobstore, blobstore_handlers.BlobstoreUploadHandler took care of long requests, for example the user could upload a 100mb file, taking 2 minutes
However I've not seen any example or document anywhere that explains a similar process using Google Cloud Storage, all examples provided just upload sample text files from appengine to gcs, there are no "user -> appengine -> gcs" examples, hope I didn't miss anything

You can upload a file to GCS using the blobstore API by specifying the argument gs_bucket_name of the function create_upload_url.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Import uploadData API - Automating through GCS & App Engine app - python

Related

How to update Dashboard (dash/plotly) with new data uploaded to the folder?

Automatically retrieving large files via public HTTP into Google Cloud Storage

How to upload a small JSON object in memory to Google BigQuery in Python?

Upload content from a BIG CSV to CloudSQL using App Engine Python

How does a user of an AppEngine app upload a large file if the app is using GCS?

Categories

Resources