I'm pretty new with Google App Engine.
What i need to do is to upload a pretty large CSV to CloudSQL.
I've got an HTML page that has a file upload module which when uploaded reaches the Blobstore.
After which i open the CSV with the Blob reader and execute each line to CloudSQL using cursor.execute("insert into table values"). The problem here is that i can only execute the HTTP request for a minute and not all the data gets inserted in that short a time. It also keeps the screen in a loading state throughout which i would like to avoid by making the code run in the back end if that's possible?
I also tried going the "LOAD DATA LOCAL INFILE" way.
"LOAD DATA LOCAL INFILE" works from my local machine when i'm connected to CloudSQL via the terminal. And its pretty quick.
How would i go about using this within App Engine?
Or is there a better way to import a large CSV into CloudSQL through the Blobstore or Google Cloud Storage directly after uploading the CSV from the HTML?
Also, is it possible to use Task Queues with Blob Store and then insert the data into CloudSQL on the backend?
I have used a similar approach for Datastore and not CloudSQL but the same approach can be applied to your scenario.
Setup a non-default module (previously backend, deprecated now) of your application
Send a http request which will trigger the module endpoint through a task queue (to avoid 60 second deadline)
Use mapreduce with CSV as input and do the operation on each line of csv within the map function (to avoid memory errors and resume pipeline from where it left in case of any errors during operation)
EDIT: Elaborating map reduce as per OP request, and also eliminating the use of taskqueue
Read the mapreduce basics from the docs found here
Download the dependency folders for mapreduce to work (simplejson, graphy, mapreduce)
Download this file to your project folder and save as "custom_input_reader.py"
Now copy the code below to your main_app.py file.
main_app.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from custom_input_reader import GoogleStorageLineInputReader
def testMapperFunc(row):
# do process with csv row
return
class TestGCSReaderPipeline(base_handler.PipelineBase):
def run(self):
yield mapreduce_pipeline.MapPipeline(
"gcs_csv_reader_job",
"main_app.testMapperFunc",
"custom_input_reader.GoogleStorageLineInputReader",
params={
"input_reader": {
"file_paths": ['/' + bucketname + '/' + filename]
}
})
Create a http handler which will initiate the map job
main_app.py
class BeginUpload(webapp2.RequestHandler):
# do whatever you want
upload_task = TestGCSReaderPipeline()
upload_task.start()
# do whatever you want
If you want to pass any parameters, add the parameter in "run" method and provide values when creating the pipeline object
You can try importing CSV data via cloud console:
https://cloud.google.com/sql/docs/import-export?hl=en#import-csv
Related
What I basically want to happen is my on demand scheduled query will run when a new file lands in my google cloud storage bucket. This query will load the CSV file into a temporary table, perform some transformation/cleaning and then append to a table.
Just to try and get the first part running, my on demand scheduled query looks like this. The idea being it will pick up the CSV file from the bucket and dump it into a table.
LOAD DATA INTO spreadsheep-20220603.Case_Studies.loading_test
from files
(
format='CSV',
uris=['gs://triggered_upload/*.csv']
);
I was in the process of setting up a Google Cloud Function that triggers when a file lands in the storage bucket, that seems to be fine but I haven't had luck working out how that function will trigger the scheduled query.
Any idea what bit of python code is needed in the function to trigger the query?
It seems to me that it's not really a scheduled query you want at all. You don't want one to run at regular intervals, you want to run a query in response to a certain event.
Now, you've rigged up a cloud function to execute some code whenever a new file is added to a bucket. What this cloud function needs is the BigQuery python client library. Here's an example of how it's used.
All that remains is to wrap this code in an appropriate function and specify dependencies and permissions using the cloud functions python framework. Here is a guide on how to do that.
I'd like to automate the process of Data Import to Google Analytics, so that every night raw CSVs uploaded to Google Cloud Storage > my app reads these files and make data import to Google Analytics
I believe this could be achieved through App Engine app:
running nightly cron job, to read data CSVs from GCS
and potentially using uploadData Management APImanagement/uploads/uploadData to make Data Import POST https://www.googleapis.com/upload/analytics/v3/management/accounts/accountId/webproperties/webPropertyId/customDataSources/customDataSourceId/uploads
but the problem i encounter is:: the Python client library for uploadData API only accepts file name to upload data:
media = MediaFileUpload('custom_data.csv',mimetype,resumable=False)
daily_upload = analytics.management().uploads().uploadData(accountId, webPropertyId, customDataSourceId, media_body).execute()
Whereas App Engine app doesn't have any persistent disk where it can read file from (I tried to insert GCS Bucket name instead of File name, but it doesn't work)
Any suggestions how I could automate the process? or am I on right track..?
Thanks a lot for any clue.. I'm stuck here :)
The client library and the api do not require that it be a file. Its only your MediaFileUpload that does You should be able to create a media uploader that takes a stream. the documentation mentions its possible
https://developers.google.com/api-client-library/python/guide/media_upload#extending-mediaupload
Is there any way to take a small data JSON object from memory and upload it to Google BigQuery without using the file system?
We have working code that uploads files to BigQuery. This code no longer works for us because our new script runs on Google App Engine which doesn't have write access to the file system.
We notice that our BigQuery upload configuration has a "MediaFileUpload" option with a data_path. We have been unable to find if there is some way to change the data_path to something in memory - or if there is another method apart from MediaFileUpload that could upload data from memory to BigQuery.
media_body=MediaFileUpload(
data_path,
mimetype='application/octet-stream'))
job = insert_request.execute()
We have seen solutions which recommend uploading data to Google Cloud Storage then referencing that file to upload to BigQuery for when datasets are large. Our dataset is small so this step seems like an unnecessary use of bandwidth, processing and time.
As a result, we are wondering if there is a way to take our small JSON object and upload it directly to BigQuery using Python?
I know that this is a little bit open ended but I am confused as to what strategy/method to apply for a large file upload service developed using Flask and boto3. For smaller files and all it is fine. But it would be really nice to see what you guys think when the size exceeds 100 MB
What I have in mind are following -
a) Stream the file to Flask app using some kind of AJAX uploader(What I am trying to build is just a REST interface using Flask-Restful. Any example of using these components, e.g. Flask-Restful, boto3 and streaming large files are welcome.). The upload app is going to be (I believe) part of a microservices platform that we are building. I do not know whether there will be a Nginx proxy in front of the flask app or it will be directly served from a Kubernetes pod/service. In case it is directly served, is there something that I have to change for large file upload either in kubernetes and/or Flask layer?
b) Using a direct JS uploader (like http://www.plupload.com/) and stream the file into s3 bucket directly and when finished get the URL and pass it to the Flask API app and store it in DB. The problem with this is, the credentials need to be there somewhere in JS which means a security threat. (Not sure if any other concerns are there)
What among them (or something different I did not think about at all) you think is the best way and where can I find some code example for that?
Thanks in advance.
[EDIT]
I have found this - http://blog.pelicandd.com/article/80/streaming-input-and-output-in-flask where the author is dealing with kind of similar situation like me and he proposed a solution. But he is opening a file already present in disk. What if I want to directly upload the file as it comes in as one single object in a s3 bucket? I feel that this can be a base of a solution but not the solution itself.
Alternatively you can use Minio-py client library, its Open Source and compatible with S3 API. It handles multipart upload for you natively.
A simple put_object.py example:
import os
from minio import Minio
from minio.error import ResponseError
client = Minio('s3.amazonaws.com',
access_key='YOUR-ACCESSKEYID',
secret_key='YOUR-SECRETACCESSKEY')
# Put a file with default content-type.
try:
file_stat = os.stat('my-testfile')
file_data = open('my-testfile', 'rb')
client.put_object('my-bucketname', 'my-objectname', file_data, file_stat.st_size)
except ResponseError as err:
print(err)
# Put a file with 'application/csv'
try:
file_stat = os.stat('my-testfile.csv')
file_data = open('my-testfile.csv', 'rb')
client.put_object('my-bucketname', 'my-objectname', file_data,
file_stat.st_size, content_type='application/csv')
except ResponseError as err:
print(err)
You can find list of complete API operations with examples here
Installing Minio-Py library
$ pip install minio
Hope it helps.
Disclaimer: I work for Minio
Flask can only use the memory to save all http request body, so there is no feature such as disk buffing as I know.
Nginx upload module is a really good way to do large file upload. the document is here.
You can also use html5, flash to send trunked file data and process the data in Flask, but it's complicated.
Try to look up if s3 offer the one time token.
Using the link I have posted above I finally ended up doing the following. Please tell me if you think it is a good solution
import boto3
from flask import Flask, request
.
.
.
#app.route('/upload', methods=['POST'])
def upload():
s3 = boto3.resource('s3', aws_access_key_id="key", aws_secret_access_key='secret', region_name='us-east-1')
s3.Object('bucket-name','filename').put(Body=request.stream.read(CHUNK_SIZE))
.
.
.
I am using this file storage engine to store files to Amazon S3 when they are uploaded:
http://code.welldev.org/django-storages/wiki/Home
It takes quite a long time to upload because the file must first be uploaded from client to web server, and then web server to Amazon S3 before a response is returned to the client.
I would like to make the process of sending the file to S3 asynchronous, so the response can be returned to the user much faster. What is the best way to do this with the file storage engine?
Thanks for your advice!
I've taken another approach to this problem.
My models have 2 file fields, one uses the standard file storage backend and the other one uses the s3 file storage backend. When the user uploads a file it get's stored localy.
I have a management command in my application that uploads all the localy stored files to s3 and updates the models.
So when a request comes for the file I check to see if the model object uses the s3 storage field, if so I send a redirect to the correct url on s3, if not I send a redirect so that nginx can serve the file from disk.
This management command can ofcourse be triggered by any event a cronjob or whatever.
It's possible to have your users upload files directly to S3 from their browser using a special form (with an encrypted policy document in a hidden field). They will be redirected back to your application once the upload completes.
More information here: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1434
There is an app for that :-)
https://github.com/jezdez/django-queued-storage
It does exactly what you need - and much more, because you can set any "local" storage and any "remote" storage. This app will store your file in fast "local" storage (for example MogileFS storage) and then using Celery (django-celery), will attempt asynchronous uploading to the "remote" storage.
Few remarks:
The tricky thing is - you can setup it to copy&upload, or to upload&delete strategy, that will delete local file once it is uploaded.
Second tricky thing - it will serve file from "local" storage until it is not uploaded.
It also can be configured to make number of retries on uploads failures.
Installation & usage is also very simple and straightforward:
pip install django-queued-storage
append to INSTALLED_APPS:
INSTALLED_APPS += ('queued_storage',)
in models.py:
from queued_storage.backends import QueuedStorage
queued_s3storage = QueuedStorage(
'django.core.files.storage.FileSystemStorage',
'storages.backends.s3boto.S3BotoStorage', task='queued_storage.tasks.TransferAndDelete')
class MyModel(models.Model):
my_file = models.FileField(upload_to='files', storage=queued_s3storage)
You could decouple the process:
the user selects file to upload and sends it to your server. After this he sees a page "Thank you for uploading foofile.txt, it is now stored in our storage backend"
When the users has uploaded the file it is stored temporary directory on your server and, if needed, some metadata is stored in your database.
A background process on your server then uploads the file to S3. This would only possible if you have full access to your server so you can create some kind of "deamon" to to this (or simply use a cronjob).*
The page that is displayed polls asynchronously and displays some kind of progress bar to the user (or s simple "please wait" Message. This would only be needed if the user should be able to "use" (put it in a message, or something like that) it directly after uploading.
[*: In case you have only a shared hosting you could possibly build some solution which uses an hidden Iframe in the users browser to start a script which then uploads the file to S3]
You can directly upload media to the s3 server without using your web application server.
See the following references:
Amazon API Reference : http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingHTTPPOST.html
A django implementation : https://github.com/sbc/django-uploadify-s3
As some of the answers here suggest uploading directly to S3, here's a Django S3 Mixin using plupload:
https://github.com/burgalon/plupload-s3mixin
I encountered the same issue with uploaded images. You cannot pass along files to a Celery worker because Celery needs to be able to pickle the arguments to a task. My solution was to deconstruct the image data into a string and get all other info from the file, passing this data and info to the task, where I reconstructed the image. After that you can save it, which will send it to your storage backend (such as S3). If you want to associate the image with a model, just pass along the id of the instance to the task and retrieve it there, bind the image to the instance and save the instance.
When a file has been uploaded via a form, it is available in your view as a UploadedFile file-like object. You can get it directly out of request.FILES, or better first bind it to your form, run is_valid and retrieve the file-like object from form.cleaned_data. At that point at least you know it is the kind of file you want it to be. After that you can get the data using read(), and get the other info using other methods/attributes. See https://docs.djangoproject.com/en/1.4/topics/http/file-uploads/
I actually ended up writing and distributing a little package to save an image asyncly. Have a look at https://github.com/gterzian/django_async Right it's just for images and you could fork it and add functionalities for your situation. I'm using it with https://github.com/duointeractive/django-athumb and S3