I've got a GCP pipeline which contains several containers on Cloud Run, each of which requires a given type of input (XML, PDF etc...). A couple cloud functions written in Python help orchestrate the workflow.
These inputs are stored in Cloud Storage, and I can retrieve them as binary. However, given the XML or PDF object is in a blob representation, the blob cannot directly serve as inputs to my containers.
I could convert the blob to XML or PDF, then store the file in the cloud function temp directory and then pass them to the relevant containers. I'd however like to avoid saving the file, and pass the blob to the container with the minimum amount of intermediary steps.
What would be the most efficient way?
Related
I am a newbie in ETL. I just managed to extract a lot of information in form of JSONs to GCS. Each JSON file includes identical key-value pairs and now I would like to transform them into dataframes on the basis of certain key values.
The next step would be loading this into a data warehouse like Clickhouse, I guess? I was not able to find any tutorials on this process.
TLDR 1) Is there a way to transform JSON data on GCS in Python without downloading the whole data?
TLDR 2) How can I set this up to run periodically or in real time?
TLDR 3) How can I go about loading the data into a warehouse?
If these are too much, I would love it if you can point me to resources around this. Appreciate the help
There are some ways to do this.
You can add files to storage, then a Cloud Functions is activated every time a new file is added (https://cloud.google.com/functions/docs/calling/storage) and will call an endpoint in Cloud Run (container service - https://cloud.google.com/run/docs/building/containers) running a Python application to transform these JSONs in a dataframe. Note that the container image will be stored in Container Registry. Then the Python notebook running on Cloud Run will save the rows incrementally to BigQuery (warehouse). After that you can have analytics with Looker Studio.
If you need to scale the solution to millions/billions of rows, you can add files to storage, Cloud Functions is activated and calls Dataproc, a service where you can run Python, Anaconda, etc. (How to call google dataproc job from google cloud function). Then this Dataproc cluster will structurate the JSONs as a dataframe and save to the warehouse (BigQuery).
I have some datasets (27 CSV files, separated by semicolons, summing 150+GB) that get uploaded every week to my Cloud Storage bucket.
Currently, I use the BigQuery console to organize that data manually, declaring the variables and changing the filenames 27 times. The first file replaces the entire previous database, then the other 26 get appended to it. The filenames are always the same.
How can I do it using Python?
Please, check out Cloud Functions functionality. It allows to use python. After the function is deployed, Cron Jobs can be created. Here is related question:
Run a python script on schedule on Google App Engine
Also here is and article which describes, how to load data from Cloud Storage Loading CSV data from Cloud Storage
I have downloaded some weather data (wind, wave, etc.) for 40 years duration and stored these data files which comes in a NetCDF format in Azure File shares. I have about 8 TB of total data stored. Each weather parameter, say wind speed for one year for the whole earth surface is saved in one single file which is about 35GB.
Next, I have developed a simple Azure website using Python and Dash package, where the user can define a location (Latitude, Longitude), select a weather parameter, date range and submit the request. See website picture below:
Now, I would like to be able to run a script once the user clicks the submit button to extract the specified data, save in a csv file, and give a download link to the file.
The Azure Storage File Share client library for Python (azure-storage-file-share) allows connecting to the file and download the file. Since one year of data file is 35GB, downloading each year of data and extracting a single grid point is not an option.
Is there anyway I can run scripts directly on Azure File shares to extract the required data, and then retrieve it from the webpage?
I am trying to avoid the situation where I need to extract data from NetCDF files and push it into a SQL database, which a website can access easily.
You could mount the file store in an Azure VM. Here's an example of how someone did the same locally: Read NetCDF file from Azure file storage
Or, you might want to look at Azure Blob Storage instead. As far as I know, Azure File Storage is really meant as a replacement for networked file shares, like on a local area network. Azure Blob Storage, on the other hand, is much better suited for streaming large files like this from the cloud. There's a good set of examples for when to use which in the Azure Blob Storage overview.
Here's an example of how to download a blob from the Azure Blob Python reference:
# Download the blob to a local file
# Add 'DOWNLOAD' before the .txt extension so you can see both files in the data directory
download_file_path = os.path.join(local_path, str.replace(local_file_name ,'.txt', 'DOWNLOAD.txt'))
print("\nDownloading blob to \n\t" + download_file_path)
with open(download_file_path, "wb") as download_file:
download_file.write(blob_client.download_blob().readall())
Overview
I have a GCP storage bucket, which has a .json file and 5 jpeg files. In the .json file the image names match the jpeg file names. I want to know a way which i can access each of the object within the storage account based upon the image name.
Method 1 (Current Method):
Currently, a python script is been used to to get the images from the storage bucket. This is been done by looping through the .json file of image names, getting each individual image name, then building a URL based on the bucket/image name and retrieving the image and displaying it on a flask App Engine site.
This current method requires the bucket objects to be public, which poses a security issue with the internet granted access to this bucket, secondly it is computational expensive, with each image having to be pulled down from the bucket separately. The bucket will eventually contain 10000 images, which will result in the images been slow to load and display on the web page.
Requirement (New Method):
Is there a method in which i can pull down images from the bucket, not all the images at once, and display them on a web page. I want to be able to access individual images from the bucket and display their corresponding image data, retrieved from the .json file.
Lastly i want to ensure that neither the bucket or the objects are public and can only be accessed via the app engine.
Thanks
Would be helpful to see the Python code that's doing the work right now. You shouldn't need the storage objects to be public. They should be able to be retrieved using the Google Cloud Storage (GCS) API and a service account token that has view-only permissions on storage (although depending on whether or not you know the object names and need to get the bucket name, it might require more permissions on the service account).
As for the performance, you could either do things on the front end to be smart about how many you're showing and fetch only what you want to display as the user scrolls, or you could paginate your results from the GCS bucket.
Links to the service account and API pieces here:
https://cloud.google.com/iam/docs/service-accounts
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
Information about pagination for retrieving GCS objects here:
How does paging work in the list_blobs function in Google Cloud Storage Python Client Library
Is there any way to take a small data JSON object from memory and upload it to Google BigQuery without using the file system?
We have working code that uploads files to BigQuery. This code no longer works for us because our new script runs on Google App Engine which doesn't have write access to the file system.
We notice that our BigQuery upload configuration has a "MediaFileUpload" option with a data_path. We have been unable to find if there is some way to change the data_path to something in memory - or if there is another method apart from MediaFileUpload that could upload data from memory to BigQuery.
media_body=MediaFileUpload(
data_path,
mimetype='application/octet-stream'))
job = insert_request.execute()
We have seen solutions which recommend uploading data to Google Cloud Storage then referencing that file to upload to BigQuery for when datasets are large. Our dataset is small so this step seems like an unnecessary use of bandwidth, processing and time.
As a result, we are wondering if there is a way to take our small JSON object and upload it directly to BigQuery using Python?