Error: upload data to Cosmosdb using pydocumendb - python

I'm using pydocumentdb to upload some processed data to CosmosDB as a Document on Azure Cloud with a Python script. The files are coming from the same source. The ingestion works well with some files, but gives the next error for the files that are greater than 1000 KB:
pydocumentdb.errors.HTTPFailure: Status code: 413
"code":"RequestEntityTooLarge","message":"Message: {\"Errors\":[\"Request
size is too large\"]
I'm using SQL API and this is how I create the document inside a Collection:
client = document_client.DocumentClient(uri, {'masterKey': cosmos_key})
... I get the Db link and Collection link ...
Client.CreateDocument(collection_link, data)
How can I solve this error?

Per my experience, to store some large size data or files on Azure Cosmos DB, the best practice is to upload data to Azure Blob Storage or other external storages and create an attachment with its references or associated metadata in a document on Azure Cosmos DB.
You can refer to the REST API for Attachments to know it and achieve the feature of your needs using the methods of PyDocument API include CreateAttachment, ReplaceAttachment, QueryAttachments and so on.
Hope it helps.

Related

Python Data Ingestion (Start with API call "Get" to AWS S3 Bucket), how to manage the username/pwd/api-key and token( expired in short time window)

data source is from SaaS Server's API endpoints, aim to use python to move data into AWS S3 Bucket(Python's Boto3 lib)
API is assigned via authorized Username/password combination and unique api-key.
then every time initially call API need get Token for further info fetch.
have 2 question:
how to manage those secrets above, save to a head file (*.ini, *.json *.yaml) or saved via AWS's Secret-Manager?
the Token is a bit challenging, the basically way is each Endpoint, fetch a new token and do the API call
then that's end of too many pipeline (like if 100 Endpoints info need per downstream business needs) then
need to craft 100 pipeline like an universal template repeating 100 times.
I am new to Python programing world, you all feel free to comment to share any user-case.
Much appreciate !!
I searched and read this show-case
[saving-from-api-to-s3-bucket/74648533]
saving from api to s3 bucket
and
"how-to-write-a-file-or-data-to-an-s3-object-using-boto3"
How to write a file or data to an S3 object using boto3
I found this has been helpful:
#Python-decopule summary: store parameters in .ini or .env files;
#few options of manage(hiding) sensitive info
a. IAM role
b. Store Secrets using **Parameter Store**
c. Store Secrets using **Secrets Manager** - Current method
recommended by AWS

MS Access data into Azure Blob

Data is in MS Access and it's in one of the shared drive on the network. I need this data in azure blob storage as CSV files. Can anyone please suggest me how can this be possible?
You can move data to Azure Blob storage in several ways, You could use either Azcopy: located here: https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10 , Or Storage Explorer(GUI): https://azure.microsoft.com/en-us/features/storage-explorer/
OR using Python SDK:
block_blob_service.create_blob_from_path(container, file, file)
Python SDK can be found here: https://github.com/Azure/azure-sdk-for-python
When it comes to changing the format from Access to CSV, it's something not related to Azure Storage, you can try existing libraries for that conversion, then upload to blob storage.

Connectiong to Azure table storage from Azure databricks

I am trying to connecto to azure table storage from Databricks. I can't seem to find any resources that doesn't go to blob containers, but I have tried modifying it for tables.
spark.conf.set(
"fs.azure.account.key.accountname.table.core.windows.net",
"accountkey")
blobDirectPath = "wasbs://accountname.table.core.windows.net/TableName"
df = spark.read.parquet(blobDirectPath)
I am making an assumption for now that tables are parquet files. I am getting authentication errors on this code now.
According to my research, Azure Databricks does not support the data source of Azure table storage. For more details, please refer to https://docs.azuredatabricks.net/spark/latest/data-sources/index.html.
Besides if you still want to use table storage, you can use Azure Cosmos DB Table API. But they have some differences. For more details, please refer to https://learn.microsoft.com/en-us/azure/cosmos-db/faq#where-is-table-api-not-identical-with-azure-table-storage-behavior.

Automatically retrieving large files via public HTTP into Google Cloud Storage

For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage.
The files are available on public HTTP URL (http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home), but they are very large (between 30 and 300 Megabytes). Size of files is the main issue.
After looking at previous stackoverflow topics, I have tried two unsuccessful methods:
1/ First attempt via urlfetch in Google App Engine
from google.appengine.api import urlfetch
url = "http://dcpc-nwp.meteo.fr/servic..."
result = urlfetch.fetch(url)
[...] # Code to save in a Google Cloud Storage bucket
But I get the following error message on the urlfetch line :
DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL
2/ Second attempt via the Cloud Storage Transfert Service
According to the documentation, it is possible to retrieve HTTP Data into Cloud Storage directly via the Cloud Storage Transfert Service :
https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata
But it requires the size and md5 of the files before the download. This option cannot work in my case because the website does not provide those information.
3/ Any ideas ?
Do you see any solution to retrieve automatically large file on HTTP into my Cloud Storage bucket?
3/ Workaround with a Compute Engine instance
Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.
This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.
For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :
It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)
The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.
Another option would be to use a serverless Cloud Function. It could look like something below in Python.
import requests
def download_url_file(url):
try:
print('[ INFO ] Downloading {}'.format(url))
req = requests.get(url)
if req.status_code==200:
# Download and save to /tmp
output_filepath = '/tmp/{}'.format(url.split('/')[-1])
output_filename = '{}'.format(url.split('/')[-1])
open(output_filepath, 'wb').write(req.content)
print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
return output_filename
else:
print('[ ERROR ] Status Code: {}'.format(req.status_code))
except Exception as e:
print('[ ERROR ] {}'.format(e))
return output_filename
Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.
Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.

How to upload a small JSON object in memory to Google BigQuery in Python?

Is there any way to take a small data JSON object from memory and upload it to Google BigQuery without using the file system?
We have working code that uploads files to BigQuery. This code no longer works for us because our new script runs on Google App Engine which doesn't have write access to the file system.
We notice that our BigQuery upload configuration has a "MediaFileUpload" option with a data_path. We have been unable to find if there is some way to change the data_path to something in memory - or if there is another method apart from MediaFileUpload that could upload data from memory to BigQuery.
media_body=MediaFileUpload(
data_path,
mimetype='application/octet-stream'))
job = insert_request.execute()
We have seen solutions which recommend uploading data to Google Cloud Storage then referencing that file to upload to BigQuery for when datasets are large. Our dataset is small so this step seems like an unnecessary use of bandwidth, processing and time.
As a result, we are wondering if there is a way to take our small JSON object and upload it directly to BigQuery using Python?

Categories

Resources