I am completely new to Azure. I have a python Script that does few operations and give me a output.
I have an azure connection that i would like to connect to blob storage from python script which upload and read files.
1) I created a app service where i changed few settings like python3.4 to use
2) created a blob storage account with container.
3) I connected the blob storage to my app service using "data connection" from mobile option.
I now want to write a python that will upload a file and reads it from the blob to process. I came across here, here
I am wondering where i can write my python script to connect to blob and read. All I am seeing is just connecting to github, one drive, dropbox. Is there a way i write python script inside azure? I tried reading the documentation of Azure. All it says is connecting to github or use Azure SDK python which is not clear to me.
I saw Azure console where i learned to pip install packages. Where can i open a python env , write code and run it and test?
I'd recommend checking out this documentation: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python
It shows how to Upload, download, and list blobs using Python (in your case you will list then download the blobs for processing):
Listing blobs:
# List the blobs in the container
print("\nList blobs in the container")
generator = block_blob_service.list_blobs(container_name)
for blob in generator:
print("\t Blob name: " + blob.name)
Downloading:
# Download the blob(s).
# Add '_DOWNLOADED' as prefix to '.txt' so you can see both files in Documents.
full_path_to_file2 = os.path.join(local_path, string.replace(local_file_name ,'.txt', '_DOWNLOADED.txt'))
print("\nDownloading blob to " + full_path_to_file2)
block_blob_service.get_blob_to_path(container_name, local_file_name, full_path_to_file2)
I can suggest using Logic Apps with the blob connectors, more details can be found here: https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-azureblobstorage
You can use triggers (actions) to perform specific tasks with blobs.
Related
I am trying to upload file from SFTP server to GCS bucket using cloud function. But this code not working. I am able to sftp. But when I try to upload file in GCS bucket, it doesn't work, and the requirement is to use cloud function with Python.
Any help will be appreciated. Here is the sample code I am trying. This code is working except sftp.get("test_report.csv", bucket_destination). Please help.
destination_bucket ="gs://test-bucket/reports"
with pysftp.Connection(host, username, password=sftp_password) as sftp:
print ("Connection successfully established ... ")
# Switch to a remote directory
sftp.cwd('/test/outgoing/')
bucket_destination = "destination_bucket"
sftp.cwd('/test/outgoing/')
if sftp.exists("test_report.csv"):
sftp.get("test_report.csv", bucket_destination)
else:
print("doesnt exist")
The pysftp cannot work with GCP directly.
Imo, you cannot actually upload a file directly from SFTP to GCP anyhow, at least not from a code running on yet another machine. But you can transfer the file without storing it on the intermediate machine, using pysftp Connection.open (or better using Paramiko SFTPClient.open) and GCS API Blob.upload_from_file. That's what many actually mean by "directly".
client = storage.Client(credentials=credentials, project='myproject')
bucket = client.get_bucket('mybucket')
blob = bucket.blob('test_report.csv')
with sftp.open('test_report.csv', bufsize=32768) as f:
blob.upload_from_file(f)
For the rest of the GCP code, see How to upload a file to Google Cloud Storage on Python 3?
For the purpose of bufsize, see Reading file opened with Python Paramiko SFTPClient.open method is slow.
Consider not using pysftp, it's dead project. Use Paramiko directly (the code will be mostly the same). See pysftp vs. Paramiko.
i got the solution based on your reply thanks here is the code
bucket = client.get_bucket(destination_bucket)
`blob = bucket.blob(destination_folder +filename)
with sftp.open(sftp_filename, bufsize=32768) as f:
blob.upload_from_file(f)
This is exactly what we built SFTP Gateway for. We have a lot of customers that still want to use SFTP, but we needed to write files directly to Google Cloud Storage. Files don't get saved temporarily on another machine. The data is streamed directly from the SFTP Client (python, filezilla, or any other client), straight to GCS.
https://console.cloud.google.com/marketplace/product/thorn-technologies-public/sftp-gateway?project=thorn-technologies-public
Full disclosure, this is our product and we use it for all our consulting clients. We are happy to help you get it setup if you want to try it.
I am new to Azure functions and am trying to write a function (Blobtrigger), the function reads the file uploaded on the blob (the file is binary .dat file), does some conversions to convert the data as a pandas data frame and then converts it to .parquet file format and saves it on the azure datalake.
Everything works well when I am locally running the function in VSCode, but when I upload it on the azure cloud functions, I get an error and the function fails.
After looking into the code and checking each step by individually uploading on cloud, I have found that the code works fine till converting the .dat file to dataframe, but fails when I add the function to save it to datalake. I am using the following function found in microsoft tutorials.
def write_dataframe_to_datalake(df, datalake_service_client, filesystem_name, dir_name, filename):
file_path = f'{dir_name}/{filename}'
file_client = datalake_service_client.get_file_client(filesystem_name, file_path)
processed_df = df.to_parquet(index=False)
file_client.upload_data(data=processed_df,overwrite=True, length=len(processed_df))
file_client.flush_data(len(processed_df))
return True
reference: https://learn.microsoft.com/en-us/azure/developer/python/tutorial-deploy-serverless-cloud-etl-05
This function works fine on azure cloud, when I run it indiviually but not with my converted dataframe.
Can anyone identify what the problem could be. Thanks a lot!
Based on the Error message we can identify the issue but when Azure function is working fine locally it should work after the deployment as well. In few cases after deploying our function we will get internal server errors when we run our code.
Below are two ways we can get rid of it.
Add Application Settings ( Add all the variables from local.settings.json under Configurations )
If our Azure function is interacting with any of Azure Storage Service or Blob storage we need to add the API link in CORS, or you can try adding “*” which means allowing all).
Refer to this MS Docs for adding Application Settings
Google cloud storage client library is returning 500 error when I attempt to upload via development server.
ServerError: Expect status [200] from Google Storage. But got status 500.
I haven't changed anything with the project and the code still works correctly in production.
I've attempted gcloud components update to get the latest dev_server and I've updated to the latest google cloud storage client library.
I've run gcloud init again to make sure credentials are loaded and I've made sure I'm using the correct bucket.
The project is running on windows 10.
Python version 2.7
Any idea why this is happening?
Thanks
Turns out this has been a problem for a while.
It has to do with how blobstore filenames are generated.
https://issuetracker.google.com/issues/35900575
The fix is to monkeypatch this file:
google-cloud-sdk\platform\google_appengine\google\appengine\api\blobstore\file_blob_storage.py
def _FileForBlob(self, blob_key):
"""Calculate full filename to store blob contents in.
This method does not check to see if the file actually exists.
Args:
blob_key: Blob key of blob to calculate file for.
Returns:
Complete path for file used for storing blob.
"""
blob_key = self._BlobKey(blob_key)
# Remove bad characters.
import re
blob_fname = re.sub(r"[^\w\./\\]", "_", str(blob_key))
# Make sure it's a relative directory.
if blob_fname and blob_fname[0] in "/\\":
blob_fname = blob_fname[1:]
return os.path.join(self._DirectoryForBlob(blob_key), blob_fname)
I'm brand new to Azure Databricks, and my mentor suggested I complete the Machine Learning Bootcamp at
https://aischool.microsoft.com/en-us/machine-learning/learning-paths/ai-platform-engineering-bootcamps/custom-machine-learning-bootcamp
Unfortunately, after successfully setting up Azure Databricks, I've run into some issues in step 2. I successfully added the 1_01_introduction file to my workspace as a notebook. However, while the tutorial talks about teaching how to mount data in Azure Blob Storage, it seems to skip that step, which causes all of the next tutorial coding steps to throw errors. The first code bit (which the tutorial tells me to run), and the error that comes up afterwards, are included below.
%run "../presenter/includes/mnt_blob"
Notebook not found: presenter/includes/mnt_blob. Notebooks can be specified via a relative path (./Notebook or ../folder/Notebook) or via an absolute path (/Abs/Path/to/Notebook). Make sure you are specifying the path correctly.
Stacktrace:
/1_01_introduction: python
As far as I can tell, the Azure Blob storage just isn't set up yet, and so the code I run (as well as the code in all of the following steps) can't find the tutorial items that are supposed to be stored in the blob. Any help you fine folks can provide would be most appreciated.
Setting up and mounting Blob Storage in Azure Databricks does take a few steps.
First, create a storage account and then create a container inside of it.
Next, keep a note of the following items:
Storage account name: The name of the storage account when you created it
Storage account key: This can be found in the Azure Portal on the resource page.
Container name: The name of the container
In an Azure Databricks notebook, create variables for the above items.
storage_account_name = "Storage account name"
storage_account_key = "Storage account key"
container = "Container name"
Then, use the below code to set a Spark config to point to your instance of Azure Blob Storage.
spark.conf.set("fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name), storage_account_key)
To mount it to Azure Databricks, use the dbutils.fs.mount method. The source is the address to your instance of Azure Blob Storage and a specific container. The mount point is where it will be mounted in the Databricks File Storage on Azure Databricks. The extra configs is where you pass in the Spark config so it doesn't always need to be set.
dbutils.fs.mount(
source = "wasbs://{0}#{1}.blob.core.windows.net".format(container, storage_account_name),
mount_point = "/mnt/<Mount name>",
extra_configs = {"fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name): storage_account_key}
)
With those set, you can now start using the mount. To check it can see files in the storage account, use the dbutils.fs.ls command.
dbutils.fs.ls("dbfs:/mnt/<Mount name>")
Hope that helps!
I have my data on google cloud platform and i want to be able to be able to download it locally, this is my first time trying that and eventually i'll use the downloaded data with my python code.
I have checked the docs, like https://cloud.google.com/genomics/downloading-credentials-for-api-access and https://cloud.google.com/storage/docs/cloud-console i have successfully got the Json file for my first link, the second one is where u'm struggling, i'm using python 3.5 and assuming my json files name is data.json i have added the following code:
os.environ["file"] = "data.json"
urllib.request.urlopen('https://storage.googleapis.com/[bucket_name]/[filename]')
first of all i don't even know what i should call the value near environ so i just called it file, not sure how i'm supposed to fill it and i got access denied on the second line, obviously it's not how to download my file as there is no destination local repository or anything in that command any guidance will be appreciated.
Edit:
from google.cloud.storage import Blob
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "credentials/client_secret.json"
storage_client = storage.Client.from_service_account_json('service_account.json')
client = storage.Client(project='my-project')
bucket = client.get_bucket('my-bucket')
blob = Blob('path/to/my-object', bucket)
download_to_filename('local/path/to/my-file')
I'm getting unresolved reference for storage and download_to_filename and should i replace service_account.json with credentials/client_secret.json. Plus i tried to print the content of os.environ["GOOGLE_APPLICATION_CREDENTIALS"]['installed'] like i'd do with any Json but it just said i should give numbers meaning it read the input path as regular text only.
You should use the idiomatic Google Cloud library to run operations in GCS.
With the example there, and knowing that the client library will get the application default credentials, first we have to set the applicaiton default credentials with
gcloud auth application-default login
===EDIT===
That was the old way. Now you should use the instructions in this link.
This means downloading a service account key file from the console, and setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the downloaded JSON.
Also, make sure that this service account has the proper permissions on the project of the bucket.
Or you can create the client with explicit credentials. You'll need to download the key file all the same, but when creating the client, use:
storage_client = storage.Client.from_service_account_json('service_account.json')
==========
And then, following the example code:
from google.cloud import storage
client = storage.Client(project='project-id')
bucket = client.get_bucket('bucket-id')
blob = storage.Blob('bucket/file/path', bucket)
blob.download_to_filename('/path/to/local/save')
Or, if this is a one-off download, just install the SDK and use gsutil to download:
gsutil cp gs://bucket/file .