I'm brand new to Azure Databricks, and my mentor suggested I complete the Machine Learning Bootcamp at
https://aischool.microsoft.com/en-us/machine-learning/learning-paths/ai-platform-engineering-bootcamps/custom-machine-learning-bootcamp
Unfortunately, after successfully setting up Azure Databricks, I've run into some issues in step 2. I successfully added the 1_01_introduction file to my workspace as a notebook. However, while the tutorial talks about teaching how to mount data in Azure Blob Storage, it seems to skip that step, which causes all of the next tutorial coding steps to throw errors. The first code bit (which the tutorial tells me to run), and the error that comes up afterwards, are included below.
%run "../presenter/includes/mnt_blob"
Notebook not found: presenter/includes/mnt_blob. Notebooks can be specified via a relative path (./Notebook or ../folder/Notebook) or via an absolute path (/Abs/Path/to/Notebook). Make sure you are specifying the path correctly.
Stacktrace:
/1_01_introduction: python
As far as I can tell, the Azure Blob storage just isn't set up yet, and so the code I run (as well as the code in all of the following steps) can't find the tutorial items that are supposed to be stored in the blob. Any help you fine folks can provide would be most appreciated.
Setting up and mounting Blob Storage in Azure Databricks does take a few steps.
First, create a storage account and then create a container inside of it.
Next, keep a note of the following items:
Storage account name: The name of the storage account when you created it
Storage account key: This can be found in the Azure Portal on the resource page.
Container name: The name of the container
In an Azure Databricks notebook, create variables for the above items.
storage_account_name = "Storage account name"
storage_account_key = "Storage account key"
container = "Container name"
Then, use the below code to set a Spark config to point to your instance of Azure Blob Storage.
spark.conf.set("fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name), storage_account_key)
To mount it to Azure Databricks, use the dbutils.fs.mount method. The source is the address to your instance of Azure Blob Storage and a specific container. The mount point is where it will be mounted in the Databricks File Storage on Azure Databricks. The extra configs is where you pass in the Spark config so it doesn't always need to be set.
dbutils.fs.mount(
source = "wasbs://{0}#{1}.blob.core.windows.net".format(container, storage_account_name),
mount_point = "/mnt/<Mount name>",
extra_configs = {"fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name): storage_account_key}
)
With those set, you can now start using the mount. To check it can see files in the storage account, use the dbutils.fs.ls command.
dbutils.fs.ls("dbfs:/mnt/<Mount name>")
Hope that helps!
Related
I am looking to connect to a delta lake in one databricks instance from a different databricks instance. I have downloaded the sparksimba jar from the downloads page. When I use the following code:
result = spark.read.format("jdbc").option('user', 'token').option('password', <password>).option('query', query).option("url", <url>).option('driver','com.simba.spark.jdbc42.Driver').load()
I get the following error:
Py4JJavaError: An error occurred while calling o287.load.: java.lang.ClassNotFoundException: com.simba.spark.jdbc42.Driver
From reading around it seems I need to register driver-class-path, but I can't find a way where this works.
I have tried the following code, but the bin/pyspark dir does not exist in my databricks env:
%sh bin/pyspark --driver-class-path $/dbfs/driver/simbaspark/simbaspark.jar --jars /dbfs/driver/simbaspark/simbaspark.jar
I have also tried:
java -jar /dbfs/driver/simbaspark/simbaspark.jar
but I get this error back: no main manifest attribute, in dbfs/driver/simbaspark/simbaspark
If you want to do that (it's really not recommended), then you just need to upload this library to DBFS, and attach it to the cluster via UI or the init script. After that it will be available for both driver & executors.
But really, as I understand, your data is stored on the DBFS in the default location (so-called DBFS Root). But storing data in the DBFS Root isn't recommended, and this is pointed in the documentation:
Data written to mount point paths (/mnt) is stored outside of the DBFS root. Even though the DBFS root is writeable, Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.
So you need to create a separate storage account or container in existing storage account, and mount it to the Databricks workspace - this could be done to the multiple workspaces, so you'll solve the problem of data sharing between multiple workspaces. It's a standard recommendation for Databricks deployments in any cloud.
Here's an example code block that I use (hope it helps)
hostURL = "jdbc:mysql://xxxx.mysql.database.azure.com:3306/acme_dbuseSSL=true&requireSL=false"
databaseName = "acme_db"
tableName = "01_dim_customers"
userName = "xxxadmin#xxxmysql"
password = "xxxxxx"
df = (
spark.read
.format("jdbc")
.option("url", f"{hostURL}")
.option("databaseName", f"{databaseName}")
.option("dbTable", f"{tableName}")
.option("user", f"{userName}")
.option("password", f"{password}")
.option("ssl", True)
.load()
)
display(df)
I have a very big folder in Google Cloud Storage and I am currently deleting the folder with the following django - python code while using Google App Engine within a 30 seconds default http timeout.
def deleteStorageFolder(bucketName, folder):
from google.cloud import storage
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
logging.info("Deleting : " + folder)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
logging.info(str(e.message))
It is really unbelievable that Google Cloud is expecting the application to request the information for the objects inside the folder one by one and then delete them one by one.
Obviously, this fails due to the timeout. What would be the best strategy here ?
(There should be a way that we delete the parent object in the bucket, it should delete all the associated child objects somewhere in the background and we remove the associated data from our model. Then Google Storage is free to delete the data whenever it wants. Yet, per my understanding, this is not how things are implemented)
2 simple options in my mind until the client library supports deleting in batch - see https://issuetracker.google.com/issues/142641783 :
if the GAE image includes the gsutil cli, you could execute gsutil -m rm ... in a subprocess
my favorite, use gcsfs library instead of the G library. It supports batch-deleting by default - see https://gcsfs.readthedocs.io/en/latest/_modules/gcsfs/core.html#GCSFileSystem.rm
There is a workaround. You can do this in 2 steps
"Move" your file to delete into another bucket with Transfert
Create a transfert from your bucket, with the filters that you want to another bucket (create a temporary one if needed). Check "delete from source after transfer" checkbox
After the successful transfer, delete the temporary bucket. If it's too long, you have another workaround.
Go to bucket page
Click on lifecycle
Set up a lifecycle where you delete file with age > 0 day
In both cases, you rely on Google Cloud batch feature because by yourselves is too, too, too long!
I am using EPH ( EventProcessorHost) class of Azure python SDK to receive events from the eventhub.
It actually uses AzureStorageCheckpointLeaseManager for checkpointing and partitioning in the storage account. But I cannot see where I can write the full path of the storage account. Like it directly create files inside the specified container in the storage account. I would like to give the full path inside the container. Where can I do that?
Here is my research:
In AzureStorageCheckpointLeaseManager, there is a parameter storage_blob_prefix, which should be used to set blob prefix(means directory for the checkpoint blob). But actually it does not work.
After going through the source code of azure_storage_checkpoint_manager.py, I can see storage_blob_prefix is actually assigned to consumer_group_directory, but consumer_group_directory is never used to create the blob for checkpoint. Instead, it just creates the blob directly inside the container.
So the fix is that we can modify the azure_storage_checkpoint_manager.py, by using lease_container_name + consumer_group_directory to create the checkpoint blob. I modified it and uploaded it github. It can work well to create a directory for the checkpoint blob as expected.
I am completely new to Azure. I have a python Script that does few operations and give me a output.
I have an azure connection that i would like to connect to blob storage from python script which upload and read files.
1) I created a app service where i changed few settings like python3.4 to use
2) created a blob storage account with container.
3) I connected the blob storage to my app service using "data connection" from mobile option.
I now want to write a python that will upload a file and reads it from the blob to process. I came across here, here
I am wondering where i can write my python script to connect to blob and read. All I am seeing is just connecting to github, one drive, dropbox. Is there a way i write python script inside azure? I tried reading the documentation of Azure. All it says is connecting to github or use Azure SDK python which is not clear to me.
I saw Azure console where i learned to pip install packages. Where can i open a python env , write code and run it and test?
I'd recommend checking out this documentation: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python
It shows how to Upload, download, and list blobs using Python (in your case you will list then download the blobs for processing):
Listing blobs:
# List the blobs in the container
print("\nList blobs in the container")
generator = block_blob_service.list_blobs(container_name)
for blob in generator:
print("\t Blob name: " + blob.name)
Downloading:
# Download the blob(s).
# Add '_DOWNLOADED' as prefix to '.txt' so you can see both files in Documents.
full_path_to_file2 = os.path.join(local_path, string.replace(local_file_name ,'.txt', '_DOWNLOADED.txt'))
print("\nDownloading blob to " + full_path_to_file2)
block_blob_service.get_blob_to_path(container_name, local_file_name, full_path_to_file2)
I can suggest using Logic Apps with the blob connectors, more details can be found here: https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-azureblobstorage
You can use triggers (actions) to perform specific tasks with blobs.
I have my data on google cloud platform and i want to be able to be able to download it locally, this is my first time trying that and eventually i'll use the downloaded data with my python code.
I have checked the docs, like https://cloud.google.com/genomics/downloading-credentials-for-api-access and https://cloud.google.com/storage/docs/cloud-console i have successfully got the Json file for my first link, the second one is where u'm struggling, i'm using python 3.5 and assuming my json files name is data.json i have added the following code:
os.environ["file"] = "data.json"
urllib.request.urlopen('https://storage.googleapis.com/[bucket_name]/[filename]')
first of all i don't even know what i should call the value near environ so i just called it file, not sure how i'm supposed to fill it and i got access denied on the second line, obviously it's not how to download my file as there is no destination local repository or anything in that command any guidance will be appreciated.
Edit:
from google.cloud.storage import Blob
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "credentials/client_secret.json"
storage_client = storage.Client.from_service_account_json('service_account.json')
client = storage.Client(project='my-project')
bucket = client.get_bucket('my-bucket')
blob = Blob('path/to/my-object', bucket)
download_to_filename('local/path/to/my-file')
I'm getting unresolved reference for storage and download_to_filename and should i replace service_account.json with credentials/client_secret.json. Plus i tried to print the content of os.environ["GOOGLE_APPLICATION_CREDENTIALS"]['installed'] like i'd do with any Json but it just said i should give numbers meaning it read the input path as regular text only.
You should use the idiomatic Google Cloud library to run operations in GCS.
With the example there, and knowing that the client library will get the application default credentials, first we have to set the applicaiton default credentials with
gcloud auth application-default login
===EDIT===
That was the old way. Now you should use the instructions in this link.
This means downloading a service account key file from the console, and setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the downloaded JSON.
Also, make sure that this service account has the proper permissions on the project of the bucket.
Or you can create the client with explicit credentials. You'll need to download the key file all the same, but when creating the client, use:
storage_client = storage.Client.from_service_account_json('service_account.json')
==========
And then, following the example code:
from google.cloud import storage
client = storage.Client(project='project-id')
bucket = client.get_bucket('bucket-id')
blob = storage.Blob('bucket/file/path', bucket)
blob.download_to_filename('/path/to/local/save')
Or, if this is a one-off download, just install the SDK and use gsutil to download:
gsutil cp gs://bucket/file .