I am looking to connect to a delta lake in one databricks instance from a different databricks instance. I have downloaded the sparksimba jar from the downloads page. When I use the following code:
result = spark.read.format("jdbc").option('user', 'token').option('password', <password>).option('query', query).option("url", <url>).option('driver','com.simba.spark.jdbc42.Driver').load()
I get the following error:
Py4JJavaError: An error occurred while calling o287.load.: java.lang.ClassNotFoundException: com.simba.spark.jdbc42.Driver
From reading around it seems I need to register driver-class-path, but I can't find a way where this works.
I have tried the following code, but the bin/pyspark dir does not exist in my databricks env:
%sh bin/pyspark --driver-class-path $/dbfs/driver/simbaspark/simbaspark.jar --jars /dbfs/driver/simbaspark/simbaspark.jar
I have also tried:
java -jar /dbfs/driver/simbaspark/simbaspark.jar
but I get this error back: no main manifest attribute, in dbfs/driver/simbaspark/simbaspark
If you want to do that (it's really not recommended), then you just need to upload this library to DBFS, and attach it to the cluster via UI or the init script. After that it will be available for both driver & executors.
But really, as I understand, your data is stored on the DBFS in the default location (so-called DBFS Root). But storing data in the DBFS Root isn't recommended, and this is pointed in the documentation:
Data written to mount point paths (/mnt) is stored outside of the DBFS root. Even though the DBFS root is writeable, Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.
So you need to create a separate storage account or container in existing storage account, and mount it to the Databricks workspace - this could be done to the multiple workspaces, so you'll solve the problem of data sharing between multiple workspaces. It's a standard recommendation for Databricks deployments in any cloud.
Here's an example code block that I use (hope it helps)
hostURL = "jdbc:mysql://xxxx.mysql.database.azure.com:3306/acme_dbuseSSL=true&requireSL=false"
databaseName = "acme_db"
tableName = "01_dim_customers"
userName = "xxxadmin#xxxmysql"
password = "xxxxxx"
df = (
spark.read
.format("jdbc")
.option("url", f"{hostURL}")
.option("databaseName", f"{databaseName}")
.option("dbTable", f"{tableName}")
.option("user", f"{userName}")
.option("password", f"{password}")
.option("ssl", True)
.load()
)
display(df)
Related
I've been following this tutorial which lets me connect to Databricks from Python and then run delta table queries. However, I've stumbled upon a problem. When I run it for the FIRST time, I get the following error:
Container container-name in account
storage-account.blob.core.windows.net not found, and we can't create
it using anoynomous credentials, and no credentials found for them in
the configuration.
When I go back to my Databricks cluster and run this code snippet
from pyspark import SparkContext
spark_context =SparkContext.getOrCreate()
if StorageAccountName is not None and StorageAccountAccessKey is not None:
print('Configuring the spark context...')
spark_context._jsc.hadoopConfiguration().set(
f"fs.azure.account.key.{StorageAccountName}.blob.core.windows.net",
StorageAccountAccessKey)
(where StorageAccountName and AccessKey are known) then run my Python app once again, it runs successfully without throwing the previous error. I'd like to ask, is there a way to run this code snippet from my Python app and at the same time reflect it on my Databricks cluster?
You just need to add these configuration options to the cluster itself as it's described in the docs. You need to set following Spark property, the same as you do in your code:
fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
For security, it's better to put access key into secret scope, and refer it from Spark configuration (see docs)
I'm brand new to Azure Databricks, and my mentor suggested I complete the Machine Learning Bootcamp at
https://aischool.microsoft.com/en-us/machine-learning/learning-paths/ai-platform-engineering-bootcamps/custom-machine-learning-bootcamp
Unfortunately, after successfully setting up Azure Databricks, I've run into some issues in step 2. I successfully added the 1_01_introduction file to my workspace as a notebook. However, while the tutorial talks about teaching how to mount data in Azure Blob Storage, it seems to skip that step, which causes all of the next tutorial coding steps to throw errors. The first code bit (which the tutorial tells me to run), and the error that comes up afterwards, are included below.
%run "../presenter/includes/mnt_blob"
Notebook not found: presenter/includes/mnt_blob. Notebooks can be specified via a relative path (./Notebook or ../folder/Notebook) or via an absolute path (/Abs/Path/to/Notebook). Make sure you are specifying the path correctly.
Stacktrace:
/1_01_introduction: python
As far as I can tell, the Azure Blob storage just isn't set up yet, and so the code I run (as well as the code in all of the following steps) can't find the tutorial items that are supposed to be stored in the blob. Any help you fine folks can provide would be most appreciated.
Setting up and mounting Blob Storage in Azure Databricks does take a few steps.
First, create a storage account and then create a container inside of it.
Next, keep a note of the following items:
Storage account name: The name of the storage account when you created it
Storage account key: This can be found in the Azure Portal on the resource page.
Container name: The name of the container
In an Azure Databricks notebook, create variables for the above items.
storage_account_name = "Storage account name"
storage_account_key = "Storage account key"
container = "Container name"
Then, use the below code to set a Spark config to point to your instance of Azure Blob Storage.
spark.conf.set("fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name), storage_account_key)
To mount it to Azure Databricks, use the dbutils.fs.mount method. The source is the address to your instance of Azure Blob Storage and a specific container. The mount point is where it will be mounted in the Databricks File Storage on Azure Databricks. The extra configs is where you pass in the Spark config so it doesn't always need to be set.
dbutils.fs.mount(
source = "wasbs://{0}#{1}.blob.core.windows.net".format(container, storage_account_name),
mount_point = "/mnt/<Mount name>",
extra_configs = {"fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name): storage_account_key}
)
With those set, you can now start using the mount. To check it can see files in the storage account, use the dbutils.fs.ls command.
dbutils.fs.ls("dbfs:/mnt/<Mount name>")
Hope that helps!
I have several txt and csv datasets in one s3 bucket, my_bucket, and a deep learning ubuntu ec2 instance. I am using Jupyter notebook on this instance. I need to read data from s3 to Jupyter.
I searched everywhere (almost) in AWS documentation and their forum together with other blogs. This is the best I could do. However, after getting the keys (both) restarting the instance (and aws too) I still get an error for aws_key.
I'm wondering if anyone ran to this or you have a better idea to get the data from there. I'm open as long as it's not using http (which requires the data to be public). Thank you.
import pandas as pd
from smart_open import smart_open
import os
aws_key = os.environ['aws_key']
aws_secret = os.environ['aws_secret']
bucket_name = 'my_bucket'
object_key = 'data.csv'
path = 's3://{}:{}#{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))
Your code sample would work if you export the aws_key and first aws_secret. Something like this would work (assuming bash is your shell):
export aws_key=<your key>
export aws_secret=<your aws secret>
python yourscript.py
It is best practice to export things like keys and secrets so that you are not storing confidential/secret things in your source code. If you were to hard code those values into your script and accidentally commit them to a public repo, it would be easy for someone to take over your aws account.
I am answering my own question here and would like to hear from community too on different solutions: Directly access S3 data from the Ubuntu Deep Learning instance by
cd ~/.aws
aws configure
Then update aws key and secret key for the instance, just to make sure. Checke awscli version using the command:
aws --version
Read more on configuration
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
In the above code, "aws_key" and "aws_secret" are not listed as environmental variables on the Ubuntu instance and hence the inbuilt function os.environ cannot be used
aws_key = 'aws_key'
aws_secret = 'aws_secret'
I am completely new to Azure. I have a python Script that does few operations and give me a output.
I have an azure connection that i would like to connect to blob storage from python script which upload and read files.
1) I created a app service where i changed few settings like python3.4 to use
2) created a blob storage account with container.
3) I connected the blob storage to my app service using "data connection" from mobile option.
I now want to write a python that will upload a file and reads it from the blob to process. I came across here, here
I am wondering where i can write my python script to connect to blob and read. All I am seeing is just connecting to github, one drive, dropbox. Is there a way i write python script inside azure? I tried reading the documentation of Azure. All it says is connecting to github or use Azure SDK python which is not clear to me.
I saw Azure console where i learned to pip install packages. Where can i open a python env , write code and run it and test?
I'd recommend checking out this documentation: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python
It shows how to Upload, download, and list blobs using Python (in your case you will list then download the blobs for processing):
Listing blobs:
# List the blobs in the container
print("\nList blobs in the container")
generator = block_blob_service.list_blobs(container_name)
for blob in generator:
print("\t Blob name: " + blob.name)
Downloading:
# Download the blob(s).
# Add '_DOWNLOADED' as prefix to '.txt' so you can see both files in Documents.
full_path_to_file2 = os.path.join(local_path, string.replace(local_file_name ,'.txt', '_DOWNLOADED.txt'))
print("\nDownloading blob to " + full_path_to_file2)
block_blob_service.get_blob_to_path(container_name, local_file_name, full_path_to_file2)
I can suggest using Logic Apps with the blob connectors, more details can be found here: https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-azureblobstorage
You can use triggers (actions) to perform specific tasks with blobs.
I'm wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. This is fine when using boto (as it's part of the API), but I can't find a definitive answer as to if PySpark supports this out of the box.
Ideally, I'd like to be able to assume a role when running in standalone mode locally and point my SparkContext to that s3 path. I've seen that non-IAM calls usually follow :
spark_conf = SparkConf().setMaster('local[*]').setAppName('MyApp')
sc = SparkContext(conf=spark_conf)
rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>#some-bucket/some-key')
Does something like this exist for providing IAM info? :
rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>:<MY-SESSION>#some-bucket/some-key')
or
rdd = sc.textFile('s3://<ROLE-ARN>:<ROLE-SESSION-NAME>#some-bucket/some-key')
If not, what are the best practices for working with IAM creds? Is it even possible?
I'm using Python 1.7 and PySpark 1.6.0
Thanks!
IAM role for accessing s3 is only support by s3a, because it is using AWS SDK.
You need to put hadoop-aws JAR and aws-java-sdk JAR (and third-party Jars in its package) into your CLASSPATH.
hadoop-aws link.
aws-java-sdk link.
Then set this in core-site.xml:
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
Hadoop 2.8+'s s3a connector supports IAM roles via a new credential provider; It's not in the Hadoop 2.7 release.
To use it you need to change the credential provider.
fs.s3a.aws.credentials.provider = org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
fs.s3a.access.key = <your access key>
fs.s3a.secret.key = <session secret>
fs.s3a.session.token = <session token>
What is in Hadoop 2.7 (and enabled by default) is the picking up of the AWS_ environment variables.
If you set the AWS env vars for session login on your local system and the remote ones then they should get picked up.
I know its a pain, but as far as the Hadoop team are concerned Hadoop 2.7 shipped mid-2016 and we've done a lot since then, stuff which we aren't going to backport
IAM Role-based access to files in S3 is supported by Spark, you just need to be careful with your config. Specifically, you need:
Compatible versions of aws-java-sdk and hadoop-aws. This is quite brittle so only specific combinations work.
You must use the S3AFileSystem, not NativeS3FileSystem. The former permits role based access, whereas the later only allows user credentials.
To find out which combinations work, go to hadoop-aws on mvnrepository here. Click through the version of hadoop-aws you have look for the version of the aws-java-sdk compile dependency.
To find out what version of hadoop-aws you are using, in PySpark you can execute:
sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()
where sc is the SparkContext
This is what worked for me:
import os
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 pyspark-shell'
sc = SparkContext.getOrCreate()
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark = SparkSession(sc)
df = spark.read.csv("s3a://mybucket/spark/iris/",header=True)
df.show()
It's the specific combination of aws-java-sdk:1.7.4 and hadoop-aws:2.7.1 that made it work. There is good guidance on troubleshooting s3a access here
Specially note that
Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.
Here is a useful post containing further information.
Here's some more useful information about compatibility between the java libraries
I was trying to get this to work in the jupyter pyspark notebook. Note that the aws-hadoop version had to match the hadoop install in the Dockerfile i.e. here.
You could try the approach in Locally reading S3 files through Spark (or better: pyspark).
However I've had better luck with setting environment variables (AWS_ACCESS_KEY_ID etc) in Bash ... pyspark will automatically pick these up for your session.
After more research, I'm convinced this is not yet supported as evidenced here.
Others have suggested taking a more manual approach (see this blog post) which suggests to list s3 keys using boto, then parallelize that list using Spark to read each object.
The problem here (and I don't yet see how they themselves get around it) is that the s3 objects given back from listing within a bucket are not serializable/pickle-able (remember : it's suggested that these objects are given to the workers to read in independent processes via map or flatMap). Furthering the problem is that the boto s3 client itself isn't serializable (which is reasonable in my opinion).
What we're left with is the only choice of recreating the assumed-role s3 client per file, which isn't optimal or feasible past a certain point.
If anyone sees any flaws in this reasoning or an alternative solution/approach, I'd love to hear it.