Connectiong to Azure table storage from Azure databricks - python

I am trying to connecto to azure table storage from Databricks. I can't seem to find any resources that doesn't go to blob containers, but I have tried modifying it for tables.
spark.conf.set(
"fs.azure.account.key.accountname.table.core.windows.net",
"accountkey")
blobDirectPath = "wasbs://accountname.table.core.windows.net/TableName"
df = spark.read.parquet(blobDirectPath)
I am making an assumption for now that tables are parquet files. I am getting authentication errors on this code now.

According to my research, Azure Databricks does not support the data source of Azure table storage. For more details, please refer to https://docs.azuredatabricks.net/spark/latest/data-sources/index.html.
Besides if you still want to use table storage, you can use Azure Cosmos DB Table API. But they have some differences. For more details, please refer to https://learn.microsoft.com/en-us/azure/cosmos-db/faq#where-is-table-api-not-identical-with-azure-table-storage-behavior.

Related

Copy a Azure table (SAS) to a db on Microsoft SQL Server

Just that: Is there a way to copy a azure table (with SAS connection) to a db on Microsoft SQL Server? It could be possible with python?
Thank you all!
I've tried on SSIS visual studio 2019 with no success
You can use **azure data factory ** or azure synapse to copy the data from azure table storage to azure SQL database. Refer MS document on Introduction to Azure Data Factory - Azure Data Factory | Microsoft Learn if you are new to data factory.
Refer MS document on Copy data to and from Azure Table storage - Azure Data Factory & Azure Synapse | Microsoft Learn.
I tried to repro this in my environment.
Linked services are for Azure table storage and azure sql database.
In the linked service for azure table storage's Authentication method, SAS URI is selected and URL and token is given.
Similarly, linked service for Azure Sql databse is created by giving server name, database name, username and password.
Then Copy activity is taken and source dataset for table storage is created and given the same in source settings.
Similarly, sink dataset is created.
Once source and sink datasets are configured in copy activity, pipeline is run to copy data from table storage to Azure SQL DB.
By this way, Data can be copied from azure table storage with SAS key to Azure SQL Database.

"SparkException: Job aborted" when Koalas writes to Azure blob storage

I am using Koalas (pandas API on Apache Spark) to write a dataframe out to a mounted Azure blob storage. When calling the df.to_csv API, Spark throws an exception and aborts the job.
Only a few of the stages seem to fail with the following error:
This request is not authorized to perform this operation using this
permission.
I am handling the data with Databricks on Azure using PySpark. The data products reside in a mounted Azure Blob storage. A service principle for databricks was made and is set as "contributer" to the Azure storage account.
When looking into the storage account, I notice that some of the first blobs were already prepared in the directory. Moreover, I am able to place the output in the blob storage using a "pure Python" approach with pandas. Therefore, I doubt that it has to do with authorization issues for Databricks.
This is the minimal coding example of what I used to create the error.
<Test to see if the blob storage is mounted>
# Import koalas
import databricks.koalas as ks
# Load the flatfile
df = ks.read_csv('/dbfs/spam/eggs.csv')
# Apply transformations
# Write out the dataframe
df.to_csv('/dbfs/bacon/eggs.csv')
Since there are many facets to this issue, I am uncertain where to start:
Authorization issue between the blob storage and Databricks
Incorrect setup of the Databricks cluster
Applying the wrong API method
Issue with the file content
Any leads on where to look?

write to Google Cloud Storage using spark to absolute path

I am trying to write a spark dataframe into google cloud storage. This dataframe has got some updates so I need a partition strategy. SO I need to write it into exact file in GCS.
i have Created a spark session as follows
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")\
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")\
.config("fs.gs.project.id", project_id)\
.config("fs.gs.auth.service.account.enable", "true")\
.config("fs.gs.auth.service.account.project.id",project_id)\
.config("fs.gs.auth.service.account.private.key.id",private_key_id)\
.config("fs.gs.auth.service.account.private.key",private_key)\
.config("fs.gs.auth.service.account.client.email",client_email)\
.config("fs.gs.auth.service.account.email",client_email)\
.config("fs.gs.auth.service.account.client.id",client_id)\
.config("fs.gs.auth.service.account.auth.uri",auth_uri)\
.config("fs.gs.auth.service.account.token.uri",token_uri)\
.config("fs.gs.auth.service.account.auth.provider.x509.cert.url",auth_provider_x509_cert_url)\
.config("fs.gs.auth.service.account.client_x509_cert_url",client_x509_cert_url)\
.config("spark.sql.avro.compression.codec", "deflate")\
.config("spark.sql.avro.deflate.level", "5")\
.getOrCreate())
and I am writing into GCS using
df.write.format(file_format).save('gs://'+bucket_name+path+'/'+table_name+'/file_name.avro')
now i see a file written in GCP is in path
gs://bucket_name/table_name/file_name.avro/--auto assigned name--.avro
what i am expecting is the file to be written like in hadoop and final result of data file to be
gs://bucket_name/table_name/file_name.avro
can any one help me achieve this?
It looks like limitation of standard Spark library. Maybe this answer will help.
You can also want to check alternative way of interacting with Google Cloud Storage from Spark, using Cloud Storage Connector with Apache Spark.

Error: upload data to Cosmosdb using pydocumendb

I'm using pydocumentdb to upload some processed data to CosmosDB as a Document on Azure Cloud with a Python script. The files are coming from the same source. The ingestion works well with some files, but gives the next error for the files that are greater than 1000 KB:
pydocumentdb.errors.HTTPFailure: Status code: 413
"code":"RequestEntityTooLarge","message":"Message: {\"Errors\":[\"Request
size is too large\"]
I'm using SQL API and this is how I create the document inside a Collection:
client = document_client.DocumentClient(uri, {'masterKey': cosmos_key})
... I get the Db link and Collection link ...
Client.CreateDocument(collection_link, data)
How can I solve this error?
Per my experience, to store some large size data or files on Azure Cosmos DB, the best practice is to upload data to Azure Blob Storage or other external storages and create an attachment with its references or associated metadata in a document on Azure Cosmos DB.
You can refer to the REST API for Attachments to know it and achieve the feature of your needs using the methods of PyDocument API include CreateAttachment, ReplaceAttachment, QueryAttachments and so on.
Hope it helps.

Access datalake from Azure datafactory V2 using on demand HD Insight cluster

I am trying to execute spark job from on demand HD Insight cluster using Azure datafactory.
Documentation indicates clearly that ADF(v2) does not support datalake linked service for on demand HD insight cluster and one have to copy data onto blob from copy activity and than execute the job. BUT this work around seems to be a hugely resource expensive in case of a billion files on a datalake. Is there any efficient way to access datalake files either from python script that execute spark jobs or any other way to directly access the files.
P.S Is there a possiblity of doing similar thing from v1, if yes then how? "Create on-demand Hadoop clusters in HDInsight using Azure Data Factory" describe on demand hadoop cluster that access blob storage but I want on demand spark cluster that access datalake.
P.P.s Thanks in advance
Currently, we don't have support for ADLS data store with HDI Spark cluster in ADF v2. We plan to add that in the coming months. Till then, you will have to contiue using the workaround as you mentioned in your post above. Sorry for the inconvenience.
The Blob storage is used for the scripts and config files that the On Demand cluster will use. In the scripts you write and store in the attached Blob storage they can write from ADLS to SQLDB for example.

Categories

Resources