"SparkException: Job aborted" when Koalas writes to Azure blob storage - python

I am using Koalas (pandas API on Apache Spark) to write a dataframe out to a mounted Azure blob storage. When calling the df.to_csv API, Spark throws an exception and aborts the job.
Only a few of the stages seem to fail with the following error:
This request is not authorized to perform this operation using this
permission.
I am handling the data with Databricks on Azure using PySpark. The data products reside in a mounted Azure Blob storage. A service principle for databricks was made and is set as "contributer" to the Azure storage account.
When looking into the storage account, I notice that some of the first blobs were already prepared in the directory. Moreover, I am able to place the output in the blob storage using a "pure Python" approach with pandas. Therefore, I doubt that it has to do with authorization issues for Databricks.
This is the minimal coding example of what I used to create the error.
<Test to see if the blob storage is mounted>
# Import koalas
import databricks.koalas as ks
# Load the flatfile
df = ks.read_csv('/dbfs/spam/eggs.csv')
# Apply transformations
# Write out the dataframe
df.to_csv('/dbfs/bacon/eggs.csv')
Since there are many facets to this issue, I am uncertain where to start:
Authorization issue between the blob storage and Databricks
Incorrect setup of the Databricks cluster
Applying the wrong API method
Issue with the file content
Any leads on where to look?

Related

Google Cloud Storage JSONs to Pandas Dataframe to Warehouse

I am a newbie in ETL. I just managed to extract a lot of information in form of JSONs to GCS. Each JSON file includes identical key-value pairs and now I would like to transform them into dataframes on the basis of certain key values.
The next step would be loading this into a data warehouse like Clickhouse, I guess? I was not able to find any tutorials on this process.
TLDR 1) Is there a way to transform JSON data on GCS in Python without downloading the whole data?
TLDR 2) How can I set this up to run periodically or in real time?
TLDR 3) How can I go about loading the data into a warehouse?
If these are too much, I would love it if you can point me to resources around this. Appreciate the help
There are some ways to do this.
You can add files to storage, then a Cloud Functions is activated every time a new file is added (https://cloud.google.com/functions/docs/calling/storage) and will call an endpoint in Cloud Run (container service - https://cloud.google.com/run/docs/building/containers) running a Python application to transform these JSONs in a dataframe. Note that the container image will be stored in Container Registry. Then the Python notebook running on Cloud Run will save the rows incrementally to BigQuery (warehouse). After that you can have analytics with Looker Studio.
If you need to scale the solution to millions/billions of rows, you can add files to storage, Cloud Functions is activated and calls Dataproc, a service where you can run Python, Anaconda, etc. (How to call google dataproc job from google cloud function). Then this Dataproc cluster will structurate the JSONs as a dataframe and save to the warehouse (BigQuery).

Connectiong to Azure table storage from Azure databricks

I am trying to connecto to azure table storage from Databricks. I can't seem to find any resources that doesn't go to blob containers, but I have tried modifying it for tables.
spark.conf.set(
"fs.azure.account.key.accountname.table.core.windows.net",
"accountkey")
blobDirectPath = "wasbs://accountname.table.core.windows.net/TableName"
df = spark.read.parquet(blobDirectPath)
I am making an assumption for now that tables are parquet files. I am getting authentication errors on this code now.
According to my research, Azure Databricks does not support the data source of Azure table storage. For more details, please refer to https://docs.azuredatabricks.net/spark/latest/data-sources/index.html.
Besides if you still want to use table storage, you can use Azure Cosmos DB Table API. But they have some differences. For more details, please refer to https://learn.microsoft.com/en-us/azure/cosmos-db/faq#where-is-table-api-not-identical-with-azure-table-storage-behavior.

write to Google Cloud Storage using spark to absolute path

I am trying to write a spark dataframe into google cloud storage. This dataframe has got some updates so I need a partition strategy. SO I need to write it into exact file in GCS.
i have Created a spark session as follows
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")\
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")\
.config("fs.gs.project.id", project_id)\
.config("fs.gs.auth.service.account.enable", "true")\
.config("fs.gs.auth.service.account.project.id",project_id)\
.config("fs.gs.auth.service.account.private.key.id",private_key_id)\
.config("fs.gs.auth.service.account.private.key",private_key)\
.config("fs.gs.auth.service.account.client.email",client_email)\
.config("fs.gs.auth.service.account.email",client_email)\
.config("fs.gs.auth.service.account.client.id",client_id)\
.config("fs.gs.auth.service.account.auth.uri",auth_uri)\
.config("fs.gs.auth.service.account.token.uri",token_uri)\
.config("fs.gs.auth.service.account.auth.provider.x509.cert.url",auth_provider_x509_cert_url)\
.config("fs.gs.auth.service.account.client_x509_cert_url",client_x509_cert_url)\
.config("spark.sql.avro.compression.codec", "deflate")\
.config("spark.sql.avro.deflate.level", "5")\
.getOrCreate())
and I am writing into GCS using
df.write.format(file_format).save('gs://'+bucket_name+path+'/'+table_name+'/file_name.avro')
now i see a file written in GCP is in path
gs://bucket_name/table_name/file_name.avro/--auto assigned name--.avro
what i am expecting is the file to be written like in hadoop and final result of data file to be
gs://bucket_name/table_name/file_name.avro
can any one help me achieve this?
It looks like limitation of standard Spark library. Maybe this answer will help.
You can also want to check alternative way of interacting with Google Cloud Storage from Spark, using Cloud Storage Connector with Apache Spark.

Access datalake from Azure datafactory V2 using on demand HD Insight cluster

I am trying to execute spark job from on demand HD Insight cluster using Azure datafactory.
Documentation indicates clearly that ADF(v2) does not support datalake linked service for on demand HD insight cluster and one have to copy data onto blob from copy activity and than execute the job. BUT this work around seems to be a hugely resource expensive in case of a billion files on a datalake. Is there any efficient way to access datalake files either from python script that execute spark jobs or any other way to directly access the files.
P.S Is there a possiblity of doing similar thing from v1, if yes then how? "Create on-demand Hadoop clusters in HDInsight using Azure Data Factory" describe on demand hadoop cluster that access blob storage but I want on demand spark cluster that access datalake.
P.P.s Thanks in advance
Currently, we don't have support for ADLS data store with HDI Spark cluster in ADF v2. We plan to add that in the coming months. Till then, you will have to contiue using the workaround as you mentioned in your post above. Sorry for the inconvenience.
The Blob storage is used for the scripts and config files that the On Demand cluster will use. In the scripts you write and store in the attached Blob storage they can write from ADLS to SQLDB for example.

How to upload a small JSON object in memory to Google BigQuery in Python?

Is there any way to take a small data JSON object from memory and upload it to Google BigQuery without using the file system?
We have working code that uploads files to BigQuery. This code no longer works for us because our new script runs on Google App Engine which doesn't have write access to the file system.
We notice that our BigQuery upload configuration has a "MediaFileUpload" option with a data_path. We have been unable to find if there is some way to change the data_path to something in memory - or if there is another method apart from MediaFileUpload that could upload data from memory to BigQuery.
media_body=MediaFileUpload(
data_path,
mimetype='application/octet-stream'))
job = insert_request.execute()
We have seen solutions which recommend uploading data to Google Cloud Storage then referencing that file to upload to BigQuery for when datasets are large. Our dataset is small so this step seems like an unnecessary use of bandwidth, processing and time.
As a result, we are wondering if there is a way to take our small JSON object and upload it directly to BigQuery using Python?

Categories

Resources