write to Google Cloud Storage using spark to absolute path - python

I am trying to write a spark dataframe into google cloud storage. This dataframe has got some updates so I need a partition strategy. SO I need to write it into exact file in GCS.
i have Created a spark session as follows
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")\
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")\
.config("fs.gs.project.id", project_id)\
.config("fs.gs.auth.service.account.enable", "true")\
.config("fs.gs.auth.service.account.project.id",project_id)\
.config("fs.gs.auth.service.account.private.key.id",private_key_id)\
.config("fs.gs.auth.service.account.private.key",private_key)\
.config("fs.gs.auth.service.account.client.email",client_email)\
.config("fs.gs.auth.service.account.email",client_email)\
.config("fs.gs.auth.service.account.client.id",client_id)\
.config("fs.gs.auth.service.account.auth.uri",auth_uri)\
.config("fs.gs.auth.service.account.token.uri",token_uri)\
.config("fs.gs.auth.service.account.auth.provider.x509.cert.url",auth_provider_x509_cert_url)\
.config("fs.gs.auth.service.account.client_x509_cert_url",client_x509_cert_url)\
.config("spark.sql.avro.compression.codec", "deflate")\
.config("spark.sql.avro.deflate.level", "5")\
.getOrCreate())
and I am writing into GCS using
df.write.format(file_format).save('gs://'+bucket_name+path+'/'+table_name+'/file_name.avro')
now i see a file written in GCP is in path
gs://bucket_name/table_name/file_name.avro/--auto assigned name--.avro
what i am expecting is the file to be written like in hadoop and final result of data file to be
gs://bucket_name/table_name/file_name.avro
can any one help me achieve this?

It looks like limitation of standard Spark library. Maybe this answer will help.
You can also want to check alternative way of interacting with Google Cloud Storage from Spark, using Cloud Storage Connector with Apache Spark.

Related

Google Cloud Storage JSONs to Pandas Dataframe to Warehouse

I am a newbie in ETL. I just managed to extract a lot of information in form of JSONs to GCS. Each JSON file includes identical key-value pairs and now I would like to transform them into dataframes on the basis of certain key values.
The next step would be loading this into a data warehouse like Clickhouse, I guess? I was not able to find any tutorials on this process.
TLDR 1) Is there a way to transform JSON data on GCS in Python without downloading the whole data?
TLDR 2) How can I set this up to run periodically or in real time?
TLDR 3) How can I go about loading the data into a warehouse?
If these are too much, I would love it if you can point me to resources around this. Appreciate the help
There are some ways to do this.
You can add files to storage, then a Cloud Functions is activated every time a new file is added (https://cloud.google.com/functions/docs/calling/storage) and will call an endpoint in Cloud Run (container service - https://cloud.google.com/run/docs/building/containers) running a Python application to transform these JSONs in a dataframe. Note that the container image will be stored in Container Registry. Then the Python notebook running on Cloud Run will save the rows incrementally to BigQuery (warehouse). After that you can have analytics with Looker Studio.
If you need to scale the solution to millions/billions of rows, you can add files to storage, Cloud Functions is activated and calls Dataproc, a service where you can run Python, Anaconda, etc. (How to call google dataproc job from google cloud function). Then this Dataproc cluster will structurate the JSONs as a dataframe and save to the warehouse (BigQuery).

Is there any way to replicate realtime streaming from azure blob storage to to azure my sql

We can basically use databricks as intermediate but I'm stuck on the python script to replicate data from blob storage to azure my sql every 30 second we are using CSV file here.The script needs to store the csv's in current timestamps.
There is no ready stream option for mysql in spark/databricks as it is not stream source/sink technology.
You can use in databricks writeStream .forEach(df) or .forEachBatch(df) option. This way it create temporary dataframe which you can save in place of your choice (so write to mysql).
Personally I would go for simple solution. In Azure Data Factory is enough to create two datasets (can be even without it) - one mysql, one blob and use pipeline with Copy activity to transfer data.

How can I load Cloud Storage data into Bigquery using Python?

I have some datasets (27 CSV files, separated by semicolons, summing 150+GB) that get uploaded every week to my Cloud Storage bucket.
Currently, I use the BigQuery console to organize that data manually, declaring the variables and changing the filenames 27 times. The first file replaces the entire previous database, then the other 26 get appended to it. The filenames are always the same.
How can I do it using Python?
Please, check out Cloud Functions functionality. It allows to use python. After the function is deployed, Cron Jobs can be created. Here is related question:
Run a python script on schedule on Google App Engine
Also here is and article which describes, how to load data from Cloud Storage Loading CSV data from Cloud Storage

"SparkException: Job aborted" when Koalas writes to Azure blob storage

I am using Koalas (pandas API on Apache Spark) to write a dataframe out to a mounted Azure blob storage. When calling the df.to_csv API, Spark throws an exception and aborts the job.
Only a few of the stages seem to fail with the following error:
This request is not authorized to perform this operation using this
permission.
I am handling the data with Databricks on Azure using PySpark. The data products reside in a mounted Azure Blob storage. A service principle for databricks was made and is set as "contributer" to the Azure storage account.
When looking into the storage account, I notice that some of the first blobs were already prepared in the directory. Moreover, I am able to place the output in the blob storage using a "pure Python" approach with pandas. Therefore, I doubt that it has to do with authorization issues for Databricks.
This is the minimal coding example of what I used to create the error.
<Test to see if the blob storage is mounted>
# Import koalas
import databricks.koalas as ks
# Load the flatfile
df = ks.read_csv('/dbfs/spam/eggs.csv')
# Apply transformations
# Write out the dataframe
df.to_csv('/dbfs/bacon/eggs.csv')
Since there are many facets to this issue, I am uncertain where to start:
Authorization issue between the blob storage and Databricks
Incorrect setup of the Databricks cluster
Applying the wrong API method
Issue with the file content
Any leads on where to look?

Access datalake from Azure datafactory V2 using on demand HD Insight cluster

I am trying to execute spark job from on demand HD Insight cluster using Azure datafactory.
Documentation indicates clearly that ADF(v2) does not support datalake linked service for on demand HD insight cluster and one have to copy data onto blob from copy activity and than execute the job. BUT this work around seems to be a hugely resource expensive in case of a billion files on a datalake. Is there any efficient way to access datalake files either from python script that execute spark jobs or any other way to directly access the files.
P.S Is there a possiblity of doing similar thing from v1, if yes then how? "Create on-demand Hadoop clusters in HDInsight using Azure Data Factory" describe on demand hadoop cluster that access blob storage but I want on demand spark cluster that access datalake.
P.P.s Thanks in advance
Currently, we don't have support for ADLS data store with HDI Spark cluster in ADF v2. We plan to add that in the coming months. Till then, you will have to contiue using the workaround as you mentioned in your post above. Sorry for the inconvenience.
The Blob storage is used for the scripts and config files that the On Demand cluster will use. In the scripts you write and store in the attached Blob storage they can write from ADLS to SQLDB for example.

Categories

Resources