Load Large Excel Files in Databricks using PySpark from an ADLS mount

Load Large Excel Files in Databricks using PySpark from an ADLS mount - python

We are trying to load a large'ish excel file from a mounted Azure Data Lake location using pyspark on Databricks.
We have used pyspark.pandas to load and we have used spark-excel to load, not with a lot of success
PySpark.Pandas
import pyspark.pandas as ps
df = ps.read_excel("dbfs:/mnt/aadata/ds/data/test.xlsx",engine="openpyxl")
We are experiencing some conversion error as below
ArrowTypeError: Expected bytes, got a 'int' object
spark-excel
df=spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema","false") \
.load('dbfs:/mnt/aadata/ds/data/test.xlsx')
We are able to load a smaller file, but a larger file gives the following error
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 185,568,653, but the maximum length for this record type is 100,000,000.
Is there any other way to load excel files in databricks with pyspark?

In your Excel file, there's probably some kind of weird format, or some kind of special character, that is preventing it from working. Save the Excel file as a CSV file, and re-try. You should easily be able to load a CSV file, because it has no weird things of any kind, whereas Excel has all kinds of weird things embedded in it.

Related

Loading big parquet files on server

I am working on a VPN server for a company and they gave us their data through lots of parquet files. We are however having problems with loading all the data. Loading a few files is working for us but we have a 103 parquet files we have to load to get the full data. When we try this, the server is having a Kernel error.
We are working with python.
Has anyone had this problem before and found a solution?
We have tried to concat a few parquet files and convert that to csv so it's easier to load but this also gives us a Kernel error. The files do have a lot of string types so this does have a lot of memory in it.
We also tried loads of different commands like low_memory and loading only certain column, this however will give us the same error.

Not able to create table from CSV in Databricks

I am trying to create a table from a CSV file stored in Azure Storage Account. I am using the below code. I am using Azure Databricks. Notebook is in Python.
%sql
drop table if exists customer;
create table customer
using csv
options ( path "/mnt/datalake/data/Customer.csv", header "True", mode "FAILFAST", inferSchema "True");
I am getting the below error.
Unable to infer schema for CSV. It must be specified manually.
Anyone having any idea, on how to resolve this error.

I have reproduced above and got the below results.
This my csv file in data container.
This is my mounting:
I have mounted this and when I tried to create table from CSV file, I got same error.
The above error arises when we don't give correct path in the csv file. In file path after /mnt give mount point(here onemount for me) not container name as we already mounted till container.
Result:

Save only the required CSV file using PySpark

I am quite new to PySpark, I am trying to read and then save a CSV file using Azure Databricks.
After saving the file I see many other files like "_Committed","_Started","_Success" and finally the CSV file with a totally different name.
I have already checked using DataFrame repartition(1) and coalesce(1) but this only deals when the CSV file itself was partitioned by Spark. Is there anything that can be done using PySpark?

You can do the following:
df.toPandas().to_csv(path/to/file.csv)
It will create a single file csv as you expect.

Those are default Log files created when saving from PySpark . We can't eliminate this.
Using coalesce(1) you can save in a single file without partition.

Is it possible to do filebased processing with UDF in pyspark?

I have a UDF defined which does the following with a dataframe where a column contains the location of zip files in azure blob storage(I tested the UDF without spark and that worked out):
downloads defined file from the blob and safe it somewhere on the Excutor/Driver
Extract a certain file of the blob and safe it on the Excutor/Driver
With this UDF I experience it is the same speed as if I would just loop in python over the files. So is it even possible to do this kind of task in spark? I wanted to use spark to parallelize the download and unzipping to speed it up.
I connected via ssh to the Excutor and the Driver (it is a test cluster, so it only has one of each) and found out that only the data was processe on the Excutor and the driver did not do anything at all. Why is that so?
The next step would be to read the extracted files (normal csvs) to a spark data frame. But how can this be done if the files are distributed over the Excutor and Driver? I did not yet find a way to access the storage of the Excutors. Or is it somehow possible to define a common location within the UDF to write it back to a location at the driver?
I would like to read than the extracted files with:
data_frame = (
spark
.read
.format('csv')
.option('header', True)
.option('delimiter', ',')
.load(f"/mydriverpath/*.csv"))
If there is another method to parallelize the download and unzipping of the files I would be happy to hear about it.

PySpark readers / writers make it easy to read and write files in parallel. When working in Spark, you generally should not loop over files or save data on the driver node.
Suppose you have 100 gzipped CSV files in the my-bucket/my-folder directory. Here's how you can read them into a DataFrame in parallel:
df = spark.read.csv("my-bucket/my-folder")
And here's how you can write them to 50 Snappy compressed Parquet files (in parallel):
df.repartition(50).write.parquet("my-bucket/another-folder")
The readers / writers do all the heavy lifting for you. See here for more info about repartition.

Problem with saving spark DataFrame as Parquet

I'm trying to save a DataFrame to a path as Parquet files. The issue is: the display() function shows a bunch of results in "Prop_0" but whenever I try to save them, only the first one gets converted and goes to the path.
The code I'm using is:
dbutils.fs.rm(Path_1, True)
avroFile = spark.read.format('com.databricks.spark.avro').load(Path_1)
avroFile.write.mode("overwrite").save(Path_2, format="parquet")

This is expected behaviour, Hadoop File Format is used by Spark and this file format requires data to be partitioned - that's why you have part- files.
I'm able to run the above code without any issue.
You may use the below method to save spark DataFrame as parquet files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Load Large Excel Files in Databricks using PySpark from an ADLS mount - python

Related

Loading big parquet files on server

Not able to create table from CSV in Databricks

Save only the required CSV file using PySpark

Is it possible to do filebased processing with UDF in pyspark?

Problem with saving spark DataFrame as Parquet

Categories

Resources