Creating a Parquet file with PySpark on an AWS EMR cluster

Creating a Parquet file with PySpark on an AWS EMR cluster - python

I'm trying to spin up a Spark cluster with Datbricks' CSV package so that I can create parquet files and also do some stuff with Spark obviously.
This is being done within AWS EMR, so I don't think I'm putting these options in the correct place.
This is the command I want to send to the cluster as it spins up: spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 --master yarn --driver-memory 4g --executor-memory 2g. I've tried putting this on a Spark step - is this correct?
If the cluster spun up without that being properly installed, how do I start up PySpark with that package? Is this correct: pyspark --packages com.databricks:spark-csv_2.10:1.4.0? I can't tell if it was installed properly or not. Not sure what functions to test
And in regards to actually using the package, is this correct for creating a parquet file:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
#is it this option1
df.write.parquet("s3n://bucketname/nation_parquet.parquet")
#or this option2
df.select('nation_id', 'name', 'some_int', 'comment').write.parquet('com.databricks.spark.csv').save('s3n://bucketname/nation_parquet.tbl')
I'm not able to find any recent (mid 2015 and later) documentation regarding writing a Parquet file.
Edit:
Okay, now I'm not sure if I'm creating my dataframe correctly. If I try to run some select queries on it and show the resultset, I don't get anything and instead some error. Here's what I tried running:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
df.registerTempTable("region2")
tcp_interactions = sqlContext.sql(""" SELECT nation_id, name, comment FROM region2 WHERE nation_id > 1 """)
tcp_interactions.show()
#get some weird Java error:
#Caused by: java.lang.NumberFormatException: For input string: "0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|"

Related

PySpark DataFrame writing empty (zero bytes) files

I'm working with PySpark DataFrame API with Spark version 3.1.1 on a local setup. After reading in data, performing some transformations etc. I save the DataFrame to disk. Output directories get created, along with part-0000* file and there is _SUCCESS file present in the output directory as well. However, my part-0000* is always empty i.e. zero bytes.
I've tried writing it in both parquet as well as csv formats with the same result. Just before writing, I called df.show() to make sure there is data in the DataFrame.
### code.py ###
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import configs
spark = SparkSession.builder.appName('My Spark App').getOrCreate()
data = spark.read.csv(configs.dataset_path, sep=configs.data_delim)
rdd = data.rdd.map(...)
data = spark.createDataFrame(rdd)
data = data.withColumn('col1', F.lit(1))
data.show() # Shows top 20 rows with data
data.write.parquet(save_path + '/dataset_parquet/', mode='overwrite') # Zero Bytes
data.write.csv(save_path + '/dataset_csv/', mode='overwrite') # Zero Bytes
I'm running this code as follows
export PYSPARK_PYTHON=python3
$SPARK_HOME/bin/spark-submit \
--master local[*] \
code.py

So I ran into a similar issue with pyspark and one thing I also noticed is that when I tried to set the mode to overwrite it was also failing. The issue with the overwrite was that it was failing to write while it was in the middle of the write, so it would create the file, fail, retry and the retry would fail with the 'file already exists' because it was past the point in its process of handling the overwrite.
So I added cache to force the evaluation because like your .show() above I was doing a data.cache().count(). The count showed records but any further evaluation using show or write showed the DF as empty.
So try adding .cache() to the first reference of that dataframe and see it it fixes your issue. It did for me.
df_bad = df_cln.filter(F.col('isInvalid')).select(F.concat(F.col('line')\
,F.lit(">> LINE:"),F.col('monotonically_increasing_id'))\
.alias("line"),F.col('monotonically_increasing_id'))
removed_file_cnt = df_bad.cache().count()
print(f"The count of the records still containing udf chars in the file: {removed_file_cnt}")
if removed_file_cnt > 0:
df_bad.coalesce(1)\
.orderBy('monotonically_increasing_id')\
.drop('monotonically_increasing_id')\
.write.option("ignoreTrailingWhiteSpace","false").option("encoding", "UTF-8")\
.format('text').save(s3_error_bucket_path, mode='overwrite')
Alternatively, consider using a .localCheckpoint() on the data column. It is fast and convenient. Since we can always restart the job there essentially no critical need for a checkpoint.

read parquet file from s3 using pyspark issue

I am trying to read parquet files from S3 but it kills my server (processing for a very long time, must reset machine in order to continue working).
No issue in writing the parquet file to S3, and when trying to write and read from local it works perfectly. When trying to read small files from s3 there are no issues.
as seen in many threads, spark's "s3a" file system client (2nd config here) should be able to handle it but in fact I get 'NoSuchMethodError' when trying to use s3a (with the proper s3a configuration listed below)
Py4JJavaError: An error occurred while calling o155.json.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)
the following configuration works only for small files, but using the follwing sparkSession config:
s3 config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('fs.s3.awsAccessKeyId', myAccessId')\
.config('fs.s3.awsSecretAccessKey', 'myAccessKey')\
.config('fs.s3.impl', 'org.apache.hadoop.fs.s3native.NativeS3FileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
s3a config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('spark.hadoop.fs.s3a.access.key', 'myAccessId')\
.config('spark.hadoop.fs.s3a.secret.key', 'myAccessKey')\
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
JARs for s3 read-write (spark.driver.extraClassPath):
hadoop-aws-2.7.3.jar,
**hadoop-common-2.7.3.jar**, -- added in order to use S3a
aws-java-sdk-s3-1.11.156.jar
Is there any other .config I can use to solve this issue?
Thanks,
Mosh.

Can't read .xlsx file on Azure Databricks

I'm on Azure databricks notebooks using Python, and I'm having trouble reading an excel file and putting it in a spark dataframe.
I saw that there were topics of the same problems, but they don't seem to work for me.
I tried the following solution:
https://sauget-ch.fr/2019/06/databricks-charger-des-fichiers-excel-at-scale/
I did add the credentials to access my files on Azure Data Lake.
After installing all the libraries I needed, I'm doing this code :
import xlrd
import azure.datalake.store
filePathBsp = projectFullPath + "BalanceShipmentPlan_20190724_19h31m37s.xlsx";
bspDf = pd.read_excel(AzureDLFileSystem.open(filePathBsp))
There, I use:
"AzureDLFileSystem.open"
to get the file in Azure Data Lake because:
"pd.read_excel"
doesn't let me get my file to the Lake.
The problem is, it gives me this error :
TypeError: open() missing 1 required positional argument: 'path'
I'm sure I can access this file because when I try:
spark.read.csv(filePathBsp)
he can find my file.
Any ideas?

Ok, after long days of researchs, i've finally found the solution.
Here it is !
First, you have to import the library "spark-Excel" in your cluster.
Here's the page for this library : https://github.com/crealytics/spark-excel
You also need the library "spark_hadoopOffice", or you'll get the following exception later :
java.io.IOException: org/apache/commons/collections4/IteratorUtils
Take care about the version of Scala in your cluster when you download the libraries, it's important.
Then, you have to mount the credentials for Azure Data Lake Storage (ADLS) This way :
# Mount point
udbRoot = "****"
configs = {
"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "****",
"dfs.adls.oauth2.credential": "****",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/****/oauth2/token"
}
# unmount
#dbutils.fs.unmount(udbRoot)
# Mounting
dbutils.fs.mount(
source = "adl://****",
mount_point = udbRoot,
extra_configs = configs
)
You need to do the mount command only once.
Then, you can do this code line :
testDf = spark.read.format("com.crealytics.spark.excel").option("useHeader", True).load(fileTest)
display(testDf)
Here you go ! You have a Spark Dataframe from an Excel File in Azure Data Lake Storage !
It worked for me, hopefully it will help someone else.

How to get csv on s3 with pyspark (No FileSystem for scheme: s3n)

There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something.
Trying to load a simple test csv file from my s3.
Doing it locally, like below, works.
from pyspark.sql import SparkSession
from pyspark import SparkContext as sc
logFile = "sparkexamplefile.csv"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
But if I add this below:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "foo")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "bar")
lines = sc.textFile("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
lines.count()
I get:
No FileSystem for scheme: s3n
I've also tried changing s3 to spark.sparkContext without any difference
Also swapping // and /// in the url
Even better, I'd rather do this and go straight to data frame:
dataFrame = spark.read.csv("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
Also I am slightly AWS ignorant, so I have tried s3, s3n, and s3a to no avail.
I've been around the internet and back but can't seem to resolve the scheme error. Thanks!

I think your spark environment didn't get aws jars. You need to add it for using s3 or s3n.
You have to copy required jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Here my spark version is Spark 2.3.0 and Hadoop 2.7.6
so you have to copy to jars from (hadoop dir)/share/hadoop/tools/lib/
to $SPARK_HOME/jars.
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar

You must check what is your version of hadoop*. jar files bound to your specific version of pyspark installed on your system, search for folder pyspark/jars and files hadoop*.
The version observed you pass into your pyspark file like this:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.538,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
This is bit tricky for new joiners on pyspark (I faced this directly my first day with pyspark :-)).
Otherwise I am on Gentoo system with local Spark 2.4.2. Some suggested to install also Hadoop and copy the jars directly to Spark, still should be same version as PySpark is using. So I am creating ebuild for Gentoo for these versions...

creating spark dataframe of .dat.gz file present in aws s3 using aws glue

I have written one pyspark code which is running in aws glue and trying to read one dat.gz file. dataframe is getting created successfully but Trim(BOTH FROM) is getting added to the column name. Below is my code snippet.
df = spark.read.format("csv").option("header", 'false').option("delimiter", '|').load("s3://xxxxxx/xxxx/xxxxx/xxx/xxxxxxxxxx.dat.gz")
output
+----------------------+------------------------+-------------------------+-------------------------+--------------------------+----------------------+----------------------------+----------------------------+----------------------------------+--------------------------------+--------------------------+---------------------------+-----------------------+-----------------------+--------------------------+-------------------------+---------------------------+------------------------+-------------------------+-----------------------+-----------------------+--------------------------+---------------------------+
|Trim(BOTH FROM EFF_DT)|Trim(BOTH FROM SITE_NUM)|Trim(BOTH FROM ARTCL_NUM)|Trim(BOTH FROM SL_UOM_CD)|Trim(BOTH FROM COND_TY_CD)|Trim(BOTH FROM EXP_DT)|Trim(BOTH FROM COND_REC_NUM)|Trim(BOTH FROM MAIN_SCAN_CD)|Trim(BOTH FROM PRC_COND_PRRTY_NUM)|Trim(BOTH FROM PRC_COND_WIN_IND)|Trim(BOTH FROM PRC_RSN_CD)|Trim(BOTH FROM PRC_METH_CD)|Trim(BOTH FROM PRC_AMT)|Trim(BOTH FROM PRC_QTY)|Trim(BOTH FROM UT_PRC_AMT)|Trim(BOTH FROM PROMO_NUM)|Trim(BOTH FROM BNS_BUY_NUM)|Trim(BOTH FROM CURRN_CD)|Trim(BOTH FROM BBY_TY_CD)|Trim(BOTH FROM BBY_AMT)|Trim(BOTH FROM BBY_PCT)|Trim(BOTH FROM BBY_LEV_CD)|Trim(BOTH FROM BBY_PRC_QTY)|
+----------------------+------------------------+-------------------------+-------------------------+--------------------------+----------------------+----------------------------+----------------------------+----------------------------------+--------------------------------+--------------------------+---------------------------+-----------------------+-----------------------+--------------------------+-------------------------+---------------------------+------------------------+--
But when reading any other file, I am getting the correct output.
Can anyone help me on this?
This is not a file problem because I tried the same code in my local machine and it is running fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a Parquet file with PySpark on an AWS EMR cluster - python

Related

PySpark DataFrame writing empty (zero bytes) files

read parquet file from s3 using pyspark issue

Can't read .xlsx file on Azure Databricks

How to get csv on s3 with pyspark (No FileSystem for scheme: s3n)

creating spark dataframe of .dat.gz file present in aws s3 using aws glue

Categories

Resources