read parquet file from s3 using pyspark issue - python

I am trying to read parquet files from S3 but it kills my server (processing for a very long time, must reset machine in order to continue working).
No issue in writing the parquet file to S3, and when trying to write and read from local it works perfectly. When trying to read small files from s3 there are no issues.
as seen in many threads, spark's "s3a" file system client (2nd config here) should be able to handle it but in fact I get 'NoSuchMethodError' when trying to use s3a (with the proper s3a configuration listed below)
Py4JJavaError: An error occurred while calling o155.json.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)
the following configuration works only for small files, but using the follwing sparkSession config:
s3 config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('fs.s3.awsAccessKeyId', myAccessId')\
.config('fs.s3.awsSecretAccessKey', 'myAccessKey')\
.config('fs.s3.impl', 'org.apache.hadoop.fs.s3native.NativeS3FileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
s3a config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('spark.hadoop.fs.s3a.access.key', 'myAccessId')\
.config('spark.hadoop.fs.s3a.secret.key', 'myAccessKey')\
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
JARs for s3 read-write (spark.driver.extraClassPath):
hadoop-aws-2.7.3.jar,
**hadoop-common-2.7.3.jar**, -- added in order to use S3a
aws-java-sdk-s3-1.11.156.jar
Is there any other .config I can use to solve this issue?
Thanks,
Mosh.

Related

cant access files in s3 using pyspark code running in pycharm

I have a basic pyspark code where I am trying to access s3 files from my pyspark code
The code is below, here I am reading a CSV file from my buckets. Its been a while but I keep getting the below error , although when I use boto I can do the same thing. Please suggest how can I fix this. I don't want to use boto3 bc it defeats my purpose of using spark
from pyspark.sql import *
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", \
"com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key","mykey")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key","mykey")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
data = "s3a://atharv-test/huditest.csv"
authorsDf = spark.read.format('csv').option("header","true").option("inferSchema","true").load(data)
authorsDf.show()
ERROR
py4j.protocol.Py4JJavaError: An error occurred while calling o36.load.
: java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)

Read files from S3 - Pyspark [duplicate]

This question already has answers here:
Spark Scala read csv file using s3a
(1 answer)
How to access s3a:// files from Apache Spark?
(11 answers)
S3A: fails while S3: works in Spark EMR
(2 answers)
Closed 3 years ago.
I have been looking for a clear answer to this question all morning but couldn't find anything understandable.
I just started to use pyspark (installed with pip) a bit ago and have a simple .py file reading data from local storage, doing some processing and writing results locally. I'm currently running it using : python my_file.py
What I'm trying to do :
Use files from AWS S3 as the input , write results to a bucket on AWS3
I am able to create a bucket an load files using "boto3" but saw some options using "spark.read.csv", which I want to use.
What I have tried :
I tried to set up the credentials with :
spark = SparkSession.builder \
.appName("my_app") \
.config('spark.sql.codegen.wholeStage', False) \
.getOrCreate()\
spark._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "my_key_id")
spark._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "my_secret_key")
then :
df = spark.read.option("delimiter", ",").csv("s3a://bucket/key/filename.csv", header = True)
But get the error :
java.io.IOException: No FileSystem for scheme: s3a
Questions :
Do I need to install something in particular to make pyspark S3 enable ?
Should I somehow package my code and run a special command using the pyspark console ?
Thank you all, sorry for the duplicated issue
SOLVED :
The solution is the following :
To link a local spark instance to S3, you must add the jar files of aws-sdk and hadoop-sdk to your classpath and run your app with : spark-submit --jars my_jars.jar
Be carefull with the version you use for the SDKs, not all of them are compatible : aws-java-sdk-1.7.4, hadoop-aws-2.7.4 worked for me.
The configuration I used is :
spark = SparkSession.builder \
.appName("my_app") \
.config('spark.sql.codegen.wholeStage', False) \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "mykey")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "mysecret")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "eu-west-3.amazonaws.com")

How to get csv on s3 with pyspark (No FileSystem for scheme: s3n)

There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something.
Trying to load a simple test csv file from my s3.
Doing it locally, like below, works.
from pyspark.sql import SparkSession
from pyspark import SparkContext as sc
logFile = "sparkexamplefile.csv"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
But if I add this below:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "foo")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "bar")
lines = sc.textFile("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
lines.count()
I get:
No FileSystem for scheme: s3n
I've also tried changing s3 to spark.sparkContext without any difference
Also swapping // and /// in the url
Even better, I'd rather do this and go straight to data frame:
dataFrame = spark.read.csv("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
Also I am slightly AWS ignorant, so I have tried s3, s3n, and s3a to no avail.
I've been around the internet and back but can't seem to resolve the scheme error. Thanks!
I think your spark environment didn't get aws jars. You need to add it for using s3 or s3n.
You have to copy required jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Here my spark version is Spark 2.3.0 and Hadoop 2.7.6
so you have to copy to jars from (hadoop dir)/share/hadoop/tools/lib/
to $SPARK_HOME/jars.
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
You must check what is your version of hadoop*. jar files bound to your specific version of pyspark installed on your system, search for folder pyspark/jars and files hadoop*.
The version observed you pass into your pyspark file like this:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.538,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
This is bit tricky for new joiners on pyspark (I faced this directly my first day with pyspark :-)).
Otherwise I am on Gentoo system with local Spark 2.4.2. Some suggested to install also Hadoop and copy the jars directly to Spark, still should be same version as PySpark is using. So I am creating ebuild for Gentoo for these versions...

creating spark dataframe of .dat.gz file present in aws s3 using aws glue

I have written one pyspark code which is running in aws glue and trying to read one dat.gz file. dataframe is getting created successfully but Trim(BOTH FROM) is getting added to the column name. Below is my code snippet.
df = spark.read.format("csv").option("header", 'false').option("delimiter", '|').load("s3://xxxxxx/xxxx/xxxxx/xxx/xxxxxxxxxx.dat.gz")
output
+----------------------+------------------------+-------------------------+-------------------------+--------------------------+----------------------+----------------------------+----------------------------+----------------------------------+--------------------------------+--------------------------+---------------------------+-----------------------+-----------------------+--------------------------+-------------------------+---------------------------+------------------------+-------------------------+-----------------------+-----------------------+--------------------------+---------------------------+
|Trim(BOTH FROM EFF_DT)|Trim(BOTH FROM SITE_NUM)|Trim(BOTH FROM ARTCL_NUM)|Trim(BOTH FROM SL_UOM_CD)|Trim(BOTH FROM COND_TY_CD)|Trim(BOTH FROM EXP_DT)|Trim(BOTH FROM COND_REC_NUM)|Trim(BOTH FROM MAIN_SCAN_CD)|Trim(BOTH FROM PRC_COND_PRRTY_NUM)|Trim(BOTH FROM PRC_COND_WIN_IND)|Trim(BOTH FROM PRC_RSN_CD)|Trim(BOTH FROM PRC_METH_CD)|Trim(BOTH FROM PRC_AMT)|Trim(BOTH FROM PRC_QTY)|Trim(BOTH FROM UT_PRC_AMT)|Trim(BOTH FROM PROMO_NUM)|Trim(BOTH FROM BNS_BUY_NUM)|Trim(BOTH FROM CURRN_CD)|Trim(BOTH FROM BBY_TY_CD)|Trim(BOTH FROM BBY_AMT)|Trim(BOTH FROM BBY_PCT)|Trim(BOTH FROM BBY_LEV_CD)|Trim(BOTH FROM BBY_PRC_QTY)|
+----------------------+------------------------+-------------------------+-------------------------+--------------------------+----------------------+----------------------------+----------------------------+----------------------------------+--------------------------------+--------------------------+---------------------------+-----------------------+-----------------------+--------------------------+-------------------------+---------------------------+------------------------+--
But when reading any other file, I am getting the correct output.
Can anyone help me on this?
This is not a file problem because I tried the same code in my local machine and it is running fine.

Creating a Parquet file with PySpark on an AWS EMR cluster

I'm trying to spin up a Spark cluster with Datbricks' CSV package so that I can create parquet files and also do some stuff with Spark obviously.
This is being done within AWS EMR, so I don't think I'm putting these options in the correct place.
This is the command I want to send to the cluster as it spins up: spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 --master yarn --driver-memory 4g --executor-memory 2g. I've tried putting this on a Spark step - is this correct?
If the cluster spun up without that being properly installed, how do I start up PySpark with that package? Is this correct: pyspark --packages com.databricks:spark-csv_2.10:1.4.0? I can't tell if it was installed properly or not. Not sure what functions to test
And in regards to actually using the package, is this correct for creating a parquet file:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
#is it this option1
df.write.parquet("s3n://bucketname/nation_parquet.parquet")
#or this option2
df.select('nation_id', 'name', 'some_int', 'comment').write.parquet('com.databricks.spark.csv').save('s3n://bucketname/nation_parquet.tbl')
I'm not able to find any recent (mid 2015 and later) documentation regarding writing a Parquet file.
Edit:
Okay, now I'm not sure if I'm creating my dataframe correctly. If I try to run some select queries on it and show the resultset, I don't get anything and instead some error. Here's what I tried running:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
df.registerTempTable("region2")
tcp_interactions = sqlContext.sql(""" SELECT nation_id, name, comment FROM region2 WHERE nation_id > 1 """)
tcp_interactions.show()
#get some weird Java error:
#Caused by: java.lang.NumberFormatException: For input string: "0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|"

Categories

Resources