cant access files in s3 using pyspark code running in pycharm - python

I have a basic pyspark code where I am trying to access s3 files from my pyspark code
The code is below, here I am reading a CSV file from my buckets. Its been a while but I keep getting the below error , although when I use boto I can do the same thing. Please suggest how can I fix this. I don't want to use boto3 bc it defeats my purpose of using spark
from pyspark.sql import *
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", \
"com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key","mykey")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key","mykey")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
data = "s3a://atharv-test/huditest.csv"
authorsDf = spark.read.format('csv').option("header","true").option("inferSchema","true").load(data)
authorsDf.show()
ERROR
py4j.protocol.Py4JJavaError: An error occurred while calling o36.load.
: java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)

Related

read parquet file from s3 using pyspark issue

I am trying to read parquet files from S3 but it kills my server (processing for a very long time, must reset machine in order to continue working).
No issue in writing the parquet file to S3, and when trying to write and read from local it works perfectly. When trying to read small files from s3 there are no issues.
as seen in many threads, spark's "s3a" file system client (2nd config here) should be able to handle it but in fact I get 'NoSuchMethodError' when trying to use s3a (with the proper s3a configuration listed below)
Py4JJavaError: An error occurred while calling o155.json.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)
the following configuration works only for small files, but using the follwing sparkSession config:
s3 config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('fs.s3.awsAccessKeyId', myAccessId')\
.config('fs.s3.awsSecretAccessKey', 'myAccessKey')\
.config('fs.s3.impl', 'org.apache.hadoop.fs.s3native.NativeS3FileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
s3a config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('spark.hadoop.fs.s3a.access.key', 'myAccessId')\
.config('spark.hadoop.fs.s3a.secret.key', 'myAccessKey')\
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
JARs for s3 read-write (spark.driver.extraClassPath):
hadoop-aws-2.7.3.jar,
**hadoop-common-2.7.3.jar**, -- added in order to use S3a
aws-java-sdk-s3-1.11.156.jar
Is there any other .config I can use to solve this issue?
Thanks,
Mosh.

How to get csv on s3 with pyspark (No FileSystem for scheme: s3n)

There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something.
Trying to load a simple test csv file from my s3.
Doing it locally, like below, works.
from pyspark.sql import SparkSession
from pyspark import SparkContext as sc
logFile = "sparkexamplefile.csv"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
But if I add this below:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "foo")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "bar")
lines = sc.textFile("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
lines.count()
I get:
No FileSystem for scheme: s3n
I've also tried changing s3 to spark.sparkContext without any difference
Also swapping // and /// in the url
Even better, I'd rather do this and go straight to data frame:
dataFrame = spark.read.csv("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
Also I am slightly AWS ignorant, so I have tried s3, s3n, and s3a to no avail.
I've been around the internet and back but can't seem to resolve the scheme error. Thanks!
I think your spark environment didn't get aws jars. You need to add it for using s3 or s3n.
You have to copy required jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Here my spark version is Spark 2.3.0 and Hadoop 2.7.6
so you have to copy to jars from (hadoop dir)/share/hadoop/tools/lib/
to $SPARK_HOME/jars.
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
You must check what is your version of hadoop*. jar files bound to your specific version of pyspark installed on your system, search for folder pyspark/jars and files hadoop*.
The version observed you pass into your pyspark file like this:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.538,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
This is bit tricky for new joiners on pyspark (I faced this directly my first day with pyspark :-)).
Otherwise I am on Gentoo system with local Spark 2.4.2. Some suggested to install also Hadoop and copy the jars directly to Spark, still should be same version as PySpark is using. So I am creating ebuild for Gentoo for these versions...

creating spark dataframe of .dat.gz file present in aws s3 using aws glue

I have written one pyspark code which is running in aws glue and trying to read one dat.gz file. dataframe is getting created successfully but Trim(BOTH FROM) is getting added to the column name. Below is my code snippet.
df = spark.read.format("csv").option("header", 'false').option("delimiter", '|').load("s3://xxxxxx/xxxx/xxxxx/xxx/xxxxxxxxxx.dat.gz")
output
+----------------------+------------------------+-------------------------+-------------------------+--------------------------+----------------------+----------------------------+----------------------------+----------------------------------+--------------------------------+--------------------------+---------------------------+-----------------------+-----------------------+--------------------------+-------------------------+---------------------------+------------------------+-------------------------+-----------------------+-----------------------+--------------------------+---------------------------+
|Trim(BOTH FROM EFF_DT)|Trim(BOTH FROM SITE_NUM)|Trim(BOTH FROM ARTCL_NUM)|Trim(BOTH FROM SL_UOM_CD)|Trim(BOTH FROM COND_TY_CD)|Trim(BOTH FROM EXP_DT)|Trim(BOTH FROM COND_REC_NUM)|Trim(BOTH FROM MAIN_SCAN_CD)|Trim(BOTH FROM PRC_COND_PRRTY_NUM)|Trim(BOTH FROM PRC_COND_WIN_IND)|Trim(BOTH FROM PRC_RSN_CD)|Trim(BOTH FROM PRC_METH_CD)|Trim(BOTH FROM PRC_AMT)|Trim(BOTH FROM PRC_QTY)|Trim(BOTH FROM UT_PRC_AMT)|Trim(BOTH FROM PROMO_NUM)|Trim(BOTH FROM BNS_BUY_NUM)|Trim(BOTH FROM CURRN_CD)|Trim(BOTH FROM BBY_TY_CD)|Trim(BOTH FROM BBY_AMT)|Trim(BOTH FROM BBY_PCT)|Trim(BOTH FROM BBY_LEV_CD)|Trim(BOTH FROM BBY_PRC_QTY)|
+----------------------+------------------------+-------------------------+-------------------------+--------------------------+----------------------+----------------------------+----------------------------+----------------------------------+--------------------------------+--------------------------+---------------------------+-----------------------+-----------------------+--------------------------+-------------------------+---------------------------+------------------------+--
But when reading any other file, I am getting the correct output.
Can anyone help me on this?
This is not a file problem because I tried the same code in my local machine and it is running fine.

PySpark Throwing error Method __getnewargs__([]) does not exist

I have a set of files. The path to the files are saved in a file., say all_files.txt. Using apache spark, I need to do an operation on all the files and club the results.
The steps that I want to do are:
Create an RDD by reading all_files.txt
For each line in all_files.txt (Each line is a path to some file),
read the contents of each of the files into a single RDD
Then do an operation all contents
This is the code I wrote for the same:
def return_contents_from_file (file_name):
return spark.read.text(file_name).rdd.map(lambda r: r[0])
def run_spark():
file_name = 'path_to_file'
spark = SparkSession \
.builder \
.appName("PythonWordCount") \
.getOrCreate()
counts = spark.read.text(file_name).rdd.map(lambda r: r[0]) \ # this line is supposed to return the paths to each file
.flatMap(return_contents_from_file) \ # here i am expecting to club all the contents of all files
.flatMap(do_operation_on_each_line_of_all_files) # here i am expecting do an operation on each line of all files
This is throwing the error:
line 323, in get_return_value py4j.protocol.Py4JError: An error
occurred while calling o25.getnewargs. Trace: py4j.Py4JException:
Method getnewargs([]) does not exist at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:745)
Can someone please tell me what I am doing wrong and how I should proceed further. Thanks in advance.
Using spark inside flatMap or any transformation that occures on executors is not allowed (spark session is available on driver only). It is also not possible to create RDD of RDDs (see: Is it possible to create nested RDDs in Apache Spark?)
But you can achieve this transformation in another way - read all content of all_files.txt into dataframe, use local map to make them dataframes and local reduce to union all, see example:
>>> filenames = spark.read.text('all_files.txt').collect()
>>> dataframes = map(lambda r: spark.read.text(r[0]), filenames)
>>> all_lines_df = reduce(lambda df1, df2: df1.unionAll(df2), dataframes)
I meet this problem today, finally figure out that I refered to a spark.DataFrame object in pandas_udf , which result to this error .
The conclution:
You can't use sparkSession object , spark.DataFrame object or other Spark distributed objects in udf and pandas_udf, because they are unpickled.
If you meet this error and you are using udf, check it carefully , must be relative problem.
I also got this error trying to log my model with MLFlow using mlflow.sklearn.log_model when the model itself was a pyspark.ml.classification model. Using mlflow.spark.log_model solved the issue.

Pig script cannot register UDF

I have a simple Pig script that uses a Python UDF I have created. The script completes fine if I remove the UDF portion. But when I try to register my UDF I get the following error:
ERROR 2997: Encountered IOException. File pig_test/py_udf_substr.py does not exist
This is my UDF:
#outputSchema("chararray")
def get_fistsn(data,n):
return data[:n]
This is my Pig script:
REGISTER 'pig_test/py_udf_substr.py' USING jython as pyudf;
A = load 'pig_test/sf.txt' using PigStorage(',')
as (Unique_flight_ID,Year,Month,Day,DOW,
Scheduled_departure_time,Scheduled_arrival_time,
Airline,Flight_number,Tail_number,Plane_model,
Seat_configuration,Departure_delay,Origin_airport,
Destination_airport,Distance_travelled,Taxi_time_in,
Taxi_time_out,Cancelled,Cancellation_code,target);
B = FOREACH A GENERATE Unique_flight_ID, pyudf.get_fistsn($0,3);
DUMP B;
I'm using HUE to run Pig. Both the data and the UDF are in the same HDFS location (pig_test).
Based on the error you get, the issue is not with the script. Its IOException- The framework is unable to read the UDF. You can try giving the complete path of the UDF and see if works. Using a new terminal try opening the file using cat command or so and see if the path is correct or not.
ERROR 2997: Encountered IOException. File pig_test/py_udf_substr.py does not exist
In the Editor, could you click on Properties, then Resources and select py_udf_substr.py as a file?
Then in your script, just do:
REGISTER 'py_udf_substr.py' USING jython as pyudf;

Categories

Resources