Read files from S3 - Pyspark [duplicate] - python

This question already has answers here:
Spark Scala read csv file using s3a
(1 answer)
How to access s3a:// files from Apache Spark?
(11 answers)
S3A: fails while S3: works in Spark EMR
(2 answers)
Closed 3 years ago.
I have been looking for a clear answer to this question all morning but couldn't find anything understandable.
I just started to use pyspark (installed with pip) a bit ago and have a simple .py file reading data from local storage, doing some processing and writing results locally. I'm currently running it using : python my_file.py
What I'm trying to do :
Use files from AWS S3 as the input , write results to a bucket on AWS3
I am able to create a bucket an load files using "boto3" but saw some options using "spark.read.csv", which I want to use.
What I have tried :
I tried to set up the credentials with :
spark = SparkSession.builder \
.appName("my_app") \
.config('spark.sql.codegen.wholeStage', False) \
.getOrCreate()\
spark._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "my_key_id")
spark._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "my_secret_key")
then :
df = spark.read.option("delimiter", ",").csv("s3a://bucket/key/filename.csv", header = True)
But get the error :
java.io.IOException: No FileSystem for scheme: s3a
Questions :
Do I need to install something in particular to make pyspark S3 enable ?
Should I somehow package my code and run a special command using the pyspark console ?
Thank you all, sorry for the duplicated issue
SOLVED :
The solution is the following :
To link a local spark instance to S3, you must add the jar files of aws-sdk and hadoop-sdk to your classpath and run your app with : spark-submit --jars my_jars.jar
Be carefull with the version you use for the SDKs, not all of them are compatible : aws-java-sdk-1.7.4, hadoop-aws-2.7.4 worked for me.
The configuration I used is :
spark = SparkSession.builder \
.appName("my_app") \
.config('spark.sql.codegen.wholeStage', False) \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "mykey")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "mysecret")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "eu-west-3.amazonaws.com")

Related

cant access files in s3 using pyspark code running in pycharm

I have a basic pyspark code where I am trying to access s3 files from my pyspark code
The code is below, here I am reading a CSV file from my buckets. Its been a while but I keep getting the below error , although when I use boto I can do the same thing. Please suggest how can I fix this. I don't want to use boto3 bc it defeats my purpose of using spark
from pyspark.sql import *
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", \
"com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key","mykey")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key","mykey")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
data = "s3a://atharv-test/huditest.csv"
authorsDf = spark.read.format('csv').option("header","true").option("inferSchema","true").load(data)
authorsDf.show()
ERROR
py4j.protocol.Py4JJavaError: An error occurred while calling o36.load.
: java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)

Reading Excel file Using PySpark: Failed to find data source: com.crealytics.spark.excel [duplicate]

This question already has answers here:
How to set up Spark on Windows?
(10 answers)
Closed 1 year ago.
I'm trying to read an excel file with spark using jupyter in vscode,with java version of 1.8.0_311 (Oracle Corporation), and scala version of version 2.12.15.
Here is the code below:
# import necessary library
import pandas as pd
from pyspark.sql.types import StructType
# entry point for spark's functionality
from pyspark import SparkContext, SparkConf, SQLContext
configure = SparkConf().setAppName("name").setMaster("local")
sc = SparkContext(conf= configure)
sql = SQLContext(sc)
# entry point for spark's dataframes
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local") \
.appName("pharmacy scraper") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()
# reading excel file
df_generika = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("inferSchema", "true").option("dataAddress", "Sheet1").load("./../data/raw-data/generika.xlsx")
Unfortunately, it produces an error
Py4JJavaError: An error occurred while calling o36.load.
: java.lang.ClassNotFoundException:
Failed to find data source: com.crealytics.spark.excel. Please find packages at
http://spark.apache.org/third-party-projects.html
Turns out winutils isn't installed.
Check your Classpath: you must have the Jar containing com.crealytics.spark.excel in it.
With Spark, the architecture is a bit different than traditional applications. You may need to have the Jar at different location: in your application, at the master level, and/or worker level. Ingestion (what you’re doing) is done by the worker, so make sure they have this Jar in their classpath.

read parquet file from s3 using pyspark issue

I am trying to read parquet files from S3 but it kills my server (processing for a very long time, must reset machine in order to continue working).
No issue in writing the parquet file to S3, and when trying to write and read from local it works perfectly. When trying to read small files from s3 there are no issues.
as seen in many threads, spark's "s3a" file system client (2nd config here) should be able to handle it but in fact I get 'NoSuchMethodError' when trying to use s3a (with the proper s3a configuration listed below)
Py4JJavaError: An error occurred while calling o155.json.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)
the following configuration works only for small files, but using the follwing sparkSession config:
s3 config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('fs.s3.awsAccessKeyId', myAccessId')\
.config('fs.s3.awsSecretAccessKey', 'myAccessKey')\
.config('fs.s3.impl', 'org.apache.hadoop.fs.s3native.NativeS3FileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
s3a config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('spark.hadoop.fs.s3a.access.key', 'myAccessId')\
.config('spark.hadoop.fs.s3a.secret.key', 'myAccessKey')\
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
JARs for s3 read-write (spark.driver.extraClassPath):
hadoop-aws-2.7.3.jar,
**hadoop-common-2.7.3.jar**, -- added in order to use S3a
aws-java-sdk-s3-1.11.156.jar
Is there any other .config I can use to solve this issue?
Thanks,
Mosh.

How to get csv on s3 with pyspark (No FileSystem for scheme: s3n)

There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something.
Trying to load a simple test csv file from my s3.
Doing it locally, like below, works.
from pyspark.sql import SparkSession
from pyspark import SparkContext as sc
logFile = "sparkexamplefile.csv"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
But if I add this below:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "foo")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "bar")
lines = sc.textFile("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
lines.count()
I get:
No FileSystem for scheme: s3n
I've also tried changing s3 to spark.sparkContext without any difference
Also swapping // and /// in the url
Even better, I'd rather do this and go straight to data frame:
dataFrame = spark.read.csv("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
Also I am slightly AWS ignorant, so I have tried s3, s3n, and s3a to no avail.
I've been around the internet and back but can't seem to resolve the scheme error. Thanks!
I think your spark environment didn't get aws jars. You need to add it for using s3 or s3n.
You have to copy required jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Here my spark version is Spark 2.3.0 and Hadoop 2.7.6
so you have to copy to jars from (hadoop dir)/share/hadoop/tools/lib/
to $SPARK_HOME/jars.
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
You must check what is your version of hadoop*. jar files bound to your specific version of pyspark installed on your system, search for folder pyspark/jars and files hadoop*.
The version observed you pass into your pyspark file like this:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.538,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
This is bit tricky for new joiners on pyspark (I faced this directly my first day with pyspark :-)).
Otherwise I am on Gentoo system with local Spark 2.4.2. Some suggested to install also Hadoop and copy the jars directly to Spark, still should be same version as PySpark is using. So I am creating ebuild for Gentoo for these versions...

Creating a Parquet file with PySpark on an AWS EMR cluster

I'm trying to spin up a Spark cluster with Datbricks' CSV package so that I can create parquet files and also do some stuff with Spark obviously.
This is being done within AWS EMR, so I don't think I'm putting these options in the correct place.
This is the command I want to send to the cluster as it spins up: spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 --master yarn --driver-memory 4g --executor-memory 2g. I've tried putting this on a Spark step - is this correct?
If the cluster spun up without that being properly installed, how do I start up PySpark with that package? Is this correct: pyspark --packages com.databricks:spark-csv_2.10:1.4.0? I can't tell if it was installed properly or not. Not sure what functions to test
And in regards to actually using the package, is this correct for creating a parquet file:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
#is it this option1
df.write.parquet("s3n://bucketname/nation_parquet.parquet")
#or this option2
df.select('nation_id', 'name', 'some_int', 'comment').write.parquet('com.databricks.spark.csv').save('s3n://bucketname/nation_parquet.tbl')
I'm not able to find any recent (mid 2015 and later) documentation regarding writing a Parquet file.
Edit:
Okay, now I'm not sure if I'm creating my dataframe correctly. If I try to run some select queries on it and show the resultset, I don't get anything and instead some error. Here's what I tried running:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load('s3n://bucketname/nation.tbl', schema = customSchema)
df.registerTempTable("region2")
tcp_interactions = sqlContext.sql(""" SELECT nation_id, name, comment FROM region2 WHERE nation_id > 1 """)
tcp_interactions.show()
#get some weird Java error:
#Caused by: java.lang.NumberFormatException: For input string: "0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|"

Categories

Resources