pyspark - Error while loading .csv file from url to Spark - python

pyspark load data from url
url = "https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("cities.csv"), header=True)
However, the following error occurred:
spark.read.csv(SparkFiles.get("cities.csv"), header=True)
[Stage 0:>
(0 + 1) / 1]20/06/30 19:10:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: File /tmp/spark-1ee8b00f-8657-4cdc-8d7b-e3bc473bbce7/userFiles-f9e0a88d-8678-48c4-a21b-c06ce76d528b/cities.csv exists and does not match contents of https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv
20/06/30 19:10:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 499, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o31.csv.``
How should i solve the problem?

The problem is with your url..
In order to read data from github you have to pass the raw url instead.
On the data page click on raw and then copy that url to get the data
url = 'https://raw.githubusercontent.com/jokecamp/FootballData/master/openFootballData/cities.csv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("cities.csv"), header=True)

Related

why not able to write sparkdataframe into parquet file?

I'm trying to write spark dataframe into parquet file but not able to write dataframe into parquet even i tried with csv
df is my dataframe
CUST_ID
---------------
00000082MM778Q49X
00000372QM8890MX7
00000424M09X729MQ
0000062Q028M05MX
my dataframe looks as above
df_parquet = (tempDir+"/"+"df.parquet") #filepath
customerQuery = f"SELECT DISTINCT(m.customer_ID) FROM ada_customer m INNER JOIN customer_nol mr ON m.customer_ID = mr.customer_ID \
WHERE mr.MODEL <> 'X' and m.STATUS = 'Process' AND m.YEAR = {year} AND mr.YEAR = {year}"
customer_df = sqlContext.read.format("jdbc").options(url="jdbc:mysql://localhost:3306/dbkl",
driver="com.mysql.jdbc.Driver",
query=customerQuery, user="root", password="root").load()
# above lines are working only writing into file not working
customer_df.write.mode("overwrite").parquet(df_parquet)
i'm getting this error don't know exactly what's wrong. can some one help with this
Traceback (most recent call last):
File "F:/SparkBook/HG.py", line 135, in <module>
customer_xdf.write.mode("overwrite").parquet(customer_parquet)
File "C:\spark3\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 1372, in csv
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1305, in __call__
File "C:\spark3\python\lib\pyspark.zip\pyspark\sql\utils.py", line 111, in deco
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:220)
... 33 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "F:/SparkBook/HG.py", line 148, in <module>
logger.error(e)
File "F:\SparkBook\lib\logger.py", line 16, in error
self.logger.error(message)
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1296, in __call__
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1266, in _build_args
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1266, in <listcomp>
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 298, in get_command_part
AttributeError: 'Py4JJavaError' object has no attribute '_get_object_id'
Process finished with exit code 1

Spark SQL cannot output the dataframe

I try running the following codes but I cannot get the result as the error message shown below:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('hive').enableHiveSupport().getOrCreate()
list = spark.read.format("csv").option("header", "true").load(r"mypath/mydata.csv")
list.createOrReplaceTempView("mydata")
df = spark.sql("""select * from mydata""")
Error info:
Traceback (most recent call last):
File "<ipython-input-31-61851d7298cc>", line 1, in <module>
df = spark.sql("""select * from mydata""")
File "C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "C:\ProgramData\Anaconda3\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
is anyone can help me figure out how to resolve this, I am using Spyder with Python 3.7.
Thank you!
Remove enableHiveSuppprt if you are not using it
spark = SparkSession.builder.appName('hive').getOrCreate()

ImportError when running NuPIC model on PySpark

I am trying to run NuPIC on PySpark but I am getting an ImportError. Does anyone have any ideas for how I can fix it?
The code runs fine when I don't use PySpark, but I am trying to run it from a Spark Dataset now.
I am trying to run it using the source code I have in my directory, since running it by installing the Nupic package causes some other errors.
Thank you for your help!!
I am trying to run this function
input_data.rdd.foreach(lambda row: iterateRDD(row, model))
def iterateRDD(record, model):
modelInput = record.asDict(False)
modelInput["value"] = float(modelInput["value"])
modelInput["timestamp"] = datetime.datetime.strptime(modelInput["timestamp"], "%Y-%m-%d %H:%M:%S")
print"modelInput", modelInput
result = model.run(modelInput)
anomalyScore = result.inferences['anomalyScore']
print "Anomaly score is", anomalyScore
However, I get this error and don't understand it.
File
"C:/Users/rakshit.trn/Documents/Nupic/nupic-master/examples/anomaly.py",
line 100, in runAnomaly
input_data.rdd.foreach(lambda row: iterateRDD(row, model)) File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 789,
in foreach self.mapPartitions(processPartition).count() # Force
evaluation File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py",
line 1055, in count return self.mapPartitions(lambda i: [sum(1 for _
in i)]).sum() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 1046, in
sum return self.mapPartitions(lambda x: [sum(x)]).fold(0,
operator.add) File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 917, in
fold vals = self.mapPartitions(func).collect() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 816, in
collect sock_info =
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File
"C:\Python\Python27\lib\site-packages\py4j\java_gateway.py", line
1257, in call answer, self.gateway_client, self.target_id,
self.name) File
"C:\Python\Python27\lib\site-packages\pyspark\sql\utils.py", line 63,
in deco return f(*a, **kw) File
"C:\Python\Python27\lib\site-packages\py4j\protocol.py", line 328, in
get_return_value format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0
in stage 2.0 (TID 2, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 364, in main File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 69, in read_command File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 172, in _read_with_length return self.loads(obj) File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 583, in loads return pickle.loads(obj) ImportError: No module
named frameworks.opf.htm_prediction_model
I guess that NuPIC is not able to get access to the frameworks/opf/htm_prediction_model.py file
You might be running an old version of NuPIC. See https://discourse.numenta.org/t/warning-0-7-0-breaking-changes/2200 and check what version you are using (https://discourse.numenta.org/t/how-to-check-what-version-of-nupic-is-installed/1045)

Pyspark celery task : toPandas() throwing Pickling error

I have a web application to run long running tasks in pyspark. I am using Django, and Celery to run the tasks asynchronously.
I have a piece of code that works great when I execute it in the console. But I am getting quite some errors when I run it through the celery task.
Firstly, my udf's don't work for some reason. I put it in a try-except block and it always goes in to the except block.
try:
func = udf(lambda x: parse(x), DateType())
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
except:
raise ValueError("No valid date format found.")
The error :
[2018-04-05 07:47:37,223: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[afbda586-0929-4d51-87f1-d612cbdb4c5e] raised unexpected: Py4JError('An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:\npy4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)\n\tat py4j.Gateway.invoke(Gateway.java:235)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:748)\n\n',)
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 68, in outlier_algorithm
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 179, in wrapper
return self(*args)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 157, in __call__
judf = self._judf
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 141, in _judf
self._judf_placeholder = self._create_judf()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 153, in _create_judf
self._name, wrapped_func, jdt, self.evalType, self.deterministic)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 324, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Further, I am using toPandas() to convert the dataframe and run some pandas function on it but it throws the following error:
[2018-04-05 07:46:29,701: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[ec267a9b-b482-492d-8404-70b489fbbfe7] raised unexpected: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 146, in outlier_algorithm
data_frame_new = data_frame_1.toPandas()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/dataframe.py", line 1937, in toPandas
if self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/context.py", line 142, in getConf
return self.sparkSession.conf.get(key, defaultValue)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/conf.py", line 46, in get
return self._jconf.get(key)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: ('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
[2018-04-05 07:46:29,706: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.>
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/pool.py", line 362, in workloop
put((READY, (job, i, result, inqW_fd)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
self.send_payload(ForkingPickler.dumps(obj))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/reduction.py", line 56, in dumps
cls(buf, protocol).dump(obj)
billiard.pool.MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.
I ran into this problem and was having a hard time pinning it down. As it turns out this error can occur if the version of Spark you are running does not match the version of PySpark you are executing it with. In my case I am running Spark 2.2.3.4 and was trying to use PySpark 2.4.4. After I downgraded PySpark to 2.2.3 the problem went away. I ran into another problem caused by the code using functionality in PySpark that was added after 2.2.3, but that's another issue.
This is just not going to work. Spark uses complex state, including JVM state, which cannot be simply serialized and send to the worker. If you want to run your code asynchronously, use thread pool to submit jobs.
I'm answering my own question. It was probably a pyspark 2.3 bug
I was using Pyspark 2.3.0 and for some reason it did not work well with Python 3.5.
I downgraded to Pyspark 2.1.2 and everything worked fine.

'Error -5 while decompressing data' in Spark, in PyPDF2 lib

Does anybody know why this occurs:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1089.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1089.0 (TID 1951, ip-10-0-208-38.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 101, in main
process()
File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-762-e1c8f006c3c2>", line 4, in getPdfData
File "<ipython-input-762-e1c8f006c3c2>", line 85, in extractData
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/pdf.py", line 2566, in extractText
content = ContentStream(content, self.pdf)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/pdf.py", line 2644, in __init__
stream = BytesIO(b_(stream.getData()))
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/generic.py", line 837, in getData
decoded._data = filters.decodeStreamData(self)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 346, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 111, in decode
data = decompress(data)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 49, in decompress
return zlib.decompress(data)
error: Error -5 while decompressing data: incomplete or truncated stream
I process PDF files on workers using PyPDF2, creating a PyPDFpdfObject, calling getDocumentInfo() and calling extract_text() on the pageObjects. Im not explicitly using the zlib module where this 'compression' error usually occurs according to the oracle called the internet.
My code runs perfectly fine for smaller number of PDF's (around 500 or so) stored in an RDD (3 workers), but when I scale it up to 5000 or higher it goes wrong. Any ideas?

Categories

Resources