Spark SQL cannot output the dataframe - python

I try running the following codes but I cannot get the result as the error message shown below:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('hive').enableHiveSupport().getOrCreate()
list = spark.read.format("csv").option("header", "true").load(r"mypath/mydata.csv")
list.createOrReplaceTempView("mydata")
df = spark.sql("""select * from mydata""")
Error info:
Traceback (most recent call last):
File "<ipython-input-31-61851d7298cc>", line 1, in <module>
df = spark.sql("""select * from mydata""")
File "C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "C:\ProgramData\Anaconda3\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
is anyone can help me figure out how to resolve this, I am using Spyder with Python 3.7.
Thank you!

Remove enableHiveSuppprt if you are not using it
spark = SparkSession.builder.appName('hive').getOrCreate()

Related

pyspark - Error while loading .csv file from url to Spark

pyspark load data from url
url = "https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("cities.csv"), header=True)
However, the following error occurred:
spark.read.csv(SparkFiles.get("cities.csv"), header=True)
[Stage 0:>
(0 + 1) / 1]20/06/30 19:10:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: File /tmp/spark-1ee8b00f-8657-4cdc-8d7b-e3bc473bbce7/userFiles-f9e0a88d-8678-48c4-a21b-c06ce76d528b/cities.csv exists and does not match contents of https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv
20/06/30 19:10:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 499, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o31.csv.``
How should i solve the problem?
The problem is with your url..
In order to read data from github you have to pass the raw url instead.
On the data page click on raw and then copy that url to get the data
url = 'https://raw.githubusercontent.com/jokecamp/FootballData/master/openFootballData/cities.csv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("cities.csv"), header=True)

How to calculate or manage streaming data in Pyspark

I wanna caculate data from streaming data and then sent to web page. For example: I will calculate sum of TotalSales column in streaming data. But it error at summary = dataStream.select('TotalSales').groupby().sum().toPandas() and this is my code.
import os
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("Python Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
schema = StructType().add("_c0", "integer").add("InvoiceNo", "string").add("Quantity","integer").add("InvoiceDate","date").add("UnitPrice","integer").add("CustomerID","double").add("TotalSales","integer")
INPUT_DIRECTORY = "C:/Users/HP/Desktop/test/jsonFile"
dataStream = spark.readStream.format("json").schema(schema).load(INPUT_DIRECTORY)
query = dataStream.writeStream.format("console").start()
summary = dataStream.select('TotalSales').groupby().sum().toPandas()
print(query.id)
query.awaitTermination();
and this is error showed on command line.
Traceback (most recent call last):
File "testStreaming.py", line 12, in <module>
dataStream = dataStream.toPandas()
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\dataframe.py", line 2150, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\dataframe.py", line 534, in collect
sock_info = self._jdf.collectToPython()
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nFileSource[C:/Users/HP/Desktop/test/jsonFile]'
Thank for your answering.
why are you trying to create a a pandas Df
toPandas will create a DataFrame that is local to your driver node. I am not sure what are you trying to achieve here . Pandas DataFrame represents a fixed set of tuples, where as structured stream is a continuous stream of data .
Now one possible sol to this problem is to complete the entire process that you want to do and send the output to a parquet/csv file and use this parquet/csv etc file to create a pandas DF .
summary = dataStream.select('TotalSales').groupby().sum()
query = dataStream.writeStream.format("parquet").outputMode("complete").start(outputPathDir)
query.awaitTermination()

ImportError when running NuPIC model on PySpark

I am trying to run NuPIC on PySpark but I am getting an ImportError. Does anyone have any ideas for how I can fix it?
The code runs fine when I don't use PySpark, but I am trying to run it from a Spark Dataset now.
I am trying to run it using the source code I have in my directory, since running it by installing the Nupic package causes some other errors.
Thank you for your help!!
I am trying to run this function
input_data.rdd.foreach(lambda row: iterateRDD(row, model))
def iterateRDD(record, model):
modelInput = record.asDict(False)
modelInput["value"] = float(modelInput["value"])
modelInput["timestamp"] = datetime.datetime.strptime(modelInput["timestamp"], "%Y-%m-%d %H:%M:%S")
print"modelInput", modelInput
result = model.run(modelInput)
anomalyScore = result.inferences['anomalyScore']
print "Anomaly score is", anomalyScore
However, I get this error and don't understand it.
File
"C:/Users/rakshit.trn/Documents/Nupic/nupic-master/examples/anomaly.py",
line 100, in runAnomaly
input_data.rdd.foreach(lambda row: iterateRDD(row, model)) File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 789,
in foreach self.mapPartitions(processPartition).count() # Force
evaluation File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py",
line 1055, in count return self.mapPartitions(lambda i: [sum(1 for _
in i)]).sum() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 1046, in
sum return self.mapPartitions(lambda x: [sum(x)]).fold(0,
operator.add) File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 917, in
fold vals = self.mapPartitions(func).collect() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 816, in
collect sock_info =
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File
"C:\Python\Python27\lib\site-packages\py4j\java_gateway.py", line
1257, in call answer, self.gateway_client, self.target_id,
self.name) File
"C:\Python\Python27\lib\site-packages\pyspark\sql\utils.py", line 63,
in deco return f(*a, **kw) File
"C:\Python\Python27\lib\site-packages\py4j\protocol.py", line 328, in
get_return_value format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0
in stage 2.0 (TID 2, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 364, in main File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 69, in read_command File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 172, in _read_with_length return self.loads(obj) File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 583, in loads return pickle.loads(obj) ImportError: No module
named frameworks.opf.htm_prediction_model
I guess that NuPIC is not able to get access to the frameworks/opf/htm_prediction_model.py file
You might be running an old version of NuPIC. See https://discourse.numenta.org/t/warning-0-7-0-breaking-changes/2200 and check what version you are using (https://discourse.numenta.org/t/how-to-check-what-version-of-nupic-is-installed/1045)

system error when creating date_range with pandas

I would like to create a date_range() with using pandas. I am kinda sure it worked before I updated pandas package.
with following line of code, I am trying to create the date_range():
date_time_index = pd.date_range(start='1/1/2018', periods=8760, freq='H')
and here is the error message:
ValueError: Error parsing datetime string "1/1/2018" at position 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "main.py", line 36, in <module>
date_time_index = pd.date_range(start='1/1/2018', periods=8760, freq='H')
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 2024, in date_range
closed=closed, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 301, in __new__
ambiguous=ambiguous)
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 403, in _generate
start = Timestamp(start)
File "pandas/tslib.pyx", line 406, in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:9940)
File "pandas/tslib.pyx", line 1401, in pandas.tslib.convert_to_tsobject (pandas/tslib.c:25239)
File "pandas/tslib.pyx", line 1516, in pandas.tslib.convert_str_to_tsobject (pandas/tslib.c:26859)
File "pandas/src/datetime.pxd", line 141, in datetime._string_t
SystemError: <class 'str'> returned a result with an error set
What am I doing wrong?
Pandas version 0.19.1 date_range() does not work with the input I gave. I updated pandas to 0.23.4 now everything is fine.
Meanwhile:
pip3 install --upgrade pandas

Pyspark celery task : toPandas() throwing Pickling error

I have a web application to run long running tasks in pyspark. I am using Django, and Celery to run the tasks asynchronously.
I have a piece of code that works great when I execute it in the console. But I am getting quite some errors when I run it through the celery task.
Firstly, my udf's don't work for some reason. I put it in a try-except block and it always goes in to the except block.
try:
func = udf(lambda x: parse(x), DateType())
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
except:
raise ValueError("No valid date format found.")
The error :
[2018-04-05 07:47:37,223: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[afbda586-0929-4d51-87f1-d612cbdb4c5e] raised unexpected: Py4JError('An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:\npy4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)\n\tat py4j.Gateway.invoke(Gateway.java:235)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:748)\n\n',)
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 68, in outlier_algorithm
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 179, in wrapper
return self(*args)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 157, in __call__
judf = self._judf
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 141, in _judf
self._judf_placeholder = self._create_judf()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 153, in _create_judf
self._name, wrapped_func, jdt, self.evalType, self.deterministic)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 324, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Further, I am using toPandas() to convert the dataframe and run some pandas function on it but it throws the following error:
[2018-04-05 07:46:29,701: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[ec267a9b-b482-492d-8404-70b489fbbfe7] raised unexpected: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 146, in outlier_algorithm
data_frame_new = data_frame_1.toPandas()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/dataframe.py", line 1937, in toPandas
if self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/context.py", line 142, in getConf
return self.sparkSession.conf.get(key, defaultValue)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/conf.py", line 46, in get
return self._jconf.get(key)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: ('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
[2018-04-05 07:46:29,706: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.>
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/pool.py", line 362, in workloop
put((READY, (job, i, result, inqW_fd)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
self.send_payload(ForkingPickler.dumps(obj))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/reduction.py", line 56, in dumps
cls(buf, protocol).dump(obj)
billiard.pool.MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.
I ran into this problem and was having a hard time pinning it down. As it turns out this error can occur if the version of Spark you are running does not match the version of PySpark you are executing it with. In my case I am running Spark 2.2.3.4 and was trying to use PySpark 2.4.4. After I downgraded PySpark to 2.2.3 the problem went away. I ran into another problem caused by the code using functionality in PySpark that was added after 2.2.3, but that's another issue.
This is just not going to work. Spark uses complex state, including JVM state, which cannot be simply serialized and send to the worker. If you want to run your code asynchronously, use thread pool to submit jobs.
I'm answering my own question. It was probably a pyspark 2.3 bug
I was using Pyspark 2.3.0 and for some reason it did not work well with Python 3.5.
I downgraded to Pyspark 2.1.2 and everything worked fine.

Categories

Resources