How to calculate or manage streaming data in Pyspark

How to calculate or manage streaming data in Pyspark - python

I wanna caculate data from streaming data and then sent to web page. For example: I will calculate sum of TotalSales column in streaming data. But it error at summary = dataStream.select('TotalSales').groupby().sum().toPandas() and this is my code.
import os
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("Python Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
schema = StructType().add("_c0", "integer").add("InvoiceNo", "string").add("Quantity","integer").add("InvoiceDate","date").add("UnitPrice","integer").add("CustomerID","double").add("TotalSales","integer")
INPUT_DIRECTORY = "C:/Users/HP/Desktop/test/jsonFile"
dataStream = spark.readStream.format("json").schema(schema).load(INPUT_DIRECTORY)
query = dataStream.writeStream.format("console").start()
summary = dataStream.select('TotalSales').groupby().sum().toPandas()
print(query.id)
query.awaitTermination();
and this is error showed on command line.
Traceback (most recent call last):
File "testStreaming.py", line 12, in <module>
dataStream = dataStream.toPandas()
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\dataframe.py", line 2150, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\dataframe.py", line 534, in collect
sock_info = self._jdf.collectToPython()
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nFileSource[C:/Users/HP/Desktop/test/jsonFile]'
Thank for your answering.

why are you trying to create a a pandas Df
toPandas will create a DataFrame that is local to your driver node. I am not sure what are you trying to achieve here . Pandas DataFrame represents a fixed set of tuples, where as structured stream is a continuous stream of data .
Now one possible sol to this problem is to complete the entire process that you want to do and send the output to a parquet/csv file and use this parquet/csv etc file to create a pandas DF .
summary = dataStream.select('TotalSales').groupby().sum()
query = dataStream.writeStream.format("parquet").outputMode("complete").start(outputPathDir)
query.awaitTermination()

Related

How do I debug my python code to identify if my issues are being caused by my code or my installation of pandas

I'm trying to use Pandas for the first time. My script looks to open a sqlite database, read the contents of the test_t table then convert the Unix milliseconds timestamps in the 'timestamp' column to dd/mm/yyyy hh:mm:ss before outputting the results to an excel file. I cant tell if the traceback issues are being caused by my code, my installation of pandas or an issue with the format of the string in the 'timestamp' column and example of which is 1670258569783. I have created a test script which correctly reports the pandas version so it seems to be installed. I've included my code and the debug output below.
Any guidance on resolving this issue would be greatly appreciated.
My Code
import sqlite3
import pandas as pd
from datetime import datetime
# Connect to the database
conn = sqlite3.connect('test.db')
# Query the database
query = '''SELECT * FROM test_t'''
df = pd.read_sql_query(query, conn)
# Convert the Unix millisecond timestamps to human-readable dates and times
df['human_readable_time'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000).strftime('%d/%m/%Y %H:%M:%S'))
# Export to an Excel file
df.to_excel('message_view.xlsx', index=False)
# Close the connection
conn.close()
Traceback Errors
C:\Users\xxxx\AppData\Local\Microsoft\WindowsApps\python3.9.exe C:\temp\MyPython\testscript.py
Traceback (most recent call last):
File "C:\temp\MyPython\testscript.py", line 13, in <module>
df['human_readable_time'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000).strftime('%d/%m/%Y %H:%M:%S'))
File "C:\Users\xxxx\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\series.py", line 4771, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "C:\Users\xxxx\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\apply.py", line 1123, in apply
return self.apply_standard()
File "C:\Users\xxxx\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\apply.py", line 1174, in apply_standard
mapped = lib.map_infer(
File "pandas\_libs\lib.pyx", line 2924, in pandas._libs.lib.map_infer
File "C:\temp\MyPython\testscript.py", line 13, in <lambda>
df['human_readable_time'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000).strftime('%d/%m/%Y %H:%M:%S'))
ValueError: Invalid value NaN (not a number)

Spark SQL cannot output the dataframe

I try running the following codes but I cannot get the result as the error message shown below:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('hive').enableHiveSupport().getOrCreate()
list = spark.read.format("csv").option("header", "true").load(r"mypath/mydata.csv")
list.createOrReplaceTempView("mydata")
df = spark.sql("""select * from mydata""")
Error info:
Traceback (most recent call last):
File "<ipython-input-31-61851d7298cc>", line 1, in <module>
df = spark.sql("""select * from mydata""")
File "C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "C:\ProgramData\Anaconda3\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
is anyone can help me figure out how to resolve this, I am using Spyder with Python 3.7.
Thank you!

Remove enableHiveSuppprt if you are not using it
spark = SparkSession.builder.appName('hive').getOrCreate()

how to convert document type to spark RDD

I'm trying to convert document type into spark RDD but I have no idea how to do that.
Basically, I'm trying to implement Google cloud NLP API in Apache Spark. Below is my code :
EDITED
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import six
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
spark = SparkSession.builder.master('yarn-client').appName('SparkNLP').getOrCreate()
gcs_uri = 'gs://mybucket/reddit.json'
document = types.Document(gcs_content_uri=gcs_uri,type=enums.Document.Type.PLAIN_TEXT)
readRDD = spark.read.text(document)
of course the second last line will give an error :
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 328, in text
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: _get_object_id
Can anyone please guide me to the right track ?

Replacing Bad Data in Pandas Data Frame

In Python 2.7, I'm connecting to an external data source using the following:
import pypyodbc
import pandas as pd
import datetime
import csv
import boto3
import os
# Connect to the DataSource
conn = pypyodbc.connect("DSN = FAKE DATA SOURCE; UID=FAKEID; PWD=FAKEPASSWORD")
# Specify the query we're going to run on it
script = ("SELECT * FROM table")
# Create a dataframe from the above query
df = pd.read_sql_query(script, conn)
I get the following error:
C:\Python27\python.exe "C:/Thing.py"
Traceback (most recent call last):
File "C:/Thing.py", line 30, in <module>
df = pd.read_sql_query(script,conn)
File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 431, in read_sql_query
parse_dates=parse_dates, chunksize=chunksize)
File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 1608, in read_query
data = self._fetchall_as_list(cursor)
File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 1617, in _fetchall_as_list
result = cur.fetchall()
File "build\bdist.win32\egg\pypyodbc.py", line 1819, in fetchall
File "build\bdist.win32\egg\pypyodbc.py", line 1871, in fetchone
ValueError: could not convert string to float: ?
It's seems to me that in one of the float columns, there is a '?' symbol for some reason. I've reached out to the owner of the data source, but they cannot change the underlying table.
Is there a way to replace incorrect data like this using pandas? I've tried using replace after the read_sql_query statement, but I get the same error.

Hard to know for certain without having your data obviously, but you could try setting coerce_float to False, i.e. replace your last line with
df = pd.read_sql_query(script, conn, coerce_float=False)
See the documentation of read_sql_query.

How can I format this array of arrays into a pandas data frame?

Here is the data
Originally I was using openpyxl and the .split() method to separate the arrays of data. This still leaves some formatting, but most of all I would really like to able to do this with pandas.
Any help would be great, thanks !
EDIT: Also if anyone knows some good tutorials for pandas beginners that would be great !
EDIT2:
Ami Tavory's answer throws this error:
Traceback (most recent call last):
File "C:\Users\David\Desktop\Python\Coursera\Computational Finance\CAPM\Scatter\JSONparser.py", line 7, in <module>
data = json.load(open('ETH_USD.txt'))
File "C:\Python27\lib\json\__init__.py", line 290, in load
**kw)
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 13409 - line 1 column 13426 (char 13408 - 13425)
EDIT3: this is my code:
# Import the JSON parser
import json
# and pandas
import pandas as pd
# Assuming the data is in stuff.txt
data = json.load(open('ETH_USD.txt'))
#bpd.DataFrame(data)
[Finished in 1.1s]
EDIT3: this worked like a treat:
# Import the JSON parser
import json
# and pandas
import pandas as pd
URL = 'http://cryptocoincharts.info/fast/period.php?pair=ETH-USDT&market=poloniex&time=alltime&resolution=1d'
data = pd.read_json(URL)
data = pd.DataFrame(data)
data.to_csv('ETH_USD_PANDAS.csv')

There are several ways. Based on the format of the text to which you linked, here is the one I think is easiest:
# Import the JSON parser
import json
# and pandas
import pandas as pd
# Assuming the data is in stuff.txt
data = json.load(open('stuff.txt'))
pd.DataFrame(data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate or manage streaming data in Pyspark - python

Related

How do I debug my python code to identify if my issues are being caused by my code or my installation of pandas

Spark SQL cannot output the dataframe

how to convert document type to spark RDD

Replacing Bad Data in Pandas Data Frame

How can I format this array of arrays into a pandas data frame?

Categories

Resources