So far, Spark hasn't created the DataFrame for streaming data, but when I am doing anomalies detection, it is more convenient and faster to use DataFrame for data analysis. I have done this part, but when I try to do real time anomalies detection using streaming data, the problems appeared. I tried several ways and still could not convert DStream to DataFrame, and cannot convert the RDD inside of DStream into DataFrame either.
Here's part of my latest version of the code:
import sys
import re
from pyspark import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql import Row
from pyspark.streaming import StreamingContext
from pyspark.mllib.clustering import KMeans, KMeansModel, StreamingKMeans
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import operator
sc = SparkContext(appName="test")
ssc = StreamingContext(sc, 5)
sqlContext = SQLContext(sc)
model_inputs = sys.argv[1]
def streamrdd_to_df(srdd):
sdf = sqlContext.createDataFrame(srdd)
sdf.show(n=2, truncate=False)
return sdf
def main():
indata = ssc.socketTextStream(sys.argv[2], int(sys.argv[3]))
inrdd = indata.map(lambda r: get_tuple(r))
Features = Row('rawFeatures')
features_rdd = inrdd.map(lambda r: Features(r))
features_rdd.pprint(num=3)
streaming_df = features_rdd.flatMap(streamrdd_to_df)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
As you can see in the main() function, when I am reading the input streaming data using ssc.socketTextStream() method, it generates DStream, then I tried to convert each individual in DStream into Row, hoping I could convert the data into DataFrame later.
If I use ppprint() to print out features_rdd here, it works, which makes me think, each individual in features_rdd is a batch of RDD while the whole features_rdd is a DStream.
Then I created streamrdd_to_df() method and hoped to convert each batch of RDD into dataframe, it gives me the error, showing:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Is there any thought about how can I do DataFrame operations on Spark streaming data?
Spark has provided us with structured streaming which can solve such problems. It can generate streaming DataFrame i.e DataFrames being appended continuously. Please check below link
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Read the Error carefully..It says there is No output operations registered. Spark is Lazy and executes the job/ cod only when it has something to produce as a result. In your program there is no "Output Operation" and same is being complained by Spark.
Define a foreach() or Raw SQL Query over the DataFrame and then print the results. It will work fine.
Why don't you use something like this:
def socket_streamer(sc): # retruns a streamed dataframe
streamer = session.readStream\
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
return streamer
The output itself of this function above (or the readStream in general) is a DataFrame. There you don't need to worry about df, it is already automatically created by spark.
See the Spark Structured Streaming Programming Guide
After 1 year, I started to explore Spark 2.0 streaming methods and finally solved my anomalies detection problem. Here's my code in IPython, you can also find how does my raw data input look like
There is no need to convert DStream into RDD. By definition DStream is a collection of RDD. Just use DStream's method foreach() to loop over each RDD and take action.
val conf = new SparkConf()
.setAppName("Sample")
val spark = SparkSession.builder.config(conf).getOrCreate()
sampleStream.foreachRDD(rdd => {
val sampleDataFrame = spark.read.json(rdd)
}
The spark documentation has an introduction to working with DStream. Basically, you have to use foreachRDD on your stream object to interact with it.
Here is an example (ensure you create a spark session object):
def process_stream(record, spark):
if not record.isEmpty():
df = spark.createDataFrame(record)
df.show()
def main():
sc = SparkContext(appName="PysparkStreaming")
spark = SparkSession(sc)
ssc = StreamingContext(sc, 5)
dstream = ssc.textFileStream(folder_path)
transformed_dstream = # transformations
transformed_dstream.foreachRDD(lambda rdd: process_stream(rdd, spark))
# ^^^^^^^^^^
ssc.start()
ssc.awaitTermination()
With Spark 2.3 / Python 3 / Scala 2.11 (Using databricks) I was able to use temporary tables and a code snippet in scala (using magic in notebooks):
Python Part:
ddf.createOrReplaceTempView("TempItems")
Then on a new cell:
%scala
import java.sql.DriverManager
import org.apache.spark.sql.ForeachWriter
// Create the query to be persisted...
val tempItemsDF = spark.sql("SELECT field1, field2, field3 FROM TempItems")
val itemsQuery = tempItemsDF.writeStream.foreach(new ForeachWriter[Row]
{
def open(partitionId: Long, version: Long):Boolean = {
// Initializing DB connection / etc...
}
def process(value: Row): Unit = {
val field1 = value(0)
val field2 = value(1)
val field3 = value(2)
// Processing values ...
}
def close(errorOrNull: Throwable): Unit = {
// Closing connections etc...
}
})
val streamingQuery = itemsQuery.start()
Related
For the output below , I want to run multiple sql queries something like shown in the code below, but spark does not support multiple sql statement, can you please suggest some other work around for this, it would be really helpful, Thanks :)
expected Output:-
Col_name Max_val Min_value
Name Null Null
Age 15 5
height 100 8
CODE :-
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Kate', age=10, height=90), \
Row(name='Brain', age=15, height=100)]).toDF()
df.createOrReplaceTempView("Test")
df3 = spark.sql("select max(name) as name ,max(age) as age,max(height) as height from Test" )
df4=df.selectExpr("stack(3,'name',bigint(name),'age',bigint(age),'height',bigint(height)) as (col_name,max_data)")
df5 = spark.sql("select min(name) as name ,min(age) as age,min(height) as height from Test" )
df6=df.selectExpr("stack(3,'name',bigint(name),'age',bigint(age),'height',bigint(height)) as (col_name,min_data)")
df7=df4.join(df6,['col_name'],'inner').groupBy("col_name").orderBy("col_name")
df7.show()
If you don't need the exact same structure of the resulting query, you can simply have multiple aggregations in the same step (which would also be more efficient):
from pyspark.sql import Row
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Kate', age=10, height=90), \
Row(name='Brain', age=15, height=100)]).toDF()
df2 = df.agg(
F.max(F.col("height")).alias("max_height"),
F.max(F.col("age")).alias("max_age"),
F.min(F.col("height")).alias("min_height"),
F.min(F.col("age")).alias("min_age")
)
df2.collect()
This gives a result of: [Row(max_height=100, max_age=15, min_height=80, min_age=5)]
To get this in the format above, you would have to use explode.
In scala you can achieve this via Futures API. Then you could expose your Scala
Something like this:
for(q <- queries) {
Future {
spark.sql(q)
}
}.map(Await.result(_, Duration("+Inf"))
Note that "+Inf" is just illustrative, dont use Inf because timeout will never happen and your code might hang forever.
This will of course not support .show() since that would be ran on top of a DataFrame and here I assume queries are a collection of queries.
You could then wrap this in a spark.ml.Transformer and pass list of queries as Params.
Then you could pass your jar to pyspark at spark submit.
Lastly you could acces then your transformer via spark._jvm.
It is quite a work around and I am only proposing it since I am aware this could work.
Could I ask why is it essential that statements in your example are ran in parallel ? That could help in finding a better suggestion.
Hi I am following frank kane's course on apachespark with python. Here i am trying to calculate total amount spent by different customers.I have mentioned the error below.Kindly help. following is my code:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("MaxTemperatures")
sc = SparkContext(conf = conf)
def parseline(lines):
fields=lines.split(',')
customerId=int(fields[0])
dollars=float(fields[2])
return (customerId, dollars)
text = sc.textFile("file:///Sparkcourse/SparkCourse/customer-orders.csv")
rdd= text.map(parseline)
reduction= rdd.map(lambda x: x[1]).reduceByKey(lambda x,y: x+y)
sortedvalues=reduction.sortByKey()
final= sortedvalues.collect()
for i,j in final:
print(i,j)
TypeError: cannot unpack non-iterable float object
I am not sure about what you want to do and I have the intuition that you have at least one error in your code: you should use reduceByKey the following way:
reduceByKey(lambda x, y: x + y)
I think in your case, every steps you want to use can be translated in the framework of the DataFrame API which will be easier to use. RDDs are not the easiest structure to handle when you use simple operations as sums, etc. (and DataFrames will be faster)
So I can propose you something like that. You will probably have to change the schema statement to match your csv structure. Assuming your spark session is named spark:
import pyspark.sql.types as pst
import pyspark.sql.functions as psf
schema = pst.StructType([
pst.StructField("customerId", pst.IntegerType(), True),
pst.StructField("dollars", pst.IntegerType(), True),
pst.StructField("productid", pst.IntegerType(), True)])
(spark.read
.csv("file:///Sparkcourse/SparkCourse/customer-orders.csv", header = False, schema = schema)
.groupBy('customerId')
.agg(psf.sum("dollars").alias("dollars"))
.sortBy('dollars')
)
I would like to use our spark cluster to run programs in parallel. My idea is to do sth like the following:
def simulate():
#some magic happening in here
return 0
spark = (
SparkSession.builder
.appName('my_simulation')
.enableHiveSupport()
.getOrCreate())
sc = spark.sparkContext
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate())
print res.collect()
The question i have is whether there's a way to execute simulate() with different parameters. The only way i currently can imagine is to have a dataframe specifying the parameters, so something like this:
parameter_list = [[5,2.3,3], [3,0.2,4]]
no_parallel_instances = sc.parallelize(parameter_list)
res = no_parallel_instances.map(lambda row: simulate(row))
print res.collect()
Is there another, more elegant way to run parallel functions with spark?
If the data you are looking to parameterize your call with differs between each row, then yes you will need to include that with each row.
However, if you are looking to set global parameters that affect every row, then you can use a broadcast variable.
http://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
Broadcast variables are created once in your script and cannot be modified after that. Spark will efficiently distribute those values to every executor to make them available to your transformations. To create one you provide the data to spark and it gives you back a handle you can use to access it on the executors. For example:
settings_bc = sc.broadcast({
'num_of_donkeys': 3,
'donkey_color': 'brown'
})
def simulate(settings, n):
# do magic
return n
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate(settings_bc.value, row))
print res.collect()
I'm new to Spark and I'm trying to develop a python script that reads a csv file with some logs:
userId,timestamp,ip,event
13,2016-12-29 16:53:44,86.20.90.121,login
43,2016-12-29 16:53:44,106.9.38.79,login
66,2016-12-29 16:53:44,204.102.78.108,logoff
101,2016-12-29 16:53:44,14.139.102.226,login
91,2016-12-29 16:53:44,23.195.2.174,logoff
And checks if a user had some strange behaviors, for example if he has done two consecutive 'login' without doing 'logoff'. I've loaded the csv as a Spark dataFrame and I wanted to compare the log rows of a single user, ordered by timestamp and checking if two consecutive events are of the same type (login - login , logoff - logoff). I'm searching for doing it in a 'map-reduce' way, but at the moment I can't figure out how to use a reduce function that compares consecutive rows.
The code I've written works, but the performance are very bad.
sc = SparkContext("local","Data Check")
sqlContext = SQLContext(sc)
LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
N_USERS = 10*1000
dataFrame = sqlContext.read.format("com.databricks.spark.csv").load(LOG_FILE_PATH)
dataFrame = dataFrame.selectExpr("C0 as userID","C1 as timestamp","C2 as ip","C3 as event")
wrongUsers = []
for i in range(0,N_USERS):
userDataFrame = dataFrame.where(dataFrame['userId'] == i)
userDataFrame = userDataFrame.sort('timestamp')
prevEvent = ''
for row in userDataFrame.rdd.collect():
currEvent = row[3]
if(prevEvent == currEvent):
wrongUsers.append(row[0])
prevEvent = currEvent
badUsers = sqlContext.createDataFrame(wrongUsers)
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
First (not related but still), be sure that the number of entries per user is not that big because that collect in for row in userDataFrame.rdd.collect(): is dangerous.
Second, you don't need to leave the DataFrame area here to use classical Python, just stick to Spark.
Now, your problem. It's basically "for each line I want to know something from the previous line": that belongs to the concept of Window functions and to be precise the lag function. Here are two interesting articles about Window functions in Spark: one from Databricks with code in Python and one from Xinh with (I think easier to understand) examples in Scala.
I have a solution in Scala, but I think you'll pull it off translating it in Python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
import sqlContext.implicits._
val LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
val RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true") // use the header from your csv
.load(LOG_FILE_PATH)
val wSpec = Window.partitionBy("userId").orderBy("timestamp")
val badUsers = data
.withColumn("previousEvent", lag($"event", 1).over(wSpec))
.filter($"previousEvent" === $"event")
.select("userId")
.distinct
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
Basically you just retrieve the value from the previous line and compare it to the value on your current line, if it's a match that is a wrong behavior and you keep the userId. For the first line in your "block" of lines for each userId, the previous value will be null: when comparing with the current value, the boolean expression will be false so no problem here.
I'm using PySpark to load data from Google BigQuery.
I've loaded data by using:
dfRates = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
Where conf is defined as https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example.
I need this data as a DataFrame, so I tried,
row = Row(['userId','accoId','rating']) # or row = Row(('userId','accoId','rating'))
dataRDD = dfRates.map(row).toDF()
and
dataRDD = sqlContext.createDataFrame(dfRates,['userId','accoId','rating'])
But it does not convert the data into a DataFrame. Is there a way to convert it into a DataFrame?
As long as the types can represented using Spark SQL types there is no reason it couldn't be. The only problem here seems to be your code.
newAPIHadoopRDD returns a RDD of pairs (tuple of length equal two). In this particular context it looks you'll get (int, str) in Python which clearly cannot be unpacked into ['userId','accoId','rating'].
According to the doc you've linked com.google.gson.JsonObject is represented as a JSON string which can be either parsed on a Python side using standard Python utils (json module):
def parse(v, fields=["userId", "accoId", "rating"]):
row = Row(*fields)
try:
parsed = json.loads(v)
except json.JSONDecodeError:
parsed = {}
return row(*[parsed.get(x) for x in fields])
dfRates.map(parse).toDF()
or on the Scala / DataFrame side using get_json_object:
from pyspark.sql.functions import col, get_json_object
dfRates.toDF(["id", "json_string"]).select(
# This assumes you expect userId field
get_json_object(col("json_string"), "$.userId"),
...
)
Please note the differences in the syntax I've used to define and create rows.
hbase table rows:
hbase(main):008:0> scan 'test_hbase_table'
ROW COLUMN+CELL
dalin column=cf:age, timestamp=1464101679601, value=40
tangtang column=cf:age, timestamp=1464101649704, value=9
tangtang column=cf:name, timestamp=1464108263191, value=zitang
2 row(s) in 0.0170 seconds
here we go
import json
host = '172.23.18.139'
table = 'test_hbase_table'
conf = {"hbase.zookeeper.quorum": host, "zookeeper.znode.parent": "/hbase-unsecure", "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
hbase_rdd1 = hbase_rdd.flatMapValues(lambda v: v.split("\n"))
and here we got the results
tt=sqlContext.jsonRDD(hbase_rdd1.values())
In [113]: tt.show()
+------------+---------+--------+-------------+----+------+
|columnFamily|qualifier| row| timestamp|type| value|
+------------+---------+--------+-------------+----+------+
| cf| age| dalin|1464101679601| Put| 40|
| cf| age|tangtang|1464101649704| Put| 9|
| cf| name|tangtang|1464108263191| Put|zitang|
+------------+---------+--------+-------------+----+------+