For the output below , I want to run multiple sql queries something like shown in the code below, but spark does not support multiple sql statement, can you please suggest some other work around for this, it would be really helpful, Thanks :)
expected Output:-
Col_name Max_val Min_value
Name Null Null
Age 15 5
height 100 8
CODE :-
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Kate', age=10, height=90), \
Row(name='Brain', age=15, height=100)]).toDF()
df.createOrReplaceTempView("Test")
df3 = spark.sql("select max(name) as name ,max(age) as age,max(height) as height from Test" )
df4=df.selectExpr("stack(3,'name',bigint(name),'age',bigint(age),'height',bigint(height)) as (col_name,max_data)")
df5 = spark.sql("select min(name) as name ,min(age) as age,min(height) as height from Test" )
df6=df.selectExpr("stack(3,'name',bigint(name),'age',bigint(age),'height',bigint(height)) as (col_name,min_data)")
df7=df4.join(df6,['col_name'],'inner').groupBy("col_name").orderBy("col_name")
df7.show()
If you don't need the exact same structure of the resulting query, you can simply have multiple aggregations in the same step (which would also be more efficient):
from pyspark.sql import Row
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Kate', age=10, height=90), \
Row(name='Brain', age=15, height=100)]).toDF()
df2 = df.agg(
F.max(F.col("height")).alias("max_height"),
F.max(F.col("age")).alias("max_age"),
F.min(F.col("height")).alias("min_height"),
F.min(F.col("age")).alias("min_age")
)
df2.collect()
This gives a result of: [Row(max_height=100, max_age=15, min_height=80, min_age=5)]
To get this in the format above, you would have to use explode.
In scala you can achieve this via Futures API. Then you could expose your Scala
Something like this:
for(q <- queries) {
Future {
spark.sql(q)
}
}.map(Await.result(_, Duration("+Inf"))
Note that "+Inf" is just illustrative, dont use Inf because timeout will never happen and your code might hang forever.
This will of course not support .show() since that would be ran on top of a DataFrame and here I assume queries are a collection of queries.
You could then wrap this in a spark.ml.Transformer and pass list of queries as Params.
Then you could pass your jar to pyspark at spark submit.
Lastly you could acces then your transformer via spark._jvm.
It is quite a work around and I am only proposing it since I am aware this could work.
Could I ask why is it essential that statements in your example are ran in parallel ? That could help in finding a better suggestion.
Related
I have a blob storage container in Azure and I want to load all of the .csv files in the container into a single spark dataframe. All the files have the same first 2 columns ('name', 'time'). I do some transformations on the time column to convert into a datetime field, and I also create a new id column based on the filename and move this so it is the first column. All remaining columns are consisting in naming format, however, some files have more additional columns than others. For example:
One file could be like this:
id
name
time
LV01
LV02
LV03
abc
name1
01/01/1900 01:00:00
47.96
23.10
43.00
Whereas the next file might have columns that go up to LV15, and another might have columns that go up to LV25 etc.
I am using the following code to load my data and initially it seems to be working:
from pyspark.sql.functions import *
file_location = 'dbfs:/mnt/<container>/<foldername>'
file_type = "csv"
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
df3 = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location) \
.withColumn("id",substring(input_file_name(), 45, 3)) #create id column from filename
#df3.select([df3.columns[-1]] + df3.columns[:-1]).show() #move id to first column
df3 = df3.select([df3.columns[-1]] + df3.columns[:-1]) #move id column to beginning
df3 = df3.withColumn("time", (col('time')/1000000000)) #convert nanoseconds into seconds as pyspark doesn't have nanoseconds
df3 = df3.withColumn("time",from_unixtime(col('time'))) #convert seconds into datetime values
df3.show()
I've checked that the files have loaded into the dataframe by checking the id column and seeing if the ids have generated correctly (which they have). The issue is with the number of columns - when i display the dataframe, I'm getting inconsistent results. I only get columns up to CH14 and I know some files go up to CH25 etc. Am I missing something? I need all of the column to be present as next I need to perform the equivalent of the pandas melt operation (which I have done in Python as follows):
cols = df.iloc[:,3:-1:]
col_names = list(cols.columns.values)
col_names
df_long = df.melt(id_vars=['id','name', 'time'], var_name='channel',value_vars=col_names, value_name='value')
df_long.head()
I can't perform this step in databricks yet (or some equivalent) until I know the columns are being pulled through correctly. When I'm loading the files, does spark only load the columns that are consistent across all files?
Latest pyspark will have the following feature:
df = df1.unionByName(df2, allowMissingColumns=True)
This should unite 2 DataFrame with different column. More details in the API doc
I have figured out a solution to this. Instead of using the infer_schema option, I can simply set the schema manually so that it includes all of the columns across each of the files:
schema = StructType() \
.add("name",StringType(),True) \
.add("time",StringType(),True) \
.add("LV01",StringType(),True) \
.add("LV02",StringType(),True) \
.add("LV03",StringType(),True) \
.add("LV04",StringType(),True) \
.add("LV05",StringType(),True) \
.add("LV06",StringType(),True) \
.add("LV07",StringType(),True) \
#etc etc
Once this is done, you can just pass the schema option into the load code:
first_row_is_header = "true"
delimiter = ","
df = spark.read.format(file_type) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.schema(schema) \
.load(file_location)
This works great for my example, although for files with 100s of columns, there will probably be a more efficient method. Still, this works for what I need to do.
I'm new to Spark and I'm trying to develop a python script that reads a csv file with some logs:
userId,timestamp,ip,event
13,2016-12-29 16:53:44,86.20.90.121,login
43,2016-12-29 16:53:44,106.9.38.79,login
66,2016-12-29 16:53:44,204.102.78.108,logoff
101,2016-12-29 16:53:44,14.139.102.226,login
91,2016-12-29 16:53:44,23.195.2.174,logoff
And checks if a user had some strange behaviors, for example if he has done two consecutive 'login' without doing 'logoff'. I've loaded the csv as a Spark dataFrame and I wanted to compare the log rows of a single user, ordered by timestamp and checking if two consecutive events are of the same type (login - login , logoff - logoff). I'm searching for doing it in a 'map-reduce' way, but at the moment I can't figure out how to use a reduce function that compares consecutive rows.
The code I've written works, but the performance are very bad.
sc = SparkContext("local","Data Check")
sqlContext = SQLContext(sc)
LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
N_USERS = 10*1000
dataFrame = sqlContext.read.format("com.databricks.spark.csv").load(LOG_FILE_PATH)
dataFrame = dataFrame.selectExpr("C0 as userID","C1 as timestamp","C2 as ip","C3 as event")
wrongUsers = []
for i in range(0,N_USERS):
userDataFrame = dataFrame.where(dataFrame['userId'] == i)
userDataFrame = userDataFrame.sort('timestamp')
prevEvent = ''
for row in userDataFrame.rdd.collect():
currEvent = row[3]
if(prevEvent == currEvent):
wrongUsers.append(row[0])
prevEvent = currEvent
badUsers = sqlContext.createDataFrame(wrongUsers)
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
First (not related but still), be sure that the number of entries per user is not that big because that collect in for row in userDataFrame.rdd.collect(): is dangerous.
Second, you don't need to leave the DataFrame area here to use classical Python, just stick to Spark.
Now, your problem. It's basically "for each line I want to know something from the previous line": that belongs to the concept of Window functions and to be precise the lag function. Here are two interesting articles about Window functions in Spark: one from Databricks with code in Python and one from Xinh with (I think easier to understand) examples in Scala.
I have a solution in Scala, but I think you'll pull it off translating it in Python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
import sqlContext.implicits._
val LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
val RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true") // use the header from your csv
.load(LOG_FILE_PATH)
val wSpec = Window.partitionBy("userId").orderBy("timestamp")
val badUsers = data
.withColumn("previousEvent", lag($"event", 1).over(wSpec))
.filter($"previousEvent" === $"event")
.select("userId")
.distinct
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
Basically you just retrieve the value from the previous line and compare it to the value on your current line, if it's a match that is a wrong behavior and you keep the userId. For the first line in your "block" of lines for each userId, the previous value will be null: when comparing with the current value, the boolean expression will be false so no problem here.
I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()
Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )
You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")
A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()
So far, Spark hasn't created the DataFrame for streaming data, but when I am doing anomalies detection, it is more convenient and faster to use DataFrame for data analysis. I have done this part, but when I try to do real time anomalies detection using streaming data, the problems appeared. I tried several ways and still could not convert DStream to DataFrame, and cannot convert the RDD inside of DStream into DataFrame either.
Here's part of my latest version of the code:
import sys
import re
from pyspark import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql import Row
from pyspark.streaming import StreamingContext
from pyspark.mllib.clustering import KMeans, KMeansModel, StreamingKMeans
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import operator
sc = SparkContext(appName="test")
ssc = StreamingContext(sc, 5)
sqlContext = SQLContext(sc)
model_inputs = sys.argv[1]
def streamrdd_to_df(srdd):
sdf = sqlContext.createDataFrame(srdd)
sdf.show(n=2, truncate=False)
return sdf
def main():
indata = ssc.socketTextStream(sys.argv[2], int(sys.argv[3]))
inrdd = indata.map(lambda r: get_tuple(r))
Features = Row('rawFeatures')
features_rdd = inrdd.map(lambda r: Features(r))
features_rdd.pprint(num=3)
streaming_df = features_rdd.flatMap(streamrdd_to_df)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
As you can see in the main() function, when I am reading the input streaming data using ssc.socketTextStream() method, it generates DStream, then I tried to convert each individual in DStream into Row, hoping I could convert the data into DataFrame later.
If I use ppprint() to print out features_rdd here, it works, which makes me think, each individual in features_rdd is a batch of RDD while the whole features_rdd is a DStream.
Then I created streamrdd_to_df() method and hoped to convert each batch of RDD into dataframe, it gives me the error, showing:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Is there any thought about how can I do DataFrame operations on Spark streaming data?
Spark has provided us with structured streaming which can solve such problems. It can generate streaming DataFrame i.e DataFrames being appended continuously. Please check below link
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Read the Error carefully..It says there is No output operations registered. Spark is Lazy and executes the job/ cod only when it has something to produce as a result. In your program there is no "Output Operation" and same is being complained by Spark.
Define a foreach() or Raw SQL Query over the DataFrame and then print the results. It will work fine.
Why don't you use something like this:
def socket_streamer(sc): # retruns a streamed dataframe
streamer = session.readStream\
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
return streamer
The output itself of this function above (or the readStream in general) is a DataFrame. There you don't need to worry about df, it is already automatically created by spark.
See the Spark Structured Streaming Programming Guide
After 1 year, I started to explore Spark 2.0 streaming methods and finally solved my anomalies detection problem. Here's my code in IPython, you can also find how does my raw data input look like
There is no need to convert DStream into RDD. By definition DStream is a collection of RDD. Just use DStream's method foreach() to loop over each RDD and take action.
val conf = new SparkConf()
.setAppName("Sample")
val spark = SparkSession.builder.config(conf).getOrCreate()
sampleStream.foreachRDD(rdd => {
val sampleDataFrame = spark.read.json(rdd)
}
The spark documentation has an introduction to working with DStream. Basically, you have to use foreachRDD on your stream object to interact with it.
Here is an example (ensure you create a spark session object):
def process_stream(record, spark):
if not record.isEmpty():
df = spark.createDataFrame(record)
df.show()
def main():
sc = SparkContext(appName="PysparkStreaming")
spark = SparkSession(sc)
ssc = StreamingContext(sc, 5)
dstream = ssc.textFileStream(folder_path)
transformed_dstream = # transformations
transformed_dstream.foreachRDD(lambda rdd: process_stream(rdd, spark))
# ^^^^^^^^^^
ssc.start()
ssc.awaitTermination()
With Spark 2.3 / Python 3 / Scala 2.11 (Using databricks) I was able to use temporary tables and a code snippet in scala (using magic in notebooks):
Python Part:
ddf.createOrReplaceTempView("TempItems")
Then on a new cell:
%scala
import java.sql.DriverManager
import org.apache.spark.sql.ForeachWriter
// Create the query to be persisted...
val tempItemsDF = spark.sql("SELECT field1, field2, field3 FROM TempItems")
val itemsQuery = tempItemsDF.writeStream.foreach(new ForeachWriter[Row]
{
def open(partitionId: Long, version: Long):Boolean = {
// Initializing DB connection / etc...
}
def process(value: Row): Unit = {
val field1 = value(0)
val field2 = value(1)
val field3 = value(2)
// Processing values ...
}
def close(errorOrNull: Throwable): Unit = {
// Closing connections etc...
}
})
val streamingQuery = itemsQuery.start()
I have data like
In pandas
to list of tuple
b = df.toPandas()
b.groupby(['product_id','store_id']).apply(lambda df:df.assign(date=lambda x:x.date.apply(lambda x:x.strftime('%Y%m%d') ) )[['date', 'yhat']].values)
to dict :
b.groupby(['product_id','store_id']).apply(lambda df:dict(df.assign(date=lambda x:x.date.apply(lambda x:x.strftime('%Y%m%d') ) )[['date', 'yhat']].values) )
My Purpose
I don't tend to use pandas_udf, is there any way to do such thing just by spark ??
I figured it out :
use create_map + map_concat
date_cols = df.select(F.date_format('date', 'yyyyMMdd')).dropDuplicates().toPandas().values.ravel().tolist()
df.withColumn('date', F.date_format('date', 'yyyyMMdd'))\
.withColumn( 'date_map' , F.create_map('date', 'yhat').alias("map"))\
.groupby(['product_id','store_id']).pivot('date').agg(F.first('date_map'))\
.select('product_id','store_id', F.map_concat(date_cols).alias('date_sale_count')).show()
However, I doubt the efficiency of my code because date_cols need collect first . Any improvement is welcome.