I have a data frame in pyspark like below. I want to do groupby and count of category column in data frame
df.show()
+--------+----+
|category| val|
+--------+----+
| cat1| 13|
| cat2| 12|
| cat2| 14|
| cat3| 23|
| cat1| 20|
| cat1| 10|
| cat2| 30|
| cat3| 11|
| cat1| 7|
| cat1| 8|
+--------+----+
res = df.groupBy('category').count()
res.show()
+--------+-----+
|category|count|
+--------+-----+
| cat2| 3|
| cat3| 2|
| cat1| 5|
+--------+-----+
I am getting my desired result. Now I want to calculate the average of category. data frame has records for 3 days. I want to calculate average of count for these 3 days.
The result I want is below. I basically want to do count/no.of.days
+--------+-----+
|category|count|
+--------+-----+
| cat2| 1|
| cat3| 1|
| cat1| 2|
+--------+-----+
How can I do that?
I believe what you want is
from pyspark.sql import functions as F
df.groupby('category').agg((F.count('val') / 3).alias('average'))
Related
I am a little confused about the method pyspark.sql.Window.rowsBetween that accepts Window.unboundedPreceding, Window.unboundedFollowing, and Window.currentRow objects as start and end arguments. Could you please explain how the function works and how to use Window objects correctly, with some examples? Thank you!
Basically rows between/range between as name suggests helps with limiting the number of rows considered inside a window.
Lets take a simple example
Starting with data
dfw=spark.createDataFrame([("abc",1,100),("abc",2,200),("abc",3,300),("abc",4,200),("abc",5,100)],"name string,id int,price int")
#output
+----+---+-----+
|name| id|price|
+----+---+-----+
| abc| 1| 100|
| abc| 2| 200|
| abc| 3| 300|
| abc| 4| 200|
| abc| 5| 100|
+----+---+-----+
Now over this data lets try to find of running max i.e max for each row
dfw.withColumn("rm",F.max("price").over(Window.partitionBy("name").orderBy("id"))).show()
#output
+----+---+-----+---+
|name| id|price| rm|
+----+---+-----+---+
| abc| 1| 100|100|
| abc| 2| 200|200|
| abc| 3| 300|300|
| abc| 4| 200|300|
| abc| 5| 100|300|
+----+---+-----+---+
So as expected it looked at each price from top to bottom one by one and populated the max value it got this behavior is known as start= Window.unboundedPreceding to end=Window.currentRow
Now changing rows between values to start= Window.unboundedPreceding to end=Window.unbounded Following we will get as below
dfw.withColumn("rm",F.max("price").over(Window.partitionBy("name").orderBy("id").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing))).show()
#output
+----+---+-----+---+
|name| id|price| rm|
+----+---+-----+---+
| abc| 1| 100|300|
| abc| 2| 200|300|
| abc| 3| 300|300|
| abc| 4| 200|300|
| abc| 5| 100|300|
+----+---+-----+---+
Now as you can see in same window its looking downwards in all values for a max instead of limiting it to current row
Now third will be start=Window.currentRow and end =Window.unboundedFollowing
dfw.withColumn("rm",F.max("price").over(Window.partitionBy("name").orderBy("id").rowsBetween(Window.currentRow,Window.unboundedFollowing))).show()
#output
+----+---+-----+---+
|name| id|price| rm|
+----+---+-----+---+
| abc| 1| 100|300|
| abc| 2| 200|300|
| abc| 3| 300|300|
| abc| 4| 200|200|
| abc| 5| 100|100|
+----+---+-----+---+
Now its looking down only for a max starting its row from the current one.
Also its not limited to just these 3 to use as is you can even start=Window.currentRow-1 and end =Window.currentRow+1 so instead of looking for all values above or below it will only look 1 rows above and 1 rows below
like this
dfw.withColumn("rm",F.max("price").over(Window.partitionBy("name").orderBy("id").rowsBetween(Window.currentRow-1,Window.currentRow+1))).show()
# output
+----+---+-----+---+
|name| id|price| rm|
+----+---+-----+---+
| abc| 1| 100|200|
| abc| 2| 200|300|
| abc| 3| 300|300|
| abc| 4| 200|300|
| abc| 5| 100|200|
+----+---+-----+---+
So you can imagine it a window inside the window which works around the current row its processing
I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks
Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")
You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf
you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)
I am trying to sort value in my pyspark dataframe, but its showing me strange output. Instead of sorting by entire number, it is sorting by first digit of entire number
I have tried sort and orderBy method, both are giving same result
sdf=spark.read.csv("dummy.txt", header=True)
sdf.sort('1',ascending=False).show()
I am getting following output
+---+
| 98|
| 9|
| 8|
| 76|
| 7|
| 68|
| 6|
| 54|
| 5|
| 43|
| 4|
| 35|
| 34|
| 34|
| 3|
| 2|
| 2|
| 2|
| 10|
+---+
Can any one explain me this thing
As your column contains data of String type, the String is being converted into a Sequence of chars and these chars are sorted.It works like a map function.
So, you could do a type cast, and then apply the orderBy function to achieve your required result.
>>> df
DataFrame[Numb: string]
>>> df.show()
+----+
|Numb|
+----+
| 20|
| 19|
| 1|
| 200|
| 60|
+----+
>>> df.orderBy(df.Numb.cast('int'),ascending=False).show()
+----+
|Numb|
+----+
| 200|
| 60|
| 20|
| 19|
| 1|
+----+
I need help getting conditional output from pyspark when using groupBy. I have the following input table:
+----+-----------+-------+
|time|auth_orient|success|
+----+-----------+-------+
| 1| LogOn|Success|
| 1| LogOff|Success|
| 1| LogOff|Success|
| 1| LogOn|Success|
| 1| LogOn| Fail|
| 1| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Fail |
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Fail |
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
+----+-----------+-------+
The table below shows what I want, which only displays the logon stats:
+----+-----------+-------+
|time|Fail |success|
+----+-----------+-------+
| 1|1 |3 |
| 2|2 |8 |
+----+-----------+-------+
Overall I am trying to group on time and populate the new columns, preferably I would rather have the code populate the column names as I will not always have a complete list, with counts.
I know a portion of what I am trying to do is capable with MultilabelBinarizer, but that is not currently available in pyspark from what I have seen.
Filter the data frame down to LogOn only first and then do groupBy.pivot:
import pyspark.sql.functions as F
df.filter(
df.auth_orient == 'LogOn'
).groupBy('time').pivot('success').agg(F.count('*')).show()
+----+----+-------+
|time|Fail|Success|
+----+----+-------+
| 1| 1| 3|
| 2| 2| 8|
+----+----+-------+
With a dataframe like this,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|null| 201602|
| 1| 20|3003| 201601|
| 1| 20|null| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|null| 201601|
+---+----+----+-------+
I need to fill the null values with the average of the existing values, with the expected result being
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
where 1128 is the average of the existing values. I need to do that for several columns.
My current approach is to use na.fill:
fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
But this is very cumbersome. Any ideas?
Well, one way or another you have to:
compute statistics
fill the blanks
It pretty much limits what you can really improve here, still:
replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary
The final result could like this:
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df_data, ["id", "date"])
In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.