I have a dataframe with the following columns: ID, event_name, event_date
Goal: For every unique ID, if they have an event_name == 'attended book event' then I want to create a new column attended_book_event and have the value = 1. If they do not have and event_name==' attended book event' then the value in the new column is 0.
Sample:
ID| event_name | event_date
1| joined_club| 12-12-03
1| attended_book_event| 12-27-03
1| elite_member| 03-01-05
2| joined_club| 12-12-03
2| elite_member| 03-01-05
I tried to groupby the id and then create a new column with the condition but the results were not what I was looking for.
df_dose['had_dose_increase'] = [1 if df_dose['event_name'] ==
'dose_increased' else 0]
I want a new column
ID| event_name | event_date| attended_book_event
1| joined_club| 12-12-03| 1
1| attended_book_event| 12-27-03|1
1| elite_member| 03-01-05|1
2| joined_club| 12-12-03|0
2| elite_member| 03-01-05|0
Using pd.Series.groupby with transform:
df['attended_book_event'] = df.groupby('ID')['event_name'].transform(lambda x: 'attended_book_event' in set(x)).astype(int)
Output:
ID event_name event_date attended_book_event
0 1 joined_club 12-12-03 1
1 1 attended_book_event 12-27-03 1
2 1 elite_member 03-01-05 1
3 2 joined_club 12-12-03 0
4 2 elite_member 03-01-05 0
Related
If I have table with dates in format MM/DD/YYYY like below.
+---+-----------+----------+
| id| startdate| enddate|
+---+-----------+----------+
| 1| 01/01/2022|01/31/2022|
| 1| 02/01/2022|02/28/2022|
| 1| 03/01/2022|03/31/2022|
| 2| 01/01/2022|03/01/2022|
| 2| 03/05/2022|03/31/2022|
| 2| 04/01/2022|04/05/2022|
+---+-----------+----------+
How to I group based on id column and if start and end date is continuous?
One thing is if there is more than a one day gap then keep the row on a new line so the above table will become:
+---+-----------+----------+
| id| startdate| enddate|
+---+-----------+----------+
| 1| 01/01/2022|31/03/2022|
| 2| 01/01/2022|03/01/2022|
| 2| 03/05/2022|04/05/2022|
+---+-----------+----------+
id = 1 becomes one row as all dates for id =1 is continuous i.e no gap > 1 but id 2 has two rows as there is a gap between 03/01/2022 and 03/05/2022.
This is a particular case of the sessionization problem (i.e. identify sessions in data based on some conditions).
Here is a possible solution that uses windows.
The logic behind the solution:
Associate at each row the temporally previous enddate with the same id
Calculate the difference in days between each startdate and the previous enddate
Identify all the rows that don't have a previous row or is at least two days after the previous row
Associate at each row a session_index, that is the number of new sessions seen up to this line
Aggregate grouping by id and session_index
w = Window.partitionBy("id")\
.orderBy("startdate")
df = df \
.select(
F.col("id"),
F.to_date("startdate", "MM/dd/yyyy").alias("startdate"),
F.to_date("enddate", "MM/dd/yyyy").alias("enddate")
) \
.withColumn("previous_enddate", F.lag('enddate', offset=1).over(w)) \
.withColumn("date_diff", F.datediff(F.col("startdate"), F.col("previous_enddate"))) \
.withColumn("is_new_session", F.col("date_diff").isNull() | (F.col("date_diff") > 1)) \
.withColumn("session_index", F.sum(F.col("is_new_session").cast("int")).over(w))
df.groupBy("id", "session_index") \
.agg(
F.min("startdate").alias("startdate"),
F.max("enddate").alias("enddate")
) \
.drop("session_index")
id
val1
val2
1
Y
Flagged
1
N
Flagged
2
N
Flagged
2
Y
Flagged
2
N
Flagged
I have the above table. I want to check the rows in val1 with the same id, if there's at least one Y and one N then all the rows having 1 as id will be marked flagged in val2. In addition, for a more efficient code, I want the code to jump to the next id once it finds a Y.
Assuming the val1 columns contains only Y and N as unique values, you can group the dataframe by id and aggregate val1 using countDistinct to count the unique values, then create a new column flagged corresponding the condition where distinct count > 1, finally join this new column with original dataframe to get the result
from pyspark.sql import functions as F
counts = df.groupBy('id').agg(F.countDistinct('val1').alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
If column val1 may contains other values along with Y, N, then first mask the values which are not in Y and N:
vals = F.when(F.col('val1').isin(['Y', 'N']), F.col('val1'))
counts = df.groupBy('id').agg(F.countDistinct(vals).alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
>>> df.show()
| id|val1|flagged|
+---+----+-------+
| 1| Y| true|
| 1| N| true|
| 2| N| true|
| 2| Y| true|
| 2| N| true|
+---+----+-------+
PS: I have also modified your output slightly as having a column named flagged with boolean values makes more sense
You can also use a window to collect the set of values and compare to the array of Y and N values:
from pyspark.sql import functions as F, Window as W
a = F.array([F.lit('N'),F.lit('Y')])
out = (df.withColumn("Flagged",F.array_intersect(a,
F.collect_set("val1").over(W.partitionBy("id")))==a))
out.show()
+---+----+-------+-------+
| id|val1| val2|Flagged|
+---+----+-------+-------+
| 1| Y|Flagged| true|
| 1| N|Flagged| true|
| 2| N|Flagged| true|
| 2| Y|Flagged| true|
| 2| N|Flagged| true|
+---+----+-------+-------+
I got a pyspark dataframe that looks like:
id
score
1
0.5
1
2.5
2
4.45
3
8.5
3
3.25
3
5.55
And I want to create a new column rank based on the value of the score column in incrementing order meaning the highest value will have the rank 0 and restarting the count by the id column.
Something like this:
id
value
rank
1
2.5
0
1
0.5
1
2
4.45
0
3
8.5
0
3
5.55
1
3
3.25
2
Thanks in advance!
You can use pyspark.sql.functions.dense_rank which returns the rank of rows within a window partition.
Note that for this to work exactly we have to add an orderBy as dense_rank() requires window to be ordered. Finally let's subtract -1 on the outcome (as the default starts from 1)
from pyspark.sql.functions import *
df = df.withColumn(
"rank", dense_rank().over(Window.partitionBy("id").orderBy(desc("score"))) - 1)
>>> df.show()
+---+-----+----+
| id|score|rank|
+---+-----+----+
| 1| 2.5| 0|
| 1| 0.5| 1|
| 2| 4.45| 0|
| 3| 8.5| 0|
| 3| 5.55| 1|
| 3| 3.25| 2|
+---+-----+----+
SQL syntax:
SELECT dense_rank()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table
I have a pyspark dataframe that looks like this:
key key2 category ip_address
1 a desktop 111
1 a desktop 222
1 b desktop 333
1 c mobile 444
2 d cell 555
And I want to groupBy key to get the total number of unique ip_addr, along with the total number of unique key_2, and then the number of unique ip_address that was contributed by each category (assume the values in category are constant, so the values of category can only be [desktop, mobile, cell]).
So, I'm looking for a resulting dataframe like this:
key num_ips num_key2 num_desktop num_mobile num_cell
1 4 3 3 1 0
2 1 1 0 0 0
I've been trying code like this, but the code for the num_desktop, num_mobile, num_cell isn't quite right.
import pyspark.sql.functions as F
df_agg = df.groupBy('key1') \
.agg(F.countDistinct('ip_addr').alias('num_ips'), \
F.countDistinct('key_2').alias('num_key2'), \
F.countDistinct('ip_addr').where(F.col('category')=='desktop').alias('num_desktop'), \
F.countDistinct('ip_addr').where(F.col('category')=='mobile').alias('num_mobile'), \
F.countDistinct('ip_addr').where(F.col('category')=='cell').alias('num_cell')))
Do I have to do some type of nested groupBy, or maybe a Window function? Any help is greatly appreciated!
I had to split the dataframe and join them back for the desktop, mobile, and cell counts
df1 = df.groupBy('key') \
.agg(F.countDistinct('ip_address').alias('num_ips'), \
F.countDistinct('key2').alias('num_key2'))
de = df.filter(col("category")=="desktop").groupBy('key')\
.agg(F.countDistinct('ip_address').alias('num_desktop')).withColumnRenamed("key", "key1")
dm = df.filter(col("category")=="mobile").groupBy('key')\
.agg(F.countDistinct('ip_address').alias('num_mobile')).withColumnRenamed("key", "key1")
dc = df.filter(col("category")=="cell").groupBy('key')\
.agg(F.countDistinct('ip_address').alias('num_cell')).withColumnRenamed("key", "key1")
join_df = df1.join(de, (df.key == de.key1), "left").drop("key1")\
.join(dm, (df.key == dm.key1), "left").drop("key1")\
.join(dc, (df.key == dc.key1), "left").drop("key1")\
.fillna(0).drop('category', 'ip_address')
Output:
+---+-------+--------+-----------+----------+--------+
|key|num_ips|num_key2|num_desktop|num_mobile|num_cell|
+---+-------+--------+-----------+----------+--------+
| 1| 4| 3| 3| 1| 0|
| 2| 1| 1| 0| 0| 1|
+---+-------+--------+-----------+----------+--------+
I have a spark data frame like below:
User Item Purchased
1 A 1
1 B 2
2 A 3
2 C 4
3 A 3
3 B 2
3 D 6
only showing top 5 rows
each user has a row for an item they have purchased. Assume Purhcased to be how many qty. purchased (count).
However there are items which a user might not have purchased so for that item that particular user doesn't have a row. We only have rows for items which a user has purchased. So if user 1 has purchased item A, B, we have 2 rows for user 1 corresponding to these two items. But if user 2 has purchased A, C then user 2 has rows for item A and C but no B. I want in the end each user should have all rows for all items in the table with the corresponding count of each.
I want to convert this data frame into a data frame as above but also having rows for items which a user has not seen and give the corresponding count as zero.
Like below:
User Item Purchased
1 A 1
1 B 2
1 C 0
1 D 0
2 A 3
2 C 4
2 B 0
2 D 0
3 A 3
3 B 2
3 D 6
3 C 0
only showing top 5 rows
One way I thought was that in spark if I use cross_tab method of sqlContext on the first data frame then I can convert each row to column with corresponding values. For item which user doesn't have it will create a column for the same and put zero there.
But then how to convert back those columns to rows?. It might also be a roundabout way.
Thanks
We can achieve this by just using only df functions as well.
orders = [(1,"A",1),(1,"B",2),(2,"A",3),(2,"C",4),(3,"A",3),(3,"B",2),(3,"D",6)]
df = sqlContext.createDataFrame(orders, ["user","item","purchased"])
df_items = df.select("item").distinct().repartition(5).withColumnRenamed("item", "item_1")
df_users = df.select("user").distinct().repartition(5).withColumnRenamed("user", "user_1")
df_cartesian = df_users.join(df_items)
//above expression returns cartesian product of users and items dfs
joined_df = df_cartesian.join(df, [df_cartesian.user_1==df.user, df_cartesian.item_1==df.item], "left_outer").drop("user").drop("item")
result_df = joined_df.fillna(0,["purchased"]).withColumnRenamed("item_1", "item").withColumnRenamed("user_1", "user")
Finally, result_df.show() produces desire output shown below:
+----+----+---------+
|user|item|purchased|
+----+----+---------+
| 2| A| 3|
| 2| B| 0|
| 2| C| 4|
| 2| D| 0|
| 3| A| 3|
| 3| B| 2|
| 3| C| 0|
| 3| D| 6|
| 1| A| 1|
| 1| B| 2|
| 1| C| 0|
| 1| D| 0|
+----+----+---------+
df = sqlContext.createDataFrame([(1, 'A', 2), (1, 'B', 3), (2, 'A', 2)], ['user', 'item', 'purchased'])
pivot = df.groupBy('user').pivot('item').sum('purchased').fillna(0)
items = [i['item'] for i in df.select('item').distinct().collect()]
flattened_rdd = pivot.rdd.flatMap(lambda x: [(x['user'], i, x[i]) for i in items])
sqlContext.createDataFrame(flattened_rdd, ["user", "item", "purchased"]).show()