So, I have a pyspark dataframe organised in this way:
ID
timestamp
value1
value2
1
1
a
x
2
1
a
y
1
2
b
x
2
2
b
y
1
3
c
y
2
3
d
y
1
4
l
y
2
4
s
y
and let's say that the timestamp is the number of day from the beginning of time. What I'd like to do is, for each line, to group into a list the values up to -x days regarding the current ID, so to have:
ID
timestamp
value1
value2
list_value_1
1
1
a
X
a
2
1
a
y
a
1
2
b
x
a,b
2
2
b
y
a,b
1
3
c
y
a,b,c
2
3
d
y
a,b,d
1
3
c
y
b,c,l
2
3
d
y
b,d,s
I imagine I should do that with a Window but I'm not sure on how to proceed (I'm quite bad with Windows for some reason).
You can do a collect_list over a Window betweeen the current row and two preceding rows, and combine the list into a comma-separated string using concat_ws:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'list_value_1',
F.concat_ws(',',
F.collect_list('value1').over(
Window.partitionBy('ID').orderBy('timestamp').rowsBetween(-2, 0)
)
)
)
df2.show()
+---+---------+------+------+------------+
| ID|timestamp|value1|value2|list_value_1|
+---+---------+------+------+------------+
| 1| 1| a| x| a|
| 1| 2| b| x| a,b|
| 1| 3| c| y| a,b,c|
| 1| 4| l| y| b,c,l|
| 2| 1| a| y| a|
| 2| 2| b| y| a,b|
| 2| 3| d| y| a,b,d|
| 2| 4| s| y| b,d,s|
+---+---------+------+------+------------+
Related
Imagine you have two datasets df and df2 like the following:
df:
ID Size Condition
1 2 1
2 3 0
3 5 0
4 7 1
df2:
aux_ID Scalar
1 2
3 2
I want to get an output where if the condition of df is 1, we multiply the size times the scalar and then return df with the changed values.
I would want to do this as efficient as possible, perhaps avoiding the join if that's possible.
output_df:
ID Size Condition
1 4 1
2 3 0
3 5 0
4 7 1
Not sure why would you want to avoid Joins in the first place. They can be efficient in there own regards.
With this said , this can be easily done with Merging the 2 datasets and building a case-when statement against the condition
Data Preparation
df1 = pd.read_csv(StringIO("""ID,Size,Condition
1,2,1
2,3,0
3,5,0
4,7,1
""")
,delimiter=','
)
df2 = pd.read_csv(StringIO("""aux_ID,Scalar
1,2
3,2
""")
,delimiter=','
)
sparkDF1 = sql.createDataFrame(df1)
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---+----+---------+
| ID|Size|Condition|
+---+----+---------+
| 1| 2| 1|
| 2| 3| 0|
| 3| 5| 0|
| 4| 7| 1|
+---+----+---------+
sparkDF2.show()
+------+------+
|aux_ID|Scalar|
+------+------+
| 1| 2|
| 3| 2|
+------+------+
Case When
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['ID'] == sparkDF2['aux_ID']
,'left'
).select(sparkDF1['*']
,sparkDF2['Scalar']
,sparkDF2['aux_ID']
).withColumn('Size_Changed',F.when( ( F.col('Condition') == 1 )
& ( F.col('aux_ID').isNotNull())
,F.col('Size') * F.col('Scalar')
).otherwise(F.col('Size')
)
)
finalDF.show()
+---+----+---------+------+------+------------+
| ID|Size|Condition|Scalar|aux_ID|Size_Changed|
+---+----+---------+------+------+------------+
| 1| 2| 1| 2| 1| 4|
| 3| 5| 0| 2| 3| 5|
| 2| 3| 0| null| null| 3|
| 4| 7| 1| null| null| 7|
+---+----+---------+------+------+------------+
You can drop the unnecessary columns , I kept them for your illustration
I have a dataframe that looks like this. To do the required manipulations in standard pandas, I would do the following:
import pandas as pd
case = pd.Series(['A', 'A', 'A', 'A',
'B', 'B', 'B', 'B',
'C', 'C', 'C', 'C'])
y = pd.Series([0, 1, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0])
year = [2016, 2017, 2018, 2019,
2016, 2017, 2018, 2019,
2016, 2017, 2018, 2019]
dict = {'case': case, 'y': y, 'year': year}
df = pd.DataFrame(dict)
# the transformations of interest
df['case_id'] = ((~(df.case == df.case.shift())) | (df.y.shift()==1)).cumsum()
df['counter'] = df.groupby(((df['case_id'] != df['case_id'].shift(1))).cumsum()).cumcount()
I am looking for help as to how to translate these two commands into a PySpark dataframe.
df['case_id'] = ((~(df.case == df.case.shift())) | (df.y.shift()==1)).cumsum()
df['counter'] = df.groupby(((df['case_id'] != df['case_id'].shift(1))).cumsum()).cumcount()
Expected output looks like:
case y year case_id counter
A 0 2016 1 0
A 1 2017 1 1
A 0 2018 2 0
A 0 2019 2 1
B 0 2016 3 0
B 1 2017 3 1
B 0 2018 4 0
B 0 2019 4 1
C 0 2016 5 0
C 0 2017 5 1
C 1 2018 5 2
C 0 2019 6 0
This is almost like an FAQ, see also another example from my old post. For this example, you can try the following:
from pyspark.sql import functions as F
from pyspark.sql import Window
pdf = spark.createDataFrame(df)
w1 = Window.partitionBy().orderBy('case', 'year')
w2 = Window.partitionBy('case_id').orderBy('case', 'year')
df_new = pdf.withColumn("case_id", F.sum(F.when(~(F.col("case") == F.lag("case").over(w1)) | (F.lag("y",1,0).over(w1) == 1),1).otherwise(0)).over(w1)+1) \
.withColumn('counter', F.count('*').over(w2)-1)
df_new.show()
+----+---+----+-------+-------+
|case| y|year|case_id|counter|
+----+---+----+-------+-------+
| A| 0|2016| 1| 0|
| A| 1|2017| 1| 1|
| A| 0|2018| 2| 0|
| A| 0|2019| 2| 1|
| B| 0|2016| 3| 0|
| B| 1|2017| 3| 1|
| B| 0|2018| 4| 0|
| B| 0|2019| 4| 1|
| C| 0|2016| 5| 0|
| C| 0|2017| 5| 1|
| C| 1|2018| 5| 2|
| C| 0|2019| 6| 0|
+----+---+----+-------+-------+
Where:
set up WindSpec w1 to sort rows by case, year and then use lag function to find the previous value (similar to shift in pandas).
pandas: (~(df.case == df.case.shift())) | (df.y.shift()==1)
pyspark: ~(F.col("case") == F.lag("case").over(w1)) | (F.lag("y",1,0).over(w1) == 1)
Note:
(1) orderBy in w1 is important since partitionBy triggers data shuffling and the order of the resulting rows is not guaranteed otherwise. (2) be caution about the null values using lag function, use the 3rd argument of lag function or coalesce function to set up default if needed.
use F.when(..,1).otherwise(0) to convert the result of (1) from boolean into int and then do cumsum:
pandas: df.c.cumsum()
pyspark: F.sum(c).over(w1)+1
add case_id into partitionBy to set up w2 and then do cumcount (no need to do another cumsum and then groupby):
pandas: df.groupby(..).cumcount()
pyspark: F.count('*').over(w2)-1
For a large dataframe, setting a WinSpec without partitionBy will move all data into a single partition which could yield OOM error. In fact, if you are just looking for cumcount inside each combination of case + case_id, you are more likely do the following:
w1 = Window.partitionBy('case').orderBy('year')
w2 = Window.partitionBy('case', 'case_id').orderBy('year')
df_new = pdf.withColumn("case_id", F.sum(F.when(F.lag("y",1,0).over(w1) == 1,1).otherwise(0)).over(w1)) \
.withColumn('counter', F.count('*').over(w2)-1)
df_new.show()
+----+---+----+-------+-------+
|case| y|year|case_id|counter|
+----+---+----+-------+-------+
| B| 0|2016| 0| 0|
| B| 1|2017| 0| 1|
| B| 0|2018| 1| 0|
| B| 0|2019| 1| 1|
| C| 0|2016| 0| 0|
| C| 0|2017| 0| 1|
| C| 1|2018| 0| 2|
| C| 0|2019| 1| 0|
| A| 0|2016| 0| 0|
| A| 1|2017| 0| 1|
| A| 0|2018| 1| 0|
| A| 0|2019| 1| 1|
+----+---+----+-------+-------+
I have a pyspark dataframe with a list of customers, days, and transaction types.
+----------+-----+------+
| Customer | Day | Type |
+----------+-----+------+
| A | 2 | X11 |
| A | 4 | X2 |
| A | 9 | Y4 |
| A | 11 | X1 |
| B | 3 | Y4 |
| B | 7 | X1 |
+----------+-----+------+
I'd like to create a column that has "most recent X type" for each customer, like so:
+----------+-----+------+-------------+
| Customer | Day | Type | MostRecentX |
+----------+-----+------+-------------+
| A | 2 | X11 | X11 |
| A | 4 | X2 | X2 |
| A | 9 | Y4 | X2 |
| A | 11 | X1 | X1 |
| B | 3 | Y4 | - |
| B | 7 | X1 | X1 |
+----------+-----+------+-------------+
So for the X types it just takes the one from the current row, but for the Y type it takes the type from the most recent X row for that member (and if there isn't one, it gets a blank or something). I imagine I need a sort of window function but not very familiar with PySpark.
You can achieve by this taking the last column that startswith the letter "X" over a Window that partitions by the Customer and orders by the Day. Specify the Window to start at the beginning of the partition and stop at the current row.
from pyspark.sql import Window
from pyspark.sql.functions import col, last, when
w = Window.partitionBy("Customer").orderBy("Day").rowsBetween(Window.unboundedPreceding, 0)
df = df.withColumn(
"MostRecentX",
last(when(col("Type").startswith("X"), col("Type")), ignorenulls=True).over(w)
)
df.show()
#+--------+---+----+-----------+
#|Customer|Day|Type|MostRecentX|
#+--------+---+----+-----------+
#| A| 2| X11| X11|
#| A| 4| X2| X2|
#| A| 9| Y4| X2|
#| A| 11| X1| X1|
#| B| 3| Y4| null|
#| B| 7| X1| X1|
#+--------+---+----+-----------+
The trick here is to use when to return the Type column only if it starts with "X". By default, when will return null. Then we can use last with ignorenulls=True to get the value for MostRecentX.
If you want to replace the null with "-" as shown in your question, just call fillna on the MostRecentX column:
df.fillna("-", subset=["MostRecentX"]).show()
#+--------+---+----+-----------+
#|Customer|Day|Type|MostRecentX|
#+--------+---+----+-----------+
#| A| 2| X11| X11|
#| A| 4| X2| X2|
#| A| 9| Y4| X2|
#| A| 11| X1| X1|
#| B| 3| Y4| -|
#| B| 7| X1| X1|
#+--------+---+----+-----------+
I have a spark dataframe that looks like this.
id cd1 version1 dt1 cd2 version2 dt2 cd3 version3 dt3
1 100 1 20100101 101 1 20100101 102 20100301
1 101 1 20100102 102 20100201 100 1 20100302
2 201 1 20100103 100 1 20100301 100 1 20100303
2 202 2 20100104 100 1 20100105
I need to transpose all the codes into a single column with the following conditions
If the corresponding version code is 1, add a decimal point after the first digit
Each patient should have distinct codes
For the above example, the output should look like this.
id code dt
1 1.00 20100101
1 1.01 20100101
1 102 20100301
1 1.01 20100102
1 102 20100201
1 10.0 20100302
2 2.01 20100103
2 1.00 20100301
2 1.00 20100303
2 202 20100104
2 10.0 20100105
I am using Pyspark to do this. In the above example, I have shown only 3 codes with their corresponding version columns but I have 30 such columns. Also, this data has around 25 million rows.
Any ideas on how to accomplish this will be extremely helpful.
You can explode a list of these columns so that there is only one (cd, version) pair per line
First, let's create the dataframe:
df = sc.parallelize([[1,100,1,101,1,102,None],[1,101,1,102,None,100,1],[2,201,1,100,1,100,1],
[2,202,2,100,1,None,None]]).toDF(["id","cd1","version1","cd2","version2","cd3","version3"])
Using posexplode:
import pyspark.sql.functions as psf
from itertools import chain
nb_versions = 4
df = df.na.fill(-1).select(
"id",
psf.posexplode(psf.create_map(list(chain(*[(psf.col("cd" + str(i)), psf.col("version"+str(i))) for i in range(1, nb_versions)])))).alias("pos", "cd", "version")
).drop("pos").filter("cd != -1")
+---+---+-------+
| id| cd|version|
+---+---+-------+
| 1|100| 1|
| 1|101| 1|
| 1|102| -1|
| 1|101| 1|
| 1|102| -1|
| 1|100| 1|
| 2|201| 1|
| 2|100| 1|
| 2|100| 1|
| 2|202| 2|
| 2|100| 1|
+---+---+-------+
Using explode:
nb_versions = 4
df = df.select(
"id",
psf.explode(psf.array(
[psf.struct(
psf.col("cd" + str(i)).alias("cd"),
psf.col("version" + str(i)).alias("version")) for i in range(1, nb_versions)])).alias("temp"))\
.select("id", "temp.*")
+---+----+-------+
| id| cd|version|
+---+----+-------+
| 1| 100| 1|
| 1| 101| 1|
| 1| 102| null|
| 1| 101| 1|
| 1| 102| null|
| 1| 100| 1|
| 2| 201| 1|
| 2| 100| 1|
| 2| 100| 1|
| 2| 202| 2|
| 2| 100| 1|
| 2|null| null|
+---+----+-------+
Now we can implement your conditions
division by 100 for version==1
distinct values
We'll use functions when, otherwise for the condition and distinct:
df.withColumn("cd", psf.when(df.version == 1, df.cd/100).otherwise(df.cd))\
.distinct().drop("version")
+---+-----+
| id| cd|
+---+-----+
| 1| 1.0|
| 1| 1.01|
| 1|102.0|
| 2| 1.0|
| 2| 2.01|
| 2|202.0|
+---+-----+
This is how I did it. I am sure there is a better way to do this.
def process_code(raw_data):
for i in range(1,4):
cd_col_name = "cd" + str(i)
version_col_name = "version" + str(i)
raw_data = raw_data.withColumn("mod_cd" + str(i), when(raw_data[version_col_name] == 1, concat(substring(raw_data[cd_col_name],1,1),lit("."),substring(raw_data[cd_col_name],2,20))).otherwise(raw_data[cd_col_name]))
mod_cols = [col for col in raw_data.columns if 'mod_cd' in col]
nb_versions = 3
new = raw_data.fillna('9999', subset=mod_cols).select("id", psf.posexplode(psf.create_map(list(chain(*[(psf.col("mod_cd" + str(i)), psf.col("dt"+str(i))) for i in range(1, nb_versions)])))).alias("pos", "final_cd", "final_date")).drop("pos")
return new
test = process_code(df)
test = test.filter(test.final_cd != '9999')
test.show(100, False)
I have a spark data frame like below:
User Item Purchased
1 A 1
1 B 2
2 A 3
2 C 4
3 A 3
3 B 2
3 D 6
only showing top 5 rows
each user has a row for an item they have purchased. Assume Purhcased to be how many qty. purchased (count).
However there are items which a user might not have purchased so for that item that particular user doesn't have a row. We only have rows for items which a user has purchased. So if user 1 has purchased item A, B, we have 2 rows for user 1 corresponding to these two items. But if user 2 has purchased A, C then user 2 has rows for item A and C but no B. I want in the end each user should have all rows for all items in the table with the corresponding count of each.
I want to convert this data frame into a data frame as above but also having rows for items which a user has not seen and give the corresponding count as zero.
Like below:
User Item Purchased
1 A 1
1 B 2
1 C 0
1 D 0
2 A 3
2 C 4
2 B 0
2 D 0
3 A 3
3 B 2
3 D 6
3 C 0
only showing top 5 rows
One way I thought was that in spark if I use cross_tab method of sqlContext on the first data frame then I can convert each row to column with corresponding values. For item which user doesn't have it will create a column for the same and put zero there.
But then how to convert back those columns to rows?. It might also be a roundabout way.
Thanks
We can achieve this by just using only df functions as well.
orders = [(1,"A",1),(1,"B",2),(2,"A",3),(2,"C",4),(3,"A",3),(3,"B",2),(3,"D",6)]
df = sqlContext.createDataFrame(orders, ["user","item","purchased"])
df_items = df.select("item").distinct().repartition(5).withColumnRenamed("item", "item_1")
df_users = df.select("user").distinct().repartition(5).withColumnRenamed("user", "user_1")
df_cartesian = df_users.join(df_items)
//above expression returns cartesian product of users and items dfs
joined_df = df_cartesian.join(df, [df_cartesian.user_1==df.user, df_cartesian.item_1==df.item], "left_outer").drop("user").drop("item")
result_df = joined_df.fillna(0,["purchased"]).withColumnRenamed("item_1", "item").withColumnRenamed("user_1", "user")
Finally, result_df.show() produces desire output shown below:
+----+----+---------+
|user|item|purchased|
+----+----+---------+
| 2| A| 3|
| 2| B| 0|
| 2| C| 4|
| 2| D| 0|
| 3| A| 3|
| 3| B| 2|
| 3| C| 0|
| 3| D| 6|
| 1| A| 1|
| 1| B| 2|
| 1| C| 0|
| 1| D| 0|
+----+----+---------+
df = sqlContext.createDataFrame([(1, 'A', 2), (1, 'B', 3), (2, 'A', 2)], ['user', 'item', 'purchased'])
pivot = df.groupBy('user').pivot('item').sum('purchased').fillna(0)
items = [i['item'] for i in df.select('item').distinct().collect()]
flattened_rdd = pivot.rdd.flatMap(lambda x: [(x['user'], i, x[i]) for i in items])
sqlContext.createDataFrame(flattened_rdd, ["user", "item", "purchased"]).show()