I have a PySpark dataframe as follows
Customer_ID Address_ID Order_ID Order_Date
Cust_1 Addr_1 1 31-Dec-20
Cust_1 Addr_1 2 23-Jan-21
Cust_1 Addr_1 3 06-Feb-21
Cust_1 Addr_2 4 13-Feb-21
Cust_1 Addr_2 5 20-Feb-21
Cust_1 Addr_3 6 18-Mar-21
Cust_1 Addr_3 7 23-Mar-21
Cust_2 Addr_4 8 31-Dec-20
Cust_2 Addr_4 9 23-Jan-21
Cust_2 Addr_4 10 06-Feb-21
Cust_2 Addr_4 11 13-Feb-21
Cust_2 Addr_4 12 20-Feb-21
Cust_2 Addr_5 13 18-Mar-21
Cust_2 Addr_5 14 23-Mar-21
The columns are customer id, address id, order id and date the order was placed respectively
The Order_ID is always unique
For every order (every row), I need to calculate the order share for a (customer c1,address a1) pair
Denoting order_share by ord_share(c1,a1) which is defined by the below formula,
The total number of orders between (Order_Date) and (Order_Date - 90 days) by c1 from a1
----------------------------------------------------------------------------------------
The total number of orders between (Order_Date) and (Order_Date - 90 days) for all addresses by c1
Note - underline in the above formula denotes division
90 days here is the window size.
An example from the above table : (I am leaving order_share as a fraction for ease of understanding)
For ORDER_ID 7:
The total number of orders are 7 - (by Cust_1)
ord_share(Cust_1,Addr_1) = 3/7, ord_share(Cust_1,Addr_2) = 2/7, ord_share(Cust_1,Addr_3) = 2/7
For ORDER_ID 6:
The total number of orders are 6 - (by Cust_1)
ord_share(Cust_1,Addr_1) = 3/6, ord_share(Cust_1,Addr_2) = 2/6, ord_share(Cust_1,Addr_3) = 1/6
For ORDER_ID 5:
The total number of orders are 5 - (by Cust_1)
ord_share(Cust_1,Addr_1) = 3, ord_share(Cust_1,Addr_2) = 2, ord_share(Cust_1,Addr_3) = 0
And so on... I need to store these for all the rows. My output format should be something like the following
(Is_original_address - this column refers to whether the Address_ID was the original address from which the order was placed)
Customer_ID Address_ID Order_ID Order_Share Is_original_address
Cust_1 Addr_1 7 3/7 0
Cust_1 Addr_2 7 2/7 0
Cust_1 Addr_3 7 2/7 1
Cust_1 Addr_1 6 3/6 0
Cust_1 Addr_2 6 2/6 0
Cust_1 Addr_3 6 1/6 1
Cust_1 Addr_1 5 3/5 0
Cust_1 Addr_2 5 2/5 1
Cust_1 Addr_3 5 0/5 0
.
.
.
For all rows
So basically, each row in the input expands to multiple rows in the output depending on the number of address for the customer
Note - the columns in the initial dataframe are not in sorted order or grouped in any order, I just choose such an example to help with explanation
I am finding it very hard to go about this problem. I thought about it a lot, and I can't seem to think of any ways of joining/grouping data to do this since every row is kind of unique. I am really not sure how to get the output dataframe.
From what I can think, I have to clone the original dataframe and for each row, I'll probably have to do multiple group by or joins. I am really unsure how even start implementation.
Any help would be appreciated. Thanks!
Please do let me know if any other information is needed.
As #Christophe commented, this uses window functions, but only to calculate the denominator
data=[
('c1','a1', 1,'2020-12-31'),
('c1','a1', 2,'2021-01-23'),
('c1','a1', 3,'2021-02-06'),
('c1','a2', 4,'2021-02-13'),
('c1','a2', 5,'2021-02-20'),
('c1','a3', 6,'2021-03-18'),
('c1','a3', 7,'2021-03-23'),
('c2','a4', 8,'2020-12-31'),
('c2','a4', 9,'2021-01-23'),
('c2','a4',10,'2021-02-06'),
('c2','a4',11,'2021-02-13'),
('c2','a4',12,'2021-02-20'),
('c2','a5',13,'2021-03-18'),
('c2','a5',14,'2021-03-23'),
]
df = spark.createDataFrame(data=data, schema = ['c_id','a_id','order_id','order_date'])
df=df.select('c_id','a_id','order_id',F.to_date(F.col('order_date')).alias('date'))
df.createOrReplaceTempView('orders')
spark.sql("""
WITH address_combinations AS (
SELECT o1.order_id, o2.c_id, o2.a_id
, CASE WHEN o1.a_id=o2.a_id THEN 1 ELSE 0 END AS is_original_address
, COUNT(CASE WHEN DATEDIFF(o1.date, o2.date) BETWEEN 0 AND 90 THEN 1 END) AS num_orders
FROM orders o1
JOIN orders o2 ON o1.c_id=o2.c_id
GROUP BY o1.order_id, o2.c_id, o2.a_id, is_original_address
)
SELECT c_id, a_id, order_id
, CONCAT(num_orders, '/', SUM(num_orders) OVER (PARTITION BY order_id)) AS order_share
, is_original_address
FROM address_combinations
ORDER BY order_id, a_id
""").show(200)
output:
+----+----+--------+-----------+-------------------+
|c_id|a_id|order_id|order_share|is_original_address|
+----+----+--------+-----------+-------------------+
| c1| a1| 1| 1/1| 1|
| c1| a2| 1| 0/1| 0|
| c1| a3| 1| 0/1| 0|
| c1| a1| 2| 2/2| 1|
| c1| a2| 2| 0/2| 0|
| c1| a3| 2| 0/2| 0|
| c1| a1| 3| 3/3| 1|
| c1| a2| 3| 0/3| 0|
| c1| a3| 3| 0/3| 0|
| c1| a1| 4| 3/4| 0|
| c1| a2| 4| 1/4| 1|
| c1| a3| 4| 0/4| 0|
| c1| a1| 5| 3/5| 0|
| c1| a2| 5| 2/5| 1|
| c1| a3| 5| 0/5| 0|
| c1| a1| 6| 3/6| 0|
| c1| a2| 6| 2/6| 0|
| c1| a3| 6| 1/6| 1|
| c1| a1| 7| 3/7| 0|
| c1| a2| 7| 2/7| 0|
| c1| a3| 7| 2/7| 1|
| c2| a4| 8| 1/1| 1|
| c2| a5| 8| 0/1| 0|
| c2| a4| 9| 2/2| 1|
| c2| a5| 9| 0/2| 0|
| c2| a4| 10| 3/3| 1|
| c2| a5| 10| 0/3| 0|
| c2| a4| 11| 4/4| 1|
| c2| a5| 11| 0/4| 0|
| c2| a4| 12| 5/5| 1|
| c2| a5| 12| 0/5| 0|
| c2| a4| 13| 5/6| 0|
| c2| a5| 13| 1/6| 1|
| c2| a4| 14| 5/7| 0|
| c2| a5| 14| 2/7| 1|
+----+----+--------+-----------+-------------------+
Not sure if this is needed, but here is the exact same SQL re-implemented using Python df APIs:
from pyspark.sql import Window
(
df.alias('o1')
.join(df.alias('o2')
, on=F.col('o1.c_id')==F.col('o2.c_id')
, how='inner'
)
.select(F.col('o1.order_id'), F.col('o2.c_id'), F.col('o2.a_id')
, F.when(F.datediff(F.col('o1.date'),F.col('o2.date')).between(0, 90), 1).alias('is_order')
, F.when(F.col('o1.a_id')==F.col('o2.a_id'), 1).otherwise(0).alias('is_original_address')
)
.groupby('order_id','c_id','a_id','is_original_address')
.agg(F.count('is_order').alias('num_orders'))
.select('c_id','a_id','order_id'
, F.concat(F.col('num_orders')
, F.lit('/')
, F.sum('num_orders').over(Window.partitionBy('order_id'))
).alias('order_share')
, 'is_original_address'
)
.sort('order_id','a_id')
).show(200)
Quick explanation:
address_combinations first self-joins the orders table with itself to get all possible combinations. However, there may be duplicates, so we perform GROUP BY and COUNT the number of orders within the 90d time window.
The next part simply gives us the denominator and formats it as desired ("x/y")
Hope this meets your requirements!
Related
I have a PySpark dataframe containing multiple rows for each user:
userId
action
time
1
buy
8 AM
1
buy
9 AM
1
sell
2 PM
1
sell
3 PM
2
sell
10 AM
2
buy
11 AM
2
sell
2 PM
2
sell
3 PM
My goal is to split this dataset into a training and a test set in such a way that, for each userId, N % of the rows are in the training set and 100-N % rows are in the test set. For example, given N=75%, the training set will be
userId
action
time
1
buy
8 AM
1
buy
9 AM
1
sell
2 PM
2
sell
10 AM
2
buy
11 AM
2
sell
2 PM
and the test set will be
userId
action
time
1
sell
3 PM
2
sell
3 PM
Any suggestion? Rows are ordered according to column time and I don't think that Spark's RandomSplit may help as I cannot stratify the split on specific columns
We had similar requirement and solved it in following way:
data = [
(1, "buy"),
(1, "buy"),
(1, "sell"),
(1, "sell"),
(2, "sell"),
(2, "buy"),
(2, "sell"),
(2, "sell"),
]
df = spark.createDataFrame(data, ["userId", "action"])
Use Window functionality to create serial row numbers. Also compute count of records by each userId. This will be helpful to compute percentage of records to filter.
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
window = Window.partitionBy(df["userId"]).orderBy(df["userId"])
df_count = df.groupBy("userId").count().withColumnRenamed("userId", "userId_grp")
df = df.join(df_count, col("userId") == col("userId_grp"), "left").drop("userId_grp")
df = df.select("userId", "action", "count", row_number().over(window).alias("row_number"))
df.show()
+------+------+-----+----------+
|userId|action|count|row_number|
+------+------+-----+----------+
| 1| buy| 4| 1|
| 1| buy| 4| 2|
| 1| sell| 4| 3|
| 1| sell| 4| 4|
| 2| sell| 4| 1|
| 2| buy| 4| 2|
| 2| sell| 4| 3|
| 2| sell| 4| 4|
+------+------+-----+----------+
Filter training records by required percentage:
n = 75
df_train = df.filter(col("row_number") <= col("count") * n / 100)
df_train.show()
+------+------+-----+----------+
|userId|action|count|row_number|
+------+------+-----+----------+
| 1| buy| 4| 1|
| 1| buy| 4| 2|
| 1| sell| 4| 3|
| 2| sell| 4| 1|
| 2| buy| 4| 2|
| 2| sell| 4| 3|
+------+------+-----+----------+
And remaining records go to the test set:
df_test = df.alias("df").join(df_train.alias("tr"), (col("df.userId") == col("tr.userId")) & (col("df.row_number") == col("tr.row_number")), "leftanti")
df_test.show()
+------+------+-----+----------+
|userId|action|count|row_number|
+------+------+-----+----------+
| 1| sell| 4| 4|
| 2| sell| 4| 4|
+------+------+-----+----------+
You can use ntile:
ds = ds.withColumn("tile", expr("ntile(4) over (partition by id order by id)"))
The dataset where tile=4 is your test set, and tile<4 is your train set:
val train = ds.filter(col("tile").equalTo(4))
val test = ds.filter(col("tile").lt(4))
test.show()
+---+------+----+----+
| id|action|time|tile|
+---+------+----+----+
| 1| sell|3 PM| 4|
| 2| sell|3 PM| 4|
+---+------+----+----+
train.show()
+---+------+-----+----+
| id|action| time|tile|
+---+------+-----+----+
| 1| buy| 8 AM| 1|
| 1| buy| 9 AM| 2|
| 1| sell| 2 PM| 3|
| 2| sell|10 AM| 1|
| 2| buy|11 AM| 2|
| 2| sell| 2 PM| 3|
+---+------+-----+----+
Good luck!
I have a dataframe like:
id Name Rank Course
1 S1 21 Physics
2 S2 22 Chemistry
3 S3 24 Math
4 S2 22 English
5 S2 22 Social
6 S1 21 Geography
I want to group this dataset over Name, Rank and calculate group number. In pandas, I can easily do:
df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()
After computing the above, I get the following output:
id Name Rank Course ngrp
1 S1 21 Physics 0
6 S1 22 Geography 0
2 S2 22 Chemistry 1
4 S2 22 English 1
5 S2 23 Social 1
3 S3 24 Math 2
Is there a method in Pyspark that will achieve the same output? I tried the following, but it doesn't seem to work:
from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()
You can opt for DENSE_RANK -
Data Preparation
df = pd.read_csv(StringIO("""
id,Name,Rank,Course
1,S1,21,Physics
2,S2,22,Chemistry
3,S3,24,Math
4,S2,22,English
5,S2,22,Social
6,S1,21,Geography
"""),delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---+----+----+---------+
| id|Name|Rank| Course|
+---+----+----+---------+
| 1| S1| 21| Physics|
| 2| S2| 22|Chemistry|
| 3| S3| 24| Math|
| 4| S2| 22| English|
| 5| S2| 22| Social|
| 6| S1| 21|Geography|
+---+----+----+---------+
Dense Rank
window = Window.orderBy(['Name','Rank'])
sparkDF = sparkDF.withColumn('ngroup',F.dense_rank().over(window) - 1)
sparkDF.orderBy(['Name','ngroup']).show()
+---+----+----+---------+------+
| id|Name|Rank| Course|ngroup|
+---+----+----+---------+------+
| 6| S1| 21|Geography| 0|
| 1| S1| 21| Physics| 0|
| 4| S2| 22| English| 1|
| 2| S2| 22|Chemistry| 1|
| 5| S2| 22| Social| 1|
| 3| S3| 24| Math| 2|
+---+----+----+---------+------+
Dense Rank - SparkSQL
sql.sql("""
SELECT
ID,
NAME,
RANK,
COURSE,
DENSE_RANK() OVER(ORDER BY NAME,RANK) - 1 as NGROUP
FROM TB1
""").show()
+---+----+----+---------+------+
| ID|NAME|RANK| COURSE|NGROUP|
+---+----+----+---------+------+
| 1| S1| 21| Physics| 0|
| 6| S1| 21|Geography| 0|
| 2| S2| 22|Chemistry| 1|
| 4| S2| 22| English| 1|
| 5| S2| 22| Social| 1|
| 3| S3| 24| Math| 2|
+---+----+----+---------+------+
Imagine you have two datasets df and df2 like the following:
df:
ID Size Condition
1 2 1
2 3 0
3 5 0
4 7 1
df2:
aux_ID Scalar
1 2
3 2
I want to get an output where if the condition of df is 1, we multiply the size times the scalar and then return df with the changed values.
I would want to do this as efficient as possible, perhaps avoiding the join if that's possible.
output_df:
ID Size Condition
1 4 1
2 3 0
3 5 0
4 7 1
Not sure why would you want to avoid Joins in the first place. They can be efficient in there own regards.
With this said , this can be easily done with Merging the 2 datasets and building a case-when statement against the condition
Data Preparation
df1 = pd.read_csv(StringIO("""ID,Size,Condition
1,2,1
2,3,0
3,5,0
4,7,1
""")
,delimiter=','
)
df2 = pd.read_csv(StringIO("""aux_ID,Scalar
1,2
3,2
""")
,delimiter=','
)
sparkDF1 = sql.createDataFrame(df1)
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---+----+---------+
| ID|Size|Condition|
+---+----+---------+
| 1| 2| 1|
| 2| 3| 0|
| 3| 5| 0|
| 4| 7| 1|
+---+----+---------+
sparkDF2.show()
+------+------+
|aux_ID|Scalar|
+------+------+
| 1| 2|
| 3| 2|
+------+------+
Case When
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['ID'] == sparkDF2['aux_ID']
,'left'
).select(sparkDF1['*']
,sparkDF2['Scalar']
,sparkDF2['aux_ID']
).withColumn('Size_Changed',F.when( ( F.col('Condition') == 1 )
& ( F.col('aux_ID').isNotNull())
,F.col('Size') * F.col('Scalar')
).otherwise(F.col('Size')
)
)
finalDF.show()
+---+----+---------+------+------+------------+
| ID|Size|Condition|Scalar|aux_ID|Size_Changed|
+---+----+---------+------+------+------------+
| 1| 2| 1| 2| 1| 4|
| 3| 5| 0| 2| 3| 5|
| 2| 3| 0| null| null| 3|
| 4| 7| 1| null| null| 7|
+---+----+---------+------+------+------------+
You can drop the unnecessary columns , I kept them for your illustration
I have a spark dataframe
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
1 | CA | appel | 100
2 | CA | orange| 20
3 | CA | berry | 10
I want to get rows where fruits are apple or orange. So I use Spark SQL:
SELECT * FROM table WHERE fruit LIKE '%apple%' OR fruit LIKE '%orange%';
It returns
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
2 | CA | orange| 20
But it is supposed to return
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
1 | CA | appel | 100
2 | CA | orange| 20
as row 1 is just a misspelling.
So I plan on using fuzzywuzzy for string matching.
I know that
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
print(fuzz.partial_ratio('apple', 'apple')) -> 100
print(fuzz.partial_ratio('apple', 'appel')) -> 83
But am not sure how to apply this to a column in dataframe to get relevant rows
Since you are interested in implementing fuzzy matching as a filter, you must first decide on a threshold of how similar you would like the matches.
Approach 1
For your fuzzywuzzy import this could be 80 for the purpose of this demonstration (adjust based on your needs). You could then implement a udf to apply your imported fuzzy logic code eg
from pyspark.sql import functions as F
from pyspark.sql import types as T
F.udf(T.BooleanType())
def is_fuzzy_match(field_value,search_value, threshold=80):
from fuzzywuzzy import fuzz
return fuzz.partial_ratio(field_value, search_value) > threshold
Then apply your udf as a filter on your dataframe
df = (
df.where(
is_fuzzy_match(F.col("fruit"),F.lit("apple")) |
is_fuzzy_match(F.col("fruit"),F.lit("orange"))
)
)
Approach 2 : Recommended
However, udfs can be expensive when executed on spark and spark has already implemented the levenshtein function which is also useful here. You may start reading more about how the levenshtein distance accomplishes fuzzy matching .
With this approach your code code look like using a threshold of 3 below
from pyspark.sql import functions as F
df = df.where(
(
F.levenshtein(
F.col("fruit"),
F.lit("apple")
) < 3
) |
(
F.levenshtein(
F.col("fruit"),
F.lit("orange")
) < 3
)
)
df.show()
+---+----+------+--------+
| id|city| fruit|quantity|
+---+----+------+--------+
| 0| CA| apple| 300|
| 1| CA| appel| 100|
| 2| CA|orange| 20|
+---+----+------+--------+
For debugging purposes the result of the levenshtein has been included below
df.withColumn("diff",
F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).show()
+---+----+------+--------+----+
| id|city| fruit|quantity|diff|
+---+----+------+--------+----+
| 0| CA| apple| 300| 0|
| 1| CA| appel| 100| 2|
| 2| CA|orange| 20| 5|
| 3| CA| berry| 10| 5|
+---+----+------+--------+----+
Update 1
In response to additional sample data provided by Op in comments:
If I have a fruit like kashmir apple and want it to match with apple
Approach 3
You could try the following approach and adjust the threshold as desired.
Since you are interested in matching the possibility of a misspelled fruit throughout the text, you could attempt to apply the levenshtein to every piece throughout the fruit name. The functions below (not udfs but for readability simplifies the application of the task ) implement this approach. matches_fruit_ratio attempts to determine how much of a match is found while matches_fruit determines the maximum matches_fruit_ratio on every piece of the fruit name split by a space.
from pyspark.sql import functions as F
def matches_fruit_ratio(fruit_column,fruit_search,threshold=0.3):
return (F.length(fruit_column) - F.levenshtein(
fruit_column,
F.lit(fruit_search)
)) / F.length(fruit_column)
def matches_fruit(fruit_column,fruit_search,threshold=0.6):
return F.array_max(F.transform(
F.split(fruit_column," "),
lambda fruit_piece : matches_fruit_ratio(fruit_piece,fruit_search)
)) >= threshold
This can be used as follows:
df = df.where(
matches_fruit(
F.col("fruit"),
"apple"
) | matches_fruit(
F.col("fruit"),
"orange"
)
)
df.show()
+---+----+-------------+--------+
| id|city| fruit|quantity|
+---+----+-------------+--------+
| 0| CA| apple| 300|
| 1| CA| appel| 100|
| 2| CA| orange| 20|
| 4| CA| apply berry| 3|
| 5| CA| apple berry| 1|
| 6| CA|kashmir apple| 5|
| 7| CA|kashmir appel| 8|
+---+----+-------------+--------+
For debugging purposes, I have added additional sample data and output columns for the different components of each function while demonstrating how this function may be used
df.withColumn("length",
F.length(
"fruit"
)
).withColumn("levenshtein",
F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).withColumn("length - levenshtein",
F.length(
"fruit"
) - F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).withColumn(
"matches_fruit_ratio",
matches_fruit_ratio(
F.col("fruit"),
"apple"
)
).withColumn(
"matches_fruit_values_before_threshold",
F.array_max(F.transform(
F.split("fruit"," "),
lambda fruit_piece : matches_fruit_ratio(fruit_piece,"apple")
))
).withColumn(
"matches_fruit",
matches_fruit(
F.col("fruit"),
"apple"
)
).show()
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| id|city| fruit|quantity|length|levenshtein|length - levenshtein|matches_fruit_ratio|matches_fruit_values_before_threshold|matches_fruit|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| 0| CA| apple| 300| 5| 0| 5| 1.0| 1.0| true|
| 1| CA| appel| 100| 5| 2| 3| 0.6| 0.6| true|
| 2| CA| orange| 20| 6| 5| 1|0.16666666666666666| 0.16666666666666666| false|
| 3| CA| berry| 10| 5| 5| 0| 0.0| 0.0| false|
| 4| CA| apply berry| 3| 11| 6| 5|0.45454545454545453| 0.8| true|
| 5| CA| apple berry| 1| 11| 6| 5|0.45454545454545453| 1.0| true|
| 6| CA|kashmir apple| 5| 13| 8| 5|0.38461538461538464| 1.0| true|
| 7| CA|kashmir appel| 8| 13| 10| 3|0.23076923076923078| 0.6| true|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
With a dataframe like this,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|null| 201602|
| 1| 20|3003| 201601|
| 1| 20|null| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|null| 201601|
+---+----+----+-------+
I need to fill the null values with the average of the existing values, with the expected result being
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
where 1128 is the average of the existing values. I need to do that for several columns.
My current approach is to use na.fill:
fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
But this is very cumbersome. Any ideas?
Well, one way or another you have to:
compute statistics
fill the blanks
It pretty much limits what you can really improve here, still:
replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary
The final result could like this:
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df_data, ["id", "date"])
In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.