Suppose I have 5 TB of data with the following schema, and I am using Pyspark.
| id | date | Month | KPI_1 | ... | KPI_n
For 90% of the KPIs, I only need to know the sum/min/max value aggregate to (id, Month) level. For the rest 10%, I need to know the first value based on date.
One option for me is to use window. For example, I can do
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy("id", "Month").orderBy(F.desc("date"))
# for the 90% kpi
agg_df = df.withColumn("kpi_1", F.sum("kpi_1").over(w))
agg_df = agg_df.withColumn("kpi_2", F.max("kpi_2").over(w))
agg_df = agg_df.withColumn("kpi_3", F.min("kpi_3").over(w))
...
# Select last row for each window to get last accumulated sum for 90% kpis and last value for 10% kpi (which is equivalent to first value if ranked ascending).
# continue process agg_df with filters based on sum/max/min values of 90% KIPs.
But I am not sure how to select last row of each window. Does anyone have any suggestions, or if there is a better way to aggregate?
Let's assume we have this data
+---+----------+-------+-----+-----+
| id| date| month|kpi_1|kpi_2|
+---+----------+-------+-----+-----+
| 1|2000-01-01|2000-01| 1| 100|
| 1|2000-01-02|2000-01| 2| 200|
| 1|2000-01-03|2000-01| 3| 300|
| 1|2000-01-04|2000-01| 4| 400|
| 1|2000-01-05|2000-01| 5| 500|
| 1|2000-02-01|2000-02| 10| 11|
| 1|2000-02-02|2000-02| 20| 21|
| 1|2000-02-03|2000-02| 30| 31|
| 1|2000-02-04|2000-02| 40| 41|
+---+----------+-------+-----+-----+
and we want to calculate the min, max and sum for kpi_1 and get the last value of kpi_2 for each group.
Getting the min, max and sum of kpi_1 can be achieved by grouping the data by id and month. With Spark >= 3.0.0 max_by can be used to the get latest value of kpi_2:
df_avg = df \
.groupBy("id","month") \
.agg(F.sum("kpi_1"), F.min("kpi_1"), F.max("kpi_1"), F.expr("max_by(kpi_2, date)"))
df_avg.show()
prints
+---+-------+----------+----------+----------+-------------------+
| id| month|sum(kpi_1)|min(kpi_1)|max(kpi_1)|max_by(kpi_2, date)|
+---+-------+----------+----------+----------+-------------------+
| 1|2000-02| 100| 10| 40| 41|
| 1|2000-01| 15| 1| 5| 500|
+---+-------+----------+----------+----------+-------------------+
For Spark version < 3.0.0 max_by is not available and so getting the last value of kpi_2 for each group is more difficult. A first idea could be to use the aggregation function first() on an descending ordered data frame . A simple test gave me the correct result, but unfortunately the documentation states "The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle".
A better approach to get the last value of kpi_2 is to use a window like shown in the question. As window function row_number() would work:
w = Window.partitionBy("id", "Month").orderBy(F.desc("date"))
df_first = df.withColumn("row_number", F.row_number().over(w)).where("row_number = 1")\
.drop("row_number") \
.select("id", "month", "KPI_2")
df_first.show()
prints
+---+-------+-----+
| id| month|KPI_2|
+---+-------+-----+
| 1|2000-02| 41|
| 1|2000-01| 500|
+---+-------+-----+
Joining the first part (without the max_by column) and the second part gives the desired result:
df_result = df_avg.join(df_first, ['id', 'month'])
df_result.show()
prints
+---+-------+----------+----------+----------+-----+
| id| month|sum(kpi_1)|min(kpi_1)|max(kpi_1)|KPI_2|
+---+-------+----------+----------+----------+-----+
| 1|2000-02| 100| 10| 40| 41|
| 1|2000-01| 15| 1| 5| 500|
+---+-------+----------+----------+----------+-----+
Related
Suppose I have a pyspark dataframe with a number of unique account values, each of which have a unique number of entries, like so:
+-------------_+--------+--------+
| account| col1| col2 | col3 |
+--------+-----+--------+--------+
| 325235 | 59| -6| 625.64|
| 325235 | 23| -282| 923.47|
| 325235 | 77|-1310.89| 3603.48|
| 245623 | 120| 1.53| 1985.63|
| 245623 | 106| -12| 1985.06|
| 658567 | 84| -12| 194.67|
I want to specify a batch size, and assign multiple accounts to the same batch based on the batch size. Lets suppose I choose batch size = 2, then the output should be the following:
+-------------_+--------+--------+--------------+
| account| col1| col2 | col3 | batch_number |
+--------+-----+--------+--------+--------------+
| 325235 | 59| -6| 625.64| 1|
| 325235 | 23| -282| 923.47| 1|
| 325235 | 77|-1310.89| 3603.48| 1|
| 245623 | 120| 1.53| 1985.63| 1|
| 245623 | 106| -12| 1985.06| 1|
| 658567 | 84| -12| 194.67| 2|
I can then do a groupby on the batch_number column and have multiple accounts per batch. Here is my working code, but it is too slow since I am doing a toPandas().
# Get unique accounts in source data
accounts = [row.account for row in source_data.select("account").distinct().collect()]
# Find number of batches based. Last batch will have size = remainder
num_batches, remainder = divmod(len(accounts), batchsize)
# Create batch dataframe where a batch number is assigned to each account.
batches = [i for _ in range(batchsize) for i in range(1, int(num_batches) + 1)] + [num_batches + 1 for i in range(remainder)]
batch_df = pd.DataFrame({"account": accounts, "batch_number": batches}, columns=["account", "batch_number"]).set_index("account")
# Add a zero column for batch number to source data which will be populated
source_data = source_data.withColumn("batch_number", lit(0))
# Map batch numbers of accounts back into the source data
source_data_p = source_data.toPandas()
for ind in source_data_p.index:
source_data_p.at[ind, "batch_number"] = batch_df.at[source_data_p.at[ind, "account"], "batch_number"]
# Convert mapped pandas df back to spark df
batched_df = sqlcontext.createDataFrame(source_data_p)
I would ideally like to get rid of the toPandas() call, and do the mapping in pyspark. I have seen a few related posts, like this one: How to batch up items from a PySpark DataFrame, but this doesn't fit into the flow of my code, so I will have to re-write the whole project just to implement this.
From what I understand you can use an indexer using mllib or any other way and then floor division:
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer
n=2
idx = StringIndexer(inputCol="account",outputCol="batch_number")
(idx.fit(df).transform(df)
.withColumn("batch_number",F.floor(F.col("batch_number")/n)+1)).show()
+-------+----+--------+-------+------------+
|account|col1| col2| col3|batch_number|
+-------+----+--------+-------+------------+
| 325235| 59| -6.0| 625.64| 1|
| 325235| 23| -282.0| 923.47| 1|
| 325235| 77|-1310.89|3603.48| 1|
| 245623| 120| 1.53|1985.63| 1|
| 245623| 106| -12.0|1985.06| 1|
| 658567| 84| -12.0| 194.67| 2|
+-------+----+--------+-------+------------+
I have a dataframe like this...
+----------+-----+
| date|price|
+----------+-----+
|2019-01-01| 25|
|2019-01-02| 22|
|2019-01-03| 20|
|2019-01-04| -5|
|2019-01-05| -1|
|2019-01-06| -2|
|2019-01-07| 5|
|2019-01-08| -11|
+----------+-----+
I want to create a new column based on a logic which needs to look back on other rows - not just the column values of the same row
I was trying some UDF but it takes the corresponding row value of the column. I do not know how to look at other rows...
With example:
I would like to create a new column "newprice" - which will be something like this...
+----------+-----+----------+
| date|price|new price
+----------+-----+----------+
|2019-01-01| 25| 25
|2019-01-02| 22| 22
|2019-01-03| 20| 20
|2019-01-04| -5| 20
|2019-01-05| -1| 20
|2019-01-06| -2| 20
|2019-01-07| 5| 5
|2019-01-08| -11| 5
+----------+-----+-----------+
Essentially every row in the new column value is based on not that corresponding row's values but other row's values...
Logic: If the price is negative then look back on previous days and if that day is positive value - take it or go back one more day until a positive value is available...
dateprice = [('2019-01-01',25),('2019-01-02',22),('2019-01-03',20),('2019-01-04', -5),\
('2019-01-05',-1),('2019-01-06',-2),('2019-01-07',5),('2019-01-08', -11)]
dataDF = sqlContext.createDataFrame(dateprice, ('date', 'price'))
Any help will be highly appreciated.
First populate the new price column with the price column, but replace the negative values with nulls. Then you can use the technique shown on Fill in null with previously known good value with pyspark to get the last non-null value, which in this case will be the last positive value.
For example:
from pyspark.sql.functions import col, last, when
from pyspark.sql import Window
w = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
dataDF.withColumn("new_price", when(col("price") >= 0, col("price")))\
.withColumn(
"new_price",
last('new_price', True).over(w)
)\
.show()
#+----------+-----+---------+
#| date|price|new_price|
#+----------+-----+---------+
#|2019-01-01| 25| 25|
#|2019-01-02| 22| 22|
#|2019-01-03| 20| 20|
#|2019-01-04| -5| 20|
#|2019-01-05| -1| 20|
#|2019-01-06| -2| 20|
#|2019-01-07| 5| 5|
#|2019-01-08| -11| 5|
#+----------+-----+---------+
Here I have taken advantage of the fact that when returns null by default if the condition doesn't match and no otherwise is specified.
I tried this one using Spark SQL. Let me explain my solution in 2 parts,
First, when the price is negative, we can fetch the most recent date in which the price was positive otherwise we can populate the price itself, as shown below,
spark.sql("""
select *,
case when price < 0 then
max(lag(case when price < 0 then null else date end) over(order by date))
over(order by date rows between unbounded preceding and current row)
else price end as price_or_date
from dataset
""").show()
Output:
+----------+-----+-------------+
| date|price|price_or_date|
+----------+-----+-------------+
|2019-01-01| 25| 25|
|2019-01-02| 22| 22|
|2019-01-03| 20| 20|
|2019-01-04| -5| 2019-01-03|
|2019-01-05| -1| 2019-01-03|
|2019-01-06| -2| 2019-01-03|
|2019-01-07| 5| 5|
|2019-01-08| -11| 2019-01-07|
+----------+-----+-------------+
Second, you can do a left join on the same dataset using the date and this derived column. So, now the ones with price in the price_or_date column would come out as null. Finally we can perform a simple coalesce on them.
Combining them, we can achieve this final query shown below to generate the desired output,
spark.sql("""
select
a.date
, a.price
, coalesce(b.price, a.price) as new_price
from
(
select *,
case when price < 0 then
max(lag(case when price < 0 then null else date end) over(order by date))
over(order by date rows between unbounded preceding and current row)
else price end as price_or_date
from dataset
) a
left join dataset b
on a.price_or_date = b.date
order by a.date""").show()
Output:
+----------+-----+---------+
| date|price|new_price|
+----------+-----+---------+
|2019-01-01| 25| 25|
|2019-01-02| 22| 22|
|2019-01-03| 20| 20|
|2019-01-04| -5| 20|
|2019-01-05| -1| 20|
|2019-01-06| -2| 20|
|2019-01-07| 5| 5|
|2019-01-08| -11| 5|
+----------+-----+---------+
Hope this helps.
I have a data frame in pyspark like below.
+---+-------------+------------+
| id| device| model|
+---+-------------+------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 3| cctv| cctv|
| 2| iphone|apple iphone|
| 3| spy camera| |
+---+-------------+------------+
I want to create a column by concatenating unique values in device and model columns for each id
I have done like below
First concatenated both device and model columns
df1 = df.select(col("id"), concat(col("model"), lit(","), col("device")).alias('con'))
+---+--------------------+
| id| con|
+---+--------------------+
| 3| mac,mac pro|
| 1| iphone5,iphone|
| 1|android,android p...|
| 1| windows,windows pc|
| 1|spy camera,spy ca...|
| 2| camera,|
| 3| cctv,cctv|
| 2| apple iphone,iphone|
| 3| ,spy camera|
+---+--------------------+
Then done a groupBy by id
df2 = df1.groupBy("id").agg(f.concat_ws(",", f.collect_set(df1.con)).alias('Group_con')
+---+-----------------------------------------------------------------------------+
| id| Group_con|
+---+-----------------------------------------------------------------------------+
| 1|iphone5,iphone,android,android phone,windows,windows pc,spy camera,spy camera|
| 2| camera,,apple iphone,iphone|
| 3| mac,mac pro,cctv,cctv,,spy camera|
+---+-----------------------------------------------------------------------------+
But I am getting duplicate values in the result. How can I avoid populating duplicate values in the final data frame
Use F.array(), F.explode() and F.collect_set():
from pyspark.sql import functions as F
df.withColumn('con', F.explode(F.array('device', 'model'))) \
.groupby('id').agg(F.collect_set('con').alias('Group_con')) \
.show(3,0)
# +---+--------------------------------------------------------------------------+
# |id |Group_con |
# +---+--------------------------------------------------------------------------+
# |3 |[cctv, mac pro, spy camera, mac] |
# |1 |[windows pc, iphone5, windows, iphone, android phone, spy camera, android]|
# |2 |[apple iphone, camera, iphone] |
# +---+--------------------------------------------------------------------------+
(tested on spark version 2.2.1)
You can remove the duplicates by using collect_set and a udf function as
from pyspark.sql import functions as f
from pyspark.sql import types as t
def uniqueStringUdf(device, model):
return ','.join(set(filter(bool, device + model)))
uniqueStringUdfCall = f.udf(uniqueStringUdf, t.StringType())
df.groupBy("id").agg(uniqueStringUdfCall(f.collect_set("device"), f.collect_set("model")).alias("con")).show(truncate=False)
which should give you
+---+------------------------------------------------------------------+
|id |con |
+---+------------------------------------------------------------------+
|3 |spy camera,mac,mac pro,cctv |
|1 |spy camera,windows,iphone5,windows pc,iphone,android phone,android|
|2 |camera,iphone,pple iphone |
+---+------------------------------------------------------------------+
where,
device + model is concatenation for two collected sets
filter(bool, device + model) is removing empty strings from concatenated list
set(filter(bool, device + model)) removes the duplicate strings in the concatenated list
','.join(set(filter(bool, device + model))) concats all the elements of concatenated list to a comma separated string.
I hope the answer is helpful
Not sure if this is going to be very helpful. But one solution I could think of is to check for the duplicate values in the column and then delete them by using their position/index.
Or
Split all values at comma "," list and remove all the duplicates by comparing each value. Or count() the occurrences of a value if its more than 1 the delete the all the duplicates other than the first one.
Sorry if this wasn't help. These are the 2 ways I could think of solving your problem.
I have a data frame in Pyspark like below. I want to count values in two columns based on some lists and populate new columns for each list
df.show()
+---+-------------+-------------_+
| id| device| device_model|
+---+-------------+--------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 2| iphone| apple iphone|
| 3| spy camera| |
| 3| cctv| cctv|
+---+-------------+--------------+
lists are below:
phone_list = ['iphone', 'android', 'nokia']
pc_list = ['windows', 'mac']
security_list = ['camera', 'cctv']
I want to count the device and device_model for each id and pivot the values in a new data frame.
I want to count the values in the both the device_model and device columns for each id that match the strings in the list.
For example: in phone_list I have a iphone string this should count values for both values iphone and iphone5
The result I want
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 4| 2| 2|
| 2| 2|null| 1|
| 3| null| 2| 3|
+---+------+----+--------+
I have done like below
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
Using the above I can only do for device column and only if the string matches exactly. But unable to figure out how to do for both the columns and when value contains the string.
How can I achieve the result I want?
Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)
#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}
#udf function for checking the strings and summing them
from pyspark.sql import functions as F
from pyspark.sql import types as t
def checkDevices(device, deviceModel, name):
sum = 0
for x in columnLists[name]:
if x in device:
sum += 1
if x in deviceModel:
sum += 1
return sum
checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())
#populating the sum returned from udf function to respective columns
for x in columnLists:
df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))
#finally grouping and sum
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()
which should give you
+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
| 3| 0| 2| 3|
| 1| 4| 2| 2|
| 2| 2| 0| 1|
+---+-----+---+--------+
Aggrgation part can be generalized as the rest of the parts. Improvements and modification is all in your hand. :)
Given a dataframe :
+-------+-------+
| A | B |
+-------+-------+
| a| 1|
+-------+-------+
| b| 2|
+-------+-------+
| c| 5|
+-------+-------+
| d| 7|
+-------+-------+
| e| 11|
+-------+-------+
I want to assign ranks to records based on conditions :
Start rank with 1
Assign rank = rank of previous record if ( B of current record - B of previous record ) is <= 2
Increment rank when ( B of current record - B of previous record ) is > 2
So I want result to be like this :
+-------+-------+------+
| A | B | rank |
+-------+-------+------+
| a| 1| 1|
+-------+-------+------+
| b| 2| 1|
+-------+-------+------+
| c| 5| 2|
+-------+-------+------+
| d| 7| 2|
+-------+-------+------+
| e| 11| 3|
+-------+-------+------+
Inbuilt functions in spark like rowNumber, rank, dense_rank don't
provide any functionality to achieve this.
I tried doing it by using a global variable rank and fetching
previous record values using lag function but it does not give
consistent results due to distributed processing in spark unlike in sql.
One more method I tried was passing lag values of records to a UDF while generating a new column and applying conditions in UDF. But the problem I am facing is I can get lag values for columns A as well as B but not for column rank.
This gives error as it cannot resolve column name rank :
HiveContext.sql("SELECT df.*,LAG(df.rank, 1) OVER (ORDER BY B , 0) AS rank_lag, udfGetVisitNo(B,rank_lag) as rank FROM df")
I cannot get lag value of a column which I am currently adding.
Also I dont want methods which require using df.collect() as this dataframe is quite large in size and collecting it on a single working node results in memory errors.
Any other method by which I can achieve the same?
I would like to know a solution having time complexity O(n) , n being the no of records.
A SQL solution would be
select a,b,1+sum(col) over(order by a) as rnk
from
(
select t.*
,case when b - lag(b,1,b) over(order by a) <= 2 then 0 else 1 end as col
from t
) x
The solution assumes the ordering is based on column a.
SQL Server example