Suppose I have a pyspark dataframe with a number of unique account values, each of which have a unique number of entries, like so:
+-------------_+--------+--------+
| account| col1| col2 | col3 |
+--------+-----+--------+--------+
| 325235 | 59| -6| 625.64|
| 325235 | 23| -282| 923.47|
| 325235 | 77|-1310.89| 3603.48|
| 245623 | 120| 1.53| 1985.63|
| 245623 | 106| -12| 1985.06|
| 658567 | 84| -12| 194.67|
I want to specify a batch size, and assign multiple accounts to the same batch based on the batch size. Lets suppose I choose batch size = 2, then the output should be the following:
+-------------_+--------+--------+--------------+
| account| col1| col2 | col3 | batch_number |
+--------+-----+--------+--------+--------------+
| 325235 | 59| -6| 625.64| 1|
| 325235 | 23| -282| 923.47| 1|
| 325235 | 77|-1310.89| 3603.48| 1|
| 245623 | 120| 1.53| 1985.63| 1|
| 245623 | 106| -12| 1985.06| 1|
| 658567 | 84| -12| 194.67| 2|
I can then do a groupby on the batch_number column and have multiple accounts per batch. Here is my working code, but it is too slow since I am doing a toPandas().
# Get unique accounts in source data
accounts = [row.account for row in source_data.select("account").distinct().collect()]
# Find number of batches based. Last batch will have size = remainder
num_batches, remainder = divmod(len(accounts), batchsize)
# Create batch dataframe where a batch number is assigned to each account.
batches = [i for _ in range(batchsize) for i in range(1, int(num_batches) + 1)] + [num_batches + 1 for i in range(remainder)]
batch_df = pd.DataFrame({"account": accounts, "batch_number": batches}, columns=["account", "batch_number"]).set_index("account")
# Add a zero column for batch number to source data which will be populated
source_data = source_data.withColumn("batch_number", lit(0))
# Map batch numbers of accounts back into the source data
source_data_p = source_data.toPandas()
for ind in source_data_p.index:
source_data_p.at[ind, "batch_number"] = batch_df.at[source_data_p.at[ind, "account"], "batch_number"]
# Convert mapped pandas df back to spark df
batched_df = sqlcontext.createDataFrame(source_data_p)
I would ideally like to get rid of the toPandas() call, and do the mapping in pyspark. I have seen a few related posts, like this one: How to batch up items from a PySpark DataFrame, but this doesn't fit into the flow of my code, so I will have to re-write the whole project just to implement this.
From what I understand you can use an indexer using mllib or any other way and then floor division:
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer
n=2
idx = StringIndexer(inputCol="account",outputCol="batch_number")
(idx.fit(df).transform(df)
.withColumn("batch_number",F.floor(F.col("batch_number")/n)+1)).show()
+-------+----+--------+-------+------------+
|account|col1| col2| col3|batch_number|
+-------+----+--------+-------+------------+
| 325235| 59| -6.0| 625.64| 1|
| 325235| 23| -282.0| 923.47| 1|
| 325235| 77|-1310.89|3603.48| 1|
| 245623| 120| 1.53|1985.63| 1|
| 245623| 106| -12.0|1985.06| 1|
| 658567| 84| -12.0| 194.67| 2|
+-------+----+--------+-------+------------+
Related
Suppose I have 5 TB of data with the following schema, and I am using Pyspark.
| id | date | Month | KPI_1 | ... | KPI_n
For 90% of the KPIs, I only need to know the sum/min/max value aggregate to (id, Month) level. For the rest 10%, I need to know the first value based on date.
One option for me is to use window. For example, I can do
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy("id", "Month").orderBy(F.desc("date"))
# for the 90% kpi
agg_df = df.withColumn("kpi_1", F.sum("kpi_1").over(w))
agg_df = agg_df.withColumn("kpi_2", F.max("kpi_2").over(w))
agg_df = agg_df.withColumn("kpi_3", F.min("kpi_3").over(w))
...
# Select last row for each window to get last accumulated sum for 90% kpis and last value for 10% kpi (which is equivalent to first value if ranked ascending).
# continue process agg_df with filters based on sum/max/min values of 90% KIPs.
But I am not sure how to select last row of each window. Does anyone have any suggestions, or if there is a better way to aggregate?
Let's assume we have this data
+---+----------+-------+-----+-----+
| id| date| month|kpi_1|kpi_2|
+---+----------+-------+-----+-----+
| 1|2000-01-01|2000-01| 1| 100|
| 1|2000-01-02|2000-01| 2| 200|
| 1|2000-01-03|2000-01| 3| 300|
| 1|2000-01-04|2000-01| 4| 400|
| 1|2000-01-05|2000-01| 5| 500|
| 1|2000-02-01|2000-02| 10| 11|
| 1|2000-02-02|2000-02| 20| 21|
| 1|2000-02-03|2000-02| 30| 31|
| 1|2000-02-04|2000-02| 40| 41|
+---+----------+-------+-----+-----+
and we want to calculate the min, max and sum for kpi_1 and get the last value of kpi_2 for each group.
Getting the min, max and sum of kpi_1 can be achieved by grouping the data by id and month. With Spark >= 3.0.0 max_by can be used to the get latest value of kpi_2:
df_avg = df \
.groupBy("id","month") \
.agg(F.sum("kpi_1"), F.min("kpi_1"), F.max("kpi_1"), F.expr("max_by(kpi_2, date)"))
df_avg.show()
prints
+---+-------+----------+----------+----------+-------------------+
| id| month|sum(kpi_1)|min(kpi_1)|max(kpi_1)|max_by(kpi_2, date)|
+---+-------+----------+----------+----------+-------------------+
| 1|2000-02| 100| 10| 40| 41|
| 1|2000-01| 15| 1| 5| 500|
+---+-------+----------+----------+----------+-------------------+
For Spark version < 3.0.0 max_by is not available and so getting the last value of kpi_2 for each group is more difficult. A first idea could be to use the aggregation function first() on an descending ordered data frame . A simple test gave me the correct result, but unfortunately the documentation states "The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle".
A better approach to get the last value of kpi_2 is to use a window like shown in the question. As window function row_number() would work:
w = Window.partitionBy("id", "Month").orderBy(F.desc("date"))
df_first = df.withColumn("row_number", F.row_number().over(w)).where("row_number = 1")\
.drop("row_number") \
.select("id", "month", "KPI_2")
df_first.show()
prints
+---+-------+-----+
| id| month|KPI_2|
+---+-------+-----+
| 1|2000-02| 41|
| 1|2000-01| 500|
+---+-------+-----+
Joining the first part (without the max_by column) and the second part gives the desired result:
df_result = df_avg.join(df_first, ['id', 'month'])
df_result.show()
prints
+---+-------+----------+----------+----------+-----+
| id| month|sum(kpi_1)|min(kpi_1)|max(kpi_1)|KPI_2|
+---+-------+----------+----------+----------+-----+
| 1|2000-02| 100| 10| 40| 41|
| 1|2000-01| 15| 1| 5| 500|
+---+-------+----------+----------+----------+-----+
I am trying to create a new column in my df called indexCP using Window I want to take the previous value from indexCP * (current_df['return']+1) if there is no previous indexCP do 100 * (current_df['return']+1).
column_list = ["id","secname"]
windowval = (Window.partitionBy(column_list).orderBy(col('calendarday').cast("timestamp").cast("long")).rangeBetween(Window.unboundedPreceding, 0))
spark_df = spark_df.withColumn('indexCP', when(spark_df["PreviousYearUnique"] == spark_df["yearUnique"], 100 * (current_df['return']+1)).otherwise(last('indexCP').over(windowval) * (current_df['return']+1)))
when I run the above code I get an error "AnalysisException: "cannot resolve 'indexCP' given input columns:" which I believe is saying you cant take a value that has not been created yet but I am unsure of how to fix it.
Starting Data Frame
## +---+-----------+----------+------------------+
## | id|calendarday| secName| return|
## +---+-----------+----------+------------------+
## | 1|2015-01-01 | 1| 0.0076|
## | 1|2015-01-02 | 1| 0.0026|
## | 1|2015-01-01 | 2| 0.0016|
## | 1|2015-01-02 | 2| 0.0006|
## | 2|2015-01-01 | 3| 0.0012|
## | 2|2015-01-02 | 3| 0.0014|
## +---+----------+-----------+------------------+
New Data Frame IndexCP added
## +---+-----------+--------+---------+------------+
## | id|calendarday| secName| return| IndexCP|
## +---+-----------+--------+---------+------------+
## | 1|2015-01-01 | 1| 0.0076| 100.76|(1st 100*(return+1))
## | 1|2015-01-02 | 1| 0.0026| 101.021976|(2nd 100.76*(return+1))
## | 2|2015-01-01 | 2| 0.0016| 100.16|(1st 100*(return+1))
## | 2|2015-01-02 | 2| 0.0006| 100.220096|(2nd 100.16*(return+1))
## | 3|2015-01-01 | 3| 0.0012| 100.12 |(1st 100*(return+1))
## | 3|2015-01-02 | 3| 0.0014| 100.260168|(2nd 100.12*(return+1))
## +---+----------+---------+---------+------------+
EDIT: This should be the final answer, I've extended it by another row for secName column.
What you're looking for is a rolling product function using your formula of IndexCP * (current_return + 1).
First you need to aggregate all existing returns into an ArrayType and then aggregate. This can be done with some Spark SQL aggregate function, such as:
column_list = ["id","secname"]
windowval = (
Window.partitionBy(column_list)
.orderBy(f.col('calendarday').cast("timestamp"))
.rangeBetween(Window.unboundedPreceding, 0)
)
df1.show()
+---+-----------+-------+------+
| id|calendarday|secName|return|
+---+-----------+-------+------+
| 1| 2015-01-01| 1|0.0076|
| 1| 2015-01-02| 1|0.0026|
| 1| 2015-01-03| 1|0.0014|
| 2| 2015-01-01| 2|0.0016|
| 2| 2015-01-02| 2|6.0E-4|
| 2| 2015-01-03| 2| 0.0|
| 3| 2015-01-01| 3|0.0012|
| 3| 2015-01-02| 3|0.0014|
+---+-----------+-------+------+
# f.collect_list(...) gets all your returns - this must be windowed
# cast(1 as double) is your base of 1 to begin with
# (acc, x) -> acc * (1 + x) is your formula translated to Spark SQL
# where acc is the accumulated value and x is the incoming value
df1.withColumn(
"rolling_returns",
f.collect_list("return").over(windowval)
).withColumn("IndexCP",
100 * f.expr("""
aggregate(
rolling_returns,
cast(1 as double),
(acc, x) -> acc * (1+x))
""")
).orderBy("id", "calendarday").show(truncate=False)
+---+-----------+-------+------+------------------------+------------------+
|id |calendarday|secName|return|rolling_returns |IndexCP |
+---+-----------+-------+------+------------------------+------------------+
|1 |2015-01-01 |1 |0.0076|[0.0076] |100.76 |
|1 |2015-01-02 |1 |0.0026|[0.0076, 0.0026] |101.021976 |
|1 |2015-01-03 |1 |0.0014|[0.0076, 0.0026, 0.0014]|101.16340676640002|
|2 |2015-01-01 |2 |0.0016|[0.0016] |100.16000000000001|
|2 |2015-01-02 |2 |6.0E-4|[0.0016, 6.0E-4] |100.220096 |
|2 |2015-01-03 |2 |0.0 |[0.0016, 6.0E-4, 0.0] |100.220096 |
|3 |2015-01-01 |3 |0.0012|[0.0012] |100.12 |
|3 |2015-01-02 |3 |0.0014|[0.0012, 0.0014] |100.26016800000002|
+---+-----------+-------+------+------------------------+------------------+
Explanation: The starting value must be 1 and the multiplier of 100 must be on the outside of the expression, otherwise you indeed start drifting by a factor of 100 above expected returns.
I have verified the values now adhere to your formula, for instance for secName == 1 and id == 1:
100 * ((1.0026 * (0.0076 + 1)) * (0.0014 + 1)) = 101.1634067664
Which is indeed correct according to the formula (acc, x) -> acc * (1+x). Hope this helps!
I have a data frame that looks like
+-------+-------+
| Code1 | Code2 |
+-------+-------+
| A | 1 |
| B | 1 |
| A | 2 |
| B | 2 |
| C | 2 |
| D | 2 |
| D | 3 |
| F | 3 |
| G | 3 |
+-------+-------+
I then want to apply a unique set of filters like so:
Scenario 1 -> filter on Code1 IN (A,B)
Scenario 2 -> filter on Code1 IN (A,D) and Code2 IN (2,3)
Scenario 3 -> filter on Code2 = 2
The result of applying the filter should be a data frame that looks like:
+-------+-------+----------+
| Code1 | Code2 | Scenario |
+-------+-------+----------+
| A | 1 | 1 |
| B | 1 | 1 |
| A | 2 | 1 |
| B | 2 | 1 |
| A | 2 | 2 |
| D | 2 | 2 |
| D | 3 | 2 |
| A | 2 | 3 |
| B | 2 | 3 |
| C | 2 | 3 |
| D | 2 | 3 |
+-------+-------+----------+
QUESTION: What is the most efficient way to do this with spark via python?
I am new to spark, so I am really asking from a conceptual level and don't need an explicit solution. I am aiming to achieve as much parallelism as possible in the operation. My real-life example involves using an initial data frame with 38 columns that is on the order of 100MB to a couple GB as a csv file and I typically have at most 100-150 scenarios.
The original design of the solution was to process each scenario filter sequentially and union the resulting filtered data frames together, but I feel like that negates the whole point of using spark.
EDIT: Does it though? For each scenario, I would filter and then union, which are both transformations (lazy eval). Would the eventual execution plan be smart enough to automatically parallelize the multiple unique filters?
Isn't there a way we can apply the filters in parallel, e.g., apply scenario filter 1 at the same time as applying filters 2 and 3? Would we have to "blow up" the initial dataframe N times, where N = # of scenario filters, append a Scenario # column to the new data frame, and apply one big filter that looks something like:
WHERE (Scenario = 1 AND Code1 IN (A,B)) OR
(Scenario = 2 AND Code1 IN (A,D) AND Code2 IN (2,3)) OR
(Scenario = 3 AND Code2 = 2)
And if that does end up being the most efficient way, isn't it also dependent on how much memory the "blown up" data frame takes? If the "blown up" data frame takes up more memory than what my cluster has, am I going to have to process only as many scenarios as can fit in memory?
you can apply all filters at once:
data.withColumn("scenario",
when('code1.isin("A", "B"), 1).otherwise(
when('code1.isin("A", "D") && 'code2.isin("2","3"), 2).otherwise(
when('code2==="2",3)
)
)
).show()
but you have one more problem, for example values (A,2) could be in all of your scenarios 1,2,3. In this case you could try smth like that:
data.withColumn("s1", when('code1.isin("A", "B"), 1).otherwise(0))
.withColumn("s2",when('code1.isin("A", "D") && 'code2.isin("2","3"), 1).otherwise(0))
.withColumn("s3",when('code2==="2",1).otherwise(0))
.show()
output:
+-----+-----+---+---+---+
|code1|code2| s1| s2| s3|
+-----+-----+---+---+---+
| A| 1| 1| 0| 0|
| B| 1| 1| 0| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| A| 2| 1| 1| 1|
| D| 2| 0| 1| 1|
| D| 3| 0| 1| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| C| 2| 0| 0| 1|
| D| 2| 0| 1| 1|
+-----+-----+---+---+---+
In the EDIT to my question, I questioned whether the lazy evaluation is the key to the problem. After doing some research in the Spark UI, I have concluded that even though my original solution looks like it is applying transformations (filter then union) sequentially for each scenario, it is actually applying all the transformations simultaneously, once an action (e.g., dataframe.count()) is called. The screenshot here represents the Event Timeline from the transformation phase of the dataframe.count() job.
The job includes 96 scenarios each with a unique filter on the original data frame. You can see that my local machine is running 8 tasks simultaneously, where each task represents a filter from one of the scenarios.
In conclusion, Spark takes care of optimizing the filters to run in parallel, once an action has been called on the resulting dataframe.
I have a Spark dataframe that adheres to the following structure:
+------+-----------+-----------+-----------+------+
|ID | Name1 | Name2 | Name3 | Y |
+------+-----------+-----------+-----------+------+
| 1 | A,1 | B,1 | C,4 | B |
| 2 | D,2 | E,2 | F,8 | D |
| 3 | G,5 | H,2 | I,3 | H |
+------+-----------+-----------+-----------+------+
For every row I want to find in which column the value of Y is denoted as the first element. So, ideally I want to retrieve a list like: [Name2,Name1,Name2].
I am not sure how and whether it works to convert first to a RDD, then use a map function and convert the result back to DataFrame.
Any ideas are welcome.
You can probably try this piece of code :
df.show()
+---+-----+-----+-----+---+
| ID|Name1|Name2|Name3| Y|
+---+-----+-----+-----+---+
| 1| A,1| B,1| C,4| B|
| 2| D,2| E,2| F,8| D|
| 3| G,5| H,2| I,3| H|
+---+-----+-----+-----+---+
from pyspark.sql import functions as F
name_cols = ["Name1", "Name2", "Name3"]
cond = F
for col in name_cols:
cond = cond.when(F.split(F.col(col),',').getItem(0) == F.col("Y"), col)
df.withColumn("whichName", cond).show()
+---+-----+-----+-----+---+---------+
| ID|Name1|Name2|Name3| Y|whichName|
+---+-----+-----+-----+---+---------+
| 1| A,1| B,1| C,4| B| Name2|
| 2| D,2| E,2| F,8| D| Name1|
| 3| G,5| H,2| I,3| H| Name2|
+---+-----+-----+-----+---+---------+
I have a data frame in pyspark like below.
+---+-------------+------------+
| id| device| model|
+---+-------------+------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 3| cctv| cctv|
| 2| iphone|apple iphone|
| 3| spy camera| |
+---+-------------+------------+
I want to create a column by concatenating unique values in device and model columns for each id
I have done like below
First concatenated both device and model columns
df1 = df.select(col("id"), concat(col("model"), lit(","), col("device")).alias('con'))
+---+--------------------+
| id| con|
+---+--------------------+
| 3| mac,mac pro|
| 1| iphone5,iphone|
| 1|android,android p...|
| 1| windows,windows pc|
| 1|spy camera,spy ca...|
| 2| camera,|
| 3| cctv,cctv|
| 2| apple iphone,iphone|
| 3| ,spy camera|
+---+--------------------+
Then done a groupBy by id
df2 = df1.groupBy("id").agg(f.concat_ws(",", f.collect_set(df1.con)).alias('Group_con')
+---+-----------------------------------------------------------------------------+
| id| Group_con|
+---+-----------------------------------------------------------------------------+
| 1|iphone5,iphone,android,android phone,windows,windows pc,spy camera,spy camera|
| 2| camera,,apple iphone,iphone|
| 3| mac,mac pro,cctv,cctv,,spy camera|
+---+-----------------------------------------------------------------------------+
But I am getting duplicate values in the result. How can I avoid populating duplicate values in the final data frame
Use F.array(), F.explode() and F.collect_set():
from pyspark.sql import functions as F
df.withColumn('con', F.explode(F.array('device', 'model'))) \
.groupby('id').agg(F.collect_set('con').alias('Group_con')) \
.show(3,0)
# +---+--------------------------------------------------------------------------+
# |id |Group_con |
# +---+--------------------------------------------------------------------------+
# |3 |[cctv, mac pro, spy camera, mac] |
# |1 |[windows pc, iphone5, windows, iphone, android phone, spy camera, android]|
# |2 |[apple iphone, camera, iphone] |
# +---+--------------------------------------------------------------------------+
(tested on spark version 2.2.1)
You can remove the duplicates by using collect_set and a udf function as
from pyspark.sql import functions as f
from pyspark.sql import types as t
def uniqueStringUdf(device, model):
return ','.join(set(filter(bool, device + model)))
uniqueStringUdfCall = f.udf(uniqueStringUdf, t.StringType())
df.groupBy("id").agg(uniqueStringUdfCall(f.collect_set("device"), f.collect_set("model")).alias("con")).show(truncate=False)
which should give you
+---+------------------------------------------------------------------+
|id |con |
+---+------------------------------------------------------------------+
|3 |spy camera,mac,mac pro,cctv |
|1 |spy camera,windows,iphone5,windows pc,iphone,android phone,android|
|2 |camera,iphone,pple iphone |
+---+------------------------------------------------------------------+
where,
device + model is concatenation for two collected sets
filter(bool, device + model) is removing empty strings from concatenated list
set(filter(bool, device + model)) removes the duplicate strings in the concatenated list
','.join(set(filter(bool, device + model))) concats all the elements of concatenated list to a comma separated string.
I hope the answer is helpful
Not sure if this is going to be very helpful. But one solution I could think of is to check for the duplicate values in the column and then delete them by using their position/index.
Or
Split all values at comma "," list and remove all the duplicates by comparing each value. Or count() the occurrences of a value if its more than 1 the delete the all the duplicates other than the first one.
Sorry if this wasn't help. These are the 2 ways I could think of solving your problem.