Choosing random items from a Spark GroupedData Object - python

I'm new to using Spark in Python and have been unable to solve this problem: After running groupBy on a pyspark.sql.dataframe.DataFrame
df ="data.json")
how can you choose N random samples from each resulting group (grouped by teamId) without replacement?
I'm basically trying to choose N random users from each team, maybe using groupBy is wrong to start with?

Well, it is kind of wrong. GroupedData is not really designed for a data access. It just describes grouping criteria and provides aggregation methods. See my answer to Using groupBy in Spark and getting back to a DataFrame for more details.
Another problem with this idea is selecting N random samples. It is a task which is really hard to achieve in parallel without psychical grouping of data and it is not something that happens when you call groupBy on a DataFrame:
There are at least two ways to handle this:
convert to RDD, groupBy and perform local sampling
import random
n = 3
def sample(iter, n):
rs = random.Random() # We should probably use os.urandom as a seed
return rs.sample(list(iter), n)
df = sqlContext.createDataFrame(
[(x, y, random.random()) for x in (1, 2, 3) for y in "abcdefghi"],
("teamId", "x1", "x2"))
grouped = row: (row.teamId, row)).groupByKey()
sampled = sqlContext.createDataFrame(
grouped.flatMap(lambda kv: sample(kv[1], n)))
## +------+---+-------------------+
## |teamId| x1| x2|
## +------+---+-------------------+
## | 1| g| 0.81921738561455|
## | 1| f| 0.8563875814036598|
## | 1| a| 0.9010425238735935|
## | 2| c| 0.3864428179837973|
## | 2| g|0.06233470405822805|
## | 2| d|0.37620872770129155|
## | 3| f| 0.7518901502732027|
## | 3| e| 0.5142305439671874|
## | 3| d| 0.6250620479303716|
## +------+---+-------------------+
use window functions
from pyspark.sql import Window
from pyspark.sql.functions import col, rand, rowNumber
w = Window.partitionBy(col("teamId")).orderBy(col("rnd_"))
sampled = (df
.withColumn("rnd_", rand()) # Add random numbers column
.withColumn("rn_", rowNumber().over(w)) # Add rowNumber over windw
.where(col("rn_") <= n) # Take n observations
.drop("rn_") # drop helper columns
## +------+---+--------------------+
## |teamId| x1| x2|
## +------+---+--------------------+
## | 1| f| 0.8563875814036598|
## | 1| g| 0.81921738561455|
## | 1| i| 0.8173912535268248|
## | 2| h| 0.10862995810038856|
## | 2| c| 0.3864428179837973|
## | 2| a| 0.6695356657072442|
## | 3| b|0.012329360826023095|
## | 3| a| 0.6450777858109182|
## | 3| e| 0.5142305439671874|
## +------+---+--------------------+
but I am afraid both will be rather expensive. If size of the individual groups is balanced and relatively large I would simply use DataFrame.randomSplit.
If number of groups is relatively small it is possible to try something else:
from pyspark.sql.functions import count, udf
from pyspark.sql.types import BooleanType
from operator import truediv
counts = (df
.agg(count("*").alias("n")) r: (r.teamId, r.n))
# This defines fraction of observations from a group which should
# be taken to get n values
counts_bd = sc.broadcast({k: truediv(n, v) for (k, v) in counts.items()})
to_take = udf(lambda k, rnd: rnd <= counts_bd.value.get(k), BooleanType())
sampled = (df
.withColumn("rnd_", rand())
.where(to_take(col("teamId"), col("rnd_")))
## +------+---+--------------------+
## |teamId| x1| x2|
## +------+---+--------------------+
## | 1| d| 0.14815204548854788|
## | 1| f| 0.8563875814036598|
## | 1| g| 0.81921738561455|
## | 2| a| 0.6695356657072442|
## | 2| d| 0.37620872770129155|
## | 2| g| 0.06233470405822805|
## | 3| b|0.012329360826023095|
## | 3| h| 0.9022527556458557|
## +------+---+--------------------+
In Spark 1.5+ you can replace udf with a call to sampleBy method:
df.sampleBy("teamId", counts_bd.value)
It won't give you exact number of observations but should be good enough most of the time as long as a number of observations per group is large enough to get proper samples. You can also use sampleByKey on a RDD in a similar way.

I found this one more dataframey, rather than going into rdd way.
You can use window function to create ranking within a group, where ranking can be random to suit your case. Then, you can filter based on the number of samples (N) you want for each group
window_1 = Window.partitionBy(data['teamId']).orderBy(F.rand())
data_1 ='*', F.rank().over(window_1).alias('rank')).filter(F.col('rank') <= N).drop('rank')

Here's an alternative using Pandas DataFrame.Sample method. This uses the spark applyInPandas method to distribute the groups, available from Spark 3.0.0. This allows you to select an exact number of rows per group.
I've added args and kwargs to the function so you can access the other arguments of DataFrame.Sample.
def sample_n_per_group(n, *args, **kwargs):
def sample_per_group(pdf):
return pdf.sample(n, *args, **kwargs)
return sample_per_group
df = spark.createDataFrame(
(1, 1.0),
(1, 2.0),
(2, 3.0),
(2, 5.0),
(2, 10.0)
("id", "v")
sample_n_per_group(2, random_state=2),
To be aware of the limitations for very large groups, from the documentation:
This function requires a full shuffle. All the data of a group will be
loaded into memory, so the user should be aware of the potential OOM
risk if data is skewed and certain groups are too large to fit in
See also here:
How take a random row from a PySpark DataFrame?


Pyspark use partition or groupby with agg and datediff

I'm new to Pyspark.
I would like to find the products not seen after 10 days from the first day they entered the store. And create a column in dataframe and set it to 1 for these products and 0 for the rest.
First I need to group the data based on product_id, then find the maximum of the seen_date. And finally calculate the difference between import_date and max(seen_date) in the groups. And finally create a new column based on the value of date_diff in each group.
Following is the code I used to first get the difference between the import_date and seen_date, but it gives error:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = (Window()
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn("date_diff", F.datediff(F.max(F.from_unixtime(F.col("import_date")).over(w)), F.from_unixtime(F.col("seen_date"))))
AnalysisException: It is not allowed to use a window function inside an aggregate function. Please use the inner window function in a sub-query.
This is the rest of my code to define a new column based on the date_diff:
not_seen = udf(lambda x: 0 if x >10 else 1, IntegerType())
df = df.withColumn('not_seen', not_seen("date_diff"))
Q: Can someone provide a fix for this code or a better approach to solve this problem?
sample data generation:
columns = ["product_id","import_date", "seen_date"]
data = [("123", "2014-05-06", "2014-05-07"),
("123", "2014-05-06", "2014-06-11"),
("125", "2015-01-02", "2015-01-03"),
("125", "2015-01-02", "2015-01-04"),
("128", "2015-08-06", "2015-08-25")]
dfFromData2 = spark.createDataFrame(data).toDF(*columns)
dfFromData2 = dfFromData2.withColumn("import_date",F.unix_timestamp(F.col("import_date"),'yyyy-MM-dd'))
dfFromData2 = dfFromData2.withColumn("seen_date",F.unix_timestamp(F.col("seen_date"),'yyyy-MM-dd'))
|product_id|import_date| seen_date|
| 123| 1399334400|1399420800|
| 123| 1399334400|1402444800|
| 125| 1420156800|1420243200|
| 125| 1420156800|1420329600|
| 128| 1438819200|1440460800|
columns = ["product_id","import_date", "seen_date"]
data = [("123", "2014-05-06", "2014-05-07"),
("123", "2014-05-06", "2014-06-11"),
("125", "2015-01-02", "2015-01-03"),
("125", "2015-01-02", "2015-01-04"),
("128", "2015-08-06", "2015-08-25")]
df = spark.createDataFrame(data).toDF(*columns)
df = df.withColumn("import_date",F.to_date(F.col("import_date"),'yyyy-MM-dd'))
df = df.withColumn("seen_date",F.to_date(F.col("seen_date"),'yyyy-MM-dd'))
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = (Window()
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
.withColumn('max_import_date', F.max(F.col("import_date")).over(w))\
.withColumn("date_diff", F.datediff(F.col('seen_date'), F.col('max_import_date')))\
.withColumn('not_seen', F.when(F.col('date_diff') > 10, 0).otherwise(1))\
|product_id|import_date| seen_date|max_import_date|date_diff|not_seen|
| 123| 2014-05-06|2014-05-07| 2014-05-06| 1| 1|
| 123| 2014-05-06|2014-06-11| 2014-05-06| 36| 0|
| 125| 2015-01-02|2015-01-03| 2015-01-02| 1| 1|
| 125| 2015-01-02|2015-01-04| 2015-01-02| 2| 1|
| 128| 2015-08-06|2015-08-25| 2015-08-06| 19| 0|
You can use the max windowing function to extract the max date.
dfFromData2 = dfFromData2.withColumn(
F.expr('if(datediff(max(from_unixtime(seen_date)) over (partition by product_id), from_unixtime(import_date)) > 10, 1, 0)')
# +----------+-----------+----------+--------+
# |product_id|import_date|seen_date |not_seen|
# +----------+-----------+----------+--------+
# |125 |1420128000 |1420214400|0 |
# |125 |1420128000 |1420300800|0 |
# |123 |1399305600 |1399392000|1 |
# |123 |1399305600 |1402416000|1 |
# |128 |1438790400 |1440432000|1 |
# +----------+-----------+----------+--------+

How to remove elements with UDF function and Pandas instead of using for loop Python

I have a problem ... how do I make such a for loop as a UDF function?
import cld3
ind_err = []
cnt = 0
cnt_NOT = 0
for index, row in pandasDF.iterrows():
lan, probability, is_reliable, proportion = cld3.get_language(row["content"])
if (lan != 'en'):
cnt_NOT += 1
elif(lan == 'en' and probability < 0.85):
cnt += 1
pandasDF = pandasDF.drop(labels=ind_err, axis=0)
This function cycles on all the lines of the pandas data frame and sees through cld3 which is English and which is not, in order to clean up. Save the indexes in an array to delete them with .drop (labels = ind_err, axis = 0).
This is the data that I have:
| content|score|
| what sapp| 1|
| right| 5|
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
This is the data that I would remove:
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
And the is the dataframe that I would have:
| content|score|
| what sapp| 1|
| right| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
The problem with this cycle is that it is very slow since there are 1,119,778 rows.
I know PySpark's withColumn is much faster, but I honestly can't figure out how to select the row to delete and get it deleted.
How can I turn that for loop into a function and make language detect a lot faster?
My environment is Google Colab.
Many thanks in advance!!
you can probably do something like that :
from pyspark.sql import functions as F, types as T
# assuming df is your dataframe
def is_english(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
Actually, I do not really understand why you want to go through Spark for that. Using properly pandas should solve your problem:
# I used you example so I only have partial text...
def is_engllish(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
8 unable to update ...
# That's the only line from your truncated example that matches your criterias

How to modify a column for a join in spark dataframe when the join key are given as a list?

I have been trying to join two dataframes using the following list of join key passed as a list and I want to add the functionality to join on a subset of the keys if one of the key value is null
I have been trying to join two dataframes df_1 and df_2.
data1 = [[1,'2018-07-31',215,'a'],
df_1 = sqlCtx.createDataFrame(data1,
data2 = [[1,'2018-07-31',215,'aaa'],
df_2 = sqlCtx.createDataFrame(data2,
The code I use to join is this:
key_a = ['application_number','application_dt','account_id']
new = df_1.join(df_2,key_a,'left')
The output for the same is:
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b|null|
My concern here is, in the case where account_id is null, the join should still work by comparing other 2 keys.
The required output should be like this:
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b| bbb|
I have found a similar approach to do so by using the statement:
join_elem = "df_1.application_number ==
df_2.application_number|df_1.application_dt ==
df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==
join_elem_column = [eval(x) for x in join_elem]
But the design consideration do not allow me to use a full join expression and i am stuck with using the list of column names as join-key.
I have been trying to find a way to accommodate this coalesce thing into this list itself but have not found any success so far.
I would call this solution a workaround.
The issue here is that we have Null value for one of the keys in the DataFrame and the OP wants that rest of the key columns to be used instead. Why not assign an arbitrary value to this Null and then apply the join. Effectively this would be same thing like making a join on the remaining two keys.
# Let's replace Null with an arbitrary value, which has
# little chance of occurring in the Dataset. For eg; -100000
df1 = df1.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
df2 = df2.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
# Do a FULL Join
df = df1.join(df2,['application_number','application_dt','account_id'],'full')
# Replace the arbitrary value back with Null.
df = df.withColumn('account_id', when(col('account_id')== -100000, None).otherwise(col('account_id')))
| 1| 2018-07-31| 215| a| aaa|
| 2| 2018-07-30| null| b| bbb|
| 3| 2017-10-28| 201| c| ccc|

How to add my own function as a custom stage in a ML pyspark Pipeline? [duplicate]

This question already has answers here:
How to create a custom Estimator in PySpark
(2 answers)
Create a custom Transformer in PySpark ML
(1 answer)
Closed 4 years ago.
The sample code from Florian
|ball_column|keep_the |hall_column|
| 0| 7| 14|
| 1| 8| 15|
| 2| 9| 16|
| 3| 10| 17|
| 4| 11| 18|
| 5| 12| 19|
| 6| 13| 20|
The first part of the code drops columns name in the banned list
#first part of the code
banned_list = ["ball","fall","hall"]
condition = lambda col: any(word in col for word in banned_list)
new_df = df.drop(*filter(condition, df.columns))
So the above piece of code should drop the ball_column and hall_column.
The second part of the code buckets specific columns in the list. For this example, we will bucket the only one remaining, keep_column.
bagging =
splits=[-float("inf"), 10, 100, float("inf")],
Now bagging the columns using pipeline was as follows
model = Pipeline(stages=bagging).fit(df)
bucketedData = model.transform(df)
How can I add the first block of the code (banned list, condition, new_df) to the ml pipeline as a stage?
I believe this does what you want. You can create a custom Transformer, and add that to the stages in the Pipeline. Note that I slightly changed your functions because we do not have access to all variables you mentioned, but the concept remains the same.
Hope this helps!
import pyspark.sql.functions as F
from import Pipeline, Transformer
from import Bucketizer
from pyspark.sql import DataFrame
from typing import Iterable
import pandas as pd
# CUSTOM TRANSFORMER ----------------------------------------------------------------
class ColumnDropper(Transformer):
A custom Transformer which drops all columns that have at least one of the
words from the banned_list in the name.
def __init__(self, banned_list: Iterable[str]):
super(ColumnDropper, self).__init__()
self.banned_list = banned_list
def _transform(self, df: DataFrame) -> DataFrame:
df = df.drop(*[x for x in df.columns if any(y in x for y in self.banned_list)])
return df
# SAMPLE DATA -----------------------------------------------------------------------
df = pd.DataFrame({'ball_column': [0,1,2,3,4,5,6],
'keep_the': [6,5,4,3,2,1,0],
'hall_column': [2,2,2,2,2,2,2] })
df = spark.createDataFrame(df)
# EXAMPLE 1: USE THE TRANSFORMER WITHOUT PIPELINE -----------------------------------
column_dropper = ColumnDropper(banned_list = ["ball","fall","hall"])
df_example = column_dropper.transform(df)
# EXAMPLE 2: USE THE TRANSFORMER WITH PIPELINE --------------------------------------
column_dropper = ColumnDropper(banned_list = ["ball","fall","hall"])
bagging = Bucketizer(
splits=[-float("inf"), 3, float("inf")],
inputCol= 'keep_the',
model = Pipeline(stages=[column_dropper,bagging]).fit(df)
bucketedData = model.transform(df)
| 6| 1.0|
| 5| 1.0|
| 4| 1.0|
| 3| 1.0|
| 2| 0.0|
| 1| 0.0|
| 0| 0.0|
Also, note that if your custom method requires to be fitted (e.g. a custom StringIndexer), you should also create a custom Estimator:
class CustomTransformer(Transformer):
def _transform(self, df) -> DataFrame:
class CustomEstimator(Estimator):
def _fit(self, df) -> CustomTransformer:

The right way to aggregate and combine RDDs

I have a customer table that hosts information about several processes for each customer.
The goal is to extract features for each customer and each process. This means every feature is an mostly an aggregate or sort-compare computation on a .groupby(customerID, processID) object.
However, the goal is to be able to add more and more features over time. So basically the user should be able to define a new functions with some filters, metrics and aggregations, and add this new function to a pool of functions which operate on the table.
The output should be a customerID, processID table, with all features.
So I startet a little minimal working example:
l = [('CM1','aa1', 100,0.1),('CM1','aa1', 110,0.2),\
('CM1','aa1', 110,0.9),('CM1','aa1', 100,1.5),\
('CX2','bb9', 100,0.1),('CX2','bb9', 100,0.2),\
('CX2','bb9', 110,6.0),('CX2','bb9', 100,0.18)]
rdd = sc.parallelize(l)
df = sqlContext.createDataFrame(rdd,['customid','procid','speed','timestamp'])
| CM1| aa1| 100| 0.1|
| CM1| aa1| 110| 0.2|
| CM1| aa1| 110| 0.9|
| CM1| aa1| 100| 1.5|
| CX2| bb9| 100| 0.1|
| CX2| bb9| 100| 0.2|
| CX2| bb9| 110| 6.0|
| CX2| bb9| 100| 0.18|
Then i define 2 arbitrary feature, which get extracted by these functions:
def extr_ft_1 (proc_data, limit=100):
proc_data = proc_data.filter(proc_data.speed > limit).agg(count(proc_data.speed))
proc_data ='count(speed)').alias('speed_feature'))
return proc_data
def extr_ft_0 (proc_data):
max_t = proc_data.agg(spark_max(proc_data.timestamp))
min_t = proc_data.agg(spark_min(proc_data.timestamp))
max_t ='max(timestamp)').alias('max'))
min_t ='min(timestamp)').alias('min'))
X = max_t.crossJoin(min_t)
X = X.withColumn('time_feature', X.max+X.min)
X = X.drop(X.min).drop(X.max)
return (X)
They return 1-element RRDs which just hold an aggregate value.
Next, all feature functions are applied for a given process and combined in a result RDD for each process:
def get_proc_features(proc, data, *features):
proc_data = data.filter( data.customid == proc)
features_for_proc = [feature_value(proc_data) for feature_value in features]
for number, feature in enumerate(features_for_proc):
if number == 0:
l = [(proc,'dummy')]
rdd = sc.parallelize(l)
df = sqlContext.createDataFrame(rdd,['customid','dummy'])
df = df.drop(df.dummy)
features_for_proc_rdd = feature
features_for_proc_rdd = features_for_proc_rdd.crossJoin(df)
features_for_proc_rdd = features_for_proc_rdd.crossJoin(feature)
return features_for_proc_rdd
They last step is to append all rows which contain the features for each process to one dataframe:
for number, proc in enumerate(customer_list_1):
if number == 0:
#results = get_trip_features(trip, df, extr_ft_0, extr_ft_1)
results = get_proc_features(proc, df, *extr_feature_funcs)
results = results.unionAll(get_proc_features(proc, df, *extr_feature_funcs))
The chain of transformations goes like this:
get features 1 and 2 for customer 1:
| 1.6|
| 2|
Combine them to:
| 1.6| CM1| 2|
Do the same for customer 2 and append all RDDs to the final result RDD:
| 1.6| CM1| 2|
| 6.1| CX2| 1|
If I run the code on the cluster, it works for 2 customers.
But when I tested it on a reasonable amount of customers, i get mostly GC and heap memory errors.
Do I work with to many RDDs here? I am afraid my code is very inefficient but I don't know where to start to optimize it. I think I just call one action at the end (I drop all shows() in live mode and just collect() the very last RDD).
I would really appreciate your help.
Your code needs refactoring, the problem is not the RDD but the fact that you filter it to work on unitary keys and then cross join. Iterating through values makes you lose the distributed aspect of pyspark. Keep in mind that you should always keep your one work table if you don't need features from another one.
The best way to do it is using dataframes and window functions.
First let's rewrite your functions:
import pyspark.sql.functions as psf
def extr_ft_1 (proc_data, w, limit=100):
return proc_data.withColumn(
psf.sum((proc_data.speed > limit).cast("int")).over(w)
def extr_ft_0(proc_data, w):
return proc_data.withColumn(
psf.min(proc_data.timestamp).over(w) + psf.max(proc_data.timestamp).over(w)
Where w is a window spec:
from pyspark.sql import Window
w = Window.partitionBy("customid")
df1 = extr_ft_1(df, w)
df0 = extr_ft_0(df1, w)
| CM1| aa1| 100| 0.1| 2| 1.6|
| CM1| aa1| 110| 0.2| 2| 1.6|
| CM1| aa1| 110| 0.9| 2| 1.6|
| CM1| aa1| 100| 1.5| 2| 1.6|
| CX2| bb9| 100| 0.1| 1| 6.1|
| CX2| bb9| 100| 0.2| 1| 6.1|
| CX2| bb9| 110| 6.0| 1| 6.1|
| CX2| bb9| 100| 0.18| 1| 6.1|
Here we never lose information (we keep all the lines) so if you want to add extra features you can. If you want a final aggregated results just run it through a groupBy("customid").
Note that you can also modify the aggregation key in the window spec to include procid for instance.

