Spark DataFrame aggregation: windows + partitioning versus groupBy operations

Spark DataFrame aggregation: windows + partitioning versus groupBy operations - python

I'm looking to perform about five different summarizing techniques over a lot of data. Generally, I'm looking to calculate mean, min, max, stddev, and sum over certain time windows and other dimensions.
Here's about as reproducible an example I can make:
import random
import string
import datetime
from pyspark.sql import SparkSession, functions as func
from pyspark.conf import SparkConf
from pyspark.sql.types import StringType, DoubleType, IntegerType
from pyspark.sql.window import Window
########## Setting up DataFrame ##########
def random_date(start, n):
current = start
for _ in range(n):
current = current + datetime.timedelta(seconds=random.randrange(60))
yield current
start_date = datetime.datetime(2013, 9, 20, 13, 00)
n_records = 50000000
dates = list(random_date(start_date, n_records))
other_data = []
for d in dates:
categorical_data = tuple(random.choice(string.ascii_lowercase) for _ in range(1))
numerical_data = tuple(random.randrange(100) for _ in range(20))
other_data.append(categorical_data + numerical_data + (d,))
categorical_columns = ['cat_{}'.format(n) for n in range(1)]
numerical_columns = ['num_{}'.format(n) for n in range(20)]
date_column = ['date']
columns = categorical_columns + numerical_columns + date_column
df = sc.parallelize(other_data).toDF(columns)
df = df.withColumn('date_window', func.window('date', '5 minutes'))
df.registerTempTable('df')
########## End DataFrame setup ##########
To date, I've tried two techniques: one using the built-in DataFrame.groupBy mechanism; the other using pyspark.sql.window.Window's orderBy and partitionBy methods.
Generally, the pipeline I've developed looks like this:
For each numeric column, group-by the categorical columns cat_0 and date_window, and calculate the five summary statistics listed previously.
For the pyspark.sql.window.Window approach, join the calculated column directly, using df.withColumn. For the DataFrame.groupBy approach, track each result DataFrame (each with three columns: two for the grouping columns, and one for the calculated column) - at the end, join each DataFrame by basically performing a reduce operation with the grouping columns as keys.
I've left some of the pipeline code below, but am primarily interested in opinions about 1) whether either of these are "best practices" for this type of work and 2) if not, am I missing something major within the Spark ecosystem that could help me do this much more quickly/with fewer resources?
Currently, the groupBy approach has performed a little bit better, but is kind of cumbersome with having to track each grouped-by DataFrame and reduce-join them all at the end. The Window approach hasn't been very great, although it's syntactically a little bit cleaner and more maintainable, IMO. In either case, I'm having to allocate a massive amount of compute to get the job to run and write to disk at the end (without repartitioning/coalescing).
gb_cols = ['cat_0', 'date_window']
strategies = {'sum', 'mean', 'stddev', 'max', 'min'}
Xcols = [col for col in df.columns if col.startswith('num')]
for col in Xcols[:]:
for s in strategies:
new_col = '{}_{}'.format(col, s)
Xcols.append(new_col)
if s == 'mean':
calc_col_series = func.mean(col)
elif s == 'stddev':
calc_col_series = func.stddev(col)
elif s == 'max':
calc_col_series = func.max(col)
elif s == 'min':
calc_col_series = func.min(col)
elif s == 'sum':
calc_col_series = func.sum(col)
elif s == 'median':
query = '''
SELECT
PERCENTILE_APPROX({}, 0.5)
FROM df
GROUP BY {}
'''.format(col, ','.join(gb_cols))
calc_col_series = spark.sql(query)
df = df.withColumn(new_col, calc_col_series.over(agg_window))
# Differencing inputs
for difference in range(1, 3 + 1):
# Last period's datapoints... moved to the future
led_series = func.lag(df[new_col], difference).over(agg_window.orderBy(window_cols['orderBy']))
diff_series = df[new_col] - led_series
new_col_diff = '{}_{}_diff'.format(new_col, difference)
df = df.withColumn(new_col_diff, diff_series)
Xcols.append(new_col_diff)

Related

Calculation of the removal percentage for chemical parameters (faster code)

I have to calculate the removal pecentages of chemical/biological parameters (e.g. after an oxidation process) in a waster water treatment plant.
My code code works so far and does exactly what it should do, but it is really slow.
On my laptop the calculation for the original dataset took about 10 sec and on my PC 4 sec for a 15x80 Data Frame. That is too long, especially if I have to deal with more rows.
What the code does:
The formula for the single removal is defined as: 1 - n(i)/n(i-1)
and for the total removal: 1 - n(i)/n(0)
Every measuring point has its own ID. The code searches for the ID's and performs the calculation and saves it in the data frame.
Here is an example (I cant post the original data):
import pandas as pd
import numpy as np
data = {"ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"],
"Measurement": [100, 80, 60, 120,90,70,50,25, 85,65,35]}
df["S_removal"]= np.nan
df["T_removal"]= np.nan
Data Frame before calculation
this is my function for the calculation:
def removal_TEST(Rem1, Measure, Rem2):
lst = [i.split("_")[1] for i in df["ID"]] #takes relevant ID information
y = np.unique(lst) #stores unique ID values to loop over them
for ID in y:
id_list = []
for i in range(0, len(df["ID"])):
if ID in df["ID"][i]:
id_list.append(i)
else: # this stores only the relevant id in a new list
id_list.append(np.nan)
indexlist = pd.Series(id_list)
first_index = indexlist.first_valid_index() #gets the first and last index of the id list
last_index = indexlist.last_valid_index()
col_indizes = []
for i in range(first_index, last_index+1):
col_indizes.append(i)
for i in col_indizes:
if i == 0:
continue # for i=0 there is no 0-1 element, so i=0 should be skipped
else:
Rem1[i]= 1-(Measure[i]/Measure[i-1])
Rem1[first_index]= np.nan #first entry of an ID must be NaN value
for i in range(first_index, last_index+1):
col_indizes.append(i)
for i in range(len(Rem2)):
for i in col_indizes:
Rem2[i]= 1-(Measure[i]/Measure[first_index])
Rem2[first_index]= np.nan
this is the result:
Final Data Frame
I am new to Python and to stackoverflow (so sorry if my code and question are not so good to read). Are there any good libraries to speed up my code, or do you have some suggestions?
Thank you :)

Your use of Pandas seems to be getting in the way of solving the problem. The only relevant state seems to be when the group changes and the first and previous measurement values for each row.
I'd be tempted to solve this just using Python primitives, but you could solve this in other ways if you had lots of data (i.e. millions of rows).
import pandas as pd
df = pd.DataFrame({
"ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"],
"Measurement": [100, 80, 60, 120,90,70,50,25, 85,65,35],
"S_removal": float('nan'),
"T_removal": float('nan'),
})
# somewhere keep track of the last group identifier
last = None
# iterate over rows
for idx, ID, meas in zip(df.index, df['ID'], df['Measurement']):
# what's the current group name
_, grp = ID.split('_', 1)
# see if we're in a new group
if grp != last:
last = grp
# track the group's measurement
grp_meas = meas
else:
# calculate things
df.loc[idx, 'S_removal'] = 1 - meas / last_meas
df.loc[idx, 'T_removal'] = 1 - meas / grp_meas
# keep track of the last measurement
last_meas = meas
I've commented the code in the hopes it makes sense. This takes ~2 seconds for 1000 copies of your example data, so 11000 rows.
Given that OP has said this needs to be done for a wide dataset, here's another version that reduces runtime to ~30ms for 11000 rows and 2 columns:
import numpy as np
import pandas as pd
data = {
"ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"],
"M1": [100, 80, 60, 120,90,70,50,25, 85,65,35],
"M2": [100, 80, 60, 120,90,70,50,25, 85,65,35],
}
# reset_index() because code below assumes they are unique
df = pd.concat([pd.DataFrame(data)]*1000).reset_index()
# column names
measurement_col_names = ['M1', 'M2']
single_output_names = ['S1', 'S2']
total_output_names = ['T1', 'T2']
# somewhere keep track of the last group identifier
last = None
# somewhere to store intermediate state
vals_idx = []
meas_vals = []
last_vals = []
grp_vals = []
# iterate over rows
for idx, ID, meas in zip(df.index, df['ID'], df.loc[:,measurement_col_names].values):
# what's the current group name
_, grp = ID.split('_', 1)
# we're in a new group
if grp != last:
last = grp
# track the group's measurement
grp_meas = meas
else:
# track values and which rows they apply to
vals_idx.append(idx)
meas_vals.append(meas)
last_vals.append(last_meas)
grp_vals.append(grp_meas)
# keep track of the last measurement
last_meas = meas
# convert to numpy array so it vectorises nicely
meas_vals = np.array(meas_vals)
# perform calculation using fast numpy operations
df.loc[vals_idx, single_output_names] = 1 - (meas_vals / last_vals)
df.loc[vals_idx, total_output_names] = 1 - (meas_vals / grp_vals)

pmdarima: Apply .predict method via .groupby and .apply to auto_arima output stored rowwise in a pd.DataFrame

I'm using auto_arima via pmdarima to fit multiple time series via a groupby. This is to say, I have a pd.DataFrame of stacked time-indexed data, grouped by variable variable, and have successfully applied transform(pm.auto_arima) to each. The reproducible example finds boring best ARIMA models, but the idea seems to work. I now want to apply .predict() similarly, but cannot get it to play nice with apply / lambda(x) / their combinations.
The code below works until the # Forecasting - help! section. I'm having trouble catching the correct object (apparently) in the apply. How might I adapt one of test1, test2, or test3 to get what I want? Or, is there some other best-practice construct to consider? Is it better across columns (without a melt)? Or via a loop?
Ultimately, I hope that test1, say, is a stacked pd.DataFrame (or pd.Series at least) with 8 rows: 4 forecasted values for each of the 2 time series in this example, with an identifier column variable (possibly tacked on after the fact).
import pandas as pd
import pmdarima as pm
import itertools
# Get data - this is OK.
url = 'https://raw.githubusercontent.com/nickdcox/learn-airline-delays/main/delays_2018.csv'
keep = ['arr_flights', 'arr_cancelled']
# Setup data - this is OK.
df = pd.read_csv(url, index_col=0)
df.index = pd.to_datetime(df.index, format = "%Y-%m")
df = df[keep]
df = df.sort_index()
df = df.loc['2018']
df = df.groupby(df.index).sum()
df.reset_index(inplace = True)
df = df.melt(id_vars = 'date', value_vars = df.columns.to_list()[1:])
# Fit auto.arima for each time series - this is OK.
fit = df.groupby('variable')['value'].transform(pm.auto_arima).drop_duplicates()
fit = fit.to_frame(name = 'model')
fit['variable'] = keep
fit.reset_index(drop = True, inplace = True)
# Setup forecasts - this is OK.
max_date = df.date.max()
dr = pd.to_datetime(pd.date_range(max_date, periods = 4 + 1, freq = 'MS').tolist()[1:])
yhat = pd.DataFrame(list(itertools.product(keep, dr)), columns = ['variable', 'date'])
yhat.set_index('date', inplace = True)
# Forecasting - help! - Can't get any of these to work.
def predict_fn(obj):
return(obj.loc[0].predict(4))
predict_fn(fit.loc[fit['variable'] == 'arr_flights']['model']) # Appears to work!
test1 = fit.groupby('variable')['model'].apply(lambda x: x.predict(n_periods = 4)) # Try 1: 'Series' object has no attribute 'predict'.
test2 = fit.groupby('variable')['model'].apply(lambda x: x.loc[0].predict(n_periods = 4)) # Try 2: KeyError
test3 = fit.groupby('variable')['model'].apply(predict_fn) # Try 3: KeyError

Can this pandas workflow be converted to dask?

Please be nice - I'm not a proper programmer, I'm a scientist and I've read as many docs on this as I can find (they're a bit sparse).
I'm trying to convert this pandas code into dash because my input file is ~0.5TB with gz and it loads too slowly in native pandas. I have a 3 TB machine, btw.
This is an example of what I'm doing with pandas:
df = pd.DataFrame([['chr1',33329,17,'''33)'6'4?1&AB=?+..''','''X%&=E&!%,0("&"Y&!'''],
['chr1',33330,15,'''6+'/7=1#><C1*'*''','''X%=E!%,("&"Y&&!'''],
['chr1',33331,13,'''2*3A#/9#CC3--''','''X%E!%,("&"Y&!'''],
['chr1',33332,1,'''4**(,:3)+7-#<(0-''','''X%&E&!%,0("&"Y&!'''],
['chr1',33333,2,'''66(/C=*42A:.&*''','''X%=&!%0("&"&&!''']],
columns = ['chrom','pos','depth','phred','map'])
df.loc[:,'phred'] = [(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df.loc[:,"map"] = [(sum(map(ord,i)))/len(i) for i in df.loc[:,"map"]]
df = df.astype({'phred': 'int32', 'map': 'int32'})
df.query('(depth < 10) | (phred < 7) | (map < 10)', inplace=True)
for chrom, df_tmp in df.groupby('chrom'):
df_end = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(-1)-1))]
df_start = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(+1)+1))]
for start, end in zip(df_start.pos, df_end.pos):
print (start, end)
Gives
33332 33333
This works (to find regions of a cancer genome with no data) and it's optimised as much as I know how.
I load the real thing like:
df = pd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
index_col=None,
usecols=[0,1,3,5,6],
names = ['chrom','pos','depth','phred','map']
)
and I can do the same with Dask (way faster!):
df = dd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
usecols=[0,1,3,5,6],
compression='gzip',
blocksize=None,
names = ['chrom','pos','depth','phred','map']
)
but i'm stuck here:
ff=[(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df['phred'] = ff
Error: Column assignment doesn't support type list
Question - is this sort of thing possible? If so are there good tutes somewhere? I need to convert the whole block of pandas code above.
Thanks in advance!

You created list comprehensions to transform 'Fred' and 'map'; I converted these list comps to functions, and wrapped the functions in np.vectorize().
def func_p(p):
return (sum(map(ord, p)) - len(p) * 33) / len(p)
def func_m(m):
return (sum(map(ord, m))) / len(m)
vec_func_p = np.vectorize(func_p)
vec_func_m = np.vectorize(func_m)
np.vectorize() does not make code faster, but it does let you write a function with scalar inputs and outputs, and convert it to a function that takes array inputs and outputs.
The benefit is that we can now pass pandas Series to these functions (I also added the type conversion to this step):
df.loc[:, 'phred'] = vec_func_p( df.loc[:, 'phred']).astype(np.int32)
df.loc[:, 'map'] = vec_func_m( df.loc[:, 'map']).astype(np.int32)
Replacing the list comprehensions with these new functions gives the same results as your version (33332 33333).

#rpanai noted that you could eliminate the for loops. The following example uses groupby() (and a couple helper columns) to find the start and end position for each contiguous sequence of positions.
Using only pandas built-in functions should be compatible with Dask (and fast).
First, create demo data frame with multiple chromosomes and multiple contiguous blocks of positions:
data1 = {
'chrom' : 'chrom_1',
'pos' : [1000, 1001, 1002,
2000, 2001, 2002, 2003]}
data2 = {
'chrom' : 'chrom_2',
'pos' : [30000, 30001, 30002, 30003, 30004,
40000, 40001, 40002, 40003, 40004, 40005]}
df = pd.DataFrame(data1).append( pd.DataFrame(data2) )
Second, create two helper functions:
rank is a sequential counter for each group;
key is constant for positions in a contiguous 'run' of positions.
df['rank'] = df.groupby('chrom')['pos'].rank(method='first')
df['key'] = df['pos'] - df['rank']
Third, group by chrom and key to create a groupby object for each contiguous block of positions, then use min and max to find start and end value for the positions.
result = (df.groupby(['chrom', 'key'])['pos']
.agg(['min', 'max'])
.droplevel('key')
.rename(columns={'min': 'start', 'max': 'end'})
)
print(result)
start end
chrom
chrom_1 1000 1002
chrom_1 2000 2003
chrom_2 30000 30004
chrom_2 40000 40005

I want to run a loop with condition and save all outputs as dataframes with different names

I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?

You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!

I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark DataFrame aggregation: windows + partitioning versus groupBy operations - python

Related

Calculation of the removal percentage for chemical parameters (faster code)

pmdarima: Apply .predict method via .groupby and .apply to auto_arima output stored rowwise in a pd.DataFrame

Can this pandas workflow be converted to dask?

I want to run a loop with condition and save all outputs as dataframes with different names

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

Categories

Resources