Efficient, large-scale competition scoring in Python

Efficient, large-scale competition scoring in Python - python

Consider a large dataframe of scores S containing entries like the following. Each row represents a contest between a subset of the participants A, B, C and D.
A B C D
0.1 0.3 0.8 1
1 0.2 NaN NaN
0.7 NaN 2 0.5
NaN 4 0.6 0.8
The way to read the matrix above is: looking at the first row, the participant A scored 0.1 in that round, B scored 0.3, and so forth.
I need to build a triangular matrix C where C[X,Y] stores how much better participant X was than participant Y. More specifically, C[X,Y] would hold the mean % difference in score between X and Y.
From the example above:
C[A,B] = 100 * ((0.1 - 0.3)/0.3 + (1 - 0.2)/0.2) = 33%
My matrix S is huge, so I am hoping to take advantage of JIT (Numba?) or built-in methods in numpy or pandas. I certainly want to avoid having a nested loop, since S has millions of rows.
Does an efficient algorithm for the above have a name?

Let's look at a NumPy based solution and thus let's assume that the input data is in an array named a. Now, the number of pairwise combinations for 4 such variables would be 4*3/2 = 6. We can generate the IDs corresponding to such combinations with np.triu_indices(). Then, we index into the columns of a with those indices. We perform the subtractions and divisions and simply add the columns ignoring the NaN affected results with np.nansum() for the desired output.
Thus, we would have an implementation like so -
R,C = np.triu_indices(a.shape[1],1)
out = 100*np.nansum((a[:,R] - a[:,C])/a[:,C],0)
Sample run -
In [121]: a
Out[121]:
array([[ 0.1, 0.3, 0.8, 1. ],
[ 1. , 0.2, nan, nan],
[ 0.7, nan, 2. , 0.5],
[ nan, 4. , 0.6, 0.8]])
In [122]: out
Out[122]:
array([ 333.33333333, -152.5 , -50. , 504.16666667,
330. , 255. ])
In [123]: 100 * ((0.1 - 0.3)/0.3 + (1 - 0.2)/0.2) # Sample's first o/p elem
Out[123]: 333.33333333333337
If you need the output as (4,4) array, we can use Scipy's squareform -
In [124]: from scipy.spatial.distance import squareform
In [125]: out2D = squareform(out)
Let's convert to a pandas dataframe for a good visual feedback -
In [126]: pd.DataFrame(out2D,index=list('ABCD'),columns=list('ABCD'))
Out[126]:
A B C D
A 0.000000 333.333333 -152.500000 -50
B 333.333333 0.000000 504.166667 330
C -152.500000 504.166667 0.000000 255
D -50.000000 330.000000 255.000000 0
Let's compute [B,C] manually and check back -
In [127]: 100 * ((0.3 - 0.8)/0.8 + (4 - 0.6)/0.6)
Out[127]: 504.1666666666667

Related

Finding counts of relative and absolute fluctuations in dataframe where each row contains a timeseries

I have a dataframe containing a table of financial timeseries, with each row having the columns:
ID of that timeseries
a Target value (against which we want to measure deviations, both relative and absolute)
and a timeseries of values for various dates: 1/01, 1/02, 1/03, ...
We want to calculate the fluctuation counts, both relative and absolute, for every row/ID's timeseries. Then we want to find which row/ID has the most fluctuations/'spikes', as follows:
First, we find difference between two timeseries values and estimate a threshold. Threshold represents how much difference is allowed between two values before we declare that a 'fluctuation' or 'spike'. If the difference is higher than the threshold you set, between any two columns's values then it's a spike.
However, we need to ensure that the threshold is generic and works with both % and absolute values between any two values in any row.
So basically, we find a threshold in a percentage form (make an educated prediction) as we have one row values represented in "%" form. Plus, '%' form will also work properly with the absolute value as well.
The output should be a new column fluctuation counts (FCount), both relative and absolute, for every row/ID.
Code:
import pandas as pd
# Create sample dataframe
raw_data = {'ID': ['A1', 'B1', 'C1', 'D1'],
'Domain': ['Finance', 'IT', 'IT', 'Finance'],
'Target': [1, 2, 3, 0.9%],
'Criteria':['<=', '<=', '>=', '>='],
"1/01":[0.9, 1.1, 2.1, 1],
"1/02":[0.4, 0.3, 0.5, 0.9],
"1/03":[1, 1, 4, 1.1],
"1/04":[0.7, 0.7, 0.1, 0.7],
"1/05":[0.7, 0.7, 0.1, 1],
"1/06":[0.9, 1.1, 2.1, 0.6],}
df = pd.DataFrame(raw_data, columns = ['ID', 'Domain', 'Target','Criteria', '1/01',
'1/02','1/03', '1/04','1/05', '1/06'])
ID Domain Target Criteria 1/01 1/02 1/03 1/04 1/05 1/06
0 A1 Finance 1 <= 0.9 0.4 1.0 0.7 0.7 0.9
1 B1 IT 2 <= 1.1 0.3 1.0 0.7 0.7 1.1
2 C1 IT 3 >= 2.1 0.5 4.0 0.1 0.1 2.1
3 D1 Finance 0.9% >= 1.0 0.9 1.1 0.7 1.0 0.6
And here's the expect output with a fluctuation count (FCount) column. Then we can get whichever ID has the largest FCount.
ID Domain Target Criteria 1/01 1/02 1/03 1/04 1/05 1/06 FCount
0 A1 Finance 1 <= 0.9 0.4 1.0 0.7 0.7 0.9 -
1 B1 IT 2 <= 1.1 0.3 1.0 0.7 0.7 1.1 -
2 C1 IT 3 >= 2.1 0.5 4.0 0.1 0.1 2.1 -
3 D1 Finance 0.9% >= 1.0 0.9 1.1 0.7 1.0 0.6 -

Given,
# importing pandas as pd
import pandas as pd
import numpy as np
# Create sample dataframe
raw_data = {'ID': ['A1', 'B1', 'C1', 'D1'],
'Domain': ['Finance', 'IT', 'IT', 'Finance'],
'Target': [1, 2, 3, '0.9%'],
'Criteria':['<=', '<=', '>=', '>='],
"1/01":[0.9, 1.1, 2.1, 1],
"1/02":[0.4, 0.3, 0.5, 0.9],
"1/03":[1, 1, 4, 1.1],
"1/04":[0.7, 0.7, 0.1, 0.7],
"1/05":[0.7, 0.7, 0.1, 1],
"1/06":[0.9, 1.1, 2.1, 0.6],}
df = pd.DataFrame(raw_data, columns = ['ID', 'Domain', 'Target','Criteria', '1/01',
'1/02','1/03', '1/04','1/05', '1/06'])
It is easier to tackle this problem by breaking it into two parts (absolute thresholds and relative thresholds) and going through it step by step on the underlying numpy arrays.
EDIT: Long explanation ahead, skip to the end for just the final function
First, create a list of date columns to access only the relevant columns in every row.
date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06']
df[date_columns].values
#Output:
array([[0.9, 0.4, 1. , 0.7, 0.7, 0.9],
[1.1, 0.3, 1. , 0.7, 0.7, 1.1],
[2.1, 0.5, 4. , 0.1, 0.1, 2.1],
[1. , 0.9, 1.1, 0.7, 1. , 0.6]])
Then we can use np.diff to easily get differences between the dates on the underlying array. We will also take an absolute because that is what we are interested in.
np.abs(np.diff(df[date_columns].values))
#Output:
array([[0.5, 0.6, 0.3, 0. , 0.2],
[0.8, 0.7, 0.3, 0. , 0.4],
[1.6, 3.5, 3.9, 0. , 2. ],
[0.1, 0.2, 0.4, 0.3, 0.4]])
Now, just worrying about the absolute thresholds, it is as simple as just checking if the values in the differences are greater than a limit.
abs_threshold = 0.5
np.abs(np.diff(df[date_columns].values)) > abs_threshold
#Output:
array([[False, True, False, False, False],
[ True, True, False, False, False],
[ True, True, True, False, True],
[False, False, False, False, False]])
We can see that the sum over this array for every row will give us the result we need (sum over boolean arrays use the underlying True=1 and False=0. Thus, you are effectively counting how many True are present). For Percentage thresholds, we just need to do an additional step, dividing all differences with the original values before comparison. Putting it all together.
To elaborate:
We can see how the sum along each row can give us the counts of values crossing absolute threshold as follows.
abs_fluctuations = np.abs(np.diff(df[date_columns].values)) > abs_threshold
print(abs_fluctuations.sum(-1))
#Output:
[1 2 4 0]
To start with relative thresholds, we can create the differences array same as before.
dates = df[date_columns].values #same as before, but just assigned
differences = np.abs(np.diff(dates)) #same as before, just assigned
pct_threshold=0.5 #aka 50%
print(differences.shape) #(4, 5) aka 4 rows, 5 columns if you want to think traditional tabular 2D shapes only
print(dates.shape) #(4, 6) 4 rows, 6 columns
Now, note that the differences array will have 1 less number of columns, which makes sense too. because for 6 dates, there will be 5 "differences", one for each gap.
Now, just focusing on 1 row, we see that calculating percent changes is simple.
print(dates[0][:2]) #for first row[0], take the first two dates[:2]
#Output:
array([0.9, 0.4])
print(differences[0][0]) #for first row[0], take the first difference[0]
#Output:
0.5
a change from 0.9 to 0.4 is a change of 0.5 in absolute terms. but in percentage terms, it is a change of 0.5/0.9 (difference/original) * 100 (where i have omitted the multiplication by 100 to make things simpler)
aka 55.555% or 0.5555..
The main thing to realise at this step is that we need to do this division against the "original" values for all differences to get percent changes.
However, dates array has one "column" too many. So, we do a simple slice.
dates[:,:-1] #For all rows(:,), take all columns except the last one(:-1).
#Output:
array([[0.9, 0.4, 1. , 0.7, 0.7],
[1.1, 0.3, 1. , 0.7, 0.7],
[2.1, 0.5, 4. , 0.1, 0.1],
[1. , 0.9, 1.1, 0.7, 1. ]])
Now, i can just calculate relative or percentage changes by element-wise division
relative_differences = differences / dates[:,:-1]
And then, same thing as before. pick a threshold, see if it's crossed
rel_fluctuations = relative_differences > pct_threshold
#Output:
array([[ True, True, False, False, False],
[ True, True, False, False, True],
[ True, True, True, False, True],
[False, False, False, False, False]])
Now, if we want to consider whether either one of absolute or relative threshold is crossed, we just need to take a bitwise OR | (it's even there in the sentence!) and then take the sum along rows.
Putting all this together, we can just create a function that is ready to use. Note that functions are nothing special, just a way of grouping together lines of code for ease of use. using a function is as simple as calling it, you have been using functions/methods without realising it all the time already.
date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06'] #if hardcoded.
date_columns = df.columns[5:] #if you wish to assign dynamically, and all dates start from 5th column.
def get_FCount(df, date_columns, abs_threshold=0.5, pct_threshold=0.5):
'''Expects a list of date columns with atleast two values.
returns a 1D array, with FCounts for every row.
pct_threshold: percentage, where 1 means 100%
'''
dates = df[date_columns].values
differences = np.abs(np.diff(dates))
abs_fluctuations = differences > abs_threshold
rel_fluctuations = differences / dates[:,:-1] > pct_threshold
return (abs_fluctuations | rel_fluctuations).sum(-1) #we took a bitwise OR. since we are concerned with values that cross even one of the thresholds.
df['FCount'] = get_FCount(df, date_columns) #call our function, and assign the result array to a new column
print(df['FCount'])
#Output:
0 2
1 3
2 4
3 0
Name: FCount, dtype: int32

Assuming you want pct_changes() accross all columns in a row with a threshold, you can also try pct_change() on axis=1:
thresh_=0.5
s=pd.to_datetime(df.columns,format='%d/%m',errors='coerce').notna() #all date cols
df=df.assign(Count=df.loc[:,s].pct_change(axis=1).abs().gt(0.5).sum(axis=1))
Or:
df.assign(Count=df.iloc[:,4:].pct_change(axis=1).abs().gt(0.5).sum(axis=1))
ID Domain Target Criteria 1/01 1/02 1/03 1/04 1/05 1/06 Count
0 A1 Finance 1.0 <= 0.9 0.4 1.0 0.7 0.7 0.9 2
1 B1 IT 2.0 <= 1.1 0.3 1.0 0.7 0.7 1.1 3
2 C1 IT 3.0 >= 2.1 0.5 4.0 0.1 0.1 2.1 4
3 D1 Finance 0.9 >= 1.0 0.9 1.1 0.7 1.0 0.6 0

Try a loc and an iloc and a sub and an abs and a sum and an idxmin:
print(df.loc[df.iloc[:, 4:].sub(df['Target'].tolist(), axis='rows').abs().sum(1).idxmin(), 'ID'])
Output:
D1
Explanation:
I first get the columns staring from the 4th one, then simply subtract each row with the corresponding Target column.
Then get the absolute value of it, so -1.1 will be 1.1 and 1.1 will be still 1.1, then sum each row together and get the row with the lowest number.
Then use a loc to get that index in the actual dataframe, and get the ID column of it which gives you D1.

The following is much cleaner pandas idiom and improves on #ParitoshSingh's version. It's much cleaner to keep two separate dataframes:
a ts (metadata) dataframe for the timeseries columns 'ID', 'Domain', 'Target','Criteria'
a values dataframe for the timeseries values (or 'dates' as the OP keeps calling them)
and use ID as the common index for both dataframes, now you get seamless merge/join and also on any results like when we call compute_FCounts().
now there's no need to pass around ugly lists of column-names or indices (into compute_FCounts()). This is way better deduplication as mentioned in comments. Code for this is at bottom.
Doing this makes compute_FCount just reduce to a four-liner (and I improved #ParitoshSingh's version to use pandas builtins df.diff(axis=1), and then pandas .abs(); also note that the resulting series is returned with the correct ID index, not 0:3; hence can be used directly in assignment/insertion/merge/join):
def compute_FCount_df(dat, abs_threshold=0.5, pct_threshold=0.5):
""""""Compute FluctuationCount for all timeseries/rows""""""
differences = dat.diff(axis=1).iloc[:, 1:].abs()
abs_fluctuations = differences > abs_threshold
rel_fluctuations = differences / dat.iloc[:,:-1] > pct_threshold
return (abs_fluctuations | rel_fluctuations).sum(1)
where the boilerplate to set up two separate dataframes is at bottom.
Also note it's cleaner not to put the fcounts series/column in either values (where it definitely doesn't belong) or ts (where it would be kind of kludgy). Note that the
#ts['FCount']
fcounts = compute_FCount_df(values)
>>> fcounts
A1 2
B1 2
C1 4
D1 1
and this allows you to directly get the index (ID) of the timeseries with most 'fluctuations':
>>> fcounts.idxmax()
'C1'
But really since conceptually we're applying the function separately row-wise to each row of timeseries values, we should use values.apply(..., axis=1) :
values.apply(compute_FCount_ts, axis=1, reduce=False) #
def compute_FCount_ts(dat, abs_threshold=0.5, pct_threshold=0.5):
"""Compute FluctuationCount for single timeseries (row)"""
differences = dat.diff().iloc[1:].abs()
abs_fluctuations = differences > abs_threshold
rel_fluctuations = differences / dat.iloc[:,:-1] > pct_threshold
return (abs_fluctuations | rel_fluctuations).sum(1)
(Note: still trying to debug the "Too many indexers" pandas issue
)
Last, here's the boilerplate code to set up two separate dataframes, with shared index ID:
import pandas as pd
import numpy as np
ts = pd.DataFrame(index=['A1', 'B1', 'C1', 'D1'], data={
'Domain': ['Finance', 'IT', 'IT', 'Finance'],
'Target': [1, 2, 3, '0.9%'],
'Criteria':['<=', '<=', '>=', '>=']})
values = pd.DataFrame(index=['A1', 'B1', 'C1', 'D1'], data={
"1/01":[0.9, 1.1, 2.1, 1],
"1/02":[0.4, 0.3, 0.5, 0.9],
"1/03":[1, 1, 4, 1.1],
"1/04":[0.7, 0.7, 0.1, 0.7],
"1/05":[0.7, 0.7, 0.1, 1],
"1/06":[0.9, 1.1, 2.1, 0.6]})

Increase value of several rows based on condition fulfilling all rows

I have a pandas dataframe with three columns and want to multiply/increase the float numbers of each row by the same amount until the sum of all three cells (one row) fulfils the critera (value equal or greater than 0.9)
df = pd.DataFrame({'A':[0.03, 0.0, 0.4],
'B': [0.1234, 0.4, 0.333],
'C': [0.5, 0.4, 0.0333]})
Outcome:
The different cells in each row were multiplied so that the sum of all three cells of each row is 0.9 (The sum of each row is not exactly 0.9 as I tried to come close with simple multiplication, hence the actual outcome would get to 0.9). It is important that the cells which are 0 would stay 0.
print (df)
A B C
0 0.0414 0.170292 0.690000
1 0.0000 0.452000 0.452000
2 0.4720 0.392940 0.039294

You can take sum on axis=1 and subtract with 0.9 ,then divide with df.shape[1] to add it back:
df.add((0.9-df.sum(axis=1))/df.shape[1],axis=0)
A B C
0 0.112200 0.205600 0.582200
1 0.033333 0.433333 0.433333
2 0.444567 0.377567 0.077867

You want to apply a scaling function along the rows:
def scale(xs, target=0.9):
"""Scale the features such that their sum equals the target."""
xs_sum = xs.sum()
if xs_sum < target:
return xs * (target / xs_sum)
else:
return xs
df.apply(scale), axis=1)
For example:
df = pd.DataFrame({'A':[0.03, 0.0, 0.4],
'B': [0.1234, 0.4, 0.333],
'C': [0.5, 0.4, 0.0333]})
df.apply(scale, axis=1)
Should give:
A B C
0 0.041322 0.169972 0.688705
1 0.000000 0.450000 0.450000
2 0.469790 0.391100 0.039110
The rows of that dataframe all sum to 0.9:
df.apply(scale), axis=1).sum(axis=1)
0 0.9
1 0.9
2 0.9
dtype: float64

Efficiently combine min/max on different columns of a pandas dataframe

I have a pandas dataframe that contains the results of computation and need to:
take the maximum value of a column and for that value find the maximum value of another column
take the minimum value of a column and for that value find the maximum value of another column
Is there a more efficient way to do it?
Setup
metrictuple = namedtuple('metrics', 'prob m1 m2')
l1 =[metrictuple(0.1, 0.4, 0.04),metrictuple(0.2, 0.4, 0.04),metrictuple(0.4, 0.4, 0.1),metrictuple(0.7, 0.2, 0.3),metrictuple(1.0, 0.1, 0.5)]
df = pd.DataFrame(l1)
# df
# prob m1 m2
#0 0.1 0.4 0.04
#1 0.2 0.4 0.04
#2 0.4 0.4 0.10
#3 0.7 0.2 0.30
#4 1.0 0.1 0.50
tmp = df.loc[(df.m1.max() == df.m1), ['prob','m1']]
res1 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.4, 0.4)
tmp = df.loc[(df.m2.min() == df.m2), ['prob','m2']]
res2 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.2, 0.04)

Pandas isn't ideal for numerical computations. This is because there is a significant overhead in slicing and selecting data, in this example df.loc.
The good news is that pandas interacts well with numpy, so you can easily drop down to the underlying numpy arrays.
Below I've defined some helper functions which makes the code more readable. Note that numpy slicing is performed via row and column numbers starting from 0.
arr = df.values
def arr_max(x, col):
return x[x[:,col]==x[:,col].max()]
def arr_min(x, col):
return x[x[:,col]==x[:,col].min()]
res1 = arr_max(arr_max(arr, 1), 0)[:,:2] # array([[ 0.4, 0.4]])
res2 = arr_max(arr_min(arr, 2), 0)[:,[0,2]] # array([[ 0.2 , 0.04]])

calculate cosine similarity for two columns in a group by in a dataframe

I have a dataframe df:
AID VID FID APerc VPerc
1 A X 0.2 0.5
1 A Z 0.1 0.3
1 A Y 0.4 0.9
2 A X 0.2 0.3
2 A Z 0.9 0.1
1 B Z 0.1 0.2
1 B Y 0.8 0.3
1 B W 0.5 0.4
1 B X 0.6 0.3
I want to calculate the cosine similarity of the values APerc and VPerc for all pairs of AID and VID. So the result for the above should be:
AID VID CosSim
1 A 0.997
2 A 0.514
1 B 0.925
I know how to groupby: df.groupby(['AID','VID'])
and I know how to generate cosine similarity for the whole column:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df['APerc'], df['VPerc'])
What's the best and fastest way to do this, given I have a really large file.

Not sure if it is the fastest, groupby.apply is usually the way to do this:
(df.groupby(['AID','VID'])
.apply(lambda g: cosine_similarity(g['APerc'], g['VPerc'])[0][0]))
#AID VID
#1 A 0.997097
# B 0.924917
#2 A 0.514496
#dtype: float64

Pairwise cosine_similarity is designed for 2D arrays so you'll need to do some reshaping before and after. Instead of that, use scipy's cosine distance:
from scipy.spatial.distance import cosine
df.groupby(['AID','VID']).apply(lambda x: 1 - cosine(x['APerc'], x['VPerc']))
Out:
AID VID
1 A 0.997097
B 0.924917
2 A 0.514496
dtype: float64
Timing on a df of shape (10k, 5) gives 2.87ms for scipy and 4.08ms for sklearn. A fair amount of that 4.08ms is probably due to the warnings it outputs because with Alexander's version it drops down to 3.31ms. I suspect sklearn version becomes much faster when called on a single 2D array.

Extend the solution of #Psidom to convert the series to numpy arrays before calculating cosine_similarity and also reshape:
(df.groupby(['AID','VID'])
.apply(lambda g: cosine_similarity(g['APerc'].values.reshape(1, -1),
g['VPerc'].values.reshape(1, -1))[0][0]))

Probability tensor multiplication using pandas.DataFrame

I'm looking for a good way to store and use conditional probabilities in python.
I'm thinking of using a pandas dataframe. If the conditional probabilities of some X are P(X=A|P1=1, P2=1) = 0.2, P(X=B|P1=2, P2=1) = 0.9 etc., I would use the dataframe
A B
P1 P2
1 1 0.2 0.8
2 0.5 0.5
2 1 0.9 0.1
2 0.9 0.1
and given the marginal probabilities of P1 and P2 as Series
1 0.4
2 0.6
Name: P1
1 0.7
2 0.3
Name: P2
I would like to obtain the Series of marginal probabilities of X, i.e. the series
A 0.602
B 0.398
Name: X
I can get what I want by
X = sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
X.name="X"
but this is not easily generalizable to more dependencies, the asymmetry between the first xs with level and the second one without looks weird and as usual when working with pandas I'm very sure that there is a better solution using it's tricks and methods.
Is pandas a good tool for this, should I represent my data in another way, and what is the best way to do this calculation, which is essentially an indexed tensor product, in pandas?

One way to vectorize is access the values in Series P1 and P2 by indexing with an array of labels.
In [20]: df = X.reset_index()
In [21]: mP1 = P1[df.P1].values
In [22]: mP2 = P2[df.P2].values
In [23]: mP1
Out[23]: array([ 0.4, 0.4, 0.6, 0.6])
In [24]: mP2
Out[24]: array([ 0.7, 0.3, 0.7, 0.3])
In [25]: mp = mP1 * mP2
In [26]: mp
Out[26]: array([ 0.28, 0.12, 0.42, 0.18])
In [27]: X.mul(mp, axis=0)
Out[27]:
A B
P1 P2
1 1 0.056 0.224
2 0.060 0.060
2 1 0.378 0.042
2 0.162 0.018
In [28]: X.mul(mp, axis=0).sum()
Out[28]:
A 0.656
B 0.344
In [29]: sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
Out[29]:
A 0.656
B 0.344
(Alternately, access the values of a MultiIndex
without resetting the index as follows.)
In [38]: P1[X.index.get_level_values("P1")].values
Out[38]: array([ 0.4, 0.4, 0.6, 0.6])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient, large-scale competition scoring in Python - python

Related

Finding counts of relative and absolute fluctuations in dataframe where each row contains a timeseries

Increase value of several rows based on condition fulfilling all rows

Efficiently combine min/max on different columns of a pandas dataframe

calculate cosine similarity for two columns in a group by in a dataframe

Probability tensor multiplication using pandas.DataFrame

Categories

Resources