Dataframe column: to find (cumulative) local maxima

Dataframe column: to find (cumulative) local maxima - python

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio"). I would like to find the cumulative local maxima of every non-zero vector contained in column "CumRetperTrade".
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which gives for every vector ( = subset corresponding to ’Portfolio =1 ’) contained in column "CumRetperTrade" the cumulative maximum value of (all its previous) values. The numeric example is below. Thanks in advance!
PS In other words, I guess that we need to use cummax() but to apply it only to the consequent (where 'Portfolio' = 1) subsets of 'CumRetperTrade'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [2, 3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [2, 3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
0 1 2 2
1 1 3 3
2 1 2 3
3 1 1 3
4 0 0 0
5 0 0 0
6 0 0 0
7 1 4 4
8 1 2 4
9 1 1 4
PPS I already asked a similar question previously (Dataframe column: to find local maxima) and received a correct answer to my question, however in my question I did not explicitly mention the requirement of cumulative local maxima

You only need a small modification to the previous answer:
df1["PeakCumRet"] = (
df1.groupby(df1["Portfolio"].diff().ne(0).cumsum())
["CumRetperTrade"].expanding().max()
.droplevel(0)
)
expanding().max() is what produces the local maxima.

Related

Pandas dataframe from numpy array with multiindex

I'm working with a numpy array called array_test with shape (5, 359, 2). This is checked with array_test.shape. The array reflects mean and uncertainty for observations in 5 repetitions of an experiment.
The goal of this is to be able to estimate the mean value of each observation across the 5 repetitions of the experiment, and to estimate the total uncertainty per observation also a mean across the 5 repetitions.
I would need to create a pandas dataframe from it, I believe with a multiindex in which the first level would have 5 values from the first dimension (named simply '1', '2', etc.), and a second one which would be 'mean' and 'uncertainty'.
Suggestions are more than welcome!

IIUC, you might want to aggregate in numpy, then construct a DataFrame and stack:
a = np.random.random((5, 359, 2))
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack()
Output (a Series):
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
dtype: float64
For a DataFrame:
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack().to_frame('value')
Output:
value
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299

I would approach it by using a normal Dataframe, but adding columns for the observation and experiment number.
import numpy as np
import pandas as pd
a = np.random.rand(5, 10, 2)
# Get the shape
n_experiments, n_observations, n_values = a.shape
# Reshape array into a 2-dimensional array
# (stacking experiments on top of each other)
a = a.reshape(-1, n_values)
# Create Dataframe and add experiment and observation number
df = pd.DataFrame(a, columns=["mean", "uncertainty"])
# This returns an array, like [0, 0, 0, 0, 0, 1, 1, 1, ..., 4, 4]
experiment = np.repeat(range(n_experiments), n_observations)
df["experiment"] = experiment
# This returns an array like [0, 1, 2, 3, 4, 0, 1, 2, ..., 3, 4]
observation = np.tile(range(n_observations), n_experiments)
df["observation"] = observation
The Dataframe now looks like this:
print(df.head(15))
mean uncertainty experiment observation
0 0.741436 0.775086 0 0
1 0.401934 0.277716 0 1
2 0.148269 0.406040 0 2
3 0.852485 0.702986 0 3
4 0.240930 0.644746 0 4
5 0.309648 0.914761 0 5
6 0.479186 0.495845 0 6
7 0.154647 0.422658 0 7
8 0.381012 0.756473 0 8
9 0.939797 0.764821 0 9
10 0.994342 0.019140 1 0
11 0.300225 0.992146 1 1
12 0.265698 0.823469 1 2
13 0.791907 0.555051 1 3
14 0.503281 0.249237 1 4
Now you can analyze the Dataframe (with groupby and mean):
# Only the mean
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).mean())
mean uncertainty
observation
0 0.699324 0.506369
1 0.382288 0.456324
2 0.333396 0.324469
3 0.690545 0.564583
4 0.365198 0.555231
5 0.453545 0.596149
6 0.526988 0.395162
7 0.565689 0.569904
8 0.425595 0.415944
9 0.731776 0.375612
Or with more advanced aggregate functions, which are probably useful for your usecase:
# Use aggregate function to calculate not only mean, but min and max as well
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).aggregate(['mean', 'min', 'max']))
mean uncertainty
mean min max mean min max
observation
0 0.699324 0.297030 0.994342 0.506369 0.019140 0.974842
1 0.382288 0.063046 0.810411 0.456324 0.108774 0.992146
2 0.333396 0.148269 0.698921 0.324469 0.009539 0.823469
3 0.690545 0.175471 0.895190 0.564583 0.260557 0.721265
4 0.365198 0.015501 0.726352 0.555231 0.249237 0.929258
5 0.453545 0.111355 0.807582 0.596149 0.101421 0.914761
6 0.526988 0.323945 0.786167 0.395162 0.007105 0.691998
7 0.565689 0.154647 0.813336 0.569904 0.302157 0.964782
8 0.425595 0.116968 0.567544 0.415944 0.014439 0.756473
9 0.731776 0.411324 0.939797 0.375612 0.085988 0.764821

Change some values in column if condition is true in Pandas dataframe without loop

I have the following dataframe:
d_test = {
'random_staff' : ['gfda', 'fsd','gec', 'erw', 'gd', 'kjhk', 'fd', 'kui'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
cluster_number column contains values from 1 to n. Some values could have repetition but no missing values are presented. For example above such values are: 1, 2, 3, 4.
I want to be able to select some value from cluster_number column and change every occurrence of this value to set of unique values. No missing value should be presented. For example if we select value 2 then desirable outcome for cluster_number is [1, 2, 3, 3, 5, 1, 4, 6]. Note we had three 2 in the column. We kept first one as 2 we change next occurrence of 2 to 5 and we changed last occurrence of 2 to 6.
I wrote code for the logic above and it works fine:
cluster_number_to_change = 2
max_cluster = max(df_test['cluster_number'])
first_iter = True
i = cluster_number_to_change
for index, row in df_test.iterrows():
if row['cluster_number'] == cluster_number_to_change:
df_test.loc[index, 'cluster_number'] = i
if first_iter:
i = max_cluster + 1
first_iter = False
else:
i += 1
But it is written as for-loop and I am trying understand if can be transformed in form of pandas .apply method (or any other effective vectorized solution).

Using boolean indexing:
# get cluster #2
m1 = df_test['cluster_number'].eq(2)
# identify duplicates
m2 = df_test['cluster_number'].duplicated()
# increment duplicates using the max as reference
df_test.loc[m1&m2, 'cluster_number'] = (
m2.where(m1).cumsum()
.add(df_test['cluster_number'].max())
.convert_dtypes()
)
print(df_test)
Output:
random_staff cluster_number
0 gfda 1
1 fsd 2
2 gec 3
3 erw 3
4 gd 5
5 kjhk 1
6 fd 4
7 kui 6

Get index of row where column value changes from previous row

I have a pandas dataframe with a column such as :
df1 = pd.DataFrame({ 'val': [997.95, 997.97, 989.17, 999.72, 984.66, 1902.15]})
I have 2 types of events that can be detected from this column, I wanna label them 1 and 2 .
I need to get the indexes of each label , and to do so I need to find where the 'val' column has changed a lot (± 7 ) from previous row.
Expected output:
one = [0, 1, 3, 5]
two = [2, 4 ]

Use Series.diff with mask for test less values like 0, last use boolean indexing with indices:
m = df1.val.diff().lt(0)
#if need test less like -7
#m = df1.val.diff().lt(-7)
one = df1.index[~m]
two = df1.index[m]
print (one)
Int64Index([0, 1, 3, 5], dtype='int64')
print (two)
nt64Index([2, 4], dtype='int64')
If need lists:
one = df1.index[~m].tolist()
two = df1.index[m].tolist()
Details:
print (df1.val.diff())
0 NaN
1 0.02
2 -8.80
3 10.55
4 -15.06
5 917.49
Name: val, dtype: float64

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]

You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6

subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]

Python pandas: Return indices of all rows like another row

Suppose we have a toy example like below.
np.random.seed(seed=1)
df = pd.DataFrame(np.random.randint(low=0,
high=2,
size=(5, 2)))
df
0 1
0 1 1
1 0 0
2 1 1
3 1 1
4 1 0
We want to return the indices of all rows like a certain row. Suppose I want the indices of all rows like row 0, which has a 1 in both column 0 and column 1.
I would want a data structure that has: (0, 2, 3).

I think you can do it like this
df.index[df.eq(df.iloc[0]).all(1)].tolist()
[0, 2, 3]

One way may be to use lambda:
df.index[df.apply(lambda row: all(row == df.iloc[0]), axis=1)].tolist()
Other way may be to use mask :
df.index[df[df == df.iloc[0].values].notnull().all(axis=1)].tolist()
Result:
[0, 2, 3]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe column: to find (cumulative) local maxima - python

You only need a small modification to the previous answer: df1["PeakCumRet"] = ( df1.groupby(df1["Portfolio"].diff().ne(0).cumsum()) ["CumRetperTrade"].expanding().max() .droplevel(0) ) expanding().max() is what produces the local maxima.

Related

Pandas dataframe from numpy array with multiindex

Change some values in column if condition is true in Pandas dataframe without loop

Get index of row where column value changes from previous row

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

Python pandas: Return indices of all rows like another row

Categories

Resources