a complex computing of the average value in pandas - python

This is my first question on this forum.
I am conducting experiments in which I measure the current-voltage curve of a device applying different experimental conditions.
The different experimental condition are encoded into a parameter K
I am performing measurements of the current I using back & forth voltage sweeps with V varying from O to 2V then from 2V to -2V and then back to 0V.
Measurements are conducted several times for each value of Kto get an average of the current at each voltage point (backward and forward values). Each measurement is ascribed to a parameter named iter (varying from 0 to 3 for instance).
I have collected all data into a pandas dataframe df and I am putting below a code able of produce the typical dfI have (the real one is way too large):
import numpy as np
import pandas as pd
K_col=[]
iter_col=[]
V_col=[]
I_col=[]
niter = 3
V_val = [0,1,2,1,0,-1,-2,-1,0]
K_val = [1,2]
for K in K_val:
for it in range(niter):
for V in V_val:
K_col.append(K)
iter_col.append(it+1)
V_col.append(V)
I_col.append((2*K+np.random.random())*V)
d={'K':K_col,'iter':iter_col,'V':V_col,'I':I_col}
df=pd.DataFrame(d)
I would like to compute the average value of I at each voltage and compare the impact of the experimental condition K.
For example let's look at 2 measurements conducted for K=1:
df[(df.K==1)&(df.iter.isin([1,2]))]
output:
K iter V I
0 1 1 0 0.000000
1 1 1 1 2.513330
2 1 1 2 4.778719
3 1 1 1 2.430393
4 1 1 0 0.000000
5 1 1 -1 -2.705487
6 1 1 -2 -4.235055
7 1 1 -1 -2.278295
8 1 1 0 0.000000
9 1 2 0 0.000000
10 1 2 1 2.535058
11 1 2 2 4.529292
12 1 2 1 2.426209
13 1 2 0 0.000000
14 1 2 -1 -2.878359
15 1 2 -2 -4.061515
16 1 2 -1 -2.294630
17 1 2 0 0.000000
We can see that for experiment 1 (iter=1) V goes multiple times at 0 (indexes 0, 4 and 8). i do not want to loose these different datapoints.
the first data point for I_avg should be (I[0]+I[9])/2 which would correspond to the first measurement at 0V. The second data point should be (I[1]+I[10])/2 that would correspond the the avg_I measured at 1V with increasing values of V etc...up to (I[8]+I[17])/2 which would be my last data point at 0V.
My first thought was to use the groupby() method using K and V as keys but this wouldn't work because V is varying back & forth hence we have duplicate values of V for each measurements and the groupby would just focus on unique values of V.
The final dataframe I would like to have should looks like this:
K V avg_I
0 1 0 0.000000
1 1 1 2.513330
2 1 2 4.778719
3 1 1 2.430393
4 1 0 0.000000
5 1 -1 -2.705487
6 1 -2 -4.235055
7 1 1 -2.278295
8 1 0 0.000000
9 1 0 0.000000
10 2 1 2.513330
11 2 2 4.778719
12 2 1 2.430393
13 2 0 0.000000
14 2 -1 -2.705487
15 2 -2 -4.235055
16 2 1 -2.278295
17 2 0 0.000000
Would anyone have an idea on how doing this?

In order to compute the mean taking into consideration also the position of each observation during the iterations you could add an extra column containing this information like this:
len_iter = 9
num_iter = len(df['iter'].unique())
num_K = len(df['K'].unique())
df['index'] = np.tile(np.arange(len_iter), num_iter*num_K)
And then compute the group by and mean to get the desired result:
df.groupby(['K', 'V', 'index'])['I'].mean().reset_index().drop(['index'], axis=1)
K V I
0 1 -2 -5.070126
1 1 -1 -2.598104
2 1 -1 -2.576927
3 1 0 0.000000
4 1 0 0.000000
5 1 0 0.000000
6 1 1 2.232128
7 1 1 2.359398
8 1 2 4.824657
9 2 -2 -9.031487
10 2 -1 -4.125880
11 2 -1 -4.350776
12 2 0 0.000000
13 2 0 0.000000
14 2 0 0.000000
15 2 1 4.535478
16 2 1 4.492122
17 2 2 8.569701

If I understand this correctly, you want to have a new datapoint that represents the average I for each V category. We can achieve this by getting the average value of I for each V and then map it on the full dataframe.
avg_I = df.groupby(['V'], as_index=False).mean()[['V', 'I']]
df['avg_I'] = df.apply(lambda x: float(avg_I['I'][avg_I['V'] == x['V']]), axis=1)
df.head()
output:
K iter V I avg_I
0 1 1 0 0.00 0.00
1 1 1 1 2.34 3.55
2 1 1 2 4.54 6.89
3 1 1 1 2.02 3.55
4 1 1 0 0.00 0.00
df.plot()

Related

Rolling sum on a column while weighting by other column and relative position

I have a table like this :
import pandas as pd
values = [0,0,0,2000,0,0,700,0,0,530,1000,820,0,0,200]
durations = [0,0,0,12,0,0,8,0,0,2,5,15,0,0,3]
ex = pd.DataFrame({'col_to_roll' : values, 'max_duration': durations})
col_to_roll max_duration
0 0 0
1 0 0
2 0 0
3 2000 12
4 0 0
5 0 0
6 700 8
7 0 0
8 0 0
9 530 2
10 1000 5
11 820 15
12 0 0
13 0 0
14 200 3
For each row position i, I want to do a rolling sum of col_to_roll between indexes i-7 and i-4 (both included). The caveat is that I want the values "further in the past" to be counted more, depending on the column max_duration (which tells for how many timesteps in the future that value can still have an effect).
There's a higher bound which is the remaining timesteps to be counted (min 1, max 4). So if I'm on row number 7 doing the roll-up sum: the value on row number 1 will be counted min(max_duration[1],4), the value on row number 2 will be counted min(max_duration[2],3) etc.
I could do it the brute force way :
new_col = []
for i in range(7,len(ex)) :
rolled_val = sum([ex.iloc[j].col_to_roll*min(ex.iloc[j].max_duration , i-j+1-4) \
for j in range(i-7,i-3)])
new_col.append(rolled_val)
ex['rolled_col'] = [np.nan]*7+new_col
Which lands the following results from the example above :
col_to_roll max_duration rolled_col
0 0 0 NaN
1 0 0 NaN
2 0 0 NaN
3 2000 12 NaN
4 0 0 NaN
5 0 0 NaN
6 700 8 NaN
7 0 0 2000.0
8 0 0 4000.0
9 530 2 6000.0
10 1000 5 8700.0
11 820 15 1400.0
12 0 0 2100.0
13 0 0 3330.0
14 200 3 2060.0
That being said, I'd appreciate a more elegant (and more importantly, more efficient) way to get this result with some pandas magic.
Just to share my ideas, this can be solved by using numpy without a for-loop
import numpy as np
ex_len = ex.shape[0]
inds = np.vstack([range(i-7,i-3) for i in range(7,ex_len)])
# part one
col_to_roll = np.take(ex.col_to_roll.values,inds)
# part two
max_duration = np.take(ex.max_duration.values,inds)
duration_to_compare = np.array([[i-j+1-4 for j in range(i-7,i-3)]for i in range(7,ex_len)])
min_mask = max_duration > duration_to_compare
max_duration[min_mask] = duration_to_compare[min_mask]
new_col = np.sum(col_to_roll*max_duration,axis=1)
ex['rolled_col'] = np.concatenate(([np.nan]*7,new_col))
Here is my humble idea about an elegant and efficient method for this task. To not reinvent the wheel, let's install pandarallel by invoking pip install pandarallel. I am a fan of multiprocessing thing, and it should help with larger data.
import pandas as pd
import numpy as np
from pandarallel import pandarallel
def rocknroll(index):
if index>=7:
a = ex['col_to_roll'].iloc[index-7:index-3]
b = map(min, ex['max_duration'].iloc[index-7:index-3], [4,3,2,1])
return sum(map(mul, a, b))
else:
return np.nan
pandarallel.initialize()
values = [0,0,0,2000,0,0,700,0,0,530,1000,820,0,0,200]
durations = [0,0,0,12,0,0,8,0,0,2,5,15,0,0,3]
ex = pd.DataFrame({'col_to_roll' : values, 'max_duration': durations})
ex['index_copy'] = list(range(0, len(ex)))
ex['rolled_col'] = ex['index_copy'].apply(rocknroll)
ex.drop(columns={'index_copy'}, inplace=True)
print(ex)
Output:
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
col_to_roll max_duration rolled_col
0 0 0 NaN
1 0 0 NaN
2 0 0 NaN
3 2000 12 NaN
4 0 0 NaN
5 0 0 NaN
6 700 8 NaN
7 0 0 2000.0
8 0 0 4000.0
9 530 2 6000.0
10 1000 5 8700.0
11 820 15 1400.0
12 0 0 2100.0
13 0 0 3330.0
14 200 3 2060.0
Further information about proper element-wise operation can be found here Element-wise addition of 2 lists?
You can use pd.rolling() to create rolling windows in combination with apply to calculate the rolled_coll sum for the specified rolling windows
First calculate the window size using the lower & upper bound (and add 1 to include both indices). This enables you to play around with different time intervals.
lower_bound = -7
upper_bound = -4
window_size = upper_bound - lower_bound + 1
Second define the function to apply on each rolling window. In your case taking take the product of the col_to_roll and the minimum value of max_duration & a list of range 4 to 0 and summing all values in the sliding window.
def calculate_rolled_count(series, ex):
index = series.index
min_values = np.minimum(ex.loc[index, 'max_duration'].values, list(range(4, 0, -1)))
return np.sum(ex.loc[index, 'col_to_roll'] * min_values)
Finally assign a new column rolled_coll to your original dataframe and apply the defined function over all rolling windows. We have to shift the columns to make the value correspond to the desired row (as the rolling window by default sets the values to the right bound of the window)
ex.assign(rolled_col=lambda x: x.rolling(window_size)
.apply(lambda x: calculate_rolled_count(x, ex))
.shift(-upper_bound)['max_duration'])
Result
col_to_roll max_duration rolled_col
0 0 0 NaN
1 0 0 NaN
2 0 0 NaN
3 2000 12 NaN
4 0 0 NaN
5 0 0 NaN
6 700 8 NaN
7 0 0 2000.0
8 0 0 4000.0
9 530 2 6000.0
10 1000 5 8700.0
11 820 15 1400.0
12 0 0 2100.0
13 0 0 3330.0
14 200 3 2060.0

Diminishing weights for pandas group by

I have the following df: visitor can make multiple visits, and the number of page views is recorded in each visit.
df = pd.DataFrame({'visitor_id':[1,1,2,1],'visit_id':[1,2,1,3], 'page_views':[10,20,30,40]})
page_views visit_id visitor_id
0 10 1 1
1 20 2 1
2 30 1 2
3 40 3 1
What I need is to create an additional column called weight, which will diminish with a certain parameter. For example, if this parameter is 1/2, the newest visit has a weight of 1, 2nd newest visit a weight of 1/2, 3rd is 1/4 and so on.
E.g. I want my dataframe to look like:
page_views visit_id visitor_id weight
0 10 1(oldest) 1 0.25
1 20 2 1 0.5
2 30 1(newest) 2 1
3 40 3(newest) 1 1
Then I will be able to group using their weight e.g.
df.groupby(['visitor_id']).Weight.sum() to get weighted page views group by.
Doesnt work as expected
df = pd.DataFrame({'visitor_id':[1,1,2,2,1,1],'visit_id':[5,6,1,2,7,8], 'page_views':[10,20,30,30,40,50]})
df['New']=df.groupby('visitor_id').visit_id.transform('max') - df.visit_id
df['weight'] = pd.Series([1/2]*len(df)).pow(df.New.values)
df
page_views visit_id visitor_id New weight
0 10 5 1 3 0
1 20 6 1 2 0
2 30 1 2 1 0
3 30 2 2 0 1
4 40 7 1 1 0
5 50 8 1 0 1
Is this what you need ?
df.groupby('visitor_id').visit_id.apply(lambda x : 1*1/2**(max(x)-x))
Out[1349]:
0 0.25
1 0.50
2 1.00
3 1.00
Name: visit_id, dtype: float64
Maybe try this
df['New']=df.groupby('visitor_id').visit_id.transform('max')-df.visit_id
pd.Series([1/2]*len(df)).pow(df.New.values)
Out[45]:
0 0.25
1 0.50
2 1.00
3 1.00
Name: New, dtype: float64

Replacing values in dataframe with 0s and 1s based on conditions

I would like to filter and replace. For the columns with are lower or higher than zero and not NaN's, I would like to set for one and the others, set to zero.
mask = ((ts[x] > 0)
| (ts[x] < 0))
ts[mask]=1
ts[ts[x]==1]
I did this and is working but I have to deal with the values that do not attend this condition replacing with zero.
Any recommendations? I am quite confusing, and also would be better to use where function in this case?
Thanks all!
Sample Data
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
Expected result
asset.relativeSetpoint.350
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
You can do this by applying a logical AND on the two conditions and converting the resultant mask to integer.
df
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
(df['asset.relativeSetpoint.350'].ne(0)
& df['asset.relativeSetpoint.350'].notnull()).astype(int)
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
Name: asset.relativeSetpoint.350, dtype: int64
The first condition df['asset.relativeSetpoint.350'].ne(0) gets a boolean mask of all elements that are not equal to 0 (this would include <0, >0, and NaN).
The second condition df['asset.relativeSetpoint.350'].notnull() will get a boolean mask of elements that are not NaNs.
The two masks are ANDed, and converted to integer.
How about using apply?
df[COLUMN_NAME] = df[COLUMN_NAME].apply(lambda x: 1 if x != 0 else 0)

Count occurrence of items in column depending on value in other column - Python

I currently have a table similar to this one:
CRED | ACBA
1 | 2
0 | 3
1 | 4
1 | 2
0 | 1
etc...
I was able to get information on the frequency of occurrence of a category (1,2,3,4) in column ACBA depending on the value in CRED (1,0) using:
pd.crosstab(df.CRED, df.ACBA)
ACBA 1 2 3 4
CRED
0 9 11 1 7
1 18 22 4 28
Now I would like to sum the values of ACBA for a specific value of CRED and then be able to divide each single value by that sum and create a new table with the result. Ex:
For CRED = 0 --> 9+11+1+7=28 then --> 9/28 11/28 1/28 7/28 to reach the final table:
1 2 3 4
CRED0 0.25 0.30 0.055 0.38
Does anyone have an idea of how to do this? I am new to Python and completely stuck on this. The idea is that I would repeat this technique across 22 other columns.Thanks
a = {'CRED': [1,0,1,1,0], 'ACBA': [2,3,4,2,1]}
df = pd.DataFrame(a)
output
ACBA CRED
0 2 1
1 3 0
2 4 1
3 2 1
4 1 0
then as what you used crosstab it
df1 = pd.crosstab(df.CRED, df.ACBA)
ACBA 1 2 3 4
CRED
0 1 0 1 0
1 0 2 0 1
then get the percentage
df1.apply(lambda a: a / a.sum() * 100, axis=1)
ACBA 1 2 3 4
CRED
0 50.0 0.000000 50.0 0.000000
1 0.0 66.666667 0.0 33.333333

How to normalize values in a dataframe column in different ranges

I have a dataframe like this:
T data
0 0 10
1 1 20
2 2 30
3 3 40
4 4 50
5 0 5
6 1 13
7 2 21
8 0 3
9 1 7
10 2 11
11 3 15
12 4 19
The values in T are sequences which all range from 0 up to a certain value whereby the maximal number can differ between the sequences.
Normally, the values in data are NOT equally spaced, that is now just for demonstration purposes.
What I want to achieve is to add a third column called dataDiv where each value in data of a certain sequence is divided by the value at T = 0 that belongs to the respective sequence. In my case, I have 3 sequences and for the first sequence I want to divide each value by 10, in the second sequence each value should be divided by 5 and in the third by 3.
So the expected outcome would look like this:
T data dataDiv
0 0 10 1.000000
1 1 20 2.000000
2 2 30 3.000000
3 3 40 4.000000
4 4 50 5.000000
5 0 5 1.000000
6 1 13 2.600000
7 2 21 4.200000
8 0 3 1.000000
9 1 7 2.333333
10 2 11 3.666667
11 3 15 5.000000
12 4 19 6.333333
The way I currently implement it is as follows:
I first determine the indices at which T = 0. Then I loop through these indices and divide the data in data by the value at T=0 of the respective sequence which gives me the desired output (which is shown above). The code looks as follows:
import pandas as pd
df = pd.DataFrame({'T': range(5) + range(3) + range(5),
'data': range(10, 60, 10) + range(5, 25, 8) + range(3, 21, 4)})
# get indices where T = 0
idZE = df[df['T'] == 0].index.tolist()
# last index of dataframe
idZE.append(max(df.index)+1)
# add the column with normalzed values
df['dataDiv'] = df['data']
# loop through indices where T = 0 and normalize values
for ix, indi in enumerate(idZE[:-1]):
df['dataDiv'].iloc[indi:idZE[ix + 1]] = df['data'].iloc[indi:idZE[ix + 1]] / df['data'].iloc[indi]
My question is: Is there any smarter solution than this which avoids the loop?
The following approach avoids loops if favour of vectorized computations and should perform faster. The basic idea is to label runs of integers in column 'T', find the first value in each of these groups and then divide the values in 'data' by the appropriate first value.
df['grp'] = (df['T'] == 0).cumsum() # label consecutive runs of integers
x = df.groupby('grp')['data'].first() # first value in each group
df['dataDiv'] = df['data'] / df['grp'].map(x) # divide
This gives the DataFrame with the desired column:
T data grp dataDiv
0 0 10 1 1.000000
1 1 20 1 2.000000
2 2 30 1 3.000000
3 3 40 1 4.000000
4 4 50 1 5.000000
5 0 5 2 1.000000
6 1 13 2 2.600000
7 2 21 2 4.200000
8 0 3 3 1.000000
9 1 7 3 2.333333
10 2 11 3 3.666667
11 3 15 3 5.000000
12 4 19 3 6.333333
(You can then drop the 'grp' column if you wish: df.drop('grp', axis=1).)
As #DSM points out below, the three lines of code could be collapsed to into one with the use of groupby.transform:
df['dataDiv'] = df['data'] / df.groupby((df['T'] == 0).cumsum())['data'].transform('first')

Categories

Resources