Create new column based on max value of groupby pandas - python

I'm trying to create a new column based on a groupby function, but I'm running into an error. In the sample dataframe below, I want to create a new column where there is a new integer only in rows correspond to the max seq variable per user. So, for instance, user122 would only have a number in the 3rd row, where seq is 3 (this users highest seq number).
df = pd.DataFrame({
'user':
{0: 'user122',
1: 'user122',
2: 'user122',
3: 'user124',
4: 'user125',
5: 'user125',
6: 'user126',
7: 'user126',
8: 'user126'},
'baseline':
{0: 4.0,
1: 4.0,
2: 4.0,
3: 2,
4: 4,
5: 4,
6: 5,
7: 5,
8: 5},
'score':
{0: np.nan,
1: 3,
2: 2,
3: 5,
4: np.nan,
5: 6,
6: 3,
7: 2,
8: 1},
'binary':
{0: 1,
1: 1,
2: 0,
3: 0,
4: 0,
5: 0,
6: 1,
7: 0,
8: 1},
'var1':
{0: 3,
1: 5,
2: 5,
3: 1,
4: 1,
5: 1,
6: 1,
7: 3,
8: 5},
'seq':
{0: 1,
1: 2,
2: 3,
3: 1,
4: 1,
5: 2,
6: 1,
7: 2,
8: 3},
})
The function I used is below
df['newnum'] = np.where(df.groupby('user')['seq'].max(), random.randint(4, 9), 'NA')
The shapes between the new column and old column are not the same, so I run into an error. I thought if I specify multiple conditions in np.where it would put "NA" in all of the places where it was not the max seq value, but this didn't happen.
Length of values (4) does not match length of index (9)
Anyone else have a better idea?
And, if possible, I'd ideally like for the newnum variable to be a multiple of the baseline (but that was too complicated, so I just created a random digit).
Thanks for any help!

the groupby results in fewer rows and not matching 1:1 with your dataframe, hence the error.
Here is how you can accomplish it,
#using transform with the groupby to return the max against each of the items
#in the groupby
df['newnum']=np.where ( df.groupby('user')['seq'].transform('max').eq(df['seq']),
np.random.randint(4, 9),
np.nan)
df
user baseline score binary var1 seq newnum
0 user122 4.0 NaN 1 3 1 NaN
1 user122 4.0 3.0 1 5 2 NaN
2 user122 4.0 2.0 0 5 3 6.0
3 user124 2.0 5.0 0 1 1 6.0
4 user125 4.0 NaN 0 1 1 NaN
5 user125 4.0 6.0 0 1 2 6.0
6 user126 5.0 3.0 1 1 1 NaN
7 user126 5.0 2.0 0 3 2 NaN
8 user126 5.0 1.0 1 5 3 6.0

idxmax = df.groupby('user')['seq'].idxmax()
df.loc[idxmax, 'newnum'] = ...
Notes:
In the first line of the above code, we get indexes of df where maximum seq is reached for each user.
In the second line, we're creating a new columns newnum and assigning it at the same time to some values at the idxmax positions. Other values are NaN by default.
Update
When we assign a numpy.ndarray vector to a new column of a pandas.DataFrame, all data frame indexes are used by default to populate the column with values from the vector. If the number of indexes is different from the vector dimension, then you get ValueError about size mismatch, as in your case. To avoid it we have to restrict data frame indexes to those which are used in assigning operation. That's the meaning of df.loc[idxmax, 'newnum'] where we address df cells in a new column 'newnum' with indexes from idxmax.
There's a restriction: only the first occurrence of a maximum per user will be used. If there's two or more occurrences of max value for some user, all others will be ignored. You've provided test values with growing seq for each user. If this assumption (about growing seq) is right, then we can use idxmax. Taking the last value in the group will do as well in this case:
idxmax = df.groupby('user').tail(1).index
If more then one occurrences of max is expected, then we should use some other method, e.g.:
df.groupby('user')['seq'].transform('max').eq(df['seq'])

Related

DataFrame How to vectorize this for loop?

I need help vectorizing this for loop.
i couldn't come up with my own solution.
So the general idea is that I want to calculate the number of bars since the last time the condition was true.
I have DataFrame with initial values 0 and 1 where 0 is the anchor point for counting to start and stop (0 means that the condition was met for the index in this cell).
For example inital DataFrame would look like this (I am typing only the series raw values and I am ommiting column names etc.)
NaN, NaN, NaN, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0
The output should look like this:
NaN, NaN, NaN, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3, 4, 5, 0
My current code:
cond_count = pd.DataFrame(index=range(cond.shape[0]), columns=range(1))
cond_count.rename(columns={0: 'Bars Since'})
cond_count['Bars Since'] = 'NaN'
cond_count['Bars Since'].iloc[indices_cond_met] = 0
cond_count['Bars Since'].iloc[indices_cond_cut] = 1
for i in range(cond_count.shape[0]):
if cond_count['Bars Since'].iloc[i] == 'NaN'
pass
elif cond_count['Bars Since'].iloc[i] == 0:
while cond_count['Bars Since'].iloc[j] != 0:
cond_count['Bars Since'].iloc[j] = cond_count['Bars Since'].shift(1).iloc[j] + 1
else:
pass
import numpy as np
import pandas as pd
df = pd.DataFrame({'data': [np.nan, np.nan, np.nan, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0]})
df['cs'] = df['data'].le(0).cumsum()
aaa = df.groupby(['data', 'cs'])['data'].apply(lambda x: x.cumsum())
df.loc[aaa.index[0]:, 'data'] = aaa
df = df.drop(['cs'], axis=1)#if you need to remove the auxiliary column
print(df)
Output
data
0 NaN
1 NaN
2 NaN
3 0.0
4 1.0
5 2.0
6 3.0
7 4.0
8 0.0
9 1.0
10 2.0
11 0.0
12 1.0
13 2.0
14 3.0
15 4.0
16 5.0
17 0.0
Here used le to get True where 0.
Then I applied cumsum(), thereby marking the lines into groups.
In the list 'aaa' applied the grouping to the columns 'data', 'cs' and submitted the data to apply, where it applied cumsum().
Using the first slice index: df.loc[aaa.index[0]:, 'data'] in loc overwrote the rows.

Calcuating the proportion of total 'yes' values in a group

I have a dataframe that that looks like this:
chr
start
end
plus
minus
total
in_control
sites_in_cluster
mean
cluster
1
1000
1005
6
7
13
Y
3
6
36346
1
1007
10012
3
1
4
N
3
6
36346
1
10014
10020
0
1
1
Y
3
6
36346
2
33532
33554
1
1
2
N
1
2
22123
cluster is an ID assigned to each row, in this case, we have 3 "sites"
In this cluster, two of these sites are in the control (in_control==Y)
I want to create an additional column, which tells me what proportion of the sites are in the control. i.e. (sum(in_control==Y) for a cluster)/sites_in_cluster
In this example, we have two rows with in_control==Y and 3 sites_in_cluster in cluster 36346. Therefore, cluster_sites_in_control would be 2/3 = 0.66 whereas cluster 22123 only has one site and isn't in the control, so would be 0/1=0
chr
start
end
plus
minus
total
in_control
sites_in_cluster
mean
cluster
cluster_sites_in_control
1
1000
1005
6
7
13
Y
3
6
36346
0.66
1
1007
10012
3
1
4
N
3
6
36346
0.66
1
10014
10020
0
1
1
Y
3
6
36346
0.66
2
33532
33554
1
1
2
N
1
2
22123
0.00
I have created code which seemingly accomplishes this, however, it seems to be extremely roundabout and I'm certain there's a better solution out there:
intersect_in_control
# %%
import pandas as pd
#get the number of sites in a control that are 'Y'
number_in_control = pd.DataFrame(intersect_in_control.groupby(['cluster']).in_control.value_counts().unstack(fill_value=0).loc[:,'Y'])
#get the number of breaksites for that cluster
number_of_breaksites = pd.DataFrame(intersect_in_control.groupby(['cluster'])['sites_in_cluster'].count())
#combine these two dataframes
combined_dataframe = pd.concat([number_in_control.reset_index(drop=False), number_of_breaksites.reset_index(drop=True)], axis=1)
#calculate the desired column
combined_dataframe["proportion_in_control"] = combined_dataframe["Y"]/combined_dataframe["sites_in_cluster"]
#left join this new dataframe to the original whilst dropping undesired columns.
cluster_in_control = intersect_in_control.merge((combined_dataframe.drop(["Y","sites_in_cluster"], axis = 1)), on='cluster', how='left')
10 rows of the df as example data:
{'chr': {0: 'chr14',
1: 'chr2',
2: 'chr1',
3: 'chr10',
4: 'chr17',
5: 'chr17',
6: 'chr2',
7: 'chr2',
8: 'chr2',
9: 'chr1',
10: 'chr1'},
'start': {0: 23016497,
1: 133031338,
2: 64081726,
3: 28671025,
4: 45219225,
5: 45219225,
6: 133026750,
7: 133026761,
8: 133026769,
9: 1510391,
10: 15853061},
'end': {0: 23016501,
1: 133031342,
2: 64081732,
3: 28671030,
4: 45219234,
5: 45219234,
6: 133026755,
7: 133026763,
8: 133026770,
9: 1510395,
10: 15853067},
'plus_count': {0: 2,
1: 0,
2: 5,
3: 1,
4: 6,
5: 6,
6: 14,
7: 2,
8: 0,
9: 2,
10: 4},
'minus_count': {0: 6,
1: 7,
2: 1,
3: 5,
4: 0,
5: 0,
6: 0,
7: 0,
8: 2,
9: 3,
10: 1},
'count': {0: 8, 1: 7, 2: 6, 3: 6, 4: 6, 5: 6, 6: 14, 7: 2, 8: 2, 9: 5, 10: 5},
'in_control': {0: 'N',
1: 'N',
2: 'Y',
3: 'N',
4: 'Y',
5: 'Y',
6: 'N',
7: 'Y',
8: 'N',
9: 'Y',
10: 'Y'},
'total_breaks': {0: 8,
1: 7,
2: 6,
3: 6,
4: 6,
5: 6,
6: 18,
7: 18,
8: 18,
9: 5,
10: 5},
'sites_in_cluster': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 3,
7: 3,
8: 3,
9: 1,
10: 1},
'mean_breaks_per_site': {0: 8.0,
1: 7.0,
2: 6.0,
3: 6.0,
4: 6.0,
5: 6.0,
6: 6.0,
7: 6.0,
8: 6.0,
9: 5.0,
10: 5.0},
'cluster': {0: 22665,
1: 24664,
2: 3484,
3: 13818,
4: 23640,
5: 23640,
6: 24652,
7: 24652,
8: 24652,
9: 48,
10: 769}}
Thanks in advance for any help :)
For percentage is possible symplify solution with mean per boolean column and for create new column use GroupBy.transform, it working well because Trues apre processing like 1:
df['cluster_sites_in_control'] = (df['in_control'].eq('Y')
.groupby(df['cluster']).transform('mean'))
print (df)
chr start end plus minus total in_control sites_in_cluster mean \
0 1 1000 1005 6 7 13 Y 3 6
1 1 1007 10012 3 1 4 N 3 6
2 1 10014 10020 0 1 1 Y 3 6
3 2 33532 33554 1 1 2 N 1 2
cluster cluster_sites_in_control
0 36346 0.666667
1 36346 0.666667
2 36346 0.666667
3 22123 0.000000

In the data frame of probabilities over time return first column name where value is < .5 for each row

Given a pandas data frame like the following where the column names are the time, the rows are each of the subjects, and the values are probabilities return the column name (or time) the first time the probability is less than .50 for each subject in the data frame. The probabilities are always descending from 1-0 I. have tried looping though the data frame but it is not computationally efficient.
subject id
0
1
2
3
4
5
6
7
…
669
670
671
1
1
0.997913
0.993116
0.989017
0.976157
0.973078
0.968056
0.963685
…
0.156092
0.156092
0.156092
2
1
0.990335
0.988685
0.983145
0.964912
0.958
0.952
0.946995
…
0.148434
0.148434
0.148434
3
1
0.996231
0.990571
0.985775
0.976809
0.972736
0.969633
0.966116
…
0.17037
0.17037
0.17037
4
1
0.997129
0.994417
0.991054
0.978795
0.974216
0.96806
0.963039
…
0.15192
0.15192
0.15192
5
1
0.997728
0.993598
0.986641
0.98246
0.977371
0.972874
0.96816
…
0.154545
0.154545
0.154545
6
1
0.998134
0.995564
0.989901
0.986941
0.982313
0.972951
0.969645
…
0.17473
0.17473
0.17473
7
1
0.995681
0.994131
0.990401
0.974494
0.967941
0.961859
0.956636
…
0.144753
0.144753
0.144753
8
1
0.997541
0.994904
0.991941
0.983389
0.979375
0.973158
0.966358
…
0.158763
0.158763
0.158763
9
1
0.992253
0.989064
0.979258
0.955747
0.948842
0.942899
0.935784
…
0.150291
0.150291
0.150291
Goal Output
subject id
time prob < .05
1
100
2
99
3
34
4
19
5
600
6
500
7
222
8
111
9
332
Since the probabilities are always descending you can do this:
>>> df.set_index("subject id").gt(.98).sum(1)
subject id
1 4
2 4
3 4
4 4
5 5
6 6
7 4
8 5
9 3
dtype: int64
note: I'm using .98 instead of .5 because I'm using only a portion of the data.
Data used
{'subject id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9},
'0': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
'1': {0: 0.997913,
1: 0.990335,
2: 0.996231,
3: 0.997129,
4: 0.997728,
5: 0.998134,
6: 0.995681,
7: 0.997541,
8: 0.992253},
'2': {0: 0.993116,
1: 0.988685,
2: 0.990571,
3: 0.994417,
4: 0.993598,
5: 0.995564,
6: 0.994131,
7: 0.994904,
8: 0.989064},
'3': {0: 0.989017,
1: 0.983145,
2: 0.985775,
3: 0.991054,
4: 0.986641,
5: 0.989901,
6: 0.990401,
7: 0.991941,
8: 0.979258},
'4': {0: 0.976157,
1: 0.964912,
2: 0.976809,
3: 0.978795,
4: 0.98246,
5: 0.986941,
6: 0.974494,
7: 0.983389,
8: 0.955747},
'5': {0: 0.973078,
1: 0.958,
2: 0.972736,
3: 0.974216,
4: 0.977371,
5: 0.982313,
6: 0.967941,
7: 0.979375,
8: 0.948842},
'6': {0: 0.968056,
1: 0.952,
2: 0.969633,
3: 0.96806,
4: 0.972874,
5: 0.972951,
6: 0.961859,
7: 0.973158,
8: 0.942899},
'7': {0: 0.963685,
1: 0.946995,
2: 0.966116,
3: 0.963039,
4: 0.96816,
5: 0.969645,
6: 0.956636,
7: 0.966358,
8: 0.935784}}
If I understand correctly, I think this is what you are looking for:
df.where(df.lt(.5)).idxmax(axis=1)

How to multiply columns in the same position from two different pandas dataframe?

I am trying to multiple columns from the dataframes such that the same types appears in the same position of the dataframe. For example, in the dataframes below, df1 colums and df2 columns are essentially the same and comes in the same order. The only difference is that the df2 columns have a suffix and the data type is float. The column positions matters in that the first column of df1 is a dichotomization of first column of df2. For a certain purpose, I need to multiply the value from df2 to dichotomized value from df1 for each column and then sum them row-wise. This should produce a single column with a sum that I need to use for something else.
First dataframe:
df1 = {'a': {0: 0,
1: 0,
2: 0,
3: 0,
4: 1},
'b': {0: 1, 1: 0, 2: 1, 3: 0, 4: 0},
'c': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
'd': {0: 0, 1: 1, 2: 1, 3: 0, 4: 0},
'e': {0: 0, 1: 1, 2: 0, 3: 1, 4: 0},
'f': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'g': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0},
'h': {0: 1,
1: 0,
2: 1,
3: 1,
4: 0},
'i: {0: 0,
1: 1,
2: 0,
3: 1,
4: 0},
'j': {0: 1, 1: 0, 2: 0, 3: 0, 4: 1}}
Second dataframe
df2 = {'a_top3': {0: 0.084973365,
1: 0.057013709,
2: 0.072325557,
3: 0.098824218,
4: 0.252425998},
'b_top3': {0: 0.168823063,
1: 0.044829924,
2: 0.178180799,
3: 0.032501712,
4: 0.054869764},
'c_top3': {0: 0.040331405,
1: 0.042758454,
2: 0.077851109,
3: 0.111247674,
4: 0.160724968},
'd_top3': {0: 0.11076121,
1: 0.156901404,
2: 0.111759722,
3: 0.031440482,
4: 0.046660293},
'e_top3': {0: 0.059534989,
1: 0.090733215,
2: 0.087737411,
3: 0.141953781,
4: 0.011520214},
'f_top3': {0: 0.067696713,
1: 0.081674345,
2: 0.034215827,
3: 0.075849444,
4: 0.011245198},
'g_top3': {0: 0.041895844,
1: 0.048191357,
2: 0.102012217,
3: 0.100579783,
4: 0.034403443},
'h_top3': {0: 0.124932915,
1: 0.085968919,
2: 0.220041335,
3: 0.155145347,
4: 0.032171372},
'i_top3': {0: 0.103714436,
1: 0.349804282,
2: 0.077229746,
3: 0.150859997,
4: 0.081321001},
'j_top3': {0: 0.197336018,
1: 0.042124409,
2: 0.038646296,
3: 0.101597518,
4: 0.314657748}}
I need a column such that it is a sum of product of each column in the same position. For example,
prod_sum = df1[['a','b','c']].mul(df2[['a_top3', 'b_top3', 'c_top3']], axis=0).sum(axis=1)
should produce the following:
The method I tried is shown above, but all I get is NaN. I can do this using loop, but curious to find out if there's a pythonic way of doing this?
Let's take a subset of your data ( the first three columns of df1 and df2):
In [362]: temp1 = df1.loc[:, ['a','b','c']]
...: temp2 = df2.iloc[:, :3]
In [363]: temp1
Out[363]:
a b c
0 0 1 0
1 0 0 0
2 0 1 0
3 0 0 0
4 1 0 1
In [364]: temp2
Out[364]:
a_top3 b_top3 c_top3
0 0.084973 0.168823 0.040331
1 0.057014 0.044830 0.042758
2 0.072326 0.178181 0.077851
3 0.098824 0.032502 0.111248
4 0.252426 0.054870 0.160725
When multiplying (or any similar operation), Pandas will try and align the index and columns. In this scenario, we need to find a way to align the column names from temp1 (a, b, c) to temp2(a_top3, ...). The simplest solution in this case is to drop the top3 suffixes for temp2, Pandas will then successfully multiply the columns and return what you need:
In [367]: temp1.mul(temp2.rename(columns = lambda x: x[0])).sum(1)
Out[367]:
0 0.168823
1 0.000000
2 0.178181
3 0.000000
4 0.413151
dtype: float64
Extending the same idea to df1 and df2 :
In [368]: df1.mul(df2.rename(columns = lambda x: x[0])).sum(1)
Out[368]:
0 0.491092
1 0.597439
2 0.509982
3 0.447959
4 0.727809
dtype: float64
Firstly make use of merge() method:
result=df1[['a','b','c']].merge(df2[['a_top3', 'b_top3', 'c_top3']],left_index=True,right_index=True)
Finally make use of apply() method and anonymous function:
result=result.apply(lambda x:x['a']*x['a_top3']+x['b']*x['b_top3']+x['c']*x['c_top3'],axis=1)
Now if you print result you will get:
0 0.168823
1 0.000000
2 0.178181
3 0.000000
4 0.413151
dtype: float64
Since the series contain float type data so you can't get 0 in place of 0.000000

Iterate pandas data frame for rows consists of arrays and compute a moving average based on condition

I can't figure out a problem I am trying to solve.
I have a pandas data frame coming from this:
date, id, measure, result
2016-07-11, 31, "[2, 5, 3, 3]", 1
2016-07-12, 32, "[3, 5, 3, 3]", 1
2016-07-13, 33, "[2, 1, 2, 2]", 1
2016-07-14, 34, "[2, 6, 3, 3]", 1
2016-07-15, 35, "[39, 31, 73, 34]", 0
2016-07-16, 36, "[3, 2, 3, 3]", 1
2016-07-17, 37, "[3, 8, 3, 3]", 1
Measurements column consists of arrays in string format.
I want to have a new moving-average-array column from the past 3 measurement records, excluding those records where the result is 0. Past 3 records mean that for id 34, the arrays of id 31,32,33 to be used.
It is about taking the average of every 1st point, 2nd point, 3rd and 4th point to have this moving-average-array.
It is not about getting the average of 1st array, 2nd array ... and then averaging the average, no.
For the first 3 rows, because there is not enough history, I just want to use their own measurement. So the solution should look like this:
date, id, measure, result . Solution
2016-07-11, 31, "[2, 5, 3, 3]", 1, "[2, 5, 3, 3]"
2016-07-12, 32, "[3, 5, 3, 3]", 1, "[3, 5, 3, 3]"
2016-07-13, 33, "[2, 1, 2, 2]", 1, "[2, 1, 2, 2]"
2016-07-14, 34, "[2, 6, 3, 3]", 1, "[2.3, 3.6, 2.6, 2.6]"
2016-07-15, 35, "[39, 31, 73, 34]", 0, "[2.3, 4, 2.6, 2.6]"
2016-07-16, 36, "[3, 2, 3, 3]", 1, "[2.3, 4, 2.6, 2.6]"
2016-07-17, 37, "[3, 8, 3, 3]", 1, "[2.3, 3, 2.6, 2.6]"
The real data is bigger. result 0 may repeat 2 or more times after each other also. I think it will be about keeping a track of previous OK results properly getting those averages. I spent time but I could not.
I am posting the dataframe here:
mydict = {'date': {0: '2016-07-11',
1: '2016-07-12',
2: '2016-07-13',
3: '2016-07-14',
4: '2016-07-15',
5: '2016-07-16',
6: '2016-07-17'},
'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
'measure': {0: '[2, 5, 3, 3]',
1: '[3, 5, 3, 3]',
2: '[2, 1, 2, 2]',
3: '[2, 6, 3, 3]',
4: '[39, 31, 73, 34]',
5: '[3, 2, 3, 3]',
6: '[3, 8, 3, 3]'},
'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)
Thank you for giving directions or pointing out how to.
Solution using only 1 for loop:
Considering the data:
mydict = {'date': {0: '2016-07-11',
1: '2016-07-12',
2: '2016-07-13',
3: '2016-07-14',
4: '2016-07-15',
5: '2016-07-16',
6: '2016-07-17'},
'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
'measure': {0: '[2, 5, 3, 3]',
1: '[3, 5, 3, 3]',
2: '[2, 1, 2, 2]',
3: '[2, 6, 3, 3]',
4: '[39, 31, 73, 34]',
5: '[3, 2, 3, 3]',
6: '[3, 8, 3, 3]'},
'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)
I defined a simple function to calculate the means and return a list. Then, loop the dataframe applying the rules:
def calc_mean(in_list):
p0 = round((in_list[0][0] + in_list[1][0] + in_list[2][0])/3,1)
p1 = round((in_list[0][1] + in_list[1][1] + in_list[2][1])/3,1)
p2 = round((in_list[0][2] + in_list[1][2] + in_list[2][2])/3,1)
p3 = round((in_list[0][3] + in_list[1][3] + in_list[2][3])/3,1)
return [p0, p1, p2, p3]
Solution = []
aux_list = []
for index, row in df.iterrows():
if index in [0,1,2]:
Solution.append(row.measure)
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
else:
Solution.append('[' +', '.join(map(str, calc_mean(aux_list))) + ']')
if row.result > 0:
aux_list.pop(0)
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
df['Solution'] = Solution
The output is:
Please note that the result is rounded to 1 decimal place, a bit different from your desired output. Made more sense to me.
EDIT:
As a suggestion in the comments by #Frenchy, to deal with result == 0 in the first 3 rows, we need to change a bit the first if clause:
if index in [0,1,2] or len(aux_list) <3:
Solution.append(row.measure)
if row.result > 0:
aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
You can use pd.eval to change from a str of list to a proper list only the part of the data in measure where result is not 0. Use rolling with mean and then shift to get the rolling average over the last 3 rows at the next row. Then map to str once your dataframe is changed to a list of list with values and tolist. Finally you just need to replace the first three rows and ffill the missing data:
df.loc[df.result.shift() != 0,'solution'] = list(map(str,
pd.DataFrame(pd.eval(df[df.result != 0].measure))
.rolling(3).mean().shift().values.tolist()))
df.loc[:2,'solution'] = df.loc[:2,'measure']
df.solution = df.solution.ffill()
Here's another solution:
# get data to reproduce example
from io import StringIO
data = StringIO("""
date;id;measure;result
2016-07-11;31;"[2,5,3,3]";1
2016-07-12;32;"[3,5,3,3]";1
2016-07-13;33;"[2,1,2,2]";1
2016-07-14;34;"[2,6,3,3]";1
2016-07-15;35;"[39,31,73,34]";0
2016-07-16;36;"[3,2,3,3]";1
2016-07-17;37;"[3,8,3,3]";1
""")
df = pd.read_csv(data, sep=";")
df
# Out:
# date id measure result
# 0 2016-07-11 31 [2,5,3,3] 1
# 1 2016-07-12 32 [3,5,3,3] 1
# 2 2016-07-13 33 [2,1,2,2] 1
# 3 2016-07-14 34 [2,6,3,3] 1
# 4 2016-07-15 35 [39,31,73,34] 0
# 5 2016-07-16 36 [3,2,3,3] 1
# 6 2016-07-17 37 [3,8,3,3] 1
# convert values in measure column to lists
from ast import literal_eval
dm = df['measure'].apply(literal_eval)
# apply rolling mean with period 2 and recollect values into list in column means
df["means"] = dm.apply(pd.Series).rolling(2, min_periods=0).mean().values.tolist()
df
# Out:
# date id measure result means
# 0 2016-07-11 31 [2,5,3,3] 1 [2.0, 5.0, 3.0, 3.0]
# 1 2016-07-12 32 [3,5,3,3] 1 [2.5, 5.0, 3.0, 3.0]
# 2 2016-07-13 33 [2,1,2,2] 1 [2.5, 3.0, 2.5, 2.5]
# 3 2016-07-14 34 [2,6,3,3] 1 [2.0, 3.5, 2.5, 2.5]
# 4 2016-07-15 35 [39,31,73,34] 0 [20.5, 18.5, 38.0, 18.5]
# 5 2016-07-16 36 [3,2,3,3] 1 [21.0, 16.5, 38.0, 18.5]
# 6 2016-07-17 37 [3,8,3,3] 1 [3.0, 5.0, 3.0, 3.0]
# moving window of size 3
df["means"] = dm.apply(pd.Series).rolling(3, min_periods=0).mean().round(2).values.tolist()
df
# Out:
# date id measure result means
# 0 2016-07-11 31 [2,5,3,3] 1 [2.0, 5.0, 3.0, 3.0]
# 1 2016-07-12 32 [3,5,3,3] 1 [2.5, 5.0, 3.0, 3.0]
# 2 2016-07-13 33 [2,1,2,2] 1 [2.33, 3.67, 2.67, 2.67]
# 3 2016-07-14 34 [2,6,3,3] 1 [2.33, 4.0, 2.67, 2.67]
# 4 2016-07-15 35 [39,31,73,34] 0 [14.33, 12.67, 26.0, 13.0]
# 5 2016-07-16 36 [3,2,3,3] 1 [14.67, 13.0, 26.33, 13.33]
# 6 2016-07-17 37 [3,8,3,3] 1 [15.0, 13.67, 26.33, 13.33]

Categories

Resources