How to find out the cumulative count between numbers? - python

i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change

IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0

If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0

Related

Calculate the amount of consecutive missing values in a row

I am trying to find a way to calculate the amount of values randomly removed from a data frame and the amount of values randomly removed one after another.
The code I have so far is:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Sampledata
x=[1,2,3,4,5,6,7,8,9,10]
y=[1,2,3,4,5,6,7,8,9,10]
df = pd.DataFrame({'col_1':y,'col_2':x})
drop_indices = np.random.choice(df.index, 5,replace=False )
df_subset = df.drop(drop_indices)
print(df_subset)
print(df)
Which randomly removes 5 rows from the data frame and gives as output:
col_1 col_2
0 1 1
1 2 2
2 3 3
5 6 6
8 9 9
col_1 col_2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
I would like to turn this into the following data frame:
col_1 col_2 col_2 N_removedvalues N_consecutive
0 1 1 1 0 0
1 2 2 2 0 0
2 3 3 3 0 0
3 4 4 1 1
4 5 5 2 2
5 6 6 6 2 0
6 7 7 3 1
7 8 8 4 2
8 9 9 9 4 0
9 10 10 5 1
res=df.merge(df_subset, on='col_1', suffixes=['_1',''], how='left')
res["N_removedvalues"]=np.where(res['col_2'].isna(), res.groupby(res['col_2'].isna()).cumcount().add(1), np.nan)
res["N_removedvalues"]=res["N_removedvalues"].ffill().fillna(0)
res['N_consecutive']=np.logical_and(res['col_2'].isna(), np.logical_or(~res['col_2'].shift().isna(), res.index==res.index[0]))
res.loc[np.logical_and(res['N_consecutive']==0, res['col_2'].isna()), 'N_consecutive']=np.nan
res['N_consecutive']=res.groupby('N_consecutive')['N_consecutive'].cumsum().ffill()
res.loc[res['N_consecutive'].gt(0), 'N_consecutive']=res.loc[res['N_consecutive'].gt(0)].groupby('N_consecutive').cumcount().add(1)
Outputs:
col_1 col_2_1 col_2 N_removedvalues N_consecutive
0 1 1 1.0 0.0 0.0
1 2 2 2.0 0.0 0.0
2 3 3 NaN 1.0 1.0
3 4 4 4.0 1.0 0.0
4 5 5 NaN 2.0 1.0
5 6 6 NaN 3.0 2.0
6 7 7 7.0 3.0 0.0
7 8 8 8.0 3.0 0.0
8 9 9 NaN 4.0 1.0
9 10 10 NaN 5.0 2.0

How to create a increment var from a first value of a dataframe group?

I have a datframe as :
data=[[0,1,5],
[0,1,6],
[0,0,8],
[0,0,10],
[0,1,12],
[0,0,14],
[0,1,16],
[0,1,18],
[1,0,2],
[1,1,0],
[1,0,1],
[1,0,2]]
df = pd.DataFrame(data,columns=['KEY','COND','VAL'])
For RES1, I want to create a counter variable RES where COND ==1. The value of RES for the
first KEY of the group remains same as the VAL (Can I use cumcount() in some way).
For RES2, then I just want to fill the missing values as
the previous value. (df.fillna(method='ffill')), I am thinking..
KEY COND VAL RES1 RES2
0 0 1 5 5 5
1 0 1 6 6 6
2 0 0 8 6
3 0 0 10 6
4 0 1 12 7 7
5 0 0 14 7
6 0 1 16 8 8
7 0 1 18 9 9
8 1 0 2 2 2
9 1 1 0 3 3
10 1 0 1 3
11 1 0 2 3
Aim is to look fir a vectorized solution that's most optimal over million rows.
IIUC
con=(df.COND==1)|(df.index.isin(df.drop_duplicates('KEY').index))
df['res1']=df.groupby('KEY').VAL.transform('first')+
df.groupby('KEY').COND.cumsum()[con]-
df.groupby('KEY').COND.transform('first')
df['res2']=df.res1.ffill()
df
Out[148]:
KEY COND VAL res1 res2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
You want:
s = (df[df.KEY.duplicated()] # Ignore first row in each KEY group
.groupby('KEY').COND.cumsum() # Counter within KEY
.add(df.groupby('KEY').VAL.transform('first')) # Add first value per KEY
.where(df.COND.eq(1)) # Set only where COND == 1
.add(df.loc[~df.KEY.duplicated(), 'VAL'], fill_value=0) # Set 1st row by KEY
)
df['RES1'] = s
df['RES2'] = df['RES1'].ffill()
KEY COND VAL RES1 RES2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0

Re-size the dataframe

I have a dataframe that is a 5252 rows x 3 columns
data look something like this
X Y Z
1 1 2
1 2 4
1 3 3.5
2 13 4
1 4 3
2 14 3.5
3 14 2
3 15 1
4 16 .5
4 18 2
. . .
. . .
. . .
1508 751 1
1508 669 1
1508 686 2.5
I want to convert it so the userid is the rows and itemid is the column and Z is the data correspond to X and Y. Something like this:
1 2 3 4 5 6 13 14 15 16 17 18 669 686
1 2 4 3.5 3 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 4 4.5 0 0 0 0 0 0
3 0 0 0 0 0 0 0 2 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 .5 0 2 0 0
.
.
.
1508 0 0 0 0 0 0 0 0 0 0 0 0 1 1
I assume you're using pandas library.
You need pd.pivot_table function. If the dataframe is called df, then you need:
pd.pivot_table(data=df, index="x", columns="y", values="z", aggfunc=sum)
You need to use pd.pivot_table() and use fillna(0). Recreating your sample dataframe:
import pandas as pd
df = pd.DataFrame({'X': [1,1,1,1,2,2,3,3,4], 'Y': [1,2,3,4,13,14,14,15,16], 'Z': [2,4,3.5,3,4,3.5,2,1,.5]})
Gives:
X Y Z
0 1 1 2.0
1 1 2 4.0
2 1 3 3.5
3 1 4 3.0
4 2 13 4.0
5 2 14 3.5
6 3 14 2.0
7 3 15 1.0
8 4 16 0.5
Then using pd.pivot_table():
pd.pivot_table(df, values='Z', index=['X'], columns=['Y']).fillna(0)
Yields:
Y 1 2 3 4 13 14 15 16
X
1 2.0 4.0 3.5 3.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 4.0 3.5 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5

Pandas/Numpy group value changes and derivative value changes above/below 0

I have a series of values (Pandas DF or Numpy Arr):
vals = [0,1,3,4,5,5,4,2,1,0,-1,-2,-3,-2,3,5,8,4,2,0,-1,-3,-8,-20,-10,-5,-2,-1,0,1,2,3,5,6,8,4,3]
df = pd.DataFrame({'val': vals})
I want to classify/group the values into 4 categories:
Increasing above 0
Decreasing above 0
Increasing below 0
Decreasing below 0
Current approach with Pandas is to categorize into above/below 0 and then that into increasing/decreasing by seeing when diff values change above/below 0.
df['above_zero'] = np.where(df['val'] >= 0, 1, 0)
df['below_zero'] = np.where(df['val'] < 0, 1, 0)
df['diffs'] = df['val'].diff()
df['diff_above_zero'] = np.where(df['diffs'] >= 0, 1, 0)
df['diff_below_zero'] = np.where(df['diffs'] < 0, 1, 0)
This produces the desired output, but now I am trying to find a solution how to group these columns into an ascending group number as soon as one of the 4 conditions changes.
Desired output would look like this (*group col is manually typed, might have errors from calculated values):
id val above_zero below_zero diffs diff_above_zero diff_below_zero group
0 0 1 0 0.0 1 0 0
1 1 1 0 1.0 1 0 0
2 3 1 0 2.0 1 0 0
3 4 1 0 1.0 1 0 0
4 5 1 0 1.0 1 0 0
5 5 1 0 0.0 1 0 0
6 4 1 0 -1.0 0 1 1
7 2 1 0 -2.0 0 1 1
8 1 1 0 -1.0 0 1 1
9 0 1 0 -1.0 0 1 1
10 -1 0 1 -1.0 0 1 2
11 -2 0 1 -1.0 0 1 2
12 -3 0 1 -1.0 0 1 2
13 -2 0 1 1.0 1 0 3
14 3 1 0 5.0 1 0 4
15 5 1 0 2.0 1 0 4
16 8 1 0 3.0 1 0 4
17 4 1 0 -4.0 0 1 5
18 2 1 0 -2.0 0 1 5
19 0 1 0 -2.0 0 1 5
20 -1 0 1 -1.0 0 1 6
21 -3 0 1 -2.0 0 1 6
22 -8 0 1 -5.0 0 1 6
23 -20 0 1 -12.0 0 1 6
24 -10 0 1 10.0 1 0 7
25 -5 0 1 5.0 1 0 7
26 -2 0 1 3.0 1 0 7
27 -1 0 1 1.0 1 0 7
28 0 1 0 1.0 1 0 8
29 1 1 0 1.0 1 0 8
30 2 1 0 1.0 1 0 8
31 3 1 0 1.0 1 0 8
32 5 1 0 2.0 1 0 8
33 6 1 0 1.0 1 0 8
34 8 1 0 2.0 1 0 8
35 4 1 0 -4.0 0 1 9
36 3 1 0 -1.0 0 1 9
Would appreciate any help on how to solve this efficiently. Thanks!
Setup
g1 = ['above_zero', 'below_zero', 'diff_above_zero', 'diff_below_zero']
You can simply index all of your boolean columns, and use shift:
c = df.loc[:, g1]
(c != c.shift().fillna(c)).any(1).cumsum()
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 2
11 2
12 2
13 3
14 4
15 4
16 4
17 5
18 5
19 5
20 6
21 6
22 6
23 6
24 7
25 7
26 7
27 7
28 8
29 8
30 8
31 8
32 8
33 8
34 8
35 9
36 9
dtype: int32
The following code will produce two columns: c1 and c2.
The values of c1 correspond to the following 4 categories:
0 means below zero and increasing
1 means above zero and increasing
2 means below zero and decreasing
3 means above zero and decreasing
And c2 corresponds to ascending group number as soon as condition (i.e. c1) changes (as you wanted). Credits to #user3483203 for using the shift with cumsum
# calculate difference
df["diff"] = df['val'].diff()
# set first value in column 'diff' to 0 (as previous step sets it to NaN)
df.loc[0, 'diff'] = 0
df["c1"] = (df['val'] >= 0).astype(int) + (df["diff"] < 0).astype(int) * 2
df["c2"] = (df["c1"] != df["c1"].shift().fillna(df["c1"])).astype(int).cumsum()
Result:
val diff c1 c2
0 0 0.0 1 0
1 1 1.0 1 0
2 3 2.0 1 0
3 4 1.0 1 0
4 5 1.0 1 0
5 5 0.0 1 0
6 4 -1.0 3 1
7 2 -2.0 3 1
8 1 -1.0 3 1
9 0 -1.0 3 1
10 -1 -1.0 2 2
11 -2 -1.0 2 2
12 -3 -1.0 2 2
13 -2 1.0 0 3
14 3 5.0 1 4
15 5 2.0 1 4
16 8 3.0 1 4
17 4 -4.0 3 5
18 2 -2.0 3 5
19 0 -2.0 3 5
20 -1 -1.0 2 6
21 -3 -2.0 2 6
22 -8 -5.0 2 6
23 -20 -12.0 2 6
24 -10 10.0 0 7
25 -5 5.0 0 7
26 -2 3.0 0 7
27 -1 1.0 0 7
28 0 1.0 1 8
29 1 1.0 1 8
30 2 1.0 1 8
31 3 1.0 1 8
32 5 2.0 1 8
33 6 1.0 1 8
34 8 2.0 1 8
35 4 -4.0 3 9
36 3 -1.0 3 9

No results are returned in dataframe

I am trying to take the average every fifth and every sixth row of var A in a dataframe, and put the result in a new column as var B. But it NaN shows. It may be resulted by I did not return values correctly?
Here is the sample data:
PID A
1 0
1 3
1 2
1 6
1 0
1 2
2 3
2 3
2 1
2 4
2 0
2 4
Expected results:
PID A B
1 0 1
1 3 1
1 2 1
1 6 1
1 0 1
1 2 1
2 3 2
2 3 2
2 1 2
2 4 2
2 0 2
2 4 2
My codes:
lst1 = df.iloc[5::6, :]
lst2 = df.iloc[4::6, :]
df['B'] = (lst1['A'] + lst2['A'])/2
print(df['B'])
The script can be run without error, but the var B is empty and show NaN.
Thanks for your help!
There is problem data not aligned, because different indexes, so get NaNs.
print(lst1)
PID A
5 1 2
11 2 4
print(lst2)
PID A
4 1 0
10 2 0
print (lst1['A'] + lst2['A'])
4 NaN
5 NaN
10 NaN
11 NaN
Name: A, dtype: float64
Solution is use values for add Series to numpy array:
print (lst1['A'] + (lst2['A'].values))
5 2
11 4
Name: A, dtype: int64
Or you can sum 2 numpy arrays:
print (lst1['A'].values + (lst2['A'].values))
[2 4]
It seems you need:
df['B'] = (lst1['A'] + lst2['A'].values).div(2)
df['B'] = df['B'].bfill()
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
But if need mean of 5. and 6. value per group by PID use groupby with transform:
df['B'] = df.groupby('PID').transform(lambda x: x.iloc[[4, 5]].mean())
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 1
Straightforward way taking the mean of the 5th and 6th positions within each group defined by 'PID'.
df.assign(B=df.groupby('PID').transform(lambda x: x.values[[4, 5]].mean()))
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 2
Fun way using join assuming there are actually exactly 6 rows for each 'PID'.
df.join(df.set_index('PID').A.pipe(lambda d: (d.iloc[4::6] + d.iloc[5::6]) / 2).rename('B'), on='PID')
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0

Categories

Resources