Calculate the amount of consecutive missing values in a row

Calculate the amount of consecutive missing values in a row - python

I am trying to find a way to calculate the amount of values randomly removed from a data frame and the amount of values randomly removed one after another.
The code I have so far is:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Sampledata
x=[1,2,3,4,5,6,7,8,9,10]
y=[1,2,3,4,5,6,7,8,9,10]
df = pd.DataFrame({'col_1':y,'col_2':x})
drop_indices = np.random.choice(df.index, 5,replace=False )
df_subset = df.drop(drop_indices)
print(df_subset)
print(df)
Which randomly removes 5 rows from the data frame and gives as output:
col_1 col_2
0 1 1
1 2 2
2 3 3
5 6 6
8 9 9
col_1 col_2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
I would like to turn this into the following data frame:
col_1 col_2 col_2 N_removedvalues N_consecutive
0 1 1 1 0 0
1 2 2 2 0 0
2 3 3 3 0 0
3 4 4 1 1
4 5 5 2 2
5 6 6 6 2 0
6 7 7 3 1
7 8 8 4 2
8 9 9 9 4 0
9 10 10 5 1

res=df.merge(df_subset, on='col_1', suffixes=['_1',''], how='left')
res["N_removedvalues"]=np.where(res['col_2'].isna(), res.groupby(res['col_2'].isna()).cumcount().add(1), np.nan)
res["N_removedvalues"]=res["N_removedvalues"].ffill().fillna(0)
res['N_consecutive']=np.logical_and(res['col_2'].isna(), np.logical_or(~res['col_2'].shift().isna(), res.index==res.index[0]))
res.loc[np.logical_and(res['N_consecutive']==0, res['col_2'].isna()), 'N_consecutive']=np.nan
res['N_consecutive']=res.groupby('N_consecutive')['N_consecutive'].cumsum().ffill()
res.loc[res['N_consecutive'].gt(0), 'N_consecutive']=res.loc[res['N_consecutive'].gt(0)].groupby('N_consecutive').cumcount().add(1)
Outputs:
col_1 col_2_1 col_2 N_removedvalues N_consecutive
0 1 1 1.0 0.0 0.0
1 2 2 2.0 0.0 0.0
2 3 3 NaN 1.0 1.0
3 4 4 4.0 1.0 0.0
4 5 5 NaN 2.0 1.0
5 6 6 NaN 3.0 2.0
6 7 7 7.0 3.0 0.0
7 8 8 8.0 3.0 0.0
8 9 9 NaN 4.0 1.0
9 10 10 NaN 5.0 2.0

Related

How to find out the cumulative count between numbers?

i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change

IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0

If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0

pandas timeseries splitting into many and taking the mean

I have the following pandas dataframe:
SEC POS DATA
1 1 4
2 1 4
3 1 5
4 1 5
5 2 2
6 3 4
7 3 2
8 4 2
9 4 2
10 1 8
11 1 6
12 2 5
13 2 5
14 2 4
15 2 6
16 3 2
17 4 1
Now I want to know the mean value of DATA and the first value of SEC for every block of the POS column.
So like this:
SEC POS DATA
1 1 4.5
5 2 2
6 3 3
8 4 2
10 1 7
12 2 5
16 3 2
17 4 1
Additionally, I want to subtract the DATA value of POS=4 from it's 3 prior DATA values, so where POS = [1,2,3].
Obtaining the following:
SEC POS DATA
1 1 2.5
5 2 0
6 3 1
8 4 2
10 1 6
12 2 4
16 3 1
17 4 1
I figured out how to do this by separating the dataframe in many different dataframes using a forloop. taking the mean and then subtract for the other dataframes. However this is very slow, so I'm wondering if there's a faster way to do this, anyone that can help?
Thanks!

Another solution:
diff_to_previous = df.POS != df.POS.shift(1)
df = df.groupby(diff_to_previous.cumsum(), as_index=False).agg({'SEC': 'first', 'POS':'first', 'DATA':'mean'})
df['tmp'] = (df['POS'] == 4).astype(int).shift(fill_value=0).cumsum()
df['DATA'] = df.groupby('tmp')['DATA'].transform(lambda x: [*(x[x.index[:-1]] - x[x.index[-1]]), x[x.index[-1]]] )
df = df.drop(columns='tmp')
print(df)
Prints:
SEC POS DATA
0 1 1 2.5
1 5 2 0.0
2 6 3 1.0
3 8 4 2.0
4 10 1 6.0
5 12 2 4.0
6 16 3 1.0
7 17 4 1.0

For your first problem, we can use:
grps = df['POS'].ne(df['POS'].shift()).cumsum()
dfg = df.groupby(grps).agg(
POS=('POS', 'min'),
SEC=('SEC', 'min'),
DATA=('DATA', 'mean')
).reset_index(drop=True)
POS SEC DATA
0 1 1 4.5
1 2 5 2.0
2 3 6 3.0
3 4 8 2.0
4 1 10 7.0
5 2 12 5.0
6 3 16 2.0
7 4 17 1.0
For your second problem:
grps2 = dfg['POS'].lt(dfg['POS'].shift()).cumsum()
m = (
dfg.groupby(grps2)
.apply(lambda x: x.loc[x['POS'].isin([1,2,3]), 'DATA']
- x.loc[x['POS'].eq(4), 'DATA'].iat[0])
.droplevel(0)
)
dfg['DATA'].update(m)
POS SEC DATA
0 1 1 2.5
1 2 5 0.0
2 3 6 1.0
3 4 8 2.0
4 1 10 6.0
5 2 12 4.0
6 3 16 1.0
7 4 17 1.0

How to fill NaN in one column depending from values two different columns

I have a dataframe with three columns. Two of them are group and subgroup, adn the third one is a value. I have some NaN values in the values column. I need to fiil them by median values,according to group and subgroup.
I made a pivot table with double index and the median of target column. But I don`t understand how to get this values and put them into original dataframe
import pandas as pd
df=pd.DataFrame(data=[
[1,1,'A',1],
[2,1,'A',3],
[3,3,'B',8],
[4,2,'C',1],
[5,3,'A',3],
[6,2,'C',6],
[7,1,'B',2],
[8,1,'C',3],
[9,2,'A',7],
[10,3,'C',4],
[11,2,'B',6],
[12,1,'A'],
[13,1,'C'],
[14,2,'B'],
[15,3,'A']],columns=['id','group','subgroup','value'])
print(df)
id group subgroup value
0 1 1 A 1
1 2 1 A 3
2 3 3 B 8
3 4 2 C 1
4 5 3 A 3
5 6 2 C 6
6 7 1 B 2
7 8 1 C 3
8 9 2 A 7
9 10 3 C 4
10 11 2 B 6
11 12 1 A NaN
12 13 1 C NaN
13 14 2 B NaN
14 15 3 A NaN
df_struct=df.pivot_table(index=['group','subgroup'],values='value',aggfunc='median')
print(df_struct)
value
group subgroup
1 A 2.0
B 2.0
C 3.0
2 A 7.0
B 6.0
C 3.5
3 A 3.0
B 8.0
C 4.0
Will be thankfull for any help

Use pandas.DataFrame.groupby.transform then fillna:
id group subgroup value
0 1 1 A 1.0
1 2 1 A NaN # < Value with nan
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
df['value'] = df['value'].fillna(df.groupby(['group', 'subgroup'])['value'].transform('median'))
print(df)
Output:
id group subgroup value
0 1 1 A 1.0
1 2 1 A 1.0
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0

Pandas: Groupby two columns and count the occurence of all values for 2nd column

I want to groupby my dataframe using two columns, one is yearmonth(format : 16-10) and other is number of cust. Then if number of cumstomers are more the six, i want to create one one row which replaces all the rows with number of cust = 6+ and sum of total values for number of cust >6.
This is how data looks like
index month num ofcust count
0 10 1.0 1
1 10 2.0 1
2 10 3.0 1
3 10 4.0 1
4 10 5.0 1
5 10 6.0 1
6 10 7.0 1
7 10 8.0 1
8 11 1.0 1
9 11 2.0 1
10 11 3.0 1
11 12 12.0 1
Output:
index month no of cust count
0 16-10 1.0 3
1 16-10 2.0 6
2 16-10 3.0 2
3 16-10 4.0 3
4 16-10 5.0 4
5 16-10 6+ 4
6 16-11 1.0 4
7 16-11 2.0 3
8 16-11 3.0 2
9 16-11 4.0 1
10 16-11 5.0 3
11 16-11 6+ 5

I believe you need replace all values >=6 first and then groupby + aggregate sum:
s = df['num ofcust'].mask(df['num ofcust'] >=6, '6+')
#alternatively
#s = df['num ofcust'].where(df['num ofcust'] <6, '6+')
df = df.groupby(['month', s])['count'].sum().reset_index()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1
Detail:
print (s)
0 1
1 2
2 3
3 4
4 5
5 6+
6 6+
7 6+
8 1
9 2
10 3
11 6+
Name: num ofcust, dtype: object
Another very similar solution is append data to column first:
df.loc[df['num ofcust'] >= 6, 'num ofcust'] = '6+'
df = df.groupby(['month', 'num ofcust'], as_index=False)['count'].sum()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1

pandas dataframe sum of shift(x) for x in range(1, n)

I have a dataframe with like this, and want to add a new column that is the equivalent of applying shift n times. For example, let n = 2:
df = pd.DataFrame(numpy.random.randint(0, 10, (10, 2)), columns=['a','b'])
a b
0 0 3
1 7 0
2 6 6
3 6 0
4 5 0
5 0 7
6 8 0
7 8 7
8 4 4
9 2 2
df['c'] = df['b'].shift(1) + df['b'].shift(2)
a b c
0 0 3 NaN
1 7 0 NaN
2 6 6 3.0
3 6 0 6.0
4 5 0 6.0
5 0 7 0.0
6 8 0 7.0
7 8 7 7.0
8 4 4 7.0
9 2 2 11.0
In this manner, column c gets the sum of the previous n values from column b.
Other than a loop, is there a better way to accomplish this for a large n?

You can use the rolling() method with a window of 2:
df['c'] = df.b.rolling(window = 2).sum().shift()
df
a b c
0 0 3 NaN
1 7 0 NaN
2 6 6 3.0
3 6 0 6.0
4 5 0 6.0
5 0 7 0.0
6 8 0 7.0
7 8 7 7.0
8 4 4 7.0
9 2 2 11.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate the amount of consecutive missing values in a row - python

Related

How to find out the cumulative count between numbers?

pandas timeseries splitting into many and taking the mean

How to fill NaN in one column depending from values two different columns

Pandas: Groupby two columns and count the occurence of all values for 2nd column

pandas dataframe sum of shift(x) for x in range(1, n)

Categories

Resources