I have the following pandas dataframe:
SEC POS DATA
1 1 4
2 1 4
3 1 5
4 1 5
5 2 2
6 3 4
7 3 2
8 4 2
9 4 2
10 1 8
11 1 6
12 2 5
13 2 5
14 2 4
15 2 6
16 3 2
17 4 1
Now I want to know the mean value of DATA and the first value of SEC for every block of the POS column.
So like this:
SEC POS DATA
1 1 4.5
5 2 2
6 3 3
8 4 2
10 1 7
12 2 5
16 3 2
17 4 1
Additionally, I want to subtract the DATA value of POS=4 from it's 3 prior DATA values, so where POS = [1,2,3].
Obtaining the following:
SEC POS DATA
1 1 2.5
5 2 0
6 3 1
8 4 2
10 1 6
12 2 4
16 3 1
17 4 1
I figured out how to do this by separating the dataframe in many different dataframes using a forloop. taking the mean and then subtract for the other dataframes. However this is very slow, so I'm wondering if there's a faster way to do this, anyone that can help?
Thanks!
Another solution:
diff_to_previous = df.POS != df.POS.shift(1)
df = df.groupby(diff_to_previous.cumsum(), as_index=False).agg({'SEC': 'first', 'POS':'first', 'DATA':'mean'})
df['tmp'] = (df['POS'] == 4).astype(int).shift(fill_value=0).cumsum()
df['DATA'] = df.groupby('tmp')['DATA'].transform(lambda x: [*(x[x.index[:-1]] - x[x.index[-1]]), x[x.index[-1]]] )
df = df.drop(columns='tmp')
print(df)
Prints:
SEC POS DATA
0 1 1 2.5
1 5 2 0.0
2 6 3 1.0
3 8 4 2.0
4 10 1 6.0
5 12 2 4.0
6 16 3 1.0
7 17 4 1.0
For your first problem, we can use:
grps = df['POS'].ne(df['POS'].shift()).cumsum()
dfg = df.groupby(grps).agg(
POS=('POS', 'min'),
SEC=('SEC', 'min'),
DATA=('DATA', 'mean')
).reset_index(drop=True)
POS SEC DATA
0 1 1 4.5
1 2 5 2.0
2 3 6 3.0
3 4 8 2.0
4 1 10 7.0
5 2 12 5.0
6 3 16 2.0
7 4 17 1.0
For your second problem:
grps2 = dfg['POS'].lt(dfg['POS'].shift()).cumsum()
m = (
dfg.groupby(grps2)
.apply(lambda x: x.loc[x['POS'].isin([1,2,3]), 'DATA']
- x.loc[x['POS'].eq(4), 'DATA'].iat[0])
.droplevel(0)
)
dfg['DATA'].update(m)
POS SEC DATA
0 1 1 2.5
1 2 5 0.0
2 3 6 1.0
3 4 8 2.0
4 1 10 6.0
5 2 12 4.0
6 3 16 1.0
7 4 17 1.0
Related
I have the following dataframe
print(A)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
And I have the following Code (Pandas Dataframe count occurrences that only happen immediately), which counts the occurrences of values that happen immediately one after another.
ser = A["1or0"].ne(A["1or0"].shift().bfill()).cumsum()
B = (
A.groupby(ser, as_index=False)
.agg({"Index": ["first", "last", "count"],
"1or0": "unique"})
.set_axis(["StartNum", "EndNum", "Size", "Value"], axis=1)
.assign(Value= lambda d: d["Value"].astype(str).str.strip("[]"))
)
print(B)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
The issues is, when NaN Values occur, the code doesn't put them together in one interval it count them always as one sized interval and not e.g. 3
print(A2)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
10 11 NaN
11 12 NaN
12 13 NaN
print(B2)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 11 1 NaN
5 12 12 1 NaN
6 13 13 1 NaN
But I want B2 to be the following
print(B2Wanted)
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 13 3 NaN
What do I need to change so that it works also with NaN?
First fillna with a value this is not possible (here -1) before creating your grouper:
group = A['1or0'].fillna(-1).diff().ne(0).cumsum()
# or
# s = A['1or0'].fillna(-1)
# group = s.ne(s.shift()).cumsum()
B = (A.groupby(group, as_index=False)
.agg(**{'StartNum': ('Index', 'first'),
'EndNum': ('Index', 'last'),
'Size': ('1or0', 'size'),
'Value': ('1or0', 'first')
})
)
Output:
StartNum EndNum Size Value
0 1 3 3 0.0
1 4 7 4 1.0
2 8 8 1 0.0
3 9 10 2 1.0
4 11 13 3 NaN
I am trying to find a way to calculate the amount of values randomly removed from a data frame and the amount of values randomly removed one after another.
The code I have so far is:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Sampledata
x=[1,2,3,4,5,6,7,8,9,10]
y=[1,2,3,4,5,6,7,8,9,10]
df = pd.DataFrame({'col_1':y,'col_2':x})
drop_indices = np.random.choice(df.index, 5,replace=False )
df_subset = df.drop(drop_indices)
print(df_subset)
print(df)
Which randomly removes 5 rows from the data frame and gives as output:
col_1 col_2
0 1 1
1 2 2
2 3 3
5 6 6
8 9 9
col_1 col_2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
I would like to turn this into the following data frame:
col_1 col_2 col_2 N_removedvalues N_consecutive
0 1 1 1 0 0
1 2 2 2 0 0
2 3 3 3 0 0
3 4 4 1 1
4 5 5 2 2
5 6 6 6 2 0
6 7 7 3 1
7 8 8 4 2
8 9 9 9 4 0
9 10 10 5 1
res=df.merge(df_subset, on='col_1', suffixes=['_1',''], how='left')
res["N_removedvalues"]=np.where(res['col_2'].isna(), res.groupby(res['col_2'].isna()).cumcount().add(1), np.nan)
res["N_removedvalues"]=res["N_removedvalues"].ffill().fillna(0)
res['N_consecutive']=np.logical_and(res['col_2'].isna(), np.logical_or(~res['col_2'].shift().isna(), res.index==res.index[0]))
res.loc[np.logical_and(res['N_consecutive']==0, res['col_2'].isna()), 'N_consecutive']=np.nan
res['N_consecutive']=res.groupby('N_consecutive')['N_consecutive'].cumsum().ffill()
res.loc[res['N_consecutive'].gt(0), 'N_consecutive']=res.loc[res['N_consecutive'].gt(0)].groupby('N_consecutive').cumcount().add(1)
Outputs:
col_1 col_2_1 col_2 N_removedvalues N_consecutive
0 1 1 1.0 0.0 0.0
1 2 2 2.0 0.0 0.0
2 3 3 NaN 1.0 1.0
3 4 4 4.0 1.0 0.0
4 5 5 NaN 2.0 1.0
5 6 6 NaN 3.0 2.0
6 7 7 7.0 3.0 0.0
7 8 8 8.0 3.0 0.0
8 9 9 NaN 4.0 1.0
9 10 10 NaN 5.0 2.0
I want to groupby my dataframe using two columns, one is yearmonth(format : 16-10) and other is number of cust. Then if number of cumstomers are more the six, i want to create one one row which replaces all the rows with number of cust = 6+ and sum of total values for number of cust >6.
This is how data looks like
index month num ofcust count
0 10 1.0 1
1 10 2.0 1
2 10 3.0 1
3 10 4.0 1
4 10 5.0 1
5 10 6.0 1
6 10 7.0 1
7 10 8.0 1
8 11 1.0 1
9 11 2.0 1
10 11 3.0 1
11 12 12.0 1
Output:
index month no of cust count
0 16-10 1.0 3
1 16-10 2.0 6
2 16-10 3.0 2
3 16-10 4.0 3
4 16-10 5.0 4
5 16-10 6+ 4
6 16-11 1.0 4
7 16-11 2.0 3
8 16-11 3.0 2
9 16-11 4.0 1
10 16-11 5.0 3
11 16-11 6+ 5
I believe you need replace all values >=6 first and then groupby + aggregate sum:
s = df['num ofcust'].mask(df['num ofcust'] >=6, '6+')
#alternatively
#s = df['num ofcust'].where(df['num ofcust'] <6, '6+')
df = df.groupby(['month', s])['count'].sum().reset_index()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1
Detail:
print (s)
0 1
1 2
2 3
3 4
4 5
5 6+
6 6+
7 6+
8 1
9 2
10 3
11 6+
Name: num ofcust, dtype: object
Another very similar solution is append data to column first:
df.loc[df['num ofcust'] >= 6, 'num ofcust'] = '6+'
df = df.groupby(['month', 'num ofcust'], as_index=False)['count'].sum()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1
How do I count the number of unique strings in a rolling window of a pandas dataframe?
a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))
Output, same as original dataframe:
0
0 a
1 b
2 a
3 a
4 b
5 c
6 d
7 e
8 e
9 e
10 e
Expected:
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:
a[0] = pd.factorize(a[0])[0]
print (a)
0
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 4
9 4
10 4
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Or:
a[0] = a[0].rank(method='dense')
0
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 5.0
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
I have a dataframe with like this, and want to add a new column that is the equivalent of applying shift n times. For example, let n = 2:
df = pd.DataFrame(numpy.random.randint(0, 10, (10, 2)), columns=['a','b'])
a b
0 0 3
1 7 0
2 6 6
3 6 0
4 5 0
5 0 7
6 8 0
7 8 7
8 4 4
9 2 2
df['c'] = df['b'].shift(1) + df['b'].shift(2)
a b c
0 0 3 NaN
1 7 0 NaN
2 6 6 3.0
3 6 0 6.0
4 5 0 6.0
5 0 7 0.0
6 8 0 7.0
7 8 7 7.0
8 4 4 7.0
9 2 2 11.0
In this manner, column c gets the sum of the previous n values from column b.
Other than a loop, is there a better way to accomplish this for a large n?
You can use the rolling() method with a window of 2:
df['c'] = df.b.rolling(window = 2).sum().shift()
df
a b c
0 0 3 NaN
1 7 0 NaN
2 6 6 3.0
3 6 0 6.0
4 5 0 6.0
5 0 7 0.0
6 8 0 7.0
7 8 7 7.0
8 4 4 7.0
9 2 2 11.0