How can I reach the 'Counter' column with pandas: If status A -> count up by one. if status B or C -> reduce by one.
Index
Status
Counter
1
A
1
2
A
2
3
A
3
4
B
2
5
C
1
6
A
2
7
B
1
8
A
2
9
A
3
10
B
2
Map the values to 1/-1 with numpy.where, then perform a cumsum:
import numpy as np
df['Counter'] = (np.where(df['Status'].eq('A'), 1, -1)
.cumsum()
)
Output:
Index Status Counter
0 1 A 1
1 2 A 2
2 3 A 3
3 4 B 2
4 5 C 1
5 6 A 2
6 7 B 1
7 8 A 2
8 9 A 3
9 10 B 2
I think what you need is a loop over the dataframe's rows. You can achieve this by using iterrows on the dataframe:
count = 0
CounterList = []
for i, row in df.iterrows():
if row["Status"] == "A":
count += 1
elif row["Status"] == "B" or row["Status"] == "C":
count -= 1
CounterList.append(count)
df["Counter"] = CounterList
df
Output
Index
Status
Counter
0
1
A
1
1
2
A
2
2
3
A
3
3
4
B
2
4
5
C
1
5
6
A
2
6
7
B
1
7
8
A
2
8
9
A
3
9
10
B
2
Use:
df = pd.DataFrame({'status':['A', 'A', 'B', 'A']})
temp = df['status']=='A'
df['counter'] = temp.replace(False, -1).astype(int).cumsum()
Input:
Output:
Related
Let's say we have the following dataframe. If we wanted to find the count of consecutive 1's, you could use the below.
col
0 0
1 1
2 1
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 1
11 1
12 1
13 0
14 1
15 1
df['col'].groupby(df['col'].diff().ne(0).cumsum()).cumsum()
But the problem I see is when you need to use groupby with and id field. If we added an id field to the dataframe (below), it makes it more complicated. We can no longer use the solution above.
id col
0 B 0
1 B 1
2 B 1
3 B 1
4 A 0
5 A 0
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
When presented with this issue, ive seen the case for making a helper series to use in the groupby like this:
s = df['col'].eq(0).groupby(df['id']).cumsum()
df['col'].groupby([df['id'],s]).cumsum()
Which works, but the problem is that the first group contains the first row, which does not fit the criteria. This usually isn't a problem, but it is if we wanted to find the count. Replacing cumsum() at the end of the last groupby() with .transform('count') would actually give us 6 instead of 5 for the count of consecutive 1's in the first B group.
The only solution I can come up with for this problem is the following code:
df['col'].groupby([df['id'],df.groupby('id')['col'].transform(lambda x: x.diff().ne(0).astype(int).cumsum())]).transform('count')
Expected output:
0 1
1 5
2 5
3 5
4 2
5 2
6 5
7 5
8 1
9 2
10 2
11 2
12 2
13 1
14 2
15 2
This works, but uses transform() twice, which I heard isn't the fastest. It is the only solution I can think of that uses diff().ne(0) to get the "real" groups.
Index 1,2,3,6 and 7 are all id B, with the same value in the 'col' column, so the count would not be reset, so they would all be apart of the same group.
Can this be done without using multiple .transform()?
The following code uses only 1 .transform(), and relies upon ordering the index, to get the correct counts.
The original index is kept, so the final result can be reindexed back to the original order.
Use cum_counts['cum_counts'] to get the exact desired output, without the other column.
import pandas as pd
# test data as shown in OP
df = pd.DataFrame({'id': ['B', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A'], 'col': [0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})
# reset the index, then set the index and sort
df = df.reset_index().set_index(['index', 'id']).sort_index(level=1)
col
index id
4 A 0
5 A 0
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
0 B 0
1 B 1
2 B 1
3 B 1
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
# get the cumulative sum
g = df.col.ne(df.col.shift()).cumsum()
# use g to groupby and use only 1 transform to get the counts
cum_counts = df['col'].groupby(g).transform('count').reset_index(level=1, name='cum_counts').sort_index()
id cum_counts
index
0 B 1
1 B 5
2 B 5
3 B 5
4 A 2
5 A 2
6 B 5
7 B 5
8 B 1
9 B 2
10 B 2
11 A 2
12 A 2
13 A 1
14 A 2
15 A 2
After looking at #TrentonMcKinney solution, I came up with:
df = df.sort_values(['id'])
grp =(df[['id','col']] != df[['id','col']].shift()).any(axis=1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df.sort_index()
Output:
id col count
0 B 0 1
1 B 1 5
2 B 1 5
3 B 1 5
4 A 0 2
5 A 0 2
6 B 1 5
7 B 1 5
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
IIUC, do you want?
grp = (df[['id', 'col']] != df[['id', 'col']].shift()).any(axis = 1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df
Output:
id col count
0 B 0 1
1 B 1 3
2 B 1 3
3 B 1 3
4 A 0 2
5 A 0 2
6 B 1 2
7 B 1 2
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
There is such a model of real data:
C S E D
1 1 3 0 0
2 1 5 0 0
3 1 6 0 0
4 2 1 0 0
5 2 3 0 0
6 2 7 0 0
ะก - category, S - start, E - end, D - delta
Using pandas, you need to enter the value of column S with the condition id = id+1 in column E, and the last value of category E is equal to the value from column S of the same row
It turns out:
C S E D
1 1 3 5 0
2 1 5 6 0
3 1 6 6 0
4 2 1 3 0
5 2 3 7 0
6 2 7 7 0
And then subtract S from E and put it in D. This, in principle, is easy. The difficulty is filling in column E
The result is this:
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0
Use DataFrameGroupBy.shift with replace last missing values by original with Series.fillna and then only subtract for column D:
df['E'] = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int)
df['D'] = df['E'] - df['S']
Or if use DataFrame.assign is necessary use lambda function for use counted values of E column:
df = df.assign(E = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int),
D = lambda x: x['E'] - x['S'])
print (df)
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0
I have the following dataframe:
Name B C D E
1 A 1 2 2 7
2 A 7 1 1 7
3 B 1 1 3 4
4 B 2 1 3 4
5 B 3 1 3 4
What I'm trying to do is to obtain a new dataframe in which, for rows with the same "Name", the elements in the "B" column are continuous, hence in this example for rows with "Name" = A, the dataframe would have to be padded with elements ranging from 1 to 7, and the values for columns C, D, E should be 0.
Name B C D E
1 A 1 2 2 7
2 A 2 0 0 0
3 A 3 0 0 0
4 A 4 0 0 0
5 A 5 0 0 0
6 A 6 0 0 0
7 A 7 0 0 0
8 B 1 1 3 4
9 B 2 1 5 4
10 B 3 4 3 6
What I've done so far is to turn the B column values for the same "Name" into continuous values:
new_idx = df_.groupby('Name').apply(lambda x: np.arange(x.index.min(), x.index.max() + 1)).apply(pd.Series).stack()
and reindexing the original (having set B as the index) df using this new Series, but I'm having trouble reindexing using duplicates. Any help would be appreciated.
You can use:
def f(x):
a = np.arange(x.index.min(), x.index.max() + 1)
x = x.reindex(a, fill_value=0)
return (x)
new_idx = (df.set_index('B')
.groupby('Name')
.apply(f)
.drop('Name', 1)
.reset_index()
.reindex(columns=df.columns))
print (new_idx)
Name B C D E
0 A 1 2 2 7
1 A 2 0 0 0
2 A 3 0 0 0
3 A 4 0 0 0
4 A 5 0 0 0
5 A 6 0 0 0
6 A 7 1 1 7
7 B 1 1 3 4
8 B 2 1 3 4
9 B 3 1 3 4
I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
Say I have the following dataframe, and I want to change the two elements in column c that correspond to the first two elements in column a that are equal to 1 to equal 2.
>>> df = pd.DataFrame({"a" : [1,1,1,1,2,2,2,2], "b" : [2,3,1,4,5,6,7,2], "c" : [1,2,3,4,5,6,7,8]})
>>> df.loc[df["a"] == 1, "c"].iloc[0:2] = 2
>>> df
a b c
0 1 2 1
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8
The code in the second line doesn't work because iloc sets a copy, so the original dataframe is not modified. How would I do this?
A dirty way would be:
df.loc[df[df['a'] == 1][:2].index, 'c'] = 2
You can use Index.isin:
import pandas as pd
df = pd.DataFrame({"a" : [1,1,1,1,2,2,2,2],
"b" : [2,3,1,4,5,6,7,2],
"c" : [1,2,3,4,5,6,7,8]})
#more general index
df.index = df.index + 10
print (df)
a b c
10 1 2 1
11 1 3 2
12 1 1 3
13 1 4 4
14 2 5 5
15 2 6 6
16 2 7 7
17 2 2 8
print (df.index.isin(df.index[:2]))
[ True True False False False False False False]
df.loc[(df["a"] == 1) & (df.index.isin(df.index[:2])), "c"] = 2
print (df)
a b c
10 1 2 2
11 1 3 2
12 1 1 3
13 1 4 4
14 2 5 5
15 2 6 6
16 2 7 7
17 2 2 8
If index is nice (starts from 0 without duplicates):
df.loc[(df["a"] == 1) & (df.index < 2), "c"] = 2
print (df)
a b c
0 1 2 2
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8
Another solution:
mask = df["a"] == 1
mask = mask & (mask.cumsum() < 3)
df.loc[mask.index[:2], "c"] = 2
print (df)
a b c
0 1 2 2
1 1 3 2
2 1 1 3
3 1 4 4
4 2 5 5
5 2 6 6
6 2 7 7
7 2 2 8