pd.get_dummies allows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
It's been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.
EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as #Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]:
0 a
1 b
2 a
3 c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]:
0 a
1 b
2 a
3 c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to #piRSquared's comment:
This solution does indeed assume there's one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).
If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False (default).
Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the "dummification" and your data contains any NaNs. Setting dummy_na=True will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What's also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says "nan").
It's also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here
This is quite a late answer, but since you ask for a quick way to do it, I assume you're looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where instead of idxmax or get_level_values, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
Method:
Using the same sample data as #Nathan:
>>> dummies
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])
>>> s2
0 a
1 b
2 a
3 c
dtype: object
Benchmark:
On a small dummy dataframe, you won't see much difference in performance. However, testing different strategies to solving this problem on a large series:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
x = dummies.stack()
return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.where method is about 4 times faster than the get_level_values method 11.5 times faster than the idxmax method! It also beats (but only by a little) the .dot() method outlined in this answer to a similar question
They all return the same result:
>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True
Setup
Using #Jeff's setup
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
and there is only one 1 per row
df.dot(df.columns)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
numpy.where
Again! Assuming only one 1 per row
i, j = np.where(df)
pd.Series(df.columns[j], i)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
Not assuming one 1 per row
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))
0 0 a
1 0 a
2 0 a
3 1 b
4 1 b
5 1 b
6 2 c
7 2 c
8 3 d
9 3 d
10 4 e
11 5 f
12 6 g
13 7 h
dtype: object
numpy.where
Where we don't assume one 1 per row and we drop the index
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
Another option is using the function from_dummies from pandas version 1.5.0. Here is a reproducible example:
import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c'])
df = pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
Using from_dummies:
pd.from_dummies(df)
0 a
1 b
2 a
3 c
Converting dat["classification"] to one hot encodes and back!!
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
##Identical check!
(tru==dat["classification"]).value_counts()
If you're categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the "dummy" variables) which don't form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical full with np.nan and then explicitly set the category of each subset. An example follows.
0. Data setup:
np.random.seed(42)
student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan # artificially introduce NAs
students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
mark pass
a 51.0 True
b NaN True
c 14.0 False
d 71.0 True
e 60.0 True
f NaN False
g 82.0 True
h 86.0 True
i 74.0 True
1. Compute the value of the relevant boolean conditions:
failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
b f p
a 1 0 0
b 0 0 0
c 0 1 0
d 0 0 1
e 0 0 1
f 0 1 0
g 0 0 1
h 0 0 1
i 0 0 1
As you can see row b has False for all three categories (since the mark is NaN and pass is True).
2. Generate the categorical series:
cat = pd.Series(
pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a barely passed
b NaN
c failed
d well passed
e well passed
f failed
g well passed
h well passed
i well passed
As you can see, a NaN was kept where none of the categories applied.
This approach is as performant as using np.where but allows for the flexibility of possible NaNs.
Related
I have a dataframe in python that has a column like below:
Type
A
A
B
B
B
I want to add another column to my data frame according to the sequence of Type:
Type Seq
A 1
A 2
B 1
B 2
B 3
I was doing it in R with the following command:
setDT(df)[ , Seq := seq_len(.N), by = rleid(Type) ]
I am not sure how to do it python.
Use Series.rank,
df['seq'] = df['Type'].rank(method = 'dense').astype(int)
Type seq
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
Edit for updated question
df['seq'] = df.groupby('Type').cumcount() + 1
df
Output:
Type seq
0 A 1
1 A 2
2 B 1
3 B 2
4 B 3
Use pd.factorize:
import pandas as pd
df['seq'] = pd.factorize(df['Type'])[0] + 1
df
Output:
Type seq
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
In pandas
(df.Type!=df.Type.shift()).ne(0).cumsum()
Out[58]:
0 1
1 1
2 2
3 2
4 2
Name: Type, dtype: int32
More info
v=c('A','A','B','B','B','A')
data.table::rleid(v)
[1] 1 1 2 2 2 3
df
Type
0 A
1 A
2 B
3 B
4 B
5 A# assign a new number in R data.table rleid
(df.Type!=df.Type.shift()).ne(0).cumsum()
Out[60]:
0 1
1 1
2 2
3 2
4 2
5 3# check
Might not be the best way but try this:
df.loc[df['Type'] == A, 'Seq'] = 1
Similarly, for B:
df.loc[df['Type'] == B, 'Seq'] = 2
A strange (and not recommended) way of doing it is to use the built-in ord() function to get the Unicode code-point of the character.
That is:
df['Seq'] = df['Type'].apply(lamba x: ord(x.lower())-96)
A much better way of doing it is to change the type of the strings to categories:
df['Seq'] = df['Type'].astype('category').cat.codes
You may have to increment the codes if you want different numbers.
pd.get_dummies allows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
It's been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.
EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as #Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]:
0 a
1 b
2 a
3 c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]:
0 a
1 b
2 a
3 c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to #piRSquared's comment:
This solution does indeed assume there's one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).
If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False (default).
Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the "dummification" and your data contains any NaNs. Setting dummy_na=True will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What's also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says "nan").
It's also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here
This is quite a late answer, but since you ask for a quick way to do it, I assume you're looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where instead of idxmax or get_level_values, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
Method:
Using the same sample data as #Nathan:
>>> dummies
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])
>>> s2
0 a
1 b
2 a
3 c
dtype: object
Benchmark:
On a small dummy dataframe, you won't see much difference in performance. However, testing different strategies to solving this problem on a large series:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
x = dummies.stack()
return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.where method is about 4 times faster than the get_level_values method 11.5 times faster than the idxmax method! It also beats (but only by a little) the .dot() method outlined in this answer to a similar question
They all return the same result:
>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True
Setup
Using #Jeff's setup
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
and there is only one 1 per row
df.dot(df.columns)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
numpy.where
Again! Assuming only one 1 per row
i, j = np.where(df)
pd.Series(df.columns[j], i)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
Not assuming one 1 per row
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))
0 0 a
1 0 a
2 0 a
3 1 b
4 1 b
5 1 b
6 2 c
7 2 c
8 3 d
9 3 d
10 4 e
11 5 f
12 6 g
13 7 h
dtype: object
numpy.where
Where we don't assume one 1 per row and we drop the index
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
Another option is using the function from_dummies from pandas version 1.5.0. Here is a reproducible example:
import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c'])
df = pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
Using from_dummies:
pd.from_dummies(df)
0 a
1 b
2 a
3 c
Converting dat["classification"] to one hot encodes and back!!
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
##Identical check!
(tru==dat["classification"]).value_counts()
If you're categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the "dummy" variables) which don't form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical full with np.nan and then explicitly set the category of each subset. An example follows.
0. Data setup:
np.random.seed(42)
student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan # artificially introduce NAs
students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
mark pass
a 51.0 True
b NaN True
c 14.0 False
d 71.0 True
e 60.0 True
f NaN False
g 82.0 True
h 86.0 True
i 74.0 True
1. Compute the value of the relevant boolean conditions:
failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
b f p
a 1 0 0
b 0 0 0
c 0 1 0
d 0 0 1
e 0 0 1
f 0 1 0
g 0 0 1
h 0 0 1
i 0 0 1
As you can see row b has False for all three categories (since the mark is NaN and pass is True).
2. Generate the categorical series:
cat = pd.Series(
pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a barely passed
b NaN
c failed
d well passed
e well passed
f failed
g well passed
h well passed
i well passed
As you can see, a NaN was kept where none of the categories applied.
This approach is as performant as using np.where but allows for the flexibility of possible NaNs.
The idea is to transform a data frame in the fastest way according to the values specific to each column.
For simplicity, here is an example where each element of a column is compared to the mean of the column it belongs to and replaced with 0 if greater than mean(column) or 1 otherwise.
In [26]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
In [27]: df
Out[27]:
0 1 2
0 1 2 3
1 4 5 6
In [28]: df.mean().values.tolist()
Out[28]: [2.5, 3.5, 4.5]
Snippet bellow, it is not real code but more to exemplify the desired behavior. I used apply method but it can be whatever works fastest.
In [29]: f = lambda x: 0 if x < means else 1
In [30]: df.apply(f)
In [27]: df
Out[27]:
0 1 2
0 0 0 0
1 1 1 1
This is a toy example but the solution has to be applied to a big data frame, therefore, it has to be fast.
Cheers!
You can create a boolean mask of the dataframe by comparing each element with the mean of that column. It can be easily achieved using
df > df.mean()
0 1 2
0 False False False
1 True True True
Since True equates to 1 and False to 0, a boolean dataframe can be easily converted to integer using astype.
(df > df.mean()).astype(int)
0 1 2
0 0 0 0
1 1 1 1
If you need the output to be some strings rather than 0 and 1, use np.where which works as (condition, if true, else)
pd.DataFrame(np.where(df > df.mean(), 'm', 'n'))
0 1 2
0 n n n
1 m m m
Edit: Addressing qn in comment; What if m and n are column dependent
df = pd.DataFrame(np.arange(12).reshape(4,3))
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
pd.DataFrame(np.where(df > df.mean(), df.min(), df.max()))
0 1 2
0 9 10 11
1 9 10 11
2 0 1 2
3 0 1 2
I want to select the rows in a dataframe which have zero in every column in a list of columns. e.g. this df:.
In:
df = pd.DataFrame([[1,2,3,6], [2,4,6,8], [0,0,3,4],[1,0,3,4],[0,0,0,0]],columns =['a','b','c','d'])
df
Out:
a b c d
0 1 2 3 6
1 2 4 6 8
2 0 0 3 4
3 1 0 3 4
4 0 0 0 0
Then:
In:
mylist = ['a','b']
selection = df.loc[df['mylist']==0]
selection
I would like to see:
Out:
a b c d
2 0 0 3 4
4 0 0 0 0
Should be simple but I'm having a slow day!
You'll need to determine whether all columns of a row have zeros or not. Given a boolean mask, use DataFrame.all(axis=1) to do that.
df[df[mylist].eq(0).all(1)]
a b c d
2 0 0 3 4
4 0 0 0 0
Note that if you wanted to find rows with zeros in every column, remove the subsetting step:
df[df.eq(0).all(1)]
a b c d
4 0 0 0 0
Using reduce and Numpy's logical_and
The point of this is to eliminate the need to create new Pandas objects and simply produce the mask we are looking for using the data where it sits.
from functools import reduce
df[reduce(np.logical_and, (df[c].values == 0 for c in mylist))]
a b c d
2 0 0 3 4
4 0 0 0 0
Consider the following dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame({'group': list('aaabbabc')})
>>> df
group
0 a
1 a
2 a
3 b
4 b
5 a
6 b
7 c
I want to count the cumulative number of times each group has occurred. My desired output looks like this:
>>> df
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
My initial approach was to do something like this:
df['n'] = df.groupby('group').apply(lambda x: list(range(x.shape[0])))
Basically assigning a length n array, zero-indexed, to each group. But that has proven difficult to transpose and join.
You can use groupby + cumcount, and horizontally concat the new column:
>>> pd.concat([df, df.group.groupby(df.group).cumcount()], axis=1).rename(columns={0: 'n'})
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Simply use groupby on column name, in this case group and then apply cumcount and finally add a column in dataframe with the result.
df['n']=df.groupby('group').cumcount()
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
You can use apply method by passing a lambda expression as parameter.
The idea is that you need to find out the count for a group as number of appearances for that group from the previous rows.
df['n'] = df.apply(lambda x: list(df['group'])[:int(x.name)].count(x['group']), axis=1)
Output
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Note: cumcount method is build with the help of the apply function.
You can read this in pandas documentation.