Enumerate rows by category - python

Enumerate values rows by category
I have the following dataframe that I'm ordering by category and values:
d = {"cat":["a","b","a","c","c"],"val" :[1,2,3,1,4] }
df = pd.DataFrame(d)
df = df.sort_values(["cat","val"])
Now from that dataframe I want to enumarate the occurrence of each category
so the result is as follows:
df["cat_count"] = [1,2,1,1,2]
Is there a way to automate this?

You can use cumcount like this. Details here cumcount
df['count'] = df.groupby('cat').cumcount()+1
print (df)
Output
cat val count
0 a 1 1
2 a 3 2
1 b 2 1
3 c 1 1
4 c 4 2

Related

How can I groupby a DataFrame at the same time I count the values and put in different columns?

I have a DataFrame that looks like the one below
Index Category Class
0 1 A
1 1 A
2 1 B
3 2 A
4 3 B
5 3 B
And I would like to get an output data frame that groups by category and have one column for each of the classes with the counting of the occurrences of that class in each category, such as the one below
Index Category A B
0 1 2 1
1 2 1 0
2 3 0 2
So far I've tried various combinations of the groupby and agg methods, but I still can't get what I want. I've also tried df.pivot_table(index='Category', columns='Class', aggfunc='count'), but that return a DataFrame without columns. Any ideas of what could work in this case?
You can use aggfunc="size" to achieve your desired result:
>>> df.pivot_table(index='Category', columns='Class', aggfunc='size', fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Alternatively, you can use .groupby(...).size() to get the counts, and then unstack to reshape your data as well:
>>> df.groupby(["Category", "Class"]).size().unstack(fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Assign a dummy value to count:
out = df.assign(val=1).pivot_table('val', 'Category', 'Class',
aggfunc='count', fill_value=0).reset_index()
print(out)
# Output
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2
import pandas as pd
df = pd.DataFrame({'Index':[0,1,2,3,4,5],
'Category': [1,1,1,2,3,3],
'Class':['A','A','B','A','B','B'],
})
df = df.groupby(['Category', 'Class']).count()
df = df.pivot_table(index='Category', columns='Class')
print(df)
output:
Index
Class A B
Category
1 2.0 1.0
2 1.0 NaN
3 NaN 2.0
Use crosstab:
pd.crosstab(df['Category'], df['Class']).reset_index()
output:
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2

create a summary table with count values

my df looks like
group value
A 1
B 1
A 1
B 1
B 0
B 0
A 0
I want to create a df
value 0 1
group
A a b
B c d
where a,b,c,d are the counts of 0s and 1s in groups A and B respectively.
I tried group df.groupby('group').size() but that gave an overall count and did not split 0's and 1s. I tried a groupby count method too but have not been able to achieve the target data frame.
Use pd.crosstab:
pd.crosstab(df['group'], df['value'])
Output:
value 0 1
group
A 1 2
B 2 2
Use pivot table for this:
res = pd.pivot_table(df, index='group', columns='value', aggfunc='size')
>>>print(res)
value 0 1
group
A 1 2
B 2 2

Combining a list of tuple dataframes in python

I have a large dataset where every two rows needs to be group together and combined into one longer row, basically duplicating the headers and adding the 2nd row to the 1st. Here is a small sample:
df = pd.DataFrame({'ID' : [1,1,2,2],'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
ID Var1 Var2
1 A B
1 2 5
2 C D
2 7 9
I will have to group the rows my 'ID' so therefore I ran:
grouped = df.groupby(['ID'])
grp_lst = list(grouped)
This resulted in a list of tuples grouped by id where element 1 is the grouped dataframe I would like to combine.
The desired result is a dataframe that looks something like this:
ID Var1 Var2 ID.1 Var1.1 Var2.1
1 A B 1 2 5
2 C D 2 7 9
I have to do this over a large data set, where the "ID" is used to group the rows and then I want to basically add the bottom row to end on the top.
Any help would be appreciated and I assume there is a much easier way to do this than I am doing.
Thanks in advance!
Let us try:
i = df.groupby('ID').cumcount().astype(str)
df_out = df.set_index([df['ID'].values, i]).stack().unstack([2, 1])
df_out.columns = df_out.columns.map('.'.join)
Details:
group the dataframe on ID and use cumcount to create sequential counter to uniquely identify the rows per ID:
>>> i
0 0
1 1
2 0
3 1
dtype: object
Create multilevel index in the dataframe with the first level set to ID values and second level set to the above sequential counter, then use stack followed by unstack to reshape the dataframe in the desired format:
>>> df_out
ID Var1 Var2 ID Var1 Var2 #---> Level 0 columns
0 0 0 1 1 1 #---> Level 1 columns
1 1 A B 1 2 5
2 2 C D 2 7 9
Finally flatten the multilevel columns using Index.map with join:
>>> df_out
ID.0 Var1.0 Var2.0 ID.1 Var1.1 Var2.1
1 1 A B 1 2 5
2 2 C D 2 7 9
Here is another way using numpy to reshape the dataframe first then tile the columns and create new dataframe from reshape values and tiled columns:
s = df.shape[1]
c = np.tile(df.columns, 2) + '.' + (np.arange(s * 2) // s).astype(str)
df_out = pd.DataFrame(df.values.reshape(-1, s * 2), columns=c)
>>> df_out
ID.0 Var1.0 Var2.0 ID.1 Var1.1 Var2.1
0 1 A B 1 2 5
1 2 C D 2 7 9
Note: This method is only applicable if you have two rows per ID and the ID columns is already sorted.

How to count the element in a column and take the result as a new column?

The DataFrame named df is shown as follows.
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 3]})
Input:
id
0 1
1 1
2 3
I want to count the number of each id, and take the result as a new column count.
Expected:
id count
0 1 2
1 1 2
2 3 1
pd.factorize and np.bincount
My favorite. factorize does not sort and has time complexity of O(n). For big data sets, factorize should be preferred over np.unique
i, u = df.id.factorize()
df.assign(Count=np.bincount(i)[i])
id Count
0 1 2
1 1 2
2 3 1
np.unique and np.bincount
u, i = np.unique(df.id, return_inverse=True)
df.assign(Count=np.bincount(i)[i])
id Count
0 1 2
1 1 2
2 3 1
Assign the new count column to the dataframe by grouping on id and then transforming that column with value_counts (or size).
>>> f.assign(count=f.groupby('id')['id'].transform('value_counts'))
id count
0 1 2
1 1 2
2 3 1
Use Series.map with Series.value_counts:
df['count'] = df['id'].map(df['id'].value_counts())
#alternative
#from collections import Counter
#df['count'] = df['id'].map(Counter(df['id']))
Detail:
print (df['id'].value_counts())
1 2
3 1
Name: id, dtype: int64
Or GroupBy.transform for return Series with same size as original DataFrame with GroupBy.size:
df['count'] = df.groupby('id')['id'].transform('size')
print (df)
id count
0 1 2
1 1 2
2 3 1

Adding rows to a Dataframe to unify the length of groups

I would like to add element to specific groups in a Pandas DataFrame in a selective way. In particular, I would like to add zeros so that all groups have the same number of elements. The following is a simple example:
import pandas as pd
df = pd.DataFrame([[1,1], [2,2], [1,3], [2,4], [2,5]], columns=['key', 'value'])
df
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
I would like to have the same number of elements per group (where grouping is by the key column). The group 2 has the most elements: three elements. However, the group 1 has only two elements so a zeros should be added as follows:
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
Note that the index does not matter.
You can create new level of MultiIndex by cumcount and then add missing values by unstack/stack or reindex:
df = (df.set_index(['key', df.groupby('key').cumcount()])['value']
.unstack(fill_value=0)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative solution:
df = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df)
key value
0 1 1
1 1 3
2 1 0
3 2 2
4 2 4
5 2 5
If is important order of values:
df1 = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names = df1.index.names)
#get appended values
miss = mux.difference(df1.index).get_level_values(0)
#create helper df and add 0 to all columns of original df
df2 = pd.DataFrame({'key':miss}).reindex(columns=df.columns, fill_value=0)
#append to original df
df = pd.concat([df, df2], ignore_index=True)
print (df)
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0

Categories

Resources