create a summary table with count values - python

my df looks like
group value
A 1
B 1
A 1
B 1
B 0
B 0
A 0
I want to create a df
value 0 1
group
A a b
B c d
where a,b,c,d are the counts of 0s and 1s in groups A and B respectively.
I tried group df.groupby('group').size() but that gave an overall count and did not split 0's and 1s. I tried a groupby count method too but have not been able to achieve the target data frame.

Use pd.crosstab:
pd.crosstab(df['group'], df['value'])
Output:
value 0 1
group
A 1 2
B 2 2

Use pivot table for this:
res = pd.pivot_table(df, index='group', columns='value', aggfunc='size')
>>>print(res)
value 0 1
group
A 1 2
B 2 2

Related

Printing count of a column based on value of another column

I have a data frame:
Dept_Name
Placed
A
1
B
0
C
1
Where 'Placed' column has a boolean value
I want to print the count of rows that have the value '1' in placed grouped by the Dept_Name
Dept_Name
Count(Placed == 1)
A
3
B
4
C
0
If values are 0,1 or True/False you can aggregate sum, last for column Count use Series.reset_index:
df1 = df.groupby('Dept_Name')['Placed'].sum().reset_index(name='Count')
If test some non boolean values - e.g. count values 100:
df2 = df['Placed'].eq(100).groupby(df['Dept_Name']).sum().reset_index(name='Count')
As you have a boolean 0/1 a simple sum will work:
out = df.groupby('Dept_Name', as_index=False).sum()
output:
Dept_Name Placed
0 A 5
1 B 0
2 C 2
For a named column:
out = df.groupby('Dept_Name', as_index=False).agg(**{'Count': ('Placed', 'sum')})
output:
Dept_Name Count
0 A 5
1 B 0
2 C 2

Enumerate rows by category

Enumerate values rows by category
I have the following dataframe that I'm ordering by category and values:
d = {"cat":["a","b","a","c","c"],"val" :[1,2,3,1,4] }
df = pd.DataFrame(d)
df = df.sort_values(["cat","val"])
Now from that dataframe I want to enumarate the occurrence of each category
so the result is as follows:
df["cat_count"] = [1,2,1,1,2]
Is there a way to automate this?
You can use cumcount like this. Details here cumcount
df['count'] = df.groupby('cat').cumcount()+1
print (df)
Output
cat val count
0 a 1 1
2 a 3 2
1 b 2 1
3 c 1 1
4 c 4 2

Pandas/Numpy shift rows into column based on existence

I have a dataframe like so:
col_a | col b
0 1
0 2
0 3
1 1
1 2
I want to convert it to:
col_a | 1 | 2 | 3
0 1 1 1
1 1 1 0
Unfortunately, most questions/answers revolving around this topic simply pivot it
Background: For Scikit, I want to use the existence of values in column b as an attribute/feature (like a sort of manual CountVectorizer, but for row values in this case instead of text)
Use get_dummies with creating first column to index, last use max per index for return only 1/0 values in output:
df = pd.get_dummies(df.set_index('col_a')['col b'], prefix='', prefix_sep='').max(level=0)
print (df)
1 2 3
col_a
0 1 1 1
1 1 1 0
You can use Groupby.cumcount and use it as columns for a pivoted dataframe, which can be obtained using pd.croostab and by default computes a frequency table of the factors :
cols = df.groupby('col_a').cumcount()
pd.crosstab(index = df.col_a, columns = cols)
col_0 0 1 2
col_a
0 1 1 1
1 1 1 0

Reshaping into binary variables using pandas python

I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below
ID SKILL NUM
1 A 1
1 A 1
1 B 1
2 C 1
3 C 1
3 C 1
3 E 1
The results I am aiming for is:
ID A B C E
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
My code atm get the following result:
ID A B C E
1 2 1 0 0
2 0 0 2 0
3 0 0 0 1
Should I remove the duplicates first??
The code I'm using atm is below;
df_pivot = df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)
You can use get_dummies with set_index for indicator columns and then get max values per index:
df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)
For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:
df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)
Solution with pivot, but then is necessary replace NaNs to 0:
df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)
If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:
df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID',
columns='SKILL',
values='NUM',
aggfunc=len,
fill_value=0))
print (df_pivot)
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
Try like this:
df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)
Or this:
df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)
Whichever suits you best.
You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.
df_pivot = df.pivot_table(index='ID', columns='SKILL',
aggfunc='any', fill_value=0).astype(int)
print(df_pivot)
NUM
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources