pandas pivot_table returns empty dataframe - python

I get an empty dataframe when I try to group values using the pivot_table. Let's first create some stupid data:
import pandas as pd
df = pd.DataFrame({"size":['large','middle','xsmall','large','middle','small'],
"color":['blue','blue','red','black','red','red']})
When I use:
df1 = df.pivot_table(index='size', aggfunc='count')
returns me what I expect. Now I would like to have a complete pivot table with the color as column:
df2 = df.pivot_table(index='size', aggfunc='count',columns='color')
But this results in an empty dataframe. Why? How can I get a simple pivot table which counts me the number of combinations?
Thank you.

You need to use len as the aggfunc, like so
df.pivot_table(index='size', aggfunc=len, columns='color')
If you want to use count, here are the steps:
First add a frequency columns, like so:
df['freq'] = df.groupby(['color', 'size'])['color'].transform('count')
Then create the pivot table using the frequency column:
df.pivot_table(values='freq', index='size', aggfunc='count', columns='color')

you need another column to be used as values for aggregation.
Add a column -
df['freq']=1
Your code will work.

Related

Pandas: Sort Dataframe is Column Value Exists in another Dataframe

I have a database which has two columns with unique numbers. This is my reference dataframe (df_reference). In another dataframe (df_data) I want to get the rows of this dataframe of which a column values exist in this reference dataframe. I tried stuff like:
df_new = df_data[df_data['ID'].isin(df_reference)]
However, like this I can't get any results. What am I doing wrong here?
From what I see, you are passing the whole dataframe in .isin() method.
Try:
df_new = df_data[df_data['ID'].isin(df_reference['ID'])]
Convert the ID column to the index of the df_data data frame. Then you could do
matching_index = df_reference['ID']
df_new = df_data.loc[matching_index, :]
This should solve the issue.

How to include grouping variable in the groupby().diff() results

Newby to pandas here so go easy on me.
I have a dataframe with lots of columns. I want to do something like
df.groupby('row').diff()
However, the result of the groupby don't include the row column.
How do I include the row column in the groupby results.
Alternatively, is it possible to merge the groupby results in the dataframe?
Create index by row column first:
df1 = df.set_index('row').groupby('row').diff().reset_index()
Or:
df1 = df.set_index('row').groupby(level=0).diff().reset_index()
You could use agg with np.diff:
df.groupby('row').agg(np.diff)

How to pivot dataframe without changing index and returning all columns from previous operation

I have a table :
When I try to pivot in python:
df.pivot(columns = 'Type', values = 'Value')
it returns only columns from Type column with values but columns Col1-Col5 do not appear in my dataframe
In Power Query it's achieved very simple, I just choose column I want to pivot and values for this column:
And after this operation I get following result:
How do I achieve the same result using pd.pivot ?
Thanks!
In no way using pd.pivot. There is the pd.pivot_table function made for this case:
table = pd.pivot_table(df, values='Value', index=['Col1','Col2','Col3','Col4','Col5'],
columns=['Type'], aggfunc=np.sum)
You should be able to use pivot_table() to do this, same steps as outlined in this answer
res = df.pivot_table(values='Value', index=['Col1','Col2','Col3','Col4','Col5'], columns='Type')
This creates a multi-index dataframe, which you can then flatten:
res.reset_index(inplace=True)
res.columns.name = None

Code Optimization for groupby

I have the below code that basically performs a group by operation, followed by a sum.
grouped = df.groupby(by=['Cabin'], as_index=False)['Fare'].sum()
I then rename the columns
grouped.columns = ['Cabin', 'testCol']
And I then merge the "grouped" dataframe with my original dataframe to calculate aggregate.
df2 = df.merge(grouped, on='Cabin')
What this does is to populate my initial dataframe with the 'testCol' from my "grouped" dataframe.
Can this code be optimized to fit in one line or something similar?
It seems need GroupBy.transform for new column of sums:
df['testCol'] = df.groupby('Cabin')['Fare'].transform('sum')

Combing pandas dataframe values based on other column values

I have a pandas dataframe like so:
import pandas as pd
import numpy as np
df = pd.DataFrame([['WY','M',2014,'Seth',5],
['WY','M',2014,'Spencer',5],
['WY','M',2014,'Tyce',5],
['NY','M',2014,'Seth',25],
['MA','M',2014,'Spencer',23]],columns = ['state','sex','year','name','number'])
print df
How do I manipulate the data to get a dataframe like:
df1 = pd.DataFrame([['M',2014,'Seth',30],
['M',2014,'Spencer',28],
['M',2014,'Tyce',5]],
columns = ['sex','year','name','number'])
print df1
This is just part of a very large dataframe, how would I do this for every name for every year?
df[['sex','year','name','number']].groupby(['sex','year','name']).sum().reset_index()
For a brief description of what this does, from left to right:
Select only the columns we care about. We could replace this part with df.drop('state',axis=1)
Perform a groupby on the columns we care about.
Sum the remaining columns (in this case, just number).
Reset the index so that the columns ['sex','year','name'] are no longer a part of the index.
you can use pivot table
df.pivot_table(values = 'number',aggfunc = 'sum',columns = ['sex','year','name']).reset_index().rename(columns={0:'number'})
Group by the columns you want, sum number, and flatten the multi-index:
df.groupby(['sex','year','name'])['number'].sum().reset_index()
In your case the column state is not sum-able, so you can shorten to:
df.groupby(['sex','year','name']).sum().reset_index()

Categories

Resources