Column headers like pivot table - python

I am trying to find out the mix of member grades that visit my stores.
import pandas as pd
df=pd.DataFrame({'MbrID':['M1','M2','M3','M4','M5','M6','M7']
,'Store':['PAR','TPM','AMK','TPM','PAR','PAR','AMK']
,'Grade':['A','A','B','A','C','A','C']})
df=df[['MbrID','Store','Grade']]
print(df)
df.groupby('Store').agg({'Grade':pd.Series.nunique})
Below is the dataframe and also the result of groupby function.
How do I produce the result like Excel Pivot table, such that the categories of Grade (A,B,C) is the column headers? This is assuming that I have quite a wide range of member grades.

I think you can use groupby with size and reshaping by unstack:
df1 = df.groupby(['Store','Grade'])['Grade'].size().unstack(fill_value=0)
print (df1)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0
Solution with crosstab:
df2 = pd.crosstab(df.Store, df.Grade)
print (df2)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0
and with pivot_table:
df3 = df.pivot_table(index='Store',
columns='Grade',
values='MbrID',
aggfunc=len,
fill_value=0)
print (df3)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0

Related

Python DataFrame: count of occurances based on another column

I have a Python Data Frame of teams and a place that they have achieved (1, 2 or 3)
Team
place
A
1
A
1
A
1
A
2
A
3
A
1
A
1
B
2
B
2
I want to manipulate the df to look like this below. So it is a count of how often a team has achieved each place.
Team
1
2
3
A
5
1
1
B
0
2
0
You could use pandas.crosstab:
pd.crosstab(df['Team'], df['place'])
or a simple groupby+size and unstack:
(df.groupby(['Team', 'place']).size()
.unstack('place', fill_value=0)
)
output:
place 1 2 3
Team
A 5 1 1
B 0 2 0
all as columns
(pd.crosstab(df['Team'], df['place'])
.rename_axis(columns=None)
.reset_index()
)
output:
Team 1 2 3
0 A 5 1 1
1 B 0 2 0
You can get the value counts for each group and then unstack the index. The rest is twiddling to get your exact output.
(df.groupby('Team')['place']
.value_counts()
.unstack(fill_value=0)
.reset_index()
.rename_axis(None, axis=1)
)

How to group phone number with and without country code

I am trying to detect phone number, my country code is +62 but some phone manufacturer or operator use 0 and +62, after query and pivoting I get pivoted data. But, the pivoted data is out of context
Here's the pivoted data
Id +623684682 03684682 +623684684 03684684
1 1 0 1 1
2 1 1 2 1
Here's what I need to group, but I don't want to group manually (+623684682 and 03684682 is same, etc)
Id 03684682 03684684
1 1 2
2 2 3
I think need replace with aggregate sum:
df = df.groupby(lambda x: x.replace('+62','0'), axis=1).sum()
Or replace columns names and sum:
df.columns = df.columns.str.replace('\+62','0')
df = df.sum(level=0, axis=1)
print (df)
03684682 03684684
Id
1 1 2
2 2 3

Reshaping into binary variables using pandas python

I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below
ID SKILL NUM
1 A 1
1 A 1
1 B 1
2 C 1
3 C 1
3 C 1
3 E 1
The results I am aiming for is:
ID A B C E
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
My code atm get the following result:
ID A B C E
1 2 1 0 0
2 0 0 2 0
3 0 0 0 1
Should I remove the duplicates first??
The code I'm using atm is below;
df_pivot = df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)
You can use get_dummies with set_index for indicator columns and then get max values per index:
df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)
For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:
df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)
Solution with pivot, but then is necessary replace NaNs to 0:
df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)
If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:
df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID',
columns='SKILL',
values='NUM',
aggfunc=len,
fill_value=0))
print (df_pivot)
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
Try like this:
df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)
Or this:
df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)
Whichever suits you best.
You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.
df_pivot = df.pivot_table(index='ID', columns='SKILL',
aggfunc='any', fill_value=0).astype(int)
print(df_pivot)
NUM
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.

reset a recurring multiindex in pandas

I have a pandas data frame in python coming from a pd.concat with a recurring multiindex:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
0 0 8880873
1 1000521
1 0 1135488
1 5388773
No, I will reset only the first index of the multiIndex, so that I get a recurring number on the index. Something like this:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
In general, I have around 5 Mio records and not the biggest machine. So I'm looking for a memory efficient solution for that.
ignore_index=True in pd.concat do not works, because then I lose the Multiindex.
Many thanks
You can convert first level by get_level_values to_series, then compare it with shifted values and add cumsum for count and last use MultiIndex.from_arrays:
a = df.index.get_level_values(0).to_series()
a = a.ne(a.shift()).cumsum() - 1
mux = pd.MultiIndex.from_arrays([a, df.index.get_level_values(1)], names=df.index.names)
df.index = mux
Or:
df = df.set_index(mux)
print (df)
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773

Group a dataframe and count amount of items of a column that is not shown

Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.
You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))

Categories

Resources