Reshaping into binary variables using pandas python

Reshaping into binary variables using pandas python - python

I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below
ID SKILL NUM
1 A 1
1 A 1
1 B 1
2 C 1
3 C 1
3 C 1
3 E 1
The results I am aiming for is:
ID A B C E
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
My code atm get the following result:
ID A B C E
1 2 1 0 0
2 0 0 2 0
3 0 0 0 1
Should I remove the duplicates first??
The code I'm using atm is below;
df_pivot = df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)

You can use get_dummies with set_index for indicator columns and then get max values per index:
df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)
For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:
df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)
Solution with pivot, but then is necessary replace NaNs to 0:
df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)
If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:
df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID',
columns='SKILL',
values='NUM',
aggfunc=len,
fill_value=0))
print (df_pivot)
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1

Try like this:
df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)
Or this:
df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)
Whichever suits you best.

You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.
df_pivot = df.pivot_table(index='ID', columns='SKILL',
aggfunc='any', fill_value=0).astype(int)
print(df_pivot)
NUM
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.

Related

Python DataFrame: count of occurances based on another column

I have a Python Data Frame of teams and a place that they have achieved (1, 2 or 3)
Team
place
A
1
A
1
A
1
A
2
A
3
A
1
A
1
B
2
B
2
I want to manipulate the df to look like this below. So it is a count of how often a team has achieved each place.
Team
1
2
3
A
5
1
1
B
0
2
0

You could use pandas.crosstab:
pd.crosstab(df['Team'], df['place'])
or a simple groupby+size and unstack:
(df.groupby(['Team', 'place']).size()
.unstack('place', fill_value=0)
)
output:
place 1 2 3
Team
A 5 1 1
B 0 2 0
all as columns
(pd.crosstab(df['Team'], df['place'])
.rename_axis(columns=None)
.reset_index()
)
output:
Team 1 2 3
0 A 5 1 1
1 B 0 2 0

You can get the value counts for each group and then unstack the index. The rest is twiddling to get your exact output.
(df.groupby('Team')['place']
.value_counts()
.unstack(fill_value=0)
.reset_index()
.rename_axis(None, axis=1)
)

How can I aggregate a dataframe on specific values?

I have a pandas dataframe df like this, say
ID activity date
1 A 4
1 B 8
1 A 12
1 C 12
2 B 9
2 A 10
3 A 3
3 D 4
and I would like to return a table that counts the number of occurence of some activity in a precise list, say l = [A, B] in this case, then
ID activity(count)_A activity(count)_B
1 2 1
2 1 2
3 1 0
is what I need.
What is the quickest way to perform that ? ideally without for loop
Thanks !
Edit: I know there is pivot function to do this kind of job. But in my case I have much more activity types than what I really need to count in the list l. Is it still optimal to use pivot ?

You can use isin with boolean indexing as first step and then pivoting - fastest should be groupby, size and unstack, then pivot_table and last crosstab, the best test each solution with real data:
df2 = (df[df['activity'].isin(['A','B'])]
.groupby(['ID','activity'])
.size()
.unstack(fill_value=0)
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
print (df2)
ID activity(count)_A activity(count)_B
0 1 2 1
1 2 1 1
2 3 1 0
Or:
df1 = df[df['activity'].isin(['A','B'])]
df2 = (pd.crosstab(df1['ID'], df1['activity'])
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
Or:
df2 = (df[df['activity'].isin(['A','B'])]
.pivot_table(index='ID', columns='activity', aggfunc='size', fill_value=0)
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))

I believe df.groupby('activity').size().reset_index(name='count')
should do as you expect.

Just aggregate by Counter and use pd.DataFrame default constructor
from collections import Counter
agg_= df.groupby(df.index).ID.agg(Counter).tolist()
ndf = pd.DataFrame(agg_)
A B C D
0 2 1.0 1.0 NaN
1 1 1.0 NaN NaN
2 1 NaN NaN 1.0
If you have l = ['A', 'B'], just filter
ndf[l]
A B
0 2 1.0
1 1 1.0
2 1 NaN

Convert pandas DataFrame column of comma separated strings to one-hot encoded

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.
For example,
data = {"mesh": ["A, B, C", "C,B", ""]}
From this I would like to get a dataframe consisting of:
index A B. C
0 1 1 1
1 0 1 1
2 0 0 0
How can I do this?

Note that you're not dealing with OHEs.
str.split + stack + get_dummies + sum
df = pd.DataFrame(data)
df
mesh
0 A, B, C
1 C,B
2
(df.mesh.str.split('\s*,\s*', expand=True)
.stack()
.str.get_dummies()
.sum(level=0))
df
A B C
0 1 1 1
1 0 1 1
2 0 0 0
apply + value_counts
(df.mesh.str.split(r'\s*,\s*', expand=True)
.apply(pd.Series.value_counts, 1)
.iloc[:, 1:]
.fillna(0, downcast='infer'))
A B C
0 1 1 1
1 0 1 1
2 0 0 0
pd.crosstab
x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df
col_0 A B C
row_0
0 1 1 1
1 0 1 1
2 0 0 0

Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.
Make sure the column has unique values separated be commas
Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.
data = {"mesh": ["A, B, C", "C,B", ""]}
sof_df=pd.DataFrame(data)
sof_df.mesh=sof_df.mesh.str.replace(' ','')
sof_df.mesh.str.get_dummies(sep=',')
OUTPUT:
A B C
0 1 1 1
1 0 1 1
2 0 0 0

If categories are controlled (you know how many and who they are), best answer is by #Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:
s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
0 A,B,C,
1 B,C,D
2 NaN
3 X,W,Z
dtype: object
If you are only interested in categories B and C for the final df of dummies, I've found this workaround does the job:
cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list() for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)
B C
0 1 0
1 0 1
2 0 0
3 0 0

reset a recurring multiindex in pandas

I have a pandas data frame in python coming from a pd.concat with a recurring multiindex:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
0 0 8880873
1 1000521
1 0 1135488
1 5388773
No, I will reset only the first index of the multiIndex, so that I get a recurring number on the index. Something like this:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
In general, I have around 5 Mio records and not the biggest machine. So I'm looking for a memory efficient solution for that.
ignore_index=True in pd.concat do not works, because then I lose the Multiindex.
Many thanks

You can convert first level by get_level_values to_series, then compare it with shifted values and add cumsum for count and last use MultiIndex.from_arrays:
a = df.index.get_level_values(0).to_series()
a = a.ne(a.shift()).cumsum() - 1
mux = pd.MultiIndex.from_arrays([a, df.index.get_level_values(1)], names=df.index.names)
df.index = mux
Or:
df = df.set_index(mux)
print (df)
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773

Group a dataframe and count amount of items of a column that is not shown

Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.

You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reshaping into binary variables using pandas python - python

Try like this: df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0) Or this: df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0) Whichever suits you best.

Related

Python DataFrame: count of occurances based on another column

How can I aggregate a dataframe on specific values?

Convert pandas DataFrame column of comma separated strings to one-hot encoded

reset a recurring multiindex in pandas

Group a dataframe and count amount of items of a column that is not shown

Categories

Resources