pivot dataframe with duplicate values

pivot dataframe with duplicate values - python

consider the below pd.DataFrame
temp = pd.DataFrame({'label_0':[1,1,1,2,2,2],'label_1':['a','b','c',np.nan,'c','b'], 'values':[0,2,4,np.nan,8,5]})
print(temp)
label_0 label_1 values
0 1 a 0.0
1 1 b 2.0
2 1 c 4.0
3 2 NaN NaN
4 2 c 8.0
5 2 b 5.0
my desired output is
label_1 1 2
0 a 0.0 NaN
1 b 2.0 5.0
2 c 4.0 8.0
3 NaN NaN NaN
I have tried pd.pivot and wrangling around with pd.gropuby but cannot get to the desired output due to duplicate entries. any help most appreciated.

d = {}
for _0, _1, v in zip(*map(temp.get, temp)):
d.setdefault(_1, {})[_0] = v
pd.DataFrame.from_dict(d, orient='index')
1 2
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0
NaN NaN NaN
OR
pd.DataFrame.from_dict(d, orient='index').rename_axis('label_1').reset_index()
label_1 1 2
0 a 0.0 NaN
1 b 2.0 5.0
2 c 4.0 8.0
3 NaN NaN NaN

Another way is to use set_index and unstack:
temp.set_index(['label_0','label_1'])['values'].unstack(0)
Output:
label_0 1 2
label_1
NaN NaN NaN
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0

You can do fillna then pivot
temp.fillna('NaN').pivot(*temp.columns).T
Out[251]:
label_0 1 2
label_1
NaN NaN NaN
a 0 NaN
b 2 5
c 4 8

Seems like a straightforward pivot works:
temp.pivot(columns='label_0', index='label_1', values='values')
Output:
label_0 1 2
label_1
NaN NaN NaN
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0

Related

Cutomise the ordering of columns in pivot table after .sort_index(level=1, axis=1)

Dataframe df1
TYPE WEEK A B C D
0 Type1 1 1 1 1 1
1 Type2 2 2 2 2 2
2 Type3 3 3 3 3 3
3 Type4 4 4 4 4 4
Expected output
A C B D A C B D A C B D A C B D
WEEK 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
TYPE
Type1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Type2 NaN NaN NaN NaN 2.0 2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
Type3 NaN NaN NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 NaN NaN NaN NaN
Type4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0
My approach:
df1 = pd.DataFrame(df1)
colname = list(df1.head())
tuples = []
for i in colname:
tuples.append((i,colname.index(i)+1))
index = pd.MultiIndex.from_tuples(tuples, names=["COLUMN", "ORDER"])
df2 = pd.DataFrame(df1.values, columns=index)
df3 = pd.pivot_table(df1,index="TYPE",columns="WEEK", values=['A','B','C','D']).sort_index(level=1, axis=1)
#For df3 cannot attain the expected result because .sort_index(level=1, axis=1) will sort them out alphabetically to ['A','B','C','D']
.sort_index(level=1, axis=1) is required to swap the level of the pivot table.
Another dataframe df2 is generated in order to fix the order of columns as ['A','C','B','D'] to be used in the pivot table
COLUMN TYPE WEEK A B C D
ORDER 1 2 3 4 5 6
0 Type1 1 1 1 1 1
1 Type2 2 2 2 2 2
2 Type3 3 3 3 3 3
3 Type4 4 4 4 4 4

Create a CategoricalDtype before pivoting:
cat = pd.CategoricalDtype(['A', 'C', 'B', 'D'], ordered=True)
df3 = df.melt(['TYPE', 'WEEK'], var_name='COLUMN').astype({'COLUMN': cat}) \
.pivot_table('value', 'TYPE', ['COLUMN', 'WEEK']).sort_index(level=1, axis=1)
Output
>>> df3
COLUMN A C B D A C B D A C B D A C B D
WEEK 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
TYPE
Type1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Type2 NaN NaN NaN NaN 2.0 2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
Type3 NaN NaN NaN NaN NaN NaN NaN NaN 3.0 3.0 3.0 3.0 NaN NaN NaN NaN
Type4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0

Creating non-exist columns in multiindex dataframe

Let's say we have dataframe like this
df = pd.DataFrame({
"metric": ["1","2","1" ,"1","2"],
"group1":["o", "x", "x" , "o", "x"],
"group2":['a', 'b', 'a', 'a', 'b'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
metric group1 group2 value value2
0 1 o a 0 0
1 2 x b 1 2
2 1 x a 2 4
3 1 o a 3 6
4 2 x b 4 8
then I want to have pivot format
df['g'] = df.groupby(['group1','group2'])['group2'].cumcount()
df1 = df.pivot(index=['g','metric'], columns=['group1','group2'], values=['value','value2']).sort_index(axis=1).rename_axis(columns={'g':None})
value value2
group1 o x o x
group2 a a b a a b
g metric
0 1 0.0 2.0 NaN 0.0 4.0 NaN
2 NaN NaN 1.0 NaN NaN 2.0
1 1 3.0 NaN NaN 6.0 NaN NaN
2 NaN NaN 4.0 NaN NaN 8.0
From here we can see that ("value","o","b") and ("value2","o","b") not exist after making pivot
but I need to have those columns with values NA
So I tried;
cols = [('value','x','a'), ('value','o','a'),('value','o','b')]
df1.assign(**{col : "NA" for col in np.setdiff1d(cols, df1.columns.values)})
which gives
Expected output
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
one corner case with this is that if b does not exist how to create that column ?
value value2
group1 o x o x
group2 a a a a
g metric
0 1 0.0 2.0 0.0 4.0
2 NaN NaN NaN NaN
1 1 3.0 NaN 6.0 NaN
2 NaN NaN NaN NaN
Multiple insert columns if not exist pandas
Pandas: Check if column exists in df from a list of columns
Pandas - How to check if multi index column exists

Use DataFrame.stack with DataFrame.unstack:
df1 = df1.stack([1,2],dropna=False).unstack([2,3])
print (df1)
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
Or with selecting last and last previous levels:
df1 = df1.stack([-2,-1],dropna=False).unstack([-2,-1])
Another idea:
df1 = df1.reindex(pd.MultiIndex.from_product(df1.columns.levels), axis=1)
print (df1)
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
EDIT:
If need set new columns by list of tuples:
cols = [('value','x','a'), ('value','o','a'),('value','o','b')]
df = df1.reindex(pd.MultiIndex.from_tuples(cols).union(df1.columns), axis=1)
print (df)
value value2
o x o x
a b a b a a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN
2 NaN NaN NaN 4.0 NaN NaN 8.0

Manipulating value in a column based on a rule

I have 3 columns -A, B and C in a pandas dataframe. What i want to do is, where ever A is not null AND B|C are not null, that row in A should be set to null.
if(dffinal['A'].loc[dffinal['A'].notnull()] &
(dffinal['B'].loc[dffinal['B'].notnull()] |
dffinal['C'].loc[dffinal['C'].notnull()])):
dffinal['A'] = np.nan
this is the error I'm getting: cannot do a non-empty take from an empty axes.

Use df.loc[]:
df.loc[df.A.notna() & (df.B.notna()|df.C.notna()),'A']=np.nan

Here first condition is not necessary, so solution should be simplify:
dffinal = pd.DataFrame({
'A':[np.nan,np.nan,4,5,5,np.nan],
'B':[7,np.nan,np.nan,4,np.nan,np.nan],
'C':[1,3,5,7,np.nan,np.nan],
})
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 4.0 NaN 5.0
3 5.0 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
mask = (dffinal['B'].notnull() | dffinal['C'].notnull())
dffinal.loc[mask, 'A'] = np.nan
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN
Same output like in first condition:
mask = dffinal['A'].notnull() & (dffinal['B'].notnull() | dffinal['C'].notnull())
dffinal.loc[mask, 'A'] = np.nan
print (dffinal)
A B C
0 NaN 7.0 1.0
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN 4.0 7.0
4 5.0 NaN NaN
5 NaN NaN NaN

Generate New DataFrame without NaN Values

I've the following Dataframe:
a b c d e
0 NaN 2.0 NaN 4.0 5.0
1 NaN 2.0 3.0 NaN 5.0
2 1.0 NaN 3.0 4.0 NaN
3 1.0 2.0 NaN 4.0 NaN
4 NaN 2.0 NaN 4.0 5.0
What I try to to is to generate a new Dataframe without the NaN values.
There are always the same number of NaN Values in a row.
The final Dataframe should look like this:
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
Does someone know an easy way to do this?
Any help is appreciated.

Using array indexing:
pd.DataFrame(df.values[df.notnull().values].reshape(df.shape[0],3),
columns=list('xyz'),dtype=int)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If the dataframe has more inconsistance values across rows like 1st row with 4 values and from 2nd row if it has 3 values, Then this will do:
a b c d e g
0 NaN 2.0 NaN 4.0 5.0 6.0
1 NaN 2.0 3.0 NaN 5.0 NaN
2 1.0 NaN 3.0 4.0 NaN NaN
3 1.0 2.0 NaN 4.0 NaN NaN
4 NaN 2.0 NaN 4.0 5.0 NaN
pd.DataFrame(df.apply(lambda x: x.values[x.notnull()],axis=1).tolist())
0 1 2 3
0 2.0 4.0 5.0 6.0
1 2.0 3.0 5.0 NaN
2 1.0 3.0 4.0 NaN
3 1.0 2.0 4.0 NaN
4 2.0 4.0 5.0 NaN
Here we cannot remove NaN's in last column.

Use justify function and select first 3 columns:
df = pd.DataFrame(justify(df.values,invalid_val=np.nan)[:, :3].astype(int),
columns=list('xyz'),
index=df.index)
print (df)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5

If, as in your example, values increase across columns, you can sort over axis=1:
res = pd.DataFrame(np.sort(df.values, 1)[:, :3],
columns=list('xyz'), dtype=int)
print(res)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5

You can use panda's method for dataframe df.fillna()
This method is used for converting the NaN or NA to your given parameter.
df.fillna(param to replace Nan)
import numpy as np
import pandas as pd
data = {
'A':[np.nan, 2.0, np.nan, 4.0, 5.0],
'B':[np.nan, 2.0, 3.0, np.nan, 5.0],
'C':[1.0 , np.nan, 3.0, 4.0, np.nan],
'D':[1.0 , 2.0, np.nan, 4.0, np.nan,],
'E':[np.nan, 2.0, np.nan, 4.0, 5.0]
}
df = pd.DataFrame(data)
print(df)
A B C D E
0 NaN NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 NaN 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
df = df.fillna(0) # Applying the method with parameter 0
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
If you want to apply this method to the particular column, the syntax would be like this
df[column_name] = df[column_name].fillna(param)
df['A'] = df['A'].fillna(0)
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
You can also use Python's replace() method to replace np.nan
df = df.replace(np.nan,0)
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
df['A'] = df['A'].replace() # Replacing only column A
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0

Python Pandas Dataframe replace values below treshold

How can I apply a function element-wise to a pandas DataFrame and pass a column-wise calculated value (e.g. quantile of column)? For example, what if I want to replace all elements in a DataFrame (with NaN) where the value is lower than the 80th percentile of the column?
def _deletevalues(x, quantile):
if x < quantile:
return np.nan
else:
return x
df.applymap(lambda x: _deletevalues(x, x.quantile(0.8)))
Using applymap only allows one to access each value individually and throws (of course) an AttributeError: ("'float' object has no attribute 'quantile'
Thank you in advance.

Use DataFrame.mask:
df = df.mask(df < df.quantile())
print (df)
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0

In [139]: df
Out[139]:
a b c
0 1 7 3
1 1 2 6
2 3 0 5
3 8 2 1
4 7 3 5
5 6 7 2
6 0 2 1
7 8 4 1
8 5 0 6
9 7 7 6
for all columns:
In [145]: df.apply(lambda x: np.where(x < x.quantile(),np.nan,x))
Out[145]:
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0
or
In [149]: df[df < df.quantile()] = np.nan
In [150]: df
Out[150]:
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pivot dataframe with duplicate values - python

Another way is to use set_index and unstack: temp.set_index(['label_0','label_1'])['values'].unstack(0) Output: label_0 1 2 label_1 NaN NaN NaN a 0.0 NaN b 2.0 5.0 c 4.0 8.0

You can do fillna then pivot temp.fillna('NaN').pivot(*temp.columns).T Out[251]: label_0 1 2 label_1 NaN NaN NaN a 0 NaN b 2 5 c 4 8

Seems like a straightforward pivot works: temp.pivot(columns='label_0', index='label_1', values='values') Output: label_0 1 2 label_1 NaN NaN NaN a 0.0 NaN b 2.0 5.0 c 4.0 8.0

Related

Cutomise the ordering of columns in pivot table after .sort_index(level=1, axis=1)

Creating non-exist columns in multiindex dataframe

Manipulating value in a column based on a rule

Generate New DataFrame without NaN Values

Python Pandas Dataframe replace values below treshold

Categories

Resources