How to merge 3 columns into 1 whilst keeping values column separate - python

I have the following Pivot table:
Subclass
Subclass2
Layer
Amount
A
B
C
5
E
F
G
100
I want to merge the 3 columns together and have Amount stay separate to form this:
Col1
Amount
A
NaN
B
NaN
C
5
E
NaN
F
NaN
G
100
So Far I've turned it into a regular DataFrame and did this:
df.melt(id_vars = ['SubClass', 'SubClass2'], value_name = 'CQ')
But that didn't arrange it right at all. It messed up all the columns.
I've thought once I get the melt right, I could just change the NaN values to 0 or blanks.
EDIT
I need to keep Subclass & Subclass2 in the final column as they're the higher level mapping of Layer, hence why I want the output Col1 to include them before listing Layer with Amount next to it.
Thanks!

here is one way to do it
pd.concat([df,
df[['Subclass','Subclass2']].stack().reset_index()[0].to_frame().rename(columns={0:'Layer'})
]
)[['Layer','Amount']].sort_values('Layer')
Layer Amount
0 A NaN
1 B NaN
0 C 5.0
2 E NaN
3 F NaN
1 G 100.0

Here is my interpretation. Using a stack instead of melt to preserve the order.
out = (df
.set_index('Amount')
.stack().reset_index(name='Col1')
.assign(Amount=lambda d: d['Amount'].where(d['level_1'].eq('Layer'), 0))
.drop(columns='level_1')
)
NB. with melt the syntax would be df.melt(id_vars='Amount', value_name='Col1'), and using variable in place of level_1
Output:
Amount Col1
0 0 A
1 0 B
2 5 C
3 0 E
4 0 F
5 100 G

Related

Merge dataframes of different sizes and simultaneously overwrite NaN values

I would like to combine two dataframes in Python of different sizes. These dataframes are loaded from Excel files. The first dataframe has many empty values containing NaN, and the second dataframe has the data to replace the NaN values in the first dataframe. The two dataframes are linked by the data in the first column, but are not in the same order.
I can successfully merge and organize the dataframes using merge(), but the resulting dataframe has extra columns because the NaN values were not overwritten. I can overwrite the NaN values with fillna(), but the resulting dataframe is out of order. Is there any way to perform this kind of merge that replaces NaN without separate operations that delete and reorder columns?
import pandas as pd
import numpy as np
df1=pd.DataFrame({'A':[1,2,3],'B':[np.nan,np.nan,np.nan],'C':['X','Y','Z']})
df1
A B C
0 1 NaN X
1 2 NaN Y
2 3 NaN Z
df2=pd.DataFrame({'A':[3,1,2],'B':['U','V','W'],'D':[7,8,9]})
df2
A B D
0 3 U 7
1 1 V 8
2 2 W 9
If I do:
df1.merge(df2,how='left',on='A',sort=True)
A B_x C B_y D
0 1 NaN X V 8
1 2 NaN Y W 9
2 3 NaN Z U 7
The data is in order but B has multiple instances.
If I do:
df1.fillna(df2)
A B C
0 1 U X
1 2 V Y
2 3 W Z
The data is out of order, but the NaN are replaced.
I want the output to be a dataframe which looks like this:
df3
A B C D
0 1 V X 8
1 2 W Y 9
2 3 U Z 7
You can use:
df3=pd.concat([df1['C'],df2[['A','B','D']].sort_values('A').reset_index(drop=True)],axis=1).reindex(columns=['A','B','C','D'])
Output:
df3
A B C D
0 1 V X 8
1 2 W Y 9
2 3 U Z 7
Explanation:
sort_values ​​orders df2 according to column A.
reset_index (drop = True) is necessary to concatenate the DataFrame in the correct order.
I use concat to join the column of df1 'C' with df2 whose columns are now in the correct order. Finally I use reindex to reposition the columns of the DataFrame df3.
You can see that the order of the DataFrame df2 has not changed, since we have not used inplace = True.
d = dict(zip(df2.A,df2.B))
df1["B"] = df1["A"].map(d)
del df2["B"]
df1.merge(df2,how='left',on='A',sort=True)

Select multiple columns and slice columns at the same time with .loc method

Taking into account this Pandas DataFrame df:
A B C D E F
0
1
2
With .loc method I can select specific columns like this:
df.loc[:, ['A','B','E']]
Or I can slice some columns like:
df.loc[:,'B':'E']
My question is? Can this method allow to combine these two options? For example for selecting the first column and slice other columns?
I have tried:
df.loc[:,['A','D':'F']]
for selecting columns A, D, E, F.
Which is the correct syntax?
You cannot natively do this using labels with loc, but you can do so using positions and np.r_ + iloc (it's the closest workaround).
f = df.columns.get_loc
df.iloc[:, np.r_[f('A'), f('D'):f('F')]]
A D E
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
This is under the assumption that your column names are unique.
You can simply do it using join
df[['A']].join(df.loc[:, 'D':'F'])
Output:
A D E F
pd.concat and map slices
This is a generalized approach that should work as expected.
sublocs = [slice('A'), slice('D', 'F')]
loc = lambda s: df.loc[:, s]
pd.concat(map(loc, sublocs), axis=1)
A D E F
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
Completely obnoxious variant
sublocs = [slice('A'), slice('D', 'F')]
pd.concat(map(df.T.loc.__getitem__, sublocs)).T

Add pandas Series as new columns to a specific Dataframe row

Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN

Create a new dataframe by aggregating repeated origin and destination values by a separate count column in a pandas dataframe

I am having trouble analysing origin-destination values in a pandas dataframe which contains origin/destination columns and a count column of the frequency of these. I want to transform this into a dataframe with the count of how many are leaving and entering:
Initial:
Origin Destination Count
A B 7
A C 1
B A 1
B C 4
C A 3
C B 10
For example this simplified dataframe has 7 leaving from A to B and 1 from A to C so overall leaving place A would be 8, and entering place A would be 4 (B - A is 1, C - A is 3) etc. The new dataframe would look something like this.
Goal:
Place Entering Leaving
A 4 8
B 17 5
C 5 13
I have tried several techniques such as .groupby() but have not yet created my intended dataframe. How can I handle the repeated values in the origin/destination columns and assign them to a new dataframe with aggregated values of just the count of leaving and entering?
Thank you!
Use double groupby + concat:
a = df.groupby('Destination')['Count'].sum()
b = df.groupby('Origin')['Count'].sum()
df = pd.concat([a,b], axis=1, keys=('Entering','Leaving')).rename_axis('Place').reset_index()
print (df)
Place Entering Leaving
0 A 4 8
1 B 17 5
2 C 5 13
pivot_table then do sum
df=pd.pivot_table(df,index='Origin',columns='Destination',values='Count',aggfunc=sum)
pd.concat([df.sum(0),df.sum(1)],1)
Out[428]:
0 1
A 4.0 8.0
B 17.0 5.0
C 5.0 13.0

Merge a lot of DataFrames together, without loop and not using concat

I have >1000 DataFrames, each have >20K rows and several columns, need to be merge by a certain common column, the idea can be illustrated by this:
data1=pd.DataFrame({'name':['a','c','e'], 'value':[1,3,4]})
data2=pd.DataFrame({'name':['a','d','e'], 'value':[3,3,4]})
data3=pd.DataFrame({'name':['d','e','f'], 'value':[1,3,5]})
data4=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4]})
#some or them may have more or less columns that the others:
#data5=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4], 'score':[1,3,4]})
final_data=data1
for i, v in enumerate([data2, data3, data4]):
if i==0:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('_0', '_%s'%(i+1)))
#in real case right_on may be = columns other than 'name'
#dependents on the dataframe, but this requirement can be
#ignored in this minimal example.
else:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('', '_%s'%(i+1)))
Result:
name value_0 value_1 value value_3
0 a 1 3 NaN NaN
1 c 3 NaN NaN NaN
2 e 4 4 3 NaN
3 d NaN 3 1 0
4 f NaN NaN 5 3
5 g NaN NaN NaN 4
[6 rows x 5 columns]
It works, but anyway this can be done without a loop?
Also, why the column name of the second to last column is not value_2?
P.S.
I know that in this minimal example, the result can also be achieved by:
pd.concat([item.set_index('name') for item in [data1, data2, data3, data4]], axis=1)
But In the real case due to the way how the dataframes were constructed and the information stored in the index columns, this is not an ideal solution without additional tricks. So, let's not consider this route.
Does it even make sense to merge it, then? What's wrong with a panel?
> data = [data1, data2, data3, data4]
> p = pd.Panel(dict(zip(map(str, range(len(data))), data)))
> p.to_frame().T
major 0 1 2
minor name value name value name value
0 a 1 c 3 e 4
1 a 3 d 3 e 4
2 d 1 e 3 f 5
3 d 0 f 3 g 4
# and just for kicks
> p.transpose(2, 0, 1).to_frame().reset_index().pivot_table(values='value', rows='name', cols='major')
major 0 1 2 3
name
a 1 3 NaN NaN
c 3 NaN NaN NaN
d NaN 3 1 0
e 4 4 3 NaN
f NaN NaN 5 3
g NaN NaN NaN 4

Categories

Resources