Change dataframe from MultiIndex to MultiColumn - python

I have a dataframe like
multiindex1 = pd.MultiIndex.from_product([['a'], np.arange(3, 8)])
df1 = pd.DataFrame(np.random.randn(5, 3), index=multiindex1)
multiindex2 = pd.MultiIndex.from_product([['s'], np.arange(1, 6)])
df2 = pd.DataFrame(np.random.randn(5, 3), index=multiindex2)
multiindex3 = pd.MultiIndex.from_product([['d'], np.arange(2, 7)])
df3 = pd.DataFrame(np.random.randn(5, 3), index=multiindex3)
df = pd.concat([df1, df2, df3])
>>>
0 1 2
a 3 0.872208 -0.145098 -0.519966
4 -0.976089 -0.730808 -1.463151
5 -0.026869 0.227912 1.525924
6 -0.161126 -0.415913 -1.211737
7 -0.893881 -0.769385 0.346436
s 1 -0.972483 0.202820 0.265304
2 0.007303 0.802974 -0.254106
3 1.619309 -1.545089 0.161493
4 2.847376 0.951796 -0.877882
5 1.749237 -0.327026 0.467784
d 2 1.440793 -0.697371 0.902004
3 0.390181 -0.449725 -0.462104
4 0.056916 0.140066 0.918281
5 0.164234 -2.491176 2.035113
6 -1.648948 0.372179 0.600297
Now I want to change it into multicolumns like
a b c
0 1 2 0 1 2 0 1 2
1 Nan Nan Nan ...
2 Nan Nan Nan ...
3 0.872208 -0.145098 -0.519966 ...
4 -0.976089 -0.730808 -1.463151 ...
5 -0.026869 0.227912 1.525924 ...
6 -0.161126 -0.415913 -1.211737 ...
7 -0.893881 -0.769385 0.346436 ...
That's to say I want two goals to be done:
change level0 index(multi index into single index) into level0 columns(single column into multi column)
merge level1 index togather and reindex by it
I've tried stack unstack pivot, but couldn't get the target form.
So is there any elegant way to achieve it?

This seemed to work for me. I used to work with someone who was amazing with multi-indices and he taught me a lot of this.
df.unstack(level=0).reorder_levels([0,1], axis=1).swaplevel(axis=1)[["a", "s", "d"]]
Output:
a s d
0 1 2 0 1 2 0 1 2
1 NaN NaN NaN 0.206957 1.329804 -0.037481 NaN NaN NaN
2 NaN NaN NaN 0.244912 1.880180 1.447138 -1.009454 0.215560 0.126756
3 0.871496 -1.247274 -0.458660 0.514475 0.989567 -1.653646 -0.623382 -0.799157 0.119259
4 -0.756771 0.523621 0.067967 1.066499 -1.436044 -1.045745 0.440954 -1.997325 -1.223662
5 0.707063 1.019831 0.422577 0.964631 -0.034742 -0.891528 0.891096 -0.724778 -0.043314
6 -0.140548 -0.093853 -1.060963 NaN NaN NaN 0.643902 -0.062267 -0.505484
7 -0.449435 0.360956 -0.769459 NaN NaN NaN NaN NaN NaN

Related

aggregating across a dictionary containing dataframes

i have the following dictionary which contains dataframes as values, each always having the same number of columns (1) with the same title
test = {'A': pd.DataFrame(np.random.randn(10), index=range(10),columns=['values']),
'B': pd.DataFrame(np.random.randn(6), index=range(6),columns=['values']),
'C': pd.DataFrame(np.random.randn(11), index=range(11),columns=['values'])}
from this, i would like to create a single dataframe where the index values are the key values of the dictionary (so A,B,C) and the columns are the union of the current index values across all dictionaries (so in this case 0,1,2,3...10). the values of this dataframe would be the corresponding 'values' from the dataframe corresponding to each row, and where blank, NaN
is there a handy way to do this?
IIUC, use pd.concat, keys, and unstack:
pd.concat([test[i] for i in test], keys=test.keys()).unstack(1)['values']
Better yet,
pd.concat(test).unstack(1)['values']
Output:
0 1 2 3 4 5 6 \
A -0.029027 -0.530398 -0.866021 1.331116 0.090178 1.044801 -1.586620
C 1.320105 1.244250 -0.162734 0.942929 -0.309025 -0.853728 1.606805
B -1.683822 1.015894 -0.178339 -0.958557 -0.910549 -1.612449 NaN
7 8 9 10
A -1.072210 1.654565 -1.188060 NaN
C 1.642461 -0.137037 -1.416697 -0.349107
B NaN NaN NaN NaN
dont over complicate things:
just use concat and transpose
pd.concat(test, axis=1).T
0 1 2 3 4 5 \
A values -0.592711 0.266518 -0.774702 0.826701 -2.642054 -0.366401
B values -0.709410 -0.463603 0.058129 -0.054475 -1.060643 0.081655
C values 1.384366 0.662186 -1.467564 0.449142 -1.368751 1.629717
6 7 8 9 10
A values 0.431069 0.761245 -1.125767 0.614622 NaN
B values NaN NaN NaN NaN NaN
C values 0.988287 -1.508384 0.214971 -0.062339 -0.011547
if you were dealing with series instead of 1 column DataFrame it would make more sense to begin with...
test = {'A': pd.Series(np.random.randn(10), index=range(10)),
'B': pd.Series(np.random.randn(6), index=range(6)),
'C': pd.Series(np.random.randn(11), index=range(11))}
pd.concat(test,axis=1).T
0 1 2 3 4 5 6 \
A -0.174565 -2.015950 0.051496 -0.433199 0.073010 -0.287708 -1.236115
B 0.935434 0.228623 0.205645 -0.602561 1.860035 -0.921963 NaN
C 0.944508 -1.296606 -0.079339 0.629038 0.314611 -0.429055 -0.911775
7 8 9 10
A -0.704886 -0.369263 -0.390684 NaN
B NaN NaN NaN NaN
C 0.815078 0.061458 1.726053 -0.503471

Unmelting a pandas dataframe with two columns

Suppose I have a dataframe
df = pd.DataFrame(np.random.normal(size = (10,3)), columns = list('abc'))
I melt the dataframe using pd.melt so that it looks like
variable value
a 0.2
a 0.03
a -0.99
a 0.86
a 1.74
Now, I would like to undo the action. Using pivot(columns = 'variable') almost works, but returns a lot of NULL values
a b c
0 0.2 NAN NAN
1 0.03 NAN NAN
2 -0.99 NAN NAN
3 0.86 NAN NAN
4 1.74 NAN NAN
How can I unmelt the dataframe so that it is as before?
A few ideas:
Assuming d1 is df.melt()
groupby + comprehension
pd.DataFrame({n: list(s) for n, s in d1.groupby('variable').value})
a b c
0 -1.087129 -1.264522 1.147618
1 0.403731 0.416867 -0.367249
2 -0.920536 0.442650 -0.351229
3 -1.193876 -0.342237 -2.001431
4 -1.596659 -1.223354 1.323841
5 0.753658 -0.891211 0.541265
6 0.455577 -1.059572 1.017490
7 -0.153736 0.050007 -0.280192
8 1.189587 0.405647 -0.102023
9 -0.103273 0.200320 -0.630194
Option 2
pd.DataFrame.set_index
d1.set_index([d1.groupby('variable').cumcount(), 'variable']).value.unstack()
variable a b c
0 -1.087129 -1.264522 1.147618
1 0.403731 0.416867 -0.367249
2 -0.920536 0.442650 -0.351229
3 -1.193876 -0.342237 -2.001431
4 -1.596659 -1.223354 1.323841
5 0.753658 -0.891211 0.541265
6 0.455577 -1.059572 1.017490
7 -0.153736 0.050007 -0.280192
8 1.189587 0.405647 -0.102023
9 -0.103273 0.200320 -0.630194
Use groupby, apply and unstack.
df.groupby('variable')['value']\
.apply(lambda x: pd.Series(x.values)).unstack().T
variable a b c
0 0.617037 -0.321493 0.747025
1 0.576410 -0.498173 0.185723
2 -1.563912 0.741198 1.439692
3 -1.305317 1.203608 -1.112820
4 1.287638 1.649580 0.404494
5 0.923544 0.988020 -1.918680
6 0.497406 -1.373345 0.074963
7 0.528444 -0.019914 -1.666261
8 0.260955 0.103575 0.190424
9 0.614411 -0.165363 -0.149514
Another method using the pivot and transform if you don't have nan value in the column i.e
df1 = df.melt()
df1.pivot(columns='variable',values='value')
.transform(lambda x: sorted(x,key=pd.isnull)).dropna()
Output:
variable a b c
0 1.596937 0.431029 0.345441
1 -0.493352 0.135649 -1.559669
2 0.548048 0.667752 0.258160
3 -0.251368 -0.265106 -2.339768
4 -0.397010 -0.381193 -0.359447
5 -0.945300 0.520029 0.362570
6 -0.883771 -0.612628 -0.478003
7 0.833100 -0.387262 -1.195496
8 -1.310178 -0.748359 0.073014
9 0.753457 1.105500 -0.895841

Merging DataFrames on dates in chronological order in Pandas

I have about 50 DataFrames in a list that have a form like this, where the particular dates included in each DataFrame are not necessarily the same.
>>> print(df1)
Unnamed: 0 df1_name
0 2004/04/27 2.2700
1 2004/04/28 2.2800
2 2004/04/29 2.2800
3 2004/04/30 2.2800
4 2004/05/04 2.2900
5 2004/05/05 2.3000
6 2004/05/06 2.3200
7 2004/05/07 2.3500
8 2004/05/10 2.3200
9 2004/05/11 2.3400
10 2004/05/12 2.3700
Now, I want to merge these 50 DataFrames together on the date column (unnamed first column in each DataFrame), and include all dates that are present in any of the DataFrames. Should a DataFrame not have a value for that date, it can just be NaN.
So a minimal example:
>>> print(sample1)
Unnamed: 0 sample_1
0 2004/04/27 1
1 2004/04/28 2
2 2004/04/29 3
3 2004/04/30 4
>>> print(sample2)
Unnamed: 0 sample_2
0 2004/04/28 5
1 2004/04/29 6
2 2004/05/01 7
3 2004/05/03 8
Then after the merge
>>> print(merged_df)
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1 NaN
1 2004/04/28 2 5
2 2004/04/29 3 6
3 2004/04/30 4 NaN
....
Is there an easy way to make use of the merge or join functions of Pandas to accomplish this? I have gotten awfully stuck trying to determine how to combine the dates like this.
All you need to do is pd.concat on all your sample dataframes. But you have to set a couple of things. One, set the index of each one to be the column you want to merge on. Ensure that column is a date column. Below is an example of how to do it.
One liner
pd.concat([s.set_index('Unnamed: 0') for s in [sample1, sample2]], axis=1).rename_axis('Unnamed: 0').reset_index()
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1.0 NaN
1 2004/04/28 2.0 5.0
2 2004/04/29 3.0 6.0
3 2004/04/30 4.0 NaN
4 2004/05/01 NaN 7.0
5 2004/05/03 NaN 8.0
I think this is more understandable
sample1 = pd.DataFrame([
['2004/04/27', 1],
['2004/04/28', 2],
['2004/04/29', 3],
['2004/04/30', 4],
], columns=['Unnamed: 0', 'sample_1'])
sample2 = pd.DataFrame([
['2004/04/28', 5],
['2004/04/29', 6],
['2004/05/01', 7],
['2004/05/03', 8],
], columns=['Unnamed: 0', 'sample_2'])
list_of_samples = [sample1, sample2]
for i, sample in enumerate(list_of_samples):
s = list_of_samples[i].copy()
cols = s.columns.tolist()
cols[0] = 'Date'
s.columns = cols
s.Date = pd.to_datetime(s.Date)
s.set_index('Date', inplace=True)
list_of_samples[i] = s
pd.concat(list_of_samples, axis=1)
sample_1 sample_2
Date
2004-04-27 1.0 NaN
2004-04-28 2.0 5.0
2004-04-29 3.0 6.0
2004-04-30 4.0 NaN
2004-05-01 NaN 7.0
2004-05-03 NaN 8.0

Merge a lot of DataFrames together, without loop and not using concat

I have >1000 DataFrames, each have >20K rows and several columns, need to be merge by a certain common column, the idea can be illustrated by this:
data1=pd.DataFrame({'name':['a','c','e'], 'value':[1,3,4]})
data2=pd.DataFrame({'name':['a','d','e'], 'value':[3,3,4]})
data3=pd.DataFrame({'name':['d','e','f'], 'value':[1,3,5]})
data4=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4]})
#some or them may have more or less columns that the others:
#data5=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4], 'score':[1,3,4]})
final_data=data1
for i, v in enumerate([data2, data3, data4]):
if i==0:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('_0', '_%s'%(i+1)))
#in real case right_on may be = columns other than 'name'
#dependents on the dataframe, but this requirement can be
#ignored in this minimal example.
else:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('', '_%s'%(i+1)))
Result:
name value_0 value_1 value value_3
0 a 1 3 NaN NaN
1 c 3 NaN NaN NaN
2 e 4 4 3 NaN
3 d NaN 3 1 0
4 f NaN NaN 5 3
5 g NaN NaN NaN 4
[6 rows x 5 columns]
It works, but anyway this can be done without a loop?
Also, why the column name of the second to last column is not value_2?
P.S.
I know that in this minimal example, the result can also be achieved by:
pd.concat([item.set_index('name') for item in [data1, data2, data3, data4]], axis=1)
But In the real case due to the way how the dataframes were constructed and the information stored in the index columns, this is not an ideal solution without additional tricks. So, let's not consider this route.
Does it even make sense to merge it, then? What's wrong with a panel?
> data = [data1, data2, data3, data4]
> p = pd.Panel(dict(zip(map(str, range(len(data))), data)))
> p.to_frame().T
major 0 1 2
minor name value name value name value
0 a 1 c 3 e 4
1 a 3 d 3 e 4
2 d 1 e 3 f 5
3 d 0 f 3 g 4
# and just for kicks
> p.transpose(2, 0, 1).to_frame().reset_index().pivot_table(values='value', rows='name', cols='major')
major 0 1 2 3
name
a 1 3 NaN NaN
c 3 NaN NaN NaN
d NaN 3 1 0
e 4 4 3 NaN
f NaN NaN 5 3
g NaN NaN NaN 4

Joining 2 data frames with overlapping data

I have 2 data frames created by pivot tables
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df=pd.DataFrame({'axis1': ['Unix','Window','Apple','Linux'],
'A': [1,np.nan,1,1],
'B': [1,np.nan,np.nan,1],
'C': [np.nan,1,np.nan,1],
'D': [1,np.nan,1,np.nan],
}).set_index(['axis1'])
print (df)
df2=pd.DataFrame({'axis1': ['Unix','Window','Apple','Linux','A'],
'A': [1,1,np.nan,np.nan,np.nan],
'E': [1,np.nan,1,1,1],
}).set_index(['axis1'])
print (df2)
Output looks like this
A B C D
axis1
Unix 1 1 NaN 1
Window NaN NaN 1 NaN
Apple 1 NaN NaN 1
Linux 1 1 1 NaN
[4 rows x 4 columns]
A E
axis1
Unix 1 1
Window 1 NaN
Apple NaN 1
Linux NaN 1
A NaN 1
Lets say I want to combine them but I want only want values of 1
So far I got it but it does not have column E or row A:
>>> df.update(df2)
>>> df
A B C D
axis1
Unix 1 1 NaN 1
Window 1 NaN 1 NaN
Apple 1 NaN NaN 1
Linux 1 1 1 NaN
[4 rows x 4 columns]
How would I update it to get the additional axis values? (include row A and Column E)
you want to reindex your first Dataframe before you call update
one robust way would be to calculate the union of both columns and rows of both df, maybe there is a smarter way, but I can't think of any at the moment
df = df.reindex(columns=df2.columns.union(df.columns),
index=df2.index.union(df.index))
then you call update on that, and it should work.

Categories

Resources