Merge a lot of DataFrames together, without loop and not using concat - python

I have >1000 DataFrames, each have >20K rows and several columns, need to be merge by a certain common column, the idea can be illustrated by this:
data1=pd.DataFrame({'name':['a','c','e'], 'value':[1,3,4]})
data2=pd.DataFrame({'name':['a','d','e'], 'value':[3,3,4]})
data3=pd.DataFrame({'name':['d','e','f'], 'value':[1,3,5]})
data4=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4]})
#some or them may have more or less columns that the others:
#data5=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4], 'score':[1,3,4]})
final_data=data1
for i, v in enumerate([data2, data3, data4]):
if i==0:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('_0', '_%s'%(i+1)))
#in real case right_on may be = columns other than 'name'
#dependents on the dataframe, but this requirement can be
#ignored in this minimal example.
else:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('', '_%s'%(i+1)))
Result:
name value_0 value_1 value value_3
0 a 1 3 NaN NaN
1 c 3 NaN NaN NaN
2 e 4 4 3 NaN
3 d NaN 3 1 0
4 f NaN NaN 5 3
5 g NaN NaN NaN 4
[6 rows x 5 columns]
It works, but anyway this can be done without a loop?
Also, why the column name of the second to last column is not value_2?
P.S.
I know that in this minimal example, the result can also be achieved by:
pd.concat([item.set_index('name') for item in [data1, data2, data3, data4]], axis=1)
But In the real case due to the way how the dataframes were constructed and the information stored in the index columns, this is not an ideal solution without additional tricks. So, let's not consider this route.

Does it even make sense to merge it, then? What's wrong with a panel?
> data = [data1, data2, data3, data4]
> p = pd.Panel(dict(zip(map(str, range(len(data))), data)))
> p.to_frame().T
major 0 1 2
minor name value name value name value
0 a 1 c 3 e 4
1 a 3 d 3 e 4
2 d 1 e 3 f 5
3 d 0 f 3 g 4
# and just for kicks
> p.transpose(2, 0, 1).to_frame().reset_index().pivot_table(values='value', rows='name', cols='major')
major 0 1 2 3
name
a 1 3 NaN NaN
c 3 NaN NaN NaN
d NaN 3 1 0
e 4 4 3 NaN
f NaN NaN 5 3
g NaN NaN NaN 4

Related

How to merge 3 columns into 1 whilst keeping values column separate

I have the following Pivot table:
Subclass
Subclass2
Layer
Amount
A
B
C
5
E
F
G
100
I want to merge the 3 columns together and have Amount stay separate to form this:
Col1
Amount
A
NaN
B
NaN
C
5
E
NaN
F
NaN
G
100
So Far I've turned it into a regular DataFrame and did this:
df.melt(id_vars = ['SubClass', 'SubClass2'], value_name = 'CQ')
But that didn't arrange it right at all. It messed up all the columns.
I've thought once I get the melt right, I could just change the NaN values to 0 or blanks.
EDIT
I need to keep Subclass & Subclass2 in the final column as they're the higher level mapping of Layer, hence why I want the output Col1 to include them before listing Layer with Amount next to it.
Thanks!
here is one way to do it
pd.concat([df,
df[['Subclass','Subclass2']].stack().reset_index()[0].to_frame().rename(columns={0:'Layer'})
]
)[['Layer','Amount']].sort_values('Layer')
Layer Amount
0 A NaN
1 B NaN
0 C 5.0
2 E NaN
3 F NaN
1 G 100.0
Here is my interpretation. Using a stack instead of melt to preserve the order.
out = (df
.set_index('Amount')
.stack().reset_index(name='Col1')
.assign(Amount=lambda d: d['Amount'].where(d['level_1'].eq('Layer'), 0))
.drop(columns='level_1')
)
NB. with melt the syntax would be df.melt(id_vars='Amount', value_name='Col1'), and using variable in place of level_1
Output:
Amount Col1
0 0 A
1 0 B
2 5 C
3 0 E
4 0 F
5 100 G

Change dataframe from MultiIndex to MultiColumn

I have a dataframe like
multiindex1 = pd.MultiIndex.from_product([['a'], np.arange(3, 8)])
df1 = pd.DataFrame(np.random.randn(5, 3), index=multiindex1)
multiindex2 = pd.MultiIndex.from_product([['s'], np.arange(1, 6)])
df2 = pd.DataFrame(np.random.randn(5, 3), index=multiindex2)
multiindex3 = pd.MultiIndex.from_product([['d'], np.arange(2, 7)])
df3 = pd.DataFrame(np.random.randn(5, 3), index=multiindex3)
df = pd.concat([df1, df2, df3])
>>>
0 1 2
a 3 0.872208 -0.145098 -0.519966
4 -0.976089 -0.730808 -1.463151
5 -0.026869 0.227912 1.525924
6 -0.161126 -0.415913 -1.211737
7 -0.893881 -0.769385 0.346436
s 1 -0.972483 0.202820 0.265304
2 0.007303 0.802974 -0.254106
3 1.619309 -1.545089 0.161493
4 2.847376 0.951796 -0.877882
5 1.749237 -0.327026 0.467784
d 2 1.440793 -0.697371 0.902004
3 0.390181 -0.449725 -0.462104
4 0.056916 0.140066 0.918281
5 0.164234 -2.491176 2.035113
6 -1.648948 0.372179 0.600297
Now I want to change it into multicolumns like
a b c
0 1 2 0 1 2 0 1 2
1 Nan Nan Nan ...
2 Nan Nan Nan ...
3 0.872208 -0.145098 -0.519966 ...
4 -0.976089 -0.730808 -1.463151 ...
5 -0.026869 0.227912 1.525924 ...
6 -0.161126 -0.415913 -1.211737 ...
7 -0.893881 -0.769385 0.346436 ...
That's to say I want two goals to be done:
change level0 index(multi index into single index) into level0 columns(single column into multi column)
merge level1 index togather and reindex by it
I've tried stack unstack pivot, but couldn't get the target form.
So is there any elegant way to achieve it?
This seemed to work for me. I used to work with someone who was amazing with multi-indices and he taught me a lot of this.
df.unstack(level=0).reorder_levels([0,1], axis=1).swaplevel(axis=1)[["a", "s", "d"]]
Output:
a s d
0 1 2 0 1 2 0 1 2
1 NaN NaN NaN 0.206957 1.329804 -0.037481 NaN NaN NaN
2 NaN NaN NaN 0.244912 1.880180 1.447138 -1.009454 0.215560 0.126756
3 0.871496 -1.247274 -0.458660 0.514475 0.989567 -1.653646 -0.623382 -0.799157 0.119259
4 -0.756771 0.523621 0.067967 1.066499 -1.436044 -1.045745 0.440954 -1.997325 -1.223662
5 0.707063 1.019831 0.422577 0.964631 -0.034742 -0.891528 0.891096 -0.724778 -0.043314
6 -0.140548 -0.093853 -1.060963 NaN NaN NaN 0.643902 -0.062267 -0.505484
7 -0.449435 0.360956 -0.769459 NaN NaN NaN NaN NaN NaN

aggregating across a dictionary containing dataframes

i have the following dictionary which contains dataframes as values, each always having the same number of columns (1) with the same title
test = {'A': pd.DataFrame(np.random.randn(10), index=range(10),columns=['values']),
'B': pd.DataFrame(np.random.randn(6), index=range(6),columns=['values']),
'C': pd.DataFrame(np.random.randn(11), index=range(11),columns=['values'])}
from this, i would like to create a single dataframe where the index values are the key values of the dictionary (so A,B,C) and the columns are the union of the current index values across all dictionaries (so in this case 0,1,2,3...10). the values of this dataframe would be the corresponding 'values' from the dataframe corresponding to each row, and where blank, NaN
is there a handy way to do this?
IIUC, use pd.concat, keys, and unstack:
pd.concat([test[i] for i in test], keys=test.keys()).unstack(1)['values']
Better yet,
pd.concat(test).unstack(1)['values']
Output:
0 1 2 3 4 5 6 \
A -0.029027 -0.530398 -0.866021 1.331116 0.090178 1.044801 -1.586620
C 1.320105 1.244250 -0.162734 0.942929 -0.309025 -0.853728 1.606805
B -1.683822 1.015894 -0.178339 -0.958557 -0.910549 -1.612449 NaN
7 8 9 10
A -1.072210 1.654565 -1.188060 NaN
C 1.642461 -0.137037 -1.416697 -0.349107
B NaN NaN NaN NaN
dont over complicate things:
just use concat and transpose
pd.concat(test, axis=1).T
0 1 2 3 4 5 \
A values -0.592711 0.266518 -0.774702 0.826701 -2.642054 -0.366401
B values -0.709410 -0.463603 0.058129 -0.054475 -1.060643 0.081655
C values 1.384366 0.662186 -1.467564 0.449142 -1.368751 1.629717
6 7 8 9 10
A values 0.431069 0.761245 -1.125767 0.614622 NaN
B values NaN NaN NaN NaN NaN
C values 0.988287 -1.508384 0.214971 -0.062339 -0.011547
if you were dealing with series instead of 1 column DataFrame it would make more sense to begin with...
test = {'A': pd.Series(np.random.randn(10), index=range(10)),
'B': pd.Series(np.random.randn(6), index=range(6)),
'C': pd.Series(np.random.randn(11), index=range(11))}
pd.concat(test,axis=1).T
0 1 2 3 4 5 6 \
A -0.174565 -2.015950 0.051496 -0.433199 0.073010 -0.287708 -1.236115
B 0.935434 0.228623 0.205645 -0.602561 1.860035 -0.921963 NaN
C 0.944508 -1.296606 -0.079339 0.629038 0.314611 -0.429055 -0.911775
7 8 9 10
A -0.704886 -0.369263 -0.390684 NaN
B NaN NaN NaN NaN
C 0.815078 0.061458 1.726053 -0.503471

Unmelting a pandas dataframe with two columns

Suppose I have a dataframe
df = pd.DataFrame(np.random.normal(size = (10,3)), columns = list('abc'))
I melt the dataframe using pd.melt so that it looks like
variable value
a 0.2
a 0.03
a -0.99
a 0.86
a 1.74
Now, I would like to undo the action. Using pivot(columns = 'variable') almost works, but returns a lot of NULL values
a b c
0 0.2 NAN NAN
1 0.03 NAN NAN
2 -0.99 NAN NAN
3 0.86 NAN NAN
4 1.74 NAN NAN
How can I unmelt the dataframe so that it is as before?
A few ideas:
Assuming d1 is df.melt()
groupby + comprehension
pd.DataFrame({n: list(s) for n, s in d1.groupby('variable').value})
a b c
0 -1.087129 -1.264522 1.147618
1 0.403731 0.416867 -0.367249
2 -0.920536 0.442650 -0.351229
3 -1.193876 -0.342237 -2.001431
4 -1.596659 -1.223354 1.323841
5 0.753658 -0.891211 0.541265
6 0.455577 -1.059572 1.017490
7 -0.153736 0.050007 -0.280192
8 1.189587 0.405647 -0.102023
9 -0.103273 0.200320 -0.630194
Option 2
pd.DataFrame.set_index
d1.set_index([d1.groupby('variable').cumcount(), 'variable']).value.unstack()
variable a b c
0 -1.087129 -1.264522 1.147618
1 0.403731 0.416867 -0.367249
2 -0.920536 0.442650 -0.351229
3 -1.193876 -0.342237 -2.001431
4 -1.596659 -1.223354 1.323841
5 0.753658 -0.891211 0.541265
6 0.455577 -1.059572 1.017490
7 -0.153736 0.050007 -0.280192
8 1.189587 0.405647 -0.102023
9 -0.103273 0.200320 -0.630194
Use groupby, apply and unstack.
df.groupby('variable')['value']\
.apply(lambda x: pd.Series(x.values)).unstack().T
variable a b c
0 0.617037 -0.321493 0.747025
1 0.576410 -0.498173 0.185723
2 -1.563912 0.741198 1.439692
3 -1.305317 1.203608 -1.112820
4 1.287638 1.649580 0.404494
5 0.923544 0.988020 -1.918680
6 0.497406 -1.373345 0.074963
7 0.528444 -0.019914 -1.666261
8 0.260955 0.103575 0.190424
9 0.614411 -0.165363 -0.149514
Another method using the pivot and transform if you don't have nan value in the column i.e
df1 = df.melt()
df1.pivot(columns='variable',values='value')
.transform(lambda x: sorted(x,key=pd.isnull)).dropna()
Output:
variable a b c
0 1.596937 0.431029 0.345441
1 -0.493352 0.135649 -1.559669
2 0.548048 0.667752 0.258160
3 -0.251368 -0.265106 -2.339768
4 -0.397010 -0.381193 -0.359447
5 -0.945300 0.520029 0.362570
6 -0.883771 -0.612628 -0.478003
7 0.833100 -0.387262 -1.195496
8 -1.310178 -0.748359 0.073014
9 0.753457 1.105500 -0.895841

How do I combine two columns within a dataframe in Pandas?

Say I have two columns, A and B, in my dataframe:
A B
1 NaN
2 5
3 NaN
4 6
I want to get a new column, C, which fills in NaN cells in column B using values from column A:
A B C
1 NaN 1
2 5 5
3 NaN 3
4 6 6
How do I do this?
I'm sure this is a very basic question, but as I am new to Pandas, any help will be appreciated!
You can use combine_first:
df['c'] = df['b'].combine_first(df['a'])
Docs: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.combine_first.html
You can use where which is a vectorized if/else:
df['C'] = df['A'].where(df['B'].isnull(), df['B'])
A B C
0 1 NaN 1
1 2 5 5
2 3 NaN 3
3 4 6 6
df['c'] = df['b'].fillna(df['a'])
So what .fillna will do is it will fill all the Nan values in the data frame
We can pass any value to it
Here we pass the value df['a']
So this method will put the corresponding values of 'a' into the Nan values of 'b'
And the final answer will be in 'c'

Categories

Resources