How to do math operation in MultiIndex for columns Pandas? - python

I would like to perform a simple math operation on a multindex columns pandas.
Take for example the multiindex as exemplified in the code below. In the first iteration,
The following variable should have the value as below
par_1=4
par_2=6
par_3=8
and the BB which is calculate as per the equation (par_2+par_3+6) / par_1 should equal to 5. However, in the code below, it is equal to nan
Following the calculation, I would like to extend the answer onto the existing df.
May I know what is the proper way of tackling this problem
columns = pd.MultiIndex.from_product([['all_cal'], ['a0_b0', 'a0_b1','a0_b3','a1_b0', 'a1_b1','a1_b3']],
names=['subject', 'type'])
data=np.array([[4,6,8,4,5,6]])
df = pd.DataFrame(data, columns=columns)
for idx in [0,1]:
par_1=df.iloc[0, df.columns.get_level_values(1)==f'a{str(idx)}_b0']
par_2=df.iloc[0, df.columns.get_level_values(1)==f'a{str(idx)}_b1']
par_3=df.iloc[0, df.columns.get_level_values(1)==f'a{str(idx)}_b3']
BB=(par_2+par_3+6) / par_1
df.loc [0, ('all_cal', f'{str(idx)}_new_info')] = (par_2+par_3+6) / par_1
df.loc [0, ('all_cal', f'{str(idx)}_new_other')] = (par_2*2) / par_1

Try with loc using index and columns names, particularly you must access using both column leves:
import numpy as np
import pandas as pd
columns = pd.MultiIndex.from_product([['all_cal'], ['a0_b0', 'a0_b1','a0_b3','a1_b0', 'a1_b1','a1_b3']],
names=['subject', 'type'])
data=np.array([[4,6,8,4,5,6]])
df = pd.DataFrame(data, columns=columns)
idx = 0
par_1 = df.loc[idx, ('all_cal', 'a0_b0')]
par_2 = df.loc[idx, ('all_cal', 'a0_b1')]
par_3 = df.loc[idx, ('all_cal', 'a0_b3')]
BB = (par_2 + par_3 + 6) / par_1
print(f"BB = {BB}")
df.loc[idx, ("all_cal", "new_info")] = (par_2 + par_3 + 6) / par_1
df.loc[idx, ("all_cal", "new_other")] = (par_2 * 2) / par_1
More detailed info here
PS: f-strings support numeric values, so you can avoid str(idx), for example:
print(f"f-strings support numbers like this: {idx}")
is a valid string.

As an alternative, you can set up the MultiIndex differently:
columns = pd.MultiIndex.from_product([['a0', 'a1'], ['b0', 'b1','b3']],
names=['subject', 'type'])
data=np.array([[4,6,8,4,5,6]])
df = pd.DataFrame(data, columns=columns)
print(df)
subject a0 a1
type b0 b1 b3 b0 b1 b3
0 4 6 8 4 5 6
Then you can stack the subject level to do your calculations:
df = df.stack('subject')
df['new_info'] = (df['b1'] + df['b3'] + 6 ) / df['b0']
df['new_other'] = (2 * df['b1']**2) / df['b0']
print(df)
type b0 b1 b3 new_info new_other
subject
0 a0 4 6 8 5.00 18.0
a1 4 5 6 4.25 12.5
...and then unstack them (and reorder things) if you want it to be "wide" again:
df = ( df.unstack('subject')
.sort_index(axis=1, level='subject')
.reorder_levels([1,0], axis=1)
)
print(df)
subject a0 a1
type b0 b1 b3 new_info new_other b0 b1 b3 new_info new_other
0 4 6 8 5.0 18.0 4 5 6 4.25 12.5

Related

Pandas Memory Error when creating new columns with apply() custom function

Function to compute mean log(1+TPM) of 2 replicates
def average_TPM(a,b):
log_a = np.log(1+a)
log_b = np.log(1+b)
if log_a > 0.1 and log_b > 0.1:
avg = np.mean([log_a,log_b])
else:
avg = np.nan
return avg
Applying the function to df to create new columns
df.loc[:,'leaf'] = df.apply(lambda row: average_TPM(row['leaf1'],row['leaf2']),axis=1)
df.loc[:,'flag_leaf'] = df.apply(lambda row: average_TPM(row['flag_leaf1'],row['flag_leaf2']),axis=1)
df.loc[:,'anther'] = df.apply(lambda row: average_TPM(row['anther1'],row['anther2']),axis=1)
df.loc[:,'premeiotic'] = df.apply(lambda row: average_TPM(row['premeiotic1'],row['premeiotic2']),axis=1)
df.loc[:,'leptotene'] = df.apply(lambda row: average_TPM(row['leptotene1'],row['leptotene2']),axis=1)
df.loc[:,'zygotene'] = df.apply(lambda row: average_TPM(row['zygotene1'],row['zygotene2']),axis=1)
df.loc[:,'pachytene'] = df.apply(lambda row: average_TPM(row['pachytene1'],row['pachytene2']),axis=1)
df.loc[:,'diplotene'] = df.apply(lambda row: average_TPM(row['diplotene1'],row['diplotene2']),axis=1)
df.loc[:,'metaphase_I'] = df.apply(lambda row: average_TPM(row['metaphaseI_1'],row['metaphaseI_2']),axis=1)
df.loc[:,'metaphase_II'] = df.apply(lambda row: average_TPM(row['metaphaseII_1'],row['metaphaseII_2']),axis=1)
df.loc[:,'pollen'] = df.apply(lambda row: average_TPM(row['pollen1'],row['pollen2']),axis=1)
Not sure why you have memory error, but you can vectorize your problem:
#dummy variable
np.random.seed = 2
df = pd.DataFrame(np.random.random(8*4).reshape(8,-1), columns=['a1','a2','b1','b2'])
print (df)
a1 a2 b1 b2
0 0.416493 0.964483 0.089547 0.218952
1 0.655331 0.468490 0.272494 0.652915
2 0.680433 0.461191 0.919223 0.552074
3 0.077158 0.138839 0.385818 0.462848
4 0.149198 0.912372 0.893708 0.081125
5 0.255422 0.143502 0.466123 0.524544
6 0.842095 0.486603 0.628405 0.686393
7 0.329461 0.714052 0.176126 0.566491
Define the list of columns to create and then use np.log1p on the whole data at once
col_create = ['a','b'] #what you need to redefine for your problem
col_get = [f'{col}{i}'for col in col_create for i in range(1,3)] #to ensure the order od columns
arr_log = np.log1p(df[col_get].to_numpy())
Now you can use np.where and vectorize comparison to assign the new columns:
df = df.assign(**pd.DataFrame( np.where( (arr_log[:,::2]>0.1)&(arr_log[:,1::2]>0.1),
(arr_log[:,::2] + arr_log[:,1::2])/2., np.nan),
columns=col_create, index=df.index))
print (df)
a1 a2 b1 b2 a b
0 0.533141 0.695231 0.909976 0.441877 0.477569 0.506518
1 0.961887 0.872382 0.064593 0.030619 0.650559 NaN
2 0.646332 0.912140 0.615057 0.354700 0.573386 0.391475
3 0.019646 0.926524 0.160417 0.676512 NaN 0.332748
4 0.249448 0.474937 0.349048 0.390213 0.305659 0.314428
5 0.046568 0.985072 0.147037 0.161261 NaN 0.143344
6 0.812421 0.750128 0.861377 0.765981 0.577176 0.595012
7 0.950178 0.397550 0.803165 0.156186 0.501321 0.367335

add a different random number to every cell in a pandas dataframe

I need to add some 'noise' to my data, so I would like to add a different random number to every cell in my pandas dataframe. This code works, but seems unpythonic. Is there a better way?
import pandas as pd
import numpy as np
df = pd.DataFrame(0.0, index=[1,2,3,4,5], columns=list('ABC') )
print df
for x,line in df.iterrows():
for col in df:
line[col] = line[col] + (np.random.rand()-0.5)/1000.0
print df
df + np.random.rand(*df.shape) / 10000.0
OR
Let's use applymap:
df = pd.DataFrame(1.0, index=[1,2,3,4,5], columns=list('ABC') )
df.applymap(lambda x: x + np.random.rand()/10000.0)
output:
A \
1 [[1.00006953418, 1.00009164785, 1.00003177706]...
2 [[1.00007291245, 1.00004186046, 1.00006935173]...
3 [[1.00000490127, 1.0000633115, 1.00004117181],...
4 [[1.00007159622, 1.0000559506, 1.00007038891],...
5 [[1.00000980335, 1.00004760836, 1.00004214422]...
B \
1 [[1.00000320322, 1.00006981682, 1.00008912557]...
2 [[1.00007443802, 1.00009270815, 1.00007225764]...
3 [[1.00001371778, 1.00001512412, 1.00007986851]...
4 [[1.00005883343, 1.00007936509, 1.00009523334]...
5 [[1.00009329606, 1.00003174878, 1.00006187704]...
C
1 [[1.00005894836, 1.00006592776, 1.0000171843],...
2 [[1.00009085391, 1.00006606979, 1.00001755092]...
3 [[1.00009736701, 1.00007240762, 1.00004558753]...
4 [[1.00003981393, 1.00007505714, 1.00007209959]...
5 [[1.0000031608, 1.00009372917, 1.00001960112],...
This would be the more succinct method and equivalent:
In [147]:
df = pd.DataFrame((np.random.rand(5,3) - 0.5)/1000.0, columns=list('ABC'))
df
Out[147]:
A B C
0 0.000381 -0.000167 0.000020
1 0.000482 0.000007 -0.000281
2 -0.000032 -0.000402 -0.000251
3 -0.000037 -0.000319 0.000260
4 -0.000035 0.000178 0.000166
If you're doing this to an existing df with non-zero values then add:
In [149]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df
Out[149]:
A B C
0 -1.705644 0.149067 0.835378
1 -0.956335 -0.586120 0.212981
2 0.550727 -0.401768 1.421064
3 0.348885 0.879210 0.136858
4 0.271063 0.132579 1.233789
In [154]:
df.add((np.random.rand(df.shape[0], df.shape[1]) - 0.5)/1000.0)
Out[154]:
A B C
0 -1.705459 0.148671 0.835761
1 -0.956745 -0.586382 0.213339
2 0.550368 -0.401651 1.421515
3 0.348938 0.878923 0.136914
4 0.270864 0.132864 1.233622
For nonzero data:
df + (np.random.rand(df.shape)-0.5)*0.001
OR
df + np.random.uniform(-0.01,0.01,(df.shape)))
For cases where your data frame contains zeros that you wish to keep as zero:
df * (1 + (np.random.rand(df.shape)-0.5)*0.001)
OR
df * (1 + np.random.uniform(-0.01,0.01,(df.shape)))
I think either of these should work, its a case of generating a same size "dataframe" (or perhaps array of arrays) as your existing df and adding it to your existing df (multiplying by 1 + random for cases where you wish zeros to remain zero). With the uniform function you can determine the scale of your noise by altering the 0.01 variable.

Pandas Split Dataframe into two Dataframes at a specific row

I have pandas DataFrame which I have composed from concat. One row consists of 96 values, I would like to split the DataFrame from the value 72.
So that the first 72 values of a row are stored in Dataframe1, and the next 24 values of a row in Dataframe2.
I create my DF as follows:
temps = DataFrame(myData)
datasX = concat(
[temps.shift(72), temps.shift(71), temps.shift(70), temps.shift(69), temps.shift(68), temps.shift(67),
temps.shift(66), temps.shift(65), temps.shift(64), temps.shift(63), temps.shift(62), temps.shift(61),
temps.shift(60), temps.shift(59), temps.shift(58), temps.shift(57), temps.shift(56), temps.shift(55),
temps.shift(54), temps.shift(53), temps.shift(52), temps.shift(51), temps.shift(50), temps.shift(49),
temps.shift(48), temps.shift(47), temps.shift(46), temps.shift(45), temps.shift(44), temps.shift(43),
temps.shift(42), temps.shift(41), temps.shift(40), temps.shift(39), temps.shift(38), temps.shift(37),
temps.shift(36), temps.shift(35), temps.shift(34), temps.shift(33), temps.shift(32), temps.shift(31),
temps.shift(30), temps.shift(29), temps.shift(28), temps.shift(27), temps.shift(26), temps.shift(25),
temps.shift(24), temps.shift(23), temps.shift(22), temps.shift(21), temps.shift(20), temps.shift(19),
temps.shift(18), temps.shift(17), temps.shift(16), temps.shift(15), temps.shift(14), temps.shift(13),
temps.shift(12), temps.shift(11), temps.shift(10), temps.shift(9), temps.shift(8), temps.shift(7),
temps.shift(6), temps.shift(5), temps.shift(4), temps.shift(3), temps.shift(2), temps.shift(1), temps,
temps.shift(-1), temps.shift(-2), temps.shift(-3), temps.shift(-4), temps.shift(-5), temps.shift(-6),
temps.shift(-7), temps.shift(-8), temps.shift(-9), temps.shift(-10), temps.shift(-11), temps.shift(-12),
temps.shift(-13), temps.shift(-14), temps.shift(-15), temps.shift(-16), temps.shift(-17), temps.shift(-18),
temps.shift(-19), temps.shift(-20), temps.shift(-21), temps.shift(-22), temps.shift(-23)], axis=1)
Question is: How can split them? :)
iloc
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]
(iloc docs)
use np.split(..., axis=1):
Demo:
In [255]: df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
In [256]: df
Out[256]:
a b c d e f
0 0.823638 0.767999 0.460358 0.034578 0.592420 0.776803
1 0.344320 0.754412 0.274944 0.545039 0.031752 0.784564
2 0.238826 0.610893 0.861127 0.189441 0.294646 0.557034
3 0.478562 0.571750 0.116209 0.534039 0.869545 0.855520
4 0.130601 0.678583 0.157052 0.899672 0.093976 0.268974
In [257]: dfs = np.split(df, [4], axis=1)
In [258]: dfs[0]
Out[258]:
a b c d
0 0.823638 0.767999 0.460358 0.034578
1 0.344320 0.754412 0.274944 0.545039
2 0.238826 0.610893 0.861127 0.189441
3 0.478562 0.571750 0.116209 0.534039
4 0.130601 0.678583 0.157052 0.899672
In [259]: dfs[1]
Out[259]:
e f
0 0.592420 0.776803
1 0.031752 0.784564
2 0.294646 0.557034
3 0.869545 0.855520
4 0.093976 0.268974
np.split() is pretty flexible - let's split an original DF into 3 DFs at columns with indexes [2,3]:
In [260]: dfs = np.split(df, [2,3], axis=1)
In [261]: dfs[0]
Out[261]:
a b
0 0.823638 0.767999
1 0.344320 0.754412
2 0.238826 0.610893
3 0.478562 0.571750
4 0.130601 0.678583
In [262]: dfs[1]
Out[262]:
c
0 0.460358
1 0.274944
2 0.861127
3 0.116209
4 0.157052
In [263]: dfs[2]
Out[263]:
d e f
0 0.034578 0.592420 0.776803
1 0.545039 0.031752 0.784564
2 0.189441 0.294646 0.557034
3 0.534039 0.869545 0.855520
4 0.899672 0.093976 0.268974
I generally use array split because it's easier simple syntax and scales better with more than 2 partitions.
import numpy as np
partitions = 2
dfs = np.array_split(df, partitions)
np.split(df, [100,200,300], axis=0] wants explicit index numbers which may or may not be desirable.

pandas merge by coordinates

I am trying to merge two pandas tables where I find all rows in df2 which have coordinates close to each row in df1. Example follows.
df1:
x y val
0 0 1 A
1 1 3 B
2 2 9 C
df2:
x y val
0 1.2 2.8 a
1 0.9 3.1 b
2 2.0 9.5 c
desired result:
x y val_x val_y
0 0 1 A NaN
1 1 3 B a
2 1 3 B b
3 2 0 C c
Each row in df1 can have 0, 1, or many corresponding entries in df2, and finding the match should be done with a cartesian distance:
(x1 - x2)^2 + (y1 - y2)^2 < 1
The input dataframes have different sizes, even though they don't in this example. I can get close by iterating over the rows in df1 and finding the close values in df2, but am not sure what to do from there:
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
# ?? What now?
Any help would be very much appreciated. I made this example with an ipython notebook, so which you can view/access here: http://nbviewer.ipython.org/gist/anonymous/49a3d821420c04169f02
I found an answer, though I am not real happy with having to loop over the rows in df1. In this case there are only a few hundred so I can deal with it, but it won't scale as well as something else. Solution:
df2_list = []
df1['merge_row'] = df1.index.values # Make a row to merge on with the index values
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
df2_subset['merge_row'] = i # Add a merge row
df2_list.append(df2_subset)
df2_found = pd.concat(df2_list)
result = pd.merge(df1, df2_found, on='merge_row', how='left')

Python: Scaling numbers column by column with pandas

I have a Pandas data frame 'df' in which I'd like to perform some scalings column by column.
In column 'a', I need the maximum number to be 1, the minimum number to be 0, and all other to be spread accordingly.
In column 'b', however, I need the minimum number to be 1, the maximum number to be 0, and all other to be spread accordingly.
Is there a Pandas function to perform these two operations? If not, numpy would certainly do.
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114
This is how you can do it using sklearn and the preprocessing module. Sci-Kit Learn has many pre-processing functions for scaling and centering data.
In [0]: from sklearn.preprocessing import MinMaxScaler
In [1]: df = pd.DataFrame({'A':[14,90,90,96,91],
'B':[103,107,110,114,114]}).astype(float)
In [2]: df
Out[2]:
A B
0 14 103
1 90 107
2 90 110
3 96 114
4 91 114
In [3]: scaler = MinMaxScaler()
In [4]: df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
In [5]: df_scaled
Out[5]:
A B
0 0.000000 0.000000
1 0.926829 0.363636
2 0.926829 0.636364
3 1.000000 1.000000
4 0.939024 1.000000
You could subtract by the min, then divide by the max (beware 0/0). Note that after subtracting the min, the new max is the original max - min.
In [11]: df
Out[11]:
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114
In [12]: df -= df.min() # equivalent to df = df - df.min()
In [13]: df /= df.max() # equivalent to df = df / df.max()
In [14]: df
Out[14]:
a b
A 0.000000 0.000000
B 0.926829 0.363636
C 0.926829 0.636364
D 1.000000 1.000000
E 0.939024 1.000000
To switch the order of a column (from 1 to 0 rather than 0 to 1):
In [15]: df['b'] = 1 - df['b']
An alternative method is to negate the b columns first (df['b'] = -df['b']).
In case you want to scale only one column in the dataframe, you can do the following:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Col1_scaled'] = scaler.fit_transform(df['Col1'].values.reshape(-1,1))
This is not very elegant but the following works for this two column case:
#Create dataframe
df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
#Apply operates on each row or column with the lambda function
#axis = 0 -> act on columns, axis = 1 act on rows
#x is a variable for the whole row or column
#This line will scale minimum = 0 and maximum = 1 for each column
df2 = df.apply(lambda x:(x.astype(float) - min(x))/(max(x)-min(x)), axis = 0)
#Want to now invert the order on column 'B'
#Use apply function again, reverse numbers in column, select column 'B' only and
#reassign to column 'B' of original dataframe
df2['B'] = df2.apply(lambda x: 1-x, axis = 1)['B']
If I find a more elegant way (for example, using the column index: (0 or 1)mod 2 - 1 to select the sign in the apply operation so it can be done with just one apply command, I'll let you know.
I think Acumenus' comment in this answer, should be mentioned explicitly as an answer, as it is a one-liner.
>>> import pandas as pd
>>> from sklearn.preprocessing import minmax_scale
>>> df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
>>> minmax_scale(df)
array([[0. , 0. ],
[0.92682927, 0.36363636],
[0.92682927, 0.63636364],
[1. , 1. ],
[0.93902439, 1. ]])
given a data frame
df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
scale with mean 0 and var 1
df.apply(lambda x: (x - np.mean(x)) / np.std(x), axis=0)
scale with range between 0 and 1
df.apply(lambda x: x / np.max(x), axis=0)

Categories

Resources