I have a dataframe with a multiindex, as per the following example:
dates = pandas.date_range(datetime.date(2020,1,1), datetime.date(2020,1,4))
columns = ['a', 'b', 'c']
index = pandas.MultiIndex.from_product([dates,columns])
panel = pandas.DataFrame(index=index, columns=columns)
This gives me a dataframe like this:
a b c
2020-01-01 a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
2020-01-02 a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
2020-01-03 a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
2020-01-04 a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
I have another 2-dimensional dataframe, as follows:
df = pandas.DataFrame(index=dates, columns=columns, data=numpy.random.rand(len(dates), len(columns)))
Resulting in the following:
a b c
2020-01-01 0.540867 0.426181 0.220182
2020-01-02 0.864340 0.432873 0.487878
2020-01-03 0.017099 0.181050 0.373139
2020-01-04 0.764557 0.097839 0.499788
I would like to assign to the [a,a] cell, across all dates, and the [a,b] cell, across all dates etc.
Something akin to the following:
for i in df.columns:
for j in df.columns:
panel.xs(i, level=1).loc[j] = df[i] * df[j]
Of course this doesn't work, because I am attempting to set a value on a copy of a slice
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I tried several variations:
panel.loc[:,'a'] # selects all rows, and column a
panel.loc[(:, 'a'), 'a'] # invalid syntax
etc...
How can I select index level 1 (eg: row 'a'), column 'a', across all index level 0 - and be able to set the values?
Try broadcasing on the values:
a = df.to_numpy()
panel = pd.DataFrame((a[...,None] * a[:,None,:]).reshape(-1, df.shape[1]),
index=panel.index, columns=panel.columns)
Output:
a b c
2020-01-01 a 0.292537 0.230507 0.119089
b 0.230507 0.181630 0.093837
c 0.119089 0.093837 0.048480
2020-01-02 a 0.747084 0.374149 0.421692
b 0.374149 0.187379 0.211189
c 0.421692 0.211189 0.238025
2020-01-03 a 0.000292 0.003096 0.006380
b 0.003096 0.032779 0.067557
c 0.006380 0.067557 0.139233
2020-01-04 a 0.584547 0.074803 0.382116
b 0.074803 0.009572 0.048899
c 0.382116 0.048899 0.249788
Related
I have an empty pandas dataframe (df), a list of (index, column) pairs (pair_list), and a list of corresponding values (value_list). I want to assign the value in value_list to the corresponding position in df according to pair_list. The following code is what I am using currently, but it is slow. Is there any faster way to do it?
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[0,1,2,3], columns=['a', 'b','c','d'])
pair_list = [(0,'a'),(1,'c'),(0,'d')]
value_list = np.array([3,2,4])
for pos, item in enumerate(pair_list):
df.at[item] = value_list[pos]
The output of the code should be:
a b c d
0 3 NaN NaN 4
1 NaN NaN 2 NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
One idea is create a MultiIndex by MultiIndex.from_tuples, then create a Series, reshape by Series.unstack and add missing columns, index values by DataFrame.reindex:
pair_list = [(0,'a'),(1,'c'),(0,'d')]
value_list = np.array([3,2,4])
mux = pd.MultiIndex.from_tuples(pair_list)
cols = ['a', 'b','c','d']
idx = [0,1,2,3]
df = pd.Series(value_list, index=mux).unstack().reindex(index=idx, columns=cols)
print (df)
a b c d
0 3.0 NaN NaN 4.0
1 NaN NaN 2.0 NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
I have the following dataframe:
In [11]: import numpy as np
...: import pandas as pd
...: df = pd.DataFrame(np.random.random(size=(10,10)), index=range(10), columns=range(10))
...: cols = pd.MultiIndex.from_product([['a', 'b', 'c', 'd', 'e'], ['m', 'n']], names=['l1', 'l2'])
...: df.columns = cols
In [12]: df
Out[12]:
l1 a b c d e
l2 m n m n m n m n m n
0 0.257448 0.207198 0.443456 0.553674 0.765539 0.428972 0.587296 0.942761 0.115083 0.073907
1 0.099647 0.702320 0.792053 0.409488 0.112574 0.435044 0.767640 0.946108 0.257002 0.286178
2 0.110061 0.058266 0.350634 0.657057 0.900674 0.882870 0.250355 0.861289 0.041383 0.981890
3 0.408866 0.042692 0.726473 0.482945 0.030925 0.337217 0.377866 0.095778 0.033939 0.550848
4 0.255034 0.455349 0.193223 0.377962 0.445834 0.400846 0.725098 0.567926 0.052293 0.471593
5 0.133966 0.239252 0.479669 0.678660 0.146475 0.042264 0.929615 0.873308 0.603774 0.788071
6 0.068064 0.849320 0.786785 0.767797 0.534253 0.348995 0.267851 0.838200 0.351832 0.566974
7 0.240924 0.089154 0.161263 0.179304 0.077933 0.846366 0.916394 0.771528 0.798970 0.942207
8 0.808719 0.737900 0.300483 0.205682 0.073342 0.081998 0.002116 0.550923 0.460010 0.650109
9 0.413887 0.671698 0.294521 0.833841 0.002094 0.363820 0.148294 0.632994 0.278557 0.340835
And then I want to do the following groupby-apply operation.
In [17]: def func(df):
...: return df.loc[:, df.columns.get_level_values('l2') == 'm']
...:
In [19]: df.groupby(level='l1', axis=1).apply(func)
Out[19]:
l1 a b c d e
l2 m n m n m n m n m n
0 0.257448 NaN 0.443456 NaN 0.765539 NaN 0.587296 NaN 0.115083 NaN
1 0.099647 NaN 0.792053 NaN 0.112574 NaN 0.767640 NaN 0.257002 NaN
2 0.110061 NaN 0.350634 NaN 0.900674 NaN 0.250355 NaN 0.041383 NaN
3 0.408866 NaN 0.726473 NaN 0.030925 NaN 0.377866 NaN 0.033939 NaN
4 0.255034 NaN 0.193223 NaN 0.445834 NaN 0.725098 NaN 0.052293 NaN
5 0.133966 NaN 0.479669 NaN 0.146475 NaN 0.929615 NaN 0.603774 NaN
6 0.068064 NaN 0.786785 NaN 0.534253 NaN 0.267851 NaN 0.351832 NaN
7 0.240924 NaN 0.161263 NaN 0.077933 NaN 0.916394 NaN 0.798970 NaN
8 0.808719 NaN 0.300483 NaN 0.073342 NaN 0.002116 NaN 0.460010 NaN
9 0.413887 NaN 0.294521 NaN 0.002094 NaN 0.148294 NaN 0.278557 NaN
Notice that even if I do not retun any data for columns with l2=='n', the structure of the original dataframe is still preserved and pandas automatically fill in the values with nan.
This is a simplified example, my intention here is not to select out the 'm' columns, this example is just for a illustration of the problem I am facing -- I want to apply some function on some subset of the columns in the dataframe and the result dataframe should only have the columns I care about.
Also I noticed that you cannot rename the column in the apply function. For example if you do:
In [25]: def func(df):
...: df = df.loc[:, df.columns.get_level_values('l2') == 'm']
...: df = df.rename(columns={'m':'p'}, level=1)
...: return df
...:
In [26]: df.groupby(level='l1', axis=1).apply(func)
Out[26]:
l1 a b c d e
l2 m n m n m n m n m n
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Notice the result is full of NaN but the original format of the DF is preserved.
My question is, what should I do so that in the applied function I can manipulate the df so the output of the apply can be different in shape compared to the original df?
Read "What is the difference between pandas agg and apply function?". Depending on your actual use case, you may not need to change the function being passed into .agg or .apply.
I want to apply some function on some subset of the columns in the dataframe
You can shape the DataFrame before grouping, or return only a subset of e.g. columns with the desired aggregation or function application.
# pass an indexed view
grouped0 = df.loc[:, ['a', 'b', 'c'].groupby(level='l1', axis=1)
# perform the .agg or .apply on a subset of e.g. columns
result1 = df.groupby(level='l1', axis=1)['a', 'b', 'c'].agg(np.sum)
Using .agg on your example code:
In [2]: df
Out[2]:
l1 a b ... d e
l2 m n m n ... m n m n
0 0.007932 0.697320 0.181242 0.380013 ... 0.075391 0.820732 0.335901 0.808365
1 0.736584 0.621418 0.736926 0.962414 ... 0.331465 0.711948 0.426704 0.849730
2 0.099217 0.802882 0.082109 0.489288 ... 0.758056 0.627021 0.539329 0.808187
3 0.152319 0.378918 0.205193 0.489060 ... 0.337615 0.475191 0.025432 0.616413
4 0.582070 0.709464 0.739957 0.472041 ... 0.299662 0.151314 0.113506 0.504926
5 0.351747 0.480518 0.424127 0.364428 ... 0.267780 0.092946 0.134434 0.443320
6 0.572375 0.157129 0.582345 0.124572 ... 0.074523 0.421519 0.733218 0.079004
7 0.026940 0.762937 0.108213 0.073087 ... 0.758596 0.559506 0.601568 0.603528
8 0.991940 0.864772 0.759207 0.523460 ... 0.981770 0.332174 0.012079 0.034952
In [4]: df.groupby(level='l1', axis=1).sum()
Out[4]:
l1 a b c d e
0 0.705252 0.561255 0.804299 0.896123 1.144266
1 1.358002 1.699341 1.422559 1.043413 1.276435
2 0.902099 0.571397 0.273161 1.385077 1.347516
3 0.531237 0.694253 0.914989 0.812806 0.641845
4 1.291534 1.211998 1.138044 0.450976 0.618433
5 0.832265 0.788555 1.063437 0.360726 0.577754
6 0.729504 0.706917 1.018795 0.496042 0.812222
7 0.789877 0.181300 0.406009 1.318102 1.205095
8 1.856713 1.282666 1.183835 1.313944 0.047031
9 0.273369 0.391189 0.867865 0.978350 0.654145
In [10]: df.groupby(level='l1', axis=1).agg(lambda x: x[0])
Out[10]:
l1 a b c d e
0 0.007932 0.181242 0.708712 0.075391 0.335901
1 0.736584 0.736926 0.476286 0.331465 0.426704
2 0.099217 0.082109 0.037351 0.758056 0.539329
3 0.152319 0.205193 0.419761 0.337615 0.025432
4 0.582070 0.739957 0.279153 0.299662 0.113506
5 0.351747 0.424127 0.845485 0.267780 0.134434
6 0.572375 0.582345 0.309942 0.074523 0.733218
7 0.026940 0.108213 0.084424 0.758596 0.601568
8 0.991940 0.759207 0.412974 0.981770 0.012079
9 0.045315 0.282569 0.019320 0.638741 0.292028
In [11]: df.groupby(level='l1', axis=1).agg(lambda x: x[1])
Out[11]:
l1 a b c d e
0 0.697320 0.380013 0.095587 0.820732 0.808365
1 0.621418 0.962414 0.946274 0.711948 0.849730
2 0.802882 0.489288 0.235810 0.627021 0.808187
3 0.378918 0.489060 0.495227 0.475191 0.616413
4 0.709464 0.472041 0.858891 0.151314 0.504926
5 0.480518 0.364428 0.217953 0.092946 0.443320
6 0.157129 0.124572 0.708853 0.421519 0.079004
7 0.762937 0.073087 0.321585 0.559506 0.603528
8 0.864772 0.523460 0.770861 0.332174 0.034952
9 0.228054 0.108620 0.848545 0.339609 0.362117
Since you say that your example func is not your use case, please provide an example of your specific use case if the general cases don't fit.
I am currently working on a dataframe from a cross-tab operation.
pd.crosstab(data['One'],data['two'], margins=True).apply(lambda r: r/len(data)*100,axis = 1)
Columns come out in the following order
A B C D E All
B
C
D
E
All 100
But I want the columns ordered as shown below:
A C D B E All
B
C
D
E
All 100
Is there a easy way to organize the columns?
when I use colnames=['C', 'D','B','E'] it returns an error:
'AssertionError: arrays and names must have the same length '
You can use reindex or reindex_axis or change order by subset:
colnames=['C', 'D','B','E']
new_cols = colnames + ['All']
#solution 1 change ordering by reindexing
df1 = df.reindex_axis(new_cols,axis=1)
#solution 2 change ordering by reindexing
df1 = df.reindex(columns=new_cols)
#solution 3 change order by subset
df1 = df[new_cols]
print (df1)
C D B E All
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN 100.0
To specify the columns of any dataframe in pandas, just index with a list of the columns in the order you want:
columns = ['A', 'C', 'D', 'B', 'E', 'All']
df2 = df.loc[:, columns]
print(df2)
I have a simple function:
def f(returns):
base = (1 + returns.sum()) / (1 + returns).prod()
base = pd.Series([base] * len(returns))
exp = returns.abs() / returns.abs().sum()
return (1 + returns) * base.pow(exp) - 1.0
and a DataFrame:
df = pd.DataFrame([[.1,.2,.3],[.4,.5,.6],[.7,.8,.9]], columns=['A', 'B', 'C'])
I can do this:
df.apply(f)
A B C
0 0.084169 0.159224 0.227440
1 0.321130 0.375803 0.426375
2 0.535960 0.567532 0.599279
However, the transposition:
df.transpose().apply(f)
produces an unexpected result:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
A NaN NaN NaN
B NaN NaN NaN
C NaN NaN NaN
Now, I can manually transpose the DataFrame:
df2 = pd.DataFrame([[1., 4., 7.],[2., 5., 8.], [3., 6., 9.]], columns=['A', 'B', 'C'])
df2.apply(f)
A B C
0 0.628713 1.516577 2.002160
1 0.989529 1.543616 1.936151
2 1.160247 1.499530 1.836141
I don't understand why I can't simply transpose and then apply the function to each row of the DataFrame. In fact, I don't know why I can't do this either:
df.apply(f, axis=1)
0 1 2 A B C
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
As EdChum says, the problem is pandas is trying to align the index of the Series you create inside f with the index of the DataFrame. This coincidentally works in your first example because you don't specify an index in the Series call, so it uses the default 0, 1, 2, which happens to be the same as your original DF. If your original DF has some other index, it will fail right away:
>>> df = pd.DataFrame([[.1,.2,.3],[.4,.5,.6],[.7,.8,.9]], columns=['A', 'B', 'C'], index=[8, 9, 10])
>>> df.apply(f)
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
To fix it, explicitly create the new Series with the same index as your DF. Change the line inside d to:
base = pd.Series([base] * len(returns), index=returns.index)
Then:
>>> df.apply(f)
A B C
8 0.084169 0.159224 0.227440
9 0.321130 0.375803 0.426375
10 0.535960 0.567532 0.599279
>>> df.T.apply(f)
8 9 10
A 0.087243 0.293863 0.453757
B 0.172327 0.359225 0.505245
C 0.255292 0.421544 0.553746
Say I have a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
I would like to substract the entries in column df.a from all other columns. In other words, I would like to get a dataframe that holds as columns the following columns:
|col_b - col_a | col_c - col_a | col_d - col_a|
I have tried df - df.a but this yields something odd:
0 1 2 3 a b c d e
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
How can I do this type of columnwise operations in Pandas? Also, just wondering, what does df -df.a do?
You probably want
>>> df.sub(df.a, axis=0)
a b c d e
0 0 0.112285 0.267105 0.365407 -0.159907
1 0 0.380421 0.119536 0.356203 0.096637
2 0 -0.100310 -0.180927 0.112677 0.260202
3 0 0.653642 0.566408 0.086720 0.256536
df-df.a is basically trying to do the subtraction along the other axis, so the indices don't match, and when using binary operators like subtraction "mismatched indices will be unioned together" (as the docs say). Since the indices don't match, you wind up with
0 1 2 3 a b c d e.
For example, you could get to the same destination more indirectly by transposing things,
(df.T - df.a).T, which by flipping df means that the default axis is now the right one.