Pandas groupby.apply tries to preserve the original dataframe strucutre - python

I have the following dataframe:
In [11]: import numpy as np
...: import pandas as pd
...: df = pd.DataFrame(np.random.random(size=(10,10)), index=range(10), columns=range(10))
...: cols = pd.MultiIndex.from_product([['a', 'b', 'c', 'd', 'e'], ['m', 'n']], names=['l1', 'l2'])
...: df.columns = cols
In [12]: df
Out[12]:
l1 a b c d e
l2 m n m n m n m n m n
0 0.257448 0.207198 0.443456 0.553674 0.765539 0.428972 0.587296 0.942761 0.115083 0.073907
1 0.099647 0.702320 0.792053 0.409488 0.112574 0.435044 0.767640 0.946108 0.257002 0.286178
2 0.110061 0.058266 0.350634 0.657057 0.900674 0.882870 0.250355 0.861289 0.041383 0.981890
3 0.408866 0.042692 0.726473 0.482945 0.030925 0.337217 0.377866 0.095778 0.033939 0.550848
4 0.255034 0.455349 0.193223 0.377962 0.445834 0.400846 0.725098 0.567926 0.052293 0.471593
5 0.133966 0.239252 0.479669 0.678660 0.146475 0.042264 0.929615 0.873308 0.603774 0.788071
6 0.068064 0.849320 0.786785 0.767797 0.534253 0.348995 0.267851 0.838200 0.351832 0.566974
7 0.240924 0.089154 0.161263 0.179304 0.077933 0.846366 0.916394 0.771528 0.798970 0.942207
8 0.808719 0.737900 0.300483 0.205682 0.073342 0.081998 0.002116 0.550923 0.460010 0.650109
9 0.413887 0.671698 0.294521 0.833841 0.002094 0.363820 0.148294 0.632994 0.278557 0.340835
And then I want to do the following groupby-apply operation.
In [17]: def func(df):
...: return df.loc[:, df.columns.get_level_values('l2') == 'm']
...:
In [19]: df.groupby(level='l1', axis=1).apply(func)
Out[19]:
l1 a b c d e
l2 m n m n m n m n m n
0 0.257448 NaN 0.443456 NaN 0.765539 NaN 0.587296 NaN 0.115083 NaN
1 0.099647 NaN 0.792053 NaN 0.112574 NaN 0.767640 NaN 0.257002 NaN
2 0.110061 NaN 0.350634 NaN 0.900674 NaN 0.250355 NaN 0.041383 NaN
3 0.408866 NaN 0.726473 NaN 0.030925 NaN 0.377866 NaN 0.033939 NaN
4 0.255034 NaN 0.193223 NaN 0.445834 NaN 0.725098 NaN 0.052293 NaN
5 0.133966 NaN 0.479669 NaN 0.146475 NaN 0.929615 NaN 0.603774 NaN
6 0.068064 NaN 0.786785 NaN 0.534253 NaN 0.267851 NaN 0.351832 NaN
7 0.240924 NaN 0.161263 NaN 0.077933 NaN 0.916394 NaN 0.798970 NaN
8 0.808719 NaN 0.300483 NaN 0.073342 NaN 0.002116 NaN 0.460010 NaN
9 0.413887 NaN 0.294521 NaN 0.002094 NaN 0.148294 NaN 0.278557 NaN
Notice that even if I do not retun any data for columns with l2=='n', the structure of the original dataframe is still preserved and pandas automatically fill in the values with nan.
This is a simplified example, my intention here is not to select out the 'm' columns, this example is just for a illustration of the problem I am facing -- I want to apply some function on some subset of the columns in the dataframe and the result dataframe should only have the columns I care about.
Also I noticed that you cannot rename the column in the apply function. For example if you do:
In [25]: def func(df):
...: df = df.loc[:, df.columns.get_level_values('l2') == 'm']
...: df = df.rename(columns={'m':'p'}, level=1)
...: return df
...:
In [26]: df.groupby(level='l1', axis=1).apply(func)
Out[26]:
l1 a b c d e
l2 m n m n m n m n m n
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Notice the result is full of NaN but the original format of the DF is preserved.
My question is, what should I do so that in the applied function I can manipulate the df so the output of the apply can be different in shape compared to the original df?

Read "What is the difference between pandas agg and apply function?". Depending on your actual use case, you may not need to change the function being passed into .agg or .apply.
I want to apply some function on some subset of the columns in the dataframe
You can shape the DataFrame before grouping, or return only a subset of e.g. columns with the desired aggregation or function application.
# pass an indexed view
grouped0 = df.loc[:, ['a', 'b', 'c'].groupby(level='l1', axis=1)
# perform the .agg or .apply on a subset of e.g. columns
result1 = df.groupby(level='l1', axis=1)['a', 'b', 'c'].agg(np.sum)
Using .agg on your example code:
In [2]: df
Out[2]:
l1 a b ... d e
l2 m n m n ... m n m n
0 0.007932 0.697320 0.181242 0.380013 ... 0.075391 0.820732 0.335901 0.808365
1 0.736584 0.621418 0.736926 0.962414 ... 0.331465 0.711948 0.426704 0.849730
2 0.099217 0.802882 0.082109 0.489288 ... 0.758056 0.627021 0.539329 0.808187
3 0.152319 0.378918 0.205193 0.489060 ... 0.337615 0.475191 0.025432 0.616413
4 0.582070 0.709464 0.739957 0.472041 ... 0.299662 0.151314 0.113506 0.504926
5 0.351747 0.480518 0.424127 0.364428 ... 0.267780 0.092946 0.134434 0.443320
6 0.572375 0.157129 0.582345 0.124572 ... 0.074523 0.421519 0.733218 0.079004
7 0.026940 0.762937 0.108213 0.073087 ... 0.758596 0.559506 0.601568 0.603528
8 0.991940 0.864772 0.759207 0.523460 ... 0.981770 0.332174 0.012079 0.034952
In [4]: df.groupby(level='l1', axis=1).sum()
Out[4]:
l1 a b c d e
0 0.705252 0.561255 0.804299 0.896123 1.144266
1 1.358002 1.699341 1.422559 1.043413 1.276435
2 0.902099 0.571397 0.273161 1.385077 1.347516
3 0.531237 0.694253 0.914989 0.812806 0.641845
4 1.291534 1.211998 1.138044 0.450976 0.618433
5 0.832265 0.788555 1.063437 0.360726 0.577754
6 0.729504 0.706917 1.018795 0.496042 0.812222
7 0.789877 0.181300 0.406009 1.318102 1.205095
8 1.856713 1.282666 1.183835 1.313944 0.047031
9 0.273369 0.391189 0.867865 0.978350 0.654145
In [10]: df.groupby(level='l1', axis=1).agg(lambda x: x[0])
Out[10]:
l1 a b c d e
0 0.007932 0.181242 0.708712 0.075391 0.335901
1 0.736584 0.736926 0.476286 0.331465 0.426704
2 0.099217 0.082109 0.037351 0.758056 0.539329
3 0.152319 0.205193 0.419761 0.337615 0.025432
4 0.582070 0.739957 0.279153 0.299662 0.113506
5 0.351747 0.424127 0.845485 0.267780 0.134434
6 0.572375 0.582345 0.309942 0.074523 0.733218
7 0.026940 0.108213 0.084424 0.758596 0.601568
8 0.991940 0.759207 0.412974 0.981770 0.012079
9 0.045315 0.282569 0.019320 0.638741 0.292028
In [11]: df.groupby(level='l1', axis=1).agg(lambda x: x[1])
Out[11]:
l1 a b c d e
0 0.697320 0.380013 0.095587 0.820732 0.808365
1 0.621418 0.962414 0.946274 0.711948 0.849730
2 0.802882 0.489288 0.235810 0.627021 0.808187
3 0.378918 0.489060 0.495227 0.475191 0.616413
4 0.709464 0.472041 0.858891 0.151314 0.504926
5 0.480518 0.364428 0.217953 0.092946 0.443320
6 0.157129 0.124572 0.708853 0.421519 0.079004
7 0.762937 0.073087 0.321585 0.559506 0.603528
8 0.864772 0.523460 0.770861 0.332174 0.034952
9 0.228054 0.108620 0.848545 0.339609 0.362117
Since you say that your example func is not your use case, please provide an example of your specific use case if the general cases don't fit.

Related

Select rows with specific values in columns and include rows with NaN in pandas dataframe

I have a DataFrame df that looks something like this:
df
a b c
0 0.557894 -0.196294 -0.020490
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
5 -0.337374 NaN -0.771888
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
9 -2.345448 2.443669 -1.409422
I want to select the rows that have a value over some value, which I would normally do using:
new_df = df[df['c'] >= .5]
but that will return:
a b c
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
5 -0.337374 NaN 0.771888
8 0.737413 NaN 0.679575
I want to get those rows, but also keep the rows that have nan values in column 'c'. I haven't been able to find a question asking the same thing, they usually ask for one or the other, but not both. I can hard code the rows that I want to drop since I know the specific values, but I was wondering if there is a better solution. The end result should look something like this:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
Only dropping rows 0,5 and 9 since they are less than .5 in columns 'c'
You should use the | (or) operator.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.557894,1.138774,np.nan,-0.069319,1.040089,-0.337374,-1.813278,np.nan,0.737413,-2.345448],
'b': [-0.196294,-0.699224,2.384483,np.nan,-0.271777,np.nan,-1.564666,np.nan,np.nan,2.443669],
'c': [-0.020490,np.nan,0.554292,1.162941,np.nan,-0.771888,np.nan,np.nan,0.679575,-1.409422]})
df = df[(df['c'] >= .5) | (df['c'].isnull())]
print(df)
Output:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
You should be able to do this by
new_df = df[df['c'] >=5 or df['c'] == 'NaN']

How to fill and merge df with 10 empty rows?

how to fill df with empty rows or create a df with empty rows.
have df :
df = pd.DataFrame(columns=["naming","type"])
how to fill this df with empty rows
Specify index values:
df = pd.DataFrame(columns=["naming","type"], index=range(10))
print (df)
naming type
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
If need empty strings:
df = pd.DataFrame('',columns=["naming","type"], index=range(10))
print (df)
naming type
0
1
2
3
4
5
6
7
8
9

MultiIndex slicing doesn't work as expected (error involving lexsorted tuples)

I've got a problem, and it just doesn't make sense. I've got a large pd.DataFrame that I reduced in size so that I could easily show it in an example (called test1):
>>> print(test1)
value TIME \
star 0 1 2 3 4
0 1952.205873 1952.205873 1952.205873 1952.205873 1952.205873
1 1952.226307 1952.226307 1952.226307 1952.226307 1952.226307
2 1952.246740 1952.246740 1952.246740 1952.246740 1952.246740
3 1952.267174 1952.267174 1952.267174 1952.267174 1952.267174
value CNTS \
star 5 0 1 2
0 1952.205873 575311.432228 534103.079080 179471.239561
1 1952.226307 571480.854183 533138.021051 187456.451900
2 1952.246740 555631.798095 530263.846685 203247.734806
3 1952.267174 553639.056784 527058.335157 210088.229427
value
star 3 4 5
0 121884.201457 39003.397835 2089.321993
1 122796.312201 39552.401359 2810.010142
2 123500.068304 39158.050385 2652.409086
3 124357.387418 38881.565235 2721.908129
and I want to perform slice indexing on it. However it just doesn't seem to work. Here is what I try:
test.loc[:,(slice(None),0)]
and I get this error:
*** KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'
This isn't the first time I've had this error or asked the question, but I still don't understand how to fix it and what's wrong.
Even more confusing, is that the following code seems to work without a hitch:
import pandas as pd
import numpy as np
column_values = ['TIME', 'XPOS']
target = range(0,2)
mindex = pd.MultiIndex.from_product([column_values, target], names=['value', 'target'])
df = pd.DataFrame(columns=mindex, index=range(10), dtype=float)
print(df.loc[:,(slice(None),0)])
I just don't understand what's happening and what's wrong here.
You need only sort MultiIndex in columns by sort_index:
df = df.sort_index(axis=1)
You can also check docs - sorting a multiindex.
Sample (columns are not lexsorted):
#your sample, only swap values in column_values
column_values = ['XPOS', 'TIME']
target = range(0,2)
mindex = pd.MultiIndex.from_product([column_values, target], names=['value', 'target'])
df = pd.DataFrame(columns=mindex, index=range(10), dtype=float)
print (df)
value XPOS TIME
target 0 1 0 1
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
print (df.columns.is_lexsorted())
False
df = df.sort_index(axis=1)
print (df.columns.is_lexsorted())
True
print(df.loc[:,(slice(None),0)])
value TIME XPOS
target 0 0
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN

Pandas Can't use Apply on Transposed DataFrame

I have a simple function:
def f(returns):
base = (1 + returns.sum()) / (1 + returns).prod()
base = pd.Series([base] * len(returns))
exp = returns.abs() / returns.abs().sum()
return (1 + returns) * base.pow(exp) - 1.0
and a DataFrame:
df = pd.DataFrame([[.1,.2,.3],[.4,.5,.6],[.7,.8,.9]], columns=['A', 'B', 'C'])
I can do this:
df.apply(f)
A B C
0 0.084169 0.159224 0.227440
1 0.321130 0.375803 0.426375
2 0.535960 0.567532 0.599279
However, the transposition:
df.transpose().apply(f)
produces an unexpected result:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
A NaN NaN NaN
B NaN NaN NaN
C NaN NaN NaN
Now, I can manually transpose the DataFrame:
df2 = pd.DataFrame([[1., 4., 7.],[2., 5., 8.], [3., 6., 9.]], columns=['A', 'B', 'C'])
df2.apply(f)
A B C
0 0.628713 1.516577 2.002160
1 0.989529 1.543616 1.936151
2 1.160247 1.499530 1.836141
I don't understand why I can't simply transpose and then apply the function to each row of the DataFrame. In fact, I don't know why I can't do this either:
df.apply(f, axis=1)
0 1 2 A B C
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
As EdChum says, the problem is pandas is trying to align the index of the Series you create inside f with the index of the DataFrame. This coincidentally works in your first example because you don't specify an index in the Series call, so it uses the default 0, 1, 2, which happens to be the same as your original DF. If your original DF has some other index, it will fail right away:
>>> df = pd.DataFrame([[.1,.2,.3],[.4,.5,.6],[.7,.8,.9]], columns=['A', 'B', 'C'], index=[8, 9, 10])
>>> df.apply(f)
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
To fix it, explicitly create the new Series with the same index as your DF. Change the line inside d to:
base = pd.Series([base] * len(returns), index=returns.index)
Then:
>>> df.apply(f)
A B C
8 0.084169 0.159224 0.227440
9 0.321130 0.375803 0.426375
10 0.535960 0.567532 0.599279
>>> df.T.apply(f)
8 9 10
A 0.087243 0.293863 0.453757
B 0.172327 0.359225 0.505245
C 0.255292 0.421544 0.553746

Unmelt Pandas DataFrame

I have a pandas dataframe with two id variables:
df = pd.DataFrame({'id': [1,1,1,2,2,3],
'num': [10,10,12,13,14,15],
'q': ['a', 'b', 'd', 'a', 'b', 'z'],
'v': [2,4,6,8,10,12]})
id num q v
0 1 10 a 2
1 1 10 b 4
2 1 12 d 6
3 2 13 a 8
4 2 14 b 10
5 3 15 z 12
I can pivot the table with:
df.pivot('id','q','v')
And end up with something close:
q a b d z
id
1 2 4 6 NaN
2 8 10 NaN NaN
3 NaN NaN NaN 12
However, what I really want is (the original unmelted form):
id num a b d z
1 10 2 4 NaN NaN
1 12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
2 14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
In other words:
'id' and 'num' my indices (normally, I've only seen either 'id' or 'num' being the index but I need both since I'm trying to retrieve the original unmelted form)
'q' are my columns
'v' are my values in the table
Update
I found a close solution from Wes McKinney's blog:
df.pivot_table(index=['id','num'], columns='q')
v
q a b d z
id num
1 10 2 4 NaN NaN
12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
However, the format is not quite the same as what I want above.
You could use set_index and unstack
In [18]: df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
Out[18]:
q id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0
You're really close slaw. Just rename your column index to None and you've got what you want.
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel().rename(None)
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Note that the the 'v' column is expected to be numeric by default so that it can be aggregated. Otherwise, Pandas will error out with:
DataError: No numeric types to aggregate
To resolve this, you can specify your own aggregation function by using a custom lambda function:
df2 = df.pivot_table(index=['id','num'], columns='q', aggfunc= lambda x: x)
you can remove name q.
df1.columns=df1.columns.tolist()
Zero's answer + remove q =
df1 = df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
df1.columns=df1.columns.tolist()
id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0
This might work just fine:
Pivot
df2 = (df.pivot_table(index=['id', 'num'], columns='q', values='v')).reset_index())
Concatinate the 1st level column names with the 2nd
df2.columns =[s1 + str(s2) for (s1,s2) in df2.columns.tolist()]
Came up with a close solution
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel()
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Still can't figure out how to drop 'q' from the dataframe
It can be done in three steps:
#1: Prepare auxilary column 'id_num':
df['id_num'] = df[['id', 'num']].apply(tuple, axis=1)
df = df.drop(columns=['id', 'num'])
#2: 'pivot' is almost an inverse of melt:
df, df.columns.name = df.pivot(index='id_num', columns='q', values='v').reset_index(), ''
#3: Bring back 'id' and 'num' columns:
df['id'], df['num'] = zip(*df['id_num'])
df = df.drop(columns=['id_num'])
This is a result, but with different order of columns:
a b d z id num
0 2.0 4.0 NaN NaN 1 10
1 NaN NaN 6.0 NaN 1 12
2 8.0 NaN NaN NaN 2 13
3 NaN 10.0 NaN NaN 2 14
4 NaN NaN NaN 12.0 3 15
Alternatively with proper order:
def multiindex_pivot(df, columns=None, values=None):
#inspired by: https://github.com/pandas-dev/pandas/issues/23955
names = list(df.index.names)
df = df.reset_index()
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
df = df.reset_index() #me
df.columns.name = '' #me
return df
df = df.set_index(['id', 'num'])
df = multiindex_pivot(df, columns='q', values='v')

Categories

Resources