Was trying to generate a pivot table with multiple "values" columns. I know I can use aggfunc to aggregate values the way I want to, but what if I don't want to sum or avg both columns but instead I want sum of one column while mean of the other one. So is it possible to do so using pandas?
df = pd.DataFrame({
'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24)
})
Now this will get a pivot table with sum:
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=np.sum)
And this for mean:
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=np.mean)
How can I get sum for D and mean for E?
Hope my question is clear enough.
You can apply a specific function to a specific column by passing in a dict.
pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc={'D':np.sum, 'E':np.mean})
You can concat two DataFrames:
>>> df1 = pd.pivot_table(df, values=['D'], rows=['B'], aggfunc=np.sum)
>>> df2 = pd.pivot_table(df, values=['E'], rows=['B'], aggfunc=np.mean)
>>> pd.concat((df1, df2), axis=1)
D E
B
A 1.810847 -0.524178
B 2.762190 -0.443031
C 0.867519 0.078460
or you can pass list of functions as aggfunc parameter and then reindex:
>>> df3 = pd.pivot_table(df, values=['D','E'], rows=['B'], aggfunc=[np.sum, np.mean])
>>> df3
sum mean
D E D E
B
A 1.810847 -4.193425 0.226356 -0.524178
B 2.762190 -3.544245 0.345274 -0.443031
C 0.867519 0.627677 0.108440 0.078460
>>> df3 = df3.ix[:, [('sum', 'D'), ('mean','E')]]
>>> df3.columns = ['D', 'E']
>>> df3
D E
B
A 1.810847 -0.524178
B 2.762190 -0.443031
C 0.867519 0.078460
Alghouth, it would be nice to have an option to defin aggfunc for each column individually. Don't know how it could be done, may be pass into aggfunc dict-like parameter, like {'D':np.mean, 'E':np.sum}.
update Actually, in your case you can pivot by hand:
>>> df.groupby('B').aggregate({'D':np.sum, 'E':np.mean})
E D
B
A -0.524178 1.810847
B -0.443031 2.762190
C 0.078460 0.867519
table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
aggfunc={'D': np.mean,'E': np.sum})
table
D E
mean sum
A C
bar large 5.500000 7.500000
small 5.500000 8.500000
foo large 2.000000 4.500000
small 2.333333 4.333333
Related
I have to two dataframes
first one: df
df1 = pd.DataFrame({
'Sample': ['Sam1', 'Sam2', 'Sam3'],
'Value': ['ak,b,c,k', 'd,k,e,b,f,a', 'am, x,y,z,a']
})
df1
looks as:
Sample Value
0 Sam1 ak,b,c,k
1 Sam2 d,k,e,b,f,a
2 Sam3 am,x,y,z,a
second one: df2
df2 = pd.DataFrame({
'Remove': ['ak', 'b', 'k', 'a', 'am']})
df2
Looks as:
Remove
0 ak
1 b
2 k
3 a
4 am
I want to remove the strings from df1['Value'] that are matching with df2['Remove']
Expected output is:
Sample Value
Sam1 c
Sam2 d,e,f
Sam3 x,y,z
This code did not help me
Any help, thanks
Using apply as a 1 liner
df1['Value'] = df1['Value'].str.split(',').apply(lambda x:','.join([i for i in x if i not in df2['Remove'].values]))
Output:
>>> df1
Sample Value
0 Sam1 c
1 Sam2 d,e,f
2 Sam3 x,y,z
You can use apply() to remove items in df1 Value column if it is in df2 Remove column.
import pandas as pd
df1 = pd.DataFrame({
'Sample': ['Sam1', 'Sam2', 'Sam3'],
'Value': ['ak,b,c,k', 'd,k,e,b,f,a', 'am, x,y,z,a']
})
df2 = pd.DataFrame({'Remove': ['ak', 'b', 'k', 'a', 'am']})
remove_list = df2['Remove'].values.tolist()
def remove_value(row, remove_list):
keep_list = [val for val in row['Value'].split(',') if val not in remove_list]
return ','.join(keep_list)
df1['Value'] = df1.apply(remove_value, axis=1, args=(remove_list,))
print(df1)
Sample Value
0 Sam1 c
1 Sam2 d,e,f
2 Sam3 x,y,z
This script will help you
for index, elements in enumerate(df1['Value']):
elements = elements.split(',')
df1['Value'][index] = list(set(elements)-set(df2['Remove']))
Just iterate the data frame and get the diff of array with the remove array like this
The complete code will be sth like this
import pandas as pd
df1 = pd.DataFrame({
'Sample': ['Sam1', 'Sam2', 'Sam3'],
'Value': ['ak,b,c,k', 'd,k,e,b,f,a', 'am,x,y,z,a']
})
df2 = pd.DataFrame({
'Remove': ['ak', 'b', 'k', 'a', 'am']})
for index, elements in enumerate(df1['Value']):
elements = elements.split(',')
df1['Value'][index] = list(set(elements)-set(df2['Remove']))
print(df1)
output
Sample Value
0 Sam1 [c]
1 Sam2 [e, d, f]
2 Sam3 [y, x, z]
On a DataFrame with single levels, group the data on columns using a dictionary:
df1 = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=['a','b','c','d','e','f','g','h'])
dict_col= {'a':'ab','b':'ab','c':'c','d':'d','e':'efgh','f':'efgh','g':'efgh','h':'efgh'}
df1.groupby(dict_col, axis=1).sum()
ab c d efgh
A 1.014831 1.274621 -1.490353 -0.954438
B 1.484857 -0.968642 0.700881 -3.281607
C 0.898556 1.444362 0.680974 -2.985182
On a MultiIndexed DataFrame:
MultiIndex = pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'], ['a','b','c','d','e','f','g','h']])
df2 = pd.DataFrame(np.random.randn(3, 32), index=['A', 'B', 'C'], columns=MultiIndex)
df2.groupby(dict_col, axis=1, level=1).sum()
ab c d efgh
A 6.583721 -1.554734 1.922187 1.100208
B 6.138441 0.653721 -0.204472 1.890755
C 0.951489 2.695940 -1.494028 0.907464
How to get something like this (All elements on level 0)?
bar baz foo
ab c d efgh ab c d efgh ......
A 6.583721 -1.554734 1.922187 1.100208 4.944954 -1.343831 0.939265 -3.614612 ......
B 6.138441 0.653721 -0.204472 1.890755 -0.347505 1.633708 0.392096 0.414880 ......
C 0.951489 2.695940 -1.494028 0.907464 1.905409 -1.021097 -2.399670 0.799798 ......
One way is to pass a function to groupby, and then convert the tuples back to MultiIndex
out = df2.groupby(lambda x: (x[0], dict_col[x[1]]), axis=1).sum()
out.columns = pd.MultiIndex.from_tuples(out.columns)
Another way is flatten the column levels by stack then unstack back after groupby:
df2.stack(level=0).groupby(dict_col, axis=1).sum().unstack(level=-1)
I read this excellent guide to pivoting but I can't work out how to apply it to my case. I have tidy data like this:
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'case': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', ],
... 'perf_var': ['num', 'time', 'num', 'time', 'num', 'time', 'num', 'time'],
... 'perf_value': [1, 10, 2, 20, 1, 30, 2, 40]
... }
... )
>>>
>>> df
case perf_var perf_value
0 a num 1
1 a time 10
2 a num 2
3 a time 20
4 b num 1
5 b time 30
6 b num 2
7 b time 40
What I want is:
To use "case" as the columns
To use the "num" values as the index
To use the "time" values as the value.
to give:
case a b
1.0 10 30
2.0 20 40
All the pivot examples I can see have the index and values in separate columns, but the above seems like a valid/common "tidy" data case to me (I think?). Is it possible to pivot from this?
You need a bit of preprocessing to get your final result :
(df.assign(num=np.where(df.perf_var == "num",
df.perf_value,
np.nan),
time=np.where(df.perf_var == "time",
df.perf_value,
np.nan))
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.loc[:, ["case", "num", "time"]]
.drop_duplicates()
.pivot("num", "case", "time"))
case a b
num
1.0 10.0 30.0
2.0 20.0 40.0
An alternative route to the same end point :
(
df.set_index(["case", "perf_var"], append=True)
.unstack()
.droplevel(0, 1)
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.drop_duplicates()
.droplevel(0)
.set_index("num", append=True)
.unstack(0)
.rename_axis(index=None)
)
I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.
You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.
Consider the following DataFrame:
arrays = [['foo', 'bar', 'bar', 'bar'],
['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(4,4), columns = columnValues)
print(df)
foo bar
A B C D
0 0.859664 0.671857 0.685368 0.939156
1 0.155301 0.495899 0.733943 0.585682
2 0.124663 0.467614 0.622972 0.567858
3 0.789442 0.048050 0.630039 0.722298
Say I want to remove the first column, like so:
df.drop(df.columns[[0]], axis = 1, inplace = True)
print(df)
bar
B C D
0 0.671857 0.685368 0.939156
1 0.495899 0.733943 0.585682
2 0.467614 0.622972 0.567858
3 0.048050 0.630039 0.722298
This produces the expected result, however the column labels foo and Aare retained:
print(df.columns.levels)
[['bar', 'foo'], ['A', 'B', 'C', 'D']]
Is there a way to completely drop a column, including its labels, from a MultiIndex DataFrame?
EDIT: As suggested by John, I had a look at https://github.com/pydata/pandas/issues/12822. What I got from it is that it's not a bug, however I believe the suggested solution (https://github.com/pydata/pandas/issues/2770#issuecomment-76500001) does not work for me. Am I missing something here?
df2 = df.drop(df.columns[[0]], axis = 1)
print(df2)
bar
B C D
0 0.969674 0.068575 0.688838
1 0.650791 0.122194 0.289639
2 0.373423 0.470032 0.749777
3 0.707488 0.734461 0.252820
print(df2.columns[[0]])
MultiIndex(levels=[['bar', 'foo'], ['A', 'B', 'C', 'D']],
labels=[[0], [1]])
df2.set_index(pd.MultiIndex.from_tuples(df2.columns.values))
ValueError: Length mismatch: Expected axis has 4 elements, new values have 3 elements
New Answer
As of pandas 0.20, pd.MultiIndex has a method pd.MultiIndex.remove_unused_levels
df.columns = df.columns.remove_unused_levels()
Old Answer
Our savior is pd.MultiIndex.to_series()
it returns a series of tuples restricted to what is in the DataFrame
df.columns = pd.MultiIndex.from_tuples(df.columns.to_series())