Pandas: Dynamic centering of multiple columns - python

I need to transform all group columns in a DataFrame except the one column with the output variable.
df = pd.DataFrame({
'Branch' : ['A', 'A', 'A', 'B', 'B', 'B'],
'M1': [1,3,5,8,9,3],
'M2': [2,4,5,9,2,1],
'Output': [1,5,5,8,1,3]
})
Right now, I am centering all columns, except the output column, manually by listing them explicitly in the group function.
def group_center (df):
df['M1'] = df['M1'] - df['M1'].mean()
df['M2'] = df['M2'] - df['M2'].mean()
return df
centered = df.groupby('Branch').apply(group_center)
Is there is a way to do this in a more dynamic fashion, as the number of variables I am analyzing keep increasing.

You can define a list of the cols of interest and pass this to the groupby which will operate on each of these cols via a lambda and apply:
In [53]:
cols = ['M1','M2']
df[cols] = df.groupby('Branch')[cols].apply(lambda x: x - x.mean())
df
Out[53]:
Branch M1 M2 Output
0 A -2.000000 -1.666667 1
1 A 0.000000 0.333333 5
2 A 2.000000 1.333333 5
3 B 1.333333 5.000000 8
4 B 2.333333 -2.000000 1
5 B -3.666667 -3.000000 3

Here is a more vectorized way where you do not need to input your columns anywhere.
means = df.groupby('Branch').apply(mean)
df.set_index("Branch", inplace=True)
output = df['Output']
df = df - means
df['Output'] = output
M1 M2 Output
Branch
A -2.000000 -1.666667 1
A 0.000000 0.333333 5
A 2.000000 1.333333 5
B 1.333333 5.000000 8
B 2.333333 -2.000000 1
B -3.666667 -3.000000 3

Related

Update a DataFrame with duplicate destination

I would like to update a dataframe with another one but with multiple "destination". Here is an example
df1 = pd.DataFrame({'name':['A', 'B', 'C', 'A'], 'category':['X', 'X', 'Y', 'Y'], 'value1':[None, 1, None, None], 'value2':[None, 10, None, None]})
name category value1 value2
0 A X NaN NaN
1 B X 1.0 10.0
2 C Y NaN NaN
3 A Y NaN NaN
df2 = pd.DataFrame({'name':['A', 'C'], 'value1':[2, 3], 'value2':[11, 12]})
name value1 value2
0 A 2 11
1 C 3 12
And the desired result would be
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0
I don't think pd.update works since there are two time 'A' in my first DataFrame.
pd.merge creates other columns and I think there is probably a more elegant way than to merge these columns manually after their creation
Thanks in advance for your help!
You can use fillna after mapping the column A in df1 with the corresponding values from df2:
mapping = df2.set_index('name')['value']
df1['value'] = df1['value'].fillna(df1['name'].map(mapping))
If you want to map multiple columns:
mapping = df2.set_index('name')
for col in mapping:
df1[col] = df1[col].fillna(df1['name'].map(mapping[col]))
Alternatively you can try merge:
df = df1.merge(df2, on='name', how='left', suffixes=['', '_r'])
df.groupby(df.columns.str.rstrip('_r'), axis=1, sort=False).first()
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0

How to apply function to ALL columns of dataframe GROUPWISELY ? (In python pandas)

How to apply a function to each column of dataframe "groupwisely" ?
I.e. group by values of one column and calculate e.g. means for each group+ other columns. The expected output is dataframe with index - names of different groups, and values - means for each group+column
E.g. consider:
df = pd.DataFrame(np.arange(16).reshape(4,4), columns=['A', 'B', 'C', 'D'])
df['group'] = ['a', 'a', 'b','b']
A B C D group
0 0 1 2 3 a
1 4 5 6 7 a
2 8 9 10 11 b
3 12 13 14 15 b
I want to calculate e.g. np.mean for each column, but "groupwisely",
in that particular example it can be done by:
t = df.groupby('group').agg({'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean })
A B C D
group
a 2 3 4 5
b 10 11 12 13
However, it requires explicit use of column names 'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean
which is unacceptable for my task, since they can be changed.
As MaxU commented simplier is groupby + GroupBy.mean:
df1 = df.groupby('group').mean()
print (df1)
A B C D
group
a 2 3 4 5
b 10 11 12 13
If need column from index:
df1 = df.groupby('group', as_index=False).mean()
print (df1)
group A B C D
0 a 2 3 4 5
1 b 10 11 12 13
You don't need to explicitly name the columns.
df.groupby('group').agg('mean')
Will produce the mean for each group for each column as requested:
A B C D
group
a 2 3 4 5
b 10 11 12 13
The below does the job:
df.groupby('group').apply(np.mean, axis=0)
giving back
A B C D
group
a 2.0 3.0 4.0 5.0
b 10.0 11.0 12.0 13.0
apply takes axis = {0,1} as additional argument, which in turn specifies whether you want to apply the function row-wise or column-wise.

Pandas: Merge two dataframe columns

Consider two dataframes:
df_a = pd.DataFrame([
['a', 1],
['b', 2],
['c', NaN],
], columns=['name', 'value'])
df_b = pd.DataFrame([
['a', 1],
['b', NaN],
['c', 3],
['d', 4]
], columns=['name', 'value'])
So looking like
# df_a
name value
0 a 1
1 b 2
2 c NaN
# df_b
name value
0 a 1
1 b NaN
2 c 3
3 d 4
I want to merge these two dataframes and fill in the NaN values of the value column with the existing values in the other column. In other words, I want out:
# DESIRED RESULT
name value
0 a 1
1 b 2
2 c 3
3 d 4
Sure, I can do this with a custom .map or .apply, but I want a solution that uses merge or the like, not writing a custom merge function. How can this be done?
I think you can use combine_first:
print (df_b.combine_first(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Or fillna:
print (df_b.fillna(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Solution with update is not so common as combine_first:
df_b.update(df_a)
print (df_b)
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0

Pandas dataframes: Combining location and integer indexing

I'd like to change a value in a dataframe by addressing the rows by using integer indexing (using iloc) , and addressing the columns using location indexing (using loc).
Is there anyway to combine these two methods? I believe it would be the same as saying, I want the 320th row of this dataframe and the column that has the title "columnTitle". Is this possible?
IIUC you can call iloc directly on the column:
In [193]:
df = pd.DataFrame(columns=list('abc'), data = np.random.randn(5,3))
df
Out[193]:
a b c
0 -0.810747 0.898848 -0.374113
1 0.550121 0.934072 -1.117936
2 -2.113217 0.131204 -0.048545
3 1.674282 -0.611887 0.696550
4 -0.076561 0.331289 -0.238261
In [194]:
df['b'].iloc[3] = 0
df
Out[194]:
a b c
0 -0.810747 0.898848 -0.374113
1 0.550121 0.934072 -1.117936
2 -2.113217 0.131204 -0.048545
3 1.674282 0.000000 0.696550
4 -0.076561 0.331289 -0.238261
Mixed integer and label based access is supported by ix.
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
>>> df
A B C
0 -0.473002 0.400249 0.332440
1 -1.291438 0.042443 0.001893
2 0.294902 0.927790 0.999090
3 1.415020 0.428405 -0.291283
4 -0.195136 -0.400629 0.079696
>>> df.ix[[0, 3, 4], ['B', 'C']]
B C
0 0.400249 0.332440
3 0.428405 -0.291283
4 -0.400629 0.079696
df.ix[[0, 3, 4], ['B', 'C']] = 0
>>> df
A B C
0 -0.473002 0.000000 0.000000
1 -1.291438 0.042443 0.001893
2 0.294902 0.927790 0.999090
3 1.415020 0.000000 0.000000
4 -0.195136 0.000000 0.000000

Pandas: fillna() method with multiindex - NaNs are filled with wrong columns

Here is a working example that reproduces my problem. First some random data is generated along with the data that we will use to fill the nans:
#Generate some random data and data that will be used to fill the nans
data = np.random.random((100,6))
fill_data = np.vstack((np.ones(200), np.ones(200)*2, np.ones(200)*3,np.ones(200), np.ones(200)*2, np.ones(200)*3)).T
#Generate indices of nans that we will put in
nan_rows = np.random.randint(0,100,50)
nan_cols = np.random.randint(0,6,50)
nan_idx = np.vstack((nan_rows,nan_cols)).T
#Put in nan values
for r,c in nan_idx:
data[r,c] = np.nan
#Generate multiindex and datetimeindex for both the data and fill_data
multi = pd.MultiIndex.from_product([['A','B'],['one','two','three']])
idx1 = pd.DatetimeIndex(start='1990-01-01', periods=100, freq='d')
idx2 = pd.DatetimeIndex(start='1989-12-01', periods=200, freq='d')
#Construct dataframes
df1 = pd.DataFrame(data, idx1, multi)
df2 = pd.DataFrame(fill_data, idx2, multi)
#fill nans from df1 with df2
df1 = df1.fillna(df2, axis=1)
Here is what the resulting frames look like:
In [167]:
df1.head()
Out[167]:
A B
one two three one two three
1990-01-01 1.000000 0.341803 0.694128 0.382164 0.326956 0.506616
1990-01-02 0.439024 0.552746 0.538489 0.003906 0.968498 0.816289
1990-01-03 0.200252 0.838014 0.805633 0.008980 0.269189 0.016243
1990-01-04 0.735120 0.384871 0.579268 0.561657 0.630314 0.361932
1990-01-05 0.938185 0.335212 0.678310 2.000000 0.819046 0.482535
In [168]:
df2.head()
Out[168]:
A B
one two three one two three
1989-12-01 1 2 3 1 2 3
1989-12-02 1 2 3 1 2 3
1989-12-03 1 2 3 1 2 3
1989-12-04 1 2 3 1 2 3
1989-12-05 1 2 3 1 2 3
So the key here is that the dataframes are different lengths but have common labels in that the multiindexed columns are the same and the timestamp labels in df1 are within df2.
Here is the result:
In [165]:
df1
Out[165]:
A B
one two three one two three
1990-01-01 1.000000 0.341803 0.694128 0.382164 0.326956 0.506616
1990-01-02 0.439024 0.552746 0.538489 0.003906 0.968498 0.816289
1990-01-03 0.200252 0.838014 0.805633 0.008980 0.269189 0.016243
1990-01-04 0.735120 0.384871 0.579268 0.561657 0.630314 0.361932
1990-01-05 0.938185 0.335212 0.678310 2.000000 0.819046 0.482535
1990-01-06 0.609736 0.164815 0.295003 0.784388 3.000000 3.000000
1990-01-07 1.000000 0.394105 0.430608 0.782029 0.327485 0.855130
1990-01-08 0.573780 0.525845 0.147302 0.091022 3.000000 3.000000
1990-01-09 0.591646 0.651251 0.649255 0.205926 3.000000 0.606428
1990-01-10 0.988085 0.524769 0.481834 0.486241 0.629223 0.575843
1990-01-11 1.000000 0.586813 0.592252 0.309429 0.877121 0.547193
1990-01-12 0.853000 0.097981 0.970053 0.519838 0.828266 0.618965
1990-01-13 0.579778 0.805140 0.050559 0.432795 0.036241 0.081218
1990-01-14 0.055462 1.000000 0.159151 0.538137 3.000000 0.296754
1990-01-15 0.848238 0.697454 0.519403 0.232734 0.612487 0.891230
1990-01-16 0.808238 0.182904 0.480846 0.052806 0.900373 0.860274
1990-01-17 0.890997 0.346767 0.265168 0.486746 0.983999 0.104035
1990-01-18 0.673155 0.248853 0.245246 2.000000 0.965884 0.295021
1990-01-19 0.074864 0.714846 2.000000 0.046031 0.105930 0.641538
1990-01-20 1.000000 0.486893 0.464024 0.499484 0.794107 0.868002
If you look closely you can see that there are values equal to 1 in columns ('A','one') and ('A','two'), values equal to 2 in ('A','three') and ('B','one') and values equal to 3 in ('B','two') and ('B','three').
The expected output would be values of 1 in the 'one' columns, 2 in the 'two' columns, etc.
Am I doing something wrong here? To me this seems like some kind of bug.
This issue has been fixed in the latest version of Pandas.
Using version 0.15.0 you will be able to do this:
import pandas as pd
import numpy as np
from numpy import nan
df = pd.DataFrame({'a': [nan, 1, 2, nan, nan],
'b': [1, 2, 3, nan, nan],
'c': [nan, 1, 2, 3, 4]},
index = list('VWXYZ'))
# a b c
# V NaN 1 NaN
# W 1 2 1
# X 2 3 2
# Y NaN NaN 3
# Z NaN NaN 4
# df2 may have different index and columns
df2 = pd.DataFrame({'a': [10, 20, 30, 40, 50],
'b': [50, 60, 70, 80, 90],
'c': list('ABCDE')},
index = list('VWXYZ'))
# a b c
# V 10 50 A
# W 20 60 B
# X 30 70 C
# Y 40 80 D
# Z 50 90 E
Now, passing a DataFrame to fillna
result = df.fillna(df2)
yields
print(result)
# a b c
# V 10 1 A
# W 1 2 1
# X 2 3 2
# Y 40 80 3
# Z 50 90 4

Categories

Resources