Idiomatic multiindex column assignment in pandas - python

I have a dataframe with a 2-level Multiindex:
ix = pd.MultiIndex.from_tuples(list(enumerate(np.random.choice(['A', 'B'], 5))))
df = pd.DataFrame({'Val': np.random.randint(0, 30, 5)}, index=ix).unstack().fillna(0)
df
Val
A B
0 27 0
1 0 3
2 0 7
3 9 0
4 0 19
I would like to add a column for each existing sublevel ('A' and 'B') that is equal to half of the Val column. My intuition was to do
df['Half_val'] = df.Val / 2
which gives a ValueError: Wrong number of items passed 2, placement implies 1 exception.
I can manually do
res = df.Val / 2
df.loc[:, ('Half_val', 'A')] = res.A
df.loc[:, ('Half_val', 'B')] = res.B
which gives what I'm after:
>>> df
Val Half_val
A B A B
0 27 0 13.5 0.0
1 0 3 0.0 1.5
2 0 7 0.0 3.5
3 9 0 4.5 0.0
4 0 19 0.0 9.5
Is there a less verbose, more idiomatic way to make a multiindex column assignment like this (particularly one where I don't have to explicitly specify each sublevel on the left side)?
Edit:
I forgot to mention that trying
res = df.Val / 2
df.loc[:, res.columns] = res
gives a KeyError: "['A' 'B'] not in index" exception.
Edit 2
It would be nice if the solution allowed pseudo-mixed level columns in the dataframe. In my example, I can do
In [5]: df['C'] = 'a'
In [6]: df
Out[6]:
Val C
A B
0 4 0 a
1 0 10 a
2 0 4 a
3 21 0 a
4 0 14 a
which adds a column with a single level. But since the column already had 2 levels, it appears it gives an implicit second level of an empty string
In [9]: list(df)
Out[9]: [('Val', 'A'), ('Val', 'B'), ('C', '')]
when I try a solution offered below, it the single-level C column seems to break it:
In [7]: pd.concat([df,df['Val']/2],axis=1,keys=['Val', 'C', 'Half'])
==> AssertionError: Cannot concat indices that do not have the same number of levels
Is there some trick for the keys parameter to pass, or do I need to give C a different dummy value for the second level (since it looks like "" doesn't count) and then remove it after the concatenation?

You can iterate over the level values and do a direct assignment (one value at a time)
In [55]: df.columns.get_level_values(1)
Out[55]: Index([u'A', u'B'], dtype='object')
In [51]: df[('Half','A')] = df[('Val','A')]/2
In [52]: df[('Half','B')] = df[('Val','B')]/2
In [53]: df
Out[53]:
Val Half
A B A B
0 0 12 0.0 6.0
1 0 5 0.0 2.5
2 0 26 0.0 13.0
3 3 0 1.5 0.0
4 25 0 12.5 0.0
You can do this as well
In [59]: concat([df['Val'],df['Val']/2],axis=1,keys=['Val','Half'])
Out[59]:
Val Half
A B A B
0 0 10 0.0 5.0
1 0 10 0.0 5.0
2 0 13 0.0 6.5
3 27 0 13.5 0.0
4 2 0 1.0 0.0
Here's an issue to track this bug/enhancement: https://github.com/pydata/pandas/issues/7475

I think this option is preferable to the concat option because you don't have to risk incorrectly re-labeling the 'Val' column. Please correct me if you disagree!
Given your input dataframe:
In [3]: df
Out[3]:
Val
A B
0 26 0
1 10 0
2 18 0
3 0 18
4 2 0
A third option worth considering is:
In [4]: df[pd.MultiIndex.from_product([['Half']] + df.columns.levels[1:])] = df['Val'] / 2
In [5]: df
Out[5]:
Val Half
A B A B
0 26 0 13 0
1 10 0 5 0
2 18 0 9 0
3 0 18 0 9
4 2 0 1 0
This approach also just works with an arbitrarily nested MultiIndex. (I don't know if it's possible do this assignment with sub-columns of a MultiIndex).
In [1]: df = pd.DataFrame({'Val': np.random.randint(5, 30, 12)}, index=pd.MultiIndex.from_product([['A', 'B','C'], ['a', 'b'], [0, 1]])).unstack().unstack()
In [2]: df
Out[2]:
Val
0 1
a b a b
A 6 10 11 7
B 16 8 23 15
C 29 17 11 18
In [3]: df[pd.MultiIndex.from_product([['Half']] + df.columns.levels[1:])] = df['Val'] / 2
In [4]: df
Out[4]:
Val Half
0 1 0 1
a b a b a b a b
A 6 10 11 7 3.0 5.0 5.5 3.5
B 16 8 23 15 8.0 4.0 11.5 7.5
C 29 17 11 18 14.5 8.5 5.5 9.0

Related

How to create a new column based on multiple conditions in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B(i)=1 if A_(i-1)-A_(i) >= 0 whenA_(i) <= 10
B(i)=1 if A_(i-1)-A_(i) >= 2 when10 < A_(i) <= 20
B(i)=1 if A_(i-1)-A_(i) >= 5 when20 < A_(i)
B(i)=0 for any other case
However, the first B_i value is always two
Example:
A
B
5
2 (the first B_i)
12
0
14
0
22
0
20
1
33
0
11
1
8
1
15
0
11
1
You can use Pandas.shift for creating A_(i-1) and use Numpy.select for checking multiple conditions like below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5,12,14,22,20,33,11,8,15,11]})
df['A_prv'] = df['A'].shift(1)
conditions = [
(df.index==0),
((df['A_prv'] - df['A'] >= 0) & (df['A'].le(10))),
((df['A_prv'] - df['A'] >= 2) & (df['A'].between(10, 20, inclusive='right'))),
# ^^^ 10 < df['A'] <= 20 ^^^
((df['A_prv'] - df['A'] >= 5) & (df['A'].ge(20)))
]
choices = [2, 1, 1, 1]
df['B'] = np.select(conditions, choices, default=0)
print(df)
Output:
A A_prv B
0 5 NaN 2
1 12 5.0 0
2 14 12.0 0
3 22 14.0 0
4 20 22.0 1
5 33 20.0 0
6 11 33.0 1
7 8 11.0 1
8 15 8.0 0
9 11 15.0 1
The most intuitive way is to iterate trough the lines testing all the three conditions in a single-line if-else (as B(i) is 1 for all the true conditions).
import pandas as pd
df = pd.DataFrame({'A':[5,12,14,22,20,33,11,8,15,11]})
B = [2]
for i in range(1,len(df['A'])):
newvalue = 1 if (df['A'][i-1]-df['A'][i]>=0 and df['A'][i]<=10) or (df['A'][i-1]-df['A'][i]>=2 and df['A'][i]>10 and df['A'][i]<=20) or (df['A'][i-1]-df['A'][i]>=5 and df['A'][i]>20) else 0
B.append(newvalue)
df['B'] = B
print(df)
Output:
A B
0 5 2
1 12 0
2 14 0
3 22 0
4 20 1
5 33 0
6 11 1
7 8 1
8 15 0
9 11 1

Slicing each dataframe row into 3 windows with different slicing ranges

I want to slice each row of my dataframe into 3 windows with slice indices that are stored in another dataframe and change for each row of the dataframe. Afterwards i want to return a single dataframe containing the windows in form of a MultiIndex. The rows in each windows that are shorter than the longest row in the window should be filled with NaN values.
Since my actual dataframe has around 100.000 rows and 600 columns, i am concerned about an efficient solution.
Consider the following example:
This is my dataframe which i want to slice into 3 windows
>>> df
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
1 8 9 10 11 12 13 14 15
2 16 17 18 19 20 21 22 23
And the second dataframe containing my slicing indices having the same count of rows as df:
>>> df_slice
0 1
0 3 5
1 2 6
2 4 7
I've tried slicing the windows, like so:
first_window = df.iloc[:, :df_slice.iloc[:, 0]]
first_window.columns = pd.MultiIndex.from_tuples([("A", c) for c in first_window.columns])
second_window = df.iloc[:, df_slice.iloc[:, 0] : df_slice.iloc[:, 1]]
second_window.columns = pd.MultiIndex.from_tuples([("B", c) for c in second_window.columns])
third_window = df.iloc[:, df_slice.iloc[:, 1]:]
third_window.columns = pd.MultiIndex.from_tuples([("C", c) for c in third_window.columns])
result = pd.concat([first_window,
second_window,
third_window], axis=1)
Which gives me the following error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [0 3
1 2
2 4
Name: 0, dtype: int64] of <class 'pandas.core.series.Series'>
My expected output is something like this:
>>> result
A B C
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 NaN 3 4 NaN NaN 5 6 7
1 8 9 NaN NaN 10 11 12 13 14 15 NaN
2 16 17 18 19 20 21 22 NaN 23 NaN NaN
Is there an efficient solution for my problem without iterating over each row of my dataframe?
Here's a solution which, using melt and then pivot_table, plus some logic to:
Identify the three groups 'A', 'B', and 'C'.
Shift the columns to the left, so that NaN would only appear at the right side of each window.
Rename columns to get the expected output.
t = df.reset_index().melt(id_vars="index")
t = pd.merge(t, df_slice, left_on="index", right_index=True)
t.variable = pd.to_numeric(t.variable)
t.loc[t.variable < t.c_0,"group"] = "A"
t.loc[(t.variable >= t.c_0) & (t.variable < t.c_1), "group"] = "B"
t.loc[t.variable >= t.c_1, "group"] = "C"
# shift relevant values to the left
shift_val = t.groupby(["group", "index"]).variable.transform("min") - t.groupby(["group"]).variable.transform("min")
t.variable = t.variable - shift_val
# extract a, b, and c groups, and create a multi-level index for their
# columns
df_a = pd.pivot_table(t[t.group == "A"], index= "index", columns="variable", values="value")
df_a.columns = pd.MultiIndex.from_product([["a"], df_a.columns])
df_b = pd.pivot_table(t[t.group == "B"], index= "index", columns="variable", values="value")
df_b.columns = pd.MultiIndex.from_product([["b"], df_b.columns])
df_c = pd.pivot_table(t[t.group == "C"], index= "index", columns="variable", values="value")
df_c.columns = pd.MultiIndex.from_product([["c"], df_c.columns])
res = pd.concat([df_a, df_b, df_c], axis=1)
res.columns = pd.MultiIndex.from_tuples([(c[0], i) for i, c in enumerate(res.columns)])
print(res)
The output is:
a b c
0 1 2 3 4 5 6 7 8 9 10
index
0 0.0 1.0 2.0 NaN 3.0 4.0 NaN NaN 5.0 6.0 7.0
1 8.0 9.0 NaN NaN 10.0 11.0 12.0 13.0 14.0 15.0 NaN
2 16.0 17.0 18.0 19.0 20.0 21.0 22.0 NaN 23.0 NaN NaN

How to sum columns to create a third one on specific rows?

I have a dataframe:
A B C V
1 4 7 T
2 6 8 T
3 9 9 F
and I want to create a new column, summing the rows where V is 'T'
So I want
A B C V D
1 4 7 T 12
2 6 8 T 16
3 9 9 F
Is there any way to do this without iteration?
Mask the values before summing:
df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T')
# Or,
df.select_dtypes(np.number).mask(df['V'] != 'T').sum(axis=1, skipna=False)
0 12.0
1 16.0
2 NaN
dtype: float64
df['D'] = df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T')
df
A B C V D
0 1 4 7 T 12.0
1 2 6 8 T 16.0
2 3 9 9 F NaN
If you actually wanted blanks, use
df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T', '')
0 24
1 32
2
dtype: object
Which returns an object column (not recommended).
Alternatively, using np.where:
np.where(df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), np.nan)
# array([12., 16., nan])
df['D'] = np.where(
df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), np.nan)
df
A B C V D
0 1 4 7 T 12.0
1 2 6 8 T 16.0
2 3 9 9 F 0.0
Use Numpy where
import numpy as np
df['D'] = np.where(df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), None)
df['D'] = df[['A', 'B', 'C']][df['V'] == 'T'].sum(axis=1)
In [51]df:
Out[51]:
A B C V D
0 1 4 7 T 12.000
1 2 6 8 T 16.000
2 3 9 9 F nan

Pandas Create New Column Based on Value in Another Column, If False Return Previous Value of New Column

this is a Python pandas problem I've been struggling with for a while now. Lets say I have a simple dataframe df where df['a'] = [1,2,3,1,4,6] and df['b'] = [10,20,30,40,50,60]. I would like to create a third column 'c', where if the value of df['a'] == 1, df['c'] = df['b']. If this is false, df['c'] = the previous value of df['c']. I have tried using np.where to make this happen, but the result is not what I was expecting. Any advice?
df = pd.DataFrame()
df['a'] = [1,2,3,1,4,6]
df['b'] = [10,20,30,40,50,60]
df['c'] = np.nan
df['c'] = np.where(df['a'] == 1, df['b'], df['c'].shift(1))
The result is:
a b c
0 1 10 10.0
1 2 20 NaN
2 3 30 NaN
3 1 40 40.0
4 4 50 NaN
5 6 60 NaN
Whereas I would have expected:
a b c
0 1 10 10.0
1 2 20 10.0
2 3 30 10.0
3 1 40 40.0
4 4 50 40.0
5 6 60 40.0
Try this:
df.c.ffill(inplace=True)
Output:
a b c
0 1 10 10.0
1 2 20 10.0
2 3 30 10.0
3 1 40 40.0
4 4 50 40.0
5 6 60 40.0

Use Pandas dataframe to add lag feature from MultiIindex Series

I have a MultiIndex Series (3 indices) that looks like this:
Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................
I also have a dataframe df which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe df that contains the value matching the ID_1 and ID_2 and the Week - 2 from the Series.
For example, for the row in dataframe that has ID_1 = 26, ID_2 = 1182 and Week = 3, I want to match the value in the Series indexed by ID_1 = 26, ID_2 = 1182 and Week = 1 (3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.
Right now, I am trying to do this by using:
[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]
This however is very slow and memory hungry and I was wondering what are some better ways to do this.
FWIW, the Series was created using
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
and I'm willing to do it a different way if better paths exist to create what I'm looking for.
Increase the Week by 2:
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
and then merge df with saved_groupby:
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
This will augment df with the target median from 2 weeks ago.
To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:
result['Median'] = result['Median'].fillna(0)
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(5, size=(20,5)),
columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)
yields
Week ID_1 ID_2 Target Foo Median
0 3 2 3 4 2 0.0
1 3 3 0 3 4 0.0
2 4 3 0 1 2 0.0
3 3 4 1 1 1 0.0
4 2 4 2 0 3 2.0
5 1 0 1 4 4 0.0
6 2 3 4 0 0 0.0
7 4 0 0 2 3 0.0
8 3 4 3 2 2 0.0
9 2 2 4 0 1 0.0
10 2 0 4 4 2 0.0
11 1 1 3 0 0 0.0
12 0 1 0 2 0 0.0
13 4 0 4 0 3 4.0
14 1 2 1 3 1 0.0
15 3 0 1 3 4 2.0
16 0 4 2 2 4 0.0
17 1 1 4 4 2 0.0
18 4 1 0 3 0 0.0
19 1 0 1 0 0 0.0

Categories

Resources