mapping a multi-index to existing pandas dataframe columns using separate dataframe - python

I have an existing data frame in the following format (let's call it df):
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
The column names were extracted from a spreadsheet that has the following form (let's call it cat_df):
current category
broader category
X A
Y B
Y C
Z D
First I'd like to prepend a higher level index to make df look like so:
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
Lastly i'd like to 'roll-up' the data into the meta-index by summing over subindices, to generate a new dataframe like so:
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1
Using concat from this answer has gotten me close, but it seems like it'd be a very manual process picking out each subset. My true dataset is has a more complex mapping, so I'd like to refer to it directly as I build my meta-index. I think once I get the meta-index settled, a simple groupby should get me to the summation, but I'm still stuck on the first step.

d = dict(zip(cat_df['current category'], cat_df.index))
cols = pd.MultiIndex.from_arrays([df.columns.map(d.get), df.columns])
df.set_axis(cols, axis=1, inplace=False)
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
df_new = df.set_axis(cols, axis=1, inplace=False)
df_new.groupby(axis=1, level=0).sum()
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1

IIUC, you can do it like this.
df.columns = pd.MultiIndex.from_tuples(cat_df.reset_index()[['broader category','current category']].apply(tuple, axis=1).tolist())
print(df)
Output:
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
Sum level:
df.sum(level=0, axis=1)
Output:
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1

You can using set_index for creating the idx, then assign to your df
idx=df1.set_index('category',append=True).index
df.columns=idx
df
Out[1170]:
current X Y Z
category A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
df.sum(axis=1,level=0)
Out[1171]:
current X Y Z
0 1 3 4
1 3 2 2
2 1 8 1

Related

Set value when row is maximum in group by - Python Pandas

I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)

Use groupby and merge to create new column in pandas

So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!
If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')
Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

Python pandas cumsum with reset everytime there is a 0

I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2

pandas replace column with mean for values

I have a pandas dataframe and want replace each value with the mean for it.
ID X Y
1 a 1
2 a 2
3 a 3
4 b 2
5 b 4
How do I replace Y values with mean Y for every unique X?
ID X Y
1 a 2
2 a 2
3 a 2
4 b 3
5 b 3
Use transform:
df['Y'] = df.groupby('X')['Y'].transform('mean')
print (df)
ID X Y
0 1 a 2
1 2 a 2
2 3 a 2
3 4 b 3
4 5 b 3
For new column in another DataFrame use map with drop_duplicates:
df1 = pd.DataFrame({'X':['a','a','b']})
print (df1)
X
0 a
1 a
2 b
df1['Y'] = df1['X'].map(df.drop_duplicates('X').set_index('X')['Y'])
print (df1)
X Y
0 a 2
1 a 2
2 b 3
Another solution:
df1['Y'] = df1['X'].map(df.groupby('X')['Y'].mean())
print (df1)
X Y
0 a 2
1 a 2
2 b 3

Pandas: Sort the column on frequency by another column having same value grouped

I've dataframe which is group by y column and sorted on their count column of y column.
Code:
df['count'] = df.groupby(['y'])['y'].transform(pd.Series.value_counts)
df = df.sort('count', ascending=False)
Output:
x y count
1 a 4
3 a 4
2 a 4
1 a 4
2 c 3
1 c 3
2 c 3
2 b 2
1 b 2
Now, I want to sort x column on its frequency having same values grouped on y column like below:
Expected Output:
x y count
1 a 4
1 a 4
2 a 4
3 a 4
2 c 3
2 c 3
1 c 3
2 b 2
1 b 2
It seems you need groupby and value_counts and then numpy.repeat for expand index values by their counts to DataFrame:
s = df.groupby('y', sort=False)['x'].value_counts()
#alternative
#s = df.groupby('y', sort=False)['x'].apply(pd.Series.value_counts)
print (s)
y x
a 1 2
2 1
3 1
c 2 2
1 1
b 1 1
2 1
Name: x, dtype: int64
df1 = pd.DataFrame(np.repeat(s.index.values, s.values).tolist(), columns=['y','x'])
#change order of columns
df1 = df1.reindex_axis(['x','y'], axis=1)
print (df1)
x y
0 1 a
1 1 a
2 2 a
3 3 a
4 2 c
5 2 c
6 1 c
7 1 b
8 2 b
If you are using an older version where df.sort_values is not supported. you can use:
df.sort(columns=['count','x'], ascending=[False,True])

Categories

Resources