pandas replace column with mean for values - python

I have a pandas dataframe and want replace each value with the mean for it.
ID X Y
1 a 1
2 a 2
3 a 3
4 b 2
5 b 4
How do I replace Y values with mean Y for every unique X?
ID X Y
1 a 2
2 a 2
3 a 2
4 b 3
5 b 3

Use transform:
df['Y'] = df.groupby('X')['Y'].transform('mean')
print (df)
ID X Y
0 1 a 2
1 2 a 2
2 3 a 2
3 4 b 3
4 5 b 3
For new column in another DataFrame use map with drop_duplicates:
df1 = pd.DataFrame({'X':['a','a','b']})
print (df1)
X
0 a
1 a
2 b
df1['Y'] = df1['X'].map(df.drop_duplicates('X').set_index('X')['Y'])
print (df1)
X Y
0 a 2
1 a 2
2 b 3
Another solution:
df1['Y'] = df1['X'].map(df.groupby('X')['Y'].mean())
print (df1)
X Y
0 a 2
1 a 2
2 b 3

Related

Pandas count number of equal values per row in dataframe

I work with this df:
A B C D
1 1 2 3
2 1 3 4
3 3 3 3
I want to add column E that holds the number of equal values in columns A-D by row.
Expected output:
A B C D E
1 1 2 3 2
2 1 3 4 0
3 3 3 3 4
Can anyone point me in the right direction?
Thanks!
Use custom lambda function with Series.duplicated with keep=False for all dupes and count Trues by sum:
df['E'] = df.apply(lambda x: x.duplicated(keep=False).sum(), axis=1)
print (df)
A B C D E
0 1 1 2 3 2
1 2 1 3 4 0
2 3 3 3 3 4
If need specify columns names:
df['E'] = df.loc[:, 'A':'D'].apply(lambda x: x.duplicated(keep=False).sum(), axis=1)

ApplyMap function on Multiple columns pandas

I have this dataframe
dd = pd.DataFrame({'a':[1,5,3],'b':[3,2,3],'c':[2,4,5]})
a b c
0 1 3 2
1 5 2 4
2 3 3 5
I just want to replace numbers of column a and b which are smaller than column c numbers. I want to this operation row wise
I did this
dd.applymap(lambda x: 0 if x < x['c'] else x )
I get error
TypeError: 'int' object is not subscriptable
I understood x is a int but how to get value of column c for that row
I want this output
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Use DataFrame.mask with DataFrame.lt:
df = dd.mask(dd.lt(dd['c'], axis=0), 0)
print (df)
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Or you can set values by compare broadcasting by column c:
dd[dd < dd['c'].to_numpy()[:, None]] = 0
print (dd)
a b c
0 0 3 2
1 5 0 4
2 0 0 5

mapping a multi-index to existing pandas dataframe columns using separate dataframe

I have an existing data frame in the following format (let's call it df):
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
The column names were extracted from a spreadsheet that has the following form (let's call it cat_df):
current category
broader category
X A
Y B
Y C
Z D
First I'd like to prepend a higher level index to make df look like so:
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
Lastly i'd like to 'roll-up' the data into the meta-index by summing over subindices, to generate a new dataframe like so:
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1
Using concat from this answer has gotten me close, but it seems like it'd be a very manual process picking out each subset. My true dataset is has a more complex mapping, so I'd like to refer to it directly as I build my meta-index. I think once I get the meta-index settled, a simple groupby should get me to the summation, but I'm still stuck on the first step.
d = dict(zip(cat_df['current category'], cat_df.index))
cols = pd.MultiIndex.from_arrays([df.columns.map(d.get), df.columns])
df.set_axis(cols, axis=1, inplace=False)
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
df_new = df.set_axis(cols, axis=1, inplace=False)
df_new.groupby(axis=1, level=0).sum()
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1
IIUC, you can do it like this.
df.columns = pd.MultiIndex.from_tuples(cat_df.reset_index()[['broader category','current category']].apply(tuple, axis=1).tolist())
print(df)
Output:
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
Sum level:
df.sum(level=0, axis=1)
Output:
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1
You can using set_index for creating the idx, then assign to your df
idx=df1.set_index('category',append=True).index
df.columns=idx
df
Out[1170]:
current X Y Z
category A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
df.sum(axis=1,level=0)
Out[1171]:
current X Y Z
0 1 3 4
1 3 2 2
2 1 8 1

Cumulative count of unique strings for each id in a different column

I have a dataframe (df_temp) which is like the following:
ID1 ID2
0 A X
1 A X
2 A Y
3 A Y
4 A Z
5 B L
6 B L
What I need is to add a column which shows the cummulative number of unique values of ID2 for each ID1, so something like
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
I've tried:
dfl_temp['CumUniqueIDs'] = dfl_temp.groupby(by=[ID1])[ID2].nunique().cumsum()+1
But this simply fills CumUniqueIDs with NaN.
Not sure what I'm doing wrong here! Any help much appreciated!
you can use groupby() + transform() + factorize():
In [12]: df['CumUniqueIDs'] = df.groupby('ID1')['ID2'].transform(lambda x: pd.factorize(x)[0]+1)
In [13]: df
Out[13]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
By using category
df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
Out[551]:
0 1
1 1
2 2
3 2
4 3
5 1
6 1
Name: ID2, dtype: int8
After assign it back
df['CumUniqueIDs']=df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
df
Out[553]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1

Pandas: Sort the column on frequency by another column having same value grouped

I've dataframe which is group by y column and sorted on their count column of y column.
Code:
df['count'] = df.groupby(['y'])['y'].transform(pd.Series.value_counts)
df = df.sort('count', ascending=False)
Output:
x y count
1 a 4
3 a 4
2 a 4
1 a 4
2 c 3
1 c 3
2 c 3
2 b 2
1 b 2
Now, I want to sort x column on its frequency having same values grouped on y column like below:
Expected Output:
x y count
1 a 4
1 a 4
2 a 4
3 a 4
2 c 3
2 c 3
1 c 3
2 b 2
1 b 2
It seems you need groupby and value_counts and then numpy.repeat for expand index values by their counts to DataFrame:
s = df.groupby('y', sort=False)['x'].value_counts()
#alternative
#s = df.groupby('y', sort=False)['x'].apply(pd.Series.value_counts)
print (s)
y x
a 1 2
2 1
3 1
c 2 2
1 1
b 1 1
2 1
Name: x, dtype: int64
df1 = pd.DataFrame(np.repeat(s.index.values, s.values).tolist(), columns=['y','x'])
#change order of columns
df1 = df1.reindex_axis(['x','y'], axis=1)
print (df1)
x y
0 1 a
1 1 a
2 2 a
3 3 a
4 2 c
5 2 c
6 1 c
7 1 b
8 2 b
If you are using an older version where df.sort_values is not supported. you can use:
df.sort(columns=['count','x'], ascending=[False,True])

Categories

Resources