How to add new column in pandas dataframe with two conditions? - python

I need to add a new column on the basis of a condition in pandas dataframe
input file
Name C2Mean C1Mean
a 2 0
b 4 2
c 6 2.5
These are the conditions:
if C1Mean = 0; log2FC = log2([C2Mean=2])
if C1Mean > 0; log2FC = log2([C2Mean=4]/[C1Mean=2])
if C1Mean > 0; log2FC = log2([C2Mean=4]/[C1Mean=2])
Based on these conditions I want to add a new column 'log2FC' like this:
Name C2Mean C1Mean log2FC
a 2 0 1
b 4 2 1
c 6 2.5 1.2630344058
The code I tried:
import pandas as pd
import numpy as np
import os
def induced_genes(rsem_exp_data):
pwd = os.getcwd()
data = pd.read_csv(rsem_exp_data,header=0,sep="\t")
data['log2FC'] = [np.log2(data['C2Mean']/data['C1Mean'])\
if data['C2Mean'] > 0] else np.log2(data['C2Mean'])]
print(data.head(5))
induced_genes('induced.genes')

This should work and it's faster than apply
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":["a", "b", "c"], "C2Mean":[2,4,6], "C1Mean":[0, 2, 2.5]})
df["log2FC"] = np.where(df["C1Mean"]==0,
np.log2(df["C2Mean"]),
np.log2(df["C2Mean"]/df["C1Mean"]))
UPDATE: Timing
N = 10000
df = pd.DataFrame({"C2Mean":np.random.randint(0,10,N),
"C1Mean":np.random.randint(0,10,N)})
%%timeit -n10
a = np.where(df["C1Mean"]==0,
np.log2(df["C2Mean"]),
np.log2(df["C2Mean"]/df["C1Mean"]))
1.06 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n10
b = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0
else np.log2(x["C2Mean"]), axis=1)
248 ms ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The speed up is ~233x.
*UPDATE 2: Remove RuntimeWarning
Just add this at the beginning
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

You can use the below code:
df = pd.DataFrame({"Name":["a", "b", "c"], "C2Mean":[2,4,6], "C1Mean":[0, 2, 2.5]})
df.head()
Name C2Mean C1Mean
a 2 0.0
b 4 2.0
c 6 2.5
df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)
df.head()
Name C2Mean C1Mean log2FC
a 2 0.0 1.000000
b 4 2.0 1.000000
c 6 2.5 1.263034
Here axis=1 implies that you want to do this operation for all the rows.

Related

Multi-index Dataframe from dictionary of Dataframes

I'd like to create a multi-index dataframe from a dictionary of dataframes where the top-level index is index of the dataframes within the dictionaries and the second level index is the keys of the dictionary.
Example
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03'])
column_names = ['Y', 'X']
df_dict = {'A':pd.DataFrame([[1,3],[7,4],[5,8]], index = dt_index, columns = column_names),
'B':pd.DataFrame([[12,3],[9,8],[75,0]], index = dt_index, columns = column_names),
'C':pd.DataFrame([[3,12],[5,1],[22,5]], index = dt_index, columns = column_names)}
Expected output:
Y X
2003-05-01 A 1 3
2003-05-01 B 12 3
2003-05-01 C 3 12
2003-05-02 A 7 4
2003-05-02 B 9 8
2003-05-02 C 5 1
2003-05-03 A 5 8
2003-05-03 B 75 0
2003-05-03 C 22 5
I've tried
pd.concat(df_dict, axis=0)
but this gives me the levels of the multi-index in the incorrect order.
Edit: Timings
Based on the answers so far, this seems like a slow operation to perform as the Dataframe scales.
Larger dummy data:
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
To convert the dictionary to a dataframe, albeit with swapped indicies takes:
%timeit pd.concat(df_dict, axis=0)
63.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Even in the best case, creating a dataframe with the indicies in the other order takes 8 times longer than the above!
%timeit pd.concat(df_dict, axis=0).swaplevel().sort_index()
528 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat(df_dict, axis=1).stack(0)
1.72 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use DataFrame.swaplevel with DataFrame.sort_index:
df = pd.concat(df_dict, axis=0).swaplevel(0,1).sort_index()
print (df)
Y X
2003-05-01 A 1 3
B 12 3
C 3 12
2003-05-02 A 7 4
B 9 8
C 5 1
2003-05-03 A 5 8
B 75 0
C 22 5
You can reach down into numpy for a speed up if you can guarantee two things:
Each of your DataFrames in df_dict have the exact same index
Each of your DataFrames are already sorted.
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
out = pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, C),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# check if this result is consistent with other answers
assert (pd.concat(df_dict, axis=0).swaplevel().sort_index() == out).all().all()
Timing:
%%timeit
pd.concat(df_dict, axis=0)
# 26.2 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, 500),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# 31.2 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat(df_dict, axis=0).swaplevel().sort_index()
# 123 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use concat on axis=1 and stack:
out = pd.concat(df_dict, axis=1).stack(0)
Output:
X Y
2003-05-01 A 3 1
B 3 12
C 12 3
2003-05-02 A 4 7
B 8 9
C 1 5
2003-05-03 A 8 5
B 0 75
C 5 22

Advanced Pandas chaining: chain index.droplevel after groupby

I was trying to find the top2 values in column2 grouped by column1.
Here is the dataframe:
# groupby id and take only top 2 values.
df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
I have done without using chained grouping:
x = df.groupby('id')['value'].value_counts().groupby(level=0).nlargest(2).to_frame()
x.columns = ['count']
x.index = x.index.droplevel(0)
x = x.reset_index()
x
Result:
id value count
0 1 30 4
1 1 20 3
2 2 40 3
3 2 10 2
Can we do this is ONE-SINGLE chained operation?
So, far I have done this:
(df.groupby('id')['value']
.value_counts()
.groupby(level=0)
.nlargest(2)
.to_frame()
.rename({'value':'count'}))
Now, I stuck at how to drop the index level.
How to do all these operations in one single chain?
You could use apply and head without the second groupby:
df.groupby('id')['value']\
.apply(lambda x: x.value_counts().head(2))\
.reset_index(name='count')\
.rename(columns={'level_1':'value'})
Output:
id value count
0 1 30 4
1 1 20 3
2 2 40 3
3 2 10 2
Timings:
#This method
7 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Groupby and groupby(level=0) with nlargest
12.9 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Try the below:
(df.groupby('id')['value']
.value_counts()
.groupby(level=0)
.nlargest(2)
.to_frame()).rename(columns={'value':'count'}).reset_index([1,2]).reset_index(drop=True)
Yet another solution:
df.groupby('id')['value'].value_counts().rename('count')\
.groupby(level=0).nlargest(2).reset_index(level=[1, 2])\
.reset_index(drop=True)
Using solution from #Scott Boston, I did some testing and also
tried to avoid apply altogether, but still apply is as good performant
as using numpy:
import numpy as np
import pandas as pd
from collections import Counter
np.random.seed(100)
df = pd.DataFrame({'id':np.random.randint(0,5,10000000),
'value':np.random.randint(0,5,10000000)})
# df = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
# 'value':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
print(df.shape)
df.head()
Using apply
%time
df.groupby('id')['value']\
.apply(lambda x: x.value_counts().head(2))\
.reset_index(name='count')\
.rename(columns={'level_1':'value'})
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 6.2 µs
Without using apply at al
%time
grouped = df.groupby('id')['value']
res = np.zeros([2,3],dtype=int)
for name, group in grouped:
data = np.array(Counter(group.values).most_common(2))
ids = np.ones([2,1],dtype=int) * name
data = np.append(ids,data,axis=1)
res = np.append(res,data,axis=0)
pd.DataFrame(res[2:], columns=['id','value','count'])
# CPU times: user 3 µs, sys: 0 ns, total: 3 µs
# Wall time: 5.96 µs

Concatenate column values in a pandas DataFrame while ignoring NaNs

I have a the following pandas table
df:
EVNT_ID col1 col2 col3 col4
123454 1 Nan 4 5
628392 Nan 3 Nan 7
293899 2 Nan Nan 6
127820 9 11 12 19
Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way
new_df:
EVNT_ID col1 col2 col3 col4 new_col
123454 1 Nan 4 5 1|4|5
628392 Nan 3 Nan 7 3|7
293899 2 Nan Nan 6 2|6
127820 9 11 12 19 9|11|12|19
I am using the following code
df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but it is giving me the following error
ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.
Try the following code:
df['new_col'] = df.iloc[:, 1:].apply(lambda x:
'|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
Initially I thought about x.dropna() instead of x if str(el) != 'nan',
but %timeit showed that dropna() works much slower.
You can do this with filter and agg:
df.filter(like='col').agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
Or,
df.drop('EVNT_ID', 1).agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
If performance is important, you can use a list comprehension:
joined = [
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']
df.assign(new_col=joined)
EVNT_ID col1 col2 col3 col4 new_col
0 123454 1.0 NaN 4.0 5 1|4|5
1 628392 NaN 3.0 NaN 7 3|7
2 293899 2.0 NaN NaN 6 2|6
3 127820 9.0 11.0 12.0 19 9|11|12|19
If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.
df = pd.concat([df] * 1000, ignore_index=True)
# In this post.
%%timeit
[
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
'|'.join([k for k in a if k])
for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]
31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although note the answers aren't identical because #RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.
Using list comprehension and zip
>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
Timing seems alright
df = pd.concat([df]*1000)
%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
'A' : [1, np.nan,4, 7],
'B' : [2, np.nan, 5, 8],
'C' : [3, 6, 9, np.nan]
}).set_index('date')
print(df)
print('.........')
start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
print(df['ColumnA'])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
"""
A B C
date
05/9/2023 1.0 2.0 3.0
07/10/2023 NaN NaN 6.0
08/11/2023 4.0 5.0 9.0
06/12/2023 7.0 8.0 NaN
...........................
OUTPUT:
date
05/9/2023 1.0,2.0,3.0
07/10/2023 6.0
08/11/2023 4.0,5.0,9.0
06/12/2023 7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998
"""

Replacing values in a dataframe from another dataframe

So i am working with a dataset with two data frames.
The Data Frames look like this:
df1:
Item_ID Item_Name
0 A
1 B
2 C
df2:
Item_slot_1 Item_slot_2 Item_Slot_3
2 2 1
1 2 0
0 1 1
The values in df2 represent the Item_ID from df1. How can i replace the values in df2 from the item_id to the actual item name so that df2 can look like:
Item_slot_1 Item_slot_2 Item_Slot_3
C C B
B C A
A B B
The data set in reality is much larger and has way more id's and names than just a,b and c
Create dictionary by zip and pass it to applymap, or replace or apply with map:
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
#if value not exist in df1['Item_ID'] get None in df2
df2 = df2.applymap(s.get)
Or:
#if value not exist in df1['Item_ID'] get original value in df2
df2 = df2.replace(s)
Or:
#if value not exist in df1['Item_ID'] get NaN in df2
df2 = df2.apply(lambda x: x.map(s))
print (df2)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
EDIT:
You can specified columns by names for process:
cols = ['Item_slot_1','Item_slot_2','Item_Slot_3']
df2[cols] = df2[cols].applymap(s.get)
df2[cols] = df2[cols].replace(s)
df2[cols] = df2[cols].apply(lambda x: x.map(s))
You can improve the speed of dictionary mapping with numpy. If your items are numbered 0-N this is trivial, if they are not, it gets a bit more tricky, but is still easily doable.
If the items in df1 are numbered 0-N, use basic indexing:
a = df1['Item_Name'].values
b = df2.values
pd.DataFrame(a[b], columns=df2.columns)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
If they are not numbered 0-N, here is a more general approach:
x = df1['Item_ID'].values
y = df1['Item_Name'].values
z = df2.values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
pd.DataFrame(m[z], columns=df2.columns)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
To only replace a subset of columns from df2 is also simple, let's demonstrate only replacing the first two columns of df2:
x = df1['Item_ID'].values
y = df1['Item_Name'].values
cols = ['Item_slot_1', 'Item_slot_2']
z = df2[cols].values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
df2[cols] = m[z]
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C 1
1 B C 0
2 A B 1
This type of indexing nets a hefty performance gain over apply and replace:
import string
df1 = pd.DataFrame({'Item_ID': np.arange(26), 'Item_Name': list(string.ascii_uppercase)})
df2 = pd.DataFrame(np.random.randint(1, 26, (10000, 100)))
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.applymap(s.get)
158 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.replace(s)
750 ms ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.apply(lambda x: x.map(s))
93.1 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
x = df1['Item_ID'].values
y = df1['Item_Name'].values
z = df2.values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
pd.DataFrame(m[z], columns=df2.columns)
30.4 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas Dataframe: Replacing NaN with row average

I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything, is there something wrong with what I'm doing? Is it because its not implemented? see link here
import pandas as pd
import numpy as np
​
pd.__version__
Out[44]:
'0.15.2'
In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df
Out[45]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
In [46]:
df.fillna(df.mean(axis=1))
Out[46]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
However something like this looks to work fine
df.fillna(df.mean(axis=0))
Out[47]:
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
As commented the axis argument to fillna is NotImplemented.
df.fillna(df.mean(axis=1), axis=1)
Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.
For now you'll need to iterate through:
m = df.mean(axis=1)
for i, col in enumerate(df):
# using i allows for duplicate columns
# inplace *may* not always work here, so IMO the next line is preferred
# df.iloc[:, i].fillna(m, inplace=True)
df.iloc[:, i] = df.iloc[:, i].fillna(m)
print(df)
c1 c2 c3
0 1 4 7.0
1 2 5 3.5
2 3 6 9.0
An alternative is to fillna the transpose and then transpose, which may be more efficient...
df.T.fillna(df.mean(axis=1)).T
As an alternative, you could also use an apply with a lambda expression like this:
df.apply(lambda row: row.fillna(row.mean()), axis=1)
yielding also
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
For an efficient solution, use DataFrame.where:
We could use where on axis=0:
df.where(df.notna(), df.mean(axis=1), axis=0)
or mask on axis=0:
df.mask(df.isna(), df.mean(axis=1), axis=0)
By using axis=0, we can fill in the missing values in each column with the row averages.
These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.
Some benchmarks:
df = creator()
>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def creator():
A = np.random.rand(300_000, 20)
A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
return pd.DataFrame(A)
I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.
The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages.
Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.
We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.
# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df
giving
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
You can broadcast the mean to a DataFrame with the same index as the original and then use update with overwrite=False to get the behavior of .fillna. Unlike .fillna, update allows for filling when the Indices have duplicated labels. Should be faster than the looping .fillna for smaller than 50,000 rows or so.
fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape),
columns=df.columns,
index=df.index)
df.update(fill, overwrite=False)
print(df)
1 1 1
0 1.0 4.0 7.0
0 2.0 5.0 3.5
0 3.0 6.0 9.0
Just had the same problem. I found this workaround to be working:
df.transpose().fillna(df.mean(axis=1)).transpose()
I'm not sure though about the efficiency of this solution.

Categories

Resources