I tried to solve this problem on my own, but I unfortunately haven't made much progress and would really appreciate anyone who can help me out.
My current dataframe contains 3 columns: 2 healthy columns and one column with some missing values, denoted as NaN.
df
Out[18]:
x1 x2 x3
0 A 1 2.0
1 B 0 NaN
2 A 0 1.0
3 A 1 2.0
4 A 0 NaN
5 B 1 1.0
6 A 1 1.0
7 B 0 2.0
8 B 0 2.0
I would like to fill the missing values in 'x3' by taking the median value of groupby of 'x1' and 'x2'.
groupby_df = df.groupby(['x1', 'x2'])['x3'].median()
groupby_df
Out[22]:
x1 x2
A 0 1.0
1 2.0
B 0 2.0
1 1.0
So, for instance, the NaN value corresponding to (B, 0) would be replaced by 2 and (A,0) by 1. I unfortunately can't figure out this part. Is there an elegant "DataFrame way" of filling in the NaN values with the computed median using groupby?
Thank You
using fillna inside groupby
df['x3']=df.groupby(['x1','x2'])['x3'].apply(lambda x : x.fillna(x.median()))
df
Out[928]:
x1 x2 x3
0 A 1 2.0
1 B 0 2.0
2 A 0 1.0
3 A 1 2.0
4 A 0 1.0
5 B 1 1.0
6 A 1 1.0
7 B 0 2.0
8 B 0 2.0
Related
I have a DataFrame in which I have a duplicate column namely weather.
As Seen in this picture of dataframe. One of them contains NaN values that is the one I want to remove from the DataFrame.
I tried this method
data_cleaned4.drop('Weather', axis=1)
It dropped both columns as it should. I tried to pass a condition to drop method but I couldn't. It shows me an error.
data_cleaned4.drop(data_cleaned4['Weather'].isnull().sum() > 0, axis=1)
Can anyone tell me how do I remove this column. Remember that the second last contains the NaN values not the last one.
A general solution. (df.isnull().any(axis=0).values) gets which columns have any NaN values and df.columns.duplicated(keep=False) marks all duplicates as True, both combined will give the columns which you want to retain
General Solution:
df.loc[:, ~((df.isnull().any(axis=0).values) & df.columns.duplicated(keep=False))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C
0 1 1 1
1 1 1 1
2 2 3 4
3 1 1 1
Just for column C:
df.loc[:, ~(df.columns.duplicated(keep=False) & (df.isnull().any(axis=0).values)
& (df.columns == 'C'))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C A
0 1 1 1 NaN
1 1 1 1 1.0
2 2 3 4 2.0
3 1 1 1 1.0
Due to the duplicate names you can rename a little bit, that's what the first lien of the code belwo does, then it should work...
data_cleaned4 = data_cleaned4.iloc[:, [j for j, c in enumerate(data_cleaned4.columns) if j != i]]
checkone = data_cleaned4.iloc[:,-1].isna().any()
checktwo = data_cleaned4.iloc[:,-2].isna().any()
if checkone:
data_cleaned4.drop(data_cleaned4.columns[-1], axis=1)
elif checktwo:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
else:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
Without a testable sample and assuming you don't have NaNs anywhere else in your dataframe
df = df.dropna(axis=1)
should work
When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
I want to fill in data between each row in a dataframe with an average of current and next row (where columns are numeric)
starting data:
time value value_1 value-2
0 0 0 4 3
1 2 1 6 6
intermediate df:
time value value_1 value-2
0 0 0 4 3
1 1 0 4 3 #duplicate of row 0
2 2 1 6 6
3 3 1 6 6 #duplicate of row 2
I would like to create df_1:
time value value_1 value-2
0 0 0 4 3
1 1 0.5 5 4.5 #average of row 0 and 2
2 2 1 6 6
3 3 2 8 8 #average of row 2 and 4
To to this I appended a copy of the starting dataframe to create the intermediate dataframe shown above:
df = df_0.append(df_0)
df.sort_values(['time'], ascending=[True], inplace=True)
df = df.reset_index()
df['value_shift'] = df['value'].shift(-1)
df['value_shift_1'] = df['value_1'].shift(-1)
df['value_shift_2'] = df['value_2'].shift(-1)
then I was thinking of applying a function to each column:
def average_vals(numeric_val):
#average every odd row
if int(row.name) % 2 != 0:
#take average of value and value_shift for each value
#but this way I need to create 3 separate functions
Is there a way to do this without writing a separate function for each column and applying to each column one by one (in real data I have tens of columns)?
How about this method using DataFrame.reindex and DataFrame.interpolate
df.reindex(np.arange(len(df.index) * 2) / 2).interpolate().reset_index(drop=True)
Explanation
Reindex, in half steps reindex(np.arange(len(df.index) * 2) / 2)
This gives a DataFrame like this:
time value value_1 value-2
0.0 0.0 0.0 4.0 3.0
0.5 NaN NaN NaN NaN
1.0 2.0 1.0 6.0 6.0
1.5 NaN NaN NaN NaN
Then use DataFrame.interpolate to fill in the NaN values .... the default will be linear interpolation, so mean in this case.
Finaly, use .reset_index(drop=True) to fix your index.
Should give
time value value_1 value-2
0 0.0 0.0 4.0 3.0
1 1.0 0.5 5.0 4.5
2 2.0 1.0 6.0 6.0
3 2.0 1.0 6.0 6.0
My problem is a large data frame which I would like to clear out. The two main problems for me are:
The whole data frame is time-based. That means I can not shift rows around, otherwise, the timestamp wouldn't fit anymore.
The data is not always in the same order.
Here is an example to clarify
index a b c d x1 x2 y1 y2 t
0 1 2 0.2
1 1 2 0.4
2 2 4 0.6
3 1 2 1.8
4 2 3 2.0
5 1 2 3.8
6 2 3 4.0
7 2 5 4.2
The result should be looking like this
index a b c d x1 x2 y1 y2 t
0 1 2 2 4 0.2
1 1 2 0.4
3 1 2 2 3 1.8
5 1 2 2 3 3.8
7 2 5 4.2
This means I would like, to sum up, the right half of the df and keep the timestamp of the first entry. The second problem is, there might be different data from the left half of the df in between.
This may not be the most general solution, but it solves your problem:
First, isolate the right half:
r = df[['x1', 'x2', 'y1', 'y2']].dropna(how='all')
Second, use dropna applied column by column to compress the data:
r_compressed = r.apply(
lambda g: g.dropna().reset_index(drop=True),
axis=0
).set_index(r.index[::2])
You need to drop the index otherwise pandas will attempt to realign the data. The original index is reapplied at the end (but only with every second index label) to facilitate reinsertion of the left half and the t column.
Output (note the index values):
x1 x2 y1 y2
0 1.0 2.0 2.0 4.0
3 1.0 2.0 2.0 3.0
5 1.0 2.0 2.0 3.0
Third, isolate left half:
l = df[['a', 'b', 'c', 'd']].dropna(how='all')
Fourth, incorporate the left half and t column to compressed right half:
out = r_compressed.combine_first(l)
out['t'] = df['t']
Output:
a b c d x1 x2 y1 y2 t
0 NaN NaN NaN NaN 1.0 2.0 2.0 4.0 0.2
1 1.0 2.0 NaN NaN NaN NaN NaN NaN 0.4
3 NaN NaN NaN NaN 1.0 2.0 2.0 3.0 1.8
5 NaN NaN NaN NaN 1.0 2.0 2.0 3.0 3.8
7 NaN NaN 2.0 5.0 NaN NaN NaN NaN 4.2
This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed last year.
I Know that the fillna() method can be used to fill NaN in whole dataframe.
df.fillna(df.mean()) # fill with mean of column.
How to limit mean calculation to the group (and the column) where the NaN is.
Exemple:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4])
})
print df
Input
a b
0 1 1
1 1 2
2 1 NaN
3 2 1
4 2 NaN
5 2 4
Output (after groupby('a') & replace NaN by mean of group)
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
IIUC then you can call fillna with the result of groupby on 'a' and transform on 'b':
In [44]:
df['b'] = df['b'].fillna(df.groupby('a')['b'].transform('mean'))
df
Out[44]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
If you have multiple NaN values then I think the following should work:
In [47]:
df.fillna(df.groupby('a').transform('mean'))
Out[47]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
EDIT
In [49]:
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4]),
'c': pd.Series([1,np.NaN,np.NaN,1,np.NaN,4]),
'd': pd.Series([np.NaN,np.NaN,np.NaN,1,np.NaN,4])
})
df
Out[49]:
a b c d
0 1 1 1 NaN
1 1 2 NaN NaN
2 1 NaN NaN NaN
3 2 1 1 1
4 2 NaN NaN NaN
5 2 4 4 4
In [50]:
df.fillna(df.groupby('a').transform('mean'))
Out[50]:
a b c d
0 1 1.0 1.0 NaN
1 1 2.0 1.0 NaN
2 1 1.5 1.0 NaN
3 2 1.0 1.0 1.0
4 2 2.5 2.5 2.5
5 2 4.0 4.0 4.0
You get all NaN for 'd' as all values are NaN for group 1 for d
We first compute the group means, ignoring the missing values:
group_means = df.groupby('a')['b'].agg(lambda v: np.nanmean(v))
Next, we use groupby again, this time fetching the corresponding values:
df_new = df.groupby('a').apply(lambda t: t.fillna(group_means.loc[t['a'].iloc[0]]))