I am trying to fillna in a specific column of the dataframe with the mean of not-null values of the same type (based on the value from another column in the dataframe).
Here is the code to reproduce my issue:
import numpy as np
import pandas as pd
df = pd.DataFrame()
#Create the DateFrame with a column of floats
#And a column of labels (str)
np.random.seed(seed=6)
df['col0']=np.random.randn(100)
lett=['a','b','c','d']
df['col1']=np.random.choice(lett,100)
#Set some of the floats to NaN for the test.
toz = np.random.randint(0,100,25)
df.loc[toz,'col0']=np.NaN
df[df['col0'].isnull()==False].count()
#Create a DF with mean for each label.
w_series = df.loc[(~df['col0'].isnull())].groupby('col1').mean()
col0
col1
a 0.057199
b 0.363899
c -0.068074
d 0.251979
#This dataframe has our label (a,b,c,d) as the index. Doesn't seem
#to work when I try to df.fillna(w_series). So I try to reindex such
#that the labels (a,b,c,d) become a column again.
#
#For some reason I cannot just do a set_index and expect the
#old index to become column. So I append the new index and
#then reset it.
w_series['col2'] = list(range(w_series.size))
w_frame = w_series.set_index('col2',append=True)
w_frame.reset_index('col1',inplace=True)
#I try fillna() with the new dataframe.
df.fillna(w_frame)
Still no luck:
col0 col1
0 0.057199 b
1 0.729004 a
2 0.217821 d
3 0.251979 c
4 -2.486781 a
5 0.913252 b
6 NaN a
7 NaN b
What am I doing wrong?
How do I fillna the dataframe with the averages of specific rows that match the missing information?
Does the size of the dataframe being filled (df) and the filler dataframe (w_frame) have to match?
Thank you
fillna is base on index, so , you need same index for your target dataframe and process dataframe
df.set_index('col1')['col0'].fillna(w_frame.set_index('col1').col0).reset_index()
# I only show the first 11 row
Out[74]:
col1 col0
0 b 0.363899
1 a 0.729004
2 d 0.217821
3 c -0.068074
4 a -2.486781
5 b 0.913252
6 a 0.057199
7 b 0.363899
8 c -0.068074
9 b -0.429894
10 a 2.631281
My way to fillna
df['col1']=df.groupby("col1")['col0'].transform(lambda x: x.fillna(x.mean()))
Related
Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']
I have dataframe like this:
id
name
emails
1
a
a#e.com,b#e.com,c#e.com,d#e.com
2
f
f#gmail.com
And I need iterate over emails if there are more than one, create additional rows in dataframe with additional emails, not corresponding to name, should be like this:
id
name
emails
1
a
a#e.com
2
f
f#gmail.com
3
NaN
b#e.com
4
NaN
c#e.com
5
NaN
d#e.com
What is the best way to do it apart of iterrows with append or concat? is it ok to modify iterated dataframe during iteration?
Thanks.
Use DataFrame.explode with splitted values by Series.str.split first, then compare values before # and if no match set missing value and last sorting like missing values are in end of DataFrame with assign range to id column:
df = df.assign(emails = df['emails'].str.split(',')).explode('emails')
mask = df['name'].eq(df['emails'].str.split('#').str[0])
df['name'] = np.where(mask, df['name'], np.nan)
df = df.sort_values('name', key=lambda x: x.isna(), ignore_index=True)
df['id'] = range(1, len(df) + 1)
print (df)
id name emails
0 1 a a#e.com
1 2 f f#gmail.com
2 3 NaN b#e.com
3 4 NaN c#e.com
4 5 NaN d#e.com
Is there a way to sort each row of a pandas data frame?
I don't care about columns names or row indexes, I just want a table with the values of each row sorted from highest to lowest.
You can use np.sort with axis=1 on the numpy data:
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,10, (2,4)))
# output
pd.DataFrame(np.sort(df.values, axis=1)[:,::-1],
index=df.index,
columns=df.columns)
Output:
0 1 2 3
0 9 6 6 1
1 8 7 2 1
If you want to override your original dataframe:
df[:] = np.sort(df.values, axis=1)[:,::-1]
Update
np.sort(df)[:,::-1] works as well, df is downcast to a numpy array, and axis=-1 is default.
I have a dataframe like the following
df = [[1,'NaN',3],[4,5,'Nan'],[7,8,9]]
df = pd.DataFrame(df)
and I would like to remove all columns that have in their first row a NaN value.
So the output should be:
df = [[1,3],[4,'Nan'],[7,9]]
df = pd.DataFrame(df)
So in this case, only the second column is removed since the first element was a NaN value.
Hence, dropna() is based on a condition.. any idea how to handle this? Thx!
If values are np.nan and not string NaN(else replace them), you can do:
Input:
df = [[1,np.nan,3],[4,5,np.nan],[7,8,9]]
df = pd.DataFrame(df)
Solution:
df.loc[:,df.iloc[0].notna()] #assign back to your desired variable
0 2
0 1 3.0
1 4 NaN
2 7 9.0
This is my original dataframe.
This is my second dataframe containing one column.
I want to add the column of second dataframe to the original dataframe at the end. Indices are different for both dataframes. I did like this.
df1['RESULT'] = df2['RESULT']
It doesn't return an error and the column is added but all values are NaNs. How do I add these columns with their values?
Assuming the size of your dataframes are the same, you can assign the RESULT_df['RESULT'].values to your original dataframe. This way, you don't have to worry about indexing issues.
# pre 0.24
feature_file_df['RESULT'] = RESULT_df['RESULT'].values
# >= 0.24
feature_file_df['RESULT'] = RESULT_df['RESULT'].to_numpy()
Minimal Code Sample
df
A B
0 -1.202564 2.786483
1 0.180380 0.259736
2 -0.295206 1.175316
3 1.683482 0.927719
4 -0.199904 1.077655
df2
C
11 -0.140670
12 1.496007
13 0.263425
14 -0.557958
15 -0.018375
Let's try direct assignment first.
df['C'] = df2['C']
df
A B C
0 -1.202564 2.786483 NaN
1 0.180380 0.259736 NaN
2 -0.295206 1.175316 NaN
3 1.683482 0.927719 NaN
4 -0.199904 1.077655 NaN
Now, assign the array returned by .values (or .to_numpy() for pandas versions >0.24). .values returns a numpy array which does not have an index.
df2['C'].values
array([-0.141, 1.496, 0.263, -0.558, -0.018])
df['C'] = df2['C'].values
df
A B C
0 -1.202564 2.786483 -0.140670
1 0.180380 0.259736 1.496007
2 -0.295206 1.175316 0.263425
3 1.683482 0.927719 -0.557958
4 -0.199904 1.077655 -0.018375
You can also call set_axis() to change the index of a dataframe/column. So if the lengths are the same, then with set_axis(), you can coerce the index of one dataframe to be the same as the other dataframe.
df1['A'] = df2['A'].set_axis(df1.index)
If you get SettingWithCopyWarning, then to silence it, you can create a copy by either calling join() or assign().
df1 = df1.join(df2['A'].set_axis(df1.index))
# or
df1 = df1.assign(new_col = df2['A'].set_axis(df1.index))
set_axis() is especially useful if you want to add multiple columns from another dataframe. You can just call join() after calling it on the new dataframe.
df1 = df1.join(df2[['A', 'B', 'C']].set_axis(df1.index))