Use fillna() and lambda function in Pandas to replace NaN values - python

I'm trying to write fillna() or a lambda function in Pandas that checks if 'user_score' column is a NaN and if so, uses column's data from another DataFrame. I tried two options:
games_data['user_score'].fillna(
genre_score[games_data['genre']]['user_score']
if np.isnan(games_data['user_score'])
else games_data['user_score'],
inplace = True
)
# but here is 'ValueError: The truth value of a Series is ambiguous'
and
games_data['user_score'] = games_data.apply(
lambda row:
genre_score[row['genre']]['user_score']
if np.isnan(row['user_score'])
else row['user_score'],
axis=1
)
# but here is 'KeyError' with another column from games_data
My dataframes:
games_data
genre_score
I will be glad for any help!

You can also fillna() directly with the user_score_by_genre mapping:
user_score_by_genre = games_data.genre.map(genre_score.user_score)
games_data.user_score = games_data.user_score.fillna(user_score_by_genre)
BTW if games_data.user_score will never deviate from the genre_score values, you can skip the fillna() and just assign directly to games_data.user_score:
games_data.user_score = games_data.genre.map(genre_score.user_score)
Pandas' built-in Series.where also works and is a bit more concise:
df1.user_score.where(df1.user_score.isna(), df2.user_score, inplace=True)

Use numpy.where:
import numpy as np
df1['user_score'] = np.where(df1['user_score'].isna(), df2['user_score'], df1['user_score'])

I found the part of the solution here
I use series.map:
user_score_by_genre = games_data['genre'].map(genre_score['user_score'])
And after that I use #MayankPorwal answer:
games_data['user_score'] = np.where(games_data['user_score'].isna(), user_score_by_genre, games_data['user_score'])
I'm not sure that it is the best way but it works for me.

Related

Replace nan-values with the mean of their column/attribute

I have tried with everything I can come up with and would appreciate some help! :)
This is a method that's gonna return an imputed part of a data frame
from statistics import mean
from unicodedata import numeric
def imputation(df, columns_to_imputed):
# Step 1: Get a part of dataframe using columns received as a parameter.
import pandas as pd
import numpy as np
df.set_axis(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], axis=1, inplace=True)#Sätter rubrikerna
part_of_df = pd.DataFrame(df.filter(columns_to_imputed, axis=1))
part_of_df = part_of_df.drop([0], axis=0)
#Step 2: Change the zero values in the columns to np.nan
part_of_df = part_of_df.replace('0', np.nan)
# Step 3: Change the nan values to the mean of each attribute (column).
#You can use the apply(), fillna() functions.
part_of_df = part_of_df.fillna(part_of_df.mean(axis=0)) #####Ive tried everything on this row, can't get it to work. I want to fill each nan-value with the mean of the column its in..
return part_of_df ####Im returning this part to see if the nans are replaced but nothings happened...
You were on the right track, you just need to make a small change. Here I created a sample Df and introduced some NaNs:
dummy_df = pd.DataFrame({"col1":range(5), "col2":range(5)})
dummy_df['col1'][1] = None
dummy_df['col1'][3] = None
dummy_df['col2'][4] = None
and got this:
Disclaimer: Don't use my method of value assignment. Use proper indexing through loc.
Now, I use apply() and lambda to iterate over each column and fill NaNs with the mean value:
dummy_df = dummy_df.apply(lambda x: x.fillna(x.mean()), axis=0)
This gives me:
Hope this helps!

pandas df[df["A"]==None] is not the same as df["A"].values==None

Assume I have a dataframe df where the column A consists of 10 None and the rest is something else.
If I do the slicing df=df[df["A"]==None] I get a wrong result. I figured out that df["A"]==None returns False (even when the elements are None) but df["A"].values==None returns the correct.
How come? Shouldn't we be able to slice in the first way ?
You should use isna() method over the serie.
For your case:
df = df.loc[df['A'].isna()]
You can use as follows
df = df[df['A'].isnull()]

Dataframe sorting does not apply when using .loc

I need to sort panda dataframe df, by a datetime column my_date. IWhenever I use .loc sorting does not apply.
df = df.loc[(df.some_column == 'filter'),]
df.sort_values(by=['my_date'])
print(dfolc)
# ...
# Not sorted!
# ...
df = df.loc[(df.some_column == 'filter'),].sort_values(by=['my_date'])
# ...
# sorting WORKS!
What is the difference of these two uses? What am I missing about dataframes?
In the first case, you didn't perform an operation in-place: you should have used either df = df.sort_values(by=['my_date']) or df.sort_values(by=['my_date'], inplace=True).
In the second case, the result of .sort_values() was saved to df, hence printing df shows sorted dataframe.
In the code df = df.loc[(df.some_column == 'filter'),] df.sort_values(by=['my_date']) print(dfolc), you are using df.loc() df.sort_values(), I'm not sure how that works.
In the seconf line, you are calling it correctly df.loc().sort_values(), which is the correct way. You don't have to use the df. notation twice.

Vectorized Flag Assignment in Dataframe

I have a dataframe with observations possessing a number of codes. I want to compare the codes present in a row with a list. If any codes are in that list, I wish to flag the row. I can accomplish this using the itertuples method as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'id' : [1,2,3,4,5],
'cd1' : ['abc1', 'abc2', 'abc3','abc4','abc5'],
'cd2' : ['abc3','abc4','abc5','abc6',''],
'cd3' : ['abc10', '', '', '','']})
code_flags = ['abc1','abc6']
# initialize flag column
df['flag'] = 0
# itertuples method
for row in df.itertuples():
if any(df.iloc[row.Index, 1:4].isin(code_flags)):
df.at[row.Index, 'flag'] = 1
The output correctly adds a flag column with the appropriate flags, where 1 indicates a flagged entry.
However, on my actual use case, this takes hours to complete. I have attempted to vectorize this approach using numpy.where.
df['flag'] = 0 # reset
df['flag'] = np.where(any(df.iloc[:,1:4].isin(code_flags)),1,0)
Which appears to evaluate everything the same. I think I'm confused on how the vectorization treats the index. I can remove the semicolon and write df.iloc[1:4] and obtain the same result.
Am I misunderstanding the where function? Is my indexing incorrect and causing a True evaluation for all cases? Is there a better way to do this?
Using np.where with .any not any(..)
np.where((df.iloc[:,1:4].isin(code_flags)).any(1),1,0)

Pandas df.apply does not modify DataFrame

I am just starting pandas so please forgive if this is something stupid.
I am trying to apply a function to a column but its not working and i don't see any errors also.
capitalizer = lambda x: x.upper()
for df in pd.read_csv(downloaded_file, chunksize=2, compression='gzip', low_memory=False):
df['level1'].apply(capitalizer)
print df
exit(1)
This print shows the level1 column values same as the original csv not doing upper. Am i missing something here ?
Thanks
apply is not an inplace function - it does not modify values in the original object, so you need to assign it back:
df['level1'] = df['level1'].apply(capitalizer)
Alternatively, you can use str.upper, it should be much faster.
df['level1'] = df['level1'].str.upper()
df['level1'] = map(lambda x: x.upper(), df['level1'])
you can use above code to make your column uppercase

Categories

Resources