Fill NaN values in DataFrame with generic values

Fill NaN values in DataFrame with generic values - python

Using Python3.7 and the currently most updated version of Pandas.
I have a dataframe with the following datatypes: [category, float, object(text)]
all i want to do is fill NaN values for the entire dataframe at once.
What ive been doing on my own is going one-by-one through every single column (hundreds at a time) and grouping columnnames into lists organized by datatype. Then setting that list of columns with pd.astype(datatype). this was extremely tedious and inefficient, as i still continue to get back lots of errors. Ive been doing it this way for months, but now i have excel sheets with arbitrary data to read in, and considering the size of the dataframes im beginning to work with (+/-400k) its unrealistic to continue that way.
For the dtypes "category" and "object(text)", i want to fillna with the string 'empty'. And for float dtypes, i want to fillna with 0.0. At this point in my project, I am not yet interested in filling with mean/median values.
Ideally I would like to achieve this with something simple like:
df.fillna_all({'float':0, 'category':'empty', 'object':'empty'})
please help!

I think this is exactly what you need:
1) To fill in the categorical variables with 'empty', you can do:
# Identify the columns in your df that are of type Object (i.e. categorical)
cat_vars = [col for col in df.columns if df[col].dtypes == 'O']
# Loop over them, and fill them with 'empty'
for col in df[cat_vars]:
df[col].fillna('empty',inplace=True)
2) To fill in the numerical variables with 0.0, you can do:
# Identify the columns that are numeric, AND have at least 1 nan to be filled
num_vars = [x for x in dat.columns if dat[x].dtypes !='O' and dat[x].isnull() > 0]
# Loop over them, and fill them with 0.0
for col in df[num_vars]:
df[col].fillna(0,inplace=True)
For the future, if you are interested in filling the numeric variables with mean or median:
for col in df[num_vars]:
df[col] = df[col].fillna(df[col].median()) # or replace with mean() for mean

Related

One hot vector in pandas to encode missing values

I am working with a large pandas dataframe and a few columns have lots of missing data. I am not totally confident with my imputation and I believe the presence or absence of data for these variables could be useful information, so I would like to add another column of the dataframe with 0 where the entry is missing and 1 otherwise. Is there a quick/efficient way to do this in pandas?

Try out the following:
df['New_Col'] = df['Col'].notna().astype('uint8')
Where Col it your column containing np.nan values and New_Col your binary target column indicating whether Col contains np.nan.

The relevant function here is .notna, which will yield bool depending on whether the value is missing or not. To apply it to multiple columns of interest, use:
for c in cols_of_interest:
df[f'{c}_not_missing'] = 1 * df[c].notna()
Note that 1 * bool will give integer 0/1.

Empty cells on dataframe after use explode()

So I'm new to pandas and this is my first notebook. I needed to join some columns of my dataframe and after that, I wanted to separate the values so it would be better to visualize them.
to join the columns I used df['Q7'] = df[['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5','Q7_Part_6','Q7_OTHER']].apply(lambda x : '_'.join(x.dropna().astype(str)), axis=1) and it did well, but i still needed to separate the values and for that i used explode() like: df.Q7 = df.Q7.str.split('_').explode('Q7') and that gave me some empty cells on the dataframe like:
Dataframe
and when i try to visualize the values they just come in empty like:
sum of empty cells
What could I do to not show these empty cells on the viz?
Edit 1: By the way, they not appear as null or NaN cells when I do: df.isnull().sum() or df.isna().sum()

c = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', \
'Q7_Part_5','Q7_Part_6','Q7_OTHER']
df['Q7'] = df[c].apply(lambda x : '_'.join(x.astype(str)), axis=1)
I am not able to replicate your issue but my best guess is if you will do the above the dimension of the list will remain intact and you will get string 'nan' values instead of empty strings.

Replacing values in multiple columns with Pandas based on conditions

I have a very large dataframe where I only want to change the values in a small continuous subset of columns. Basically, in those columns the values are either integers or null. All I want is to replace the 0's and nulls with 'No' and everything else with 'Yes' only in those columns
In R, this can be done basically with a one liner:
df <- df %>%
mutate_at(vars(MCI:BNP), ~factor(case_when(. > 0 ~ 'Yes',
TRUE ~ 'No')))
But we're working in Python and I can't quite figure out the equivalent using Pandas. I've been messing around with loc and iloc, which work fine when only changing a single column but I must be missing something when it comes to modifying multiple columns. And the answers I've found on other stackoverflow answers have all mostly been just changing the value in a single column based on some set of conditions
col1 = df.columns.get_loc("MCI")
col2 = df.columns.get_loc("BNP")
df.iloc[:,col1:col2]
Will get me the columns I want, but trying to call loc doesn't work with multidimensional keys. I even tried it with the columns as a list instead of by integer index by creating a an extra variable
binary_var = ['MCI','PVD','CVA','DEMENTIA','CPD','RD','PUD','MLD','DWOC','DWC','HoP','RND','MALIGNANCY','SLD','MST','HIV','AKF',
'ARMD','ASPHY','DEP','DWLK','DRUGA','DUOULC','FALL','FECAL','FLDELEX','FRAIL','GASTRICULC','GASTROULC','GLAU','HYPERKAL',
'HYPTEN','HYPOKAL','HYPOTHYR','HYPOXE','IMMUNOS','ISCHRT','LIPIDMETA','LOSWIGT','LOWBAK','MALNUT','OSTEO','PARKIN',
'PNEUM','RF','SEIZ','SD','TUML','UI','VI','MENTAL','FUROSEMIDE','METOPROLOL','ASPIRIN','OMEPRAZOLE','LISINOPRIL','DIGOXIN',
'ALDOSTERONE_ANTAGONIST','ACE_INHIBITOR','ANGIOTENSIN_RECEPTOR_BLOCKERS','BETA_BLOCKERSDIURETICHoP','BUN','CREATININE',
'SODIUM','POTASSIUM','HEMOGLOBIN','WBC_COUNT','CHLORIDE','ALBUMIN','TROPONIN','BNP']
df.loc[df[binary_var] == 0, binary_var]
But then it just can't find the index for those column names at all. I think Pandas also has problems converting columns that were originally integers into No/Yes. I don't need to do this in place, I'm probably just missing something simple that pandas has built in hopefully
In a very psuedo-code description, all I really want is this
if(df.iloc[:,col1:col2] == 0 || df.iloc[:,col1:col2].isnull())
df ONLY in that subset of column = 'No'
else
df ONLY in that subset of column = 'Yes'

Use:
df.loc[:, 'MCI':'BNP'] = np.where(df.loc[:, 'MCI':'BNP'] > 0, 'Yes', 'No')

Pandas fill dataframe from another dataframe where the [non-] index doesnt always overlap

I have a bunch of dataframes where I want to pull out a single column from each and merge them into another dataframe with a timestamp column that is not indexed.
So e.g. all the dataframes look like:
[Index] [time] [col1] [col2] [etc]
0 2020-04-21T18:00:00Z 1 2 ...
All of the dataframes have a 'time' column and a 'col1' column. Because the 'time' column does not necessarily overlap, I made a new dataframe with a join of all the dataframes (that I added to a dictionary)
di = ... #dictionary of all the dataframes of interest
for key in di:
temptimeslist = di[key]['time'].tolist()
fulltimeslist.extend(x for x in temptimeslist if x not in fulltimeslist)
datadf['time'] = fulltimeslist #make a new df and add this as a column
(i'm sure there's an easier way to do the above, any suggestions are welcome). Note that for a number of reasons, translatting the ISO datetime format into a datetime and setting that as an index is not ideal.
The dumb way to do what I want is obvious enough:
for key in di:
datadf[key] = float("NaN")
tempdf = di[key] #could skip this probably
for i in range(len(datadf)):
if tempdf.time[tempdf.time==datadf.time[i]].index.tolist():
if len(tempdf.time[tempdf.time==datadf.time[i]].index.tolist())==1: #make sure value only shows up once, could reasonably skip this and put protection in elsewhere
datadf.loc[i,key] = float(tempdf[colofinterest][tempdf.time[tempdf.time==datadf.time[i]].index.tolist()])
#i guess i could do the above backwards so i loop over only the shorter dataframe to save some time.
but this seems needlessly long for python...I originally tried the pandas merge and join methods but got various keyerrors when trying them...same goes for 'in' statements inside the if statements.
e.g, I've tried thins like
datadf.join(Nodes_dict[key],datadf['time']==Nodes_dict[key]['time'],how="left").select()
but this fails.
I guess the question boils down to the following steps:
1) given 2 dataframes with a column of strings (times in iso format), find the indexes in the larger one for where they match the shorter one (or vice versa)
2) given that list of indexes, populate a separate column in the larger df using values from the smaller df but only in the correct spots and nan otherwise

What's the best way to replace NaN values (in a Pandas DataFrame) with values from a separate Pandas Series?

I started with a Pandas DataFrame which has a column with many NaN values.
I split this Pandas DataFrame into two DataFrames: non-NaN and NaN.
I estimated a linear regression model to try to fill in the NaN values (as a function of the other columns).
So I now have a separate Pandas Series that has the estimated values. Its length is the same length as the NaN DataFrame.
I now want to put these estimated values back into the NaN DataFrame, so that I can then ultimately pd.concat() these two DataFrames into one DataFrame that I can then use for my analysis.
I cannot figure out a way to put these values back into the NaN DataFrame into the correct rows. Every time I tried, only some of the NaNs get filled (and probably in the wrong order). It seems to be something to do with the way they're indexed.
df_nan["Column"] = y_predicted
This is the way I've tried to do it, but it only fills in some of the rows, and incorrectly. Something to do with indices maybe?

I think a way of doing this could be the following: you keep your raw dataframe and use apply on the column you want to impute.
df['imputed_column'] = df.apply(lambda x: x.Column if(pd.notnull(x.Column)) else y_predicted[x.name],axis=1)
The following line will get the estimated value if it has a null value (with x.name being the index of the row). Otherwise, it will keep the same value.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fill NaN values in DataFrame with generic values - python

Related

One hot vector in pandas to encode missing values

Empty cells on dataframe after use explode()

Replacing values in multiple columns with Pandas based on conditions

Pandas fill dataframe from another dataframe where the [non-] index doesnt always overlap

What's the best way to replace NaN values (in a Pandas DataFrame) with values from a separate Pandas Series?

Categories

Resources