I am working with a large pandas dataframe and a few columns have lots of missing data. I am not totally confident with my imputation and I believe the presence or absence of data for these variables could be useful information, so I would like to add another column of the dataframe with 0 where the entry is missing and 1 otherwise. Is there a quick/efficient way to do this in pandas?
Try out the following:
df['New_Col'] = df['Col'].notna().astype('uint8')
Where Col it your column containing np.nan values and New_Col your binary target column indicating whether Col contains np.nan.
The relevant function here is .notna, which will yield bool depending on whether the value is missing or not. To apply it to multiple columns of interest, use:
for c in cols_of_interest:
df[f'{c}_not_missing'] = 1 * df[c].notna()
Note that 1 * bool will give integer 0/1.
So I'm new to pandas and this is my first notebook. I needed to join some columns of my dataframe and after that, I wanted to separate the values so it would be better to visualize them.
to join the columns I used df['Q7'] = df[['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5','Q7_Part_6','Q7_OTHER']].apply(lambda x : '_'.join(x.dropna().astype(str)), axis=1) and it did well, but i still needed to separate the values and for that i used explode() like: df.Q7 = df.Q7.str.split('_').explode('Q7') and that gave me some empty cells on the dataframe like:
Dataframe
and when i try to visualize the values they just come in empty like:
sum of empty cells
What could I do to not show these empty cells on the viz?
Edit 1: By the way, they not appear as null or NaN cells when I do: df.isnull().sum() or df.isna().sum()
c = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', \
'Q7_Part_5','Q7_Part_6','Q7_OTHER']
df['Q7'] = df[c].apply(lambda x : '_'.join(x.astype(str)), axis=1)
I am not able to replicate your issue but my best guess is if you will do the above the dimension of the list will remain intact and you will get string 'nan' values instead of empty strings.
I have a very large dataframe where I only want to change the values in a small continuous subset of columns. Basically, in those columns the values are either integers or null. All I want is to replace the 0's and nulls with 'No' and everything else with 'Yes' only in those columns
In R, this can be done basically with a one liner:
df <- df %>%
mutate_at(vars(MCI:BNP), ~factor(case_when(. > 0 ~ 'Yes',
TRUE ~ 'No')))
But we're working in Python and I can't quite figure out the equivalent using Pandas. I've been messing around with loc and iloc, which work fine when only changing a single column but I must be missing something when it comes to modifying multiple columns. And the answers I've found on other stackoverflow answers have all mostly been just changing the value in a single column based on some set of conditions
col1 = df.columns.get_loc("MCI")
col2 = df.columns.get_loc("BNP")
df.iloc[:,col1:col2]
Will get me the columns I want, but trying to call loc doesn't work with multidimensional keys. I even tried it with the columns as a list instead of by integer index by creating a an extra variable
binary_var = ['MCI','PVD','CVA','DEMENTIA','CPD','RD','PUD','MLD','DWOC','DWC','HoP','RND','MALIGNANCY','SLD','MST','HIV','AKF',
'ARMD','ASPHY','DEP','DWLK','DRUGA','DUOULC','FALL','FECAL','FLDELEX','FRAIL','GASTRICULC','GASTROULC','GLAU','HYPERKAL',
'HYPTEN','HYPOKAL','HYPOTHYR','HYPOXE','IMMUNOS','ISCHRT','LIPIDMETA','LOSWIGT','LOWBAK','MALNUT','OSTEO','PARKIN',
'PNEUM','RF','SEIZ','SD','TUML','UI','VI','MENTAL','FUROSEMIDE','METOPROLOL','ASPIRIN','OMEPRAZOLE','LISINOPRIL','DIGOXIN',
'ALDOSTERONE_ANTAGONIST','ACE_INHIBITOR','ANGIOTENSIN_RECEPTOR_BLOCKERS','BETA_BLOCKERSDIURETICHoP','BUN','CREATININE',
'SODIUM','POTASSIUM','HEMOGLOBIN','WBC_COUNT','CHLORIDE','ALBUMIN','TROPONIN','BNP']
df.loc[df[binary_var] == 0, binary_var]
But then it just can't find the index for those column names at all. I think Pandas also has problems converting columns that were originally integers into No/Yes. I don't need to do this in place, I'm probably just missing something simple that pandas has built in hopefully
In a very psuedo-code description, all I really want is this
if(df.iloc[:,col1:col2] == 0 || df.iloc[:,col1:col2].isnull())
df ONLY in that subset of column = 'No'
else
df ONLY in that subset of column = 'Yes'
Use:
df.loc[:, 'MCI':'BNP'] = np.where(df.loc[:, 'MCI':'BNP'] > 0, 'Yes', 'No')
I have a bunch of dataframes where I want to pull out a single column from each and merge them into another dataframe with a timestamp column that is not indexed.
So e.g. all the dataframes look like:
[Index] [time] [col1] [col2] [etc]
0 2020-04-21T18:00:00Z 1 2 ...
All of the dataframes have a 'time' column and a 'col1' column. Because the 'time' column does not necessarily overlap, I made a new dataframe with a join of all the dataframes (that I added to a dictionary)
di = ... #dictionary of all the dataframes of interest
for key in di:
temptimeslist = di[key]['time'].tolist()
fulltimeslist.extend(x for x in temptimeslist if x not in fulltimeslist)
datadf['time'] = fulltimeslist #make a new df and add this as a column
(i'm sure there's an easier way to do the above, any suggestions are welcome). Note that for a number of reasons, translatting the ISO datetime format into a datetime and setting that as an index is not ideal.
The dumb way to do what I want is obvious enough:
for key in di:
datadf[key] = float("NaN")
tempdf = di[key] #could skip this probably
for i in range(len(datadf)):
if tempdf.time[tempdf.time==datadf.time[i]].index.tolist():
if len(tempdf.time[tempdf.time==datadf.time[i]].index.tolist())==1: #make sure value only shows up once, could reasonably skip this and put protection in elsewhere
datadf.loc[i,key] = float(tempdf[colofinterest][tempdf.time[tempdf.time==datadf.time[i]].index.tolist()])
#i guess i could do the above backwards so i loop over only the shorter dataframe to save some time.
but this seems needlessly long for python...I originally tried the pandas merge and join methods but got various keyerrors when trying them...same goes for 'in' statements inside the if statements.
e.g, I've tried thins like
datadf.join(Nodes_dict[key],datadf['time']==Nodes_dict[key]['time'],how="left").select()
but this fails.
I guess the question boils down to the following steps:
1) given 2 dataframes with a column of strings (times in iso format), find the indexes in the larger one for where they match the shorter one (or vice versa)
2) given that list of indexes, populate a separate column in the larger df using values from the smaller df but only in the correct spots and nan otherwise
I started with a Pandas DataFrame which has a column with many NaN values.
I split this Pandas DataFrame into two DataFrames: non-NaN and NaN.
I estimated a linear regression model to try to fill in the NaN values (as a function of the other columns).
So I now have a separate Pandas Series that has the estimated values. Its length is the same length as the NaN DataFrame.
I now want to put these estimated values back into the NaN DataFrame, so that I can then ultimately pd.concat() these two DataFrames into one DataFrame that I can then use for my analysis.
I cannot figure out a way to put these values back into the NaN DataFrame into the correct rows. Every time I tried, only some of the NaNs get filled (and probably in the wrong order). It seems to be something to do with the way they're indexed.
df_nan["Column"] = y_predicted
This is the way I've tried to do it, but it only fills in some of the rows, and incorrectly. Something to do with indices maybe?
I think a way of doing this could be the following: you keep your raw dataframe and use apply on the column you want to impute.
df['imputed_column'] = df.apply(lambda x: x.Column if(pd.notnull(x.Column)) else y_predicted[x.name],axis=1)
The following line will get the estimated value if it has a null value (with x.name being the index of the row). Otherwise, it will keep the same value.