I am working with a large pandas dataframe and a few columns have lots of missing data. I am not totally confident with my imputation and I believe the presence or absence of data for these variables could be useful information, so I would like to add another column of the dataframe with 0 where the entry is missing and 1 otherwise. Is there a quick/efficient way to do this in pandas?
Try out the following:
df['New_Col'] = df['Col'].notna().astype('uint8')
Where Col it your column containing np.nan values and New_Col your binary target column indicating whether Col contains np.nan.
The relevant function here is .notna, which will yield bool depending on whether the value is missing or not. To apply it to multiple columns of interest, use:
for c in cols_of_interest:
df[f'{c}_not_missing'] = 1 * df[c].notna()
Note that 1 * bool will give integer 0/1.
Related
Using Python3.7 and the currently most updated version of Pandas.
I have a dataframe with the following datatypes: [category, float, object(text)]
all i want to do is fill NaN values for the entire dataframe at once.
What ive been doing on my own is going one-by-one through every single column (hundreds at a time) and grouping columnnames into lists organized by datatype. Then setting that list of columns with pd.astype(datatype). this was extremely tedious and inefficient, as i still continue to get back lots of errors. Ive been doing it this way for months, but now i have excel sheets with arbitrary data to read in, and considering the size of the dataframes im beginning to work with (+/-400k) its unrealistic to continue that way.
For the dtypes "category" and "object(text)", i want to fillna with the string 'empty'. And for float dtypes, i want to fillna with 0.0. At this point in my project, I am not yet interested in filling with mean/median values.
Ideally I would like to achieve this with something simple like:
df.fillna_all({'float':0, 'category':'empty', 'object':'empty'})
please help!
I think this is exactly what you need:
1) To fill in the categorical variables with 'empty', you can do:
# Identify the columns in your df that are of type Object (i.e. categorical)
cat_vars = [col for col in df.columns if df[col].dtypes == 'O']
# Loop over them, and fill them with 'empty'
for col in df[cat_vars]:
df[col].fillna('empty',inplace=True)
2) To fill in the numerical variables with 0.0, you can do:
# Identify the columns that are numeric, AND have at least 1 nan to be filled
num_vars = [x for x in dat.columns if dat[x].dtypes !='O' and dat[x].isnull() > 0]
# Loop over them, and fill them with 0.0
for col in df[num_vars]:
df[col].fillna(0,inplace=True)
For the future, if you are interested in filling the numeric variables with mean or median:
for col in df[num_vars]:
df[col] = df[col].fillna(df[col].median()) # or replace with mean() for mean
I have a very large dataframe where I only want to change the values in a small continuous subset of columns. Basically, in those columns the values are either integers or null. All I want is to replace the 0's and nulls with 'No' and everything else with 'Yes' only in those columns
In R, this can be done basically with a one liner:
df <- df %>%
mutate_at(vars(MCI:BNP), ~factor(case_when(. > 0 ~ 'Yes',
TRUE ~ 'No')))
But we're working in Python and I can't quite figure out the equivalent using Pandas. I've been messing around with loc and iloc, which work fine when only changing a single column but I must be missing something when it comes to modifying multiple columns. And the answers I've found on other stackoverflow answers have all mostly been just changing the value in a single column based on some set of conditions
col1 = df.columns.get_loc("MCI")
col2 = df.columns.get_loc("BNP")
df.iloc[:,col1:col2]
Will get me the columns I want, but trying to call loc doesn't work with multidimensional keys. I even tried it with the columns as a list instead of by integer index by creating a an extra variable
binary_var = ['MCI','PVD','CVA','DEMENTIA','CPD','RD','PUD','MLD','DWOC','DWC','HoP','RND','MALIGNANCY','SLD','MST','HIV','AKF',
'ARMD','ASPHY','DEP','DWLK','DRUGA','DUOULC','FALL','FECAL','FLDELEX','FRAIL','GASTRICULC','GASTROULC','GLAU','HYPERKAL',
'HYPTEN','HYPOKAL','HYPOTHYR','HYPOXE','IMMUNOS','ISCHRT','LIPIDMETA','LOSWIGT','LOWBAK','MALNUT','OSTEO','PARKIN',
'PNEUM','RF','SEIZ','SD','TUML','UI','VI','MENTAL','FUROSEMIDE','METOPROLOL','ASPIRIN','OMEPRAZOLE','LISINOPRIL','DIGOXIN',
'ALDOSTERONE_ANTAGONIST','ACE_INHIBITOR','ANGIOTENSIN_RECEPTOR_BLOCKERS','BETA_BLOCKERSDIURETICHoP','BUN','CREATININE',
'SODIUM','POTASSIUM','HEMOGLOBIN','WBC_COUNT','CHLORIDE','ALBUMIN','TROPONIN','BNP']
df.loc[df[binary_var] == 0, binary_var]
But then it just can't find the index for those column names at all. I think Pandas also has problems converting columns that were originally integers into No/Yes. I don't need to do this in place, I'm probably just missing something simple that pandas has built in hopefully
In a very psuedo-code description, all I really want is this
if(df.iloc[:,col1:col2] == 0 || df.iloc[:,col1:col2].isnull())
df ONLY in that subset of column = 'No'
else
df ONLY in that subset of column = 'Yes'
Use:
df.loc[:, 'MCI':'BNP'] = np.where(df.loc[:, 'MCI':'BNP'] > 0, 'Yes', 'No')
I have a dataframe with a series of columns that contain boolean values, one column for each month of the year. Here's a snippet of the df:
df
I'm trying to update the 2019.04_flag, 2019.05_flag, etc columns with the last valid value. I know that I can use df[2019.04_flag].fillna(2019.03_flag), but I don't want to write 11 fillna lines. Is there a means of updating the value dynamically? I've tried to use the fillna method with the ffill parameter here df with ffill, but as you can see it doesn't propagate across the row.
Edited
I would look into the pandas fillna method, documentation is here. It has different methods for filling NaN -- I think "ffill" would suit your needs. It fills the NaN with the last valid entry. Try the following:
df = df.fillna(method = "ffill", axis = 1)
Setting axis = 1 will perform the imputation across the columns, the axis I believe you want (a single row across columns).
I want to compare two different columns in a dataframe (called station_programming_df). I have one dataframe column that contains integers (called 'facility_id'). I have a second dataframe column that contains a dataframe object (which contains a series of integers)(called 'parsed_final_participant_val') . I want to see if the integer in the first column is in the column with the dataframe object (the second column). If true, I want to return a "1" in a new column (i.e., 'master_color')
I have tried various approaches including using python's "isin" function, which is not returning errors but is also not returning the correct amount. I have also attempted to convert the datatypes as well but with no luck.
station_programming_df['master_color']=np.where(station_programming_df['facility_id'].isin(station_programming_df['final_participants_val']),1,0 )
Here is what the dataframe that I am using looks like:
DATA:
facility_id,final_participants_val,master_color
35862,"62469,33894,33749,34847,21656,35396,4624,69571",0
35396,"62469,33894,33749,34847,21656,35396,4624,69571",0
While no error message is returned, I am not finding any matches. The second row should have returned a "1" in the master_color column.
I am wondering if it has to do with how it is interpreting the series (final_participants_val)
Any help would be really appreciated.
Use DataFrame.apply:
station_programming_df['master_color']=station_programming_df.apply(lambda x: 1 if str(x['facility_id']) in x['final_participants_val'] else 0,axis=1)
print(df)
facility_id final_participants_val master_color
0 35862 62469,33894,33749,34847,21656,35396,4624,69571 0
1 35396 62469,33894,33749,34847,21656,35396,4624,69571 1
You can use df.apply and in.
station_programming_df['master_color'] = station_programming_df.apply(lambda x: str(x.facility_id) in x.final_participants_val, axis=1)
facility_id final_participants_val master_color
0 35862 62469,33894,33749,34847,21656,35396,4624,69571 False
1 35396 62469,33894,33749,34847,21656,35396,4624,69571 True
I have a fairly large pandas dataframe (11k rows and 20 columns). One column has a mixed data type, mostly numeric (float) with a handful of strings scattered throughout.
I subset this dataframe by querying other columns before performing some statistical analysis using the data in the mixed column (but can't do this if there's a string present). 99% of the time once subsetted this column is purely numeric, but rarely a string value will end up in the subset, which I need to trap.
What's the most efficient/pythonic way of looping through a Pandas mixed type column to check for strings (or conversely check whether the whole column is full of numeric values or not)?
If there is even a single string present in the column I want to raise an error, otherwise proceed.
This is one way. I'm not sure it can be vectorised.
import pandas as pd
df = pd.DataFrame({'A': [1, None, 'hello', True, 'world', 'mystr', 34.11]})
df['stringy'] = [isinstance(x, str) for x in df.A]
# A stringy
# 0 1 False
# 1 None False
# 2 hello True
# 3 True False
# 4 world True
# 5 mystr True
# 6 34.11 False
Here's a different way. It converts the values of column A to numeric, but does not fail on errors: strings are replaced by NA. The notnull() is there to remove these NA.
df = df[pd.to_numeric(df.A, errors='coerce').notnull()]
However, if there were NAs in the column already, they too will be removed.
See also:
Select row from a DataFrame based on the type of the object(i.e. str)