I have a very large dataframe where I only want to change the values in a small continuous subset of columns. Basically, in those columns the values are either integers or null. All I want is to replace the 0's and nulls with 'No' and everything else with 'Yes' only in those columns
In R, this can be done basically with a one liner:
df <- df %>%
mutate_at(vars(MCI:BNP), ~factor(case_when(. > 0 ~ 'Yes',
TRUE ~ 'No')))
But we're working in Python and I can't quite figure out the equivalent using Pandas. I've been messing around with loc and iloc, which work fine when only changing a single column but I must be missing something when it comes to modifying multiple columns. And the answers I've found on other stackoverflow answers have all mostly been just changing the value in a single column based on some set of conditions
col1 = df.columns.get_loc("MCI")
col2 = df.columns.get_loc("BNP")
df.iloc[:,col1:col2]
Will get me the columns I want, but trying to call loc doesn't work with multidimensional keys. I even tried it with the columns as a list instead of by integer index by creating a an extra variable
binary_var = ['MCI','PVD','CVA','DEMENTIA','CPD','RD','PUD','MLD','DWOC','DWC','HoP','RND','MALIGNANCY','SLD','MST','HIV','AKF',
'ARMD','ASPHY','DEP','DWLK','DRUGA','DUOULC','FALL','FECAL','FLDELEX','FRAIL','GASTRICULC','GASTROULC','GLAU','HYPERKAL',
'HYPTEN','HYPOKAL','HYPOTHYR','HYPOXE','IMMUNOS','ISCHRT','LIPIDMETA','LOSWIGT','LOWBAK','MALNUT','OSTEO','PARKIN',
'PNEUM','RF','SEIZ','SD','TUML','UI','VI','MENTAL','FUROSEMIDE','METOPROLOL','ASPIRIN','OMEPRAZOLE','LISINOPRIL','DIGOXIN',
'ALDOSTERONE_ANTAGONIST','ACE_INHIBITOR','ANGIOTENSIN_RECEPTOR_BLOCKERS','BETA_BLOCKERSDIURETICHoP','BUN','CREATININE',
'SODIUM','POTASSIUM','HEMOGLOBIN','WBC_COUNT','CHLORIDE','ALBUMIN','TROPONIN','BNP']
df.loc[df[binary_var] == 0, binary_var]
But then it just can't find the index for those column names at all. I think Pandas also has problems converting columns that were originally integers into No/Yes. I don't need to do this in place, I'm probably just missing something simple that pandas has built in hopefully
In a very psuedo-code description, all I really want is this
if(df.iloc[:,col1:col2] == 0 || df.iloc[:,col1:col2].isnull())
df ONLY in that subset of column = 'No'
else
df ONLY in that subset of column = 'Yes'
Use:
df.loc[:, 'MCI':'BNP'] = np.where(df.loc[:, 'MCI':'BNP'] > 0, 'Yes', 'No')
Related
I am working with a large pandas dataframe and a few columns have lots of missing data. I am not totally confident with my imputation and I believe the presence or absence of data for these variables could be useful information, so I would like to add another column of the dataframe with 0 where the entry is missing and 1 otherwise. Is there a quick/efficient way to do this in pandas?
Try out the following:
df['New_Col'] = df['Col'].notna().astype('uint8')
Where Col it your column containing np.nan values and New_Col your binary target column indicating whether Col contains np.nan.
The relevant function here is .notna, which will yield bool depending on whether the value is missing or not. To apply it to multiple columns of interest, use:
for c in cols_of_interest:
df[f'{c}_not_missing'] = 1 * df[c].notna()
Note that 1 * bool will give integer 0/1.
So I'm new to pandas and this is my first notebook. I needed to join some columns of my dataframe and after that, I wanted to separate the values so it would be better to visualize them.
to join the columns I used df['Q7'] = df[['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5','Q7_Part_6','Q7_OTHER']].apply(lambda x : '_'.join(x.dropna().astype(str)), axis=1) and it did well, but i still needed to separate the values and for that i used explode() like: df.Q7 = df.Q7.str.split('_').explode('Q7') and that gave me some empty cells on the dataframe like:
Dataframe
and when i try to visualize the values they just come in empty like:
sum of empty cells
What could I do to not show these empty cells on the viz?
Edit 1: By the way, they not appear as null or NaN cells when I do: df.isnull().sum() or df.isna().sum()
c = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', \
'Q7_Part_5','Q7_Part_6','Q7_OTHER']
df['Q7'] = df[c].apply(lambda x : '_'.join(x.astype(str)), axis=1)
I am not able to replicate your issue but my best guess is if you will do the above the dimension of the list will remain intact and you will get string 'nan' values instead of empty strings.
Using Python3.7 and the currently most updated version of Pandas.
I have a dataframe with the following datatypes: [category, float, object(text)]
all i want to do is fill NaN values for the entire dataframe at once.
What ive been doing on my own is going one-by-one through every single column (hundreds at a time) and grouping columnnames into lists organized by datatype. Then setting that list of columns with pd.astype(datatype). this was extremely tedious and inefficient, as i still continue to get back lots of errors. Ive been doing it this way for months, but now i have excel sheets with arbitrary data to read in, and considering the size of the dataframes im beginning to work with (+/-400k) its unrealistic to continue that way.
For the dtypes "category" and "object(text)", i want to fillna with the string 'empty'. And for float dtypes, i want to fillna with 0.0. At this point in my project, I am not yet interested in filling with mean/median values.
Ideally I would like to achieve this with something simple like:
df.fillna_all({'float':0, 'category':'empty', 'object':'empty'})
please help!
I think this is exactly what you need:
1) To fill in the categorical variables with 'empty', you can do:
# Identify the columns in your df that are of type Object (i.e. categorical)
cat_vars = [col for col in df.columns if df[col].dtypes == 'O']
# Loop over them, and fill them with 'empty'
for col in df[cat_vars]:
df[col].fillna('empty',inplace=True)
2) To fill in the numerical variables with 0.0, you can do:
# Identify the columns that are numeric, AND have at least 1 nan to be filled
num_vars = [x for x in dat.columns if dat[x].dtypes !='O' and dat[x].isnull() > 0]
# Loop over them, and fill them with 0.0
for col in df[num_vars]:
df[col].fillna(0,inplace=True)
For the future, if you are interested in filling the numeric variables with mean or median:
for col in df[num_vars]:
df[col] = df[col].fillna(df[col].median()) # or replace with mean() for mean
I have a DataFrame which has a few columns. There is a column with a value that only appears once in the entire dataframe. I want to write a function that returns the column name of the column with that specific value. I can manually find which column it is with the usual data exploration, but since I have multiple dataframes with the same properties, I need to be able to find that column for multiple dataframes. So a somewhat generalized function would be of better use.
The problem is that I don't know beforehand which column is the one I am looking for since in every dataframe the position of that particular column with that particular value is different. Also the desired columns in different dataframes have different names, so I cannot use something like df['my_column'] to extract the column.
Thanks
You'll need to iterate columns and look for the value:
def find_col_with_value(df, value):
for col in df:
if (df[col] == value).any():
return col
This will return the name of the first column that contains value. If value does not exist, it will return None.
Check the entire DataFrame for the specific value, checking any to see if it ever appears in a column, then slice the columns (or the DataFrame if you want the Series)
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(0, 5, (100, 200)),
columns=[chr(i+40) for i in range(200)])
df.loc[5, 'Y'] = 'secret_value' # Secret value in column 'Y'
df.eq('secret_value').any().loc[lambda x: x].index
# or
df.columns[df.eq('secret_value').any()]
Index(['Y'], dtype='object')
I have another solution:
names = ds.columns
for i in names:
for j in ds[i]:
if j == 'your_value':
print(i)
break
Here you are collecting all the names of columns and then iterating all dataset while it will be found. Then print the name of column.
I have a very large dataframe with many columns. I want to check all the columns and remove any row containing any instance of the string 'MU', and there are some columns that have 'MU#1' or 'MU#2', and they will sometimes switch places (like 'MU#1 would be in column 1 at index 0 and 'MU#2' will be in column 1 at index 1). Initially, I tried removing them with this but it becomes far too cumbersome if I try to do this for both strings above:
df_slice = df[(df.phase_2 != 'MU#1') & (df.phase_3 != 'MU#1') & (df.phase_1 != 'MU#1') & (df.phase_4 != 'MU#1') ]
This may work, but I have to repeat this slice a few times with other dataframes and I imagine there is a much simpler route. I also have more columns than what is shown above, but that is just a snippet.
Simply put, all columns need to be checked for 'MU' and the rows with 'MU' need to be removed. Thanks!
You could also try .str.contains() and apply to the dataframe. This avoids hardcoding the columns in just in case
df[df.apply(lambda x: (~x.str.contains('MU', case=True, regex=True)))].dropna()
or
df[~df.stack().str.contains('MU').any(level=0)]
How it works
Option 1
when used in df.apply(), x.str.contains, #is a wild card for any column in the datframe that contains
x.str.contains('MU', case=True, regex=True) is a wild card for any column in the datframe that contains 'MU', case sensitive and regular expression implied
~ Reverses, hence you end up with rows that do not have MU
Resulting dataframe returns NaN where the condition is not met. .dropna() hence eliminates the rows with NaN
Option 2
df.stack()# Stacks the dataframe
df.stack().str.contains('MU')#boolean selects rows with the string 'MU'
df.stack().str.contains('MU').any(level=0)# Selects the index
~df.stack().str.contains('MU').any(level=0)# Reverses the selection taking only those without string 'MU'
What we do with all
df = df[df[['phase_1','phase_2','phase_3','phase_4']].ne('MU#1').all(1)]
Update
df = df[(~df[['phase_1','phase_2','phase_3','phase_4']].isin(['MU#1','MU#2'])).all(1)]
This works fine with me.
df[~df.stack().str.contains('Any String').any(level=0)]
Even when searching specific string in the dataframe
df[df.stack().str.contains('Any String').any(level=0)]
Thanks.