Creating a flag for rows with missing data - python

I have a pandas dataframe and want to create a new column.
This new column would return 1 if all columns in the row have a value (are not Nan)
If there was a Nan in any one of the columns in the row it would return 0
Does anyone have guidance on how to go about this?
I have used the below to sum the instances of 'Not Nans' in the row, which could possibly be used in an if statement? or is there a more simple way
code_count.apply(lambda x: x.count(), axis=1)
code_count['count_languages'] = code_count.apply(lambda x: x.count(), axis=1)

Use DataFrame.notna for test non missing values with DataFrame.all for test if all values per rows are True, then convert mask to 1,0 by Series.view:
code_count['count_languages'] = code_count.notna().all(axis=1).view('i1')
Or Series.astype:
code_count['count_languages'] = code_count.notna().all(axis=1).astype('int')
Or numpy.where:
code_count['count_languages'] = np.where(code_count.notna().all(axis=1), 1, 0)

Related

Is there any way to shift row values in the dataframe?

I want to shift values of row 10 , Fintech into next column and fill the city column in same row with Bahamas. Is there any way to do that?
I found the dataframe.shift() function of pandas but it is limited to columns and it shifts all the values.
Use DataFrame.shift with filtered rows and axis=1:
#test None values like Nonetype
m = df['Select Investors'].isna()
#test None values like strings
#m = df['Select Investors'].eq('None')
df.loc[m, 'Country':] = df.loc[m, 'Country':].shift(axis=1)

Creating a new column based on multiple columns

I'm trying to create a new column based on other columns existing in my df.
My new column, col, should be 1 if there is at least one 1 in columns A ~ E.
If all values in columns A ~ E is 0, then value of col should be 0.
I've attached image for a better understanding.
What is the most efficient way to do this with python, not using loop? Thanks.
enter image description here
If need test all columns use DataFrame.max or DataFrame.any with cast to integers for True/False to 1/0 mapping:
df['col'] = df.max(axis=1)
df['col'] = df.any(axis=1).astype(int)
Or if need test columns between A:E add DataFrame.loc:
df['col'] = df.loc[:, 'A':'E'].max(axis=1)
df['col'] = df.loc[:, 'A':'E'].any(axis=1).astype(int)
If need specify columns by list use subset:
cols = ['A','B','C','D','E']
df['col'] = df[cols].max(axis=1)
df['col'] = df[cols].any(axis=1).astype(int)

Filtering df on entire row

I have a dataframe with lots of coded columns. I would like to filter this df where a certain code occurs in any column. I know how to filter on multiple columns, but due to the shear number of columns, it is impractical to write out each column.
E.g. if any column contains x, keep that row.
Thanks in advance
Why don't you try using a boolean mask?
value = # the code you are looking for
df = # whatever ...
mask = df[df.columns[0]] == value
for col in df.columns[1:]:
mask |= df[col] == value
df2 = df[mask]

Dropping columns with duplicate values

I have a dataframe with 2415 columns and I want to drop consecutive duplicate columns. That is, if column 1 and column 2 have the same values, I want to drop column 2.
I wrote the below code but it doesn't seem to work:
for i in (0,len(df.columns)-1):
if (df[i].tolist() == df[i+1].tolist()):
df=df.drop([i+1], axis=1)
else:
df=df
You need to select column name from the index.Try this.
columns = df.columns
drops = []
for i in (0,len(df.columns)-1):
if (df[columns[i]].tolist() == df[columns[i+1]].tolist()):
drops.append(columns[i])
df = df.drop(drops,axis=1)
Let us try shift
df.loc[:,~df.eq(df.astype(object).shift(axis=1)).all()]

Is there an equivalent Python function similar to complete.cases in R

I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']

Categories

Resources