Python dataframe exclude rows based on condition not working - python

I have a dataframe that I am concatenating from dataframes and arrays.
Somehow its inherited the index of the original dataframe - hence I am trying to exclude rows based on one of the columns that should not have missing values.
If I view my dataframe, it shows as this:
print(model_data2['is_62p_days_overdue'][0:11])
now, when I run:
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] != np.nan)[0:11])
I get the exact same output.
And when I run :
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] == np.nan)[0:11])
What am I missing? this is driving me nuts!
I've tried resetting the index - but this also does nothing.

IIUC:
Instead of this:
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] != np.nan)[0:11])
try with loc accessor and notna() method:
print(model_data2.loc[model_data2['is_62p_days_overdue'].notna(),'is_62p_days_overdue'][0:11])
Answer to the comment:
there are 2 reasons of it
you can't compare NaN's like that like you do in your method:
model_data2['is_62p_days_overdue'] != np.nan
#this is wrong instead use notna() method
2.You are using where method even when you corrected above method it will make that back to NaN:
model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'].notna())

see the "# rows you may want to see" in the bottom of my code
import pandas as pd
import numpy as np
# make a dataset
dict={'is_62p_days_overdue':[0, 0, 0, 0, 0, None, None, 0, None, 0, None]}
data=pd.DataFrame(dict)
print(data)
# append numeric 1~10
data=data.append(pd.DataFrame({'is_62p_days_overdue': list(range(1,10+1))}),ignore_index=True)
data
# rows you may want to see
data.loc[~(data.is_62p_days_overdue.isna())]

you can use .dropna() to drop the rows with NaN values.
use this:
model_data2.dropna(subset = ['is_62p_days_overdue'], inplace = True)

Related

Figuring out if an entire column in a Pandas dataframe is the same value or not

I have a pandas dataframe that works just fine. I am trying to figure out how to tell if a column with a label that I know if correct does not contain all the same values.
The code
below errors out for some reason when I want to see if the column contains -1 in each cell
# column = "TheColumnLabelThatIsCorrect"
# df = "my correct dataframe"
# I get an () takes 1 or 2 arguments but 3 is passed in error
if (not df.loc(column, estimate.eq(-1).all())):
I just learned about .eq() and .all() and hopefully I am using them correctly.
It's a syntax issue - see docs for .loc/indexing. Specifically, you want to be using [] instead of ()
You can do something like
if not df[column].eq(-1).all():
...
If you want to use .loc specifically, you'd do something similar:
if not df.loc[:, column].eq(-1).all():
...
Also, note you don't need to use .eq(), you can just do (df[column] == -1).all()) if you prefer.
You could drop duplicates and if you get only one record it means all records are the same.
import pandas as pd
df = pd.DataFrame({'col': [1, 1, 1, 1]})
len(df['col'].drop_duplicates()) == 1
> True
Question not as clear. Lets try the following though
Contains only -1 in each cell
df['estimate'].eq(-1).all()
Contains -1 in any cell
df['estimate'].eq(-1).any()
Filter out -1 and all columns
df.loc[df['estimate'].eq(-1),:]
df['column'].value_counts() gives you a list of all unique values and their counts in a column. As for checking if all the values are a specific number, you can do that by dropping duplicates and checking the length to be 1.
len(set(df['column'])) == 1

Remove row based on sum of numpy array within each entry in df column

I feel I'm making this harder than it should be: what I have is a dataframe with some columns whose entries each contain numpy arrays (the names of the columns containing these arrays is in an array called names_of_cols_that_contain_arrays). What I want to do is filter out rows for which these numpy arrays have a sum value of zero. This is a similar question on which my code is based, but it doesn't seem to work with the iterator over rows in each column.
What I have currently in my code is
for col_name in names_of_cols_that_contain_arrays:
for i in range(len(df[col_name])):
df = df[df[col_name][i].sum() > 0.0]
which doesn't seem that efficient but is a first attempt that explictly goes through what I thought would be the correct method. But this appears to return a boolean, i.e.
Traceback
...
KeyError: True
In fact in most cases to the code above I get some error associated with a boolean being returned. Any pointers would be appreciated, thanks in advance!
IIUC:
You can try:
df=df.loc[df['names_of_cols_that_contain_arrays'].map(sum)>0]
#OR
df=df.loc[df['names_of_cols_that_contain_arrays'].map(np.sum).gt(0)]
Sample dataframe used:
from numpy import array
d={'names_of_cols_that_contain_arrays': {0: array([-1, 0, -8]),
1: array([-1, -2, 5])}}
df=pd.DataFrame(d)

Python.pandas: how to select rows where objects start with letters 'PL'

I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]

Rename column values using pandas DataFrame

in one of the columns in my dataframe I have five values:
1,G,2,3,4
How to make it change the name of all "G" to 1
I tried:
df = df['col_name'].replace({'G': 1})
I also tried:
df = df['col_name'].replace('G',1)
"G" is in fact 1 (I do not know why there is a mixed naming)
Edit:
works correctly with:
df['col_name'] = df['col_name'].replace({'G': 1})
If I am understanding your question correctly, you are trying to change the values in a column and not the column name itself.
Given you have mixed data type there, I assume that column is of type object and thus the number is read as string.
df['col_name'] = df['col_name'].str.replace('G', '1')
You could try the following line
df.replace('G', 1, inplace=True)
use numpy
import numpy as np
df['a'] = np.where((df.a =='G'), 1, df.a)
You can try this, lets say your data is like :
ab=pd.DataFrame({'a':[1,2,3,'G',5]})
And you will replace it as :
ab1=ab.replace('G',4)

Multiplying columns by another column in a dataframe

(Full disclosure that this is related to another question I asked, so bear with me if I should have appended it to what I wrote previously, even though the problem is different.)
I have a dataframe consisting of a column of weights and columns containing binary values of 0 and 1. I'd like to multiply every column within the dataframe by the weights column. However, I seem to be replacing every column within the dataframe with the weight column. I'm sure I'm missing something incredibly stupid/basic here--I'm rather new to pandas and python as a whole. What am I doing wrong?
celebfile = pd.read_csv(celebcsv)
celebframe = pd.DataFrame(celebfile)
behaviorfile = pd.read_csv(behaviorcsv)
behaviorframe = pd.DataFrame(behaviorfile)
celebbehavior = pd.merge(celebframe, behaviorframe, how ='inner', on = 'RespID')
celebbehavior2 = celebbehavior.copy()
def multiplycolumns(column):
for column in celebbehavior:
return celebbehavior[column]*celebbehavior['WEIGHT']
celebbehavior2 = celebbehavior2.apply(lambda column: multiplycolumns(column), axis=0)
print(celebbehavior2.head())
You have return statement in a for loop, which means the for loop is executed only once, to multiply a data frame with a column, you can use mul method with the correct axis parameter:
celebbehavior.mul(celebbehavior['WEIGHT'], axis=0)
read_csv
returns a pd.DataFrame... Not necessary to use pd.DataFrame on top of it.
mul with axis=0
You can use apply but that is awkward. Use mul(axis=0)... This should be all you need.
df = pd.read_csv(celebcsv).merge(pd.read_csv(behaviorcsv), on='RespID')
df = df.mul(df.WEIGHT, 0)
?
You said that it looks like you are just replacing with the weights column? Are you other columns all ones?
you can use the `mul' method to multiply the columns. However, just fyi if you do want to use apply you can bear in mind the following:
The apply function passes each series in the dataframe to the function. This looping is inherent to the apply function. Therefore first thing to say is that your loop within the function is redundant. Also you have a return statement inside it which is causing the behavior you do not want.
If each column is passed as the argument automatically all you need to do is tell the function what to multiply it by. In this case your weights series.
Here is an implementation using apply. Of course the undesirable here is that the weights are also multiplpied by themselves:
df = pd.DataFrame({'1' : [1, 1, 0, 1],
'2' : [0, 0, 1, 0],
'weights' : [0.5, 0.25, 0.1, 0.05]})
def multiply_columns(column, weights):
return column * weights
df.apply(lambda x: multiply_columns(x, df['weights']))

Categories

Resources