Updating dataframe fills columns with nan - python

In my DataFrame, I first replace values larger than a value with nan, then create another DataFrame with the same column name and fill it with random numbers. Then I update the original DataFrame with the newly created one, but in rows where I first set the value of the column nan, all other columns become nan. Original rows with nan in that column do not have the same problem. Here is what I mean in pandas syntax:
df[df['column_name'] > 40] = np.nan
column_series = df['column_name']
null_indices = column_series[column_series.isnull()].index
random_df = pd.DataFrame(np.random.normal(mu, sigma, size=len(null_indices)), index=null_indices, columns=['column_name'])
df.update(random_df)
Here are some numbers to explain the situation better:
Number of nans in the column before replacing values > 40 with nan: 6685022
Number of rows with column value > 40: 329066
Number of rows with nan in every column except column_name after replacing: 329066

df[df['column_name'] > 40] = np.nan will fill the entire df with nulls, if the values in column_name are > 40.
Nihal is right, but i prefer this form (cleaner imo):
df.column_name.loc[df.column_name > 40] = np.nan
PS: it's a good idea to use Jupyter Notebook to see how the DataFrame looks like at each step.

may be this works
df.ix[df['column_name'] > 40,'column_name'] = np.nan # or indexof columns
column_series = df['column_name']
null_indices = column_series[column_series.isnull()].index
random_df = pd.DataFrame(np.random.normal(mu, sigma, size=len(null_indices)),
index=null_indices, columns=['column_name'])
df.update(random_df)

use this recommended way:
df.loc[df['coulmn_name'] > 40, 'column_name'] = np.nan

The problem arises just with your first statement
df[df['column_name'] > 40] = np.nan
which means "replace ALL values in selected rows with nan". So the command
df.update(random_df)
inherits it.

Related

fill missing value based on index

i try to percentace missing value like this
null = data.isnull().sum()*100/len(data)
and filter like this
null_column = null[(null>= 10.0) & (null <= 40.0)].index
the output type is index
how can i using fillna to replace median in every column based on index
my code before like this
null_column = null[(null>= 10.0) & (null <= 40.0)].index
data.fillna(percent_column2.median(), inplace=True)
the result always
index doesnt have median
but when i deleted index it works but the median that replaced is not median in every column. But, median that 2 values of percentage missing value not in original dataframe. How can i fill nan value based on index to replace in original data frame?
I guess something like this:
data = pd.DataFrame([[0,1,np.nan],[np.nan,1,np.nan],[1,np.nan,2],[23,12,3],[1,3,1]])
cols = list(null[(null>=10) & (null<=40)].index)
data.iloc[:, cols] = data.iloc[:, cols].fillna(data.iloc[:, cols].median(), inplace=False)

Is there any way to shift row values in the dataframe?

I want to shift values of row 10 , Fintech into next column and fill the city column in same row with Bahamas. Is there any way to do that?
I found the dataframe.shift() function of pandas but it is limited to columns and it shifts all the values.
Use DataFrame.shift with filtered rows and axis=1:
#test None values like Nonetype
m = df['Select Investors'].isna()
#test None values like strings
#m = df['Select Investors'].eq('None')
df.loc[m, 'Country':] = df.loc[m, 'Country':].shift(axis=1)

xarray.dataset.where() does not filter all values

ds = xr.open_dataset('./input_file.nc')
ds2 = ds.where(ds.total_precip > 0, drop=True)
print(ds2)
This code replaces some zero values with Nan, but i still see many 0 values not getting dropped. If i change the condition to ds.total_precip == 0 I get a smaller data set with all total_precip values = 0. Am I missing something? Is there another way to filter out from a data set based on a condition?
You are applying the wrong logic.
Inside ds.where(condition,other), the condition needs to provide the values where you want to MAINTAIN the original values. In this case, ds.total_precip>0 is maintaining all values >0, and setting all the others to NaN, so any of your zeros is being removed.
If you want to remove zeros, you should use:
ds2 = ds.where(ds.total_precip !=0, drop=True)
If you want to remove values under or equal to zero:
ds2 = ds.where(ds.total_precip > 0, drop=True)
If you want to remove values above or equal to zero:
ds2 = ds.where(ds.total_precip < 0, drop=True)

Replace values in a slice of columns in a pandas dataframe with a value based on a condition

I have a large Pandas dataframe, and want to replace some values in a subset of the columns based on a condition.
Specifically, I want to replace the values that are greater than one with 1 in every column to the right of the 9th column.
Because the dataframe is so large and growing in both the number of rows and columns over time, I cannot manually specify the names of the columns to change values in. Rather, I just need to specify that column 10 and greater should be inspected for values > 1.
After looking at many different Stack Overflow posts and Pandas documentation, I tried:
df.iloc[df[:,10: ] > 1] = 1
However, this gives me the error “unhashable type: ‘slice’”.
I then tried:
df[df.iloc[:, 10:] > 1] = 1
and
df[df.loc[:, df.columns[10:]] > 1] = 1
as per 2 suggestions in the comments, but both of those give me the error “Cannot do inplace boolean setting on mixed-types with a non np.nan value”.
Does anyone know why I’m getting these errors and/or what I should change about my code to avoid them?
Thank you!
1. DataFrame.where
We can use iloc to select all the columns to the right of 9th column, then using where we can replace the values in the slice of dataframe where the condition x.le(1) is False.
df.iloc[:, 10:] = df.iloc[:, 10:].where(lambda x: x.le(1), 1)
2. DataFrame.clip
Alternatively we can use clip where we can define the upper limit as 1 which assigns all the values greater than 1 in the slice of dataframe to 1.
df.iloc[:, 10:] = df.iloc[:, 10:].clip(upper=1)

I need to drop all rows based on condition but if there are null entries in the column, I want to keep those rows

The dataframe I'm using has a column for ages, called age. There are entries in the age column that and meaningless, as in it has values over 101 and below 1. The age column also has null entries.
I want to delete the rows for the invalid ages.
Then, I want to fill the null entries with the mean age of what's left.
df = df[(df.age <102) & (df.age > 0)]
When I do this, it drops not only the meaningless ages but the null entries, too. I thought about filling with the mean first, but I don't want the meaningless ages to be included and misrepresent the mean.
This can be done in, at least, two ways:
Method one:
keep also nan values in your mask:
df = df[((df.age <102) & (df.age > 0))|(df.age.isnull())]
and then fill the nan values:
df = df.fillna(df.age.mean())
Method two:
fill the nan values by applying mean just on masked dataframe:
df = df.fillna(df[((df.age <102) & (df.age > 0))]["age"].mean())
and then apply the mask:
df = df[((df.age <102) & (df.age > 0))]

Categories

Resources