fill missing value based on index - python

i try to percentace missing value like this
null = data.isnull().sum()*100/len(data)
and filter like this
null_column = null[(null>= 10.0) & (null <= 40.0)].index
the output type is index
how can i using fillna to replace median in every column based on index
my code before like this
null_column = null[(null>= 10.0) & (null <= 40.0)].index
data.fillna(percent_column2.median(), inplace=True)
the result always
index doesnt have median
but when i deleted index it works but the median that replaced is not median in every column. But, median that 2 values of percentage missing value not in original dataframe. How can i fill nan value based on index to replace in original data frame?

I guess something like this:
data = pd.DataFrame([[0,1,np.nan],[np.nan,1,np.nan],[1,np.nan,2],[23,12,3],[1,3,1]])
cols = list(null[(null>=10) & (null<=40)].index)
data.iloc[:, cols] = data.iloc[:, cols].fillna(data.iloc[:, cols].median(), inplace=False)

Related

I need to drop all rows based on condition but if there are null entries in the column, I want to keep those rows

The dataframe I'm using has a column for ages, called age. There are entries in the age column that and meaningless, as in it has values over 101 and below 1. The age column also has null entries.
I want to delete the rows for the invalid ages.
Then, I want to fill the null entries with the mean age of what's left.
df = df[(df.age <102) & (df.age > 0)]
When I do this, it drops not only the meaningless ages but the null entries, too. I thought about filling with the mean first, but I don't want the meaningless ages to be included and misrepresent the mean.
This can be done in, at least, two ways:
Method one:
keep also nan values in your mask:
df = df[((df.age <102) & (df.age > 0))|(df.age.isnull())]
and then fill the nan values:
df = df.fillna(df.age.mean())
Method two:
fill the nan values by applying mean just on masked dataframe:
df = df.fillna(df[((df.age <102) & (df.age > 0))]["age"].mean())
and then apply the mask:
df = df[((df.age <102) & (df.age > 0))]

Pandas - Fill in missing values choosing values from a normal distribution

The code below will generate only one value of a normal distribution, and fill in all the missing values with this same value:
helper_df = df.dropna()
df = df.fillna(numpy.random.normal(loc=helper_df.mean(), scale=numpy.std(helper_df)))
What can we do to generate a value for each missing value?
You can create a series with normal values. You should extract the index of the Nan values in the column you are working on.
df: your dataframe
col: the col containing Nan values
index = df[df.col.isna()].index
value = np.random.normal(loc=data.col.mean(), scale=data.col.std(), size=data.Age.isna().sum())
data.Age.fillna(pd.Series(value, index=index), inplace=True)
You can create a series of random variables with the same length as your dataframe, then apply fillna:
df.fillna(pd.Series([np.random.normal() for x in range(len(df))]))
If a value in a row is not missing, fillna just ignores it.

Updating dataframe fills columns with nan

In my DataFrame, I first replace values larger than a value with nan, then create another DataFrame with the same column name and fill it with random numbers. Then I update the original DataFrame with the newly created one, but in rows where I first set the value of the column nan, all other columns become nan. Original rows with nan in that column do not have the same problem. Here is what I mean in pandas syntax:
df[df['column_name'] > 40] = np.nan
column_series = df['column_name']
null_indices = column_series[column_series.isnull()].index
random_df = pd.DataFrame(np.random.normal(mu, sigma, size=len(null_indices)), index=null_indices, columns=['column_name'])
df.update(random_df)
Here are some numbers to explain the situation better:
Number of nans in the column before replacing values > 40 with nan: 6685022
Number of rows with column value > 40: 329066
Number of rows with nan in every column except column_name after replacing: 329066
df[df['column_name'] > 40] = np.nan will fill the entire df with nulls, if the values in column_name are > 40.
Nihal is right, but i prefer this form (cleaner imo):
df.column_name.loc[df.column_name > 40] = np.nan
PS: it's a good idea to use Jupyter Notebook to see how the DataFrame looks like at each step.
may be this works
df.ix[df['column_name'] > 40,'column_name'] = np.nan # or indexof columns
column_series = df['column_name']
null_indices = column_series[column_series.isnull()].index
random_df = pd.DataFrame(np.random.normal(mu, sigma, size=len(null_indices)),
index=null_indices, columns=['column_name'])
df.update(random_df)
use this recommended way:
df.loc[df['coulmn_name'] > 40, 'column_name'] = np.nan
The problem arises just with your first statement
df[df['column_name'] > 40] = np.nan
which means "replace ALL values in selected rows with nan". So the command
df.update(random_df)
inherits it.

Pandas replace column values with a list

I have a dataframe df where some of the columns are strings and some are numeric. I am trying to convert all of them to numeric. So what I would like to do is something like this:
col = df.ix[:,i]
le = preprocessing.LabelEncoder()
le.fit(col)
newCol = le.transform(col)
df.ix[:,i] = newCol
but this does not work. Basically my question is how do I delete a column from a data frame then create a new column with the same name as the column I deleted when I do not know the column name, only the column index?
This should do it for you:
# Find the name of the column by index
n = df.columns[1]
# Drop that column
df.drop(n, axis = 1, inplace = True)
# Put whatever series you want in its place
df[n] = newCol
...where [1] can be whatever the index is, axis = 1 should not change.
This answers your question very literally where you asked to drop a column and then add one back in. But the reality is that there is no need to drop the column if you just replace it with newCol.
newcol = [..,..,.....]
df['colname'] = newcol
This will keep the colname intact while replacing its contents with newcol.

How do I overwrite the value of a specific index/column in a DataFrame?

I have a dataframe Exposure with zeros constructed as follows:
Exposure = pd.DataFrame(0, index=dates, columns=tickers)
and a DataFrame df with data.
I want to fill some of the data from df to Exposure:
for index, column in df.iterrows():
# value of df(index,column) to be filled at Exposure(index,column)
How do I overwrite the value of at (index,column) of Exposure with the value of df(index,column)?
The best way is:
df.loc[index, column] = value
You can try this:
for index, column in df.iterrows():
Exposure.loc[index, column.index] = column.values
This will make new index and columns in Exposure if they don't exist, if you want to avoid this, construct the common index and columns firstly, then do the assignment in a vectorized way(avoiding the for loop):
common_index = Exposure.index.intersection(df.index)
common_columns = Exposure.columns.intersection(df.columns)
Exposure.loc[common_index, common_columns] = df.loc[common_index, common_columns]

Categories

Resources