Pandas - Fill in missing values choosing values from a normal distribution - python

The code below will generate only one value of a normal distribution, and fill in all the missing values with this same value:
helper_df = df.dropna()
df = df.fillna(numpy.random.normal(loc=helper_df.mean(), scale=numpy.std(helper_df)))
What can we do to generate a value for each missing value?

You can create a series with normal values. You should extract the index of the Nan values in the column you are working on.
df: your dataframe
col: the col containing Nan values
index = df[df.col.isna()].index
value = np.random.normal(loc=data.col.mean(), scale=data.col.std(), size=data.Age.isna().sum())
data.Age.fillna(pd.Series(value, index=index), inplace=True)

You can create a series of random variables with the same length as your dataframe, then apply fillna:
df.fillna(pd.Series([np.random.normal() for x in range(len(df))]))
If a value in a row is not missing, fillna just ignores it.

Related

fill missing value based on index

i try to percentace missing value like this
null = data.isnull().sum()*100/len(data)
and filter like this
null_column = null[(null>= 10.0) & (null <= 40.0)].index
the output type is index
how can i using fillna to replace median in every column based on index
my code before like this
null_column = null[(null>= 10.0) & (null <= 40.0)].index
data.fillna(percent_column2.median(), inplace=True)
the result always
index doesnt have median
but when i deleted index it works but the median that replaced is not median in every column. But, median that 2 values of percentage missing value not in original dataframe. How can i fill nan value based on index to replace in original data frame?
I guess something like this:
data = pd.DataFrame([[0,1,np.nan],[np.nan,1,np.nan],[1,np.nan,2],[23,12,3],[1,3,1]])
cols = list(null[(null>=10) & (null<=40)].index)
data.iloc[:, cols] = data.iloc[:, cols].fillna(data.iloc[:, cols].median(), inplace=False)

How to count the rows with the conditions?

I have a data set something like this:
import pandas as pd
# initialize data of lists.
data = {'name':['x', 'y', 'z'],
'value':['fb', 'nan', 'ti']}
# Create DataFrame
df = pd.DataFrame(data)
I now want to check the column of value and count the number of rows if value does not have 'fb' and 'nan' (null values).
How can I do this?
df[~df.value.isin(['fb','nan'])].shape[0]
In this case, we are checking when the value is not in this list and selecting those rows only. From there we can get the shape using shape of that dataframe.
Output
1
This would be the result dataframe
name value
2 z ti
If in future you want to also ignore the rows where the value column is NA (NA values, such as None or numpy.NaN),then you can use this
df[(~df.value.isin(['fb','nan'])) & (~df.value.isnull())].shape[0]
To count values that are not fb and nan:
(~df.value.str.contains('fb|nan')).sum()
Omit the tilde if you want to count the fb and nan values.
Just make a condition checking for "fb" or NaN, and use sum to get the count of True's:
>>> (df['value'].eq('fb') | df['value'].isna()).sum()
3

Is there any way to shift row values in the dataframe?

I want to shift values of row 10 , Fintech into next column and fill the city column in same row with Bahamas. Is there any way to do that?
I found the dataframe.shift() function of pandas but it is limited to columns and it shifts all the values.
Use DataFrame.shift with filtered rows and axis=1:
#test None values like Nonetype
m = df['Select Investors'].isna()
#test None values like strings
#m = df['Select Investors'].eq('None')
df.loc[m, 'Country':] = df.loc[m, 'Country':].shift(axis=1)

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

How do I overwrite the value of a specific index/column in a DataFrame?

I have a dataframe Exposure with zeros constructed as follows:
Exposure = pd.DataFrame(0, index=dates, columns=tickers)
and a DataFrame df with data.
I want to fill some of the data from df to Exposure:
for index, column in df.iterrows():
# value of df(index,column) to be filled at Exposure(index,column)
How do I overwrite the value of at (index,column) of Exposure with the value of df(index,column)?
The best way is:
df.loc[index, column] = value
You can try this:
for index, column in df.iterrows():
Exposure.loc[index, column.index] = column.values
This will make new index and columns in Exposure if they don't exist, if you want to avoid this, construct the common index and columns firstly, then do the assignment in a vectorized way(avoiding the for loop):
common_index = Exposure.index.intersection(df.index)
common_columns = Exposure.columns.intersection(df.columns)
Exposure.loc[common_index, common_columns] = df.loc[common_index, common_columns]

Categories

Resources