Replace NaN with a random value every row - python

I have a dataset with a column 'Self_Employed'. In these columns are values 'Yes', 'No' and 'NaN. I want to replace the NaN values with a value that is calculated in calc(). I've tried some methods I found on here, but I couldn't find one that was applicable to me.
Here is my code, I put the things i've tried in comments.:
# Handling missing data - Self_employed
SEyes = (df['Self_Employed']=='Yes').sum()
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = randint(0,(SEno+SEyes))
if rand_SE > 81:
return 'No'
else:
return 'Yes'
> # df['Self_Employed'] = df['Self_Employed'].fillna(randint(0,100))
> #df['Self_Employed'].isnull().apply(lambda v: calc())
>
>
> # df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
> # df[df['Self_Employed']]
>
> # df_nan['Self_Employed'] = df_nan['Self_Employed'].isnull().apply(lambda v: calc())
> # df_nan
>
> # for i in range(df['Self_Employed'].isnull().sum()):
> # print(df.Self_Employed[i]
df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
df
now the line where i tried it with df_nan seems to work, but then I have a separate set with only the former missing values, but I want to fill the missing values in the whole dataset. For the last row I'm getting an error, i linked to a screenshot of it.
Do you understand my problem and if so, can you help?
This is the set with only the rows where Self_Employed is NaN
This is the original dataset
This is the error

Make shure that SEno+SEyes != null
use the .loc method to set the value for Self_Employed when it is empty
SEyes = (df['Self_Employed']=='Yes').sum() + 1
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = np.random.randint(0,(SEno+SEyes))
if(rand_SE >= 81):
return 'No'
else:
return 'Yes'
df.loc[df['Self_Employed'].isna(), 'Self_Employed'] = df.loc[df['Self_Employed'].isna(), 'Self_Employed'].apply(lambda x: calc())

What about df['Self_Employed'] = df['Self_Employed'].fillna(calc())?

You could first identify the locations of your NaNs like
na_loc = df.index[df['Self_Employed'].isnull()]
Count the amount of NaNs in your column like
num_nas = len(na_loc)
Then generate an according amount of random numbers, readily indexed and set up
fill_values = pd.DataFrame({'Self_Employed': [random.randint(0,100) for i in range(num_nas)]}, index = na_loc)
And finally substitute those values in your dataframe
df.loc[na_loc]['Self_Employed'] = fill_values

Related

Apply custom function to entire dataframe

I have a function which call another one.
The objective is, by calling function get_substr to extract a substring based on a position of the nth occurence of a character
def find_nth(string, char, n):
start = string.find(char)
while start >= 0 and n > 1:
start = string.find(char, start+len(char))
n -= 1
return start
def get_substr(string,char,n):
if n == 1:
return string[0:find_nth(string,char,n)]
else:
return string[find_nth(string,char,n-1)+len(char):find_nth(string,char,n)]
The function works.
Now I want to apply it on a dataframe by doing this.
df_g['F'] = df_g.apply(lambda x: get_substr(x['EQ'],'-',1))
I get on error:
KeyError: 'EQ'
I don't understand it as df_g['EQ'] exists.
Can you help me?
Thanks
You forgot about axis=1, without that function is applied to each column rather than each row. Consider simple example
import pandas as pd
df = pd.DataFrame({'A':[1,2],'B':[3,4]})
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)
print(df)
output
A B Z
0 1 3 100
1 2 4 200
As side note if you are working with value from single column you might use pandas.Series.apply rather than pandas.DataFrame.apply, in above example it would mean
df['Z'] = df['A'].apply(lambda x:x*100)
in place of
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)

apply command only when value exists once

I have the following code that masks values equal to ten, and then the next closest value. But actually I need to apply it only if 10 occurs once in the column ending in '_ans'. So the mask should only occur for the column 'a_ans', because there are two 10s in 'b_ans.
any comments welcome. thanks
df = pd.DataFrame(data={'a_ans':[0,1,1,10,11],
'a_num': [1,8,90,2,8],
'b_ans': [0,10,139,10,18],
'b_num': [15,43,90,14,87]}).astype(float)
out=[]
for i in ['a_', 'b_']:
pairs = (df.loc[:,df.columns.str.startswith(i)]) # pair columns
mask1 = pairs[i+'ans'] == 10 # mask values equal to 10
mask2 = pairs[i+'ans'].eq(pairs[i+'ans'].mask(mask1).max())# get the next highest value
pairs = pairs.mask(mask1, 1001).mask(mask2, 1002) # replacing values
out.append(pairs)
you can use value_counts() to get the occurrence of each row value within each column:
if pairs[i+'ans'].value_counts()[10] == 1:
# apply mask logic
Following modifications could be useful, but it is not clear what should be the next values closest or highest?
df = pd.DataFrame(data={'a_ans':[0,1,1,10,11],
'a_num': [1,8,90,2,8],
'b_ans': [0,10,139,10,18],
'b_num': [15,43,90,14,87]}).astype(float)
out=[]
for i in ['a_', 'b_']:
pairs = df.loc[:,df.columns.str.startswith(i+"ans")] # for only _ans columns
if len(pairs[pairs[i+'ans'] == 10]) == 1: # for only one ten
mask1 = pairs[i+'ans'] == 10 # mask values equal to 10
mask2 = pairs[i+'ans'].eq(pairs[i+'ans'].mask(mask1).max())
pairs = pairs.mask(mask1, 1001).mask(mask2, 1002)
out.append(pairs)

Dataframes from arrays with different length - fill missing values by rmean of row

I'm want to create a dataframe, out of arrays with different size. I want to fill the missing values depending on similar values.
I've tried to stick the arrays together and do a sort and a split with numpy. I've then calculate the mean of the splits and decide wether its a value close to the mean or its better fill with nan.
def find_nearest(array, value):
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return idx
#generate sample data
loa = [((np.arange(np.random.randint(1,3),np.random.randint(3,6)))*val).tolist()
for val in np.random.uniform(0.9,1.1,5)]
#reshape
flat_list = sum(loa,[])
#add some attributes
attributes = [np.random.randint(-3,-1) for x in range(len(flat_list))]
#sort and split on percentage change
flat_list.sort()
arr = np.array(flat_list)
arr_splits = np.split(arr, np.argwhere(np.diff(arr)/arr[1:]*100 > 12)[:,0])
#means of the splits
means = [np.mean(arr) for arr in arr_splits]
#create dataframe
i = 0
res = np.zeros((len(loa), len(means)*2))*np.nan
for row, l in enumerate(loa):
for val in l:
col = find_nearest(means, val)
res[row, col] = val
res[row, col+len(means)] = attributes[i]
i = i + 1
df = pd.DataFrame(res)
Is there another way, to do this stuff more directly with pandas? ... or something more elegant?

Pandas - how to change cell according to next 10 cell's average

I have a dataset that I am trying to clean up. The data is all numeric. Basically, if there is a cell that is below 0 or above 100 i want to set it to NaN. I solved this with this code:
for col in df:
df.loc[df[col] < 0, col] = numpy.NaN
df.loc[df[col] > 100, col] = numpy.NaN
For values above 0 but below 20 i need to check the 10 cells above and below it. If the value is more than 20 different than the average of either the 10 cells in the same column above or below then it should also be set to numpy.NaN.
I am not sure how to go about this one quite yet, after reading the documentation i know that i can simply pass in a function into the df.loc[] that returns a boolean list. However, I am not sure how to access the the passed in value's index to check for the 10 values above and below. I think it could look like something like this, but i am not even sure if this would properly produce a boolean list the way pd.df.loc[] wants it.
def myFunc(value):
#access index and create avgs for both tenBefore and tenAfter
if abs(tenBeforeAvg - value) > 20 or abs(tenAfterAvg - value) > 20:
return False
else:
return True
for col in df:
df.loc[df[col] < 0, col] = numpy.NaN
df.loc[df[col] > 100, col] = numpy.NaN
df.loc[myFunc(df[col]), col] = numpy.NaN
Thanks ahead.
Perhaps this can help you on the way.
You can compare your DataFrame with a rolling_mean DataFrame and a reversed one for the averages above and below.
However, due to the NaNs in your dataframe, the average will not always be calculated, so you can ensure that it's calculated regardless using the min_periods.
Do check if it's accurate, as I haven't.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(-10, 110, (100, 3)))
#remove those higher than 100, lower than 0.
df[(df < 0) | (df > 100)] = np.nan
mean_desc = df.rolling(10, min_periods=1).mean()
mean_asc = df[::-1].rolling(10, min_periods=1).mean() # reversed rolling avg.
mean_asc.index = mean_desc.index
df[(df < 20) & (df > 0) & (df > mean_desc - 20) & (df < mean_desc + 20) & (df > mean_asc - 20) & (df < mean_asc + 20)] = "np.nan" # <-- replace with np.nan
print(df)

Fill pandas data frame using .append()

I have a dataframe with a column containing comma separated strings. What I want to do is separate them by comma, count them and append the counted number to a new data frame. If the column contains a list with only one element, I want to differentiate wheather it is a string or an integer. If it is an integer, I want to append the value 0 in that row to the new df.
My code looks as follows:
def decide(dataframe):
df=pd.DataFrame()
for liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
df.append(pd.Series([len(x)]), ignore_index=True)
else:
#check if element in list is int
for i in x:
try:
int(i)
print i
x = []
df.append(pd.Series([int(len(x))]), ignore_index=True)
except:
print i
x = [1]
df.append(pd.Series([len(x)]), ignore_index=True)
return df
The Input data look like this:
C1
0 a,b,c
1 0
2 a
3 ab,x,j
If I now run the function with my original dataframe as input, it returns an empty dataframe. Through the print statement in the try/except statements I could see that everything works. The problem is appending the resulting values to the new dataframe. What do I have to change in my code? If possible, please do not give an entire different solution, but tell me what I am doing wrong in my code so I can learn.
******************UPDATE************************************
I edited the code so that it can be called as lambda function. It looks like this now:
def decide(x):
For liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
x = len(x)
print x
else:
#check if element in list is int
for i in x:
try:
int(i)
x = []
x = len(x)
print x
except:
x = [1]
x = len(x)
print x
And I call it like this:
df['Count']=df['C1'].apply(lambda x: decide(x))
It prints the right values, but the new column only contains None.
Any ideas why?
This is a good start, it could be simplified, but I think it works as expected.
#I have a dataframe with a column containing comma separated strings.
df = pd.DataFrame({'data': ['apple, peach', 'banana, peach, peach, cherry','peach','0']})
# What I want to do is separate them by comma, count them and append the counted number to a new data frame.
df['data'] = df['data'].str.split(',')
df['count'] = df['data'].apply(lambda row: len(row))
# If the column contains a list with only one element
df['first'] = df['data'].apply(lambda row: row[0])
# I want to differentiate wheather it is a string or an integer
df['first'] = pd.to_numeric(df['first'], errors='coerce')
# if the element in x is an integer, len(x) should be set to zero
df.loc[pd.notnull(df['first']), 'count'] = 0
# Dropping temp column
df.drop('first', 1, inplace=True)
df
data count
0 [apple, peach] 2
1 [banana, peach, peach, cherry] 4
2 [peach] 1
3 [0] 0

Categories

Resources