The values are getting replaced but the moment i print the data it still shows the nan values
for col in data.columns:
for each in range(len(data[col])):
if math.isnan(data[col][each]) == True:
data.replace(data[col][each], statistics.mean(data[col]))
data
dataset: https://docs.google.com/spreadsheets/d/1AVTVmUVs9lSe7I9EXoPs0gNaSIo9KM2PrXxwVWeqtME/edit?usp=sharing
Looks like what you are trying to do is to replace NaN values by the mean of each column, which has been treated here
Regarding your problem, the function replace(a,b) replaces all the values in your dataframe that are equal to a by b.
Moreover, the function statistics.mean will return NaN if there is a Nan number in your list, so you should use numpy.nanmean() instead.
Related
I want to replace the missing values from a column with people's ages (which also contains numerical values, not only NaN values) but everything I've tried so far either doesn't work how I want it to or it doesn't work at all.
I wish to apply a random variable generator which follows a normal distribution using the mean and standard deviation obtained with that column.
I have tried the following:
Replacing with numpy, replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].replace(np.nan, round(rd.normalvariate(age_mean, age_std),0))
Fillna with pandas, also replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].fillna(round(rd.normalvariate(age_mean, age_std),0))
Applying a function on the dataframe with pandas, replaces NaN values but also changes all existing numerical values (I only wish to fill the NaN values)
df_travel['Age'] = df_travel['Age'].where(df_travel['Age'].isnull() == True).apply(lambda v: round(rd.normalvariate(age_mean, age_std),0))
Any ideas would be appreciated. Thanks in advance.
Series.fillna can accept a Series, so generate a random array of size len(df_travel):
rng = np.random.default_rng(0)
mu = df_travel['Age'].mean()
sd = df_travel['Age'].std()
filler = pd.Series(rng.normal(loc=mu, scale=sd, size=len(df_travel)))
df_travel['Age'] = df_travel['Age'].fillna(filler)
I would go with it the following way:
# compute mean and std of `Age`
age_mean = df['Age'].mean()
age_std = df['Age'].std()
# number of NaN in `Age` column
num_na = df['Age'].isna().sum()
# generate `num_na` samples from N(age_mean, age_std**2) distribution
rand_vals = age_mean + age_std * np.random.randn(num_na)
# replace missing values with `rand_vals`
df.loc[df['Age'].isna(), 'Age'] = rand_vals
I am looking to filter out the rows with values only, I have tried using several methods but it doesn't works. Processing the values further i need to have the dataframe without values without having any Null or blank rows.
Input Data:
Col 1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88
NaN
Q_d15634_Apr90Apr91
S_e15336_may91Apr93
NaN
Expected Output:
Input Data:
Col 1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88
Q_d15634_Apr90Apr91
S_e15336_may91Apr93
Code i have been using :
df['Col 1'] = df[~df['Col 1'].isnull()]
ValueError: Wrong number of items passed 60, placement implies 1
I have a Dataframe with a lot of "bad" cells. Let's say, they have all -99.99 as values, and I want to remove them (set them to NaN).
This works fine:
df[df == -99.99] = None
But actually I want to delete all these cells ONLY if another cell in the same row is market as 1 (e.g. in the column "Error").
I want to delete all -99.99 cells, but only if df["Error"] == 1.
The most straight-forward solution I thin is something like
df[(df == -99.99) & (df["Error"] == 1)] = None
but it gives me the error:
ValueError: cannot reindex from a duplicate axis
I tried every given solutions on the internet but I cant get it to work! :(
Since my Dataframe is big I don't want to iterate it (which of course, would work, but take a lot of time).
Any hint?
Try using broadcasting while passing numpy values:
# sample data, special value is -99
df = pd.DataFrame([[-99,-99,1], [2,-99,2],
[1,1,1], [-99,0, 1]],
columns=['a','b','Errors'])
# note the double square brackets
df[(df==-99) & (df[['Errors']]==1).values] = np.nan
Output:
a b Errors
0 NaN NaN 1
1 2.0 -99.0 2
2 1.0 1.0 1
3 NaN 0.0 1
At least, this is working (but with column iteration):
for i in df.columns:
df.loc[df[i].isin([-99.99]) & df["Error"].isin([1]), i] = None
I have 2 columns in a dataframe and I am trying to enter into a condition based on if the second one is NaN and First one has some values, unsuccessfully using:
if np.isfinite(train_bk['Product_Category_1']) and np.isnan(train_bk['Product_Category_2'])
and
if not (train_bk['Product_Category_2']).isnull() and (train_bk['Product_Category_3']).isnull()
I would use eval:
df.eval(' ind = ((pc1==pc1) & (pc2!=pc2) )*2+((pc1==pc1)&(pc2==pc2))*3')
df.replace({'ind':{0:1})
I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-05-22 105.24 NaN 6507.58
2013-05-23 104.63 NaN 6393.86
2013-05-26 104.62 NaN 6521.54
2013-05-27 104.62 NaN 6609.31
2013-05-28 104.54 87.79 6640.24
2013-05-29 103.91 86.88 6577.39
2013-05-30 103.43 87.66 6516.55
2013-06-02 103.56 87.55 6559.43
I would like to compute the first non-NaN value in each column.
As Locate first and last non NaN values in a Pandas DataFrame points out, first_valid_index can be used. Unfortunately, it returns the first row where at least one element is not NaN and does not work per-column.
You should use the apply function which applies a function on either each column (default) or each row efficiently:
>>> first_valid_indices = df.apply(lambda series: series.first_valid_index())
>>> first_valid_indices
1125400 2013-05-22 00:00:00
5430095 2013-05-28 00:00:00
1095751 2013-05-22 00:00:00
first_valid_indiceswill then be a series containing the first_valid_index for each column.
You could also define the lambda function as a normal function outside:
def first_valid_index(series):
return series.first_valid_index()
and then call apply like this:
df.apply(first_valid_index)
The built in function DataFrame.groupby().column.first() returns the first non null value in the column, while last() returns the last.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.first.html
If you don't wish to get the first value for each group, you can add a dummy column of 1s. Then get the first non null value using the groupby & first functions.
from Pandas import DataFrame
df = DataFrame({'a':[None,1,None],'b':[None,2,None]})
df['dummy'] = 1
df.groupby('dummy').first()
df.groupby('dummy').last()
By compute I assume you mean access?
The simplest way to do this is with the pd.Series.first_valid_index() method probably inside a dict comprehension:
values = {col : DF.loc[DF[col].first_valid_index(), col] for col in DF.columns}
values
Just to be clear, each column in a pandas DataFrame is a Series. So the above is the same as doing:
values = {}
for column in DF.columns:
First_Non_Null_Index = DF[column].first_valid_index()
values[column] = DF.loc[First_Non_Null_Index, column]
So the operation in my one line solution is on a per column basis. I.e. it is not going to create the type of error you seem to be suggesting in the edit you made to the question. Let me know if it does not work as expected.