Hi there I'm a newbie in python learning through notebook, I have given iris dataset through .csv file and asked to replace one of the column values in some particular rows to NaN.I have tried "fillna" functions and "replace" functions but I'm not successful.Here is my code:
import pandas as pd
import numpy as np
from numpy import nan as NaN
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(url)
iris.columns = ['sepal_length','sepal_width','petal_length','petal_width','class']
iris.columns
#iris
iris.petal_length.fillna(np.nan)
iris1=iris.iloc[10:30]
print (iris1)
#bool_series = pd.isnull(iris['petal_length'])
#print (df)
Looks like the problem is, that you are not saving the resulting DataFrame from .fillna() or .replace(). By default, those methods return new DataFrame object. To fix this either save the result to a variable or use inplace=True argument in your replace() or fillna() calls.
I think you can use:
This replaces <some_value> with np.nan for the petal_length column
irirs.petal_length.replace(<some_value>, np.nan)
This will replace the rows where petal_length is equal to <some_value>
irirs[irirs.petal_length == <some_value>] = np.nan
Related
I have tried with everything I can come up with and would appreciate some help! :)
This is a method that's gonna return an imputed part of a data frame
from statistics import mean
from unicodedata import numeric
def imputation(df, columns_to_imputed):
# Step 1: Get a part of dataframe using columns received as a parameter.
import pandas as pd
import numpy as np
df.set_axis(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], axis=1, inplace=True)#Sätter rubrikerna
part_of_df = pd.DataFrame(df.filter(columns_to_imputed, axis=1))
part_of_df = part_of_df.drop([0], axis=0)
#Step 2: Change the zero values in the columns to np.nan
part_of_df = part_of_df.replace('0', np.nan)
# Step 3: Change the nan values to the mean of each attribute (column).
#You can use the apply(), fillna() functions.
part_of_df = part_of_df.fillna(part_of_df.mean(axis=0)) #####Ive tried everything on this row, can't get it to work. I want to fill each nan-value with the mean of the column its in..
return part_of_df ####Im returning this part to see if the nans are replaced but nothings happened...
You were on the right track, you just need to make a small change. Here I created a sample Df and introduced some NaNs:
dummy_df = pd.DataFrame({"col1":range(5), "col2":range(5)})
dummy_df['col1'][1] = None
dummy_df['col1'][3] = None
dummy_df['col2'][4] = None
and got this:
Disclaimer: Don't use my method of value assignment. Use proper indexing through loc.
Now, I use apply() and lambda to iterate over each column and fill NaNs with the mean value:
dummy_df = dummy_df.apply(lambda x: x.fillna(x.mean()), axis=0)
This gives me:
Hope this helps!
I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.
dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()
Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.
You can replace "None" with numpy.nan, instead that using 0.
Something like this should do the trick:
import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()
if you want it working for any string in your pandas serie, you could use pd.to_numeric:
pd.to_numeric(dur_temp, errors='coerce').mean()
in this way all the values that cannot be converted to float will be replaced by NaN regardless of which is
Just filter by condition like this
df[df['a']!='None'] #assuming your mean values are in column a
Make them np.NAN values
I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)
You can use fillna(value=np.nan) as shown below:
descricao_duration = dur_temp.fillna(value=np.nan).mean()
Demo:
import pandas as pd
import numpy as np
dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)
Output:
duration 15.0
dtype: float64
I want to change the existing dataframe value to nan. What is the way to do it when you need to change several?
dataframe['A', 'B'....] = np.nan
I tried this but nothing changed
Double brackets are required in this case, to pass the list of columns to the dataframe index operator []. For the OP case would be:
dataframe[['A', 'B'....]] = np.nan
Reproducible example:
import numpy as np
import pandas as pd
dict= {'ColA':[1,2,3], 'ColB':[4,5,6], 'ColC':[7,8,9], 'ColD':[-1,-2,-3] }
df=pd.DataFrame(dict,index=['I1','I2','I3'])
print(df)
df[['ColA','ColD']]=np.nan
print(df)
Note:
This solution was originally suggested via comment, now included as an answer with a reproducible example for future reference.
I want to replace "?" with NaN in Python.
The following code does not work, and I am not sure what is the reason.
import pandas as pd;
import numpy as np;
col_names = ['BI_RADS', 'age','shape','margin','density','severity']
dataset = pd.read_csv('mammographic_masses.data.txt', names = col_names)
dataset.replace("?", np.NaN)
After executing the above code, I still get those question marks in the dataset.
The format of the dataset looks like the followings:
5,67,3,5,3,1
4,43,1,1,?,1
5,58,?,5,3,1
4,28,1,1,3,0
5,74,1,5,?,1
Use inplace=True
Ex:
dataset.replace("?", np.NaN, inplace=True)
I have a large dataset where multiple columns had NaN values. I used python pandas to replace the missing values in few columns by mean and the rest by median. I got rid of all the NaN values and wrote the resultant the Dataframe to a new file.
Now when I read the new file again it contains NaNs instead of values. I am unable to figure out why is this happening. Below is my code for reference:
df = pd.DataFrame.from_csv('temp_train.csv',header=0)
df.prop_review_score=df.prop_review_score.fillna(0)
mean_score_2 = np.mean(df.prop_location_score2)
df.prop_location_score2 = df.prop_location_score2.fillna(mean_score_2)
median_search_query = np.median(df.srch_query_affinity_score)
df.srch_query_affinity_score = df.srch_query_affinity_score.fillna(median_search_query)
median_orig_distance = np.median(df.orig_destination_distance)
df.orig_destination_distance = df.orig_destination_distance.fillna(median_orig_distance)
df.to_csv('final_train_data.csv')
Now in another script when I type the following I get NaNs in srch_query_affinity_score
df = pd.DataFrame.from_csv('final_train_data.csv',header=0)
print df
I would recommend to use pandas.DataFrame.median instead of numpy.median on the dataframe.
A quick test for me shows (when there are NaNs in the data as Woody suggests):
df = pd.DataFrame({'x':[10,pd.np.NAN,np.NAN,20]})
df.x.median() # returns 20.0
np.median(df.x) # returns NaN
So consider replacing:
median_search_query = np.median(df.srch_query_affinity_score)
with
median_search_query = df.srch_query_affinity_score.median()
To make sure before you go to csv do something like:
assert df.srch_query_affinity_score.isnull().sum() == 0