ValueError: Input contains NaN
i have run
from sklearn.preprocessing import OrdinalEncoderfrom
data_.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1])
here is data_
Age Sex Embarked Survived
0 22.0 male S 0
1 38.0 female C 2
2 26.0 female S 2
3 35.0 female S 2
4 35.0 male S 0
Before doing some processor, you have always have to to preprocess the data and make a few summary of how your data is. In concrete, the error you obtained is telling you that you have NaN values. To check it, try this command:
df.isnull().any().any()
If the output is TRUE, you have NaN values. You can run the next command if you want to know where this NaN values are:
df.isnull().any()
Then, you will know in which column are your NaN values.
Once you know you have NaN values, you have to preprocess them (eliminate, fill,... whatever you believe is the best option). The link gtomer commented is a nice resource.
I am trying to use the KNNImputer Package to impute missing values into my dataframe.
Here is my dataframe
pd.DataDrame(numeric_data)
age bmi children charges
0 19 NaN 0.0 16884.9240
1 18 33.770 1.0 NaN
2 28 33.000 3.0 4449.4620
3 33 22.705 0.0 NaN
Here is when I pass the imputer package and output the dataframe.
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data))
This gives:
0 1 2 3
0 19.0 34.0850 0.0 16884.924000
1 18.0 33.7700 1.0 6309.517125
2 28.0 33.0000 3.0 4449.462000
3 33.0 22.7050 0.0 4610.464925
How do I execute the same without losing my column name? Can I store the column name somewhere else and append later or can I impute with the column name being affected itself.
I have tried to exclude the column but I get the following error:
ValueError: could not convert string to float: 'age'
This should give you the desired result:-
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data), columns=numeric_data.columns)
When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?
As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]
This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0
I have a df structured as so:
UnitNo Time Sensor
0 1.0 2016-07-20 18:34:44 19.0
1 1.0 2016-07-20 19:27:39 19.0
2 3.0 2016-07-20 20:45:39 17.0
3 3.0 2016-07-20 23:05:29 17.0
4 3.0 2016-07-21 01:23:30 11.0
5 2.0 2016-07-21 04:23:59 11.0
6 2.0 2016-07-21 17:33:29 2.0
7 2.0 2016-07-21 18:55:04 2.0
I want to create a time-series plot where each UnitNo has its own line (color) and the y-axis values correspond to Sensor and the x-axis is Time. I want to do this in ggplot, but I am having trouble figuring out how to do this efficiently. I have looked at previous examples but they all have regular time series, i.e., observations for each variable occur at the same times which makes it easy to create a time index. I imagine I can loop through and add data to plot(?), but I was wondering if there was a more efficient/elegant way forward.
df.set_index('Time').groupby('UnitNo').Sensor.plot();
I think you need pivot or set_index and unstack with DataFrame.plot:
df.pivot('Time', 'UnitNo','Sensor').plot()
Or:
df.set_index(['Time', 'UnitNo'])['Sensor'].unstack().plot()
If some duplicates:
df = df.groupby(['Time', 'UnitNo'])['Sensor'].mean().unstack().plot()
df = df.pivot_table(index='Time', columns='UnitNo',values='Sensor', aggfunc='mean').plot()
I am trying to learn Python, coming from a SAS background.
I have imported a SAS dataset, and one thing I noticed was that I have multiple date columns that are coming through as SAS dates (I believe).
In looking around, I found a link which explained how to perform this (here):
The code is as follows:
alldata['DateFirstOnsite'] = pd.to_timedelta(alldata.DateFirstOnsite, unit='s') + pd.datetime(1960, 1, 1)
However, I'm wondering how to do this for multiple columns. If I have multiple date fields, rather than repeating this line of code multiple times, can I create a list of fields I have, and then run this code on that list of fields? How is that done?
Thanks in advance
Yes, it's possible to create a list and iterate through that list to convert the SAS date fields to pandas datetime. However, I'm not sure why you're using a to_timedelta method, unless the SAS date fields are represented by seconds after 1960/01/01. If you plan on using the to_timedelta method, then its simply a case of creating a function that takes your df and your field and passing those two into your function:
def convert_SAS_to_datetime(df, field):
df[field] = pd.to_timedelta(df[field], unit='s') + pd.datetime(1960, 1, 1)
return df
Now, let's suppose you have your list of fields that you know should be converted to a datetime field (along with your df):
my_list = ['field1','field2','field3','field4','field5']
my_df = pd.read_sas('mySASfile.sas7bdat') # your SAS data that's converted to a pandas DF
You can now iterate through your list with a for loop while passing those fields and your df to the function:
for field in my_list:
my_df = convert_SAS_to_datetime(my_df, field)
Now, the other method I would recommend is using the to_datetime method, but this assumes that you know what the SAS format of your date fields are.
e.g. 01Jan2016 # date9 format
This is when you might have to look through the documentation here to determine the directive to converting the date. In the case of a date9 format, then you can use:
df[field] = pd.to_datetime(df[date9field], format="%d%b%Y")
If i read your question correctly, you want to apply your code to multiple columns? to do that simple do this:
alldata[['col1','col2','col3']] = 'your_code_here'
Exmaple:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : ['Pharmacy of IDAHO','Access medicare arkansas','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
df[['E', 'D']] = 1 # <---- notice double brackets
print(df)
A B C D E
0 NaN 1.0 Pharmacy of IDAHO 1 1
1 NaN 0.0 Access medicare arkansas 1 1
2 3.0 3.0 NJ Pharmacy 1 1
3 4.0 5.0 Idaho Rx 1 1
4 5.0 0.0 CA Herbals 1 1
5 5.0 0.0 Florida Pharma 1 1
6 3.0 NaN AK RX 1 1
7 1.0 9.0 Ohio Drugs 1 1
8 5.0 0.0 PA Rx 1 1
9 NaN 0.0 USA Pharma 1 1
Notice the double brackets in the beginning. Hope this helps!