sklearn ValueError: Input contains NaN

sklearn ValueError: Input contains NaN - python

ValueError: Input contains NaN
i have run
from sklearn.preprocessing import OrdinalEncoderfrom
data_.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1])
here is data_
Age Sex Embarked Survived
0 22.0 male S 0
1 38.0 female C 2
2 26.0 female S 2
3 35.0 female S 2
4 35.0 male S 0

Before doing some processor, you have always have to to preprocess the data and make a few summary of how your data is. In concrete, the error you obtained is telling you that you have NaN values. To check it, try this command:
df.isnull().any().any()
If the output is TRUE, you have NaN values. You can run the next command if you want to know where this NaN values are:
df.isnull().any()
Then, you will know in which column are your NaN values.
Once you know you have NaN values, you have to preprocess them (eliminate, fill,... whatever you believe is the best option). The link gtomer commented is a nice resource.

Related

Ignore column name while imputation pandas

I am trying to use the KNNImputer Package to impute missing values into my dataframe.
Here is my dataframe
pd.DataDrame(numeric_data)
age bmi children charges
0 19 NaN 0.0 16884.9240
1 18 33.770 1.0 NaN
2 28 33.000 3.0 4449.4620
3 33 22.705 0.0 NaN
Here is when I pass the imputer package and output the dataframe.
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data))
This gives:
0 1 2 3
0 19.0 34.0850 0.0 16884.924000
1 18.0 33.7700 1.0 6309.517125
2 28.0 33.0000 3.0 4449.462000
3 33.0 22.7050 0.0 4610.464925
How do I execute the same without losing my column name? Can I store the column name somewhere else and append later or can I impute with the column name being affected itself.
I have tried to exclude the column but I get the following error:
ValueError: could not convert string to float: 'age'

This should give you the desired result:-
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data), columns=numeric_data.columns)

How do you match an element in the entire dataframe and return the entire row or index of that particular matched element?

Each row in this dataframe represents an order and executionStatus.x has some info about the order status.
Those executionStatus.x columns are automatically created by flatten_json by
amirziai, depending on how many arguments there are. So if there are 3 statuses for one order like in row 0, there will be up to executionStatus.2. Since row 1 and 2 only have one status, it only has values in executionStatus.0.
My problem is I cannot match "ORDER_FULFILLED" because I don't know how many executionStatuses there will be and I would need to write the exact column name like so df[df['executionStatus.0'].str.match('ORDER_FULFILLED')].
executionStatus.0 executionStatus.1 executionStatus.2 \
0 REQUESTED_AMOUNT_ROUNDED MEOW ORDER_FULFILLED
1 ORDER_FULFILLED NaN NaN
2 NOT_AN_INFUNDING_LOAN NaN NaN
investedAmount loanId requestedAmount OrderInstructId
0 50.0 22222 55.0 55555
1 25.0 33333 25.0 55555
2 0.0 44444 25.0 55555
Is there a way to get the entire row or index that matched with "ORDER_FULFILLED" element in the entire dataframe?
Ideally, the matched dataframe should look like this because row 0 and row 1 have ORDER_FULFILLED in the executionStatuses and row 3 does not so it should be excluded. Thanks!
investedAmount loanId requestedAmount OrderInstructId
0 50.0 22222 55.0 55555
1 25.0 33333 25.0 55555

Use df.filter() for getting the similar columns containing executionStatus with a boolean mask:
df[df.filter(like='executionStatus').eq('ORDER_FULFILLED').any(axis=1)]
executionStatus.0 executionStatus.1 executionStatus.2 \
0 REQUESTED_AMOUNT_ROUNDED MEOW ORDER_FULFILLED
1 ORDER_FULFILLED NaN NaN
investedAmount loanId requestedAmount OrderInstructId
0 50 22222 55 55555
1 25 33333 25 55555
If you want to delete he execution columns from output, use:
df.loc[df.filter(like='executionStatus').eq('ORDER_FULFILLED').any(axis=1),\
df.columns.difference(df.filter(like='executionStatus').columns)

pandas dropna not working as expected on finding mean

When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?

As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]

This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0

Pandas - finding anomaly in paired column values in large Dataframe

I've been banging my head against a wall on this for a couple of hours, and would appreciate any help I could get.
I'm working with a large data set (over 270,000 rows), and am trying to find an anomaly within two columns that should have paired values.
From the snippet of output below - I'm looking at the Alcohol_Category_ID and Alcohol_Category_Name columns. The ID column has a numeric string value that should pair up 1:1 with a string descriptor in the Name column. (e.g., "1031100.0" == "100 PROOF VODKA".
As you can see, both columns have the same count of non-null values. However, there are 72 unique IDs and only 71 unique Names. I take this to mean that one Name is incorrectly associated with two different IDs.
County Alcohol_Category_ID Alcohol_Category_Name Vendor_Number \
count 269843 270288 270288 270920
unique 99 72 71 116
top Polk 1031080.0 VODKA 80 PROOF 260
freq 49092 35366 35366 46825
first NaN NaN NaN NaN
last NaN NaN NaN NaN
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
My trouble is in actually isolating out where this duplication is occurring so that I can hopefully replace the erroneous ID with its correct value. I am having a dog of a time with this.
My dataframe is named i_a.
I've been trying to examine the pairings of values between these two columns with groupby and count statements like this:
i_a.groupby(["Alcohol_Category_Name", "Alcohol_Category_ID"]).Alcohol_Category_ID.count()
However, I'm not sure how to whittle it down from there. And there are too many pairings to make this easy to do visually.
Can someone recommend a way to isolate out the Alcohol_Category_Name associated with more than one Alcohol_Category_ID?
Thank you so much for your consideration!
EDIT: After considering the advice of Dmitry, I found the solution by continually pairing down duplicates until I honed in on the value of interest, like so:
#Finding all unique pairings of Category IDs and Names
subset = i_a.drop_duplicates(["Alcohol_Category_Name", "Alcohol_Category_ID"])
#Now, determine which of the category names appears more than once (thus paired with more than one ID)
subset[subset["Alcohol_Category_Name"].duplicated()]
Thank you so much for your help. It seems really obvious in retrospect, but I could not figure it out for the life of me.

I think this snippet meets your needs:
> df = pd.DataFrame({'a':[1,2,3,1,2,3], 'b':[1,2,1,1,2,1]})
So df.a has 3 unique values mapping to 2 uniques in df.b.
> df.groupby('b')['a'].nunique()
b
1 2
2 1
That shows that df.b=1 maps to 2 uniques in a (and that df.b=2 maps to only 1).

Verbose debug output with pandas Series

I have a Pandas Series with 76 elements, when I try to print out the Series (for debugging) it is abbreviated with "..." in the output. Is there a way to pretty print all of the elements of the Series?
In this example, the Series is called "data"
print str(data)
gives me this
Open 40.4568
High 40.4568
Low 39.806
Close 40.114
Volume 796146.2
Active 1
TP1_ema 700
stop_ema_width 0.5
LS_ema 10
stop_window 210
target_width 3
LS_width 0
TP1_pct 1
TP1_width 4
stop_ema 1400
...
ValueSharesHeld NaN
AccountIsWorth NaN
Profit NaN
BuyPrice NaN
SellPrice NaN
ShortPrice NaN
BtcPrice NaN
LongStopPrice NaN
ShortStopPrice NaN
LongTargetPrice NaN
ShortTargetPrice NaN
LTP1_Price NaN
STP1_Price NaN
TradeOpenPrice NaN
TheEnd False
Name: 2000-11-03 14:00, Length: 76, dtype: object
Note the "..." inserted in the middle. I'm debugging using PTVS on Visual Studio 2013 (Python Tools for Visual Studio". I get the same behaviour with enthought canopy.

pd.options.display.max_rows = 100
The default is set at 60 (so dataframes or series with more elements will be truncated when printed).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.