Error in my dataset - python

Currently, I am getting this error in my code
'ValueError: Input contains NaN, infinity or a value too large for dtype('float64')'
when I want to run this code
import pandas as pd
train=pd.read_csv('C:\Users\ABDILLAH\Desktop\datasets\Rails\RailsDataset.csv')
features_col=['Num_comments', 'Num_Commits','Changed_files']
X=train.loc[:,features_col]
y=train.classes
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X,y)`
So if you need a sample of my dataset to check what is realy happened please let me know

I've loaded the sample set and the code below ran in my computer:
import pandas as pd
from sklearn.linear_model import LogisticRegression
train = pd.read_csv('RailsDataset_bis.csv')
features_col = ['Num_Comments', 'Num_Commits', 'Changed_files']
X = train[features_col].dropna()
y = train['class'].dropna()
logreg = LogisticRegression()
logreg.fit(X, y)
I've have corrected issues such as:
There is no Num_comments column, there only is a Num_Comments
column as pandas is case-sensitive. This line
X=train.loc[:,features_col] didn't give you an error, but
generated a column full of NaN. Selecting columns like this X = train[features_col]will throw an error in case the column name doesn't exist.
There is no train.classes as the column name is class and not classes.
There was a line full of NaN on the bottom of the set that needed to be
removed with dropna().

Related

Pandas preprocessing data and labelling

I want to divide my data into labels in that the first 6 columns determine the 7th column now I have selected the first 6 columns which is working perfectly
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Assign column names to the dataset
names=['buying', 'maint', 'doors', 'persons', 'lug_boot','safety', 'class']
# load the dataset in csv format into the pandas dataframe
cardata= pd.read_csv(r'C:\Users\user\Downloads\car.data', names=names)
X = cardata.iloc[:, 0:6]
The above code is working perfectly and when I run
print(X.head())
it prints the first 6 columns with exemption of the last column which is supposed to be predicted.
But this code below seems not to work as it outputs a similar behaviour to the one above
y = cardata.select_dtypes(include=[object])
print(y.head())
please help I need to assign the variable y to only the last column that is the seventh column
The output is the same which is not the case , I need when I run print(y.head()) it only prints the last column
Try this
X,y = cardata.iloc[:,:-1],cardata.iloc[:,-1]
This selects all rows and separates X and y based on the last column (index = -1). This should get you the result you are looking for

Python 3: Error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

I'm using scikit-learn for basic machine learning
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
X = df[['floors', 'waterfront','lat' ,'bedrooms' ,'sqft_basement' ,'view' ,'bathrooms','sqft_living15','sqft_above','grade','sqft_living']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
However, whenever I try to train the model with more than one data type, I get
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Anyone know why?
Data: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv
Edit: When checking for infinite values manually, I found none, however when checking using python every value type had infinites in them
I think you don't understand how Machine Learning works, you should have values associated with the properties floors,waterfront etc and the Y should be float as well(ground truth) for each property. It's like when you want to predict the house prices based on a number of features (X values) like no of bedrooms, square meter of the house, how big is the kitchen and so on.
All these features should have values that describe them and the Y values should be the house selling price. So the processing of all these features and giving the actual selling price will train your model to make predictions on new data that it hasn't seen before.
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
#I have given sample float values to help you understand how you should implement it
x_data = {'Properties':['floors', 'waterfront','lat' ,'bedrooms' ,'sqft_basement' ,'view' ,'bathrooms','sqft_living15','sqft_above','grade','sqft_living'],
'values':[1.2,2.2,0.4,5.3,0.2,2.3,1.2,4,1.3,3.2,0.8]}
y_data = {'Price':[100, 200, 500, 400, 220, 140, 150, 190, 300, 240, 59]}
#this is how you initialize pandas dataframe
X_df = pd.DataFrame(x_data)
Y_df = pd.DataFrame(y_data)
#I will need only the values from X dataframe and convert it to numpy array
X = X_df['values'].to_numpy().reshape(-1,1)
Y = Y_df.to_numpy()
lm = LinearRegression()
lm.fit(X,Y)
It will be better if you start from the basics and then move to more advanced topics.

How do I fix this "TypeError: float() argument must be a string or a number, not 'method'" Error?

I've tried to use the imputer to replace all of the NaN portions of my database with the averages of its respectful column. For example, I wanted to fix a blank entry in my database under the salary column and I want that blank section to be filled with the average salary values under that column. I tried doing this by following along with a tutorial but I think the video was outdated, resulting in this error.
Code:
#Data Proccesing
#Importing the Libaries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
#Taking care of Missig Data
from sklearn.preprocessing import Imputer
#The source of all the problems
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform
Initially, X looked like this when compiled prior to using Imputer:
However, Once I compiled lines 16-18, I got this error and I'm not sure what to do
The line
imputer.transform
Should be
imputer.transform()
...With parentheses to actually call the method rather than assign it's name to something.

I have a ValueError while using Jupyter Notebook and need help to find out why I get this error and how to fix it

When I run this code:
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(X, y)
I get this output:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This error points to the line where it says melbourne_model.fit(X, y).
I want the code to fit the model with X and y so I can make future predictions of houses in Melbourne based on a few variables I input such as year built, land area, rooms/bedrooms, location, etc. Right now I can't do that because of this error.
I think it is because the X and y aren't NumPy Arrays and when I use np.asarray() and put what I want to turn into a NumPy Array, it doesn't work. I know this because when I write type(X) or type(y), I get pandas.core.series.Series.
The whole code to my file:
import pandas as pd
import numpy as np
melbourne_file_path = 'melb_data.csv\\melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
np.asarray(melbourne_data.Price)
y = melbourne_data.Price
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
np.asarray(melbourne_data[melbourne_predictors])
X = melbourne_data[melbourne_predictors]
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(X, y)
I am using Jupyter Notebook as part of Anaconda.
The CSV file I am using can be downloaded here.
Once you do the download the folder you need to extract the files and the csv is inside the folder. You can make your own melbourne_file_path based on where the file is for you.
The error you're getting is fairly clear: Input contains NaN, infinity or a value too large. The problem is not that your inputs are pandas Series, but that your data is missing values! A quick glance at your CSV on Kaggle shows that rows 15 and 16 are missing quite a few fields, for example.
It's up to you to decide how to handle these missing values. One way is just to drop any row that's missing 1 or more values: df.dropna(inplace=True). This should get the RandomForestRegressor to fit without errors, but might bias your results if too many rows are dropped. A possibly better approach is to fill missing values with the column mean: df.fillna(df.mean()).

could not convert categorical data to number OneHotEncoder

I have a simple code to convert categorical data into one hot encoding in python:
a,1,p
b,3,r
a,5,t
I tried to convert them with python OneHotEncoder:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
data = pd.read_csv("C:\\test.txt", sep=",", header=None)
one_hot_encoder = OneHotEncoder(categorical_features=[0,2])
one_hot_encoder.fit(data.values)
This piece of code does not work and throws an error
ValueError: could not convert string to float: 't'
Can you please help me?
Try this:
from sklearn import preprocessing
for c in df.columns:
df[c]=df[c].apply(str)
le=preprocessing.LabelEncoder().fit(df[c])
df[c] =le.transform(df[c])
pd.to_numeric(df[c]).astype(np.float)
#user3104352,
I encountered the same behavior and found it frustrating.
Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.

Categories

Resources