Numpy Array for SVM model rather than a DataFrame - python

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Read the data.
data = np.asarray(pd.read_csv('data.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y.
X = data[:,0:2]
y = data[:,2]
# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC()
model.fit(X,y)
2 Questions:
the data goes into a numpy array from a pandas Dataframe (by pd.read_csv).
Is that better? Is there a good reason for that? why not stay with the DataFrame?
I do not understand this notation:
X = data[:,0:2]
y = data[:,2]
What does it do?
Thank you.
The data consists of a CSV file with many rows like this:
0.28917,0.65643,0
It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

Related

Python 3: Error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

I'm using scikit-learn for basic machine learning
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
X = df[['floors', 'waterfront','lat' ,'bedrooms' ,'sqft_basement' ,'view' ,'bathrooms','sqft_living15','sqft_above','grade','sqft_living']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
However, whenever I try to train the model with more than one data type, I get
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Anyone know why?
Data: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv
Edit: When checking for infinite values manually, I found none, however when checking using python every value type had infinites in them
I think you don't understand how Machine Learning works, you should have values associated with the properties floors,waterfront etc and the Y should be float as well(ground truth) for each property. It's like when you want to predict the house prices based on a number of features (X values) like no of bedrooms, square meter of the house, how big is the kitchen and so on.
All these features should have values that describe them and the Y values should be the house selling price. So the processing of all these features and giving the actual selling price will train your model to make predictions on new data that it hasn't seen before.
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
#I have given sample float values to help you understand how you should implement it
x_data = {'Properties':['floors', 'waterfront','lat' ,'bedrooms' ,'sqft_basement' ,'view' ,'bathrooms','sqft_living15','sqft_above','grade','sqft_living'],
'values':[1.2,2.2,0.4,5.3,0.2,2.3,1.2,4,1.3,3.2,0.8]}
y_data = {'Price':[100, 200, 500, 400, 220, 140, 150, 190, 300, 240, 59]}
#this is how you initialize pandas dataframe
X_df = pd.DataFrame(x_data)
Y_df = pd.DataFrame(y_data)
#I will need only the values from X dataframe and convert it to numpy array
X = X_df['values'].to_numpy().reshape(-1,1)
Y = Y_df.to_numpy()
lm = LinearRegression()
lm.fit(X,Y)
It will be better if you start from the basics and then move to more advanced topics.

Inverse Transform with FunctionTransformer from sklearn

I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))
Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.

Issues with Pymc3 Summary

I am currently struggling to obtain a summary of the statistics of a model I ran through Bayesian regression on. I first used Lasso and model selection to filter the best variables, then used pm.Model to obtain the regression proper.
Of course, having 'filtered' the explanatory variables that weren't relevant, the shape of the X matrix had changed. The data I worked on is the load_boston dataset from sklearn.dataset. I coded the data as independent variable and the target as dependent variable.
Having performed model selection with SelectFromModel, I used the get.support method to obtain an index of the retained variables. I then used a loop over both the indexes of all variables and the numbers contained in the support, with the purpose of storing the names of the retained variables in an empty list I had created at hoc. The code looks something like this
import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(9)
# Load the boston dataset.
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston['data'], boston['target']
# Here is the code for the estimator LassoCV
# Here is the code for Model Selection
support(indices=True) #to obtain the list of indices of retained variables
X_transform = sfm.transform(X) #to remove the unnecessary variables
#Here is the line for linear modeling
#I initialize some useful variables
m = y.shape[0]
n = X.shape[1]
c = supp.shape[0]
L = boston['feature_names']
varnames=[]
for i in range (0, n):
for j in range (0, c):
if i == supp[j]:
varnames.append(L[i])
pm.summary(trace, varnames=varnames)
The console then displays 'KeyError: RM', which is one of the names of the variables used. One issue I noticed that every object of varnames is classified as str_ object of numpy module, meaning that I can't read the name of the retained variables on the list unless I double click on them.
How could I fix this? I have no clue what I am doing wrong.

Python SciKitLearn and Pandas categoric data

I'm working on multivariable regression from a csv, predicting crop performance based on multiple factors. Some of my columns are numerical and meaningful. Others are numerical and categorical, or strings and categorical (for instance, crop variety, or plot code or whatever.) How do I teach Python to use them? I've found One Hot Encoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) but don't really understand how to apply it here.
My code so far:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('filepath.csv')
df.drop(df[df['LabeledDataColumn'].isnull()].index.tolist(),inplace=True)
scale = StandardScaler()
pd.options.mode.chained_assignment = None # default='warn'
X = df[['inputColumn1', 'inputColumn2', ...,'inputColumn20']]
y = df['LabeledDataColumn']
X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']] = scale.fit_transform(X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']].as_matrix())
#print (X)
est = sm.OLS(y, X).fit()
est.summary()
You could use the get_dummies function pandas provides and convert the categorical values.
Something like this..
predictor = pd.concat([data.get(['numerical_column_1','numerical_column_2','label']),
pd.get_dummies(data['categorical_column1'], prefix='Categorical_col1'),
pd.get_dummies(data['categorical_column2'], prefix='categorical_col2'),
axis=1)
then you could get the outcome/label column by doing
outcome = predictor['label']
del predictor['label']
Then call the model on the data doing
est = sm.OLS(outcome, predictor).fit()

Getting feature names in addition to values - SciKitLearn+Pandas

I generate a set of features for input, that I store as a table using pandas and the CSV format.
(Each column header represents a feature names, except for the first, blank column, which is where the class labels are stored for each row).
My next step is reading the table from the csv file, into scikit learn. (I'm currently doing this with pandas again). However, after training and experimenting with my models using different feature selection methods (and different initially generated features), I want the NAMES of the selected features.
I assume this should be trivial, but I just haven't found how to do it.
(Note: I am NOT working on standard text documents, so "CountVectorizer" and "NaiveBayes"/nltk and the like do not help me).
I need a method to get the selected features, (and preferably something to drop the unselected ones, for when I apply the models and selected features on new "test" data).
Thank you very much!
My data is currently loaded like this:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
def load_data(filename="Feat_normalized.csv") :
df = pd.read_csv(filename, index_col=0)
lb = LabelEncoder()
labels = lb.fit_transform((df.index.values))
features = df.values
feature_names = list(df.columns)
feature_names.pop(0) #Remove index.
return (features, labels, lb)
features, labels, lb_encoder = load_data(filename)
X, y = features, labels
clf_logit = LogisticRegression(penalty="l1", dual=False, class_weight='auto')
X_reduced = clf_logit.fit_transform(X, y)
print('New sparse (filtered) features matrix size:')
print(X_svm.shape)
#Then fit to various models, Random forests, SVM, etc'..
Truncated Example of the first 2 rows in the input data/csv:
AA_C AA__D AA__E AA_F AA__G AA_H AA_I AA_K AA_L AA_M
Mammal_sequence_1.0.fasta 3.838099345 0.456591162 3.764884604 3.620232638 3.460992571 3.858487012 2.69247235 3.18710619 3.671029774 4.625996297 1.542632799
(AA_"" = Feature name. Mammal_sequence_1.0.fasta = Class name/label; (1 per row, empty header).
Thank you very much!

Categories

Resources