using x.reshape on a 1D array in sklearn

using x.reshape on a 1D array in sklearn - python

I tried to use sklearn to use a simple decision tree classifier, and it complained that using a 1D array is now depricated and must use X.reshape(1,-1). So I did but it has turned my labels list to a list of lists with only one element so number of labels and samples do not match now. Another words my list of labels=[0,0,1,1] turns into [[0 0 1 1]]. Thanks
This is the simple code that I used:
from sklearn import tree
import numpy as np
features =[[140,1],[130,1],[150,0],[170,0]]
labels=[0,0,1,1]
labels = np.array(labels).reshape(1,-1)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features,labels)
print clf.predict([150,0])

You are reshaping the wrong thing. Reshape the data your are predicting on, not your labels.
>>> clf.predict(np.array([150,0]).reshape(1,-1))
array([1])
Your labels have to align with your training data (features) ,so the length of both arrays should be the same. If labels is reshaped, you are right, it is a list of lists with a length of 1 and not equal to the length of your features.
You have to reshape your test data because prediction needs an array that looks like your training data. i.e. each index needs to be a training example with the same number of features as in training. You'll see that the following two commands return a list of lists and just a list respectively.
>>> np.array([150,0]).reshape(1,-1)
array([[150, 0]])
>>> np.array([150,0])
array([150, 0])

Related

dividing big dataset python

My dataset Features shape is (80102, 2592) and label.shape (80102, 2). I want to consider only few rows for traning as it is taking lot of time for training the CNN model. How can I divide the dataset in python and consider only few rows for traning and tesing both.

If your data is in the form of arrays let X be the array containing the data and y be the array containing the labels. You can use sklearn train_test_split function to create new samples of the data per the code below
from sklearn.model_selection import train_test_split
percent=.1 specify the percentof data you want to use, in this case 10%
X_data, X_dummy, y_labels, y_dummy=train_test_split(X,y,train_size=percent,randon_state=123, shuffle=True)
X_data will contain 10% of the original data and will be shuffled
y_labels will contain 10% of the corresponding labels.
If you want to specifically set the number of samples set train_size to an integer value. If you need further information the documentation is located here. If you data is a pandas dataframe you can use the pandas function pandas.DataFrame.sample..Documentation for that is here.. Assume your data frame is called data. The code below will produce a new data frame with a specified percent of the original rows
percent=.1
new_data=pandas.data.sample(n=None, frac=percent, replace=False, weights=None, random_state=123, axis=0)

Found input variables with inconsistent numbers of samples: [14559, 1455900]

I am facing some problems when I try to fit the model. This happens when I try to use LogisticRegression, Naive bayes or svm models. But I get results when I use random forest regression or decision tree.
The error says:
ValueError: y should be a 1d array, got an array of shape (20799, 100)
instead.
The solution is to use y_train.ravel() when I fit the model. But then again, the below error appears:
Found input variables with inconsistent numbers of samples: [14559,
1455900]
Here's my code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
df=pd.read_csv('../input/filteredbymany.csv',low_memory=False,usecols=['county','crashalcoh','drvrsex','developmen','lightcond','drvrvehtyp','drvrage','pedage','city','crashloc','crashtype','pedpos'])
df.dropna(inplace=True)
dummies= pd.get_dummies(df)
merged=pd.concat([df,dummies],axis='columns')
X = merged
X = X.drop(['county','crashalcoh','city','developmen','drvrage','drvrsex','drvrvehtyp','lightcond','pedage','crashloc','crashtype','pedpos'],axis='columns')
y = X.loc[:, X.columns.str.startswith('county')]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
model = LogisticRegression()
model.fit(X_train,y_train.values.ravel())
model.predict(X_test)
I have been struggling with this for around 80 hours or so. Please help.

The problem
You want to have an array X with N rows. Each row is a sample of something and each column is a feature of these samples. And then you want to have an array y with N values. The i'th value of y is the value ("label") you want to predict for the i'th row of X.
The first error
Your y is two-dimensional (shape is (N, 100)), but it should be one-dimensional (shape (N,)). So you have 100 labels for each instance in X, but the model you chose can only predict one label per instance.
The second error
Then you ravel it to a one-dimensional array with shape (100*N,). Now you have one dimension, but still too many values.
Solution
Look at your tables X and y and see which column of y you actually want.

Splitting dataset into two non-redundant numpy arrays?

I have a numpy array "my_data". I am trying to split this dataset randomly. However, when I do this using the following code, I get a "train" array and a "test" array. Train array and test array have some rows in column.
training_idx = np.random.randint(my_data.shape[0], size=split_size)
test_idx = np.random.randint(my_data.shape[0], size=len(my_data)-split_size)
train, test = my_data[training_idx,:], my_data[test_idx,:]
My intention is to find train array first randomly and then whatever rows are left in my_data which are not in train array, to be a part of test array.
Is there a way in numpy to do so ? (I am refraining from using sklearn to split my data)
I referred to this post here to get here with my dataset.
How to split/partition a dataset into training and test datasets for, e.g., cross validation?
If I code per this post’s logic I end up getting train and test data sets where train and test have some redundant rows in them. I intend on making train and test datasets where no rows are common.

Following this answer you can do:
train_idx = np.random.randint(my_data.shape[0], size=split_size)
mask = np.ones_like(my_data, dtype=bool)
mask[train_idx] = False
train, test = my_data[~mask], my_data[mask]
Although, a more natural way would be to slice a permutation of your data, as Poojan suggested.
permuted = np.random.permutation(my_data)
train, test = permuted[:split_size], permuted[split_size:]

Scikit-learn f1_score for list of strings

Is there any way to compute f1_score for a list of labels as strings regardless their order?
f1_score(['a','b','c'],['a','c','b'],average='macro')
I wish this to return 1 instead of 0.33333333333
I know I could vectorize labels but this syntax would be far easier, in my case, since I am dealing with many labels

What you need is the f1_score for a multilabel classification task and for that you need a 2-d matrix for y_true and y_pred of shape [n_samples, n_labels].
You are currently supplying a 1-d array only. Hence it will be considered as a multi-class problem, not multilabel.
The official documentation provides the necessary details.
And for that to be scored correctly you need to convert the y_true, y_pred to label-indicator matrix as documented here:
y_true : 1d array-like, or label indicator array / sparse matrix
y_pred : 1d array-like, or label indicator array / sparse matrix
So you need to change the code like this:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
y_true = [['a','b','c']]
y_pred = [['a','c','b']]
binarizer = MultiLabelBinarizer()
# This should be your original approach
#binarizer.fit(your actual true output consisting of all labels)
# In this case, I am considering only the given labels.
binarizer.fit(y_true)
f1_score(binarizer.transform(y_true),
binarizer.transform(y_pred),
average='macro')
Output: 1.0
You can have a look at examples of MultilabelBinarizer here:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
https://stackoverflow.com/a/42392689/3374996

data dimension of scikit learn linear regression

I just started using Python scikit-learn package to do linear regression. I am confused with the dimension of data set it required. For example, I want to regress X on Y using the following code
from sklearn import linear_model
x=[0,1,2]
y=[0,1,2]
regr = linear_model.LinearRegression()
regr.fit (x,y)
print('Coefficients: \n', regr.coef_)
System returned with error : tuple index out of range.
According the scikit-learn website, effective arrays should be like
x=[[0,0],[1,1],[2,2]]
y=[0,1,2]
(http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)
from sklearn import linear_model
x=[[0,0],[1,1],[2,2]]
y=[0,1,2]
regr = linear_model.LinearRegression()
regr.fit (x,y)
print('Coefficients: \n', regr.coef_)
so it means that package can not regress X[i] on Y[i] for two single numbers? it must be an array on a number? like [0,0] in X to 0in Y?
Thanks in advance.

You can.
Simply reshape your data to be x = [[0], [1], [2]].
In this case , every point in your data will have a single feature - single number.

Scikit requires your x to be a 2-dimensional array. It need not be a numpy array. You can always use a simple python list.
In case if you have your x as a 1-dimensional array like you just mentioned in your question, you can simply do the following:
x = [[value] for value in [0,1,2]]
This will store a 2D array of your 1D array in x i.e. every individual value of your list is stored as an array.

x can also be converted into a numpy array, and then reshaped as follows:
import numpy as np
x = np.array(x).reshape(-1, 1)
This converts your data into a 2D array so that you can use it for fitting the linear regression model from sklearn.
array([[0],
[1],
[2]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using x.reshape on a 1D array in sklearn - python

Related

dividing big dataset python

Found input variables with inconsistent numbers of samples: [14559, 1455900]

Splitting dataset into two non-redundant numpy arrays?

Scikit-learn f1_score for list of strings

data dimension of scikit learn linear regression

Categories

Resources