sci-kit learn: Reshape your data either using X.reshape(-1, 1) - python

I'm training a python (2.7.11) classifier for text classification and while running I'm getting a deprecated warning message that I don't know which line in my code is causing it! The error/warning. However, the code works fine and give me the results...
\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
My code:
def main():
data = []
folds = 10
ex = [ [] for x in range(0,10)]
results = []
for i,f in enumerate(sys.argv[1:]):
data.append(csv.DictReader(open(f,'r'),delimiter='\t'))
for f in data:
for i,datum in enumerate(f):
ex[i % folds].append(datum)
#print ex
for held_out in range(0,folds):
l = []
cor = []
l_test = []
cor_test = []
vec = []
vec_test = []
for i,fold in enumerate(ex):
for line in fold:
if i == held_out:
l_test.append(line['label'].rstrip("\n"))
cor_test.append(line['text'].rstrip("\n"))
else:
l.append(line['label'].rstrip("\n"))
cor.append(line['text'].rstrip("\n"))
vectorizer = CountVectorizer(ngram_range=(1,1),min_df=1)
X = vectorizer.fit_transform(cor)
for c in cor:
tmp = vectorizer.transform([c]).toarray()
vec.append(tmp[0])
for c in cor_test:
tmp = vectorizer.transform([c]).toarray()
vec_test.append(tmp[0])
clf = MultinomialNB()
clf .fit(vec,l)
result = accuracy(l_test,vec_test,clf)
print result
if __name__ == "__main__":
main()
Any idea which line raises this warning?
Another issue is that running this code with different data sets gives me the same exact accuracy, and I can't figure out what causes this?
If I want to use this model in another python process, I looked at the documentation and I found an example of using pickle library, but not for joblib. So, I tried following the same code, but this gave me errors:
clf = joblib.load('model.pkl')
pred = clf.predict(vec);
Also, if my data is CSV file with this format: "label \t text \n"
what should be in the label column in test data?
Thanks in advance

Your 'vec' input into your clf.fit(vec,l).fit needs to be of type [[]], not just []. This is a quirk that I always forget when I fit models.
Just adding an extra set of square brackets should do the trick!

It's:
pred = clf.predict(vec);
I used this in my code and it worked:
#This makes it into a 2d array
temp = [2 ,70 ,90 ,1] #an instance
temp = np.array(temp).reshape((1, -1))
print(model.predict(temp))

2 solution: philosophy___make your data from 1D to 2D
Just add: []
vec = [vec]
Reshape your data
import numpy as np
vec = np.array(vec).reshape(1, -1)

If you want to find out where the Warning is coming from you can temporarly promote Warnings to Exceptions. This will give you a full Traceback and thus the lines where your program encountered the warning.
with warnings.catch_warnings():
warnings.simplefilter("error")
main()
If you run the program from the commandline you can also use the -W flag. More information on Warning-handling can be found in the python documentation.
I know it is only one part of your question I answered but did you debug your code?

Since 1D array would be deprecated. Try passing 2D array as a parameter. This might help.
clf = joblib.load('model.pkl')
pred = clf.predict([vec]);

Predict method expects 2-d array , you can watch this video , i have also located the exact time https://youtu.be/KjJ7WzEL-es?t=2602 .You have to change from [] to [[]].

Related

I cannot get accuracy score to match when measured before and after assigning to pandas dataframe

I am working on a classifier which will use non-nan values from Target_Column to predict what should be in the place of all nan values. But after training my model, I test it before and after assigning the predictions to a new column in my original dataframe. These two tests do not match. To me, the issue appears to be that one of the transformations after prediction is somehow shuffling the predictions so that they no longer match. I have removed as much extraneous code as I can.
##############################################################
### Here is the initial data transformation for background ###
### You can skip to next section for now... ###
##############################################################
df = pd.read_excel('./Data/[Excel File That Holds Data].xlsx')
df['Target_Column'] = df['Target_Column'].astype('str').str.strip()
le = LabelEncoder()
y_full = le_sub_system.fit_transform(df['Target_Column'].astype('str').str.strip())
desc_vec = TfidfVectorizer(ngram_range=(1,3), max_features=1000)
tag_vec = TfidfVectorizer(ngram_range=(1,4), analyzer='char', max_features=1000)
desc = desc_vec.fit_transform(df.descriptor.astype('str').fillna(''))
tag = tag_vec.fit_transform(df.tag.astype('str').fillna(''))
X = scipy.sparse.hstack([desc, tag])
### The indexing here matches the indexing done in the line marked below ###
nan_encoding = le.transform(['nan'])[0]
X_train = X.todense()[y_full != nan_encoding]
y_train = y_full[y_full != nan_encoding]
X_train.shape, y_train.shape #---> ((94669, 2000), (94669,))
########################################
### Here is where the problem starts ###
########################################
clf = xgb.XGBClassifier(n_estimators=10, max_depth=11, tree_method='gpu_hist', n_jobs=94)
clf.fit(X_train, y_train)
out_full = clf.predict(X)
out_training_set = clf.predict(X_train)
df['Target_Predicted'] = le.inverse_transform(out_full)
>>> accuracy_score(out_training_set, y_train)
0.9832152024421933
### The indexing here matches the indexing done in the lines marked above ###
>>> print(accuracy_score(df.Target_Column[y_full != nan_encoding], df.Target_Predicted[y_full != nan_encoding]))
0.0846422799438042
>>> print(accuracy_score(df.Target_Column[(df.Target_Column != 'nan')], df.Target_Predicted[(df.Target_Column!= 'nan')]))
0.0846422799438042
>>> (df.Target_Column[(df.Target_Column!= 'nan')].values == le.inverse_transform(y_train)).all()
True
>>> (df.Target_Column[y_full != nan_encoding] == le.inverse_transform(y_train)).all()
True
>>> (le.transform(df.Target_Predicted[y_full != nan_encoding]) == out[y_full != nan_encoding]).all()
True
As you can see, both ways of indexing the newly created columns in the dataframe return the same results, and they are indexed exactly the same way as when I created the training set initially, and (for the actual target values) returns exactly the same values. So how can the accuracy have changed?
I think it might be related to the fact that you are using predict on a sparse matrix (or maybe there's some shuffling like you suggested). Anyhoo, try just filling the prediction column with nans (or any value you need to represent the missing values) and then fill the indexes for which you have the target variable with the predictions from the dense dataframe:
tmp = pd.Series(y_full)
valid_indexes = tmp[tmp!=nan_encoding].index.values
df['Target_Predicted'] = le.inverse_transform(nan_encoding)
df.Target_Predicted.iloc[valid_indexes ,] = le.inverse_transform(out_training_set)
Hope it helps!
The problem lies with predicting on a sparse matrix, as Davide DN suggested in his answer. Changing the line
out_full = clf.predict(X)
to
out_full = clf.predict(X.todense())
solved the problem immediately. Hopefully if someone else has the same problem, then their data fits in memory in dense format.

K-Prototypes in python "IndexError: too many indices for array"

I am trying to perform a k-prototype clustering to mixed data(categorical and numeric). My input file is a csv which looks like this(it contains 300000 rows):
Unnamed: 0.1,market,vendor_name,price,ship_from,category_cl
0,mark,03welle,1.79367196,DE,Drugs
1,aruna,03welle,0.05880975,DE,Drugs
2,ny,03welle,0.11344859,DE,Drugs
3,mi,03welle,0.18655316,DE,Drugs
I am trying to implement a k-prototypes clustering as can cluster mixed data. The problem is I am getting an error and I cannot understand it(and of course fix it). I am using the code I found in the relative repo:
import numpy as np
print("initialising")
syms = np.genfromtxt('pameteliko.csv', dtype=str, delimiter='\t')[:, 0]
print("******")
print(syms)
X = np.genfromtxt('pameteliko.csv', dtype=object, delimiter='\t')[:, 1:]
print("################")
X[:, 0] = X[:, 0].astype(float)
from kmodes.kprototypes import KPrototypes
kproto = KPrototypes(n_clusters=6, init='Cao', verbose=2)
clusters = kproto.fit_predict(X, categorical=[1, 2])
#Print cluster centroids of the trained model.
print(kproto.cluster_centroids_)
#Print training statistics
print(kproto.cost_)
print(kproto.n_iter_)
(The prints are there for debugging purposes). I am getting the following error:
IndexError: too many indices for array
I have also some doubts regarding the syms and the X. Any help would be really appreciated.
Change delimiter '\t' to ','
syms = np.genfromtxt('pameteliko.csv', dtype=str, delimiter=',')[:, 0]
print("******")
print(syms)
X = np.genfromtxt('pameteliko.csv', dtype=object, delimiter=',')[:, 1:]
because you are using comma-separated value files. I hope it works!

Reshaping data for Linear regression [duplicate]

On a fresh installation of Anaconda under Ubuntu... I am preprocessing my data in various ways prior to a classification task using Scikit-Learn.
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler().fit(train)
train = scaler.transform(train)
test = scaler.transform(test)
This all works fine but if I have a new sample (temp below) that I want to classify (and thus I want to preprocess in the same way then I get
temp = [1,2,3,4,5,5,6,....................,7]
temp = scaler.transform(temp)
Then I get a deprecation warning...
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
So the question is how should I be rescaling a single sample like this?
I suppose an alternative (not very good one) would be...
temp = [temp, temp]
temp = scaler.transform(temp)
temp = temp[0]
But I'm sure there are better ways.
Just listen to what the warning is telling you:
Reshape your data either X.reshape(-1, 1) if your data has a single feature/column
and X.reshape(1, -1) if it contains a single sample.
For your example type(if you have more than one feature/column):
temp = temp.reshape(1,-1)
For one feature/column:
temp = temp.reshape(-1,1)
Well, it actually looks like the warning is telling you what to do.
As part of sklearn.pipeline stages' uniform interfaces, as a rule of thumb:
when you see X, it should be an np.array with two dimensions
when you see y, it should be an np.array with a single dimension.
Here, therefore, you should consider the following:
temp = [1,2,3,4,5,5,6,....................,7]
# This makes it into a 2d array
temp = np.array(temp).reshape((len(temp), 1))
temp = scaler.transform(temp)
This might help
temp = ([[1,2,3,4,5,6,.....,7]])
.values.reshape(-1,1) will be accepted without alerts/warnings
.reshape(-1,1) will be accepted, but with deprecation war
I faced the same issue and got the same deprecation warning. I was using a numpy array of [23, 276] when I got the message. I tried reshaping it as per the warning and end up in nowhere. Then I select each row from the numpy array (as I was iterating over it anyway) and assigned it to a list variable. It worked then without any warning.
array = []
array.append(temp[0])
Then you can use the python list object (here 'array') as an input to sk-learn functions. Not the most efficient solution, but worked for me.
You can always, reshape like:
temp = [1,2,3,4,5,5,6,7]
temp = temp.reshape(len(temp), 1)
Because, the major issue is when your, temp.shape is:
(8,)
and you need
(8,1)
-1 is the unknown dimension of the array. Read more about "newshape" parameters on numpy.reshape documentation -
# X is a 1-d ndarray
# If we want a COLUMN vector (many/one/unknown samples, 1 feature)
X = X.reshape(-1, 1)
# you want a ROW vector (one sample, many features/one/unknown)
X = X.reshape(1, -1)
from sklearn.linear_model import LinearRegression
X = df[['x_1']]
X_n = X.values.reshape(-1, 1)
y = df['target']
y_n = y.values
model = LinearRegression()
model.fit(X_n, y)
y_pred = pd.Series(model.predict(X_n), index=X.index)

How to train SVM using python?

I'm trying to train SVM on dataset with only two column like:
1 4.5456436
0 2.4353453
1 3.5435636
1 5.4235354
0 1.4235345
I have tried:
x = np.array([[1],[0],[1],[1]])
y = np.array([[4.5456436],[2.4353453],[3.5435636],[5.4235354]])
clf = svm.SVC()
clf.fit(y,x)
for these lines it works correctly, but the problem occurs when I import the array from dataset file, I got an error:
ValueError: The number of classes has to be greater than one; got 1
although the output and the type in the two cases are the same.
imported data from the dataset code is:
def read(dir):
x = []
y = []
with open(dir) as f:
lines = f.readlines()
for i in range(len(lines)):
x.append(lines[i][0]);y.append(lines[i][1:])
x = np.array([[int(i)] for i in x])
y = np.array([[float(i)] for i in y])
any suggestion, thanks in advance.
Posting the comment as answer to just close the question.
The error is that there is only one type of class (label) in target. See, in the example you posted above (x = np.array([[1],[0],[1],[1]])), there are two categories to classify (0 and 1).
But when you import the dataset from file, target has only type of category for all available samples. Please check the arrays loaded from your file.

numpy and pytables issue (error: tuple index out of range)

I am new to python and pytables. Currently I am writing a project about clustering and KNN algorithm. That is what I have got.
********** code *****************
import numpy.random as npr
import numpy as np
step0: obtain the cluster
dtype = np.dtype('f4')
pnts_inds = np.arange(100)
npr.shuffle(pnts_inds)
pnts_inds = pnts_inds[:10]
pnts_inds = np.sort(pnts_inds)
for i,ind in enumerate(pnts_inds):
clusters[i] = pnts_obj[ind]
step1: save the result to a HDF5 file called clst_fn.h5
filters = tables.Filters(complevel = 1, complib = 'zlib')
clst_fobj = tables.openFile('clst_fn.h5', 'w')
clst_obj = clst_fobj.createCArray(clst_fobj.root, 'clusters',
tables.Atom.from_dtype(dtype), clusters.shape,
filters = filters)
clst_obj[:] = clusters
clst_fobj.close()
step2: other function
blabla
step3: load the cluster from clst_fn
pnts_fobj= tables.openFile('clst_fn.h5','r')
for pnts in pnts_fobj.walkNodes('/', classname = 'Array'):
break
#
step4: evoke another function (called knn). The function input argument is the data from pnts. I have checked the knn function individually. This function works well if the input is pnts = npr.rand(100,128)
def knn(pnts):
pnts = numpy.ascontiguousarray(pnts)
N = ctypes.c_uint(pnts.shape[0])
D = ctypes.c_uint(pnts.shape[1])
#
evoke knn using the cluster from clst_fn (see step 3)
knn(pnts)
********** end of code *****************
My problem now is that python is giving me a hard time by showing:
error: IndexError: tuple index out of range
This error comes from
"D = ctypes.c_uint(pnts.shape[1])" this line.
Obviously, there must be something wrong with the input argument. Any thought about fixing the problem? Thank you in advance.

Categories

Resources