I'm trying to train SVM on dataset with only two column like:
1 4.5456436
0 2.4353453
1 3.5435636
1 5.4235354
0 1.4235345
I have tried:
x = np.array([[1],[0],[1],[1]])
y = np.array([[4.5456436],[2.4353453],[3.5435636],[5.4235354]])
clf = svm.SVC()
clf.fit(y,x)
for these lines it works correctly, but the problem occurs when I import the array from dataset file, I got an error:
ValueError: The number of classes has to be greater than one; got 1
although the output and the type in the two cases are the same.
imported data from the dataset code is:
def read(dir):
x = []
y = []
with open(dir) as f:
lines = f.readlines()
for i in range(len(lines)):
x.append(lines[i][0]);y.append(lines[i][1:])
x = np.array([[int(i)] for i in x])
y = np.array([[float(i)] for i in y])
any suggestion, thanks in advance.
Posting the comment as answer to just close the question.
The error is that there is only one type of class (label) in target. See, in the example you posted above (x = np.array([[1],[0],[1],[1]])), there are two categories to classify (0 and 1).
But when you import the dataset from file, target has only type of category for all available samples. Please check the arrays loaded from your file.
Related
Good Day,
I am trying to train LSTM using multiple excel files (Motion Capture Data) as input. Each excel file represents a body motion, I would like to train the network using multiple motions in the training set and in the tests set. Below the example of a single excel file:
As for the input shape, it's (1, 2751, 93), the input dimension breakdown:
samples: 1,
time steps: 2751,
features: 93
The input independent variable (x) is the human joints along with their positions, and the dependent (y) is the labels of each movement.
Thanks in Advance!
EDIT: Added Elaborate Code
# Multiple Sheets
import os
import glob
motionName = []
for ds in glob.glob("*.csv"):
head, tail = os.path.split(str(ds))
motionName.append(tail)
print('Motion Name: ', tail)
import pandas as pd
num_rows = 300
samples = 0
datasets = []
activityIndex = []
list_num_features = [[]]
for i, activity in enumerate(motionName):
data = pd.read_csv('{}'.format(motionName[i]), nrows = num_rows, header=None, skiprows=1)
list_num_features.append([])
datasets.append(data)
#datasets[i].append(data)
for j in range(0, len(data.columns)):
list_num_features[i].append(data.columns[j])
activityIndex.append('{}'.format(motionName[i]))
samples += 1
print('activityIndex : {} '.format(activityIndex))
for i in range(0, len(datasets)-1):
print('{}'.format(motionName[i]))
print(datasets[i].head())
The output:
Whereby, the expected output to get when invoking the 'df.head()' is something similar to this output:
What I am trying to do is to be able to get/print every record (row) separately when desired. I was able to do that when loading a single dataframe using the below sample code, but failed when tried to load multiple dataframes into a list then trying to implement the same step for each dataframe using a loop.
# Single Sheet
import pandas as pd
dataset = pd.read_csv('motion.csv')
index = dataset.index
print(len(index))
num_rows = len(index)
dataset.head()
EDIT: Question Clarified!
Simply, what do I have now is the following:
8 dataframes stored in a list (list shape (8,))
Each dataframe shape is (300,93)
what do I want to do is have this list shaped to (8, 300, 93) for instance so it matches the input layer for the neural network.
As I keep getting the below error:
ValueError: cannot reshape array of size 8 into shape (8,300,93)
I am requesting clarification if possible as things are sort of vague at my end as to why I am having this error.
Thanks in-advance!
Wrote this function to handle the preprocessing to overcome the reshaping issue. Also, the function encodes the labels (y) using Scikit-Learn 'LabelEncadoer()'.
## Data Preprocessing
from sklearn.preprocessing import LabelEncoder
def preprocess_df(df, start, quantity, numRows, df_name):
x = []
features = []
y = []
label_encoder = LabelEncoder()
for i in range(start, quantity):
data = pd.read_csv('{}'.format(df[i]), nrows=numRows, skiprows=1)
y.append(df[i])
x.append(data)
if i == start:
for j in range(0, len(data.columns)):
features.append(data.columns[j])
if df_name == 'test':
i = i - start
print('({}/{}) x[{}]: {}'.format(i+1, (quantity - start), i, x[i].shape))
else:
print('({}/{}) x[{}]: {}'.format(i+1, quantity, i, x[i].shape))
print('{} set (x) shape: {}, {} set (y) shape: {}'.format(df_name, np.array(x).shape, df_name, np.array(y).shape))
y = np.array(label_encoder.fit_transform(y))
return np.array(x), y, np.array(features)
I am working on a classifier which will use non-nan values from Target_Column to predict what should be in the place of all nan values. But after training my model, I test it before and after assigning the predictions to a new column in my original dataframe. These two tests do not match. To me, the issue appears to be that one of the transformations after prediction is somehow shuffling the predictions so that they no longer match. I have removed as much extraneous code as I can.
##############################################################
### Here is the initial data transformation for background ###
### You can skip to next section for now... ###
##############################################################
df = pd.read_excel('./Data/[Excel File That Holds Data].xlsx')
df['Target_Column'] = df['Target_Column'].astype('str').str.strip()
le = LabelEncoder()
y_full = le_sub_system.fit_transform(df['Target_Column'].astype('str').str.strip())
desc_vec = TfidfVectorizer(ngram_range=(1,3), max_features=1000)
tag_vec = TfidfVectorizer(ngram_range=(1,4), analyzer='char', max_features=1000)
desc = desc_vec.fit_transform(df.descriptor.astype('str').fillna(''))
tag = tag_vec.fit_transform(df.tag.astype('str').fillna(''))
X = scipy.sparse.hstack([desc, tag])
### The indexing here matches the indexing done in the line marked below ###
nan_encoding = le.transform(['nan'])[0]
X_train = X.todense()[y_full != nan_encoding]
y_train = y_full[y_full != nan_encoding]
X_train.shape, y_train.shape #---> ((94669, 2000), (94669,))
########################################
### Here is where the problem starts ###
########################################
clf = xgb.XGBClassifier(n_estimators=10, max_depth=11, tree_method='gpu_hist', n_jobs=94)
clf.fit(X_train, y_train)
out_full = clf.predict(X)
out_training_set = clf.predict(X_train)
df['Target_Predicted'] = le.inverse_transform(out_full)
>>> accuracy_score(out_training_set, y_train)
0.9832152024421933
### The indexing here matches the indexing done in the lines marked above ###
>>> print(accuracy_score(df.Target_Column[y_full != nan_encoding], df.Target_Predicted[y_full != nan_encoding]))
0.0846422799438042
>>> print(accuracy_score(df.Target_Column[(df.Target_Column != 'nan')], df.Target_Predicted[(df.Target_Column!= 'nan')]))
0.0846422799438042
>>> (df.Target_Column[(df.Target_Column!= 'nan')].values == le.inverse_transform(y_train)).all()
True
>>> (df.Target_Column[y_full != nan_encoding] == le.inverse_transform(y_train)).all()
True
>>> (le.transform(df.Target_Predicted[y_full != nan_encoding]) == out[y_full != nan_encoding]).all()
True
As you can see, both ways of indexing the newly created columns in the dataframe return the same results, and they are indexed exactly the same way as when I created the training set initially, and (for the actual target values) returns exactly the same values. So how can the accuracy have changed?
I think it might be related to the fact that you are using predict on a sparse matrix (or maybe there's some shuffling like you suggested). Anyhoo, try just filling the prediction column with nans (or any value you need to represent the missing values) and then fill the indexes for which you have the target variable with the predictions from the dense dataframe:
tmp = pd.Series(y_full)
valid_indexes = tmp[tmp!=nan_encoding].index.values
df['Target_Predicted'] = le.inverse_transform(nan_encoding)
df.Target_Predicted.iloc[valid_indexes ,] = le.inverse_transform(out_training_set)
Hope it helps!
The problem lies with predicting on a sparse matrix, as Davide DN suggested in his answer. Changing the line
out_full = clf.predict(X)
to
out_full = clf.predict(X.todense())
solved the problem immediately. Hopefully if someone else has the same problem, then their data fits in memory in dense format.
On a fresh installation of Anaconda under Ubuntu... I am preprocessing my data in various ways prior to a classification task using Scikit-Learn.
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler().fit(train)
train = scaler.transform(train)
test = scaler.transform(test)
This all works fine but if I have a new sample (temp below) that I want to classify (and thus I want to preprocess in the same way then I get
temp = [1,2,3,4,5,5,6,....................,7]
temp = scaler.transform(temp)
Then I get a deprecation warning...
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
So the question is how should I be rescaling a single sample like this?
I suppose an alternative (not very good one) would be...
temp = [temp, temp]
temp = scaler.transform(temp)
temp = temp[0]
But I'm sure there are better ways.
Just listen to what the warning is telling you:
Reshape your data either X.reshape(-1, 1) if your data has a single feature/column
and X.reshape(1, -1) if it contains a single sample.
For your example type(if you have more than one feature/column):
temp = temp.reshape(1,-1)
For one feature/column:
temp = temp.reshape(-1,1)
Well, it actually looks like the warning is telling you what to do.
As part of sklearn.pipeline stages' uniform interfaces, as a rule of thumb:
when you see X, it should be an np.array with two dimensions
when you see y, it should be an np.array with a single dimension.
Here, therefore, you should consider the following:
temp = [1,2,3,4,5,5,6,....................,7]
# This makes it into a 2d array
temp = np.array(temp).reshape((len(temp), 1))
temp = scaler.transform(temp)
This might help
temp = ([[1,2,3,4,5,6,.....,7]])
.values.reshape(-1,1) will be accepted without alerts/warnings
.reshape(-1,1) will be accepted, but with deprecation war
I faced the same issue and got the same deprecation warning. I was using a numpy array of [23, 276] when I got the message. I tried reshaping it as per the warning and end up in nowhere. Then I select each row from the numpy array (as I was iterating over it anyway) and assigned it to a list variable. It worked then without any warning.
array = []
array.append(temp[0])
Then you can use the python list object (here 'array') as an input to sk-learn functions. Not the most efficient solution, but worked for me.
You can always, reshape like:
temp = [1,2,3,4,5,5,6,7]
temp = temp.reshape(len(temp), 1)
Because, the major issue is when your, temp.shape is:
(8,)
and you need
(8,1)
-1 is the unknown dimension of the array. Read more about "newshape" parameters on numpy.reshape documentation -
# X is a 1-d ndarray
# If we want a COLUMN vector (many/one/unknown samples, 1 feature)
X = X.reshape(-1, 1)
# you want a ROW vector (one sample, many features/one/unknown)
X = X.reshape(1, -1)
from sklearn.linear_model import LinearRegression
X = df[['x_1']]
X_n = X.values.reshape(-1, 1)
y = df['target']
y_n = y.values
model = LinearRegression()
model.fit(X_n, y)
y_pred = pd.Series(model.predict(X_n), index=X.index)
I'm training a python (2.7.11) classifier for text classification and while running I'm getting a deprecated warning message that I don't know which line in my code is causing it! The error/warning. However, the code works fine and give me the results...
\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
My code:
def main():
data = []
folds = 10
ex = [ [] for x in range(0,10)]
results = []
for i,f in enumerate(sys.argv[1:]):
data.append(csv.DictReader(open(f,'r'),delimiter='\t'))
for f in data:
for i,datum in enumerate(f):
ex[i % folds].append(datum)
#print ex
for held_out in range(0,folds):
l = []
cor = []
l_test = []
cor_test = []
vec = []
vec_test = []
for i,fold in enumerate(ex):
for line in fold:
if i == held_out:
l_test.append(line['label'].rstrip("\n"))
cor_test.append(line['text'].rstrip("\n"))
else:
l.append(line['label'].rstrip("\n"))
cor.append(line['text'].rstrip("\n"))
vectorizer = CountVectorizer(ngram_range=(1,1),min_df=1)
X = vectorizer.fit_transform(cor)
for c in cor:
tmp = vectorizer.transform([c]).toarray()
vec.append(tmp[0])
for c in cor_test:
tmp = vectorizer.transform([c]).toarray()
vec_test.append(tmp[0])
clf = MultinomialNB()
clf .fit(vec,l)
result = accuracy(l_test,vec_test,clf)
print result
if __name__ == "__main__":
main()
Any idea which line raises this warning?
Another issue is that running this code with different data sets gives me the same exact accuracy, and I can't figure out what causes this?
If I want to use this model in another python process, I looked at the documentation and I found an example of using pickle library, but not for joblib. So, I tried following the same code, but this gave me errors:
clf = joblib.load('model.pkl')
pred = clf.predict(vec);
Also, if my data is CSV file with this format: "label \t text \n"
what should be in the label column in test data?
Thanks in advance
Your 'vec' input into your clf.fit(vec,l).fit needs to be of type [[]], not just []. This is a quirk that I always forget when I fit models.
Just adding an extra set of square brackets should do the trick!
It's:
pred = clf.predict(vec);
I used this in my code and it worked:
#This makes it into a 2d array
temp = [2 ,70 ,90 ,1] #an instance
temp = np.array(temp).reshape((1, -1))
print(model.predict(temp))
2 solution: philosophy___make your data from 1D to 2D
Just add: []
vec = [vec]
Reshape your data
import numpy as np
vec = np.array(vec).reshape(1, -1)
If you want to find out where the Warning is coming from you can temporarly promote Warnings to Exceptions. This will give you a full Traceback and thus the lines where your program encountered the warning.
with warnings.catch_warnings():
warnings.simplefilter("error")
main()
If you run the program from the commandline you can also use the -W flag. More information on Warning-handling can be found in the python documentation.
I know it is only one part of your question I answered but did you debug your code?
Since 1D array would be deprecated. Try passing 2D array as a parameter. This might help.
clf = joblib.load('model.pkl')
pred = clf.predict([vec]);
Predict method expects 2-d array , you can watch this video , i have also located the exact time https://youtu.be/KjJ7WzEL-es?t=2602 .You have to change from [] to [[]].
I am trying to get the scores of all the features of my data set.
file_data = numpy.genfromtxt(input_file)
y = file_data[:,-1]
X = file_data[:,0:-1]
x_new = SelectKBest(chi2, k='all').fit_transform(X,y)
Before the first row of X had the "Feature names" in string format but I was getting "Input contains NaN, infinity or a value too large for dtype('float64')" error. So, now X contains only the data and y contains the target values(1,-1).
How can I get the score of each feature from SelectKBest(trying to use Uni-variate feature selection)?
thanks
Solution
You just have to do something like this.
file_data = numpy.genfromtxt(input_file)
y = file_data[:,-1]
X = file_data[:,0:-1]
selector = SelectKBest(chi2, k='all').fit(X,y)
x_new = selector.transform(X) # not needed to get the score
scores = selector.scores_
Your problem
When you use directly .fit_transform(features, target), the selector is not stored and you are returning the selected features. However, the scores is an attribute of the selector. In order to get it, you have to use .fit(features, target). Once you have your selector fitted, you can get the selected features by calling selector.transform(features), as you can see in the code avobe.
As I commented in the code, you don't need to have transformed the features to get the score. Just with fitting them is enough.
Links
Documentation about SelectKBest in sklearn
Example in the docs