I am new to ML and learning concepts. I am importing one csv file contained some columns with customer code and product details. I am trying to predict the what product will buy future also but getting array([10])
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import numpy as np
sales_data = pd.read_csv("salesdata.csv")
sales_data = sales_data.astype('int32')
X = sales_data.drop(columns=['Product'])
y = sales_data['Product']
model = DecisionTreeClassifier()
model.fit(X, y)
predictions = model.predict([[7301, 52199000]])
predictions
So the good news is, your code is all great. Just a little misunderstanding on what the model is returning... Basically it is saying product number '10' is its prediction based off of the values you plugged into model.predict(). This is the models way of talking to us, and sometimes we need to take an extra step to see what that tenth class label is in our language. Try model.classes_ ; this should print all of the class names the model is trained on.
Related
I have this code sample, that works well when I try this in jupyternotebook. And it shows as a table (an image) with two columns as below for the below code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance
X = inputsdf
y = targetdf
X_traindf, X_testdf, y_traindf, y_testdf = train_test_split(inputsNew, target, random_state=0)
estimator = RandomForestClassifier(max_depth=2, random_state=0)
estimator.fit(X_traindf, y_traindf)
perm = PermutationImportance(estimator, random_state=1).fit(X_testdf, y_testdf)
eli5.show_weights(perm, feature_names = X_testdf.columns.tolist())
But I need these values to be converted as an array or dictionary or anything that I can assign to variable/s and re-use them. so it will look like kind of below:
{
"PercentageSalaryHike": "0.0960 +- 0.0222",
.
.
.
}
Can someone please help me? OR IS THERE A BETTER WAY TO FIND THE PERMUTATION IMPORTANCE FOR EACH COLUMN?
variable = np.array(eli5.show_weights(perm, feature_names = X_testdf.columns.tolist()))
I hope this will work.
I do have the below dataset.
I've created Logistic Regression out of it and checked Accuracy and is working fine. So now requirement is I've a new data with Age 30 and EstimatedSalary 50000 and I would like to predict whether Purchased will be 0 or 1. How to pass the new values 30 and 50000 in my python code.
Below is the python code which I've used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
%matplotlib inline
dataset = pd.read_csv(r"suv_data.csv")
X=dataset.iloc[:,[0,1]].values
y=dataset.iloc[:,2].values
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=1)
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
accuracy_score(y_test,y_pred)*100
Regards,
Bharath Vikas
In general, to evaluate (i.e. call .predict in sklearn) a trained model, you need to input samples that have the same shape as the samples the model was trained on.
In your case I suppose (see my comment on your question) you wanted to have samples with Age and EstimatedSalary in the training set using Purchased as label.
Then, to test on a single sample just try this:
single_test_sample = pd.DataFrame({'Age':[30], 'EstimatedSalary':[50000]}).iloc[:,[0,1]].values
single_test_sample = sc.transform(single_test_sample)
single_test_prediction = classifier.predict(single_test_sample)
Note that you can also add more values in the test dataframe Age and EstimatedSalary columns, now I only added the sample you were interested in. If you add more, the model will output a prediction for each row in the test dataframe.
Also note that your code and mine, will also work without this .values at the end of the train/test set as sklearn already provides functionality with pandas dataframes.
Your question is not clear but I understand that you need to use the fitted model to predict a new sample.
After having fitted your model just use this:
new_sample = np.array([[30,50000]]) # 2D numpy array
new_sample_sc = sc.transform(new_sample)
y_pred_new = classifier.predict(new_sample_sc)
print(y_pred_new)
I am trying to create a machine learning model to suggest treatment for stroke patients based on their responses to various questionnaires and assessments. For instance, the patient will be asked to rate the stiffness of the fingers, elbow, shoulder, and pectoral muscles (each on a scale of 0 to 100) or answer 14 questions related to mental health (each on a scale of 0 to 3).
I would like to create an sklearn pipeline roughly as follows:
1. The patient responses are aggregated. For example, the four stiffness responses should be averaged to create a single “stiffness” value, while the fourteen mental health questions should be summed up to create a single “mental health” value. The “stiffness” and “mental health” values would then be features in the model.
2. Once the features have been aggregated in this way, a decision tree classifier is trained on labeled data to assign each patient to the appropriate therapy.
3. The trained pipeline is exported as a pmml file for production
I assume this must be doable with some code like this:
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from sklearn.tree import DecisionTreeClassifier
from somewhere import Something
pipeline = PMMLPipeline([
("input_aggregation", Something()),
("classifier", DecisionTreeClassifier())
])
pipeline.fit(patient_input, therapy_labels)
sklearn2pmml(pipeline, "ClassificationPipeline.pmml", with_repr = True)
I’ve been poking around the documentation and I can figure out to apply PCA to a group of columns but not how to do something as straightforward as collapsing a group of columns by summing or averaging. Does anyone have any hints about how I could do this?
Thanks for your help.
Sample code:
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import Aggregator
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(["stiffness_1", "stiffness_2", "stiffness_3", "stiffness_4"], Aggregator(function = "mean")),
(["mental_health_1", "mental_health2", .., "mental_health_14"], Aggregator(function = "sum"))
])),
("classifier", DecisionTreeClassifier())
])
pipeline.fit(X, y)
Explanation - you can use sklearn_pandas.DataFrameMapper to define a column group, and apply a transformation to it. For the conversion to PMML work, you need to provide a transformer class, not a direct function. Perhaps all your transformation needs are handled by the sklearn2pmml.preprocessing.Aggregator transformer class. If not, you can always define your own.
While #makis has provided a 100% valid Python example, it wouldn't work in the Python-to-PMML case, because the converter cannot parse/handle custom Python functions.
You just need to define a custom function and use it in the Pipeline.
Here is the full code:
from sklearn.preprocessing import FunctionTransformer
import numpy as np
from sklearn2pmml import make_pmml_pipeline
# fake data with 7 columns
X = np.random.rand(10,7)
n_rows = X.shape[0]
def custom_function(X):
#averiging 4 first columns, sums the others, column-wise
return np.concatenate([np.mean(X[:,0:5],axis = 1).reshape(n_rows,1), np.sum(X[:,5:],axis=1).reshape(n_rows,1)],axis = 1)
# Now, if you run: `custom_function(X)` it should return an array (10,2).
pipeline = make_pmml_pipeline(
FunctionTransformer(custom_function),
)
Hello dear forum members,
I have a data set of 20 Million randomly collected individual tweets (no two tweets come from the same account). Let me refer to this data set as "general" data set. Also, I have another "specific" data set that includes 100,000 tweets collected from drug (opioid) abusers. Each tweet has at least one tag associated with it, e.g., opioids, addiction, overdose, hydrocodone, etc. (max 25 tags).
My goal is to use the "specific" data set to train the model using Keras and then use it to tag tweets in the "general" data set to identify tweets that might have been written by drug abusers.
Following examples in source1 and source2, I managed to build a simple working version of such model:
from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]
# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)
# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)
# test prediction accuracy
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# make predictions using a test set
for i in range(1000):
prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
In order to move forward, I would like to clarify a few things:
Let's say all my training tweets have a single tag -- opioids. Then if I pass the non-tagged tweets through it, isn't it likely that the model simply tags all of them as opioids as it doesn't know anything else? Should I be using a variety of different tweets/tags then for the learning purpose? Perhaps, there are any general guidelines for the selection of the tweets/tags for the training purposes?
How can I add more columns with tags for training (not a single one like is used in the code)?
Once I train the model and achieve appropriate accuracy, how do I pass non-tagged tweets through it to make predictions?
How do I add a confusion matrix?
Any other relevant feedback is also greatly appreciated.
Thanks!
Examples of "general" tweets:
everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.
Examples of "specific" tweets:
$2 million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids
My shot to this is:
Create a new dataset with tweets from general + specific data. Let's say 200k-250K where 100K is you specific data set, rest is general
Take your 25 keywords/tags and write a rule if any one or more than one exists in a tweet it is DA (Drug Abuser) or NDA(Non Drug Abuser). This will be your dependent variable.
Your new dataset will be one column with all the tweets and another column with the dependent variable telling it is DA or NDA
Now divide into train/test and use keras or any other algo. so that it can learn.
Then test the model by plotting Confusion Matrix
Passs you other remaining data set from General to this model and check,
If their are new words other than 25 which are not in the specific dataset, from the model you built it will still try to intelligently guess the right category by the group of words that come together, tone etc.
This is an example of my data:
filename,2,3,4,5,6,7,class
a.txt,0,0,0,0,0,0,0
b.txt,0,0,0,0,0,1,0
c.txt,0,0,0,0,1,0,0
d.txt,1,0,1,0,0,1,1
When I train my data, I just use the columns from 2 -> 7 as input, class as output. But when I test the model after it trained and save, I need to know that which files are belong to which class. I mean like how to know d.txt is class 1.
I use pandas to import the data from .csv file, I use train set and test set in 2 different csv files. In the train phase, I uses columns 2-7 as input, and column class as target, these columns are numerical. The filename class is just text. In the test phase, I need to know the filename with the predicted class. But I don't know how to do that.
Thanks
P/s: I used MLP,SVM, NB as classifier.
Assuming your data is in .csv format:
filename,2,3,4,5,6,7,class
a.txt,0,0,0,0,0,0,0
b.txt,0,0,0,0,0,1,0
c.txt,0,0,0,0,1,0,0
d.txt,1,0,1,0,0,1,1
You can output the corresponding filename to a predicted class using:
features=[1,0,1,0,0,1] #input
output=clf.predict([features])[0] #predicted class
print(df[df["class"]==output]["filename"]) #corresponding filename
Note that in your example you're facing the problem where the amount of features is greater than the amount of examples, therefore the classifier may deteriorate.
Hopefully you just gave a sample of your data. In this case you're likely to be good. Just watch out for what classifier to use.
Full code:
import numpy as np
import pandas as pd
from sklearn import svm
df=pd.read_csv('file.csv')
X = df.iloc[:,1:7].values
y = df.iloc[:,7:8].values
clf = svm.SVC() #using SVM as classifier
clf.fit(X, y)
features=[1,0,1,0,0,1]
output=clf.predict([features])[0]
print(df[df["class"]==output]["filename"])