Predict existing data using scikit learn

Predict existing data using scikit learn - python

My dataset looks like this:
age address freetime goout Dalc Walc G1 G2 G3 AverageG
17 U 1 1 3 5 7 7 7 7
15 X 3 2 6 3 5 4 2 3.6666
20 T 1 5 4 1 3 2 1 2
What I'm trying to do using python is to predict the value AverageG which is the average of G1, G2, G3.
I know that the value of AverageG can be calculated by making the average of G1, G2 and G3 but in my case it has to be predicted by using the library scikit-learn

For this toy example you can use linear regression.
I will give the general idea, then you can translate it for your specific dataframe:
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.random.randint(0,10,(1000,3))
y = X.mean(axis=1)
model = LinearRegression()
model.fit(X, y)
new_data = np.array([1,2,3]).reshape(1, -1)
model.predict(new_data)
and the model correctly predicts:
array([2.])

Related

Visualizing clusters result using PCA (Python)

I have a dataset containing 61 rows(users) and 26 columns, on which I apply clustering with k-means and others algorithms.
first applied KMeans on the dataset after normalizing it.
As a prior task I run k-means on this data after normalizing it and identified 10 clusters.
In parallel I also tried to visualize these clusters that's why i use PCA to reduce the number of my features.
I have written the following code:
UserID Communication_dur Lifestyle_dur Music & Audio_dur Others_dur Personnalisation_dur Phone_and_SMS_dur Photography_dur Productivity_dur Social_Media_dur System_tools_dur ... Music & Audio_Freq Others_Freq Personnalisation_Freq Phone_and_SMS_Freq Photography_Freq Productivity_Freq Social_Media_Freq System_tools_Freq Video players & Editors_Freq Weather_Freq
1 63 219 9 10 99 42 36 30 76 20 ... 2 1 11 5 3 3 9 1 4 8
2 9 0 0 6 78 0 32 4 15 3 ... 0 2 4 0 2 1 2 1 0 0
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(3)
pca.fit(X)
pca_data = pd.DataFrame(pca.transform(X))
print(pca_data.head())
gives the following results:
0 1 2
0 8 -4 5
1 -2 -2 1
2 1 1 -0
3 2 -1 1
4 3 -1 -3
I want to show a plot (cluster) of my dataset by using a PCA and interpret the results ?
I am really new in this space and advice would be greatly appreciated!
Thanks in advance once again.

Using an example dataset:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
df, y = make_blobs(n_samples=70, centers=10,n_features=26,random_state=999,cluster_std=1)
Perform scaling, PCA and put the PC scores into a dataframe:
Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(2)
pca_data = pd.DataFrame(pca.fit_transform(X),columns=['PC1','PC2'])
Perform kmeans and place the label into a data frame and you can already plot it using seaborn:
kmeans =KMeans(n_clusters=10).fit(X)
pca_data['cluster'] = pd.Categorical(kmeans.labels_)
sns.scatterplot(x="PC1",y="PC2",hue="cluster",data=pca_data)
Or matplotlib:
fig,ax = plt.subplots()
scatter = ax.scatter(pca_data['PC1'], pca_data['PC2'],c=pca_data['cluster'],cmap='Set3',alpha=0.7)
legend1 = ax.legend(*scatter.legend_elements(),
loc="upper left", title="")
ax.add_artist(legend1)

Getting "Perfect separation detected, results not available" while building the Logistic Regression model

As part of my assignment I am building logistic regression model but I am getting an error "Perfect separation detected, results not available" while building it.
**X_train :-**
year amt_spnt rank
1 -1.723034 -0.418500 0.272727
2 0.716660 2.088507 -0.636364
3 1.174102 -0.558333 -1.545455
4 -0.503187 -1.297451 1.181818
5 1.326583 -0.628250 -1.545455
**y_train :-**
1 0
2 1
3 1
4 0
5 1
Name: result, dtype: int64
**Logistic Model code:-**
import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()
**Dataset before and after scaling**
**Image for evidence:-**
[![Evidence][1]][1]
[1]: https://i.stack.imgur.com/cTncA.png

This is a model setting issue, because of the perfect separation, your model can not converge. Perfect separation means there is one (or more) variable in your independent variables that can perfectly distinct dependent variable = 0 from dependent variable = 1. See the following example:
Y 0 0 0 0 0 0 1 1 1 1
X 1 2 3 4 4 4 5 6 7 8
If X <= 4, Y = 0
If X > 4, Y = 1
A short answer to your question is to find such variable in your independent variable and remove it from your model.

GridSearchCV for number of neurons

I am trying to learn by myself how to grid-search number of neurons in a basic multi-layered neural networks. I am using GridSearchCV and KerasClasifier of Python as well as Keras. The code below works for other data sets very well but I could not make it work for Iris dataset for some reasons and I cannot find it why, I am missing out something here. The result I get is:
Best: 0.000000 using {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 5}
from pandas import read_csv
import numpy
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn.model_selection import GridSearchCV
dataframe=read_csv("iris.csv", header=None)
dataset=dataframe.values
X=dataset[:,0:4].astype(float)
Y=dataset[:,4]
seed=7
numpy.random.seed(seed)
#encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
#one-hot encoding
dummy_y = np_utils.to_categorical(encoded_Y)
#scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
def create_model(n_neurons=1):
#create model
model = Sequential()
model.add(Dense(n_neurons, input_dim=X.shape[1], activation='relu')) # hidden layer
model.add(Dense(3, activation='softmax')) # output layer
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, initial_epoch=0, verbose=0)
# define the grid search parameters
neurons=[3, 5]
#this does 3-fold classification. One can change k.
param_grid = dict(n_neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, dummy_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
For the purpose of illustration and computational efficiency I search only for two values. I sincerely apologize for asking such a simple question. I am new to Python, switched from R, by the way because I realized that Deep Learning community is using python.

Haha, this is probably the funniest thing I ever experienced on Stack Overflow :) Check:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=5)
and you should see different behavior. The reason why your model get a perfect score (in terms of cross_entropy having 0 is equivalent to best model possible) is that you haven't shuffled your data and because Iris consist of three balanced classes each of your feed had a single class like a target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 (first fold ends here) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (second fold ends here)2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Such problems are really easy to be solved by every model - so that's why you've got a perfect match.
Try to shuffle your data before - this should result in an expected behavior.

Get most informative features from very simple scikit-learn SVM classifier

I tried to build the a very simple SVM predictor that I would understand with my basic python knowledge. As my code looks so different from this question and also this question I don't know how I can find the most important features for SVM prediction in my example.
I have the following 'sample' containing features and class (status):
A B C D E F status
1 5 2 5 1 3 1
1 2 3 2 2 1 0
3 4 2 3 5 1 1
1 2 2 1 1 4 0
I saved the feature names as 'features':
A B C D E F
The features 'X':
1 5 2 5 1 3
1 2 3 2 2 1
3 4 2 3 5 1
1 2 2 1 1 4
And the status 'y':
1
0
1
0
Then I build X and y arrays out of the sample, train & test on half of the sample and count the correct predictions.
import pandas as pd
import numpy as np
from sklearn import svm
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
y = sample['status'].values.tolist()
y = np.array(y)
test_size = int(X.shape[0]/2)
clf = svm.SVC(kernel="linear", C= 1)
clf.fit(X[:-test_size],y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x].reshape(-1, len(features)))[0] == y[-x]:
correct_count += 1
accuracy = (float(correct_count)/test_size) * 100.00
My problem is now, that I have no idea, how I could implement the code from the questions above so that I could also see, which ones are the most important features.
I would be grateful if you could tell me, if that's even possible for my simple version? And if yes, any tipps on how to do it would be great.

From all feature set, the set of variables which produces the lowest values for square of norm of vector must be chosen as variables of high importance in order

Python : How to use Multinomial Logistic Regression using SKlearn

I have a test dataset and train dataset as below. I have provided a sample data with min records, but my data has than 1000's of records. Here E is my target variable which I need to predict using an algorithm. It has only four categories like 1,2,3,4. It can take only any of these values.
Training Dataset:
A B C D E
1 20 30 1 1
2 22 12 33 2
3 45 65 77 3
12 43 55 65 4
11 25 30 1 1
22 23 19 31 2
31 41 11 70 3
1 48 23 60 4
Test Dataset:
A B C D E
11 21 12 11
1 2 3 4
5 6 7 8
99 87 65 34
11 21 24 12
Since E has only 4 categories, I thought of predicting this using Multinomial Logistic Regression (1 vs Rest Logic). I am trying to implement it using python.
I know the logic that we need to set these targets in a variable and use an algorithm to predict any of these values:
output = [1,2,3,4]
But I am stuck at a point on how to use it using python (sklearn) to loop through these values and what algorithm should I use to predict the output values? Any help would be greatly appreciated

You could try
LogisticRegression(multi_class='multinomial',solver ='newton-cg').fit(X_train,y_train)

LogisticRegression can handle multiple classes out-of-the-box.
X = df[['A', 'B', 'C', 'D']]
y = df['E']
lr = LogisticRegression()
lr.fit(X, y)
preds = lr.predict(X) # will output array with integer values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Predict existing data using scikit learn - python

Related

Visualizing clusters result using PCA (Python)

Getting "Perfect separation detected, results not available" while building the Logistic Regression model

GridSearchCV for number of neurons

Get most informative features from very simple scikit-learn SVM classifier

Python : How to use Multinomial Logistic Regression using SKlearn

Categories

Resources