OneHotEncoder striping headers - python

I am trying to make an ML model in the titanic dataset and while preparing it I used OneHotEncoder to make Embarked dummies and while doing that I lost my column headers.
Here is how the dataset looked before.
Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 1 22.000000 1 0 7.2500 146 2
1 1 0 38.000000 1 0 71.2833 81 0
2 3 0 26.000000 0 0 7.9250 146 2
3 1 0 35.000000 1 0 53.1000 55 2
4 3 1 35.000000 0 0 8.0500 146 2
... ... ... ... ... ... ... ... ...
886 2 1 27.000000 0 0 13.0000 146 2
887 1 0 19.000000 0 0 30.0000 30 2
888 3 0 29.699118 1 2 23.4500 146 2
889 1 1 26.000000 0 0 30.0000 60 0
890 3 1 32.000000 0 0 7.7500 146 1
Here is the code.
ct = ColumnTransformer([('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = pd.DataFrame(ct.fit_transform(X))
X
Here is how the dataset is looking now.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 1.0 22.000000 1.0 7.2500 146.0
1 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 38.000000 1.0 71.2833 81.0
2 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 26.000000 0.0 7.9250 146.0
3 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 35.000000 1.0 53.1000 55.0
4 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 1.0 35.000000 0.0 8.0500 146.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 1.0 27.000000 0.0 13.0000 146.0
887 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 19.000000 0.0 30.0000 30.0
888 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 29.699118 1.0 23.4500 146.0
889 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 26.000000 0.0 30.0000 60.0
890 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 3.0 1.0 32.000000 0.0 7.7500 146.0

You can use the get_feature_names method of ColumnTransformer, provided all your transformers support that method and you've trained on a dataframe.
ct = ColumnTransformer([('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = pd.DataFrame(ct.fit_transform(X), columns=ct.get_feature_names())
X

output of fit_transform is array like
X_t{array-like, sparse matrix} of shape (n_samples, sum_n_components)
(not DataFrame-like)
Thus no Headers. If you want headers, you'll have to name them when rebuilding the DataFrame.

Related

I am trying to write a For Loop in Python to identify types of sales for use with a 'sales report'

UPDATED - 4.13.22
I am new to programming python and am trying to create a program using For Loops that will go through a data frame by rows to identify different types of 'group sales' made up by different combinations of product sales and posting the results in a 'Result' column.
I was told in previous comments to print the df and paste it:
Date LFMIX SALE LCSIX SALE LOTIX SALE LSPIX SALE LEQIX SALE \
0 0.0 0.0 30000.0 0.0 0.0 0.0
1 0.0 0.0 30000.0 0.0 0.0 0.0
2 0.0 30000.0 0.0 0.0 0.0 0.0
3 0.0 25000.0 25000.0 0.0 0.0 0.0
4 0.0 30000.0 30000.0 0.0 0.0 0.0
5 0.0 30000.0 0.0 0.0 0.0 30000.0
6 0.0 0.0 30000.0 0.0 0.0 30000.0
7 0.0 25000.0 25000.0 0.0 0.0 25000.0
AUM LFMIX AUM LCSIX AUM LOTIX AUM LSPIX AUM LEQIX \
0 200000.0 0.0 0.0 0.0 0.0
1 500000.0 0.0 0.0 0.0 0.0
2 0.0 200000.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 200000.0
5 0.0 200000.0 0.0 0.0 0.0
6 200000.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0
is the sale = 10% of pairing fund AUM LFMIX LCSIX LOTIX LSPIX LEQIX \
0 0.0 1 1 0.0 0.0 0.0
1 0.0 1 1 0.0 0.0 0.0
2 0.0 1 1 0.0 0.0 0.0
3 0.0 1 1 0.0 0.0 0.0
4 0.0 1 1 0.0 0.0 1.0
5 0.0 1 1 0.0 0.0 1.0
6 0.0 1 1 0.0 0.0 1.0
7 0.0 1 1 0.0 0.0 1.0
Expected_Result Result
0 DP1
1 0
2 DP2
3 DP3
4 TT1
5 TT2
6 TT3
7 TT4
my Python code to sort just the 1st row:
for row in range(len(df)):
if (df["LCSIX"][row] >= (df["AUM LFMIX"][row] * .1)): df["Result"][row] = "DP1"
and the results:
Date LFMIX SALE LCSIX SALE LOTIX SALE LSPIX SALE LEQIX SALE \
0 0.0 0.0 30000.0 0.0 0.0 0.0
1 0.0 0.0 30000.0 0.0 0.0 0.0
2 0.0 30000.0 0.0 0.0 0.0 0.0
3 0.0 25000.0 25000.0 0.0 0.0 0.0
4 0.0 30000.0 30000.0 0.0 0.0 0.0
5 0.0 30000.0 0.0 0.0 0.0 30000.0
6 0.0 0.0 30000.0 0.0 0.0 30000.0
7 0.0 25000.0 25000.0 0.0 0.0 25000.0
AUM LFMIX AUM LCSIX AUM LOTIX AUM LSPIX AUM LEQIX \
0 200000.0 0.0 0.0 0.0 0.0
1 500000.0 0.0 0.0 0.0 0.0
2 0.0 200000.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 200000.0
5 0.0 200000.0 0.0 0.0 0.0
6 200000.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0
is the sale = 10% of pairing fund AUM LFMIX LCSIX LOTIX LSPIX LEQIX \
0 0.0 1 1 0.0 0.0 0.0
1 0.0 1 1 0.0 0.0 0.0
2 0.0 1 1 0.0 0.0 0.0
3 0.0 1 1 0.0 0.0 0.0
4 0.0 1 1 0.0 0.0 1.0
5 0.0 1 1 0.0 0.0 1.0
6 0.0 1 1 0.0 0.0 1.0
7 0.0 1 1 0.0 0.0 1.0
Expected_Result Result
0 DP1
1 0
2 DP2 DP1
3 DP3 DP1
4 TT1 DP1
5 TT2 DP1
6 TT3
7 TT4 DP1
As you can see, the code fail to identify row[0] as a DP1 and misidentifies other rows.
I am planning on coding 'For Loops' that will identify 17 different types of group sales, this is simply the 1st group I am trying to identify...
Thanks for the help.
When you're working with pandas, you need to think in terms of doing things with whole columns, NOT row by row, which is hopelessly slow in pandas. If you need to go row by row, then do all of that before you convert to pandas.
In this case, you need to set the "result" column for all rows where your condition is met. This does that in one line:
df["result"][df["LCIX"] >= df["AUM_LFMIX"]*0.1] = "DP1"
So, we select the column as "result", and we select the rows where the relation is true. Simple. ;)

.rolling() on groupby dataframe

I grouped a data frame by week number and get a column of numbers that looks like this
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 235.0
17 849.0
18 1013.0
19 1155.0
20 1170.0
21 1247.0
22 1037.0
23 1197.0
24 1125.0
25 1106.0
26 1229.0
I used the following line of code on the column: df_group['rolling_total'] = df_group['totals'].rolling(2).sum()
This is my desired result:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
235.0
1084.0
2097.0
3252.0
4422.0
5669.0
6706.0
7903.0
9028.0
10134.0
11363.0
I get this instead:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
235.0
1084.0
1862.0
2168.0
2325.0
2417.0
2284.0
2234.0
2322.0
2231.0
2335.0
All I want is the rolling sum of the column. Is .rolling() not the way to accomplish this? Is there something I am doing wrong?
Use .cumsum(), not .rolling():
print(df.cumsum())
Prints:
column1
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 235.0
17 1084.0
18 2097.0
19 3252.0
20 4422.0
21 5669.0
22 6706.0
23 7903.0
24 9028.0
25 10134.0
26 11363.0

Add and fill missing columns with values of 0s in pandas matrix [python]

I have a matrix of the form :
movie_id 1 2 3 ... 1494 1497 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 1.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0
. ...
.
.
As you can see even though the movies in my dataset are 1500, some movies haven't been recorded cause of the preprocess that my data has gone through.
What i want is to add and fill all the columns (movie_ids) that haven't been recorded with values of 0 (I don't know which movie_ids haven't been recorded exactly). So for example i want a new matrix of the form:
movie_id 1 2 3 ... 1494 1495 1496 1497 1498 1499 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
. ...
.
.
Use DataFrame.reindex along axis=1 with fill_value=0 to conform the dataframe columns to a new index range:
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1, fill_value=0)
Result:
movie_id 1 2 3 1498 1499 1500
user_id
1600 1.0 0.0 1.0 0 0 1.0
1601 1.0 0.0 0.0 0 0 0.0
1602 0.0 0.0 0.0 ... 0 0 1.0
1603 0.0 0.0 1.0 ... 0 0 0.0
1604 1.0 0.0 0.0 0 0 0.0
I assume variable name of the matrix is matrix
n_moovies = 1500
moove_ids = matrix.columns
for moovie_id in range(1, n_moovies + 1):
# iterate over id-s
if moovie_id not in moove_ids:
# if there's no such moovie create a column filled with zeros
matrix[moovie_id] = 0

Separating strings from numerical data in a .txt file by python [duplicate]

This question already has answers here:
How read Common Data Format (CDF) in Python
(4 answers)
Closed 4 years ago.
I have a .txt file that looks like this:
08/19/93 UW ARCHIVE 100.0 1962 W IEEE 14 Bus Test Case
BUS DATA FOLLOWS 14 ITEMS
1 Bus 1 HV 1 1 3 1.060 0.0 0.0 0.0 232.4 -16.9 0.0 1.060 0.0 0.0 0.0 0.0 0
2 Bus 2 HV 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045 50.0 -40.0 0.0 0.0 0
3 Bus 3 HV 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010 40.0 0.0 0.0 0.0 0
4 Bus 4 HV 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
5 Bus 5 HV 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
6 Bus 6 LV 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070 24.0 -6.0 0.0 0.0 0
7 Bus 7 ZV 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
8 Bus 8 TV 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090 24.0 -6.0 0.0 0.0 0
9 Bus 9 LV 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.19 0
10 Bus 10 LV 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
11 Bus 11 LV 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
12 Bus 12 LV 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
13 Bus 13 LV 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
14 Bus 14 LV 1 1 0 1.036 -16.04 14.9 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I need to remove the characters from this file and need only the numerical data in a matrix form. I am relatively new to python, so any kind of help will be really appreciated. Thank you.
I would suggest reading the Data in a Pandas Dataframe and than deleting the column with text or create a second Frame without the text column.
Try:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]
As it is simple to do this in pandas if data is correct, here is my take:
import pandas as pd
data = '''\
08/19/93 UW ARCHIVE 100.0 1962 W IEEE 14 Bus Test Case
BUS DATA FOLLOWS 14 ITEMS
1 Bus 1 HV 1 1 3 1.060 0.0 0.0 0.0 232.4 -16.9 0.0 1.060 0.0 0.0 0.0 0.0 0
2 Bus 2 HV 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045 50.0 -40.0 0.0 0.0 0
3 Bus 3 HV 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010 40.0 0.0 0.0 0.0 0
4 Bus 4 HV 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
5 Bus 5 HV 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
6 Bus 6 LV 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070 24.0 -6.0 0.0 0.0 0
7 Bus 7 ZV 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
8 Bus 8 TV 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090 24.0 -6.0 0.0 0.0 0
9 Bus 9 LV 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.19 0
10 Bus 10 LV 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
11 Bus 11 LV 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
12 Bus 12 LV 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
13 Bus 13 LV 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
'''
fileobj = pd.compat.StringIO(data)
# change fileobj to filepath and sep to `\t`
df = pd.read_csv(fileobj, sep='\s+', header=None, skiprows=2)
df = df.loc[:,df.dtypes != 'object']
print(df)
Returns:
0 2 4 5 6 7 8 9 10 11 12 13 14 \
0 1 1 1 1 3 1.060 0.00 0.0 0.0 232.4 -16.9 0.0 1.060
1 2 2 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045
2 3 3 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010
3 4 4 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.000
4 5 5 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.000
5 6 6 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070
6 7 7 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.000
7 8 8 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090
8 9 9 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.000
9 10 10 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.000
10 11 11 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.000
11 12 12 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.000
12 13 13 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.000
15 16 17 18 19
0 0.0 0.0 0.0 0.00 0
1 50.0 -40.0 0.0 0.00 0
2 40.0 0.0 0.0 0.00 0
3 0.0 0.0 0.0 0.00 0
4 0.0 0.0 0.0 0.00 0
5 24.0 -6.0 0.0 0.00 0
6 0.0 0.0 0.0 0.00 0
7 24.0 -6.0 0.0 0.00 0
8 0.0 0.0 0.0 0.19 0
9 0.0 0.0 0.0 0.00 0
10 0.0 0.0 0.0 0.00 0
11 0.0 0.0 0.0 0.00 0
12 0.0 0.0 0.0 0.00 0

Grid search with f1 as scoring function,implementation error?

I'm training MLP and using version 0.18dev of sklearn. I don't know what's wrong with my code. Could you guys please help?
# TODO: Import 'GridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
# TODO: Create the parameters list you wish to tune
parameters = {'max_iter' : [100,200]}
# TODO: Initialize the classifier
clf = clf_B
# TODO: Make an f1 scoring function using 'make_scorer'
f1_scorer = make_scorer(f1_score, pos_label = 'Yes')
# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring = f1_scorer)
# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)
# Get the estimator
clf = grid_obj.best_estimator_
# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))
And the error message
---
IndexError Traceback (most recent call last)
<ipython-input-216-4a3fb1d65cb7> in <module>()
24
25 # TODO: Fit the grid search object to the training data and find the optimal parameters
---> 26 grid_obj = grid_obj.fit(X_train, y_train)
27
28 # Get the estimator
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
810
811 """
--> 812 return self._fit(X, y, ParameterGrid(self.param_grid))
813
814
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
537 'of samples (%i) than data (X: %i samples)'
538 % (len(y), n_samples))
--> 539 cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
540
541 if self.verbose > 0:
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in check_cv(cv, X, y, classifier)
1726 if classifier:
1727 if type_of_target(y) in ['binary', 'multiclass']:
-> 1728 cv = StratifiedKFold(y, cv)
1729 else:
1730 cv = KFold(_num_samples(y), cv)
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, y, n_folds, shuffle, random_state)
546 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
547 for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 548 label_test_folds = test_folds[y == label]
549 # the test split can be too big because we used
550 # KFold(max(c, self.n_folds), self.n_folds) instead of
IndexError: too many indices for array
MLPClassifier and this is how my input looks like.
clf_B = MLPClassifier(random_state=4)
print X_train
school_GP school_MS sex_F sex_M age address_R address_U \
171 1.0 0.0 0.0 1.0 16 0.0 1.0
12 1.0 0.0 0.0 1.0 15 0.0 1.0
13 1.0 0.0 0.0 1.0 15 0.0 1.0
151 1.0 0.0 0.0 1.0 16 0.0 1.0
310 1.0 0.0 1.0 0.0 19 0.0 1.0
274 1.0 0.0 1.0 0.0 17 0.0 1.0
371 0.0 1.0 0.0 1.0 18 1.0 0.0
29 1.0 0.0 0.0 1.0 16 0.0 1.0
109 1.0 0.0 1.0 0.0 16 0.0 1.0
327 1.0 0.0 0.0 1.0 17 1.0 0.0
131 1.0 0.0 1.0 0.0 15 0.0 1.0
128 1.0 0.0 0.0 1.0 18 1.0 0.0
174 1.0 0.0 1.0 0.0 16 0.0 1.0
108 1.0 0.0 0.0 1.0 15 1.0 0.0
280 1.0 0.0 0.0 1.0 17 0.0 1.0
163 1.0 0.0 0.0 1.0 17 0.0 1.0
178 1.0 0.0 0.0 1.0 16 1.0 0.0
275 1.0 0.0 1.0 0.0 17 0.0 1.0
35 1.0 0.0 1.0 0.0 15 0.0 1.0
276 1.0 0.0 1.0 0.0 18 1.0 0.0
282 1.0 0.0 1.0 0.0 18 1.0 0.0
99 1.0 0.0 1.0 0.0 16 0.0 1.0
194 1.0 0.0 0.0 1.0 16 0.0 1.0
357 0.0 1.0 1.0 0.0 17 0.0 1.0
10 1.0 0.0 1.0 0.0 15 0.0 1.0
112 1.0 0.0 1.0 0.0 16 0.0 1.0
338 1.0 0.0 1.0 0.0 18 0.0 1.0
292 1.0 0.0 1.0 0.0 18 0.0 1.0
305 1.0 0.0 1.0 0.0 18 0.0 1.0
340 1.0 0.0 1.0 0.0 19 0.0 1.0
.. ... ... ... ... ... ... ...
255 1.0 0.0 0.0 1.0 17 0.0 1.0
58 1.0 0.0 0.0 1.0 15 0.0 1.0
33 1.0 0.0 0.0 1.0 15 0.0 1.0
38 1.0 0.0 1.0 0.0 15 1.0 0.0
359 0.0 1.0 1.0 0.0 18 0.0 1.0
51 1.0 0.0 1.0 0.0 15 0.0 1.0
363 0.0 1.0 1.0 0.0 17 0.0 1.0
260 1.0 0.0 1.0 0.0 18 0.0 1.0
102 1.0 0.0 0.0 1.0 15 0.0 1.0
195 1.0 0.0 1.0 0.0 17 0.0 1.0
167 1.0 0.0 1.0 0.0 16 0.0 1.0
293 1.0 0.0 1.0 0.0 17 1.0 0.0
116 1.0 0.0 0.0 1.0 15 0.0 1.0
124 1.0 0.0 1.0 0.0 16 0.0 1.0
218 1.0 0.0 1.0 0.0 17 0.0 1.0
287 1.0 0.0 1.0 0.0 17 0.0 1.0
319 1.0 0.0 1.0 0.0 18 0.0 1.0
47 1.0 0.0 0.0 1.0 16 0.0 1.0
213 1.0 0.0 0.0 1.0 18 0.0 1.0
389 0.0 1.0 1.0 0.0 18 0.0 1.0
95 1.0 0.0 1.0 0.0 15 1.0 0.0
162 1.0 0.0 0.0 1.0 16 0.0 1.0
263 1.0 0.0 1.0 0.0 17 0.0 1.0
360 0.0 1.0 1.0 0.0 18 1.0 0.0
75 1.0 0.0 0.0 1.0 15 0.0 1.0
299 1.0 0.0 0.0 1.0 18 0.0 1.0
22 1.0 0.0 0.0 1.0 16 0.0 1.0
72 1.0 0.0 1.0 0.0 15 1.0 0.0
15 1.0 0.0 1.0 0.0 16 0.0 1.0
168 1.0 0.0 1.0 0.0 16 0.0 1.0
famsize_GT3 famsize_LE3 Pstatus_A ... higher internet \
171 1.0 0.0 0.0 ... 1 1
12 0.0 1.0 0.0 ... 1 1
13 1.0 0.0 0.0 ... 1 1
151 0.0 1.0 0.0 ... 1 0
310 0.0 1.0 0.0 ... 1 0
274 1.0 0.0 0.0 ... 1 1
371 0.0 1.0 0.0 ... 0 1
29 1.0 0.0 0.0 ... 1 1
109 0.0 1.0 0.0 ... 1 1
327 1.0 0.0 0.0 ... 1 1
131 1.0 0.0 0.0 ... 1 1
128 1.0 0.0 0.0 ... 1 1
174 0.0 1.0 0.0 ... 1 1
108 1.0 0.0 0.0 ... 1 1
280 0.0 1.0 1.0 ... 1 1
163 1.0 0.0 0.0 ... 0 1
178 1.0 0.0 0.0 ... 1 1
275 0.0 1.0 0.0 ... 1 1
35 1.0 0.0 0.0 ... 1 0
276 1.0 0.0 1.0 ... 0 1
282 0.0 1.0 0.0 ... 1 0
99 1.0 0.0 0.0 ... 1 1
194 1.0 0.0 0.0 ... 1 1
357 0.0 1.0 1.0 ... 1 0
10 1.0 0.0 0.0 ... 1 1
112 1.0 0.0 0.0 ... 1 1
338 0.0 1.0 0.0 ... 1 1
292 0.0 1.0 0.0 ... 1 1
305 1.0 0.0 0.0 ... 1 1
340 1.0 0.0 0.0 ... 1 1
.. ... ... ... ... ... ...
255 0.0 1.0 0.0 ... 1 1
58 0.0 1.0 0.0 ... 1 1
33 0.0 1.0 0.0 ... 1 1
38 1.0 0.0 0.0 ... 1 1
359 0.0 1.0 0.0 ... 1 1
51 0.0 1.0 0.0 ... 1 1
363 0.0 1.0 0.0 ... 1 1
260 1.0 0.0 0.0 ... 1 1
102 1.0 0.0 0.0 ... 1 1
195 0.0 1.0 0.0 ... 1 1
167 1.0 0.0 0.0 ... 1 1
293 0.0 1.0 0.0 ... 1 0
116 1.0 0.0 0.0 ... 1 0
124 1.0 0.0 0.0 ... 1 1
218 1.0 0.0 0.0 ... 1 0
287 1.0 0.0 0.0 ... 1 1
319 1.0 0.0 0.0 ... 1 1
47 1.0 0.0 0.0 ... 1 1
213 1.0 0.0 0.0 ... 1 1
389 1.0 0.0 0.0 ... 1 0
95 1.0 0.0 0.0 ... 1 1
162 0.0 1.0 0.0 ... 1 0
263 1.0 0.0 0.0 ... 1 0
360 0.0 1.0 1.0 ... 1 0
75 1.0 0.0 0.0 ... 1 1
299 0.0 1.0 0.0 ... 1 1
22 0.0 1.0 0.0 ... 1 1
72 1.0 0.0 0.0 ... 1 1
15 1.0 0.0 0.0 ... 1 1
168 1.0 0.0 0.0 ... 1 1
romantic famrel freetime goout Dalc Walc health absences
171 1 4 3 2 1 1 3 2
12 0 4 3 3 1 3 5 2
13 0 5 4 3 1 2 3 2
151 1 4 4 4 3 5 5 6
310 1 4 2 4 2 2 3 0
274 1 4 3 3 1 1 1 2
371 1 4 3 3 2 3 3 3
29 1 4 4 5 5 5 5 16
109 1 5 4 5 1 1 4 4
327 0 4 4 5 5 5 4 8
131 1 4 3 3 1 2 4 0
128 0 3 3 3 1 2 4 0
174 0 4 4 5 1 1 4 4
108 1 1 3 5 3 5 1 6
280 1 4 5 4 2 4 5 30
163 0 5 3 3 1 4 2 2
178 1 4 3 3 3 4 3 10
275 1 4 4 4 2 3 5 6
35 0 3 5 1 1 1 5 0
276 1 4 1 1 1 1 5 75
282 0 5 2 2 1 1 3 1
99 0 5 3 5 1 1 3 0
194 0 5 3 3 1 1 3 0
357 1 1 2 3 1 2 5 2
10 0 3 3 3 1 2 2 0
112 0 3 1 2 1 1 5 6
338 0 5 3 3 1 1 1 7
292 1 5 4 3 1 1 5 12
305 0 4 4 3 1 1 3 8
340 1 4 3 4 1 3 3 4
.. ... ... ... ... ... ... ... ...
255 0 4 4 4 1 2 5 2
58 0 4 3 2 1 1 5 2
33 0 5 3 2 1 1 2 0
38 0 4 3 2 1 1 5 2
359 0 5 3 2 1 1 4 0
51 0 4 3 3 1 1 5 2
363 1 2 3 4 1 1 1 0
260 1 3 1 2 1 3 2 21
102 0 5 3 3 1 1 5 4
195 1 4 3 2 1 1 5 0
167 1 4 2 3 1 1 3 0
293 0 3 1 2 1 1 3 6
116 0 4 4 3 1 1 2 2
124 1 5 4 4 1 1 5 0
218 0 3 3 3 1 4 3 3
287 0 4 3 3 1 1 3 6
319 0 4 4 4 3 3 5 2
47 0 4 2 2 1 1 2 4
213 0 4 4 4 2 4 5 15
389 0 1 1 1 1 1 5 0
95 0 3 1 2 1 1 1 2
162 0 4 4 4 2 4 5 0
263 0 3 2 3 1 1 4 4
360 1 4 3 4 1 4 5 0
75 0 4 3 3 2 3 5 6
299 1 1 4 2 2 2 1 5
22 0 4 5 1 1 3 5 2
72 1 3 3 4 2 4 5 2
15 0 4 4 4 1 2 2 4
168 0 5 1 5 1 1 4 0
[300 rows x 48 columns]
This is how my output looks like
print y_train
passed
171 yes
12 yes
13 yes
151 yes
310 no
274 yes
371 yes
29 yes
109 yes
327 yes
131 no
128 no
174 no
108 yes
280 no
163 yes
178 no
275 yes
35 no
276 no
282 yes
99 no
194 yes
357 yes
10 no
112 yes
338 yes
292 yes
305 yes
340 yes
.. ...
255 no
58 no
33 yes
38 yes
359 yes
51 yes
363 yes
260 yes
102 yes
195 yes
167 yes
293 yes
116 yes
124 no
218 no
287 yes
319 yes
47 yes
213 no
389 no
95 yes
162 no
263 no
360 yes
75 yes
299 yes
22 yes
72 no
15 yes
168 no
[300 rows x 1 columns]

Categories

Resources