I have a Python dataframe that looks like this:
0 1 2 3 4 5
0 1 1 10 0.0 6 0.0
1 1 1 20 0.0 3 0.0
2 1 1 30 0.0 6 0.0
3 1 1 40 0.0 2 0.0
4 1 1 50 0.0 5 0.0
5 1 1 60 0.0 6 0.0
6 1 1 70 0.0 3 0.0
7 1 1 80 0.0 6 0.0
8 1 1 90 0.0 2 0.0
9 1 1 100 0.0 4 0.0
10 1 1 110 0.0 4 0.0
11 1 1 120 0.0 3 0.0
12 1 1 130 0.0 6 0.0
13 1 1 140 0.0 5 0.0
14 1 1 150 0.0 5 0.0
15 1 1 160 0.0 2 0.0
16 1 1 170 0.0 2 0.0
17 1 1 180 0.0 1 0.0
18 1 1 190 0.0 1 0.0
19 1 1 200 0.0 6 0.0
.. .. .. .. ... .. ..
n-10 99 99 110 0.0 4 0.0
n-8 99 99 120 0.0 2 0.0
n-7 99 99 130 0.0 9 0.0
n-6 99 99 140 0.0 8 0.0
n-5 99 99 150 0.0 5 0.0
n-4 99 99 160 0.0 1 0.0
n-3 99 99 170 0.0 0 0.0
n-2 99 99 180 0.0 7 0.0
n-1 99 99 190 0.0 6 0.0
n 99 99 200 0.0 4 0.0
And I need to sum my column 4 for every x amount of column 2, where x=10 here.
The output would look like this:
0 1 2 3 4 5
0 1 1 100 0.0 43 0.0
1 1 1 200 0.0 35 0.0
.. .. .. .. ... .. ..
m 99 99 200 0.0 46 0.0
How would I do this?
Related
I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?
You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
I am trying to calculate RSI on a dataframe
df = pd.DataFrame({"Close": [100,101,102,103,104,105,106,105,103,102,103,104,103,105,106,107,108,106,105,107,109]})
df["Change"] = df["Close"].diff()
df["Gain"] = np.where(df["Change"]>0,df["Change"],0)
df["Loss"] = np.where(df["Change"]<0,abs(df["Change"]),0 )
df["Index"] = [x for x in range(len(df))]
print(df)
Close Change Gain Loss Index
0 100 NaN 0.0 0.0 0
1 101 1.0 1.0 0.0 1
2 102 1.0 1.0 0.0 2
3 103 1.0 1.0 0.0 3
4 104 1.0 1.0 0.0 4
5 105 1.0 1.0 0.0 5
6 106 1.0 1.0 0.0 6
7 105 -1.0 0.0 1.0 7
8 103 -2.0 0.0 2.0 8
9 102 -1.0 0.0 1.0 9
10 103 1.0 1.0 0.0 10
11 104 1.0 1.0 0.0 11
12 103 -1.0 0.0 1.0 12
13 105 2.0 2.0 0.0 13
14 106 1.0 1.0 0.0 14
15 107 1.0 1.0 0.0 15
16 108 1.0 1.0 0.0 16
17 106 -2.0 0.0 2.0 17
18 105 -1.0 0.0 1.0 18
19 107 2.0 2.0 0.0 19
20 109 2.0 2.0 0.0 20
RSI_length = 7
Now, I am stuck in calculating "Avg Gain". The logic for average gain here is for first average gain at index 6 will be mean of "Gain" for RSI_length periods. For consecutive "Avg Gain" it should be
(Previous Avg Gain * (RSI_length - 1) + "Gain") / RSI_length
I tried the following but does not work as expected
df["Avg Gain"] = np.nan
df["Avg Gain"] = np.where(df["Index"]==(RSI_length-1),df["Gain"].rolling(window=RSI_length).mean(),\
np.where(df["Index"]>(RSI_length-1),(df["Avg Gain"].iloc[df["Index"]-1]*(RSI_length-1)+df["Gain"]) / RSI_length,np.nan))
The output of this code is:
print(df)
Close Change Gain Loss Index Avg Gain
0 100 NaN 0.0 0.0 0 NaN
1 101 1.0 1.0 0.0 1 NaN
2 102 1.0 1.0 0.0 2 NaN
3 103 1.0 1.0 0.0 3 NaN
4 104 1.0 1.0 0.0 4 NaN
5 105 1.0 1.0 0.0 5 NaN
6 106 1.0 1.0 0.0 6 0.857143
7 105 -1.0 0.0 1.0 7 NaN
8 103 -2.0 0.0 2.0 8 NaN
9 102 -1.0 0.0 1.0 9 NaN
10 103 1.0 1.0 0.0 10 NaN
11 104 1.0 1.0 0.0 11 NaN
12 103 -1.0 0.0 1.0 12 NaN
13 105 2.0 2.0 0.0 13 NaN
14 106 1.0 1.0 0.0 14 NaN
15 107 1.0 1.0 0.0 15 NaN
16 108 1.0 1.0 0.0 16 NaN
17 106 -2.0 0.0 2.0 17 NaN
18 105 -1.0 0.0 1.0 18 NaN
19 107 2.0 2.0 0.0 19 NaN
20 109 2.0 2.0 0.0 20 NaN
Desired output is:
Close Change Gain Loss Index Avg Gain
0 100 NaN 0 0 0 NaN
1 101 1.0 1 0 1 NaN
2 102 1.0 1 0 2 NaN
3 103 1.0 1 0 3 NaN
4 104 1.0 1 0 4 NaN
5 105 1.0 1 0 5 NaN
6 106 1.0 1 0 6 0.857143
7 105 -1.0 0 1 7 0.734694
8 103 -2.0 0 2 8 0.629738
9 102 -1.0 0 1 9 0.539775
10 103 1.0 1 0 10 0.605522
11 104 1.0 1 0 11 0.661876
12 103 -1.0 0 1 12 0.567322
13 105 2.0 2 0 13 0.771990
14 106 1.0 1 0 14 0.804563
15 107 1.0 1 0 15 0.832483
16 108 1.0 1 0 16 0.856414
17 106 -2.0 0 2 17 0.734069
18 105 -1.0 0 1 18 0.629202
19 107 2.0 2 0 19 0.825030
20 109 2.0 2 0 20 0.992883
(edited)
Here's an implementation of your formula.
RSI_LENGTH = 7
rolling_gain = df["Gain"].rolling(RSI_LENGTH).mean()
df.loc[RSI_LENGTH-1, "RSI"] = rolling_gain[RSI_LENGTH-1]
for inx in range(RSI_LENGTH, len(df)):
df.loc[inx, "RSI"] = (df.loc[inx-1, "RSI"] * (RSI_LENGTH -1) + df.loc[inx, "Gain"]) / RSI_LENGTH
The result is:
Close Change Gain Loss Index RSI
0 100 NaN 0.0 0.0 0 NaN
1 101 1.0 1.0 0.0 1 NaN
2 102 1.0 1.0 0.0 2 NaN
3 103 1.0 1.0 0.0 3 NaN
4 104 1.0 1.0 0.0 4 NaN
5 105 1.0 1.0 0.0 5 NaN
6 106 1.0 1.0 0.0 6 0.857143
7 105 -1.0 0.0 1.0 7 0.734694
8 103 -2.0 0.0 2.0 8 0.629738
9 102 -1.0 0.0 1.0 9 0.539775
10 103 1.0 1.0 0.0 10 0.605522
11 104 1.0 1.0 0.0 11 0.661876
12 103 -1.0 0.0 1.0 12 0.567322
13 105 2.0 2.0 0.0 13 0.771990
14 106 1.0 1.0 0.0 14 0.804563
15 107 1.0 1.0 0.0 15 0.832483
16 108 1.0 1.0 0.0 16 0.856414
17 106 -2.0 0.0 2.0 17 0.734069
18 105 -1.0 0.0 1.0 18 0.629202
19 107 2.0 2.0 0.0 19 0.825030
20 109 2.0 2.0 0.0 20 0.992883
I want to sumarize rows and columns of dataframe (pdf and wdf) and save results in another dataframe columns (to_hex).
I tried it for one dataframe and it worked. It doesn't work for another (it gives NaN). I cannot understand what is the difference.
to_hex = pd.DataFrame(0, index=np.arange(len(sasiedztwo)), columns=['ID','podroze','p_rozmyte'])
to_hex.loc[:,'ID']= wdf.index+1
to_hex.index=pdf.index
to_hex.loc[:,'podroze']= pd.DataFrame(pdf.sum(axis=0))[:]
to_hex.index=wdf.index
to_hex.loc[:,'p_rozmyte']= pd.DataFrame(wdf.sum(axis=0))[:]
This is how pdf dataframe looks like:
0 1 2 3 4 5 6 7 8
0 0 0 10 0 0 0 0 0 100
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 1000
8 0 0 0 0 0 0 0 0 0
This is wdf:
0 1 2 3 4 5 6 7 8
0 2.5 5.0 35.0 0.0 27.5 55.0 25.0 50.0 102.5
1 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
2 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 25.0
3 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
4 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 250.0
6 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
7 0.0 0.0 250.0 0.0 250.0 500.0 250.0 500.0 1000.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 500.0
And this is the result in to_hex:
ID podroze p_rozmyte
0 1 0 NaN
1 2 0 NaN
2 3 10 NaN
3 4 0 NaN
4 5 0 NaN
5 6 0 NaN
6 7 0 NaN
7 8 0 NaN
8 9 1100 NaN
SOLUTION:
One option to solve it is to modify your code as follows:
to_hex.loc[:,'ID']= wdf.index+1
# to_hex.index=pdf.index # no need
to_hex.loc[:,'podroze']= pdf.sum(axis=0) # modified; directly use the series output from SUM()
# to_hex.index=wdf.index # no need
to_hex.loc[:,'p_rozmyte']= wdf.sum(axis=0) # modified
Then you get:
ID podroze p_rozmyte
0 1 0 2.5
1 2 0 5.0
2 3 10 302.5
3 4 0 0.0
4 5 0 277.5
5 6 0 555.0
6 7 0 275.0
7 8 0 550.0
8 9 1100 3527.5
I think the reason that you get NaN for one case and correct values for the other case lies in to_hex.dtypes:
ID int64
podroze int64
p_rozmyte int64
dtype: object
And as you see to_hex dataframe has column types as int64. This is fine when you add pdf dataframe (since it has the same dtype)
pd.DataFrame(pdf.sum(axis=0))[:].dtypes
0 int64
dtype: object
but does not work when you add wdf:
pd.DataFrame(wdf.sum(axis=0))[:].dtypes
0 float64
dtype: object
It's a bit complicated for explain, so I'll do my best. I have a pandas with two columns: hour (from 1 to 24) and value(corresponding to each hour). Dataset index is huge but column hour is repeated on that 24 hours basis (from 1 to 24). I am trying to create new 24 columns: value -1, value -2, value -3...value -24 that will correspond to each row and value from -1 hour, value from -2 hours(from above rows).
hour | value | value -1 | value -2 | value -3| ... | value - 24
1 10 0 0 0 0
2 11 10 0 0 0
3 12 11 10 0 0
4 13 12 11 10 0
...
24 32 31 30 29 0
1 33 32 31 30 10
2 34 33 32 31 11
and so on...
All value numbers are for the example. As I said there are lots of rows, not only 24 for all hours in a day time but all following time series from 1 to 24 and etc.
Thanks in advance and may the force be with you!
Is this what you need?
df = pd.DataFrame([[1,10],[2,11],
[3,12],[4,13]], columns=['hour','value'])
for i in range(1, 24):
df['value -' + str(i)] = df['value'].shift(i).fillna(0)
result:
Is this what you are looking for?
import pandas as pd
df = pd.DataFrame({'hour': list(range(24))*2,
'value': list(range(48))})
shift_cols_n = 10
for shift in range(1, shift_cols_n):
new_columns_name = 'value - ' + str(shift)
# Assuming that you don't have any NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift).fillna(0)
# A safer (and a less simple) way, in case you have NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift)
df.loc[:shift, new_columns_name] = 0
print(df.head(9))
hour value value - 1 value - 2 value - 3 value - 4 value - 5 \
0 0 0 0.0 0.0 0.0 0.0 0.0
1 1 1 0.0 0.0 0.0 0.0 0.0
2 2 2 1.0 0.0 0.0 0.0 0.0
3 3 3 2.0 1.0 0.0 0.0 0.0
4 4 4 3.0 2.0 1.0 0.0 0.0
5 5 5 4.0 3.0 2.0 1.0 0.0
6 6 6 5.0 4.0 3.0 2.0 1.0
7 7 7 6.0 5.0 4.0 3.0 2.0
8 8 8 7.0 6.0 5.0 4.0 3.0
value - 6 value - 7 value - 8 value - 9
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
7 1.0 0.0 0.0 0.0
8 2.0 1.0 0.0 0.0
I'm training MLP and using version 0.18dev of sklearn. I don't know what's wrong with my code. Could you guys please help?
# TODO: Import 'GridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
# TODO: Create the parameters list you wish to tune
parameters = {'max_iter' : [100,200]}
# TODO: Initialize the classifier
clf = clf_B
# TODO: Make an f1 scoring function using 'make_scorer'
f1_scorer = make_scorer(f1_score, pos_label = 'Yes')
# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring = f1_scorer)
# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)
# Get the estimator
clf = grid_obj.best_estimator_
# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))
And the error message
---
IndexError Traceback (most recent call last)
<ipython-input-216-4a3fb1d65cb7> in <module>()
24
25 # TODO: Fit the grid search object to the training data and find the optimal parameters
---> 26 grid_obj = grid_obj.fit(X_train, y_train)
27
28 # Get the estimator
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
810
811 """
--> 812 return self._fit(X, y, ParameterGrid(self.param_grid))
813
814
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
537 'of samples (%i) than data (X: %i samples)'
538 % (len(y), n_samples))
--> 539 cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
540
541 if self.verbose > 0:
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in check_cv(cv, X, y, classifier)
1726 if classifier:
1727 if type_of_target(y) in ['binary', 'multiclass']:
-> 1728 cv = StratifiedKFold(y, cv)
1729 else:
1730 cv = KFold(_num_samples(y), cv)
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, y, n_folds, shuffle, random_state)
546 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
547 for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 548 label_test_folds = test_folds[y == label]
549 # the test split can be too big because we used
550 # KFold(max(c, self.n_folds), self.n_folds) instead of
IndexError: too many indices for array
MLPClassifier and this is how my input looks like.
clf_B = MLPClassifier(random_state=4)
print X_train
school_GP school_MS sex_F sex_M age address_R address_U \
171 1.0 0.0 0.0 1.0 16 0.0 1.0
12 1.0 0.0 0.0 1.0 15 0.0 1.0
13 1.0 0.0 0.0 1.0 15 0.0 1.0
151 1.0 0.0 0.0 1.0 16 0.0 1.0
310 1.0 0.0 1.0 0.0 19 0.0 1.0
274 1.0 0.0 1.0 0.0 17 0.0 1.0
371 0.0 1.0 0.0 1.0 18 1.0 0.0
29 1.0 0.0 0.0 1.0 16 0.0 1.0
109 1.0 0.0 1.0 0.0 16 0.0 1.0
327 1.0 0.0 0.0 1.0 17 1.0 0.0
131 1.0 0.0 1.0 0.0 15 0.0 1.0
128 1.0 0.0 0.0 1.0 18 1.0 0.0
174 1.0 0.0 1.0 0.0 16 0.0 1.0
108 1.0 0.0 0.0 1.0 15 1.0 0.0
280 1.0 0.0 0.0 1.0 17 0.0 1.0
163 1.0 0.0 0.0 1.0 17 0.0 1.0
178 1.0 0.0 0.0 1.0 16 1.0 0.0
275 1.0 0.0 1.0 0.0 17 0.0 1.0
35 1.0 0.0 1.0 0.0 15 0.0 1.0
276 1.0 0.0 1.0 0.0 18 1.0 0.0
282 1.0 0.0 1.0 0.0 18 1.0 0.0
99 1.0 0.0 1.0 0.0 16 0.0 1.0
194 1.0 0.0 0.0 1.0 16 0.0 1.0
357 0.0 1.0 1.0 0.0 17 0.0 1.0
10 1.0 0.0 1.0 0.0 15 0.0 1.0
112 1.0 0.0 1.0 0.0 16 0.0 1.0
338 1.0 0.0 1.0 0.0 18 0.0 1.0
292 1.0 0.0 1.0 0.0 18 0.0 1.0
305 1.0 0.0 1.0 0.0 18 0.0 1.0
340 1.0 0.0 1.0 0.0 19 0.0 1.0
.. ... ... ... ... ... ... ...
255 1.0 0.0 0.0 1.0 17 0.0 1.0
58 1.0 0.0 0.0 1.0 15 0.0 1.0
33 1.0 0.0 0.0 1.0 15 0.0 1.0
38 1.0 0.0 1.0 0.0 15 1.0 0.0
359 0.0 1.0 1.0 0.0 18 0.0 1.0
51 1.0 0.0 1.0 0.0 15 0.0 1.0
363 0.0 1.0 1.0 0.0 17 0.0 1.0
260 1.0 0.0 1.0 0.0 18 0.0 1.0
102 1.0 0.0 0.0 1.0 15 0.0 1.0
195 1.0 0.0 1.0 0.0 17 0.0 1.0
167 1.0 0.0 1.0 0.0 16 0.0 1.0
293 1.0 0.0 1.0 0.0 17 1.0 0.0
116 1.0 0.0 0.0 1.0 15 0.0 1.0
124 1.0 0.0 1.0 0.0 16 0.0 1.0
218 1.0 0.0 1.0 0.0 17 0.0 1.0
287 1.0 0.0 1.0 0.0 17 0.0 1.0
319 1.0 0.0 1.0 0.0 18 0.0 1.0
47 1.0 0.0 0.0 1.0 16 0.0 1.0
213 1.0 0.0 0.0 1.0 18 0.0 1.0
389 0.0 1.0 1.0 0.0 18 0.0 1.0
95 1.0 0.0 1.0 0.0 15 1.0 0.0
162 1.0 0.0 0.0 1.0 16 0.0 1.0
263 1.0 0.0 1.0 0.0 17 0.0 1.0
360 0.0 1.0 1.0 0.0 18 1.0 0.0
75 1.0 0.0 0.0 1.0 15 0.0 1.0
299 1.0 0.0 0.0 1.0 18 0.0 1.0
22 1.0 0.0 0.0 1.0 16 0.0 1.0
72 1.0 0.0 1.0 0.0 15 1.0 0.0
15 1.0 0.0 1.0 0.0 16 0.0 1.0
168 1.0 0.0 1.0 0.0 16 0.0 1.0
famsize_GT3 famsize_LE3 Pstatus_A ... higher internet \
171 1.0 0.0 0.0 ... 1 1
12 0.0 1.0 0.0 ... 1 1
13 1.0 0.0 0.0 ... 1 1
151 0.0 1.0 0.0 ... 1 0
310 0.0 1.0 0.0 ... 1 0
274 1.0 0.0 0.0 ... 1 1
371 0.0 1.0 0.0 ... 0 1
29 1.0 0.0 0.0 ... 1 1
109 0.0 1.0 0.0 ... 1 1
327 1.0 0.0 0.0 ... 1 1
131 1.0 0.0 0.0 ... 1 1
128 1.0 0.0 0.0 ... 1 1
174 0.0 1.0 0.0 ... 1 1
108 1.0 0.0 0.0 ... 1 1
280 0.0 1.0 1.0 ... 1 1
163 1.0 0.0 0.0 ... 0 1
178 1.0 0.0 0.0 ... 1 1
275 0.0 1.0 0.0 ... 1 1
35 1.0 0.0 0.0 ... 1 0
276 1.0 0.0 1.0 ... 0 1
282 0.0 1.0 0.0 ... 1 0
99 1.0 0.0 0.0 ... 1 1
194 1.0 0.0 0.0 ... 1 1
357 0.0 1.0 1.0 ... 1 0
10 1.0 0.0 0.0 ... 1 1
112 1.0 0.0 0.0 ... 1 1
338 0.0 1.0 0.0 ... 1 1
292 0.0 1.0 0.0 ... 1 1
305 1.0 0.0 0.0 ... 1 1
340 1.0 0.0 0.0 ... 1 1
.. ... ... ... ... ... ...
255 0.0 1.0 0.0 ... 1 1
58 0.0 1.0 0.0 ... 1 1
33 0.0 1.0 0.0 ... 1 1
38 1.0 0.0 0.0 ... 1 1
359 0.0 1.0 0.0 ... 1 1
51 0.0 1.0 0.0 ... 1 1
363 0.0 1.0 0.0 ... 1 1
260 1.0 0.0 0.0 ... 1 1
102 1.0 0.0 0.0 ... 1 1
195 0.0 1.0 0.0 ... 1 1
167 1.0 0.0 0.0 ... 1 1
293 0.0 1.0 0.0 ... 1 0
116 1.0 0.0 0.0 ... 1 0
124 1.0 0.0 0.0 ... 1 1
218 1.0 0.0 0.0 ... 1 0
287 1.0 0.0 0.0 ... 1 1
319 1.0 0.0 0.0 ... 1 1
47 1.0 0.0 0.0 ... 1 1
213 1.0 0.0 0.0 ... 1 1
389 1.0 0.0 0.0 ... 1 0
95 1.0 0.0 0.0 ... 1 1
162 0.0 1.0 0.0 ... 1 0
263 1.0 0.0 0.0 ... 1 0
360 0.0 1.0 1.0 ... 1 0
75 1.0 0.0 0.0 ... 1 1
299 0.0 1.0 0.0 ... 1 1
22 0.0 1.0 0.0 ... 1 1
72 1.0 0.0 0.0 ... 1 1
15 1.0 0.0 0.0 ... 1 1
168 1.0 0.0 0.0 ... 1 1
romantic famrel freetime goout Dalc Walc health absences
171 1 4 3 2 1 1 3 2
12 0 4 3 3 1 3 5 2
13 0 5 4 3 1 2 3 2
151 1 4 4 4 3 5 5 6
310 1 4 2 4 2 2 3 0
274 1 4 3 3 1 1 1 2
371 1 4 3 3 2 3 3 3
29 1 4 4 5 5 5 5 16
109 1 5 4 5 1 1 4 4
327 0 4 4 5 5 5 4 8
131 1 4 3 3 1 2 4 0
128 0 3 3 3 1 2 4 0
174 0 4 4 5 1 1 4 4
108 1 1 3 5 3 5 1 6
280 1 4 5 4 2 4 5 30
163 0 5 3 3 1 4 2 2
178 1 4 3 3 3 4 3 10
275 1 4 4 4 2 3 5 6
35 0 3 5 1 1 1 5 0
276 1 4 1 1 1 1 5 75
282 0 5 2 2 1 1 3 1
99 0 5 3 5 1 1 3 0
194 0 5 3 3 1 1 3 0
357 1 1 2 3 1 2 5 2
10 0 3 3 3 1 2 2 0
112 0 3 1 2 1 1 5 6
338 0 5 3 3 1 1 1 7
292 1 5 4 3 1 1 5 12
305 0 4 4 3 1 1 3 8
340 1 4 3 4 1 3 3 4
.. ... ... ... ... ... ... ... ...
255 0 4 4 4 1 2 5 2
58 0 4 3 2 1 1 5 2
33 0 5 3 2 1 1 2 0
38 0 4 3 2 1 1 5 2
359 0 5 3 2 1 1 4 0
51 0 4 3 3 1 1 5 2
363 1 2 3 4 1 1 1 0
260 1 3 1 2 1 3 2 21
102 0 5 3 3 1 1 5 4
195 1 4 3 2 1 1 5 0
167 1 4 2 3 1 1 3 0
293 0 3 1 2 1 1 3 6
116 0 4 4 3 1 1 2 2
124 1 5 4 4 1 1 5 0
218 0 3 3 3 1 4 3 3
287 0 4 3 3 1 1 3 6
319 0 4 4 4 3 3 5 2
47 0 4 2 2 1 1 2 4
213 0 4 4 4 2 4 5 15
389 0 1 1 1 1 1 5 0
95 0 3 1 2 1 1 1 2
162 0 4 4 4 2 4 5 0
263 0 3 2 3 1 1 4 4
360 1 4 3 4 1 4 5 0
75 0 4 3 3 2 3 5 6
299 1 1 4 2 2 2 1 5
22 0 4 5 1 1 3 5 2
72 1 3 3 4 2 4 5 2
15 0 4 4 4 1 2 2 4
168 0 5 1 5 1 1 4 0
[300 rows x 48 columns]
This is how my output looks like
print y_train
passed
171 yes
12 yes
13 yes
151 yes
310 no
274 yes
371 yes
29 yes
109 yes
327 yes
131 no
128 no
174 no
108 yes
280 no
163 yes
178 no
275 yes
35 no
276 no
282 yes
99 no
194 yes
357 yes
10 no
112 yes
338 yes
292 yes
305 yes
340 yes
.. ...
255 no
58 no
33 yes
38 yes
359 yes
51 yes
363 yes
260 yes
102 yes
195 yes
167 yes
293 yes
116 yes
124 no
218 no
287 yes
319 yes
47 yes
213 no
389 no
95 yes
162 no
263 no
360 yes
75 yes
299 yes
22 yes
72 no
15 yes
168 no
[300 rows x 1 columns]