I am trying to calculate RSI on a dataframe
df = pd.DataFrame({"Close": [100,101,102,103,104,105,106,105,103,102,103,104,103,105,106,107,108,106,105,107,109]})
df["Change"] = df["Close"].diff()
df["Gain"] = np.where(df["Change"]>0,df["Change"],0)
df["Loss"] = np.where(df["Change"]<0,abs(df["Change"]),0 )
df["Index"] = [x for x in range(len(df))]
print(df)
Close Change Gain Loss Index
0 100 NaN 0.0 0.0 0
1 101 1.0 1.0 0.0 1
2 102 1.0 1.0 0.0 2
3 103 1.0 1.0 0.0 3
4 104 1.0 1.0 0.0 4
5 105 1.0 1.0 0.0 5
6 106 1.0 1.0 0.0 6
7 105 -1.0 0.0 1.0 7
8 103 -2.0 0.0 2.0 8
9 102 -1.0 0.0 1.0 9
10 103 1.0 1.0 0.0 10
11 104 1.0 1.0 0.0 11
12 103 -1.0 0.0 1.0 12
13 105 2.0 2.0 0.0 13
14 106 1.0 1.0 0.0 14
15 107 1.0 1.0 0.0 15
16 108 1.0 1.0 0.0 16
17 106 -2.0 0.0 2.0 17
18 105 -1.0 0.0 1.0 18
19 107 2.0 2.0 0.0 19
20 109 2.0 2.0 0.0 20
RSI_length = 7
Now, I am stuck in calculating "Avg Gain". The logic for average gain here is for first average gain at index 6 will be mean of "Gain" for RSI_length periods. For consecutive "Avg Gain" it should be
(Previous Avg Gain * (RSI_length - 1) + "Gain") / RSI_length
I tried the following but does not work as expected
df["Avg Gain"] = np.nan
df["Avg Gain"] = np.where(df["Index"]==(RSI_length-1),df["Gain"].rolling(window=RSI_length).mean(),\
np.where(df["Index"]>(RSI_length-1),(df["Avg Gain"].iloc[df["Index"]-1]*(RSI_length-1)+df["Gain"]) / RSI_length,np.nan))
The output of this code is:
print(df)
Close Change Gain Loss Index Avg Gain
0 100 NaN 0.0 0.0 0 NaN
1 101 1.0 1.0 0.0 1 NaN
2 102 1.0 1.0 0.0 2 NaN
3 103 1.0 1.0 0.0 3 NaN
4 104 1.0 1.0 0.0 4 NaN
5 105 1.0 1.0 0.0 5 NaN
6 106 1.0 1.0 0.0 6 0.857143
7 105 -1.0 0.0 1.0 7 NaN
8 103 -2.0 0.0 2.0 8 NaN
9 102 -1.0 0.0 1.0 9 NaN
10 103 1.0 1.0 0.0 10 NaN
11 104 1.0 1.0 0.0 11 NaN
12 103 -1.0 0.0 1.0 12 NaN
13 105 2.0 2.0 0.0 13 NaN
14 106 1.0 1.0 0.0 14 NaN
15 107 1.0 1.0 0.0 15 NaN
16 108 1.0 1.0 0.0 16 NaN
17 106 -2.0 0.0 2.0 17 NaN
18 105 -1.0 0.0 1.0 18 NaN
19 107 2.0 2.0 0.0 19 NaN
20 109 2.0 2.0 0.0 20 NaN
Desired output is:
Close Change Gain Loss Index Avg Gain
0 100 NaN 0 0 0 NaN
1 101 1.0 1 0 1 NaN
2 102 1.0 1 0 2 NaN
3 103 1.0 1 0 3 NaN
4 104 1.0 1 0 4 NaN
5 105 1.0 1 0 5 NaN
6 106 1.0 1 0 6 0.857143
7 105 -1.0 0 1 7 0.734694
8 103 -2.0 0 2 8 0.629738
9 102 -1.0 0 1 9 0.539775
10 103 1.0 1 0 10 0.605522
11 104 1.0 1 0 11 0.661876
12 103 -1.0 0 1 12 0.567322
13 105 2.0 2 0 13 0.771990
14 106 1.0 1 0 14 0.804563
15 107 1.0 1 0 15 0.832483
16 108 1.0 1 0 16 0.856414
17 106 -2.0 0 2 17 0.734069
18 105 -1.0 0 1 18 0.629202
19 107 2.0 2 0 19 0.825030
20 109 2.0 2 0 20 0.992883
(edited)
Here's an implementation of your formula.
RSI_LENGTH = 7
rolling_gain = df["Gain"].rolling(RSI_LENGTH).mean()
df.loc[RSI_LENGTH-1, "RSI"] = rolling_gain[RSI_LENGTH-1]
for inx in range(RSI_LENGTH, len(df)):
df.loc[inx, "RSI"] = (df.loc[inx-1, "RSI"] * (RSI_LENGTH -1) + df.loc[inx, "Gain"]) / RSI_LENGTH
The result is:
Close Change Gain Loss Index RSI
0 100 NaN 0.0 0.0 0 NaN
1 101 1.0 1.0 0.0 1 NaN
2 102 1.0 1.0 0.0 2 NaN
3 103 1.0 1.0 0.0 3 NaN
4 104 1.0 1.0 0.0 4 NaN
5 105 1.0 1.0 0.0 5 NaN
6 106 1.0 1.0 0.0 6 0.857143
7 105 -1.0 0.0 1.0 7 0.734694
8 103 -2.0 0.0 2.0 8 0.629738
9 102 -1.0 0.0 1.0 9 0.539775
10 103 1.0 1.0 0.0 10 0.605522
11 104 1.0 1.0 0.0 11 0.661876
12 103 -1.0 0.0 1.0 12 0.567322
13 105 2.0 2.0 0.0 13 0.771990
14 106 1.0 1.0 0.0 14 0.804563
15 107 1.0 1.0 0.0 15 0.832483
16 108 1.0 1.0 0.0 16 0.856414
17 106 -2.0 0.0 2.0 17 0.734069
18 105 -1.0 0.0 1.0 18 0.629202
19 107 2.0 2.0 0.0 19 0.825030
20 109 2.0 2.0 0.0 20 0.992883
Related
I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?
You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
I have a Python dataframe that looks like this:
0 1 2 3 4 5
0 1 1 10 0.0 6 0.0
1 1 1 20 0.0 3 0.0
2 1 1 30 0.0 6 0.0
3 1 1 40 0.0 2 0.0
4 1 1 50 0.0 5 0.0
5 1 1 60 0.0 6 0.0
6 1 1 70 0.0 3 0.0
7 1 1 80 0.0 6 0.0
8 1 1 90 0.0 2 0.0
9 1 1 100 0.0 4 0.0
10 1 1 110 0.0 4 0.0
11 1 1 120 0.0 3 0.0
12 1 1 130 0.0 6 0.0
13 1 1 140 0.0 5 0.0
14 1 1 150 0.0 5 0.0
15 1 1 160 0.0 2 0.0
16 1 1 170 0.0 2 0.0
17 1 1 180 0.0 1 0.0
18 1 1 190 0.0 1 0.0
19 1 1 200 0.0 6 0.0
.. .. .. .. ... .. ..
n-10 99 99 110 0.0 4 0.0
n-8 99 99 120 0.0 2 0.0
n-7 99 99 130 0.0 9 0.0
n-6 99 99 140 0.0 8 0.0
n-5 99 99 150 0.0 5 0.0
n-4 99 99 160 0.0 1 0.0
n-3 99 99 170 0.0 0 0.0
n-2 99 99 180 0.0 7 0.0
n-1 99 99 190 0.0 6 0.0
n 99 99 200 0.0 4 0.0
And I need to sum my column 4 for every x amount of column 2, where x=10 here.
The output would look like this:
0 1 2 3 4 5
0 1 1 100 0.0 43 0.0
1 1 1 200 0.0 35 0.0
.. .. .. .. ... .. ..
m 99 99 200 0.0 46 0.0
How would I do this?
So right now when I run this, I get a final output that includes two header columns. As a result it won't let me write this to a .csv either. How would I fix this so that it only includes the column from the first table? (seeing the rest of the column names are the same throughout)
import pandas as pd
import urllib.request
import bs4 as bs
urls = ['https://fantasysportsdaily.net/bsl/boxes/1-1.html',
'https://fantasysportsdaily.net/bsl/boxes/1-2.html'
]
final = []
for url in urls:
df = pd.read_html(url, header=0)
format1 = df[1].iloc[:, : 16]
colname1 = format1.columns[0]
format1.insert(1, 'Team', colname1)
format1.rename(columns = {list(format1)[0]: 'Player'}, inplace = True)
format2 = format1.drop(format1[format1.Player == 'TEAM TOTALS'].index)
team1 = format2.drop(format2[format2.Player == 'PERCENTAGES'].index)
format3 = df[2].iloc[:, : 16]
colname2 = format3.columns[0]
format3.insert(1, 'Team', colname2)
format3.rename(columns = {list(format3)[0]: 'Player'}, inplace = True)
format4 = format3.drop(format3[format3.Player == 'TEAM TOTALS'].index)
team2 = format4.drop(format4[format4.Player == 'PERCENTAGES'].index)
both_teams = [team1, team2]
combined = pd.concat(both_teams)
final.append(combined, ignore_index=True)
print(final)
##final.to_csv ('boxes.csv', index = True, header=True)
Please Pay attention to the following points.
since you are calling the same host so you've to use the same session to avoid getting blocked or consider your requests as DDOS attack since pd.read_html is using requests underneath with a different session on each request. so that's better to use one session for the same host. That's why I've used requests.Session() ref
Please try to follow The DRY Principle as you don't need to repeat your code! use a Function or Class as I've used within the code.
Finally, iloc[] is actually can drop columns and rows as well! so you don't need to circle yourself.
import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0'
}
def myformat(content, key):
df = pd.read_html(content)[key].iloc[:-2, :-1]
df.insert(1, 'Team', df.columns[0])
df.rename(columns={df.columns[0]: "Player"}, inplace=True)
return df
def main(url):
with requests.Session() as req:
req.headers.update(headers)
allin = []
for num in range(1, 3):
r = req.get(url.format(num))
df1 = myformat(r.content, 1)
df2 = myformat(r.content, 2)
final = pd.concat([df1, df2], ignore_index=True)
allin.append(final)
target = pd.concat(allin, ignore_index=True)
print(target)
main('https://fantasysportsdaily.net/bsl/boxes/1-{}.html')
Output:
Player Team POS MIN FG FGA 3P ... REB A PF ST TO BL PTS
0 David Robinson Spurs C 32 6 14 0 ... 15 1.0 4.0 2.0 3.0 2.0 21.0
1 Reggie Miller Spurs SG 42 5 12 3 ... 6 3.0 0.0 2.0 3.0 0.0 17.0
2 Tom Gugliotta Spurs PF 25 6 7 0 ... 11 1.0 4.0 2.0 3.0 1.0 17.0
3 Allan Houston Spurs PG 27 5 19 2 ... 1 1.0 1.0 2.0 3.0 0.0 12.0
4 Sean Elliott Spurs SF 34 3 6 0 ... 4 0.0 3.0 2.0 2.0 0.0 7.0
5 Rik Smits Spurs PF 32 1 10 0 ... 9 1.0 4.0 0.0 1.0 0.0 6.0
6 Mark Jackson Spurs PG 21 3 9 0 ... 3 7.0 1.0 2.0 6.0 0.0 6.0
7 Will Perdue Spurs C 16 1 3 0 ... 1 1.0 1.0 1.0 0.0 1.0 4.0
8 Robert Pack Spurs SG 12 0 2 0 ... 0 1.0 1.0 0.0 0.0 0.0 0.0
9 John Starks Lakers SG 39 10 20 2 ... 7 1.0 2.0 2.0 4.0 0.0 27.0
10 Magic Johnson Lakers PG 36 7 10 1 ... 7 7.0 1.0 1.0 2.0 0.0 20.0
11 Eddie Jones Lakers SF 31 4 7 1 ... 5 3.0 0.0 2.0 2.0 0.0 12.0
12 Elden Campbell Lakers PF 24 5 10 0 ... 5 0.0 4.0 0.0 0.0 1.0 12.0
13 Cedric Ceballos Lakers PF 32 3 11 0 ... 11 3.0 6.0 4.0 7.0 0.0 10.0
14 Vlade Divac Lakers C 24 3 6 0 ... 9 1.0 5.0 1.0 1.0 1.0 6.0
15 Pervis Ellison Lakers C 18 3 4 0 ... 4 0.0 6.0 1.0 0.0 1.0 6.0
16 Nick Van Exel Lakers PG 17 3 7 0 ... 1 3.0 0.0 0.0 1.0 0.0 6.0
17 Corie Blount Lakers C 6 0 0 0 ... 4 0.0 1.0 1.0 1.0 1.0 4.0
18 Anthony Peeler Lakers SF 13 0 4 0 ... 1 1.0 0.0 0.0 2.0 0.0 0.0
19 Terry Porter Timberwolves PG 31 6 15 2 ... 4 1.0 2.0 1.0 4.0 0.0 16.0
20 Kendall Gill Timberwolves PG 26 6 10 1 ... 5 5.0 0.0 0.0 3.0 0.0 15.0
21 J.R. Rider Timberwolves SG 34 7 14 0 ... 5 4.0 4.0 0.0 6.0 1.0 14.0
22 Larry Johnson Timberwolves SF 31 3 13 0 ... 10 3.0 1.0 0.0 1.0 0.0 8.0
23 LaPhonso Ellis Timberwolves PF 30 1 13 0 ... 15 2.0 3.0 0.0 1.0 1.0 6.0
24 J.R. Reid Timberwolves PF 18 1 4 0 ... 3 0.0 1.0 2.0 3.0 0.0 4.0
25 Mark Davis Timberwolves SF 17 1 3 0 ... 3 0.0 0.0 0.0 1.0 0.0 2.0
26 Eric Riley Timberwolves C 13 1 2 0 ... 5 0.0 0.0 1.0 1.0 0.0 2.0
27 Kevin Garnett Timberwolves C 35 0 8 0 ... 9 2.0 2.0 2.0 0.0 2.0 1.0
28 Micheal Williams Timberwolves PG 5 0 2 0 ... 2 1.0 1.0 0.0 0.0 0.0 0.0
29 Jim McIlvaine Bullets C 30 5 8 0 ... 6 1.0 2.0 0.0 1.0 6.0 16.0
30 Ledell Eackles Bullets SG 30 6 9 2 ... 9 1.0 2.0 3.0 2.0 0.0 15.0
31 Juwan Howard Bullets PF 29 4 10 0 ... 6 1.0 2.0 1.0 1.0 0.0 15.0
32 Avery Johnson Bullets PG 35 6 16 0 ... 2 6.0 2.0 2.0 2.0 0.0 14.0
33 Tim Legler Bullets SF 28 5 13 0 ... 2 0.0 0.0 1.0 1.0 0.0 10.0
34 David Benoit Bullets C 18 2 8 1 ... 10 1.0 1.0 1.0 1.0 0.0 7.0
35 Brent Price Bullets SG 18 2 6 1 ... 2 2.0 1.0 1.0 0.0 0.0 5.0
36 Rasheed Wallace Bullets SF 22 2 6 0 ... 1 2.0 1.0 1.0 0.0 0.0 4.0
37 Cory Alexander Bullets PG 9 0 2 0 ... 2 2.0 0.0 0.0 3.0 0.0 1.0
38 Mitchell Butler Bullets PF 19 0 1 0 ... 6 0.0 1.0 0.0 0.0 0.0 0.0
[39 rows x 17 columns]
pandas.concat() can concatenate a list of same structure pandas objects into one:
final = []
for url in urls:
...
combined = pd.concat(both_teams)
final.append(combined)
final_df = pd.concat(final, ignore_index=True)
print(final_df)
It's a bit complicated for explain, so I'll do my best. I have a pandas with two columns: hour (from 1 to 24) and value(corresponding to each hour). Dataset index is huge but column hour is repeated on that 24 hours basis (from 1 to 24). I am trying to create new 24 columns: value -1, value -2, value -3...value -24 that will correspond to each row and value from -1 hour, value from -2 hours(from above rows).
hour | value | value -1 | value -2 | value -3| ... | value - 24
1 10 0 0 0 0
2 11 10 0 0 0
3 12 11 10 0 0
4 13 12 11 10 0
...
24 32 31 30 29 0
1 33 32 31 30 10
2 34 33 32 31 11
and so on...
All value numbers are for the example. As I said there are lots of rows, not only 24 for all hours in a day time but all following time series from 1 to 24 and etc.
Thanks in advance and may the force be with you!
Is this what you need?
df = pd.DataFrame([[1,10],[2,11],
[3,12],[4,13]], columns=['hour','value'])
for i in range(1, 24):
df['value -' + str(i)] = df['value'].shift(i).fillna(0)
result:
Is this what you are looking for?
import pandas as pd
df = pd.DataFrame({'hour': list(range(24))*2,
'value': list(range(48))})
shift_cols_n = 10
for shift in range(1, shift_cols_n):
new_columns_name = 'value - ' + str(shift)
# Assuming that you don't have any NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift).fillna(0)
# A safer (and a less simple) way, in case you have NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift)
df.loc[:shift, new_columns_name] = 0
print(df.head(9))
hour value value - 1 value - 2 value - 3 value - 4 value - 5 \
0 0 0 0.0 0.0 0.0 0.0 0.0
1 1 1 0.0 0.0 0.0 0.0 0.0
2 2 2 1.0 0.0 0.0 0.0 0.0
3 3 3 2.0 1.0 0.0 0.0 0.0
4 4 4 3.0 2.0 1.0 0.0 0.0
5 5 5 4.0 3.0 2.0 1.0 0.0
6 6 6 5.0 4.0 3.0 2.0 1.0
7 7 7 6.0 5.0 4.0 3.0 2.0
8 8 8 7.0 6.0 5.0 4.0 3.0
value - 6 value - 7 value - 8 value - 9
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
7 1.0 0.0 0.0 0.0
8 2.0 1.0 0.0 0.0
I'm training MLP and using version 0.18dev of sklearn. I don't know what's wrong with my code. Could you guys please help?
# TODO: Import 'GridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
# TODO: Create the parameters list you wish to tune
parameters = {'max_iter' : [100,200]}
# TODO: Initialize the classifier
clf = clf_B
# TODO: Make an f1 scoring function using 'make_scorer'
f1_scorer = make_scorer(f1_score, pos_label = 'Yes')
# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring = f1_scorer)
# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)
# Get the estimator
clf = grid_obj.best_estimator_
# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))
And the error message
---
IndexError Traceback (most recent call last)
<ipython-input-216-4a3fb1d65cb7> in <module>()
24
25 # TODO: Fit the grid search object to the training data and find the optimal parameters
---> 26 grid_obj = grid_obj.fit(X_train, y_train)
27
28 # Get the estimator
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
810
811 """
--> 812 return self._fit(X, y, ParameterGrid(self.param_grid))
813
814
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
537 'of samples (%i) than data (X: %i samples)'
538 % (len(y), n_samples))
--> 539 cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
540
541 if self.verbose > 0:
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in check_cv(cv, X, y, classifier)
1726 if classifier:
1727 if type_of_target(y) in ['binary', 'multiclass']:
-> 1728 cv = StratifiedKFold(y, cv)
1729 else:
1730 cv = KFold(_num_samples(y), cv)
/home/indy/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, y, n_folds, shuffle, random_state)
546 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
547 for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 548 label_test_folds = test_folds[y == label]
549 # the test split can be too big because we used
550 # KFold(max(c, self.n_folds), self.n_folds) instead of
IndexError: too many indices for array
MLPClassifier and this is how my input looks like.
clf_B = MLPClassifier(random_state=4)
print X_train
school_GP school_MS sex_F sex_M age address_R address_U \
171 1.0 0.0 0.0 1.0 16 0.0 1.0
12 1.0 0.0 0.0 1.0 15 0.0 1.0
13 1.0 0.0 0.0 1.0 15 0.0 1.0
151 1.0 0.0 0.0 1.0 16 0.0 1.0
310 1.0 0.0 1.0 0.0 19 0.0 1.0
274 1.0 0.0 1.0 0.0 17 0.0 1.0
371 0.0 1.0 0.0 1.0 18 1.0 0.0
29 1.0 0.0 0.0 1.0 16 0.0 1.0
109 1.0 0.0 1.0 0.0 16 0.0 1.0
327 1.0 0.0 0.0 1.0 17 1.0 0.0
131 1.0 0.0 1.0 0.0 15 0.0 1.0
128 1.0 0.0 0.0 1.0 18 1.0 0.0
174 1.0 0.0 1.0 0.0 16 0.0 1.0
108 1.0 0.0 0.0 1.0 15 1.0 0.0
280 1.0 0.0 0.0 1.0 17 0.0 1.0
163 1.0 0.0 0.0 1.0 17 0.0 1.0
178 1.0 0.0 0.0 1.0 16 1.0 0.0
275 1.0 0.0 1.0 0.0 17 0.0 1.0
35 1.0 0.0 1.0 0.0 15 0.0 1.0
276 1.0 0.0 1.0 0.0 18 1.0 0.0
282 1.0 0.0 1.0 0.0 18 1.0 0.0
99 1.0 0.0 1.0 0.0 16 0.0 1.0
194 1.0 0.0 0.0 1.0 16 0.0 1.0
357 0.0 1.0 1.0 0.0 17 0.0 1.0
10 1.0 0.0 1.0 0.0 15 0.0 1.0
112 1.0 0.0 1.0 0.0 16 0.0 1.0
338 1.0 0.0 1.0 0.0 18 0.0 1.0
292 1.0 0.0 1.0 0.0 18 0.0 1.0
305 1.0 0.0 1.0 0.0 18 0.0 1.0
340 1.0 0.0 1.0 0.0 19 0.0 1.0
.. ... ... ... ... ... ... ...
255 1.0 0.0 0.0 1.0 17 0.0 1.0
58 1.0 0.0 0.0 1.0 15 0.0 1.0
33 1.0 0.0 0.0 1.0 15 0.0 1.0
38 1.0 0.0 1.0 0.0 15 1.0 0.0
359 0.0 1.0 1.0 0.0 18 0.0 1.0
51 1.0 0.0 1.0 0.0 15 0.0 1.0
363 0.0 1.0 1.0 0.0 17 0.0 1.0
260 1.0 0.0 1.0 0.0 18 0.0 1.0
102 1.0 0.0 0.0 1.0 15 0.0 1.0
195 1.0 0.0 1.0 0.0 17 0.0 1.0
167 1.0 0.0 1.0 0.0 16 0.0 1.0
293 1.0 0.0 1.0 0.0 17 1.0 0.0
116 1.0 0.0 0.0 1.0 15 0.0 1.0
124 1.0 0.0 1.0 0.0 16 0.0 1.0
218 1.0 0.0 1.0 0.0 17 0.0 1.0
287 1.0 0.0 1.0 0.0 17 0.0 1.0
319 1.0 0.0 1.0 0.0 18 0.0 1.0
47 1.0 0.0 0.0 1.0 16 0.0 1.0
213 1.0 0.0 0.0 1.0 18 0.0 1.0
389 0.0 1.0 1.0 0.0 18 0.0 1.0
95 1.0 0.0 1.0 0.0 15 1.0 0.0
162 1.0 0.0 0.0 1.0 16 0.0 1.0
263 1.0 0.0 1.0 0.0 17 0.0 1.0
360 0.0 1.0 1.0 0.0 18 1.0 0.0
75 1.0 0.0 0.0 1.0 15 0.0 1.0
299 1.0 0.0 0.0 1.0 18 0.0 1.0
22 1.0 0.0 0.0 1.0 16 0.0 1.0
72 1.0 0.0 1.0 0.0 15 1.0 0.0
15 1.0 0.0 1.0 0.0 16 0.0 1.0
168 1.0 0.0 1.0 0.0 16 0.0 1.0
famsize_GT3 famsize_LE3 Pstatus_A ... higher internet \
171 1.0 0.0 0.0 ... 1 1
12 0.0 1.0 0.0 ... 1 1
13 1.0 0.0 0.0 ... 1 1
151 0.0 1.0 0.0 ... 1 0
310 0.0 1.0 0.0 ... 1 0
274 1.0 0.0 0.0 ... 1 1
371 0.0 1.0 0.0 ... 0 1
29 1.0 0.0 0.0 ... 1 1
109 0.0 1.0 0.0 ... 1 1
327 1.0 0.0 0.0 ... 1 1
131 1.0 0.0 0.0 ... 1 1
128 1.0 0.0 0.0 ... 1 1
174 0.0 1.0 0.0 ... 1 1
108 1.0 0.0 0.0 ... 1 1
280 0.0 1.0 1.0 ... 1 1
163 1.0 0.0 0.0 ... 0 1
178 1.0 0.0 0.0 ... 1 1
275 0.0 1.0 0.0 ... 1 1
35 1.0 0.0 0.0 ... 1 0
276 1.0 0.0 1.0 ... 0 1
282 0.0 1.0 0.0 ... 1 0
99 1.0 0.0 0.0 ... 1 1
194 1.0 0.0 0.0 ... 1 1
357 0.0 1.0 1.0 ... 1 0
10 1.0 0.0 0.0 ... 1 1
112 1.0 0.0 0.0 ... 1 1
338 0.0 1.0 0.0 ... 1 1
292 0.0 1.0 0.0 ... 1 1
305 1.0 0.0 0.0 ... 1 1
340 1.0 0.0 0.0 ... 1 1
.. ... ... ... ... ... ...
255 0.0 1.0 0.0 ... 1 1
58 0.0 1.0 0.0 ... 1 1
33 0.0 1.0 0.0 ... 1 1
38 1.0 0.0 0.0 ... 1 1
359 0.0 1.0 0.0 ... 1 1
51 0.0 1.0 0.0 ... 1 1
363 0.0 1.0 0.0 ... 1 1
260 1.0 0.0 0.0 ... 1 1
102 1.0 0.0 0.0 ... 1 1
195 0.0 1.0 0.0 ... 1 1
167 1.0 0.0 0.0 ... 1 1
293 0.0 1.0 0.0 ... 1 0
116 1.0 0.0 0.0 ... 1 0
124 1.0 0.0 0.0 ... 1 1
218 1.0 0.0 0.0 ... 1 0
287 1.0 0.0 0.0 ... 1 1
319 1.0 0.0 0.0 ... 1 1
47 1.0 0.0 0.0 ... 1 1
213 1.0 0.0 0.0 ... 1 1
389 1.0 0.0 0.0 ... 1 0
95 1.0 0.0 0.0 ... 1 1
162 0.0 1.0 0.0 ... 1 0
263 1.0 0.0 0.0 ... 1 0
360 0.0 1.0 1.0 ... 1 0
75 1.0 0.0 0.0 ... 1 1
299 0.0 1.0 0.0 ... 1 1
22 0.0 1.0 0.0 ... 1 1
72 1.0 0.0 0.0 ... 1 1
15 1.0 0.0 0.0 ... 1 1
168 1.0 0.0 0.0 ... 1 1
romantic famrel freetime goout Dalc Walc health absences
171 1 4 3 2 1 1 3 2
12 0 4 3 3 1 3 5 2
13 0 5 4 3 1 2 3 2
151 1 4 4 4 3 5 5 6
310 1 4 2 4 2 2 3 0
274 1 4 3 3 1 1 1 2
371 1 4 3 3 2 3 3 3
29 1 4 4 5 5 5 5 16
109 1 5 4 5 1 1 4 4
327 0 4 4 5 5 5 4 8
131 1 4 3 3 1 2 4 0
128 0 3 3 3 1 2 4 0
174 0 4 4 5 1 1 4 4
108 1 1 3 5 3 5 1 6
280 1 4 5 4 2 4 5 30
163 0 5 3 3 1 4 2 2
178 1 4 3 3 3 4 3 10
275 1 4 4 4 2 3 5 6
35 0 3 5 1 1 1 5 0
276 1 4 1 1 1 1 5 75
282 0 5 2 2 1 1 3 1
99 0 5 3 5 1 1 3 0
194 0 5 3 3 1 1 3 0
357 1 1 2 3 1 2 5 2
10 0 3 3 3 1 2 2 0
112 0 3 1 2 1 1 5 6
338 0 5 3 3 1 1 1 7
292 1 5 4 3 1 1 5 12
305 0 4 4 3 1 1 3 8
340 1 4 3 4 1 3 3 4
.. ... ... ... ... ... ... ... ...
255 0 4 4 4 1 2 5 2
58 0 4 3 2 1 1 5 2
33 0 5 3 2 1 1 2 0
38 0 4 3 2 1 1 5 2
359 0 5 3 2 1 1 4 0
51 0 4 3 3 1 1 5 2
363 1 2 3 4 1 1 1 0
260 1 3 1 2 1 3 2 21
102 0 5 3 3 1 1 5 4
195 1 4 3 2 1 1 5 0
167 1 4 2 3 1 1 3 0
293 0 3 1 2 1 1 3 6
116 0 4 4 3 1 1 2 2
124 1 5 4 4 1 1 5 0
218 0 3 3 3 1 4 3 3
287 0 4 3 3 1 1 3 6
319 0 4 4 4 3 3 5 2
47 0 4 2 2 1 1 2 4
213 0 4 4 4 2 4 5 15
389 0 1 1 1 1 1 5 0
95 0 3 1 2 1 1 1 2
162 0 4 4 4 2 4 5 0
263 0 3 2 3 1 1 4 4
360 1 4 3 4 1 4 5 0
75 0 4 3 3 2 3 5 6
299 1 1 4 2 2 2 1 5
22 0 4 5 1 1 3 5 2
72 1 3 3 4 2 4 5 2
15 0 4 4 4 1 2 2 4
168 0 5 1 5 1 1 4 0
[300 rows x 48 columns]
This is how my output looks like
print y_train
passed
171 yes
12 yes
13 yes
151 yes
310 no
274 yes
371 yes
29 yes
109 yes
327 yes
131 no
128 no
174 no
108 yes
280 no
163 yes
178 no
275 yes
35 no
276 no
282 yes
99 no
194 yes
357 yes
10 no
112 yes
338 yes
292 yes
305 yes
340 yes
.. ...
255 no
58 no
33 yes
38 yes
359 yes
51 yes
363 yes
260 yes
102 yes
195 yes
167 yes
293 yes
116 yes
124 no
218 no
287 yes
319 yes
47 yes
213 no
389 no
95 yes
162 no
263 no
360 yes
75 yes
299 yes
22 yes
72 no
15 yes
168 no
[300 rows x 1 columns]