I am reading the book : "Building Machine Learning Systems with Python".
In the classification of Iris dates, I am having trouble understanding the syntax of :
plt.scatter(features[target == t,0],
features[target == t,1],
marker=marker,
c=c)
Specifically, what does features[target == t,0] actually mean?
Looking at this code, it seems that features and target are both arrays and t is a number. Moreover, both features and target have the same number of rows.
In that case, features[target == t, 0] does the following:
target == t creates a Boolean array of the same shape as target (True if the value is t, otherwise False).
features[target == t, 0] selects those rows from features which correspond to True in the target == t array. The 0 specifies that the first column of features should be selected.
In other words, the code selects the rows of features for which target is equal to t and from those rows, the 0 selects the first column.
A better explanation to this could be that this for loop splits features array into 3 different arrays , each corresponding to a particular species of Iris.
All these arrays have the 1st feature of the particular plant(instance).
This will be the output if you print features[target==t,0]
[ 5.1 4.9 4.7 4.6 5. 5.4 4.6 5. 4.4 4.9 5.4 4.8 4.8 4.3 5.8
5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5. 5. 5.2 5.2 4.7
4.8 5.4 5.2 5.5 4.9 5. 5.5 4.9 4.4 5.1 5. 4.5 4.4 5. 5.1
4.8 5.1 4.6 5.3 5. ]
[ 7. 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5. 5.9 6. 6.1 5.6
6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6. 5.7
5.5 5.5 5.8 6. 5.4 6. 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5. 5.6
5.7 5.7 6.2 5.1 5.7]
[ 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8
6.4 6.5 7.7 7.7 6. 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2
7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6. 6.9 6.7 6.9 5.8 6.8 6.7
6.7 6.3 6.5 6.2 5.9]
Related
I have two dataframes df1, df2 described below
df1
prod age
0 Winalto_eu 28
1 Winalto_uc 25
2 CEM_eu 30
df2
age qx
0 25 2.7
1 26 2.8
2 27 2.8
3 28 2.9
4 29 3.0
5 30 3.2
6 31 3.4
7 32 3.7
8 33 4.1
9 34 4.6
10 35 5.1
11 36 5.6
12 37 6.1
13 38 6.7
14 39 7.5
15 40 8.2
I would like to add new columns with a for loop to df1.
The names of the new colums should be qx1, qx2,...qx10
for i in range(0,10):
df1['qx'+str(i)]
The values of qx1 should be affected by the loop, doing a kind of vlookup on the age :
For instance on the first row, for the prod 'Winalto_eu', the value of qx1 should be the value of
df2['qx'] at the age of 28+1, qx2 the same at 28+2...
The target dataframe should look like this :
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Have you any idea ?
Thanks
I think this would give what you want. I used shift function to first generate additional columns in df2 and then merged with df1.
import pandas as pd
df1 = pd.DataFrame({'prod': ['Winalto_eu', 'Winalto_uc', 'CEM_eu'], 'age' : [28, 25, 30]})
df2 = pd.DataFrame({'age': list(range(25,41)), 'qx': [2.7, 2.8, 2.8, 2.9, 3, 3.2, 3.4, 3.7, 4.1, 4.6, 5.1, 5.6, 6.1, 6.7, 7.5, 8.2]})
for i in range(1,11):
df2['qx'+str(i)] = df2.qx.shift(-i)
df3 = pd.merge(df1,df2,how = 'left',on = ['age'])
At the beginning you should try with pd.df.set_index('prod",inplace=True) after that transponse df with qx
Here's a way using .loc filtering the data:
top_n = 10
values = [df2.loc[df2['age'].gt(x),'qx'].iloc[:top_n].tolist() for x in df1['age']]
coln = ['qx'+str(x) for x in range(1,11)]
df1[coln] = pd.DataFrame(values)
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Ridiculously overengineered solution:
pd.concat([df1,pd.DataFrame(columns=['qx'+str(i) for i in range(11)],
data=[ser1.T.loc[:,i:i+10].values.flatten().tolist()
for i in df1['age']])],
axis=1)
prod age qx0 qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.7 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Try:
df=df1.assign(key=0).merge(df2.assign(key=0), on="key", suffixes=["", "_y"]).query("age<age_y").drop(["key"], axis=1)
df["q"]=df.groupby("prod")["age_y"].rank()
#to keep only 10 positions for each
df=df.loc[df["q"]<=10]
df=df.pivot_table(index=["prod", "age"], columns="q", values="qx")
df.columns=[f"qx{col:0.0f}" for col in df.columns]
df=df.reset_index()
Output:
prod age qx1 qx2 qx3 ... qx6 qx7 qx8 qx9 qx10
0 CEM_eu 30 3.4 3.7 4.1 ... 5.6 6.1 6.7 7.5 8.2
1 Winalto_eu 28 3.0 3.2 3.4 ... 4.6 5.1 5.6 6.1 6.7
2 Winalto_uc 25 2.8 2.8 2.9 ... 3.4 3.7 4.1 4.6 5.1
Here is my code everything works pretty good until I try and send to excel. I have a script that works fine for one web page but not multiple pages.
Working code and what I want:
import pandas as pd
from pandas import ExcelWriter
dfs = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play/',header=0)
for df in dfs:
print(df)
writer = pd.ExcelWriter('nfl.xlsx')
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
writer.save()
none working code:
import pandas as pd
from pandas import ExcelWriter
oyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0)
dyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/opponent-yards-per-play',header=0)
for df in (oyyp_df, dyyp_df):
print(df)
writer = pd.ExcelWriter('nfl.xlsx')
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
df.to_excel('nflypp.xlsx', sheet_name='yppd', index=False, engine='xlsxwriter')
writer.save()
Working until it gets to the df.to_excel
error: AttributeError: 'list' object has no attribute 'to_excel'
Here is the out put
C:\Cabs\projects>nflstatsypp.py
[ Rank Team 2018 Last 3 Last 1 Home Away 2017
0 1 Kansas City 7.0 7.0 6.9 6.4 7.5 6.1
1 2 LA Chargers 6.8 6.4 6.2 6.6 6.9 5.9
2 3 LA Rams 6.7 6.2 5.4 7.0 6.4 5.8
3 4 Tampa Bay 6.5 6.3 5.3 6.3 6.8 5.6
4 5 New Orleans 6.2 6.0 3.6 6.7 5.7 6.3
5 6 Pittsburgh 6.2 6.0 5.3 6.2 6.2 5.8
6 7 Carolina 6.2 7.3 6.8 6.1 6.2 5.1
7 8 Atlanta 6.0 5.0 2.9 6.5 5.5 5.8
8 9 Green Bay 6.0 5.4 4.4 5.9 6.1 4.9
9 10 Denver 5.9 6.1 6.3 6.1 5.8 4.8
10 11 New England 5.9 6.2 6.6 6.2 5.5 6.0
11 12 NY Giants 5.8 6.2 5.0 5.4 6.1 4.9
12 13 Houston 5.7 6.0 5.2 6.2 5.3 5.0
13 14 Seattle 5.7 6.2 6.8 5.5 5.9 5.2
14 15 San Francisco 5.7 5.8 6.1 5.4 5.9 5.3
15 16 Indianapolis 5.7 5.7 3.7 6.2 5.1 4.6
16 17 Cincinnati 5.6 5.1 4.8 5.5 5.7 4.8
17 18 Minnesota 5.6 5.1 4.7 5.6 5.6 5.4
18 19 Oakland 5.5 5.3 6.4 6.2 5.0 5.4
19 20 Philadelphia 5.5 5.4 6.1 5.5 5.5 5.6
20 21 Chicago 5.5 4.6 4.9 6.0 5.0 4.9
21 22 Cleveland 5.4 7.3 8.2 5.1 5.8 4.9
22 23 Tennessee 5.4 7.1 7.5 5.8 5.0 5.2
23 24 Miami 5.4 4.7 3.5 5.8 4.9 4.9
24 25 Dallas 5.3 5.2 4.7 5.6 5.1 5.3
25 26 Detroit 5.3 5.0 4.8 5.2 5.5 5.5
26 27 Baltimore 5.2 5.4 4.8 5.3 5.2 4.6
27 28 Washington 5.2 4.8 5.6 5.0 5.4 5.3
28 29 Jacksonville 5.0 4.3 3.8 5.0 5.1 5.4
29 30 NY Jets 4.9 4.5 4.3 5.4 4.4 5.0
30 31 Buffalo 4.5 6.2 6.3 4.5 4.6 4.7
31 32 Arizona 4.4 4.8 5.5 4.5 4.2 4.7]
[ Rank Team 2018 Last 3 Last 1 Home Away 2017
0 1 Baltimore 4.6 4.1 2.9 4.5 4.8 5.0
1 2 Buffalo 4.9 4.2 3.5 5.1 4.7 5.3
2 3 Chicago 4.9 4.8 5.0 4.6 5.2 5.1
3 4 Pittsburgh 5.2 5.1 6.2 5.6 4.8 5.3
4 5 Dallas 5.3 5.2 3.6 4.9 5.6 5.1
5 6 Minnesota 5.3 5.4 6.6 4.6 5.9 4.8
6 7 Arizona 5.3 5.1 4.4 5.0 5.6 4.9
7 8 Jacksonville 5.3 5.6 7.5 4.3 6.2 4.8
8 9 Houston 5.4 6.1 8.2 5.9 4.9 5.7
9 10 Tennessee 5.4 5.1 3.8 5.0 5.7 5.1
10 11 LA Chargers 5.5 5.1 5.3 5.7 5.4 5.3
11 12 Indianapolis 5.5 4.8 3.9 5.6 5.4 5.7
12 13 Green Bay 5.5 5.7 5.5 5.2 5.8 5.5
13 14 San Francisco 5.6 5.9 6.8 5.1 5.8 5.3
14 15 New England 5.7 5.4 4.7 5.4 5.9 5.7
15 16 NY Jets 5.7 6.8 6.7 6.0 5.4 5.4
16 17 Cleveland 5.7 5.3 5.2 6.0 5.5 5.1
17 18 Carolina 5.8 5.5 5.3 5.8 5.8 5.4
18 19 Washington 5.8 5.8 6.1 5.7 5.9 5.3
19 20 NY Giants 5.8 6.0 4.9 5.7 6.0 5.7
20 21 Denver 5.9 6.2 4.8 6.0 5.7 4.9
21 22 New Orleans 5.9 4.8 4.7 6.1 5.8 5.4
22 23 Kansas City 6.0 5.4 6.4 5.4 6.4 5.6
23 24 Philadelphia 6.1 7.0 5.6 5.7 6.6 5.2
24 25 Detroit 6.1 5.6 5.4 5.9 6.4 5.5
25 26 LA Rams 6.1 6.4 4.8 6.4 5.8 5.3
26 27 Seattle 6.1 7.2 6.1 6.7 5.8 4.9
27 28 Atlanta 6.2 5.1 4.8 6.4 5.9 5.2
28 29 Cincinnati 6.2 5.7 6.3 6.2 6.2 5.0
29 30 Miami 6.3 6.7 6.3 6.1 6.5 5.4
30 31 Tampa Bay 6.4 6.4 6.8 5.8 7.1 6.0
31 32 Oakland 6.6 6.2 6.9 6.5 6.6 5.6]
Traceback (most recent call last):
File "C:\Cabs\projects\nflstatsypp.py", line 14, in
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
AttributeError: 'list' object has no attribute 'to_excel'
One last ? how do you clean up the above second table so the headers are lined up like the first table? if it has been answered add link please. Thanks. Note when printed out in python the first table headers are correct just for clarification. thanks again. no more edits. hope all this helps.
I'm brand new, just having fun. Have been researching for months with all different codes. have about 15.py trying to get this to work.
Thanks for any help. if the answer is out there I can't find or understand it. :-) finally. again sorry for being such a newbe. LOL
There's a few ways you can do this. I would probably loop it condense the code a bit, save each dataframe as you iterate in your for loop. But it also looks like you want different names for your sheets, which would involve creating a variable in same way to associate each of your pd.read_htmls, and it appears you're a beginner, so we'll just try to keep this as simple as possible, and we'll do it another way which is just straight away save the data.
First off, when you do oyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0), it's storing it as dataframe, BUT packaging it into a list (see here).
Also, it would be benificial to go back and read up about lists in Python. So your for loop iterates through those items in each of your lists (oyyp_df, dyyp_df).
If you want to call a specific item in a list, you call it by it's index/position. The key to note though, is that the index starts at 0. So the first item in a list is at position 0, the 2nd item is at position 1, etc.
a_list = ['first item', 'sencond item, 'third item']
to call that first item, you'd type a_list[0] and you would see the output 'first item'.
Now a list can be of many data types. It could be strings, like above, it can integers, it can be dictionaries, or in your case here, it's dataframes.
so oyyp_df is really = [<your DATFRAME>, <maybe a 2nd dataframe>, etc.]. yours only contains 1 item, in the first position. So you get that error. lists can't do .to_excel, but dataframes can.
What we can do is store that 1st item dataframe though by setting that equal to another name (or you could actually use the same name...but be careful as if your list has other items in it, you'd lose those); oyyp_df = oyyp_df[0]
I changed a couple things to hopefully make it clearer in your code below.
import pandas as pd
html_data1 = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0)
html_data2 = pd.read_html('https://www.teamrankings.com/nfl/stat/opponent-yards-per-play',header=0)
for df in (html_data1, html_data2):
print(df)
oyyp_df = html_data1[0]
dyyp_df = html_data2[0]
writer = pd.ExcelWriter('nflypp.xlsx')
oyyp_df.to_excel(writer, sheet_name='yppo', index=False)
dyyp_df.to_excel(writer, sheet_name='yppd', index=False)
writer.save()
writer.close()
How can I change the values of the column 4 to 1 and -1, so that Iris-setosa is replace with 1 and Iris-virginica replaced with -1?
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 Iris-virginica
121 5.6 2.8 4.9 2.0 Iris-virginica
122 7.7 2.8 6.7 2.0 Iris-virginica
123 6.3 2.7 4.9 1.8 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica
126 6.2 2.8 4.8 1.8 Iris-virginica
I would appreciate the help.
You can use replace
d = {'Iris-setosa': 1, 'Iris-virginica': -1}
df['4'].replace(d,inplace = True)
0 1 2 3 4
0 5.1 3.5 1.4 0.2 1
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 5.4 3.9 1.7 0.4 1
6 4.6 3.4 1.4 0.3 1
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 -1
121 5.6 2.8 4.9 2.0 -1
122 7.7 2.8 6.7 2.0 -1
123 6.3 2.7 4.9 1.8 -1
124 6.7 3.3 5.7 2.1 -1
125 7.2 3.2 6.0 1.8 -1
126 6.2 2.8 4.8 1.8 -1
df.iloc[df["4"]=="Iris-setosa","4"]=1
df.iloc[df["4"]=="Iris-virginica","4"]=-1
I would do something like this
def encode_row(self, row):
if row[4] == "Iris-setosa":
return 1
return -1
df_test[4] = df_test.apply(lambda row : self.encode_row(row), axis=1)
assuming that df_test is your data frame
Sounds like
df['4'] = np.where(df['4'] == 'Iris-setosa', 1, -1)
should do the job
What does the function load_iris() do ?
Also, I don't understand what type of data it contains and where to find it.
iris = datasets.load_iris()
X = iris.data
target = iris.target
names = iris.target_names
Can somebody please tell in detail what does this piece of code does?
Thanks in advance.
load_iris is a function from sklearn. The link provides documentation: iris in your code will be a dictionary-like object. X and y will be numpy arrays, and names has the array of possible targets as text (rather than numeric values as in y).
You can get some documentation with:
# import some data to play with
iris = datasets.load_iris()
print('The data matrix:\n',iris['data'])
print('The classification target:\n',iris['target'])
print('The names of the dataset columns:\n',iris['feature_names'])
print('The names of target classes:\n',iris['target_names'])
print('The full description of the dataset:\n',iris['DESCR'])
print('The path to the location of the data:\n',iris['filename'])
This gives you:
The data matrix:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
The classification target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
The names of the dataset columns:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
The names of target classes:
['setosa' 'versicolor' 'virginica']
The full description of the dataset:
.. _iris_dataset:
Iris plants dataset
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU#io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
Many, many more ...
The path to the location of the data:
/Applications/anaconda3/lib/python3.7/site-packages/sklearn/datasets/data/iris.csv
To thread off the previous comments and posts from above, wanted to add another way to load iris() besides iris = datasets.load_iris()
from sklearn.datasets import load_iris
iris = load_iris()
Then, you can do:
X = iris.data
target = iris.target
names = iris.target_names
And see posts and comments from other people here.
And you can make a dataframe with :
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target
df['species'] = df['species'].replace(to_replace= [0, 1, 2], value = ['setosa', 'versicolor', 'virginica'])
From the Naive Bayes tutorial in sklearn there's example on iris dataset but it looks too cryptic, can someone help to enlighten me?
What does the iris.data mean? why is there 4 columns?
What does the iris.target mean? why are they a flat array of 0s, 1s and 2s?
from sklearn import datasets
iris = datasets.load_iris()
print iris.data
[out]:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]
[ 5.4 3.9 1.7 0.4]
[ 4.6 3.4 1.4 0.3]
[ 5. 3.4 1.5 0.2]
[ 4.4 2.9 1.4 0.2]
[ 4.9 3.1 1.5 0.1]
[ 5.4 3.7 1.5 0.2]
[ 4.8 3.4 1.6 0.2]
[ 4.8 3. 1.4 0.1]
[ 4.3 3. 1.1 0.1]
[ 5.8 4. 1.2 0.2]
[ 5.7 4.4 1.5 0.4]
[ 5.4 3.9 1.3 0.4]
[ 5.1 3.5 1.4 0.3]
[ 5.7 3.8 1.7 0.3]
[ 5.1 3.8 1.5 0.3]
[ 5.4 3.4 1.7 0.2]
[ 5.1 3.7 1.5 0.4]
[ 4.6 3.6 1. 0.2]
[ 5.1 3.3 1.7 0.5]
[ 4.8 3.4 1.9 0.2]
[ 5. 3. 1.6 0.2]
[ 5. 3.4 1.6 0.4]
[ 5.2 3.5 1.5 0.2]
[ 5.2 3.4 1.4 0.2]
[ 4.7 3.2 1.6 0.2]
[ 4.8 3.1 1.6 0.2]
[ 5.4 3.4 1.5 0.4]
[ 5.2 4.1 1.5 0.1]
[ 5.5 4.2 1.4 0.2]
[ 4.9 3.1 1.5 0.1]
[ 5. 3.2 1.2 0.2]
[ 5.5 3.5 1.3 0.2]
[ 4.9 3.1 1.5 0.1]
[ 4.4 3. 1.3 0.2]
[ 5.1 3.4 1.5 0.2]
[ 5. 3.5 1.3 0.3]
[ 4.5 2.3 1.3 0.3]
[ 4.4 3.2 1.3 0.2]
[ 5. 3.5 1.6 0.6]
[ 5.1 3.8 1.9 0.4]
[ 4.8 3. 1.4 0.3]
[ 5.1 3.8 1.6 0.2]
[ 4.6 3.2 1.4 0.2]
[ 5.3 3.7 1.5 0.2]
[ 5. 3.3 1.4 0.2]
[ 7. 3.2 4.7 1.4]
[ 6.4 3.2 4.5 1.5]
[ 6.9 3.1 4.9 1.5]
[ 5.5 2.3 4. 1.3]
[ 6.5 2.8 4.6 1.5]
[ 5.7 2.8 4.5 1.3]
[ 6.3 3.3 4.7 1.6]
[ 4.9 2.4 3.3 1. ]
[ 6.6 2.9 4.6 1.3]
[ 5.2 2.7 3.9 1.4]
[ 5. 2. 3.5 1. ]
[ 5.9 3. 4.2 1.5]
[ 6. 2.2 4. 1. ]
[ 6.1 2.9 4.7 1.4]
[ 5.6 2.9 3.6 1.3]
[ 6.7 3.1 4.4 1.4]
[ 5.6 3. 4.5 1.5]
[ 5.8 2.7 4.1 1. ]
[ 6.2 2.2 4.5 1.5]
[ 5.6 2.5 3.9 1.1]
[ 5.9 3.2 4.8 1.8]
[ 6.1 2.8 4. 1.3]
[ 6.3 2.5 4.9 1.5]
[ 6.1 2.8 4.7 1.2]
[ 6.4 2.9 4.3 1.3]
[ 6.6 3. 4.4 1.4]
[ 6.8 2.8 4.8 1.4]
[ 6.7 3. 5. 1.7]
[ 6. 2.9 4.5 1.5]
[ 5.7 2.6 3.5 1. ]
[ 5.5 2.4 3.8 1.1]
[ 5.5 2.4 3.7 1. ]
[ 5.8 2.7 3.9 1.2]
[ 6. 2.7 5.1 1.6]
[ 5.4 3. 4.5 1.5]
[ 6. 3.4 4.5 1.6]
[ 6.7 3.1 4.7 1.5]
[ 6.3 2.3 4.4 1.3]
[ 5.6 3. 4.1 1.3]
[ 5.5 2.5 4. 1.3]
[ 5.5 2.6 4.4 1.2]
[ 6.1 3. 4.6 1.4]
[ 5.8 2.6 4. 1.2]
[ 5. 2.3 3.3 1. ]
[ 5.6 2.7 4.2 1.3]
[ 5.7 3. 4.2 1.2]
[ 5.7 2.9 4.2 1.3]
[ 6.2 2.9 4.3 1.3]
[ 5.1 2.5 3. 1.1]
[ 5.7 2.8 4.1 1.3]
[ 6.3 3.3 6. 2.5]
[ 5.8 2.7 5.1 1.9]
[ 7.1 3. 5.9 2.1]
[ 6.3 2.9 5.6 1.8]
[ 6.5 3. 5.8 2.2]
[ 7.6 3. 6.6 2.1]
[ 4.9 2.5 4.5 1.7]
[ 7.3 2.9 6.3 1.8]
[ 6.7 2.5 5.8 1.8]
[ 7.2 3.6 6.1 2.5]
[ 6.5 3.2 5.1 2. ]
[ 6.4 2.7 5.3 1.9]
[ 6.8 3. 5.5 2.1]
[ 5.7 2.5 5. 2. ]
[ 5.8 2.8 5.1 2.4]
[ 6.4 3.2 5.3 2.3]
[ 6.5 3. 5.5 1.8]
[ 7.7 3.8 6.7 2.2]
[ 7.7 2.6 6.9 2.3]
[ 6. 2.2 5. 1.5]
[ 6.9 3.2 5.7 2.3]
[ 5.6 2.8 4.9 2. ]
[ 7.7 2.8 6.7 2. ]
[ 6.3 2.7 4.9 1.8]
[ 6.7 3.3 5.7 2.1]
[ 7.2 3.2 6. 1.8]
[ 6.2 2.8 4.8 1.8]
[ 6.1 3. 4.9 1.8]
[ 6.4 2.8 5.6 2.1]
[ 7.2 3. 5.8 1.6]
[ 7.4 2.8 6.1 1.9]
[ 7.9 3.8 6.4 2. ]
[ 6.4 2.8 5.6 2.2]
[ 6.3 2.8 5.1 1.5]
[ 6.1 2.6 5.6 1.4]
[ 7.7 3. 6.1 2.3]
[ 6.3 3.4 5.6 2.4]
[ 6.4 3.1 5.5 1.8]
[ 6. 3. 4.8 1.8]
[ 6.9 3.1 5.4 2.1]
[ 6.7 3.1 5.6 2.4]
[ 6.9 3.1 5.1 2.3]
[ 5.8 2.7 5.1 1.9]
[ 6.8 3.2 5.9 2.3]
[ 6.7 3.3 5.7 2.5]
[ 6.7 3. 5.2 2.3]
[ 6.3 2.5 5. 1.9]
[ 6.5 3. 5.2 2. ]
[ 6.2 3.4 5.4 2.3]
[ 5.9 3. 5.1 1.8]]
From the iris.target, it returns another array of 0s, 1s and 2s. What do they mean?
from sklearn import datasets
iris = datasets.load_iris()
print iris.target
[out]:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Iris is the well-known Fisher's Iris data set. He measured the length and width of the sepal and petal (two parts of the flower) of three species of Iris. Each row contains the measurements from one flower and there are measurements for 50 flowers of each type, hence the dimensions of iris.data. The actual type of the flower is coded as 0, 1, or 2 in iris.target; you can recover the actual species names (as strings) from iris.target_name.
Fisher showed that his then-new discriminant method could separate the three species based on their sepal and petal measurements and it's been a standard classification data set ever since.
Td;dr: sample data. One example per row with four attributes; 150 examples total. Class labels are stored separately and are coded as integers.
Docs here: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris