python covariance matrix of array of vectors - python
I have an array of size 4 vectors(which we could consider 4-tuples). I want to find the covariance matrix but if I call self.cov I get a huge matrix whilst I'm expecting a 4x4.
The code is simply
print(np.cov(iris_separated[0])) where iris_separated[0] is the setosas from the iris dataset.
print(iris_separated[0]) looks like this
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]]
And I'm expecting a 4x4 covariance matrix, instead I'm getting a huge matrix of a lot of dimensions.
[[4.75 4.42166667 4.35333333 ... 4.23 4.945 4.60166667]
[4.42166667 4.14916667 4.055 ... 3.93833333 4.59916667 4.29583333]
[4.35333333 4.055 3.99 ... 3.87666667 4.53166667 4.21833333]
...
[4.23 3.93833333 3.87666667 ... 3.77 4.405 4.09833333]
[4.945 4.59916667 4.53166667 ... 4.405 5.14916667 4.78916667]
[4.60166667 4.29583333 4.21833333 ... 4.09833333 4.78916667 4.4625 ]]
You need to transpose the matrix. Each column represents an observation and each row represents a variable. Therefore, it should be np.cov(iris_seperated[0].T).
Please refer the docs
https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
print(np.cov(iris_separated[0],rowvar=False)) fixes the problem, so does using .T on the data
Related
I have a text file which 10 lists of data, I am trying to convert this into a Dataframe where every list is a column
Here is the data that I need to display in the Dataframe. Each list needs a column name, my current way is producing a Dataframe that is clearly wrong, any advice would be great [[ 7.3 10.3 7.3 3.4 1.2 0.3 0.1 8.8 12.4 8.8 4.1 1.5 0.4 0.1 5.3 7.5 5.3 2.5 0.9 0.2 0.1 2.1 3. 2.1 1. 0.4 0.1 0. 0.6 0.9 0.6 0.3 0.1 0. 0. 0.2 0.2 0.2 0.1 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ], [[ 4.6 6.6 4.7 2.3 0.8 0.2 0.1 7.6 10.9 7.8 3.7 1.3 0.4 0.1 6.3 8.9 6.4 3. 1.1 0.3 0.1 3.4 4.9 3.5 1.7 0.6 0.2 0. 1.4 2. 1.4 0.7 0.2 0.1 0. 0.5 0.7 0.5 0.2 0.1 0. 0. 0.1 0.2 0.1 0.1 0. 0. 0. ], [[ 6.4 9.1 6.4 3. 1.1 0.3 0.1 8.5 12.1 8.6 4. 1.4 0.4 0.1 5.7 8.1 5.7 2.7 1. 0.3 0.1 2.5 3.6 2.5 1.2 0.4 0.1 0. 0.8 1.2 0.8 0.4 0.1 0. 0. 0.2 0.3 0.2 0.1 0. 0. 0. 0. 0.1 0.1 0. 0. 0. 0. ], [[ 3.9 5.8 4.4 2.2 0.8 0.2 0.1 6.8 10.2 7.6 3.8 1.4 0.4 0.1 5.9 8.9 6.7 3.3 1.3 0.4 0.1 3.5 5.2 3.9 1.9 0.7 0.2 0.1 1.5 2.3 1.7 0.9 0.3 0.1 0. 0.5 0.8 0.6 0.3 0.1 0. 0. 0.2 0.2 0.2 0.1 0. 0. 0. ], [[ 7.2 10. 6.8 3.1 1.1 0.3 0.1 9.1 12.5 8.6 3.9 1.3 0.4 0.1 5.7 7.8 5.3 2.5 0.8 0.2 0.1 2.4 3.2 2.2 1. 0.4 0.1 0. 0.7 1. 0.7 0.3 0.1 0. 0. 0.2 0.3 0.2 0.1 0. 0. 0. 0. 0.1 0. 0. 0. 0. 0. ], [[3. 4.9 4. 2.2 0.9 0.3 0.1 5.7 9.2 7.5 4. 1.6 0.5 0.1 5.3 8.6 7. 3.8 1.5 0.5 0.1 3.3 5.4 4.4 2.4 1. 0.3 0.1 1.6 2.5 2.1 1.1 0.5 0.1 0. 0.6 0.9 0.8 0.4 0.2 0.1 0. 0.2 0.3 0.2 0.1 0.1 0. 0. ], [[ 6.9 8.1 4.7 1.8 0.5 0.1 0. 10.4 12.2 7.1 2.8 0.8 0.2 0. 7.8 9.1 5.3 2.1 0.6 0.1 0. 3.9 4.6 2.7 1. 0.3 0.1 0. 1.5 1.7 1. 0.4 0.1 0. 0. 0.4 0.5 0.3 0.1 0. 0. 0. 0.1 0.1 0.1 0. 0. 0. 0. ], [[3. 4.5 3.4 1.7 0.6 0.2 0. 6. 9.1 6.8 3.4 1.3 0.4 0.1 6. 9.1 6.8 3.4 1.3 0.4 0.1 4. 6. 4.5 2.3 0.8 0.3 0.1 2. 3. 2.3 1.1 0.4 0.1 0. 0.8 1.2 0.9 0.5 0.2 0.1 0. 0.3 0.4 0.3 0.2 0.1 0. 0. ], [[ 6.4 9.6 7.2 3.6 1.3 0.4 0.1 8. 12. 9. 4.5 1.7 0.5 0.1 5. 7.5 5.6 2.8 1.1 0.3 0.1 2.1 3.1 2.3 1.2 0.4 0.1 0. 0.7 1. 0.7 0.4 0.1 0. 0. 0.2 0.2 0.2 0.1 0. 0. 0. 0. 0.1 0. 0. 0. 0. 0. ], [[1.1 1.9 1.7 1. 0.4 0.2 0. 3.1 5.3 4.7 2.7 1.2 0.4 0.1 4.2 7.4 6.4 3.8 1.6 0.6 0.2 3.9 6.7 5.9 3.4 1.5 0.5 0.2 2.6 4.6 4.1 2.4 1. 0.4 0.1 1.5 2.5 2.2 1.3 0.6 0.2 0.1 0.7 1.2 1. 0.6 0.3 0.1 0. ] ] my current code: matrix_df = pd.DataFrame(pd.read_csv('filename.txt', names = [names 1 -10]))
You will have to pre-process the collection of lists to convert it to something like this: master_list = [[7.3,10.3,7.3,3.4,1.2,0.3,0.1,8.8,12.4,8.8,4.1,1.5,0.4,0.1,5.3,7.5,5.3,2.5,0.9,0.2,0.1,2.1,3.0,2.1,1.0,0.4,0.1,0.0,0.6,0.9,0.6,0.3,0.1,0.0,0.0,0.2,0.2,0.2,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0], [4.6,6.6,4.7,2.3,0.8,0.2,0.1,7.6,10.9,7.8,3.7,1.3,0.4,0.1,6.3,8.9,6.4,3.0,1.1,0.3,0.1,3.4,4.9,3.5,1.7,0.6,0.2,0.0,1.4,2.0,1.4,0.7,0.2,0.1,0.0,0.5,0.7,0.5,0.2,0.1,0.0,0.0,0.1,0.2,0.1,0.1,0.0,0.0,0.0], [6.4,9.1,6.4,3.0,1.1,0.3,0.1,8.5,12.1,8.6,4.0,1.4,0.4,0.1,5.7,8.1,5.7,2.7,1.0,0.3,0.1,2.5,3.6,2.5,1.2,0.4,0.1,0.0,0.8,1.2,0.8,0.4,0.1,0.0,0.0,0.2,0.3,0.2,0.1,0.0,0.0,0.0,0.0,0.1,0.1,0.0,0.0,0.0,0.0], [3.9,5.8,4.4,2.2,0.8,0.2,0.1,6.8,10.2,7.6,3.8,1.4,0.4,0.1,5.9,8.9,6.7,3.3,1.3,0.4,0.1,3.5,5.2,3.9,1.9,0.7,0.2,0.1,1.5,2.3,1.7,0.9,0.3,0.1,0.0,0.5,0.8,0.6,0.3,0.1,0.0,0.0,0.2,0.2,0.2,0.1,0.0,0.0,0.0], [7.2,10.0,6.8,3.1,1.1,0.3,0.1,9.1,12.5,8.6,3.9,1.3,0.4,0.1,5.7,7.8,5.3,2.5,0.8,0.2,0.1,2.4,3.2,2.2,1.0,0.4,0.1,0.0,0.7,1.0,0.7,0.3,0.1,0.0,0.0,0.2,0.3,0.2,0.1,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0], [3.0,4.9,4.0,2.2,0.9,0.3,0.1,5.7,9.2,7.5,4.0,1.6,0.5,0.1,5.3,8.6,7.0,3.8,1.5,0.5,0.1,3.3,5.4,4.4,2.4,1.0,0.3,0.1,1.6,2.5,2.1,1.1,0.5,0.1,0.0,0.6,0.9,0.8,0.4,0.2,0.1,0.0,0.2,0.3,0.2,0.1,0.1,0.0,0.0], [6.9,8.1,4.7,1.8,0.5,0.1,0.0,10.4,12.2,7.1,2.8,0.8,0.2,0.0,7.8,9.1,5.3,2.1,0.6,0.1,0.0,3.9,4.6,2.7,1.0,0.3,0.1,0.0,1.5,1.7,1.0,0.4,0.1,0.0,0.0,0.4,0.5,0.3,0.1,0.0,0.0,0.0,0.1,0.1,0.1,0.0,0.0,0.0,0.0], [3.0,4.5,3.4,1.7,0.6,0.2,0.0,6.0,9.1,6.8,3.4,1.3,0.4,0.1,6.0,9.1,6.8,3.4,1.3,0.4,0.1,4.0,6.0,4.5,2.3,0.8,0.3,0.1,2.0,3.0,2.3,1.1,0.4,0.1,0.0,0.8,1.2,0.9,0.5,0.2,0.1,0.0,0.3,0.4,0.3,0.2,0.1,0.0,0.0], [6.4,9.6,7.2,3.6,1.3,0.4,0.1,8.0,12.0,9.0,4.5,1.7,0.5,0.1,5.0,7.5,5.6,2.8,1.1,0.3,0.1,2.1,3.1,2.3,1.2,0.4,0.1,0.0,0.7,1.0,0.7,0.4,0.1,0.0,0.0,0.2,0.2,0.2,0.1,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0], [1.1,1.9,1.7,1.0,0.4,0.2,0.0,3.1,5.3,4.7,2.7,1.2,0.4,0.1,4.2,7.4,6.4,3.8,1.6,0.6,0.2,3.9,6.7,5.9,3.4,1.5,0.5,0.2,2.6,4.6,4.1,2.4,1.0,0.4,0.1,1.5,2.5,2.2,1.3,0.6,0.2,0.1,0.7,1.2,1.0,0.6,0.3,0.1,0.0]] Once you have the source data in this format, then you can easily convert it into a dataframe: lst = list(map(list, zip(*master_list))) df = pd.DataFrame(lst) print(df) Output: 0 1 2 3 4 5 6 7 8 9 0 7.3 4.6 6.4 3.9 7.2 3.0 6.9 3.0 6.4 1.1 1 10.3 6.6 9.1 5.8 10.0 4.9 8.1 4.5 9.6 1.9 2 7.3 4.7 6.4 4.4 6.8 4.0 4.7 3.4 7.2 1.7 3 3.4 2.3 3.0 2.2 3.1 2.2 1.8 1.7 3.6 1.0 4 1.2 0.8 1.1 0.8 1.1 0.9 0.5 0.6 1.3 0.4 5 0.3 0.2 0.3 0.2 0.3 0.3 0.1 0.2 0.4 0.2 6 0.1 0.1 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0 7 8.8 7.6 8.5 6.8 9.1 5.7 10.4 6.0 8.0 3.1 8 12.4 10.9 12.1 10.2 12.5 9.2 12.2 9.1 12.0 5.3 9 8.8 7.8 8.6 7.6 8.6 7.5 7.1 6.8 9.0 4.7 10 4.1 3.7 4.0 3.8 3.9 4.0 2.8 3.4 4.5 2.7 11 1.5 1.3 1.4 1.4 1.3 1.6 0.8 1.3 1.7 1.2 12 0.4 0.4 0.4 0.4 0.4 0.5 0.2 0.4 0.5 0.4 13 0.1 0.1 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.1 14 5.3 6.3 5.7 5.9 5.7 5.3 7.8 6.0 5.0 4.2 15 7.5 8.9 8.1 8.9 7.8 8.6 9.1 9.1 7.5 7.4 16 5.3 6.4 5.7 6.7 5.3 7.0 5.3 6.8 5.6 6.4 17 2.5 3.0 2.7 3.3 2.5 3.8 2.1 3.4 2.8 3.8 18 0.9 1.1 1.0 1.3 0.8 1.5 0.6 1.3 1.1 1.6 19 0.2 0.3 0.3 0.4 0.2 0.5 0.1 0.4 0.3 0.6 20 0.1 0.1 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.2 21 2.1 3.4 2.5 3.5 2.4 3.3 3.9 4.0 2.1 3.9 22 3.0 4.9 3.6 5.2 3.2 5.4 4.6 6.0 3.1 6.7 23 2.1 3.5 2.5 3.9 2.2 4.4 2.7 4.5 2.3 5.9 24 1.0 1.7 1.2 1.9 1.0 2.4 1.0 2.3 1.2 3.4 25 0.4 0.6 0.4 0.7 0.4 1.0 0.3 0.8 0.4 1.5 26 0.1 0.2 0.1 0.2 0.1 0.3 0.1 0.3 0.1 0.5 27 0.0 0.0 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.2 28 0.6 1.4 0.8 1.5 0.7 1.6 1.5 2.0 0.7 2.6 29 0.9 2.0 1.2 2.3 1.0 2.5 1.7 3.0 1.0 4.6 30 0.6 1.4 0.8 1.7 0.7 2.1 1.0 2.3 0.7 4.1 31 0.3 0.7 0.4 0.9 0.3 1.1 0.4 1.1 0.4 2.4 32 0.1 0.2 0.1 0.3 0.1 0.5 0.1 0.4 0.1 1.0 33 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.4 34 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 35 0.2 0.5 0.2 0.5 0.2 0.6 0.4 0.8 0.2 1.5 36 0.2 0.7 0.3 0.8 0.3 0.9 0.5 1.2 0.2 2.5 37 0.2 0.5 0.2 0.6 0.2 0.8 0.3 0.9 0.2 2.2 38 0.1 0.2 0.1 0.3 0.1 0.4 0.1 0.5 0.1 1.3 39 0.0 0.1 0.0 0.1 0.0 0.2 0.0 0.2 0.0 0.6 40 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1 0.0 0.2 41 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 42 0.0 0.1 0.0 0.2 0.0 0.2 0.1 0.3 0.0 0.7 43 0.0 0.2 0.1 0.2 0.1 0.3 0.1 0.4 0.1 1.2 44 0.0 0.1 0.1 0.2 0.0 0.2 0.1 0.3 0.0 1.0 45 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.2 0.0 0.6 46 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1 0.0 0.3 47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 48 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
python, h2o.import_file() return empty line
i have some problem with read files in h2o. import h2o from h2o.estimators.deeplearning import H2ODeepLearningEstimator h2o.init() train = h2o.import_file("("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"") splits = train.split_frame(ratios=[0.75], seed=1234) dl = H2ODeepLearningEstimator(distribution="quantile",quantile_alpha=0.8) dl.train(x=range(0,2), y="petal_len", training_frame=splits[0]) print(dl.predict(splits[1])) UPDATE_1, The fourth line has this form(sorry, i copied wrong from IDE): train = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") I got H2OTypeError: Argument x should be a None | integer | string | list(string | integer) | set(integer | string), got range range(0, 2). This is due to the fact that "train" is empty. In [23]: train Out[23]: I thought that there is a problem with reading from and linking and manually downloading file. train = h2o.import_file("iris_wheader.csv") But i got same result. In [26]: train Out[26]: I connected pandas and open this .csv in pandas. It opened, I got a pandas-dataframe, I used train = h2o.H2OFrame(train) and got an empty train. In [29]: train Out[29]: How to solve this problem? UPDATE_2 When I went to 127.0.0.1:54321/flow/index.html, and it shows me that the dataframe has been loaded into the cluster. But in Python, I get empty train. I use Spyder IDE with IPython console, can it somehow influence the result?
There is a problem with this line: train = h2o.import_file("("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"") You have extra " and (, it should be: train = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") Then you'll see that train and also print(train) give output: In [6]: train Out[6]: sepal_len sepal_wid petal_len petal_wid class ----------- ----------- ----------- ----------- ----------- 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa 5 3.6 1.4 0.2 Iris-setosa 5.4 3.9 1.7 0.4 Iris-setosa 4.6 3.4 1.4 0.3 Iris-setosa 5 3.4 1.5 0.2 Iris-setosa 4.4 2.9 1.4 0.2 Iris-setosa 4.9 3.1 1.5 0.1 Iris-setosa [150 rows x 5 columns] In [7]: train.nrow Out[7]: 150 In [8]: print(train) sepal_len sepal_wid petal_len petal_wid class ----------- ----------- ----------- ----------- ----------- 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa 5 3.6 1.4 0.2 Iris-setosa 5.4 3.9 1.7 0.4 Iris-setosa 4.6 3.4 1.4 0.3 Iris-setosa 5 3.4 1.5 0.2 Iris-setosa 4.4 2.9 1.4 0.2 Iris-setosa 4.9 3.1 1.5 0.1 Iris-setosa [150 rows x 5 columns]
datasets.load_iris() in Python
What does the function load_iris() do ? Also, I don't understand what type of data it contains and where to find it. iris = datasets.load_iris() X = iris.data target = iris.target names = iris.target_names Can somebody please tell in detail what does this piece of code does? Thanks in advance.
load_iris is a function from sklearn. The link provides documentation: iris in your code will be a dictionary-like object. X and y will be numpy arrays, and names has the array of possible targets as text (rather than numeric values as in y).
You can get some documentation with: # import some data to play with iris = datasets.load_iris() print('The data matrix:\n',iris['data']) print('The classification target:\n',iris['target']) print('The names of the dataset columns:\n',iris['feature_names']) print('The names of target classes:\n',iris['target_names']) print('The full description of the dataset:\n',iris['DESCR']) print('The path to the location of the data:\n',iris['filename']) This gives you: The data matrix: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] [5.4 3.7 1.5 0.2] [4.8 3.4 1.6 0.2] [4.8 3. 1.4 0.1] [4.3 3. 1.1 0.1] [5.8 4. 1.2 0.2] [5.7 4.4 1.5 0.4] [5.4 3.9 1.3 0.4] [5.1 3.5 1.4 0.3] [5.7 3.8 1.7 0.3] [5.1 3.8 1.5 0.3] [5.4 3.4 1.7 0.2] [5.1 3.7 1.5 0.4] [4.6 3.6 1. 0.2] [5.1 3.3 1.7 0.5] [4.8 3.4 1.9 0.2] [5. 3. 1.6 0.2] [5. 3.4 1.6 0.4] [5.2 3.5 1.5 0.2] [5.2 3.4 1.4 0.2] [4.7 3.2 1.6 0.2] [4.8 3.1 1.6 0.2] [5.4 3.4 1.5 0.4] [5.2 4.1 1.5 0.1] [5.5 4.2 1.4 0.2] [4.9 3.1 1.5 0.2] [5. 3.2 1.2 0.2] [5.5 3.5 1.3 0.2] [4.9 3.6 1.4 0.1] [4.4 3. 1.3 0.2] [5.1 3.4 1.5 0.2] [5. 3.5 1.3 0.3] [4.5 2.3 1.3 0.3] [4.4 3.2 1.3 0.2] [5. 3.5 1.6 0.6] [5.1 3.8 1.9 0.4] [4.8 3. 1.4 0.3] [5.1 3.8 1.6 0.2] [4.6 3.2 1.4 0.2] [5.3 3.7 1.5 0.2] [5. 3.3 1.4 0.2] [7. 3.2 4.7 1.4] [6.4 3.2 4.5 1.5] [6.9 3.1 4.9 1.5] [5.5 2.3 4. 1.3] [6.5 2.8 4.6 1.5] [5.7 2.8 4.5 1.3] [6.3 3.3 4.7 1.6] [4.9 2.4 3.3 1. ] [6.6 2.9 4.6 1.3] [5.2 2.7 3.9 1.4] [5. 2. 3.5 1. ] [5.9 3. 4.2 1.5] [6. 2.2 4. 1. ] [6.1 2.9 4.7 1.4] [5.6 2.9 3.6 1.3] [6.7 3.1 4.4 1.4] [5.6 3. 4.5 1.5] [5.8 2.7 4.1 1. ] [6.2 2.2 4.5 1.5] [5.6 2.5 3.9 1.1] [5.9 3.2 4.8 1.8] [6.1 2.8 4. 1.3] [6.3 2.5 4.9 1.5] [6.1 2.8 4.7 1.2] [6.4 2.9 4.3 1.3] [6.6 3. 4.4 1.4] [6.8 2.8 4.8 1.4] [6.7 3. 5. 1.7] [6. 2.9 4.5 1.5] [5.7 2.6 3.5 1. ] [5.5 2.4 3.8 1.1] [5.5 2.4 3.7 1. ] [5.8 2.7 3.9 1.2] [6. 2.7 5.1 1.6] [5.4 3. 4.5 1.5] [6. 3.4 4.5 1.6] [6.7 3.1 4.7 1.5] [6.3 2.3 4.4 1.3] [5.6 3. 4.1 1.3] [5.5 2.5 4. 1.3] [5.5 2.6 4.4 1.2] [6.1 3. 4.6 1.4] [5.8 2.6 4. 1.2] [5. 2.3 3.3 1. ] [5.6 2.7 4.2 1.3] [5.7 3. 4.2 1.2] [5.7 2.9 4.2 1.3] [6.2 2.9 4.3 1.3] [5.1 2.5 3. 1.1] [5.7 2.8 4.1 1.3] [6.3 3.3 6. 2.5] [5.8 2.7 5.1 1.9] [7.1 3. 5.9 2.1] [6.3 2.9 5.6 1.8] [6.5 3. 5.8 2.2] [7.6 3. 6.6 2.1] [4.9 2.5 4.5 1.7] [7.3 2.9 6.3 1.8] [6.7 2.5 5.8 1.8] [7.2 3.6 6.1 2.5] [6.5 3.2 5.1 2. ] [6.4 2.7 5.3 1.9] [6.8 3. 5.5 2.1] [5.7 2.5 5. 2. ] [5.8 2.8 5.1 2.4] [6.4 3.2 5.3 2.3] [6.5 3. 5.5 1.8] [7.7 3.8 6.7 2.2] [7.7 2.6 6.9 2.3] [6. 2.2 5. 1.5] [6.9 3.2 5.7 2.3] [5.6 2.8 4.9 2. ] [7.7 2.8 6.7 2. ] [6.3 2.7 4.9 1.8] [6.7 3.3 5.7 2.1] [7.2 3.2 6. 1.8] [6.2 2.8 4.8 1.8] [6.1 3. 4.9 1.8] [6.4 2.8 5.6 2.1] [7.2 3. 5.8 1.6] [7.4 2.8 6.1 1.9] [7.9 3.8 6.4 2. ] [6.4 2.8 5.6 2.2] [6.3 2.8 5.1 1.5] [6.1 2.6 5.6 1.4] [7.7 3. 6.1 2.3] [6.3 3.4 5.6 2.4] [6.4 3.1 5.5 1.8] [6. 3. 4.8 1.8] [6.9 3.1 5.4 2.1] [6.7 3.1 5.6 2.4] [6.9 3.1 5.1 2.3] [5.8 2.7 5.1 1.9] [6.8 3.2 5.9 2.3] [6.7 3.3 5.7 2.5] [6.7 3. 5.2 2.3] [6.3 2.5 5. 1.9] [6.5 3. 5.2 2. ] [6.2 3.4 5.4 2.3] [5.9 3. 5.1 1.8]] The classification target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] The names of the dataset columns: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] The names of target classes: ['setosa' 'versicolor' 'virginica'] The full description of the dataset: .. _iris_dataset: Iris plants dataset Data Set Characteristics: :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU#io.arc.nasa.gov) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. topic:: References Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. Many, many more ... The path to the location of the data: /Applications/anaconda3/lib/python3.7/site-packages/sklearn/datasets/data/iris.csv
To thread off the previous comments and posts from above, wanted to add another way to load iris() besides iris = datasets.load_iris() from sklearn.datasets import load_iris iris = load_iris() Then, you can do: X = iris.data target = iris.target names = iris.target_names And see posts and comments from other people here. And you can make a dataframe with : df = pd.DataFrame(X, columns=iris.feature_names) df['species'] = iris.target df['species'] = df['species'].replace(to_replace= [0, 1, 2], value = ['setosa', 'versicolor', 'virginica'])
Syntax explanation of [target == t,1]
I am reading the book : "Building Machine Learning Systems with Python". In the classification of Iris dates, I am having trouble understanding the syntax of : plt.scatter(features[target == t,0], features[target == t,1], marker=marker, c=c) Specifically, what does features[target == t,0] actually mean?
Looking at this code, it seems that features and target are both arrays and t is a number. Moreover, both features and target have the same number of rows. In that case, features[target == t, 0] does the following: target == t creates a Boolean array of the same shape as target (True if the value is t, otherwise False). features[target == t, 0] selects those rows from features which correspond to True in the target == t array. The 0 specifies that the first column of features should be selected. In other words, the code selects the rows of features for which target is equal to t and from those rows, the 0 selects the first column.
A better explanation to this could be that this for loop splits features array into 3 different arrays , each corresponding to a particular species of Iris. All these arrays have the 1st feature of the particular plant(instance). This will be the output if you print features[target==t,0] [ 5.1 4.9 4.7 4.6 5. 5.4 4.6 5. 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5. 5. 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5. 5.5 4.9 4.4 5.1 5. 4.5 4.4 5. 5.1 4.8 5.1 4.6 5.3 5. ] [ 7. 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5. 5.9 6. 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6. 5.7 5.5 5.5 5.8 6. 5.4 6. 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5. 5.6 5.7 5.7 6.2 5.1 5.7] [ 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6. 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6. 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9]
What does the array mean in numpy/sklearn datasets? python
From the Naive Bayes tutorial in sklearn there's example on iris dataset but it looks too cryptic, can someone help to enlighten me? What does the iris.data mean? why is there 4 columns? What does the iris.target mean? why are they a flat array of 0s, 1s and 2s? from sklearn import datasets iris = datasets.load_iris() print iris.data [out]: [[ 5.1 3.5 1.4 0.2] [ 4.9 3. 1.4 0.2] [ 4.7 3.2 1.3 0.2] [ 4.6 3.1 1.5 0.2] [ 5. 3.6 1.4 0.2] [ 5.4 3.9 1.7 0.4] [ 4.6 3.4 1.4 0.3] [ 5. 3.4 1.5 0.2] [ 4.4 2.9 1.4 0.2] [ 4.9 3.1 1.5 0.1] [ 5.4 3.7 1.5 0.2] [ 4.8 3.4 1.6 0.2] [ 4.8 3. 1.4 0.1] [ 4.3 3. 1.1 0.1] [ 5.8 4. 1.2 0.2] [ 5.7 4.4 1.5 0.4] [ 5.4 3.9 1.3 0.4] [ 5.1 3.5 1.4 0.3] [ 5.7 3.8 1.7 0.3] [ 5.1 3.8 1.5 0.3] [ 5.4 3.4 1.7 0.2] [ 5.1 3.7 1.5 0.4] [ 4.6 3.6 1. 0.2] [ 5.1 3.3 1.7 0.5] [ 4.8 3.4 1.9 0.2] [ 5. 3. 1.6 0.2] [ 5. 3.4 1.6 0.4] [ 5.2 3.5 1.5 0.2] [ 5.2 3.4 1.4 0.2] [ 4.7 3.2 1.6 0.2] [ 4.8 3.1 1.6 0.2] [ 5.4 3.4 1.5 0.4] [ 5.2 4.1 1.5 0.1] [ 5.5 4.2 1.4 0.2] [ 4.9 3.1 1.5 0.1] [ 5. 3.2 1.2 0.2] [ 5.5 3.5 1.3 0.2] [ 4.9 3.1 1.5 0.1] [ 4.4 3. 1.3 0.2] [ 5.1 3.4 1.5 0.2] [ 5. 3.5 1.3 0.3] [ 4.5 2.3 1.3 0.3] [ 4.4 3.2 1.3 0.2] [ 5. 3.5 1.6 0.6] [ 5.1 3.8 1.9 0.4] [ 4.8 3. 1.4 0.3] [ 5.1 3.8 1.6 0.2] [ 4.6 3.2 1.4 0.2] [ 5.3 3.7 1.5 0.2] [ 5. 3.3 1.4 0.2] [ 7. 3.2 4.7 1.4] [ 6.4 3.2 4.5 1.5] [ 6.9 3.1 4.9 1.5] [ 5.5 2.3 4. 1.3] [ 6.5 2.8 4.6 1.5] [ 5.7 2.8 4.5 1.3] [ 6.3 3.3 4.7 1.6] [ 4.9 2.4 3.3 1. ] [ 6.6 2.9 4.6 1.3] [ 5.2 2.7 3.9 1.4] [ 5. 2. 3.5 1. ] [ 5.9 3. 4.2 1.5] [ 6. 2.2 4. 1. ] [ 6.1 2.9 4.7 1.4] [ 5.6 2.9 3.6 1.3] [ 6.7 3.1 4.4 1.4] [ 5.6 3. 4.5 1.5] [ 5.8 2.7 4.1 1. ] [ 6.2 2.2 4.5 1.5] [ 5.6 2.5 3.9 1.1] [ 5.9 3.2 4.8 1.8] [ 6.1 2.8 4. 1.3] [ 6.3 2.5 4.9 1.5] [ 6.1 2.8 4.7 1.2] [ 6.4 2.9 4.3 1.3] [ 6.6 3. 4.4 1.4] [ 6.8 2.8 4.8 1.4] [ 6.7 3. 5. 1.7] [ 6. 2.9 4.5 1.5] [ 5.7 2.6 3.5 1. ] [ 5.5 2.4 3.8 1.1] [ 5.5 2.4 3.7 1. ] [ 5.8 2.7 3.9 1.2] [ 6. 2.7 5.1 1.6] [ 5.4 3. 4.5 1.5] [ 6. 3.4 4.5 1.6] [ 6.7 3.1 4.7 1.5] [ 6.3 2.3 4.4 1.3] [ 5.6 3. 4.1 1.3] [ 5.5 2.5 4. 1.3] [ 5.5 2.6 4.4 1.2] [ 6.1 3. 4.6 1.4] [ 5.8 2.6 4. 1.2] [ 5. 2.3 3.3 1. ] [ 5.6 2.7 4.2 1.3] [ 5.7 3. 4.2 1.2] [ 5.7 2.9 4.2 1.3] [ 6.2 2.9 4.3 1.3] [ 5.1 2.5 3. 1.1] [ 5.7 2.8 4.1 1.3] [ 6.3 3.3 6. 2.5] [ 5.8 2.7 5.1 1.9] [ 7.1 3. 5.9 2.1] [ 6.3 2.9 5.6 1.8] [ 6.5 3. 5.8 2.2] [ 7.6 3. 6.6 2.1] [ 4.9 2.5 4.5 1.7] [ 7.3 2.9 6.3 1.8] [ 6.7 2.5 5.8 1.8] [ 7.2 3.6 6.1 2.5] [ 6.5 3.2 5.1 2. ] [ 6.4 2.7 5.3 1.9] [ 6.8 3. 5.5 2.1] [ 5.7 2.5 5. 2. ] [ 5.8 2.8 5.1 2.4] [ 6.4 3.2 5.3 2.3] [ 6.5 3. 5.5 1.8] [ 7.7 3.8 6.7 2.2] [ 7.7 2.6 6.9 2.3] [ 6. 2.2 5. 1.5] [ 6.9 3.2 5.7 2.3] [ 5.6 2.8 4.9 2. ] [ 7.7 2.8 6.7 2. ] [ 6.3 2.7 4.9 1.8] [ 6.7 3.3 5.7 2.1] [ 7.2 3.2 6. 1.8] [ 6.2 2.8 4.8 1.8] [ 6.1 3. 4.9 1.8] [ 6.4 2.8 5.6 2.1] [ 7.2 3. 5.8 1.6] [ 7.4 2.8 6.1 1.9] [ 7.9 3.8 6.4 2. ] [ 6.4 2.8 5.6 2.2] [ 6.3 2.8 5.1 1.5] [ 6.1 2.6 5.6 1.4] [ 7.7 3. 6.1 2.3] [ 6.3 3.4 5.6 2.4] [ 6.4 3.1 5.5 1.8] [ 6. 3. 4.8 1.8] [ 6.9 3.1 5.4 2.1] [ 6.7 3.1 5.6 2.4] [ 6.9 3.1 5.1 2.3] [ 5.8 2.7 5.1 1.9] [ 6.8 3.2 5.9 2.3] [ 6.7 3.3 5.7 2.5] [ 6.7 3. 5.2 2.3] [ 6.3 2.5 5. 1.9] [ 6.5 3. 5.2 2. ] [ 6.2 3.4 5.4 2.3] [ 5.9 3. 5.1 1.8]] From the iris.target, it returns another array of 0s, 1s and 2s. What do they mean? from sklearn import datasets iris = datasets.load_iris() print iris.target [out]: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Iris is the well-known Fisher's Iris data set. He measured the length and width of the sepal and petal (two parts of the flower) of three species of Iris. Each row contains the measurements from one flower and there are measurements for 50 flowers of each type, hence the dimensions of iris.data. The actual type of the flower is coded as 0, 1, or 2 in iris.target; you can recover the actual species names (as strings) from iris.target_name. Fisher showed that his then-new discriminant method could separate the three species based on their sepal and petal measurements and it's been a standard classification data set ever since. Td;dr: sample data. One example per row with four attributes; 150 examples total. Class labels are stored separately and are coded as integers. Docs here: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris