Error while reading Boston data from UCL website using pandas

Error while reading Boston data from UCL website using pandas - python

Any help please for reading this file from url website.
eurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
data = pandas.read_csv(url, sep=',', header = None)
I tried sep=',', sep=';' and sep='\t' but the data read like this
but with
data = pandas.read_csv(url, sep=' ', header = None)
I received an error,
pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)()
pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)()
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 30 fields in line 2, saw 31
Maybe same question asked here enter link description here but the accepted answer does not help me.
any help please to read this file from the url provide it.
BTW, I know there is Boston = load_boston() to read this data but when I read it from this function, the attribute 'MEDV' in the dataset does not download with the dataset.

There are multiple spaces used as a delimiter, that's why it's not working when you use a single space as a delimiter (sep=' ')
you can do it using sep='\s+':
In [171]: data = pd.read_csv(url, sep='\s+', header = None)
In [172]: data.shape
Out[172]: (506, 14)
In [173]: data.head()
Out[173]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
or using delim_whitespace=True:
In [174]: data = pd.read_csv(url, delim_whitespace=True, header = None)
In [175]: data.shape
Out[175]: (506, 14)
In [176]: data.head()
Out[176]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2

Related

Dropping NaNs from selected data in pandas

Continuing on my previous question link (things are explained there), I now have obtained an array. However, I don't know how to use this array, but that is a further question. The point of this question is, there are NaN values in the 63 x 2 column that I created and I want the rows with NaN values deleted so that I can use the data (once I ask another question on how to graph and export as x , y arrays)
Here's what I have. This code works.
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = [df.iloc[:, [0, 1]]]
The sample of the .csv file is located in the link.
I tried inputting
data1.dropna()
but it didn't work.
I want the NaN values/rows to drop so that I'm left with a 28 x 2 array. (I am using the first column with actual values as an example).
Thank you.

Try
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = df.iloc[:, [0, 1]]
cleaned_data = data1.dropna()
You were probably getting an Exception like "List does not have a method 'dropna'". That's because your data1 was not a Pandas DataFrame, but a List - and inside that list was a DataFrame.

However the answer is already given, Though i would like to put some thoughts across this.
Importing Your dataFrame taking the example dataset from your earlier post you provided:
>>> import pandas as pd
>>> df = pd.read_csv("so.csv")
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
22 11.0 30.0 29.7 29.6 ... 39.3 NaN 43.8 44.3
23 11.5 30.0 29.8 29.7 ... 40.2 NaN 43.8 44.3
24 12.0 30.0 29.8 29.7 ... 40.9 NaN 43.9 44.3
25 12.5 30.1 29.8 29.7 ... 41.4 NaN 43.9 44.3
26 13.0 30.1 29.8 29.8 ... 41.8 NaN 43.9 44.4
27 13.5 30.1 29.9 29.8 ... 42.0 NaN 43.9 44.4
28 14.0 30.1 29.9 29.8 ... 42.1 NaN NaN 44.4
29 14.5 NaN 29.9 29.8 ... 42.3 NaN NaN 44.4
30 15.0 NaN 29.9 NaN ... 42.4 NaN NaN NaN
31 15.5 NaN NaN NaN ... 42.4 NaN NaN NaN
However, It good to clean the data beforehand and then process the data as you desired hence dropping the NA values during import itself will be significantly useful.
>>> df = pd.read_csv("so.csv").dropna() <-- dropping the NA here itself
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
and lastly cast your dataFrame as you wish:
>>> df = [df.iloc[:, [0, 1]]]
# new_df = [df.iloc[:, [0, 1]]] <-- if you don't want to alter actual dataFrame
>>> df
[ time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0]
Better Solution:
While looking at the end result, i see you are just concerning about the particular columns those are 'time' & '1mnaoh trial 1' hence idealistic would be to use usecole option which will reduce your memory footprint for the search across the data because you just opted the only columns which are useful for you and then use dropna() which will give you wanted you wanted i believe.
>>> df = pd.read_csv("so.csv", usecols=['time', '1mnaoh trial 1']).dropna()
>>> df
time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0
22 11.0 30.0
23 11.5 30.0
24 12.0 30.0
25 12.5 30.1
26 13.0 30.1
27 13.5 30.1
28 14.0 30.1

invalid literal for float():

I'm new with python. So maybe there is something really basic here I'm missing, but I can't figure it out...For my work I'm trying to read a txt file and apply KNN on it.
The File content is as follow and it has three columns with the third one as the class, the separator is a space.
0.85 17.45 2
0.75 15.6 2
3.3 15.45 2
5.25 14.2 2
4.9 15.65 2
5.35 15.85 2
5.1 17.9 2
4.6 18.25 2
4.05 18.75 2
3.4 19.7 2
2.9 21.15 2
3.1 21.85 2
3.9 21.85 2
4.4 20.05 2
7.2 14.5 2
7.65 16.5 2
7.1 18.65 2
7.05 19.9 2
5.85 20.55 2
5.5 21.8 2
6.55 21.8 2
6.05 22.3 2
5.2 23.4 2
4.55 23.9 2
5.1 24.4 2
8.1 26.35 2
10.15 27.7 2
9.75 25.5 2
9.2 21.1 2
11.2 22.8 2
12.6 23.1 2
13.25 23.5 2
11.65 26.85 2
12.45 27.55 2
13.3 27.85 2
13.7 27.75 2
14.15 26.9 2
14.05 26.55 2
15.15 24.2 2
15.2 24.75 2
12.2 20.9 2
12.15 21.45 2
12.75 22.05 2
13.15 21.85 2
13.75 22 2
13.95 22.7 2
14.4 22.65 2
14.2 22.15 2
14.1 21.75 2
14.05 21.4 2
17.2 24.8 2
17.7 24.85 2
17.55 25.2 2
17 26.85 2
16.55 27.1 2
19.15 25.35 2
18.8 24.7 2
21.4 25.85 2
15.8 21.35 2
16.6 21.15 2
17.45 20.75 2
18 20.95 2
18.25 20.2 2
18 22.3 2
18.6 22.25 2
19.2 21.95 2
19.45 22.1 2
20.1 21.6 2
20.1 20.9 2
19.9 20.35 2
19.45 19.05 2
19.25 18.7 2
21.3 22.3 2
22.9 23.65 2
23.15 24.1 2
24.25 22.85 2
22.05 20.25 2
20.95 18.25 2
21.65 17.25 2
21.55 16.7 2
21.6 16.3 2
21.5 15.5 2
22.4 16.5 2
22.25 18.1 2
23.15 19.05 2
23.5 19.8 2
23.75 20.2 2
25.15 19.8 2
25.5 19.45 2
23 18 2
23.95 17.75 2
25.9 17.55 2
27.65 15.65 2
23.1 14.6 2
23.5 15.2 2
24.05 14.9 2
24.5 14.7 2
14.15 17.35 1
14.3 16.8 1
14.3 15.75 1
14.75 15.1 1
15.35 15.5 1
15.95 16.45 1
16.5 17.05 1
17.35 17.05 1
17.15 16.3 1
16.65 16.1 1
16.5 15.15 1
16.25 14.95 1
16 14.25 1
15.9 13.2 1
15.15 12.05 1
15.2 11.7 1
17 15.65 1
16.9 15.35 1
17.35 15.45 1
17.15 15.1 1
17.3 14.9 1
17.7 15 1
17 14.6 1
16.85 14.3 1
16.6 14.05 1
17.1 14 1
17.45 14.15 1
17.8 14.2 1
17.6 13.85 1
17.2 13.5 1
17.25 13.15 1
17.1 12.75 1
16.95 12.35 1
16.5 12.2 1
16.25 12.5 1
16.05 11.9 1
16.65 10.9 1
16.7 11.4 1
16.95 11.25 1
17.3 11.2 1
18.05 11.9 1
18.6 12.5 1
18.9 12.05 1
18.7 11.25 1
17.95 10.9 1
18.4 10.05 1
17.45 10.4 1
17.6 10.15 1
17.7 9.85 1
17.3 9.7 1
16.95 9.7 1
16.75 9.65 1
19.8 9.95 1
19.1 9.55 1
17.5 8.3 1
17.55 8.1 1
17.85 7.55 1
18.2 8.35 1
19.3 9.1 1
19.4 8.85 1
19.05 8.85 1
18.9 8.5 1
18.6 7.85 1
18.7 7.65 1
19.35 8.2 1
19.95 8.3 1
20 8.9 1
20.3 8.9 1
20.55 8.8 1
18.35 6.95 1
18.65 6.9 1
19.3 7 1
19.1 6.85 1
19.15 6.65 1
21.2 8.8 1
21.4 8.8 1
21.1 8 1
20.4 7 1
20.5 6.35 1
20.1 6.05 1
20.45 5.15 1
20.95 5.55 1
20.95 6.2 1
20.9 6.6 1
21.05 7 1
21.85 8.5 1
21.9 8.2 1
22.3 7.7 1
21.85 6.65 1
21.3 5.05 1
22.6 6.7 1
22.5 6.15 1
23.65 7.2 1
24.1 7 1
21.95 4.8 1
22.15 5.05 1
22.45 5.3 1
22.45 4.9 1
22.7 5.5 1
23 5.6 1
23.2 5.3 1
23.45 5.95 1
23.75 5.95 1
24.45 6.15 1
24.6 6.45 1
25.2 6.55 1
26.05 6.4 1
25.3 5.75 1
24.35 5.35 1
23.3 4.9 1
22.95 4.75 1
22.4 4.55 1
22.8 4.1 1
22.9 4 1
23.25 3.85 1
23.45 3.6 1
23.55 4.2 1
23.8 3.65 1
23.8 4.75 1
24.2 4 1
24.55 4 1
24.7 3.85 1
24.7 4.3 1
24.9 4.75 1
26.4 5.7 1
27.15 5.95 1
27.3 5.45 1
27.5 5.45 1
27.55 5.1 1
26.85 4.95 1
26.6 4.9 1
26.85 4.4 1
26.2 4.4 1
26 4.25 1
25.15 4.1 1
25.6 3.9 1
25.85 3.6 1
24.95 3.35 1
25.1 3.25 1
25.45 3.15 1
26.85 2.95 1
27.15 3.15 1
27.2 3 1
27.95 3.25 1
27.95 3.5 1
28.8 4.05 1
28.8 4.7 1
28.75 5.45 1
28.6 5.75 1
29.25 6.3 1
30 6.55 1
30.6 3.4 1
30.05 3.45 1
29.75 3.45 1
29.2 4 1
29.45 4.05 1
29.05 4.55 1
29.4 4.85 1
29.5 4.7 1
29.9 4.45 1
30.75 4.45 1
30.4 4.05 1
30.8 3.95 1
31.05 3.95 1
30.9 5.2 1
30.65 5.85 1
30.7 6.15 1
31.5 6.25 1
31.65 6.55 1
32 7 1
32.5 7.95 1
33.35 7.45 1
32.6 6.95 1
32.65 6.6 1
32.55 6.35 1
32.35 6.1 1
32.55 5.8 1
32.2 5.05 1
32.35 4.25 1
32.9 4.15 1
32.7 4.6 1
32.75 4.85 1
34.1 4.6 1
34.1 5 1
33.6 5.25 1
33.35 5.65 1
33.75 5.95 1
33.4 6.2 1
34.45 5.8 1
34.65 5.65 1
34.65 6.25 1
35.25 6.25 1
34.35 6.8 1
34.1 7.15 1
34.45 7.3 1
34.7 7.2 1
34.85 7 1
34.35 7.75 1
34.55 7.85 1
35.05 8 1
35.5 8.05 1
35.8 7.1 1
36.6 6.7 1
36.75 7.25 1
36.5 7.4 1
35.95 7.9 1
36.1 8.1 1
36.15 8.4 1
37.6 7.35 1
37.9 7.65 1
29.15 4.4 1
34.9 9 1
35.3 9.4 1
35.9 9.35 1
36 9.65 1
35.75 10 1
36.7 9.15 1
36.6 9.8 1
36.9 9.75 1
37.25 10.15 1
36.4 10.15 1
36.3 10.7 1
36.75 10.85 1
38.15 9.7 1
38.4 9.45 1
38.35 10.5 1
37.7 10.8 1
37.45 11.15 1
37.35 11.4 1
37 11.75 1
36.8 12.2 1
37.15 12.55 1
37.25 12.15 1
37.65 11.95 1
37.95 11.85 1
38.6 11.75 1
38.5 12.2 1
38 12.95 1
37.3 13 1
37.5 13.4 1
37.85 14.5 1
38.3 14.6 1
38.05 14.45 1
38.35 14.35 1
38.5 14.25 1
39.3 14.2 1
39 13.2 1
38.95 12.9 1
39.2 12.35 1
39.5 11.8 1
39.55 12.3 1
39.75 12.75 1
40.2 12.8 1
40.4 12.05 1
40.45 12.5 1
40.55 13.15 1
40.45 14.5 1
40.2 14.8 1
40.65 14.9 1
40.6 15.25 1
41.3 15.3 1
40.95 15.7 1
41.25 16.8 1
40.95 17.05 1
40.7 16.45 1
40.45 16.3 1
39.9 16.2 1
39.65 16.2 1
39.25 15.5 1
38.85 15.5 1
38.3 16.5 1
38.75 16.85 1
39 16.6 1
38.25 17.35 1
39.5 16.95 1
39.9 17.05 1
My Code:
import csv
import random
import math
import operator
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(3):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.67
loadDataset('Jain.txt', split, trainingSet, testSet)
print 'Train set: ' + repr(len(trainingSet))
print 'Test set: ' + repr(len(testSet))
# generate predictions
predictions=[]
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
main()

Here:
lines = csv.reader(csvfile)
You have to tell csv.reader what separator to use - else it will use the default excel ',' separator. Note that in the example you posted, the separator might actually NOT be "a space", but either a tab ("\t" in python) or just a random number of spaces - in which case it's not a csv-like format and you'll have to parse lines by yourself.
Also your code is far from pythonic. First thing first: python's 'for' loop are really "for each" kind of loops, ie they directly yields values from the object you iterate on. The proper way to iterate on a list is:
lst = ["a", "b", "c"]
for item in lst:
print(item)
so no need for range() and indexed access here. Note that if you want to have the index too, you can use enumerate(sequence), which will yield (index, item) pairs, ie:
lst = ["a", "b", "c"]
for index, item in enumerate(lst):
print("item at {} is {}".format(index, item))
So your loadDataset() function could be rewritten as:
def loadDataset(filename, split, trainingSet=None , testSet=None):
# fix the mutable default argument gotcha
# cf https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
if trainingSet is None:
trainingSet = []
if testSet is None:
testSet = []
with open(filename, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter="\t")
for row in reader:
row = tuple(float(x) for x in row)
if random.random() < split:
trainingSet.append(row)
else:
testSet.append(row)
# so the caller can get the values back
return trainingSet, testSet
Note that if any value in your file is not a proper representation of a float, you'll still get a ValueError in row = tuple(float(x) for x in row). The solution here is to catch the error and handle it one way or another - either by reraising it with additionnal debugging info (which value is wrong and which line of the file it belongs to) or by logging the error and ignoring this row or however it makes sense in the context of your app / lib:
for row in reader:
try:
row = tuple(float(x) for x in row)
except ValueError as e:
# here we choose to just log the error
# and ignore the row, but you may want
# to do otherwise, your choice...
print("wrong value in line {}: {}".format(reader.line_num, row))
continue
if random.random() < split:
trainingSet.append(row)
else:
testSet.append(row)
Also, if you want to iterate over two lists in parallel (get 'list1[x], list2[x]' pairs), you can use zip():
lst1 = ["a", "b", "c"]
lst2 = ["x", "y", "z"]
for pair in zip(lst1, lst2):
print(pair)
and there are functions to sum() values from an iterable, ie:
lst = [1, 2, 3]
print(sum(lst))
so your euclideanDistance function can be rewritten as:
def euclideanDistance(instance1, instance2, length):
pairs = zip(instance1[:length], instance2[:length])
return math.sqrt(sum(pow(x - y) for x, y in pairs))
etc etc...

pandas apply list of function to data frame

Lets take boston data set available in the from sklearn.datasets import load_boston
boston = load_boston()
X = pd.DataFrame(boston["data"])
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395.60 12.43
I have built a machine learning model (RF) and have obtained all estimators in the model.
estimators = model.estimators_
You can think this has list of functions that takes row level data and return a value.
>> estimators = model.estimators_
>> estimators
[DecisionTreeRegressor(criterion='mse', max_depth=60, max_features=8,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=5,
min_samples_split=12, min_weight_fraction_leaf=0.0,
presort=False, random_state=1838148368, splitter='best'), DecisionTreeRegressor(criterion='mse', max_depth=60, max_features=8,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=5,
min_samples_split=12, min_weight_fraction_leaf=0.0,
presort=False, random_state=1754873550, splitter='best'), DecisionTreeRegressor(criterion='mse', max_depth=60, max_features=8,
max_leaf_nodes=None, min_impurity_decrease=0.0,....]
I want each estimator/function in list to be apply to every row in the data frame.
If I don't convert the data to data frame boston['data'] returns a 2D Array. I can use two for loops to accomplish above. Assume X is a 2D array then I can do following
for x in range(len(X)):
vals = []
for estimator in model.estimators_:
vals.append(estimator.predict(X[x])[0])
I don't want to use 2D array option because I want to keep the index information of the DataFrame for future operations.

In the latest version of pandas, df.agg should be able to do exactly this.
Unfortunately it appears to be broken for the current version when axis=1: https://github.com/pandas-dev/pandas/issues/16679
Here's a hacky way around it:
X.T.agg(estimators).T

Reindexing data frame Pandas

I am trying to split a data set for training and testing using Pandas.
data = pd.read_csv("housingdata.csv", header=None)
train = testing.sample(frac=0.6)
train.reindex()
test = testing.loc[~testing.index.isin(train.index)]
print train
print test
when I print the data, I get
0 1 2 3 4
9 0.17004 12.5 7.87 0 0.524
1 0.02731 0.0 7.07 0 0.469
5 0.02985 0.0 2.18 0 0.458
3 0.03237 0.0 2.18 0 0.458
7 0.14455 12.5 7.87 0 0.524
6 0.08829 12.5 7.87 0 0.524
0 1 2 3 4
0 0.00632 18.0 2.31 0 0.538
2 0.02729 0.0 7.07 0 0.469
4 0.06905 0.0 2.18 0 0.458
8 0.21124 12.5 7.87 0 0.524
As noticed, the row indices are re-shuffled. How to re-index the rows in both the data sets?
This however does not change global settings. Eg.,
train.iloc[0,4]
gives 0.524

As #EdChum's comments point out, it's not exactly clear what behavior you're looking for. But if all you want to do is to give both new dataframes indices going from 0, 1, 2 ... n then you can use reset_index():
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)

pandas dataframe plotting 1 column over 2

this is driving me nuts, I can't plot column 'b'
it plots only column 'A'.....
this is my code, no idea what I'm doing wrong, probably something silly...
the dataframe seems ok, weirdness also is that I can access both df['A'] and df['b'] but only df['A'].plot() works, if I issue a df['b'].plot() I get this error :
Traceback (most recent call last): File
"C:\Python27\lib\site-packages\IPython\core\interactiveshell.py", line
2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in
df['b'].plot() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2511,
in plot_series
**kwds) File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2317,
in _plot
plot_obj.generate() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 921, in
generate
self._compute_plot_data() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 997, in
_compute_plot_data
'plot'.format(numeric_data.class.name)) TypeError: Empty 'Series': no numeric data to plot
import sqlalchemy
import pandas as pd
import matplotlib.pyplot as plt
engine = sqlalchemy.create_engine(
'sqlite:///C:/Users/toto/PycharmProjects/my_db.sqlite')
tables = engine.table_names()
dic = {}
for t in tables:
sql = 'SELECT t."weight" FROM "' + t + '" t WHERE t."udl"="IBE SM"'
dic[t] = (pd.read_sql(sql, engine)['weight'][0], pd.read_sql(sql, engine)['weight'][1])
df = pd.DataFrame.from_dict(dic, orient='index').sort_index()
df = df.set_index(pd.DatetimeIndex(df.index))
df.columns = ['A', 'b']
print(df)
print(df.info())
df.plot()
plt.show()
this is the 2 print
A b
2014-08-05 1.81 3.39
2014-08-06 1.81 3.39
2014-08-07 1.81 3.39
2014-08-08 1.80 3.37
2014-08-11 1.79 3.35
2014-08-13 1.80 3.36
2014-08-14 1.80 3.35
2014-08-18 1.80 3.35
2014-08-19 1.79 3.34
2014-08-20 1.80 3.35
2014-08-27 1.79 3.35
2014-08-28 1.80 3.35
2014-08-29 1.79 3.35
2014-09-01 1.79 3.35
2014-09-02 1.79 3.35
2014-09-03 1.79 3.36
2014-09-04 1.79 3.37
2014-09-05 1.80 3.38
2014-09-08 1.79 3.36
2014-09-09 1.79 3.35
2014-09-10 1.78 3.35
2014-09-11 1.78 3.34
2014-09-12 1.78 3.34
2014-09-15 1.78 3.35
2014-09-16 1.78 3.35
2014-09-17 1.78 3.35
2014-09-18 1.78 3.34
2014-09-19 1.79 3.35
2014-09-22 1.79 3.36
2014-09-23 1.80 3.37
... ... ...
2014-12-10 1.73 3.29
2014-12-11 1.74 3.27
2014-12-12 1.74 3.25
2014-12-15 1.74 3.24
2014-12-16 1.74 3.27
2014-12-17 1.75 3.28
2014-12-18 1.76 3.29
2014-12-19 1.04 1.39
2014-12-22 1.04 1.39
2014-12-23 1.04 1.4
2014-12-24 1.04 1.39
2014-12-29 1.04 1.39
2014-12-30 1.04 1.4
2015-01-02 1.04 1.4
2015-01-05 1.04 1.4
2015-01-06 1.04 1.4
2015-01-07 NaN 1.39
2015-01-08 NaN 1.39
2015-01-09 NaN 1.39
2015-01-12 NaN 1.38
2015-01-13 NaN 1.38
2015-01-14 NaN 1.38
2015-01-15 NaN 1.38
2015-01-16 NaN 1.38
2015-01-19 NaN 1.39
2015-01-20 NaN 1.38
2015-01-21 NaN 1.39
2015-01-22 NaN 1.4
2015-01-23 NaN 1,4
2015-01-26 NaN 1.41
[107 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 107 entries, 2014-08-05 00:00:00 to 2015-01-26 00:00:00
Data columns (total 2 columns):
A 93 non-null float64
b 107 non-null object
dtypes: float64(1), object(1)
memory usage: 2.1+ KB
None
Process finished with exit code 0

just got it, 'b' is of object type and not float64 because of this line :
2015-01-23 NaN 1,4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error while reading Boston data from UCL website using pandas - python

Related

Dropping NaNs from selected data in pandas

invalid literal for float():

pandas apply list of function to data frame

Reindexing data frame Pandas

pandas dataframe plotting 1 column over 2

Categories

Resources