Am getting errors doing NPV calculation using numpy_financial.npv(rate, values) on a dataframe.
Am i able to use dataframes for NPV calculation?
Not sure how to fix this. Manually looping through each row?
npvValues = ['value_1',' value_2', 'value_3', 'value_4', 'value_5', 'RFR']
round(df[npvValues].sample(5),1).sort_index(ascending=False)
DATE value_1 value_2 value_3 value_4 value_5 RFR
2017-04-03 38.5 92.8 168.7 257.0 354.0 2.1
2016-01-11 35.7 86.1 156.6 238.7 328.6 2.3
2013-07-29 28.1 67.8 123.3 187.8 258.6 2.3
2011-05-02 24.2 58.3 106.1 161.6 222.5 3.4
2010-01-18 24.4 58.8 107.0 163.0 224.5 3.8
NPV Calculation
value = ['value_1',' value_2', 'value_3', 'value_4', 'value_5']
df['NPV_IV'] = npf.npv(rate=df['RFR']/100, values=df[value])
df['NPV_IV']
Here is the full error trace
Full Error Trace
values parameter in npv takes only 1 dimensional array, so you need to loop through it.
df = pd.read_csv('t.csv')
value = ['value_1', 'value_2', 'value_3', 'value_4', 'value_5']
num = df[value].to_numpy()
rate_lists = df['RFR'].tolist()
new_col = []
for index, i in enumerate(num):
n = npf.npv(rate=rate_lists[index] / 100, values=i)
new_col.append(n)
df['NPV_IV'] = new_col
print(df)
DATE value_1 value_2 value_3 value_4 value_5 RFR NPV_IV
0 2017-04-03 38.5 92.8 168.7 257.0 354.0 2.1 858.450837
1 2016-01-11 35.7 86.1 156.6 238.7 328.6 2.3 792.491237
2 2013-07-29 28.1 67.8 123.3 187.8 258.6 2.3 623.725804
3 2011-05-02 24.2 58.3 106.1 161.6 222.5 3.4 520.644433
4 2010-01-18 24.4 58.8 107.0 163.0 224.5 3.8 519.488982
Related
I am working on a dataframe,
0
1
2
3
4
5
6
7
new_width
new_height
new_depth
audited_Width
audited_Height
audited_Depth
inf
val
----------
-----------
----------
--------------
---------------
--------------
---
---
35.00
2.00
21.00
21.00
2.50
35.00
T
12.00
4.40
10.60
11.60
4.40
12.00
T
20.50
17.00
5.50
21.50
17.05
20.50
F
24.33
22.00
18.11
24.00
22.05
24.33
T
23.00
23.00
19.00
19.00
23.00
23.00
F
Here i want to find difference between rows (0, 3) and (1,4) and (2,5) and verify if the difference value(any one or all the three) falls in the range(0,1), and if yes then it should check the corresponding cell in row 6 and if it is 'T', then print 'YES' in corresponding cell in row 7!
I have the following code:
a=df['new_width'] - df['audited_Width']
for i in a:
if (i in range (0,1))==True:
df['Value'] = 'Yes'
print(df['Value'])
I know that 4th line is incorrect. What alternatives can I use to get the desired output?
You shouldn't iterate over the rows of a DataFrame. Instead, you can create a mask to select the rows that meet your condition, and then use that to fill the "val" column:
mask = ((df["new_width"] - df["audited_Width"]).between(0, 1) \
| (df["new_height"] - df["audited_Height"]).between(0, 1) \
| (df["new_depth"] - df["audited_Depth"]).between(0, 1)) \
& (df["inf"] == "T")
df["val"] = df["val"].where(~mask, "YES")
This outputs:
new_width new_height new_depth audited_Width audited_Height audited_Depth inf val
0 35.00 2.0 21.00 21.0 2.50 35.00 T NaN
1 12.00 4.4 10.60 11.6 4.40 12.00 T YES
2 20.50 17.0 5.50 21.5 17.05 20.50 F NaN
3 24.33 22.0 18.11 24.0 22.05 24.33 T YES
4 23.00 23.0 19.00 19.0 23.00 23.00 F NaN
Custom for loops are basically never the best option when it comes to pandas.
This is a method that reshapes your dataframe to an arguably better shape, performs a simple check on the new shape, and then extracts indices that should be modified in the original dataframe.
df.columns = df.columns.str.lower()
df2 = pd.wide_to_long(df.reset_index(), ['new', 'audited'], ['index'], 'values', '_', '\w+')
mask = df2[df2.new.sub(df2.audited).between(0,1) & df2.inf.eq('T')]
idx = mask.reset_index('index')['index'].unique()
df.loc[idx, 'val'] = 'YES'
print(df)
Output:
new_width new_height new_depth audited_width audited_height audited_depth inf val
0 35.00 2.0 21.00 21.0 2.50 35.00 T NaN
1 12.00 4.4 10.60 11.6 4.40 12.00 T YES
2 20.50 17.0 5.50 21.5 17.05 20.50 F NaN
3 24.33 22.0 18.11 24.0 22.05 24.33 T YES
4 23.00 23.0 19.00 19.0 23.00 23.00 F NaN
I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1
I'm new with python. So maybe there is something really basic here I'm missing, but I can't figure it out...For my work I'm trying to read a txt file and apply KNN on it.
The File content is as follow and it has three columns with the third one as the class, the separator is a space.
0.85 17.45 2
0.75 15.6 2
3.3 15.45 2
5.25 14.2 2
4.9 15.65 2
5.35 15.85 2
5.1 17.9 2
4.6 18.25 2
4.05 18.75 2
3.4 19.7 2
2.9 21.15 2
3.1 21.85 2
3.9 21.85 2
4.4 20.05 2
7.2 14.5 2
7.65 16.5 2
7.1 18.65 2
7.05 19.9 2
5.85 20.55 2
5.5 21.8 2
6.55 21.8 2
6.05 22.3 2
5.2 23.4 2
4.55 23.9 2
5.1 24.4 2
8.1 26.35 2
10.15 27.7 2
9.75 25.5 2
9.2 21.1 2
11.2 22.8 2
12.6 23.1 2
13.25 23.5 2
11.65 26.85 2
12.45 27.55 2
13.3 27.85 2
13.7 27.75 2
14.15 26.9 2
14.05 26.55 2
15.15 24.2 2
15.2 24.75 2
12.2 20.9 2
12.15 21.45 2
12.75 22.05 2
13.15 21.85 2
13.75 22 2
13.95 22.7 2
14.4 22.65 2
14.2 22.15 2
14.1 21.75 2
14.05 21.4 2
17.2 24.8 2
17.7 24.85 2
17.55 25.2 2
17 26.85 2
16.55 27.1 2
19.15 25.35 2
18.8 24.7 2
21.4 25.85 2
15.8 21.35 2
16.6 21.15 2
17.45 20.75 2
18 20.95 2
18.25 20.2 2
18 22.3 2
18.6 22.25 2
19.2 21.95 2
19.45 22.1 2
20.1 21.6 2
20.1 20.9 2
19.9 20.35 2
19.45 19.05 2
19.25 18.7 2
21.3 22.3 2
22.9 23.65 2
23.15 24.1 2
24.25 22.85 2
22.05 20.25 2
20.95 18.25 2
21.65 17.25 2
21.55 16.7 2
21.6 16.3 2
21.5 15.5 2
22.4 16.5 2
22.25 18.1 2
23.15 19.05 2
23.5 19.8 2
23.75 20.2 2
25.15 19.8 2
25.5 19.45 2
23 18 2
23.95 17.75 2
25.9 17.55 2
27.65 15.65 2
23.1 14.6 2
23.5 15.2 2
24.05 14.9 2
24.5 14.7 2
14.15 17.35 1
14.3 16.8 1
14.3 15.75 1
14.75 15.1 1
15.35 15.5 1
15.95 16.45 1
16.5 17.05 1
17.35 17.05 1
17.15 16.3 1
16.65 16.1 1
16.5 15.15 1
16.25 14.95 1
16 14.25 1
15.9 13.2 1
15.15 12.05 1
15.2 11.7 1
17 15.65 1
16.9 15.35 1
17.35 15.45 1
17.15 15.1 1
17.3 14.9 1
17.7 15 1
17 14.6 1
16.85 14.3 1
16.6 14.05 1
17.1 14 1
17.45 14.15 1
17.8 14.2 1
17.6 13.85 1
17.2 13.5 1
17.25 13.15 1
17.1 12.75 1
16.95 12.35 1
16.5 12.2 1
16.25 12.5 1
16.05 11.9 1
16.65 10.9 1
16.7 11.4 1
16.95 11.25 1
17.3 11.2 1
18.05 11.9 1
18.6 12.5 1
18.9 12.05 1
18.7 11.25 1
17.95 10.9 1
18.4 10.05 1
17.45 10.4 1
17.6 10.15 1
17.7 9.85 1
17.3 9.7 1
16.95 9.7 1
16.75 9.65 1
19.8 9.95 1
19.1 9.55 1
17.5 8.3 1
17.55 8.1 1
17.85 7.55 1
18.2 8.35 1
19.3 9.1 1
19.4 8.85 1
19.05 8.85 1
18.9 8.5 1
18.6 7.85 1
18.7 7.65 1
19.35 8.2 1
19.95 8.3 1
20 8.9 1
20.3 8.9 1
20.55 8.8 1
18.35 6.95 1
18.65 6.9 1
19.3 7 1
19.1 6.85 1
19.15 6.65 1
21.2 8.8 1
21.4 8.8 1
21.1 8 1
20.4 7 1
20.5 6.35 1
20.1 6.05 1
20.45 5.15 1
20.95 5.55 1
20.95 6.2 1
20.9 6.6 1
21.05 7 1
21.85 8.5 1
21.9 8.2 1
22.3 7.7 1
21.85 6.65 1
21.3 5.05 1
22.6 6.7 1
22.5 6.15 1
23.65 7.2 1
24.1 7 1
21.95 4.8 1
22.15 5.05 1
22.45 5.3 1
22.45 4.9 1
22.7 5.5 1
23 5.6 1
23.2 5.3 1
23.45 5.95 1
23.75 5.95 1
24.45 6.15 1
24.6 6.45 1
25.2 6.55 1
26.05 6.4 1
25.3 5.75 1
24.35 5.35 1
23.3 4.9 1
22.95 4.75 1
22.4 4.55 1
22.8 4.1 1
22.9 4 1
23.25 3.85 1
23.45 3.6 1
23.55 4.2 1
23.8 3.65 1
23.8 4.75 1
24.2 4 1
24.55 4 1
24.7 3.85 1
24.7 4.3 1
24.9 4.75 1
26.4 5.7 1
27.15 5.95 1
27.3 5.45 1
27.5 5.45 1
27.55 5.1 1
26.85 4.95 1
26.6 4.9 1
26.85 4.4 1
26.2 4.4 1
26 4.25 1
25.15 4.1 1
25.6 3.9 1
25.85 3.6 1
24.95 3.35 1
25.1 3.25 1
25.45 3.15 1
26.85 2.95 1
27.15 3.15 1
27.2 3 1
27.95 3.25 1
27.95 3.5 1
28.8 4.05 1
28.8 4.7 1
28.75 5.45 1
28.6 5.75 1
29.25 6.3 1
30 6.55 1
30.6 3.4 1
30.05 3.45 1
29.75 3.45 1
29.2 4 1
29.45 4.05 1
29.05 4.55 1
29.4 4.85 1
29.5 4.7 1
29.9 4.45 1
30.75 4.45 1
30.4 4.05 1
30.8 3.95 1
31.05 3.95 1
30.9 5.2 1
30.65 5.85 1
30.7 6.15 1
31.5 6.25 1
31.65 6.55 1
32 7 1
32.5 7.95 1
33.35 7.45 1
32.6 6.95 1
32.65 6.6 1
32.55 6.35 1
32.35 6.1 1
32.55 5.8 1
32.2 5.05 1
32.35 4.25 1
32.9 4.15 1
32.7 4.6 1
32.75 4.85 1
34.1 4.6 1
34.1 5 1
33.6 5.25 1
33.35 5.65 1
33.75 5.95 1
33.4 6.2 1
34.45 5.8 1
34.65 5.65 1
34.65 6.25 1
35.25 6.25 1
34.35 6.8 1
34.1 7.15 1
34.45 7.3 1
34.7 7.2 1
34.85 7 1
34.35 7.75 1
34.55 7.85 1
35.05 8 1
35.5 8.05 1
35.8 7.1 1
36.6 6.7 1
36.75 7.25 1
36.5 7.4 1
35.95 7.9 1
36.1 8.1 1
36.15 8.4 1
37.6 7.35 1
37.9 7.65 1
29.15 4.4 1
34.9 9 1
35.3 9.4 1
35.9 9.35 1
36 9.65 1
35.75 10 1
36.7 9.15 1
36.6 9.8 1
36.9 9.75 1
37.25 10.15 1
36.4 10.15 1
36.3 10.7 1
36.75 10.85 1
38.15 9.7 1
38.4 9.45 1
38.35 10.5 1
37.7 10.8 1
37.45 11.15 1
37.35 11.4 1
37 11.75 1
36.8 12.2 1
37.15 12.55 1
37.25 12.15 1
37.65 11.95 1
37.95 11.85 1
38.6 11.75 1
38.5 12.2 1
38 12.95 1
37.3 13 1
37.5 13.4 1
37.85 14.5 1
38.3 14.6 1
38.05 14.45 1
38.35 14.35 1
38.5 14.25 1
39.3 14.2 1
39 13.2 1
38.95 12.9 1
39.2 12.35 1
39.5 11.8 1
39.55 12.3 1
39.75 12.75 1
40.2 12.8 1
40.4 12.05 1
40.45 12.5 1
40.55 13.15 1
40.45 14.5 1
40.2 14.8 1
40.65 14.9 1
40.6 15.25 1
41.3 15.3 1
40.95 15.7 1
41.25 16.8 1
40.95 17.05 1
40.7 16.45 1
40.45 16.3 1
39.9 16.2 1
39.65 16.2 1
39.25 15.5 1
38.85 15.5 1
38.3 16.5 1
38.75 16.85 1
39 16.6 1
38.25 17.35 1
39.5 16.95 1
39.9 17.05 1
My Code:
import csv
import random
import math
import operator
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(3):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.67
loadDataset('Jain.txt', split, trainingSet, testSet)
print 'Train set: ' + repr(len(trainingSet))
print 'Test set: ' + repr(len(testSet))
# generate predictions
predictions=[]
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
main()
Here:
lines = csv.reader(csvfile)
You have to tell csv.reader what separator to use - else it will use the default excel ',' separator. Note that in the example you posted, the separator might actually NOT be "a space", but either a tab ("\t" in python) or just a random number of spaces - in which case it's not a csv-like format and you'll have to parse lines by yourself.
Also your code is far from pythonic. First thing first: python's 'for' loop are really "for each" kind of loops, ie they directly yields values from the object you iterate on. The proper way to iterate on a list is:
lst = ["a", "b", "c"]
for item in lst:
print(item)
so no need for range() and indexed access here. Note that if you want to have the index too, you can use enumerate(sequence), which will yield (index, item) pairs, ie:
lst = ["a", "b", "c"]
for index, item in enumerate(lst):
print("item at {} is {}".format(index, item))
So your loadDataset() function could be rewritten as:
def loadDataset(filename, split, trainingSet=None , testSet=None):
# fix the mutable default argument gotcha
# cf https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
if trainingSet is None:
trainingSet = []
if testSet is None:
testSet = []
with open(filename, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter="\t")
for row in reader:
row = tuple(float(x) for x in row)
if random.random() < split:
trainingSet.append(row)
else:
testSet.append(row)
# so the caller can get the values back
return trainingSet, testSet
Note that if any value in your file is not a proper representation of a float, you'll still get a ValueError in row = tuple(float(x) for x in row). The solution here is to catch the error and handle it one way or another - either by reraising it with additionnal debugging info (which value is wrong and which line of the file it belongs to) or by logging the error and ignoring this row or however it makes sense in the context of your app / lib:
for row in reader:
try:
row = tuple(float(x) for x in row)
except ValueError as e:
# here we choose to just log the error
# and ignore the row, but you may want
# to do otherwise, your choice...
print("wrong value in line {}: {}".format(reader.line_num, row))
continue
if random.random() < split:
trainingSet.append(row)
else:
testSet.append(row)
Also, if you want to iterate over two lists in parallel (get 'list1[x], list2[x]' pairs), you can use zip():
lst1 = ["a", "b", "c"]
lst2 = ["x", "y", "z"]
for pair in zip(lst1, lst2):
print(pair)
and there are functions to sum() values from an iterable, ie:
lst = [1, 2, 3]
print(sum(lst))
so your euclideanDistance function can be rewritten as:
def euclideanDistance(instance1, instance2, length):
pairs = zip(instance1[:length], instance2[:length])
return math.sqrt(sum(pow(x - y) for x, y in pairs))
etc etc...
I have several .csv files which I am importing via Pandas and then work out a summary of the data (min, max, mean), ideally weekly and monthly reports. I have the following code, but just do not seem to get the month summary to work, I am sure the problem is with the timestamp conversion.
What am I doing wrong?
import pandas as pd
import numpy as np
#Format of the data that is been imported
#2017-05-11 18:29:14+00:00,264.0,987.99,26.5,23.70,512.0,11.763,52.31
df = pd.read_csv('data.csv')
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
print 'month info'
print [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='M'))]
print(data.groupby('timestamp')['light'].mean())
IIUC, you almost have it, and your datetime conversion is fine. Here is an example:
Starting from a dataframe like this (which is your example row, duplicated with slight modifications):
>>> df
time x y z a b c d
0 2017-05-11 18:29:14+00:00 264.0 947.99 24.5 53.7 511.0 11.463 12.31
1 2017-05-15 18:29:14+00:00 265.0 957.99 25.5 43.7 512.0 11.563 22.31
2 2017-05-21 18:29:14+00:00 266.0 967.99 26.5 33.7 513.0 11.663 32.31
3 2017-06-11 18:29:14+00:00 267.0 977.99 26.5 23.7 514.0 11.763 42.31
4 2017-06-22 18:29:14+00:00 268.0 997.99 27.5 13.7 515.0 11.800 52.31
You can do what you did before with your datetime:
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
And then get your summaries either separately:
monthly_mean = df.groupby(pd.Grouper(key='timestamp',freq='M')).mean()
monthly_max = df.groupby(pd.Grouper(key='timestamp',freq='M')).max()
monthly_min = df.groupby(pd.Grouper(key='timestamp',freq='M')).min()
weekly_mean = df.groupby(pd.Grouper(key='timestamp',freq='W')).mean()
weekly_min = df.groupby(pd.Grouper(key='timestamp',freq='W')).min()
weekly_max = df.groupby(pd.Grouper(key='timestamp',freq='W')).max()
# Examples:
>>> monthly_mean
x y z a b c d
timestamp
2017-05-31 265.0 957.99 25.5 43.7 512.0 11.5630 22.31
2017-06-30 267.5 987.99 27.0 18.7 514.5 11.7815 47.31
>>> weekly_mean
x y z a b c d
timestamp
2017-05-14 264.0 947.99 24.5 53.7 511.0 11.463 12.31
2017-05-21 265.5 962.99 26.0 38.7 512.5 11.613 27.31
2017-05-28 NaN NaN NaN NaN NaN NaN NaN
2017-06-04 NaN NaN NaN NaN NaN NaN NaN
2017-06-11 267.0 977.99 26.5 23.7 514.0 11.763 42.31
2017-06-18 NaN NaN NaN NaN NaN NaN NaN
2017-06-25 268.0 997.99 27.5 13.7 515.0 11.800 52.31
Or aggregate them all together to get a multi-indexed dataframe with your summaries:
monthly_summary = df.groupby(pd.Grouper(key='timestamp',freq='M')).agg(['mean', 'min', 'max'])
weekly_summary = df.groupby(pd.Grouper(key='timestamp',freq='W')).agg(['mean', 'min', 'max'])
# Example of summary of row 'x':
>>> monthly_summary['x']
mean min max
timestamp
2017-05-31 265.0 264.0 266.0
2017-06-30 267.5 267.0 268.0
>>> weekly_summary['x']
mean min max
timestamp
2017-05-14 264.0 264.0 264.0
2017-05-21 265.5 265.0 266.0
2017-05-28 NaN NaN NaN
2017-06-04 NaN NaN NaN
2017-06-11 267.0 267.0 267.0
2017-06-18 NaN NaN NaN
2017-06-25 268.0 268.0 268.0
Any help please for reading this file from url website.
eurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
data = pandas.read_csv(url, sep=',', header = None)
I tried sep=',', sep=';' and sep='\t' but the data read like this
but with
data = pandas.read_csv(url, sep=' ', header = None)
I received an error,
pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7988)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)()
pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)()
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 30 fields in line 2, saw 31
Maybe same question asked here enter link description here but the accepted answer does not help me.
any help please to read this file from the url provide it.
BTW, I know there is Boston = load_boston() to read this data but when I read it from this function, the attribute 'MEDV' in the dataset does not download with the dataset.
There are multiple spaces used as a delimiter, that's why it's not working when you use a single space as a delimiter (sep=' ')
you can do it using sep='\s+':
In [171]: data = pd.read_csv(url, sep='\s+', header = None)
In [172]: data.shape
Out[172]: (506, 14)
In [173]: data.head()
Out[173]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
or using delim_whitespace=True:
In [174]: data = pd.read_csv(url, delim_whitespace=True, header = None)
In [175]: data.shape
Out[175]: (506, 14)
In [176]: data.head()
Out[176]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2