Why is numpy.polyfit is off by a large margin?

Why is numpy.polyfit is off by a large margin? - python

I'm trying to to use np.polyfit to fit a fairly simple dataset, but it's off by a fairly large margin:
And the code:
import numpy as np
import matplotlib as plt
fit = np.polyfit(xvals, yvals, 1)
f = np.poly1d(fit)
plt.scatter(xvals, yvals, color="blue", label="input")
plt.scatter(xvals, f(yvals), color="red", label="fit")
plt.legend()
What am I doing wrong? How can I improve the fit?
The original data:
xvals = array([ 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14,
15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 27, 28, 29,
30, 31, 32, 34, 35, 36, 37, 38, 40, 41, 42, 43, 44,
45, 47, 48, 49, 50, 51, 52, 54, 55, 56, 57, 58, 60,
61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 74, 75,
76, 77, 78, 80, 81, 82, 83, 84, 85, 87, 88, 89, 90,
91, 92, 94, 95, 96, 97, 98, 100])
yvals = array([ 0, 3, 5, 8, 10, 12, 15, 17, 19, 21, 23, 25, 27,
28, 30, 32, 33, 35, 36, 37, 39, 40, 41, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 54, 55, 56, 57,
58, 58, 59, 60, 61, 61, 62, 63, 63, 64, 65, 66, 66,
67, 67, 68, 69, 70, 70, 71, 72, 73, 73, 74, 75, 76,
77, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89,
90, 91, 92, 94, 95, 97, 98, 100])

You need f(xvals) not f(yvals). But of course you can do much better for this data we a higher order polynomial. E.g.,
import numpy as np
import matplotlib.pyplot as plt
xvals = np.array([ 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14,
15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 27, 28, 29,
30, 31, 32, 34, 35, 36, 37, 38, 40, 41, 42, 43, 44,
45, 47, 48, 49, 50, 51, 52, 54, 55, 56, 57, 58, 60,
61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 74, 75,
76, 77, 78, 80, 81, 82, 83, 84, 85, 87, 88, 89, 90,
91, 92, 94, 95, 96, 97, 98, 100])
yvals = np.array([ 0, 3, 5, 8, 10, 12, 15, 17, 19, 21, 23, 25, 27,
28, 30, 32, 33, 35, 36, 37, 39, 40, 41, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 54, 55, 56, 57,
58, 58, 59, 60, 61, 61, 62, 63, 63, 64, 65, 66, 66,
67, 67, 68, 69, 70, 70, 71, 72, 73, 73, 74, 75, 76,
77, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89,
90, 91, 92, 94, 95, 97, 98, 100])
fit = np.polyfit(xvals, yvals, 3)
f = np.poly1d(fit)
#print f
fig, ax = plt.subplots(1,1,figsize=(6,4),dpi=400)
ax.scatter(xvals, yvals, color="blue", label="input")
ax.scatter(xvals, f(xvals), color="red", label="fit")
ax.legend()
plt.show()

Related

Adding each value in an RDD to its partition number

Quite new to PySpark so this might be simple. I have an RDD that ranges from 1 to 100 and has 4 partitions.
A = sc.parallelize(range(100), 4)
And I have to find a way to return another RDD where each value in the RDD is added to its partition number. The ideal example would be:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102]
Would like to know how I could amend the following code to get the desired results.
A = sc.parallelize(range(100), 4)
B =
print(B.collect())

Keep remaining numbers in range 100 except numbers in the array

Array
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
How to print the remaining numbers in the range 0-100, except those numbers in a?

You can use sets and subtract a from the range of numbers 0 - 100:
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
print(set(range(101)) - set(a))
Prints:
{1, 2, 4, 6, 7, 9, 10, 13, 16, 17, 19, 23, 24, 25, 27, 29, 30, 31, 32, 34, 35, 36, 37, 39, 40, 43, 44, 45, 46, 47, 48, 49, 50, 53, 54, 55, 56, 57, 58, 59, 60, 63, 66, 68, 70, 71, 72, 74, 75, 77, 78, 80, 81, 87, 88, 89, 90, 91, 92, 95, 98, 99, 100}
If order is crucial, you can filter the range by removing items in a -- still using set(a) to make it efficient.
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
s_a = set(a)
filtered = [n for n in range(101) if n not in s_a]

formatting dictionary printing output

I have dictionary called d which has several lists stored into it. If I print the dictionary I get this difficult to read output :
{'Patch(0,8)': [28, 56, 75], 'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
'Patch(4,0)': [2, 5, 6, 8, 19, 22, 23, 27, 31, 34, 35, 36, 41, 45, 51, 52, 53, 55, 56, 59, 60, 61, 62, 64, 66, 67, 68, 70, 73, 75, 76, 77, 79, 85, 87, 91, 94, 96],
'Patch(4,6)': [19, 23, 45, 56, 60, 75, 91], 'Patch(0,0)': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], 'Patch(8,0)': [2, 22, 23, 27, 34, 52
, 55, 60, 85], 'Patch(0,2)': [0, 1, 2, 3, 4, 6, 7, 10, 11, 13, 15, 16, 17, 18, 19, 20, 22, 23, 25, 26, 28, 29, 30, 32, 34, 36, 37, 38, 40, 43, 44, 45, 46, 47,
49, 50, 51, 52, 53, 54, 56, 58, 59, 60, 61, 62, 63, 64, 66, 70, 71, 74, 75, 76, 77, 78, 80, 81, 83, 85, 90, 91, 92, 93, 94, 96, 98, 99], 'Patch(2,8)': [28, 56, 75], 'Patch(4,8)': [56, 75]}
I just want to print each Patch and corresponding data in a new line :
{'Patch(0,8)': [28, 56, 75],
'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
I tried pprint after seeing the suggestions in this answer :
pprint.pprint(d, width=1)
I get this :
{'Patch(0,8)': [28,
56,
75], and so on
What am I missing here ?

Just pass in width that is big enough to hold every value in the dict:
>>> pprint.pprint(d, width=1000)
{'Patch(0,0)': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
'Patch(0,2)': [0, 1, 2, 3, 4, 6, 7, 10, 11, 13, 15, 16, 17, 18, 19, 20, 22, 23, 25, 26, 28, 29, 30, 32, 34, 36, 37, 38, 40, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 56, 58, 59, 60, 61, 62, 63, 64, 66, 70, 71, 74, 75, 76, 77, 78, 80, 81, 83, 85, 90, 91, 92, 93, 94, 96, 98, 99],
'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
'Patch(0,8)': [28, 56, 75],
'Patch(2,8)': [28, 56, 75],
'Patch(4,0)': [2, 5, 6, 8, 19, 22, 23, 27, 31, 34, 35, 36, 41, 45, 51, 52, 53, 55, 56, 59, 60, 61, 62, 64, 66, 67, 68, 70, 73, 75, 76, 77, 79, 85, 87, 91, 94, 96],
'Patch(4,6)': [19, 23, 45, 56, 60, 75, 91],
'Patch(4,8)': [56, 75],
'Patch(8,0)': [2, 22, 23, 27, 34, 52, 55, 60, 85]}

I usually print dicts as JSON to give it structure and formatting I can easily read.
import json
json.dumps( dict( a=1, b=2), indent=2)

You can make this into a simple loop to print it. have a look at dict.iteritems for the official docs.
for key, value in d.iteritems():
print key + " - " + str(value)

PySpark RDD filtered-out elements coming back

I am trying to implement a Sieve of Eratosthenes using PySpark.
For this, I am trying to apply many filter s to my RDD, but on each iteration, whatever was filtered out during the previous iterations keeps coming back, and I wonder why.
Here's the code:
from math import ceil
from math import sqrt
min_number = 2
max_number = 101
rdd = sc.parallelize(range(min_number, max_number), 4)
pivot = min_number
max_pivot = ceil(sqrt(max_number))
while pivot <= max_pivot:
print "RDD for pivot = " + str(pivot) + ":"
rdd = rdd.filter(lambda x: x <= pivot or x % pivot != 0)
pivot = rdd.filter(lambda x: x > pivot).reduce(min)
rdd.collect()
And the output:
Pivot = 2
[2, 3, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20, 22, 23, 25, 26, 28, 29, 31, 32, 34, 35, 37, 38, 40, 41, 43, 44, 46, 47, 49, 50, 52, 53, 55, 56, 58, 59, 61, 62, 64, 65, 67, 68, 70, 71, 73, 74, 76, 77, 79, 80, 82, 83, 85, 86, 88, 89, 91, 92, 94, 95, 97, 98, 100]
Pivot = 3
[2, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14, 15, 17, 18, 19, 21, 22, 23, 25, 26, 27, 29, 30, 31, 33, 34, 35, 37, 38, 39, 41, 42, 43, 45, 46, 47, 49, 50, 51, 53, 54, 55, 57, 58, 59, 61, 62, 63, 65, 66, 67, 69, 70, 71, 73, 74, 75, 77, 78, 79, 81, 82, 83, 85, 86, 87, 89, 90, 91, 93, 94, 95, 97, 98, 99]
Pivot = 4
[2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18, 19, 21, 22, 23, 24, 26, 27, 28, 29, 31, 32, 33, 34, 36, 37, 38, 39, 41, 42, 43, 44, 46, 47, 48, 49, 51, 52, 53, 54, 56, 57, 58, 59, 61, 62, 63, 64, 66, 67, 68, 69, 71, 72, 73, 74, 76, 77, 78, 79, 81, 82, 83, 84, 86, 87, 88, 89, 91, 92, 93, 94, 96, 97, 98, 99]
Pivot = 5
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 97, 98, 99, 100]
Pivot = 6
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 48, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 67, 68, 69, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95, 96, 97, 99, 100]
Pivot = 7
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 97, 98, 99, 100]
Pivot = 8
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 60, 61, 62, 64, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 100]
Pivot = 9
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 99]
Pivot = 10
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 100]
Pivot = 11
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 97, 98, 99, 100]
As you can see, on each iteration, only multiples of the current pivot are being filtered out, but numbers that had already being filtered out keep coming back, even when I replace the rdd reference on each iteration.
In case it is of any help, I am running PySpark 2.0.1 on Python 2.7.10 for Mac.
Thanks!

Python closures are evaluated when function is called, not when it is created (late binding).
As a result in the first iteration rdd is evaluated as:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 2 or x % 2 != 0))
in the second one:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 3 or x % 3 != 0)
.filter(lambda x: x <= 3 or x % 3 != 0))
in the third one:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 4 or x % 4 != 0)
.filter(lambda x: x <= 4 or x % 4 != 0)
.filter(lambda x: x <= 4 or x % 4 != 0))
and each time pivot is resolved in the current scope.
Correct implementation:
while pivot <= max_pivot:
def f(x, pivot=pivot):
return x <= pivot or x % pivot != 0
rdd = rdd.filter(f)
pivot = rdd.filter(lambda x: x > pivot).min()

convert list of strings from file to list of integers

I have a large file filled with integers separated by white space and comma. I am trying to read in 1KB at a time and convert it into a list of integers.
This code works fine:
with open('test_age.txt', 'r+') as inf:
with open('test_age_out.txt', 'r+') as outf:
sorted_list =[]
a = [x.strip() for x in inf.read(1000).split(',')]
int_a = map(int, a)
f = tempfile.TemporaryFile()
outf_array = sorted(int_a)
f.write(str(outf_array))
f.seek(0)
#etc...
output:
[1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, etc...
But once I add in a while loop to read the next 1KB:
with open('test_age.txt', 'r+') as inf:
with open('test_age_out.txt', 'r+') as outf:
sorted_list =[]
while True:
a = [x.strip() for x in inf.read(1000).split(',')]
int_a = map(int, a)
if not a:
break
f = tempfile.TemporaryFile()
outf_array = sorted(int_a)
print outf_array
f.write(str(outf_array))
f.seek(0)
I get the output and a ValueError:
[1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8,
8, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 12, 12, 12,
12, 12, 12, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 16, 17, 18,
19, 19, 20, 20, 20, 20, 21, 21, 22, 22, 22, 23, 23, 24, 24, 24, 24, 25,
25, 25, 25, 25, 26, 26, 26, 26, 27, 27, 27, 28, 28, 29, 30, 30, 30, 30,
31, 31, 31, 32, 32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 35, 35,
35, 35, 35, 36, 36, 37, 37, 37, 37, 38, 38, 39, 39, 39, 39, 39, 39, 40,
40, 40, 40, 41, 41, 42, 43, 43, 43, 44, 44, 44, 44, 44, 45, 46, 46, 46,
46, 47, 47, 47, 47, 47, 48, 48, 48, 48, 48, 48, 49, 49, 49, 50, 50, 50,
50, 50, 50, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 53, 53, 54,
54, 54, 55, 55, 55, 55, 56, 56, 56, 56, 56, 57, 57, 57, 57, 58, 58, 58,
59, 59, 60, 60, 60, 61, 62, 62, 62, 62, 63, 63, 63, 63, 63, 63, 63, 64,
64, 64, 65, 66, 66, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 69,
69, 69, 69, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 74, 75, 76, 76,
76, 76, 77, 77, 77, 77, 78, 78, 79, 79, 79, 79, 81, 81, 81, 81, 82, 82,
82, 82, 82, 83, 83, 83, 83, 84, 85, 85, 85, 85, 86, 86, 86, 87, 87, 87,
87, 87, 87, 88, 88, 88, 88, 88, 88, 88, 89, 89, 89, 89, 90, 90, 90, 91,
91, 91, 91, 91, 91, 91, 92, 92, 93, 93, 93, 94, 94, 94, 94, 95, 95,
96, 96, 96, 97, 97, 98, 99, 100, 100, 100, 100, 100]
[2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 8, 9, 10, 10, 11, 11, 11, 11, 12, 12,12,
13, 14, 15, 17, 17, 17, 17, 17, 17, 18, 18, 18, 20, 21, 22, 22, 22, 22,
23, 23, 24, 24, 24, 26, 27, 27, 27, 27, 28, 28, 29, 29, 29, 29, 30, 32,
32, 32, 32, 33, 33, 34, 34, 36, 37, 37, 37, 37, 38, 39, 41, 41, 42, 43,
44, 44, 46, 46, 47, 48, 49, 49, 49, 49, 51, 51, 52, 52, 52, 52, 53, 54,
54, 54, 55, 55, 56, 60, 60, 61, 61, 61, 62, 63, 63, 64, 65, 65, 65, 65,
66, 66, 67, 68, 68, 68, 70, 70, 73, 73, 73, 74, 74, 75, 75, 75, 77, 77,
77, 77, 78, 78, 78, 78, 79, 80, 81, 81, 82, 82, 83, 83, 83, 83, 84, 84,
85, 85, 85, 85, 86, 87, 88, 90, 91, 91, 91, 92, 93, 93, 93, 94, 95, 97,
98, 98, 99, 100]
int_a = map(int, a)
ValueError: invalid literal for int() with base 10: ''
I am not sure why this is happening. If I call print, it seems as if the lists ARE being created and sorted. However the ValueError exists. What gives?

Look at the output of str.split with a passed delimiter appearing at the head or tail of a string:
>>> ', 3, 5'.split(', ')
['', '3', '5']
That empty string is what your program is trying (and failing) to parse as an integer. ''.strip() doesn't help (and isn't necessary for int(), by the way - it automatically ignores leading and trailing whitespace). I recommend reading blocks that are guaranteed to be full and valid, such as lines. If the file is just one big line, you'll have to do some extra work to save the last characters from a line and move them into the next line's processing. Don't forget to process the remaining characters after the loop.
line = inf.read(1000)
new += line
current, delimiter, new = line.rpartition(', ')
# process current
# continue loop to add more content
If the file can comfortably fit in your system's memory, you could just read the entire file and split it in one go:
numbers = map(int, inf.read().split(', '))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is numpy.polyfit is off by a large margin? - python

Related

Adding each value in an RDD to its partition number

Keep remaining numbers in range 100 except numbers in the array

formatting dictionary printing output

PySpark RDD filtered-out elements coming back

convert list of strings from file to list of integers

Categories

Resources