Related
Quite new to PySpark so this might be simple. I have an RDD that ranges from 1 to 100 and has 4 partitions.
A = sc.parallelize(range(100), 4)
And I have to find a way to return another RDD where each value in the RDD is added to its partition number. The ideal example would be:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102]
Would like to know how I could amend the following code to get the desired results.
A = sc.parallelize(range(100), 4)
B =
print(B.collect())
Array
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
How to print the remaining numbers in the range 0-100, except those numbers in a?
You can use sets and subtract a from the range of numbers 0 - 100:
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
print(set(range(101)) - set(a))
Prints:
{1, 2, 4, 6, 7, 9, 10, 13, 16, 17, 19, 23, 24, 25, 27, 29, 30, 31, 32, 34, 35, 36, 37, 39, 40, 43, 44, 45, 46, 47, 48, 49, 50, 53, 54, 55, 56, 57, 58, 59, 60, 63, 66, 68, 70, 71, 72, 74, 75, 77, 78, 80, 81, 87, 88, 89, 90, 91, 92, 95, 98, 99, 100}
If order is crucial, you can filter the range by removing items in a -- still using set(a) to make it efficient.
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
s_a = set(a)
filtered = [n for n in range(101) if n not in s_a]
I have dictionary called d which has several lists stored into it. If I print the dictionary I get this difficult to read output :
{'Patch(0,8)': [28, 56, 75], 'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
'Patch(4,0)': [2, 5, 6, 8, 19, 22, 23, 27, 31, 34, 35, 36, 41, 45, 51, 52, 53, 55, 56, 59, 60, 61, 62, 64, 66, 67, 68, 70, 73, 75, 76, 77, 79, 85, 87, 91, 94, 96],
'Patch(4,6)': [19, 23, 45, 56, 60, 75, 91], 'Patch(0,0)': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], 'Patch(8,0)': [2, 22, 23, 27, 34, 52
, 55, 60, 85], 'Patch(0,2)': [0, 1, 2, 3, 4, 6, 7, 10, 11, 13, 15, 16, 17, 18, 19, 20, 22, 23, 25, 26, 28, 29, 30, 32, 34, 36, 37, 38, 40, 43, 44, 45, 46, 47,
49, 50, 51, 52, 53, 54, 56, 58, 59, 60, 61, 62, 63, 64, 66, 70, 71, 74, 75, 76, 77, 78, 80, 81, 83, 85, 90, 91, 92, 93, 94, 96, 98, 99], 'Patch(2,8)': [28, 56, 75], 'Patch(4,8)': [56, 75]}
I just want to print each Patch and corresponding data in a new line :
{'Patch(0,8)': [28, 56, 75],
'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
I tried pprint after seeing the suggestions in this answer :
pprint.pprint(d, width=1)
I get this :
{'Patch(0,8)': [28,
56,
75], and so on
What am I missing here ?
Just pass in width that is big enough to hold every value in the dict:
>>> pprint.pprint(d, width=1000)
{'Patch(0,0)': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
'Patch(0,2)': [0, 1, 2, 3, 4, 6, 7, 10, 11, 13, 15, 16, 17, 18, 19, 20, 22, 23, 25, 26, 28, 29, 30, 32, 34, 36, 37, 38, 40, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 56, 58, 59, 60, 61, 62, 63, 64, 66, 70, 71, 74, 75, 76, 77, 78, 80, 81, 83, 85, 90, 91, 92, 93, 94, 96, 98, 99],
'Patch(0,6)': [1, 11, 17, 19, 20, 23, 28, 30, 44, 45, 49, 56, 60, 63, 75, 81, 91, 99],
'Patch(0,8)': [28, 56, 75],
'Patch(2,8)': [28, 56, 75],
'Patch(4,0)': [2, 5, 6, 8, 19, 22, 23, 27, 31, 34, 35, 36, 41, 45, 51, 52, 53, 55, 56, 59, 60, 61, 62, 64, 66, 67, 68, 70, 73, 75, 76, 77, 79, 85, 87, 91, 94, 96],
'Patch(4,6)': [19, 23, 45, 56, 60, 75, 91],
'Patch(4,8)': [56, 75],
'Patch(8,0)': [2, 22, 23, 27, 34, 52, 55, 60, 85]}
I usually print dicts as JSON to give it structure and formatting I can easily read.
import json
json.dumps( dict( a=1, b=2), indent=2)
You can make this into a simple loop to print it. have a look at dict.iteritems for the official docs.
for key, value in d.iteritems():
print key + " - " + str(value)
I am trying to implement a Sieve of Eratosthenes using PySpark.
For this, I am trying to apply many filter s to my RDD, but on each iteration, whatever was filtered out during the previous iterations keeps coming back, and I wonder why.
Here's the code:
from math import ceil
from math import sqrt
min_number = 2
max_number = 101
rdd = sc.parallelize(range(min_number, max_number), 4)
pivot = min_number
max_pivot = ceil(sqrt(max_number))
while pivot <= max_pivot:
print "RDD for pivot = " + str(pivot) + ":"
rdd = rdd.filter(lambda x: x <= pivot or x % pivot != 0)
pivot = rdd.filter(lambda x: x > pivot).reduce(min)
rdd.collect()
And the output:
Pivot = 2
[2, 3, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20, 22, 23, 25, 26, 28, 29, 31, 32, 34, 35, 37, 38, 40, 41, 43, 44, 46, 47, 49, 50, 52, 53, 55, 56, 58, 59, 61, 62, 64, 65, 67, 68, 70, 71, 73, 74, 76, 77, 79, 80, 82, 83, 85, 86, 88, 89, 91, 92, 94, 95, 97, 98, 100]
Pivot = 3
[2, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14, 15, 17, 18, 19, 21, 22, 23, 25, 26, 27, 29, 30, 31, 33, 34, 35, 37, 38, 39, 41, 42, 43, 45, 46, 47, 49, 50, 51, 53, 54, 55, 57, 58, 59, 61, 62, 63, 65, 66, 67, 69, 70, 71, 73, 74, 75, 77, 78, 79, 81, 82, 83, 85, 86, 87, 89, 90, 91, 93, 94, 95, 97, 98, 99]
Pivot = 4
[2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18, 19, 21, 22, 23, 24, 26, 27, 28, 29, 31, 32, 33, 34, 36, 37, 38, 39, 41, 42, 43, 44, 46, 47, 48, 49, 51, 52, 53, 54, 56, 57, 58, 59, 61, 62, 63, 64, 66, 67, 68, 69, 71, 72, 73, 74, 76, 77, 78, 79, 81, 82, 83, 84, 86, 87, 88, 89, 91, 92, 93, 94, 96, 97, 98, 99]
Pivot = 5
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 97, 98, 99, 100]
Pivot = 6
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 48, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 67, 68, 69, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95, 96, 97, 99, 100]
Pivot = 7
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 97, 98, 99, 100]
Pivot = 8
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 60, 61, 62, 64, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 100]
Pivot = 9
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 99]
Pivot = 10
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 100]
Pivot = 11
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 97, 98, 99, 100]
As you can see, on each iteration, only multiples of the current pivot are being filtered out, but numbers that had already being filtered out keep coming back, even when I replace the rdd reference on each iteration.
In case it is of any help, I am running PySpark 2.0.1 on Python 2.7.10 for Mac.
Thanks!
Python closures are evaluated when function is called, not when it is created (late binding).
As a result in the first iteration rdd is evaluated as:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 2 or x % 2 != 0))
in the second one:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 3 or x % 3 != 0)
.filter(lambda x: x <= 3 or x % 3 != 0))
in the third one:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 4 or x % 4 != 0)
.filter(lambda x: x <= 4 or x % 4 != 0)
.filter(lambda x: x <= 4 or x % 4 != 0))
and each time pivot is resolved in the current scope.
Correct implementation:
while pivot <= max_pivot:
def f(x, pivot=pivot):
return x <= pivot or x % pivot != 0
rdd = rdd.filter(f)
pivot = rdd.filter(lambda x: x > pivot).min()
I have a large file filled with integers separated by white space and comma. I am trying to read in 1KB at a time and convert it into a list of integers.
This code works fine:
with open('test_age.txt', 'r+') as inf:
with open('test_age_out.txt', 'r+') as outf:
sorted_list =[]
a = [x.strip() for x in inf.read(1000).split(',')]
int_a = map(int, a)
f = tempfile.TemporaryFile()
outf_array = sorted(int_a)
f.write(str(outf_array))
f.seek(0)
#etc...
output:
[1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, etc...
But once I add in a while loop to read the next 1KB:
with open('test_age.txt', 'r+') as inf:
with open('test_age_out.txt', 'r+') as outf:
sorted_list =[]
while True:
a = [x.strip() for x in inf.read(1000).split(',')]
int_a = map(int, a)
if not a:
break
f = tempfile.TemporaryFile()
outf_array = sorted(int_a)
print outf_array
f.write(str(outf_array))
f.seek(0)
I get the output and a ValueError:
[1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8,
8, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 12, 12, 12,
12, 12, 12, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 16, 17, 18,
19, 19, 20, 20, 20, 20, 21, 21, 22, 22, 22, 23, 23, 24, 24, 24, 24, 25,
25, 25, 25, 25, 26, 26, 26, 26, 27, 27, 27, 28, 28, 29, 30, 30, 30, 30,
31, 31, 31, 32, 32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 35, 35,
35, 35, 35, 36, 36, 37, 37, 37, 37, 38, 38, 39, 39, 39, 39, 39, 39, 40,
40, 40, 40, 41, 41, 42, 43, 43, 43, 44, 44, 44, 44, 44, 45, 46, 46, 46,
46, 47, 47, 47, 47, 47, 48, 48, 48, 48, 48, 48, 49, 49, 49, 50, 50, 50,
50, 50, 50, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 53, 53, 54,
54, 54, 55, 55, 55, 55, 56, 56, 56, 56, 56, 57, 57, 57, 57, 58, 58, 58,
59, 59, 60, 60, 60, 61, 62, 62, 62, 62, 63, 63, 63, 63, 63, 63, 63, 64,
64, 64, 65, 66, 66, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 69,
69, 69, 69, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 74, 75, 76, 76,
76, 76, 77, 77, 77, 77, 78, 78, 79, 79, 79, 79, 81, 81, 81, 81, 82, 82,
82, 82, 82, 83, 83, 83, 83, 84, 85, 85, 85, 85, 86, 86, 86, 87, 87, 87,
87, 87, 87, 88, 88, 88, 88, 88, 88, 88, 89, 89, 89, 89, 90, 90, 90, 91,
91, 91, 91, 91, 91, 91, 92, 92, 93, 93, 93, 94, 94, 94, 94, 95, 95,
96, 96, 96, 97, 97, 98, 99, 100, 100, 100, 100, 100]
[2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 8, 9, 10, 10, 11, 11, 11, 11, 12, 12,12,
13, 14, 15, 17, 17, 17, 17, 17, 17, 18, 18, 18, 20, 21, 22, 22, 22, 22,
23, 23, 24, 24, 24, 26, 27, 27, 27, 27, 28, 28, 29, 29, 29, 29, 30, 32,
32, 32, 32, 33, 33, 34, 34, 36, 37, 37, 37, 37, 38, 39, 41, 41, 42, 43,
44, 44, 46, 46, 47, 48, 49, 49, 49, 49, 51, 51, 52, 52, 52, 52, 53, 54,
54, 54, 55, 55, 56, 60, 60, 61, 61, 61, 62, 63, 63, 64, 65, 65, 65, 65,
66, 66, 67, 68, 68, 68, 70, 70, 73, 73, 73, 74, 74, 75, 75, 75, 77, 77,
77, 77, 78, 78, 78, 78, 79, 80, 81, 81, 82, 82, 83, 83, 83, 83, 84, 84,
85, 85, 85, 85, 86, 87, 88, 90, 91, 91, 91, 92, 93, 93, 93, 94, 95, 97,
98, 98, 99, 100]
int_a = map(int, a)
ValueError: invalid literal for int() with base 10: ''
I am not sure why this is happening. If I call print, it seems as if the lists ARE being created and sorted. However the ValueError exists. What gives?
Look at the output of str.split with a passed delimiter appearing at the head or tail of a string:
>>> ', 3, 5'.split(', ')
['', '3', '5']
That empty string is what your program is trying (and failing) to parse as an integer. ''.strip() doesn't help (and isn't necessary for int(), by the way - it automatically ignores leading and trailing whitespace). I recommend reading blocks that are guaranteed to be full and valid, such as lines. If the file is just one big line, you'll have to do some extra work to save the last characters from a line and move them into the next line's processing. Don't forget to process the remaining characters after the loop.
line = inf.read(1000)
new += line
current, delimiter, new = line.rpartition(', ')
# process current
# continue loop to add more content
If the file can comfortably fit in your system's memory, you could just read the entire file and split it in one go:
numbers = map(int, inf.read().split(', '))