Related
Quite new to PySpark so this might be simple. I have an RDD that ranges from 1 to 100 and has 4 partitions.
A = sc.parallelize(range(100), 4)
And I have to find a way to return another RDD where each value in the RDD is added to its partition number. The ideal example would be:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102]
Would like to know how I could amend the following code to get the desired results.
A = sc.parallelize(range(100), 4)
B =
print(B.collect())
Array
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
How to print the remaining numbers in the range 0-100, except those numbers in a?
You can use sets and subtract a from the range of numbers 0 - 100:
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
print(set(range(101)) - set(a))
Prints:
{1, 2, 4, 6, 7, 9, 10, 13, 16, 17, 19, 23, 24, 25, 27, 29, 30, 31, 32, 34, 35, 36, 37, 39, 40, 43, 44, 45, 46, 47, 48, 49, 50, 53, 54, 55, 56, 57, 58, 59, 60, 63, 66, 68, 70, 71, 72, 74, 75, 77, 78, 80, 81, 87, 88, 89, 90, 91, 92, 95, 98, 99, 100}
If order is crucial, you can filter the range by removing items in a -- still using set(a) to make it efficient.
a = (0, 3, 5, 8, 11, 12, 14, 15, 18, 20, 21, 22, 26, 26, 28, 33, 38, 41, 42, 42, 51, 52, 61, 62, 64, 65, 67, 69, 73, 76, 79, 82, 83, 84, 85, 86, 93, 94, 96, 97)
s_a = set(a)
filtered = [n for n in range(101) if n not in s_a]
I am trying to implement a Sieve of Eratosthenes using PySpark.
For this, I am trying to apply many filter s to my RDD, but on each iteration, whatever was filtered out during the previous iterations keeps coming back, and I wonder why.
Here's the code:
from math import ceil
from math import sqrt
min_number = 2
max_number = 101
rdd = sc.parallelize(range(min_number, max_number), 4)
pivot = min_number
max_pivot = ceil(sqrt(max_number))
while pivot <= max_pivot:
print "RDD for pivot = " + str(pivot) + ":"
rdd = rdd.filter(lambda x: x <= pivot or x % pivot != 0)
pivot = rdd.filter(lambda x: x > pivot).reduce(min)
rdd.collect()
And the output:
Pivot = 2
[2, 3, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20, 22, 23, 25, 26, 28, 29, 31, 32, 34, 35, 37, 38, 40, 41, 43, 44, 46, 47, 49, 50, 52, 53, 55, 56, 58, 59, 61, 62, 64, 65, 67, 68, 70, 71, 73, 74, 76, 77, 79, 80, 82, 83, 85, 86, 88, 89, 91, 92, 94, 95, 97, 98, 100]
Pivot = 3
[2, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14, 15, 17, 18, 19, 21, 22, 23, 25, 26, 27, 29, 30, 31, 33, 34, 35, 37, 38, 39, 41, 42, 43, 45, 46, 47, 49, 50, 51, 53, 54, 55, 57, 58, 59, 61, 62, 63, 65, 66, 67, 69, 70, 71, 73, 74, 75, 77, 78, 79, 81, 82, 83, 85, 86, 87, 89, 90, 91, 93, 94, 95, 97, 98, 99]
Pivot = 4
[2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18, 19, 21, 22, 23, 24, 26, 27, 28, 29, 31, 32, 33, 34, 36, 37, 38, 39, 41, 42, 43, 44, 46, 47, 48, 49, 51, 52, 53, 54, 56, 57, 58, 59, 61, 62, 63, 64, 66, 67, 68, 69, 71, 72, 73, 74, 76, 77, 78, 79, 81, 82, 83, 84, 86, 87, 88, 89, 91, 92, 93, 94, 96, 97, 98, 99]
Pivot = 5
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 97, 98, 99, 100]
Pivot = 6
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 48, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 61, 62, 64, 65, 66, 67, 68, 69, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95, 96, 97, 99, 100]
Pivot = 7
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 97, 98, 99, 100]
Pivot = 8
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 60, 61, 62, 64, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 100]
Pivot = 9
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 99]
Pivot = 10
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 100]
Pivot = 11
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 97, 98, 99, 100]
As you can see, on each iteration, only multiples of the current pivot are being filtered out, but numbers that had already being filtered out keep coming back, even when I replace the rdd reference on each iteration.
In case it is of any help, I am running PySpark 2.0.1 on Python 2.7.10 for Mac.
Thanks!
Python closures are evaluated when function is called, not when it is created (late binding).
As a result in the first iteration rdd is evaluated as:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 2 or x % 2 != 0))
in the second one:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 3 or x % 3 != 0)
.filter(lambda x: x <= 3 or x % 3 != 0))
in the third one:
(sc.parallelize(range(min_number, max_number), 4)
.filter(lambda x: x <= 4 or x % 4 != 0)
.filter(lambda x: x <= 4 or x % 4 != 0)
.filter(lambda x: x <= 4 or x % 4 != 0))
and each time pivot is resolved in the current scope.
Correct implementation:
while pivot <= max_pivot:
def f(x, pivot=pivot):
return x <= pivot or x % pivot != 0
rdd = rdd.filter(f)
pivot = rdd.filter(lambda x: x > pivot).min()
I have a large file filled with integers separated by white space and comma. I am trying to read in 1KB at a time and convert it into a list of integers.
This code works fine:
with open('test_age.txt', 'r+') as inf:
with open('test_age_out.txt', 'r+') as outf:
sorted_list =[]
a = [x.strip() for x in inf.read(1000).split(',')]
int_a = map(int, a)
f = tempfile.TemporaryFile()
outf_array = sorted(int_a)
f.write(str(outf_array))
f.seek(0)
#etc...
output:
[1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, etc...
But once I add in a while loop to read the next 1KB:
with open('test_age.txt', 'r+') as inf:
with open('test_age_out.txt', 'r+') as outf:
sorted_list =[]
while True:
a = [x.strip() for x in inf.read(1000).split(',')]
int_a = map(int, a)
if not a:
break
f = tempfile.TemporaryFile()
outf_array = sorted(int_a)
print outf_array
f.write(str(outf_array))
f.seek(0)
I get the output and a ValueError:
[1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8,
8, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 12, 12, 12,
12, 12, 12, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 16, 17, 18,
19, 19, 20, 20, 20, 20, 21, 21, 22, 22, 22, 23, 23, 24, 24, 24, 24, 25,
25, 25, 25, 25, 26, 26, 26, 26, 27, 27, 27, 28, 28, 29, 30, 30, 30, 30,
31, 31, 31, 32, 32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 35, 35,
35, 35, 35, 36, 36, 37, 37, 37, 37, 38, 38, 39, 39, 39, 39, 39, 39, 40,
40, 40, 40, 41, 41, 42, 43, 43, 43, 44, 44, 44, 44, 44, 45, 46, 46, 46,
46, 47, 47, 47, 47, 47, 48, 48, 48, 48, 48, 48, 49, 49, 49, 50, 50, 50,
50, 50, 50, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 53, 53, 54,
54, 54, 55, 55, 55, 55, 56, 56, 56, 56, 56, 57, 57, 57, 57, 58, 58, 58,
59, 59, 60, 60, 60, 61, 62, 62, 62, 62, 63, 63, 63, 63, 63, 63, 63, 64,
64, 64, 65, 66, 66, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 69,
69, 69, 69, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 74, 75, 76, 76,
76, 76, 77, 77, 77, 77, 78, 78, 79, 79, 79, 79, 81, 81, 81, 81, 82, 82,
82, 82, 82, 83, 83, 83, 83, 84, 85, 85, 85, 85, 86, 86, 86, 87, 87, 87,
87, 87, 87, 88, 88, 88, 88, 88, 88, 88, 89, 89, 89, 89, 90, 90, 90, 91,
91, 91, 91, 91, 91, 91, 92, 92, 93, 93, 93, 94, 94, 94, 94, 95, 95,
96, 96, 96, 97, 97, 98, 99, 100, 100, 100, 100, 100]
[2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 8, 9, 10, 10, 11, 11, 11, 11, 12, 12,12,
13, 14, 15, 17, 17, 17, 17, 17, 17, 18, 18, 18, 20, 21, 22, 22, 22, 22,
23, 23, 24, 24, 24, 26, 27, 27, 27, 27, 28, 28, 29, 29, 29, 29, 30, 32,
32, 32, 32, 33, 33, 34, 34, 36, 37, 37, 37, 37, 38, 39, 41, 41, 42, 43,
44, 44, 46, 46, 47, 48, 49, 49, 49, 49, 51, 51, 52, 52, 52, 52, 53, 54,
54, 54, 55, 55, 56, 60, 60, 61, 61, 61, 62, 63, 63, 64, 65, 65, 65, 65,
66, 66, 67, 68, 68, 68, 70, 70, 73, 73, 73, 74, 74, 75, 75, 75, 77, 77,
77, 77, 78, 78, 78, 78, 79, 80, 81, 81, 82, 82, 83, 83, 83, 83, 84, 84,
85, 85, 85, 85, 86, 87, 88, 90, 91, 91, 91, 92, 93, 93, 93, 94, 95, 97,
98, 98, 99, 100]
int_a = map(int, a)
ValueError: invalid literal for int() with base 10: ''
I am not sure why this is happening. If I call print, it seems as if the lists ARE being created and sorted. However the ValueError exists. What gives?
Look at the output of str.split with a passed delimiter appearing at the head or tail of a string:
>>> ', 3, 5'.split(', ')
['', '3', '5']
That empty string is what your program is trying (and failing) to parse as an integer. ''.strip() doesn't help (and isn't necessary for int(), by the way - it automatically ignores leading and trailing whitespace). I recommend reading blocks that are guaranteed to be full and valid, such as lines. If the file is just one big line, you'll have to do some extra work to save the last characters from a line and move them into the next line's processing. Don't forget to process the remaining characters after the loop.
line = inf.read(1000)
new += line
current, delimiter, new = line.rpartition(', ')
# process current
# continue loop to add more content
If the file can comfortably fit in your system's memory, you could just read the entire file and split it in one go:
numbers = map(int, inf.read().split(', '))
I'm using python 3.2.3 IDLE and this is my code:
originalList = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
newList = orginalList[0.05:0.95] #<<<<I have no idea what I'm doing here
print (newList)
I have an original list of numbers, they are 1 - 100 and i want to make a new list from the original list however the new list must only have data that belongs to the sub-range 5%- 95% of the original list
so the new list must be like [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18....95]. How do i do that? i know my newList code is wrong
originalList.sort()
newList = originalList[int(len(originalList) * .05) : int(len(originalList) * .95)]
sl = slice(4, 95)
print(originalList[sl])
Also see http://docs.python.org/2/library/functions.html#slice
size = len(originalList)
newList = originalList[0.05*size - 1:0.95*size + 1]
If you want to get part of a list, the syntax is
List = [1,2,3,4,5,6,7,8,9,10]
newList = [*start index*:*Index to end AT*]
so, the first number is the index where the sub-list starts, while the second number is the index at which the sublist stops (that index is not included).
hope this helps!
I'd also use a list comprehension for creating the original list... less mistake prone.
originalList = range(1,101)
newList = originalList[(len(originalList)*.05)-1:len(originalList)*.95]
print newList
Gives the desired result...
Edit: Changed range to be more concise per comment below.
For lists of arbitrary length, you could do:
>>> l = range(200)
>>> percentage = 5
>>> skip = int(len(l) * (float(percentage) / 100) / 2)
>>> len(l[skip:-skip])
190
You could use the fidx module, which allows percentages as indexes:
import fidx
originalList = fidx([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100])
# or better: originalList = fidx.list(range(1,101))
newList = originalList[0.05:0.95]
print (newList)
which returns
[6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]