PySpark's reduceByKey not working as expected

PySpark's reduceByKey not working as expected - python

I'm writing a large PySpark program and I've recently run into trouble when using reduceByKey on an RDD. I've been able to recreate the problem with a simple test program. The code is:
from pyspark import SparkConf, SparkContext
APP_NAME = 'Test App'
def main(sc):
test = [(0, [i]) for i in xrange(100)]
test = sc.parallelize(test)
test = test.reduceByKey(method)
print test.collect()
def method(x, y):
x.append(y[0])
return x
if __name__ == '__main__':
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster('local[*]')
sc = SparkContext(conf=conf)
main(sc)
I would expect the output to be (0, [0,1,2,3,4,...,98,99]) based on the Spark documentation. Instead, I get the following output:
[(0, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 24, 36, 48, 60, 72, 84])]
Could someone please help me understand why this output is being generated?
As a side note, when I use
def method(x, y):
x = x + y
return x
I get the expected output.

First of all it looks like you actually want groupByKey not reduceByKey:
rdd = sc.parallelize([(0, i) for i in xrange(100)])
grouped = rdd.groupByKey()
k, vs = grouped.first()
assert len(list(vs)) == 100
Could someone please help me understand why this output is being generated?
reduceByKey assumes that f is associative and your method is clearly not. Depending on the order of operations the output is different. Lets say you start with following data for a certain key:
[1], [2], [3], [4]
Now add lets add some parentheses:
((([1], [2]), [3]), [4])
(([1, 2], [3]), [4])
([1, 2, 3], [4])
[1, 2, 3, 4]
and with another set of parentheses
(([1], ([2], [3])), [4])
(([1], [2, 3]), [4])
([1, 2], [4])
[1, 2, 4]
When you rewrite it as follows:
method = lambda x, y: x + y
or simply
from operator import add
method = add
you get an associative function and it works as expected.
Generally speaking for reduce* operations you want functions which are both associative and commutative.

Related

How to avoid code duplication in unit tests

Suppose I have a function called "factorial" and I want to test this function. I often find myself rewriting unit tests like the one shown below where I define some test cases, possibly including some edge cases, running tests for all of those. This common pattern, defining the test values and expected output and running tests on them leaves me with the following boilerplate code. Essentially I would like to have one function to which I pass the list of test values with the list of expected values and the function to test it on and let the framework handle the rest for me. Does something like that exist and what would speaks against such a simplified approach?
import unittest
class TestRecursionAlgorithms(unittest.TestCase):
def test_factorial(self):
input_values = [1, 2, 3, 4, 5]
solutions = [1, 2, 6, 24, 120]
for idx, (input_value, expected_solution) in enumerate(zip(input_values, solutions)):
with self.subTest(test_case=idx):
self.assertEqual(expected_solution, factorial(input_value))
Cheers

You could use a variation of this.
input_values = [1, 2, 3, 4, 5]
solutions = [1, 2, 6, 24, 120]
result = dict(zip(input_values, solutions)) # Key:Value
print(result)
match = {i: k for i, k in result.items() if i == k} # Key Value comparison
print(match)
result:
{1: 1, 2: 2, 3: 6, 4: 24, 5: 120}
{1: 1, 2: 2}

In attempting to combine two arays I have a type problem in numpy

When I attempt:
data_f = hstack([data,Ki])
I get:
TypeError: 'list' object is not callable.
I have 'googled' in vain without result. What have I missed?
I have successfully created the two arrays I want to combine:
data = []
data = np.vstack([data1,data2,data3,data4,data5,data6,data7,data8,data9,data10])
A = []
A = data[:,1]
Ki = []
Ki = np.exp((1000*A)/(Rcal*Tk))
name_s = name+'_Ki'
np.savetxt(name_s,[A],newline='\n',delimiter = ' ')
data_f = []
hstack = []
data_f = hstack([data,Ki])

Please Read The Fine Manual,
where they clearly explain that hstack() wants
a tuple of ndarrays of similar shape.
You're not supplying that.
Carefully examine data & Ki,
to ensure they have similar .shape
EDIT
Here is an example of calling hstack():
>>> a = np.array(range(3)).reshape(3, 1)
>>> b = np.array(range(12)).reshape(3, 4)
>>> a.shape, b.shape
((3, 1), (3, 4))
>>> np.hstack((a, b))
array([[ 0, 0, 1, 2, 3],
[ 1, 4, 5, 6, 7],
[ 2, 8, 9, 10, 11]])
Notice that making a just np.array(range(3)) would not work.
To understand why, look at the difference between the .shape
of those expressions.

Python: How to write this code from python shell into a function?

What's happening here is that the first and second element of every tuple are getting multiplied, and it adds all of the products in the end. I know how to enter it in the Python shell, but how do I write it out as a function? Thanks for the help.
>>> x = [(70.9, 1, 24.8),
(15.4, 2, 70.5),
(30.0, 3, 34.6),
(25.0, 4, 68.4),
(45.00, 5, 99.0)]
>>> result = (a[0]*a[1] for a in x)
>>> sum(result)
>>> 516.7

Create the function:
def my_func(x):
result = (a[0]*a[1] for a in x)
return sum(result)
Call the function:
x = [(70.9, 1, 24.8),
(15.4, 2, 70.5),
(30.0, 3, 34.6),
(25.0, 4, 68.4),
(45.00, 5, 99.0)]
my_func(x)
Result will be 516.7

using numpy packege dot product also we can archive this easily
import numpy as np
x = [(70.9, 1, 24.8),(15.4, 2, 70.5),(30.0, 3, 34.6),(25.0, 4, 68.4),(45.00, 5, 99.0)]
def func(list):
nmpyArray = np.array(list)
mul = np.dot(nmpyArray[:, 0], nmpyArray[:, 1])
print(mul)
return mul
func(x)

Piping a pipe-delimited flat file into python for use in Pandas and Stats

I have searched a lot, but haven't found an answer to this.
I am trying to pipe in a flat file with data and put into something python read and that I can do analysis with (for instance, perform a t-test).
First, I created a simple pipe delimited flat file:
1|2
3|4
4|5
1|6
2|7
3|8
8|9
and saved it as "simpledata".
Then I created a bash script in nano as
#!/usr/bin/env python
import sys
from scipy import stats
A = sys.stdin.read()
print A
paired_sample = stats.ttest_rel(A[:,0],A[:,1])
print "The t-statistic is %.3f and the p-value is %.3f." % paired_sample
Then I save the script as pairedttest.sh and run it as
cat simpledata | pairedttest.sh
The error I get is
TypeError: string indices must be integers, not tuple
Thanks for your help in advance

Are you trying to call this?:
paired_sample = stats.ttest_rel([1,3,4,1,2,3,8], [2,4,5,6,7,8,9])
If so, you can't do it the way you're trying. A is just a string when you read it from stdin, so you can't index it the way you're trying. You need to build the two lists from the string. The most obvious way is like this:
left = []
right = []
for line in A.splitlines():
l, r = line.split("|")
left.append(int(l))
right.append(int(r))
print left
print right
This will output:
[1, 3, 4, 1, 2, 3, 8]
[2, 4, 5, 6, 7, 8, 9]
So you can call stats.ttest_rel(left, right)
Or to be really clever and make a (nearly impossible to read) one-liner out of it:
z = zip(*[map(int, line.split("|")) for line in A.splitlines()])
This will output:
[(1, 3, 4, 1, 2, 3, 8), (2, 4, 5, 6, 7, 8, 9)]
So you can call stats.ttest_rel(*z)

Duplicating Items within a list

I am fairly new to python and am trying to figure out how to duplicate items within a list. I have tried several different things and searched for the answer extensively but always come up with an answer of how to remove duplicate items, and I feel like I am missing something that should be fairly apparent.
I want a list of items to duplicate such as if the list was [1, 4, 7, 10] to be [1, 1, 4, 4, 7, 7, 10, 10]
I know that
list = range(5)
for i in range(len(list)):
list.insert(i+i, i)
print list
will return [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] but this does not work if the items are not in order.
To provide more context I am working with audio as a list, attempting to make the audio slower.
I am working with:
def slower():
left = Audio.getLeft()
right = Audio.getRight()
for i in range(len(left)):
left.insert(????)
right.insert(????)
Where "left" returns a list of items that are the "sounds" in the left headphone and "right" is a list of items that are sounds in the right headphone. Any help would be appreciated. Thanks.

Here is a simple way:
def slower(audio):
return [audio[i//2] for i in range(0,len(audio)*2)]

Something like this works:
>>> list = [1, 32, -45, 12]
>>> for i in range(len(list)):
... list.insert(2*i+1, list[2*i])
...
>>> list
[1, 1, 32, 32, -45, -45, 12, 12]
A few notes:
Don't use list as a variable name.
It's probably cleaner to flatten the list zipped with itself.
e.g.
>>> zip(list,list)
[(1, 1), (-1, -1), (32, 32), (42, 42)]
>>> [x for y in zip(list, list) for x in y]
[1, 1, -1, -1, 32, 32, 42, 42]
Or, you can do this whole thing lazily with itertools:
from itertools import izip, chain
for item in chain.from_iterable(izip(list, list)):
print item
I actually like this method best of all. When I look at the code, it is the one that I immediately know what it is doing (although others may have different opinions on that).
I suppose while I'm at it, I'll just point out that we can do the same thing as above with a generator function:
def multiply_elements(iterable, ntimes=2):
for item in iterable:
for _ in xrange(ntimes):
yield item
And lets face it -- Generators are just a lot of fun. :-)

listOld = [1,4,7,10]
listNew = []
for element in listOld:
listNew.extend([element,element])

This might not be the fastest way but it is pretty compact
a = range(5)
list(reduce(operator.add, zip(a,a)))
a then contains
[0, 0, 1, 1, 2, 2, 3, 3, 4, 4]

a = [0,1,2,3]
list(reduce(lambda x,y: x + y, zip(a,a))) #=> [0,0,1,1,2,2,3,3]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark's reduceByKey not working as expected - python

Related

How to avoid code duplication in unit tests

In attempting to combine two arays I have a type problem in numpy

Python: How to write this code from python shell into a function?

Piping a pipe-delimited flat file into python for use in Pandas and Stats

Duplicating Items within a list

Categories

Resources