PySpark, Error calling saveAsTextFile

PySpark, Error calling saveAsTextFile - python

this is mi first attempt with Python. I'm trying to use python with Apache Spark.
This is what i want to do:
l = sc.textFile("/user/cloudera/dataset.txt")
l = l.map(lambda x: map(int, x))
then i use cartesian function to obtain all possible combination of elements
lc = l.cartesian(l)
now for every couple i apply a function:
output = lc.map(lambda x: str(x[0]) + ";" + str(x[1]) + ";" + str(cosineSim(x[0], x[1])))`
my objective is to obtain strings like:
element1; element1; similarity
element1; element2; similarity
...
and so on..
when i call output.first() this is my output:
[45, 12, 7, 2, 2, 2, 2, 4, 7];[45, 12, 7, 2, 2, 2, 2, 4, 7];1.0
this is a string, indeed if i a do:
s = output.first()
type(s)
<type 'str'>
but if i execute output.collect() or output.saveAsTextFile(path) i have this error:
15/02/13 06:06:18 WARN TaskSetManager: Lost task 1.0 in stage 61.0 (TID 183, 10.39.127.148): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
process()
File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 2, in <lambda>
ValueError: invalid literal for int() with base 10: ''
what's wrong?

I think there must be an error in this formula:
l = l.map(lambda x: map(int, x))
Can you check that the l RDD always has values (no '')? If it doesn't, you get a typical Python error:
> In [32]: int('')
--------------------------------------------------------------------------- ValueError Traceback (most recent call
last) in ()
----> 1 int('')
ValueError: invalid literal for int() with base 10: ''
Moving forward, keep in mind that map are lazy evaluations which means that they are not computed until the next action (collect and save are actions) is instructed.

Related

what to do if in python we get valueERRor

I have tried this statement
N,T,M = list(map(int,input().split()))
and python is showing me
N,T,M = list(map(int,input().split()))
ValueError: not enough values to unpack (expected 3, got 1)
What to do?

It seems you're receiving this error because you are trying to assign three distinct values (N, T, and M) even though your list only has one value:
By running this in the Python3 terminal, we see the same error you reported:
>>>N,T,M=list([1])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected 3, got 1)
With 2 items in the list, the error message changes:
>>> N,T,M=list([1, 2])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected 3, got 2)
If you have 3 items in your list, Python will assign one item to each of your variables:
>>>N,T,M=list([1, 2, 3])
# This works and no error is received
Then, with 4 items in the list, Python complains:
>>> N,T,M=list([1, 2, 3, 4])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack (expected 3)
Essentially, you need to ensure that the number of items on the right side of your expression will be three in order to instantiate three separate variables.

Editing every element in the array to return an array of edited(sliced) elements

Hi I am trying to make edits to every element in the array to return an array of edited(sliced) elements. However, I am getting the error below. Any help is appreciated.
Traceback
>>> p=Playlist.objects.get(id=3)
>>> l=p.song.values_list('link', flat=True)
>>> print(l)
<QuerySet ['https://www.youtube.com/watch?v=_DqmVMlJzqA', 'https://www.youtube.com/watch?v=_DqmVMlJzqA', 'https://www.youtube.com/watch?v=_DqmVMlJzqA', 'https://www.youtube.com/watch?v=k6PiQr-lQY4', 'https://www.youtube.com/watch?v=gqOEoUR5RHg']>
>>> print([l[i][17:] if l[i][0:17] == 'https://youtu.be/' else l[i][32:] for i in l])
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "<console>", line 1, in <listcomp>
File "C:\Users\hanya\AppData\Local\Programs\Python\Python37\lib\site-packages\django\db\models\query.py", line 278, in __getitem__
raise TypeError
TypeError

execute string with function into a function

how to make the function test() work correctly?
Python 3.4.1
a function into string does not work well when this string is inside a function.
how define this function that inside a string?
def func(x):
return x+1
def test(a,b):
loc = {'a':a,'b':b}
glb = {}
exec('c = [func(a+i)for i in range(b)]', glb,loc)
c = loc['c']
print(c)
print('out the function test()')
a= 1
b= 4
c = [func(a+i)for i in range(b)]
print(c)
'''results:
out the function test()
[2, 3, 4, 5]
>>> test(1,4)
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
test(1,4)
File "C:\Users\Rosania\Documents\Edilon\Python examples\apagar.py", line 6, in test
exec('c = [func(a+i)for i in range(b)]', glb,loc)
File "<string>", line 1, in <module>
File "<string>", line 1, in <listcomp>
NameError: name 'func' is not defined
'''

This is kind of evil to eval a string.
Assuming you know what you're doing...
Put "func" in the locals dict too. Your eval environments must know about everything you expect to reference in your eval'd string.
loc = {'a':a, 'b':b, 'func':func}

Cant understand some python tuple syntax

I started learning python today and found this very nice code visualization tool pythontutor.com, the problem is that I still don't quite get some of the syntax on the example code.
def listSum(numbers):
if not numbers:
return 0
else:
(f, rest) = numbers
return f + listSum(rest)
myList = (1, (2, (3, None)))
total = listSum(myList)
What does (f, rest) = numbers means?

It's tuple unpacking.
There needs to be 2 items in the tuple when used in this way. More or less will result in an exception, as shown below.
>>> numbers = (1, 2, 3, 4, 5)
>>> (f, rest) = numbers
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
>>> numbers = (1, 2)
>>> (f, rest) = numbers
>>> print f
1
>>> print rest
2
>>> numbers = (1)
>>> (f, rest) = numbers
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'int' object is not iterable
>>> numbers = (1,)
>>> (f, rest) = numbers
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: need more than 1 value to unpack
Note that (1) and (1, )are syntactically different, with only the latter being a tuple.
See the Python Doc on Tuples and Sequences for more details.

(f, rest) = numbers
unpacks the tuple. That is, it takes the two values stored in numbers and stores them in f and rest, respectively. Note that the number of variables you unpack into must be the same as the number of values in the tuple, or else an exception will be thrown.

Tupple is a data structure in which you can store multiple items under one name.
Lets say that we have a tupple(t) with two items.
Then t[0] = first_item and t[1] = sencond_item
Another way of accessing the tupple item is:
(f, rest) = numbers
In this syntax numbers (tupple) must have 2 items only otherwise it is an exception
f = numbers[0]
rest = numbers[1]

TypeError while representing arbitrary element type in multiprocessing.Array

>>> from multiprocessing import Array, Value
>>> import numpy as np
>>> a = [(i,[]) for i in range(3)]
>>> a
[(0, []), (1, []), (2, [])]
>>> a[0][1].extend(np.array([1,2,3]))
>>> a[1][1].extend(np.array([4,5]))
>>> a[2][1].extend(np.array([6,7,8]))
>>> a
[(0, [1, 2, 3]), (1, [4, 5]), (2, [6, 7, 8])]
Following the python multiprocessing example: def test_sharedvalues(): I am trying to create a shared Proxy object using the below code:
shared_a = [multiprocessing.Array(id, e) for id, e in a]
but it is giving me an error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/multiprocessing/__init__.py", line 255, in Array
return Array(typecode_or_type, size_or_initializer, **kwds)
File "/usr/lib64/python2.6/multiprocessing/sharedctypes.py", line 87, in Array
obj = RawArray(typecode_or_type, size_or_initializer)
File "/usr/lib64/python2.6/multiprocessing/sharedctypes.py", line 60, in RawArray
result = _new_value(type_)
File "/usr/lib64/python2.6/multiprocessing/sharedctypes.py", line 36, in _new_value
size = ctypes.sizeof(type_)
TypeError: this type has no size

Ok. The problem is solved
I changed
>>> a = [(i,[]) for i in range(3)]
to
>>> a = [('i',[]) for i in range(3)]
and this solved the TypeError.
Actually, I also found out that I did not necessarily had to use the i as count within range(3) (since Array automatically allows indexing), The 'i' is for c_int typecode under multiprocessing.sharedctypes
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark, Error calling saveAsTextFile - python

Related

what to do if in python we get valueERRor

Editing every element in the array to return an array of edited(sliced) elements

execute string with function into a function

Cant understand some python tuple syntax

TypeError while representing arbitrary element type in multiprocessing.Array

Categories

Resources