I want to start by stating that I am aware that this error message was posted multiple times. But I cannot seem to understand how those posts apply to me. So I want to try my luck:
I have Dataframe "df" and I am trying to perform a parallel processing of subsets of that dataframe:
for i in range(1, 2):
pool = ThreadPool(processes=4)
async_result = pool.apply_async(helper.Helper.transform(df.copy(), i))
lst.append(async_result)
results = []
for item in lst:
currentitem = item.get()
results.append(currentitem)
Helper Method:
#staticmethod
def transform(df, i):
return df
So I usualle code in Java and for a class I need to do some stuff in python. I just dont understand why in this case I get the error:
Traceback (most recent call last):
File "C:/Users/Barry/file.py", line 28, in <module>
currentitem = item.get()
File "C:\Users\Barry\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\pool.py", line 768, in get
raise self._value
File "C:\Users\Barry\AppData\Local\Programs\Python\Python38-32\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
TypeError: 'DataFrame' object is not callable
A print in the thread function or before creating the thread results in proper output.
The issue is with the line:
async_result = pool.apply_async(helper.Helper.transform(df.copy(), i))
The catch - you're calling the function 'transform' before passing it to apply_async. As a result, apply async receives a data frame, "thinks" it's a function, and tries to call it asynchronously. The result is the exception you're seeing, and this result is saved as part of the AsyncResult object.
To fix it just change this line to:
async_result = pool.apply_async(helper.Helper.transform, (df.copy(), i))
Note that apply_async gets two arguments - the function and the parameters to the function.
Related
I started from this question: How to chain a Celery task that returns a list into a group?
But I want to expand twice. So in my use case I have:
task A: determines total number of items for a given date
task B: downloads 1000 metadata entries for that date
task C: download the content for one item
So each step I'm expanding the number of items of the next step. I can do it by looping through the results in my task and calling .delay() on the next task function. But I thought I'd try to not make my main tasks do that. Instead they'd return a list of tuples - each tuple would then be expanded into the arguments for a call to the next function.
The above question has an answer that appears to meet my need, but I can't work out the correct way of chaining it for a two level expansion.
Here is a very cut down example of my code:
from celery import group
from celery.task import subtask
from celery.utils.log import get_task_logger
from .celery import app
logger = get_task_logger(__name__)
#app.task
def task_range(upper=10):
# wrap in list to make JSON serializer work
return list(zip(range(upper), range(upper)))
#app.task
def add(x, y):
logger.info(f'x is {x} and y is {y}')
char = chr(ord('a') + x)
char2 = chr(ord('a') + x*2)
result = x + y
logger.info(f'result is {result}')
return list(zip(char * result, char2 * result))
#app.task
def combine_log(c1, c2):
logger.info(f'combine log is {c1}{c2}')
#app.task
def dmap(args_iter, celery_task):
"""
Takes an iterator of argument tuples and queues them up for celery to run with the function.
"""
logger.info(f'in dmap, len iter: {len(args_iter)}')
callback = subtask(celery_task)
run_in_parallel = group(callback.clone(args) for args in args_iter)
return run_in_parallel.delay()
I've then tried various ways to make my nested mapping work. First, a one level mapping works fine, so:
pp = (task_range.s() | dmap.s(add.s()))
pp(2)
Produces the kind of results I'd expect, so I'm not totally off.
But when I try to add another level:
ppp = (task_range.s() | dmap.s(add.s() | dmap.s(combine_log.s())))
Then in the worker I see the error:
[2019-11-23 22:34:12,024: ERROR/ForkPoolWorker-2] Task proj.tasks.dmap[e92877a9-85ce-4f16-88e3-d6889bc27867] raised unexpected: TypeError("add() missing 2 required positional arguments: 'x' and 'y'",)
Traceback (most recent call last):
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/app/trace.py", line 648, in __protected_call__
return self.run(*args, **kwargs)
File "/home/hdowner/dev/playground/celery/proj/tasks.py", line 44, in dmap
return run_in_parallel.delay()
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/canvas.py", line 186, in delay
return self.apply_async(partial_args, partial_kwargs)
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/canvas.py", line 1008, in apply_async
args=args, kwargs=kwargs, **options))
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/canvas.py", line 1092, in _apply_tasks
**options)
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/canvas.py", line 578, in apply_async
dict(self.options, **options) if options else self.options))
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/canvas.py", line 607, in run
first_task.apply_async(**options)
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/canvas.py", line 229, in apply_async
return _apply(args, kwargs, **options)
File "/home/hdowner/.venv/play_celery/lib/python3.6/site-packages/celery/app/task.py", line 532, in apply_async
check_arguments(*(args or ()), **(kwargs or {}))
TypeError: add() missing 2 required positional arguments: 'x' and 'y'
And I'm not sure why changing the argument to dmap() from a plain task signature to a chain changes how the arguments get passed into add(). My impression was that it shouldn't, it just means the return value of add() would get passed on. But apparently that is not the case ...
Turns out the problem is that the clone() method on a chain instance does not pass the arguments through at some point - see https://stackoverflow.com/a/53442344/3189 for the full details. If I use the method in that answer, my dmap() code becomes:
#app.task
def dmap(args_iter, celery_task):
"""
Takes an iterator of argument tuples and queues them up for celery to run with the function.
"""
callback = subtask(celery_task)
run_in_parallel = group(clone_signature(callback, args) for args in args_iter)
return run_in_parallel.delay()
def clone_signature(sig, args=(), kwargs=(), **opts):
"""
Turns out that a chain clone() does not copy the arguments properly - this
clone does.
From: https://stackoverflow.com/a/53442344/3189
"""
if sig.subtask_type and sig.subtask_type != "chain":
raise NotImplementedError(
"Cloning only supported for Tasks and chains, not {}".format(sig.subtask_type)
)
clone = sig.clone()
if hasattr(clone, "tasks"):
task_to_apply_args_to = clone.tasks[0]
else:
task_to_apply_args_to = clone
args, kwargs, opts = task_to_apply_args_to._merge(args=args, kwargs=kwargs, options=opts)
task_to_apply_args_to.update(args=args, kwargs=kwargs, options=deepcopy(opts))
return clone
And then when I do:
ppp = (task_range.s() | dmap.s(add.s() | dmap.s(combine_log.s())))
everything works as expected.
Thanks for the great answer. I had to tweak the code to make sure it could handle tasks with single arguments. I am sure this is awful, but it works! Any improvements appreciated.
#celery_app.task(name='app.worker.dmap')
def dmap(args_iter, celery_task):
"""
Takes an iterator of argument tuples and queues them up for celery to run with the function.
"""
callback = subtask(celery_task)
print(f"ARGS: {args_iter}")
args_list = []
run_in_parallel = group(clone_signature(callback, args if type(args) is list else [args]) for args in args_iter)
print(f"Finished Loops: {run_in_parallel}")
return run_in_parallel.delay()
Specifically - I added:
if type(args) is list else [args]
to this line:
run_in_parallel = group(clone_signature(callback, args ***if type(args) is list else [args]***) for args in args_iter)
This is my code.
from multiprocessing.managers import BaseManager
from threading import Thread
def manager1():
my_dict = {}
my_dict['key'] = "value"
print(my_dict['key']) #this works
class SyncManager(BaseManager): pass
SyncManager.register('get_my_dict', callable=lambda:my_dict)
n = SyncManager(address=('localhost', 50001), authkey=b'secret')
t = n.get_server()
t.serve_forever()
def get_my_dict_from_the_manager():
class SyncManager(BaseManager): pass
SyncManager.register('get_my_dict')
n = SyncManager(address=('localhost', 50001), authkey=b'secret')
n.connect()
my_dict = n.get_my_dict()
return my_dict
thread1 = Thread(target=manager1)
thread1.daemon = True
thread1.start()
my_dict = get_my_dict_from_the_manager()
print(my_dict.keys()) #this works
print(my_dict['key']) #DOES NOT WORK
On the last line of the script, I try to access a value in the dictionary my_dict by subscripting with a key. This throws an error. This is my terminal output:
value
['key']
Traceback (most recent call last):
File "/home/magnus/PycharmProjects/docker-falcon/app/so_test.py", line 31, in <module>
print(my_dict['key'])
TypeError: 'AutoProxy[get_my_dict]' object is not subscriptable
Process finished with exit code 1
It seems the AutoProxy object sort of behaves like the dict it is supposed to proxy, but not quite. Is there a way to make it subscriptable?
The problem is that the AutoProxy object does not expose the __getitem__ method that a dict normally has. An answer to my similar question allows you to access items by their key: simply replace print(my_dict['key']) with print(my_dict.get('key'))
Error
Traceback (most recent call last):
File "C:/Users/RCS/Desktop/Project/SHM.py", line 435, in <module>
app = SHM()
File "C:/Users/RCS/Desktop/Project/SHM.py", line 34, in __init__
frame = F(container, self)
File "C:/Users/RCS/Desktop/Project/SHM.py", line 384, in __init__
if "3202" in q:
TypeError: argument of type 'method' is not iterable
code
some part of code, initialisation and all
while 1:
q = variable1.get
if "3202" in q:
variable2.set("NI NODE3202")
try:
switch(labelframe2, labelframe1)
except:
switch(labelframe3, labelframe1)
elif "3212" in q:
variable2.set("NI NODE3212")
try:
switch(labelframe1, labelframe2)
except:
switch(labelframe3, labelframe2)
elif "3214" in q:
variable2.set("NI NODE3214")
try:
switch(labelframe1, labelframe3)
except:
switch(labelframe2, labelframe3)
else:
None
some other part of code
def switch(x, y):
if x.isGridded:
x.isGridded = False
x.grid_forget()
y.isGridded = True
y.grid(row=0, column=0)
else:
return False
I am trying to create a switch between three labelframes which are inside another labelframe, and outside this labelframe are other labelframes that are not changing.
I have read some similar answers but I don't want to use __iter__() in my code. Can anybody provide any other suggestions?
You forgot to call the Entry.get() method:
q = variable1.get()
# ^^ call the method
Because the method object itself doesn't support containment testing directly, Python is instead trying to iterate over the object to see if there are any elements contained in it that match your string.
If you call the method, you get a string value instead. Strings do support containment testing.
The reason you got that error was because you did not add "()" after.get query hence the error to fix this change q = variable1.get to q = variable.get()
short short version:
I am having trouble parallelizing code which uses instance methods.
Longer version:
This python code produces the error:
Error
Traceback (most recent call last):
File "/Users/gilzellner/dev/git/3.2.1-build/cloudify-system-tests/cosmo_tester/test_suites/stress_test_openstack/test_file.py", line 24, in test
self.pool.map(self.f, [self, url])
File "/Users/gilzellner/.virtualenvs/3.2.1-build/lib/python2.7/site-packages/pathos/multiprocessing.py", line 131, in map
return _pool.map(star(f), zip(*args)) # chunksize
File "/Users/gilzellner/.virtualenvs/3.2.1-build/lib/python2.7/site-packages/multiprocess/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/Users/gilzellner/.virtualenvs/3.2.1-build/lib/python2.7/site-packages/multiprocess/pool.py", line 567, in get
raise self._value
AttributeError: 'Test' object has no attribute 'get_type'
This is a simplified version of a real problem I have.
import urllib2
from time import sleep
from os import getpid
import unittest
from pathos.multiprocessing import ProcessingPool as Pool
class Test(unittest.TestCase):
def f(self, x):
print urllib2.urlopen(x).read()
print getpid()
return
def g(self, y, z):
print y
print z
return
def test(self):
url = "http://nba.com"
self.pool = Pool(processes=1)
for x in range(0, 3):
self.pool.map(self.f, [self, url])
self.pool.map(self.g, [self, url, 1])
sleep(10)
I am using pathos.multiprocessing due to the recommendation here:
Multiprocessing: Pool and pickle Error -- Pickling Error: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
Before using pathos.multiprocessing, the error was:
"PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed"
You're using multiprocessing map method incorrectly.
According to python docs:
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though).
Where standard map:
Apply function to every item of iterable and return a list of the
results.
Example usage:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
What you're looking for is apply_async method:
def test(self):
url = "http://nba.com"
self.pool = Pool(processes=1)
for x in range(0, 3):
self.pool.apply_async(self.f, args=(self, url))
self.pool.apply_async(self.g, args=(self, url, 1))
sleep(10)
The error indicates you are trying to read an attribute which is not defined for the object Test.
AttributeError: 'Test' object has no attribute 'get_type'"
In your class test, you haven't defined get_type method or any other attribute hence the error.
Apologies in advance, but I am unable to post a fully working example (too much overhead in this code to distill to a runnable snippet). I will post as much explanatory detail as I can, and please do let me know if anything critical seems missing.
Running Python 2.7.5 through IDLE
I am writing a program to compare two text files. Since the files can be large (~500MB) and each row comparison is independent, I would like to implement multiprocessing to speed up the comparison. This is working pretty well, but I am getting stuck on a pseudo-random Bad file descriptor error. I am new to multiprocessing, so I guess there is a technical problem with my implementation. Can anyone point me in the right direction?
Here is the code causing the trouble (specifically the pool.map):
# openfiles
csvReaderTest = csv.reader(open(testpath, 'r'))
csvReaderProd = csv.reader(open(prodpath, 'r'))
compwriter = csv.writer(open(outpath, 'wb'))
pool = Pool()
num_chunks = 3
chunksTest = itertools.groupby(csvReaderTest, keyfunc)
chunksProd = itertools.groupby(csvReaderProd, keyfunc)
while True:
# make a list of num_chunks chunks
groupsTest = [list(chunk) for key, chunk in itertools.islice(chunksTest, num_chunks)]
groupsProd = [list(chunk) for key, chunk in itertools.islice(chunksProd, num_chunks)]
# merge the two lists (pair off comparison rows)
groups_combined = zip(groupsTest,groupsProd)
if groups_combined:
# http://stackoverflow.com/questions/5442910/python-multiprocessing-pool-map-for-multiple-arguments
a_args = groups_combined # a list - set of combinations to be tested
second_arg = True
worker_result = pool.map(worker_mini_star, itertools.izip(itertools.repeat(second_arg),a_args))
Here is the full error output. (This error sometimes occurs, and other times the comparison runs to finish without problems):
Traceback (most recent call last):
File "H:/<PATH_SNIP>/python_csv_compare_multiprocessing_rev02_test2.py", line 407, in <module>
main(fileTest, fileProd, fileout, stringFields, checkFileLengths)
File "H:/<PATH_SNIP>/python_csv_compare_multiprocessing_rev02_test2.py", line 306, in main
worker_result = pool.map(worker_mini_star, itertools.izip(itertools.repeat(second_arg),a_args))
File "C:\Python27\lib\multiprocessing\pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Python27\lib\multiprocessing\pool.py", line 554, in get
raise self._value
IOError: [Errno 9] Bad file descriptor
If it helps, here are the functions called by pool.map:
def worker_mini(flag, chunk):
row_comp = []
for entry, entry2 in zip(chunk[0][0], chunk[1][0]):
if entry == entry2:
temp_comp = entry
else:
temp_comp = '%s|%s' % (entry, entry2)
row_comp.append(temp_comp)
return True, row_comp
#takes a single tuple argument and unpacks the tuple to multiple arguments
def worker_mini_star(flag_chunk):
"""Convert `f([1,2])` to `f(1,2)` call."""
return worker_mini(*flag_chunk)
def main():