I've actually asked a question about multiprocessing before, but now I'm running in to a weird shortcoming with the type of data that gets returned.
I'm using Gspread to interface with Google's Sheets API and get a "worksheet" object back.
This object, or an aspect of this object, is apparently incompatible with multiprocessing due to being "unpickle-able". Please see output:
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<Worksheet 'Activation Log' id:o12345wm>]'.
Reason: 'UnpickleableError(<ssl.SSLContext object at 0x1e4be30>,)'
The code I'm using is essentially:
from multiprocessing import pool
from oauth2client.client import SignedJwtAssertionCredentials
import gspread
sheet = 1
pool = multiprocessing.pool.Pool(1)
p = pool.apply_async(get_a_worksheet, args=(sheet,))
worksheet = p.get()
And the script fails while attempting to "get" the results. The get_a_worksheet function returns a Gspread worksheet object that allows me to manipulate the remote sheet. Being able to upload changes to the document is important here - I'm not just trying to reference data, I need to alter it as well.
Does anyone know how I can run a subprocess in a separate and monitorable thread, and get an arbitrary (or custom) object type safely out of it at the end? Does anyone know what makes the ssl.SSLContext object special and "unpickleable"?
Thanks all in advance.
Multiprocessing uses pickling to pass objects between processes. So I do not believe you can use multiprocessing and make an object unpicklable.
I ended up writing a solution around this shortcoming by having the sub-process simply perform the necessary work inside itself rather than return a Worksheet object.
What I ended up with was about half a dozen function and multiprocessing function pairs, each one written to do what I needed done, but inside of a sub-process so that it could be monitored and timed.
A hierarchical map would look something like:
Main()
check_spreadsheet_for_a_string()
check_spreadsheet_for_a_string_worker()
get_hash_of_spreadsheet()
get_hash_of_spreadsheet_worker()
... etc
Where the "worker" functions are the functions called in the multiprocessing setup, and the regular functions above them manage the sub-process and time it to make sure the overall program doesn't halt if the call to gspread internals hangs or takes too long.
Related
I try to get a pickle file from an S3 resource using the "Object.get()" method of the boto3 library from several processes simultaneously. This causes my program to get stuck on one of the processes (No exception raised and the program does not continue to the next line).
I tried to add a "Config" variable to the S3 connection. That didn't help.
import pickle
import boto3
from botocore.client import Config
s3_item = _get_s3_name(descriptor_key) # Returns a path string of the desiered file
config = Config(connect_timeout=5, retries={'max_attempts': 0})
s3 = boto3.resource('s3', config=config)
bucket_uri = os.environ.get(*ct.S3_MICRO_SERVICE_BUCKET_URI) # Returns a string of the bucket URI
estimator_factory_logger.debug(f"Calling s3 with item {s3_item} from URI {bucket_uri}")
model_file_from_s3 = s3.Bucket(bucket_uri).Object(s3_item)
estimator_factory_logger.debug("Loading bytes...")
model_content = model_file_from_s3.get()['Body'].read() # <- Program gets stuck here
estimator_factory_logger.debug("Loading from pickle...")
est = pickle.loads(model_content)
No error message raised. It seems that the "get" method is stuck in a deadlock.
Your help will be much appreciated.
Is there a possibility that one of the files in the bucket is just huge and program takes a long time to read?
If that's the case, as a debugging step I'd look into model_file_from_s3.get()['Body'] object, which is botocore.response.StreamingBody object, and use set_socket_timeout()on it to try and force timeout.
https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html
The problem was that we created a subprocess after our main process opened several threads in it. Apparently, This is a big No-No in Linux.
We fixed it by using "spawn" instead of "fork"
I would like to use multiprocess package for example in this code.
I've tried to call the function create_new_population and distribute the data to 8 processors, but when I do I get pickle error.
Normally the function would run like this: self.create_new_population(self.pop_size)
I try to distribute the work like this:
f= self.create_new_population
pop = self.pop_size/8
self.current_generation = [pool.apply_async(f, pop) for _ in range(8)]
I get Can't pickle local object 'exhaust.__init__.<locals>.tour_select'
or PermissionError: [WinError 5] Access is denied
I've read this thread carefully and also tried to bypass the error using an approach from Steven Bethard to allow method pickling/unpickling via copyreg:
def _pickle_method(method)
def _unpickle_method(func_name, obj, cls)
I also tried to use pathos package without any luck.
I've know that the code should be called under if __name__ == '__main__': block, but I would like to know if this can be done with minimum possible changes in the code.
I have a program that interacts with and changes block devices (/dev/sda and such) on linux. I'm using various external commands (mostly commands from the fdisk and GNU fdisk packages) to control the devices. I have made a class that serves as the interface for most of the basic actions with block devices (for information like: What size is it? Where is it mounted? etc.)
Here is one such method querying the size of a partition:
def get_drive_size(device):
"""Returns the maximum size of the drive, in sectors.
:device the device identifier (/dev/sda and such)"""
query_proc = subprocess.Popen(["blockdev", "--getsz", device], stdout=subprocess.PIPE)
#blockdev returns the number of 512B blocks in a drive
output, error = query_proc.communicate()
exit_code = query_proc.returncode
if exit_code != 0:
raise Exception("Non-zero exit code", str(error, "utf-8")) #I have custom exceptions, this is slight pseudo-code
return int(output) #should always be valid
So this method accepts a block device path, and returns an integer. The tests will run as root, since this entire program will end up having to run as root anyway.
Should I try and test code such as these methods? If so, how? I could try and create and mount image files for each test, but this seems like a lot of overhead, and is probably error-prone itself. It expects block devices, so I cannot operate directly on image files in the file system.
I could try mocking, as some answers suggest, but this feels inadequate. It seems like I start to test the implementation of the method, if I mock the Popen object, rather than the output. Is this a correct assessment of proper unit-testing methodology in this case?
I am using python3 for this project, and I have not yet chosen a unit-testing framework. In the absence of other reasons, I will probably just use the default unittest framework included in Python.
You should look into the mock module (I think it's part of the unittest module now in Python 3).
It enables you to run tests without the need to depened in any external resources while giving you control over how the mocks interact with your code.
I would start from the docs in Voidspace
Here's an example:
import unittest2 as unittest
import mock
class GetDriveSizeTestSuite(unittest.TestCase):
#mock.patch('path/to/original/file.subprocess.Popen')
def test_a_scenario_with_mock_subprocess(self, mock_popen):
mock_popen.return_value.communicate.return_value = ('Expected_value', '')
mock_popen.return_value.returncode = '0'
self.assertEqual('expected_value', get_drive_size('some device'))
I created incl.py and ml.py. incl.py is to be loaded from one of several directories each one containing such a file incl.py. ml.py is the "main" loading that incl.py via read() and exec(). Each incl.py is expected to contain a set of functions with the same name and interface but possibly different behaviour.
ml.py starts one or more threads. Each thread should load incl.py from its individual directory. The loading works fine, however, the loaded functions seem to be unknown to the thread.
content of incl.py:
def printIncluded (parameter):
print (parameter)
content of ml.py:
import threading
def threadContent (parameter):
exec (open ("incl.py").read ())
printIncluded (parameter)
thread = threading.Thread (target = threadContent, args = (("loaded from thread"),))
thread.start ()
As soon as I don't use threading it works, for example with the following content of ml.py:
exec (open ("incl.py").read ())
printIncluded ("directly loaded")
What has to be considered regarding exec() when working in threads?
I found a helpful hint at python globals: import vs. execfile.
Extending the statement
exec (open ("incl.py").read ())
to
exec (open ("incl.py").read (), globals ())
makes it work. However, it seems that I don't have yet a precise imagination of local and global scopes and why that works. So, the question is still open regarding the aspect 'why'.
Apart from that, my impression from reading a few other answers is that using import is preferred by a couple of true Pythonians but I didn't quite understand why (the 2nd one). At least the coding seems simpler using read () and exec () than constructing long sys path extensions.
I'm trying to generate a uuid for a filename, and I'm also using the multiprocessing module. Unpleasantly, all of my uuids end up exactly the same. Here is a small example:
import multiprocessing
import uuid
def get_uuid( a ):
## Doesn't help to cycle through a bunch.
#for i in xrange(10): uuid.uuid4()
## Doesn't help to reload the module.
#reload( uuid )
## Doesn't help to load it at the last minute.
## (I simultaneously comment out the module-level import).
#import uuid
## uuid1() does work, but it differs only in the first 8 characters and includes identifying information about the computer.
#return uuid.uuid1()
return uuid.uuid4()
def main():
pool = multiprocessing.Pool( 20 )
uuids = pool.map( get_uuid, range( 20 ) )
for id in uuids: print id
if __name__ == '__main__': main()
I peeked into uuid.py's code, and it seems to depending-on-the-platform use some OS-level routines for randomness, so I'm stumped as to a python-level solution (to do something like reload the uuid module or choose a new random seed). I could use uuid.uuid1(), but only 8 digits differ and I think there are derived exclusively from the time, which seems dangerous especially given that I'm multiprocessing (so the code could be executing at exactly the same time). Is there some Wisdom out there about this issue?
This is the correct way to generate your own uuid4, if you need to do that:
import os, uuid
return uuid.UUID(bytes=os.urandom(16), version=4)
Python should be doing this automatically--this code is right out of uuid.uuid4, when the native _uuid_generate_random doesn't exist. There must be something wrong with your platform's _uuid_generate_random.
If you have to do this, don't just work around it yourself and let everyone else on your platform suffer; report the bug.
I dont see a way to make this work either. But you could just generate all the uuids in the main thread and pass them to the workers.
This works fine for me. Does your Python installation have os.urandom? If not, random number seeding will be very poor and would lead to this problem (assuming there's also no native UUID module, uuid._uuid_generate_random).
Currently, I am working on a script, which fetches file either from a zip archive or disk. After fetching, the payload gets pushed to an external tool via web API.
For performance reason, I used the multiprocessing.Pool.map method. And for the tmp file name uuid looked quite handy. But I ran into the same issue you asked here.
First please check out the official docs from uuid. There is an class attribute called is_safe which provides more information if the uuid is multiprocess safe or not. In my case it was not.
After some research, I finally changed my strategy and moved from uuid to process pid and name.
Because I just need the uuid for tmp file naming, pid and name also works fine. We can access the current worker Process instance via multiprocessing.current_process(). If you really need an uuid, you could potentially integrate the worker pid somehow.
In addition, uuid uses system entropy for the generation (uuid source). Because for me it does not matter how the file is named, this solution also prevents laking entropy.