ERROR:root:can't pickle fasttext_pybind.fasttext objects - python

I am using gunicorn with multiple workers for my machine learning project. But the problem is when I send a train request only the worker getting the training request gets updated with the latest model after training is done. Here it is worth to mention that, to make the inference faster I have programmed to load the model once after each training. This is why, the only worker which is used for current training operation loads the latest model and the other workers still keeps the previously loaded model. Right now the model file (binary format) is loaded once after each training in a global dictionary variable where key is the model name and the value is the model file. Obviously, this problem won't occur if I program it to load the model every time from disk for each prediction, but I cannot do it, as it will make the prediction slower.
I studied further on global variables and further investigation shows that, in a multi-processing environment, all the workers (processes) create their own copies of global variables. Apart from the binary model file, I also have some other global variables (in dictionary type) need to be synced across all processes. So, how to handle this situation?
TL;DR: I need some approach which can help me to store variable which will be common across all the processes (workers). Any way to do this? With multiprocessing.Manager, dill etc.?
Update 1: I have multiple machine learning algorithms in my project and they have their own model files, which are being loaded to memory in a dictionary where the key is the model name and the value is the corresponding model object. I need to share all of them (in other words, I need to share the dictionary). But some of the models are not pickle serializable like - FastText. So, when I try to use a proxy variable (in my case a dictionary to hold models) with multiprocessing.Manager I get error for those non-pickle-serializable object while assigning the loaded model file to this dictionary. Like: can't pickle fasttext_pybind.fasttext objects. More information on multiprocessing.Manager can be found here: Proxy Objects
Following is the summary what I have done:
import multiprocessing
import fasttext
mgr = multiprocessing.Manager()
model_dict = mgr.dict()
model_file = fasttext.load_model("path/to/model/file/which/is/in/.bin/format")
model_dict["fasttext"] = model_file # This line throws this error
Error:
can't pickle fasttext_pybind.fasttext objects
I printed the model_file which I am trying to assign, it is:
<fasttext.FastText._FastText object at 0x7f86e2b682e8>
Update 2:
According to this answer I modified my code a little bit:
import fasttext
from multiprocessing.managers import SyncManager
def Manager():
m = SyncManager()
m.start()
return m
# As the model file has a type of "<fasttext.FastText._FastText object at 0x7f86e2b682e8>" so, using "fasttext.FastText._FastText" as the class of it
SyncManager.register("fast", fasttext.FastText._FastText)
# Now this is the Manager as a replacement of the old one.
mgr = Manager()
ft = mgr.fast() # This line gives error.
This gives me EOFError.
Update 3: I tried using dill both with multiprocessing and multiprocess. The summary of changes are as the following:
import multiprocessing
import multiprocess
import dill
# Any one of the following two lines
mgr = multiprocessing.Manager() # Or,
mgr = multiprocess.Manager()
model_dict = mgr.dict()
... ... ...
... ... ...
model_file = dill.dumps(model_file) # This line throws the error
model_dict["fasttext"] = model_file
... ... ...
... ... ...
# During loading
model_file = dill.loads(model_dict["fasttext"])
But still getting the error: can't pickle fasttext_pybind.fasttext objects.
Update 4:
This time I am using another library called jsonpickle. It seems to be that serialization and de-serialization occurs properly (as it is not reporting any issue while running). But surprisingly enough, after de-serialization whenever I am making a prediction, it faces segmentation fault. More details and the steps to reproduce it can be found here: Segmentation fault (core dumped)
Update 5: Tried cloudpickle, srsly, but couldn't make the program working.

For the sake of completeness I am providing the solution that worked for me. All the approaches I have tried to serialize FastText went in vain. Finally, as #MedetTleukabiluly mentioned in the comment, I managed to share the message of loading the model from the disk with other workers with redis-pubsub. Obviously, it is not actually sharing the model from the same memory space, rather, just sharing the message to other workers to inform them they should load the model from the disk (as a new training just happened). Following is the general solution:
# redis_pubsub.py
import logging
import os
import fasttext
import socket
import threading
import time
"""The whole purpose of GLOBAL_NAMESPACE is to keep the whole pubsub mechanism separate.
As this might be a case another service also publishing in the same channel.
"""
GLOBAL_NAMESPACE = "SERVICE_0"
def get_ip():
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
try:
# doesn't even have to be reachable
s.connect(('10.255.255.255', 1))
IP = s.getsockname()[0]
except Exception:
IP = '127.0.0.1'
finally:
s.close()
return IP
class RedisPubSub:
def __init__(self):
self.redis_client = get_redis_client() #TODO: A SAMPLE METHOD WHICH CAN RETURN YOUR REDIS CLIENT (you have to implement)
# Unique ID is used, to identify which worker from which server is the publisher. Just to avoid updating
# getting a message which message is indeed sent by itself.
self.unique_id = "IP_" + get_ip() + "__" + str(GLOBAL_NAMESPACE) + "__" + "PID_" + str(os.getpid())
def listen_to_channel_and_update_models(self, channel):
try:
pubsub = self.redis_client.pubsub()
pubsub.subscribe(channel)
except Exception as exception:
logging.error(f"REDIS_ERROR: Model Update Listening: {exception}")
while True:
try:
message = pubsub.get_message()
# Successful operation gives 1 and unsuccessful gives 0
# ..we are not interested to receive these flags
if message and message["data"] != 1 and message["data"] != 0:
message = message["data"].decode("utf-8")
message = str(message)
splitted_msg = message.split("__SEPERATOR__")
# Not only making sure the message is coming from another worker
# but also we have to make sure the message sender and receiver (i.e, both of the workers) are under the same namespace
if (splitted_msg[0] != self.unique_id) and (splitted_msg[0].split('__')[1] == GLOBAL_NAMESPACE):
algo_name = splitted_msg[1]
model_path = splitted_msg[2]
# Fasttext
if "fasttext" in algo_name:
try:
#TODO: YOU WILL GET THE LOADED NEW FILE IN model_file. USE IT TO UPDATE THE OLD ONE.
model_file = fasttext.load_model(model_path + '.bin')
except Exception as exception:
logging.error(exception)
else:
logging.info(f"{algo_name} model is updated for process with unique_id: {self.unique_id} by process with unique_id: {splitted_msg[0]}")
time.sleep(1) # sleeping for 1 second to avoid hammering the CPU too much
except Exception as exception:
time.sleep(1)
logging.error(f"PUBSUB_ERROR: Model or component update: {exception}")
def publish_to_channel(self, channel, algo_name, model_path):
def _publish_to_channel():
try:
message = self.unique_id + '__SEPERATOR__' + str(algo_name) + '__SEPERATOR__' + str(model_path)
time.sleep(3)
self.redis_client.publish(channel, message)
except Exception as exception:
logging.error(f"PUBSUB_ERROR: Model or component publishing: {exception}")
# As the delay before pubsub can pause the next activities which are independent, hence, doing this publishing in another thread.
thread = threading.Thread(target = _publish_to_channel)
thread.start()
Also you have to start the listener:
from redis_pubsub import RedisPubSub
pubsub = RedisPubSub()
# start the listener:
thread = threading.Thread(target = pubsub.listen_to_channel_and_update_models, args = ("sync-ml-models", ))
thread.start()
From fasttext training module, when you finish the training, publish this message to other workers, such that the other workers get a chance to re-load the model from the disk:
# fasttext_api.py
from redis_pubsub import RedisPubSub
pubsub = RedisPubSub()
pubsub.publish_to_channel(channel = "sync-ml-models", # a sample name for the channel
algo_name = f"fasttext",
model_path = "path/to/fasttext/model")

Related

pythoncom.CoInitialize() is not called and the program terminates

I´m working on a Python program supposed to read incoming MS-Word documents in a client/server fashion, i.e. the client sends a request (one or multiple MS-Word documents) and the server reads specific content from those requests using pythoncom and win32com.
Because I want to minimize waiting time for the client (client needs a status message from server, I do not want to open an MS-Word instance for every request. Hence, I intend to have a pool of running MS-Word instances from which the server can pick and choose. This, in turn, means I have to reuse those instances from the pool in different threads and this is what causes trouble right now.
After I fixed the following error I asked previously on stack overflow, my code looks now like this:
import pythoncom, win32com.client, threading, psutil, os, queue, time, datetime
class WordInstance:
def __init__(self,app):
self.app = app
self.flag = True
appPool = {'WINWORD.EXE': queue.Queue()}
def initAppPool():
global appPool
wordApp = win32com.client.DispatchEx('Word.Application')
appPool["WINWORD.EXE"].put(wordApp) # For testing purpose I only use one MS-Word instance currently
def run_in_thread(instance,appid, path):
print(f"[{datetime.now()}] open doc ... {threading.current_thread().name}")
pythoncom.CoInitialize()
wordApp = win32com.client.Dispatch(pythoncom.CoGetInterfaceAndReleaseStream(appid, pythoncom.IID_IDispatch))
doc = wordApp.Documents.Open(path)
doc.SaveAs(rf'{path}.FB.pdf', FileFormat=17)
doc.Close()
print(f"[{datetime.now()}] close doc ... {threading.current_thread().name}")
instance.flag = True
if __name__ == '__main__':
initAppPool()
pathOfFile2BeRead1 = r'C:\Temp\file4.docx'
pathOfFile2BeRead2 = r'C:\Temp\file5.docx'
#treat first request
wordApp = appPool["WINWORD.EXE"].get(True, 10)
wordApp.flag = False
pythoncom.CoInitialize()
wordApp_id = pythoncom.CoMarshalInterThreadInterfaceInStream(pythoncom.IID_IDispatch, wordApp.app)
readDocjob1 = threading.Thread(target=run_in_thread,args=(wordApp,wordApp_id,pathOfFile2BeRead1), daemon=True)
readDocjob1.start()
appPool["WINWORD.EXE"].put(wordApp)
#wait here until readDocjob1 is done
wait = True
while wait:
try:
wordApp = appPool["WINWORD.EXE"].get(True, 1)
if wordApp.flag:
print(f"[{datetime.now()}] ok appPool extracted")
wait = False
else:
appPool["WINWORD.EXE"].put(wordApp)
except queue.Empty:
print(f"[{datetime.datetime.now()}] error: appPool empty")
except BaseException as err:
print(f"[{datetime.datetime.now()}] error: {err}")
wordApp.flag = False
openDocjob2 = threading.Thread(target=run_in_thread,args=(wordApp,wordApp_id,pathOfFile2BeRead2), daemon=True)
openDocjob2.start()
When I run the script I receive the following output printed on the terminal:
[2022-03-29 11:41:08.217678] open doc ... Thread-1
[2022-03-29 11:41:10.085999] close doc ... Thread-1
[2022-03-29 11:41:10.085999] ok appPool extracted
[2022-03-29 11:41:10.085999] open doc ... Thread-2
Process finished with exit code 0
And only the first word file is converted to a pdf. It seems like def run_in_thread terminates after the print statement and before/during pythoncom.CoInitialize(). Sadly I do not receive any error message which makes it quite hard to understand the cause of this behavior.
After reading into Microsofts documentation I tried using
pythoncom.CoInitializeEx(pythoncom.APARTMENTTHREADED) instead of pythoncom.CoInitialize(). Since my COM object needs to be called by multiple threads. However this changed nothing.

LIGHTGBM pickle output not working with multiple SANIC workers (>1) but working with single worker

I am trying to load machine learning model output with Sanic. I have loaded the output in the main method(defined globally). It works fine when I set sanic worker as 1 but not not working with multiple sanic workers when defined globally. My code waits for indefinite time for model to generate desired result.
Its works when I load model output inside the function (e.g. here in the method modelrun) even if sanic workers >= 1
It works when I load model output globally(outside the function) but only if sanic workers = 1
It doesnot work when I load model output globally(outside the function) if sanic workers > 1
import pickle
import sanic
if __name__ == '__main__':
df = pd.DataFrame()
p_file_path = "/Users/pratiksha/FModel_06Jan_Smote_Sel_Vars_48.dat"
pickle_file = open(path, 'rb')
lbg_model_smote_sel_vars = pickle.load(pickle_file)
modelrun(df, lbg_model_smote_sel_vars)
app.run(host=app_host, port=int(app_port), debug=True,
auto_reload=True, workers=int(10))
def modelrun(df_f, lbg_model_smote_sel_vars):
training_pred_smote = lbg_model_smote_sel_vars.predict_proba(df_f)
return training_pred_smote
Versions Used
sanic==20.12.1
lightgbm==3.3.1
numpy==1.20.1
pandas==1.2.4
scikitlearn==1.0.2
scipy==1.6.2
Upgrade Sanic Version to 21.12.1
When you have multiple workers, Sanic will start a main process that manages several subprocesses. These subprocesses will be the application server workers.
Check out the cycle here.
https://sanicframework.org/en/guide/basics/listeners.html
After loading the pickle, it seems to be solved by adding it to the context property.
ex.
# or main_process_start
#app.listener("before_server_start")
async def startup(app, loop):
loaded = pickle.load(your_file)
app.ctx.pickle = loaded
#app.get("/")
async def handler(request):
# Call pickle
request.app.ctx.pickle

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

How can I leverage luigi for Openstack tasks

I want to use Luigi to manage workflows in Openstack. I am new to Luigi. For the starter, I just want to authenticate myself to Openstack and then fetch image list, flavor list etc using Luigi. Any help will be appreciable.
I am not good with python but I tried below code. I am also not able to list images. Error: glanceclient.exc.HTTPNotFound: The resource could not be found. (HTTP 404)
import luigi
import os_client_config
import glanceclient.v2.client as glclient
from luigi.mock import MockFile
import sys
import os
def get_credentials():
d = {}
d['username'] = 'X'
d['password'] = 'X'
d['auth_url'] = 'X'
d['tenant_name'] = 'X'
d['endpoint'] = 'X'
return d
class LookupOpenstack(luigi.Task):
d =[]
def requires(self):
pass
def output(self):
gc = glclient.Client(**get_credentials())
images = gc.images.list()
print("images", images)
for i in images:
print(i)
return MockFile("images", mirror_on_stderr=True)
def run(self):
pass
if __name__ == '__main__':
luigi.run(["--local-scheduler"], LookupOpenstack())
The general approach to this is just write python code to perform the tasks you want using the OpenStack API. https://docs.openstack.org/user-guide/sdk.html It looks like the error you are getting is addressed on the OpenStack site. https://ask.openstack.org/en/question/90071/glanceclientexchttpnotfound-the-resource-could-not-be-found-http-404/
You would then just wrap this code in luigi Tasks as appropriate- there's nothing special about doing with this OpenStack, except that you must define the output() of your luigi tasks to match up with an output that indicates the task is done. Right now it looks like the work is being done in the output() method, which should be in the run() method, the output method should just be what to look for to indicate that the run() method is complete so it doesn't run() when required by another task if it is already done.
It's really impossible to say more without understanding more details of your workflow.

How to access same multiprocessing namespace from different modules

I need to be able to create shared object with pySerial object in it. This object will be created by child process only once after finding device from list of locations. Other processes will use it in later time.
Python multiprocessing manager can't know about changes to objects embedded to other objects.
So if I create manager:
import multiprocessing as mp
manager=mp.Manager()
ns=manager.Namespace()
I can share object between processes.
ns.obj = SerialReader()
where
class SerialReader(object):
port = None
def connect(self):
#some code to test connected device
...
#end of that code
ser=serial.Serial(device, etc)
self.ser=ser
#or
self.saveport() #for future use
def saveport(self):
self.port = self.ser._port
ns.port= self.ser._port
Now I will run it in child process:
p=Process(target = ns.obj.connect)
p.start()
and print results:
print ns.obj.port
print ns.port
Output:
None
/dev/ttyACM0
I whant to be able to use simple code like:
ns.obj.ser.write(),
ns.obj.somemethod(arg) where
...inside SerialReaders class...
def somemethod(self, arg):
if arg == condition:
self.ser.write('some text %s' %arg)
but I can't make refference to ns.obj.ser because it will be concidered as not defined if it will be run from new process. The same situation we will get if we will reference self.ser in other methods inside conntrollers.py and try to run them in new process.
EDIT: I found way to import namespace to module:
from __main__ import ns
or by sending ns to init when creating object.
But problem still exist. ns.obj is NoneType object because this object is still in process of creation. So I can't type ns.obj.ser= self.ser
If I try to send ns to SerialReader()
ns.obj = SerialReader(ns)
and try to print it inside SerialReader_init_ I get:
<NamespaceProxy object, typeid 'Namespace' at 0xa21b70c; '__str__()' failed>
I can't add ser to it too..

Categories

Resources