Using win32com with multithreading - python

I am working on a web app with CherryPy that needs to access a few applications via COM.
Right now I create a new instance of the application with each request, which means each request waits 3 seconds for the application to start and 0.01 for the actual job.
I would like to start each COM application once and keep it alive and reuse it for a few seconds on the following requests because most of the time it is used by a burst of 5-10 ajax requests, then nothing for hours.
Is it possible to share a COM abject across all the threads of a CherryPy application?
Here is the summary of a few experiments that show how it is working now on each request and how it does not work across threads.
The following code successfully starts and stops Excel:
>>> import pythoncom, win32com.client
>>> def start():
global xl
xl = win32com.client.Dispatch('Excel.Application')
>>> def stop():
global xl
xl.quit()
xl = None
>>> start()
>>> stop()
But the following code starts Excel and closes it after 3 seconds.
>>> import pythoncom, win32com.client, threading, time
>>> def start():
global xl
pythoncom.CoInitialize()
xl = win32com.client.Dispatch('Excel.Application')
time.sleep(3)
>>> threading.Thread(target=start).start()
I added the call to CoInitialize() otherwise the xl object would not work (see this post).
And I added the 3 second pause, so I could see on the task manager that the EXCEL.EXE process starts and is alive for 3 seconds.
Why does it die after the thread that started it ends?
I checked the documentation of CoInitialize(), but I couldn't understand if it is possible to get it to work in multithreaded environment.

If you want to use win32com in multiple threads you need to do a little bit of work more as COMObject cannot be passed to a thread directly. You need to use CoMarshalInterThreadInterfaceInStream() and CoGetInterfaceAndReleaseStream() to pass instance between threads:
import pythoncom, win32com.client, threading, time
def start():
# Initialize
pythoncom.CoInitialize()
# Get instance
xl = win32com.client.Dispatch('Excel.Application')
# Create id
xl_id = pythoncom.CoMarshalInterThreadInterfaceInStream(pythoncom.IID_IDispatch, xl)
# Pass the id to the new thread
thread = threading.Thread(target=run_in_thread, kwargs={'xl_id': xl_id})
thread.start()
# Wait for child to finish
thread.join()
def run_in_thread(xl_id):
# Initialize
pythoncom.CoInitialize()
# Get instance from the id
xl = win32com.client.Dispatch(
pythoncom.CoGetInterfaceAndReleaseStream(xl_id, pythoncom.IID_IDispatch)
)
time.sleep(5)
if __name__ == '__main__':
start()
For more info see: https://mail.python.org/pipermail/python-win32/2008-June/007788.html

The answer from #Mauriusz Jamro ( https://stackoverflow.com/a/27966218/7733418 ) was really helpful. Just to add to it, also ensure that you do:
pythoncom.CoUninitialize ()
in the end so that there's no memory leak. You can call it somewhere after using CoInitialize() and before your process ends.

Try using multiprocessing. Worked for me, after a long search.
from multiprocessing import Process
p = Process(target=test, args=())
p.start()
p.join()

Related

Apscheduler keeps spawned processes alive for no reason

EDIT: I found the issue. It was a problem with PyCharm. I ran the .py outside of PyCharm and it worked as expected. In PyCharm I enabled "Emulate terminal in output console" and it now also works there...
Expectations:
Apscheduler spawns a thread that checks a website for something.
If the something was found (or multiple of it), the thread spawns (multiple) processes to download it/them.
After five seconds the next check thread spawns. While the other downloads may continue in the background.
Problem:
The spawned processes never stop to exist, which makes other parts of the code (not included) not work, because I need to check if the processes are done etc.
If I use a simple time.sleep(5) instead (see code), it works as expected.
No I cannot set max_instances to 1 because this will stop the scheduled job from running if there is one active download process.
Code:
import datetime
import multiprocessing
from apscheduler.schedulers.background import BackgroundScheduler
class DownloadThread(multiprocessing.Process):
def __init__(self):
super().__init__()
print("Process started")
def main():
print(multiprocessing.active_children())
# prints: [<DownloadThread name='DownloadThread-1' pid=3188 parent=7088 started daemon>,
# <DownloadThread name='DownloadThread-3' pid=12228 parent=7088 started daemon>,
# <DownloadThread name='DownloadThread-2' pid=13544 parent=7088 started daemon>
# ...
# ]
new_process = DownloadThread()
new_process.daemon = True
new_process.start()
new_process.join()
if __name__ == '__main__':
sched = BackgroundScheduler()
sched.add_job(main, 'interval', args=(), seconds=5, max_instances=999, next_run_time=datetime.datetime.now())
sched.start()
while True:
# main() # works. Processes despawn.
# time.sleep(5)
input()

multiprocessing.Process target executing only 2 out of 3 times

I'm using the multiprocessing library to launch a Process in parallel with the main one. I use the target argument at the initialisation to specify a function to execute. But the function is not executed approximatively 1 out of 3 times.
After digging into the multiprocessing library and using monkey patches to debug, I found out that the method _bootstrap of BaseProcess (the Process class inherits from BaseProcess), that is supposed to call the function specified in the target parameters at the initialisation, was not called when the method start() of the Process was called.
As my OS is Ubuntu 18.04, the default method to start the process is fork. So the Popen used to launch the process is in the file popen_fork.py of the multiprocessing library. And in this Popen class, the method _launch is calling os.fork() and then calling the Process's _bootstrap method.
With a monkey patch, I found out that the code supposed to be executed in the child process is not executed at all, and this is why the function specified in the target parameter when initializing the process was not executed when the method start() was called.
It is not possible to reproduce the problem in a simpler environment than the one I am working on. But here is some code that represents what I am doing, and what is my problem :
import time
from multiprocessing import Process
from multiprocessing.managers import BaseManager
class A:
def __init__(self, manager):
# manager is an object created by registering it in
# multiprocessing.managers.BaseManager, so it is made for interprocess
# communication
self.manager = manager
self.p = Process(target=self.process_method, args=(self.manager, ))
def start(self):
self.p.start()
def process_method(self, manager):
# This is the method that is not executed 2 out of 3 times
print("(A.process_method) Entering method")
c = 0
while True:
print(f"(A.process_method) Sending message : c = {c}")
manager.on_reception(f"c = {c}")
time.sleep(5)
class Manager:
def __init__(self):
self.msg = None
self.unread_msg = False
def on_reception(self, msg):
self.msg = msg
self.unread_msg = True
def get_last_msg(self):
if self.unread_msg:
self.unread_msg = False
return self.msg
else:
return None
if __name__ == "__main__":
BaseManager.register("Manager", Manager)
bm = BaseManager()
bm.start()
manager = bm.Manager()
a = A(manager)
a.start()
while True:
msg = manager.get_last_msg()
if msg is not None:
print(msg)
The method that should be executed every time is A.process_method. In this example, it is executed every time, but in my environment, it is not.
Does anyone ever had this problem and knows how to fix it ?
After digging more, I found out that a flask server was launched in Thread and not in a Process. I changed it to run in a Process instead of a Thread, and now everything is running as it is supposed to.
Both Flask and my Process are using the logging package. And this can cause a deadlock when launching a new Process.

Issue with MultiProcessing in Python with BeautifulSoup 4

I'm having issuing using most or all of the cores to process the files faster , it can be reading multiple files a time or using multiple cores to read a single file.
I would prefer using multiple cores to read a single file before moving it to the next.
I tried the code below but can't seem to get all the core used up.
The following code would basically retrieve *.txt file in the directory which contains htmls , in json format.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import json
import urlparse
import os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlTheHtml(htmlsource):
htmlArray = json.loads(htmlsource)
for eachHtml in htmlArray:
soup = BeautifulSoup(eachHtml['result'], 'html.parser')
if all(['another text to search' not in str(soup),
'text to search' not in str(soup)]):
try:
gd_no = ''
try:
gd_no = soup.find('input', {'id': 'GD_NO'})['value']
except:
pass
r = requests.post('domain api address', data={
'gd_no': gd_no,
})
except:
pass
if __name__ == '__main__':
pool = Pool(cpu_count() * 2)
print(cpu_count())
fileArray = []
for filename in os.listdir(os.getcwd()):
if filename.endswith('.txt'):
fileArray.append(filename)
for file in fileArray:
with open(file, 'r') as myfile:
htmlsource = myfile.read()
results = pool.map(crawlTheHtml(htmlsource), f)
On top of that , i'm not sure what the ,f represent.
Question 1 :
What did i not do properly to fully utilize all the cores/threads ?
Question 2 :
Is there a better way to use try : except : because sometimes the value is not in the page and that would cause the script to stop. When dealing with multiple variables, i will end up with a lot of try & except statement.
Answer to question 1, your problem is this line:
from multiprocessing.dummy import Pool # This is a thread-based Pool
Answer taken from: multiprocessing.dummy in Python is not utilising 100% cpu
When you use multiprocessing.dummy, you're using threads, not processes:
multiprocessing.dummy replicates the API of multiprocessing but is no
more than a wrapper around the threading module.
That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs. If you want get full parallelism across all available cores, you're going to need to address the pickling issue you're hitting with multiprocessing.Pool.
i had this probleme
you need to do
from multiprocessing import Pool
from multiprocessing import freeze_support
and you need to do in the end
if __name__ = '__main__':
freeze_support()
and you can continue your script
from multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random
MAX_WORKERS=10
class Testing_mp(object):
def __init__(self):
"""
Initiates a queue, a pool and a temporary buffer, used only
when the queue is full.
"""
self.q = Queue()
self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
self.temp_buffer = []
def add_to_queue(self, msg):
"""
If queue is full, put the message in a temporary buffer.
If the queue is not full, adding the message to the queue.
If the buffer is not empty and that the message queue is not full,
putting back messages from the buffer to the queue.
"""
if self.q.full():
self.temp_buffer.append(msg)
else:
self.q.put(msg)
if len(self.temp_buffer) > 0:
add_to_queue(self.temp_buffer.pop())
def write_to_queue(self):
"""
This function writes some messages to the queue.
"""
for i in range(50):
self.add_to_queue("First item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
self.add_to_queue("Second item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
def worker_main(self):
"""
Waits indefinitely for an item to be written in the queue.
Finishes when the parent process terminates.
"""
print "Process {0} started".format(getpid())
while True:
# If queue is not empty, pop the next element and do the work.
# If queue is empty, wait indefinitly until an element get in the queue.
item = self.q.get(block=True, timeout=None)
print "{0} retrieved: {1}".format(getpid(), item)
# simulate some random length operations
sleep(random())
# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
mp_class = Testing_mp()
mp_class.write_to_queue()
# Waits a bit for the child processes to do some work
# because when the parent exits, childs are terminated.
sleep(5)

Returning value from thread in python without blocking main thread

I have got an XMLRPC server and client runs some functions on server and gets returned value. If the function executes quickly then everything is fine but I have got a function that reads from file and returns some value to user. Reading takes about minute(there is some complicated stuff) and when one client runs this function on the server then server is not able to respond for other users until the function is done.
I would like to create new thread that will read this file and return value for user. Is it possible somehow?
Are there any good solutions/patters to do not block server when one client run some long function?
Yes it is possible , this way
#starting the thread
def start_thread(self):
threading.Thread(target=self.new_thread,args=()).start()
# the thread you are running your logic
def new_thread(self, *args):
#call the function you want to retrieve data from
value_returned = partial(self.retrieved_data_func,arg0)
#the function that returns
def retrieved_data_func(self):
arg0=0
return arg0
Yes, using the threading module you can spawn new threads. See the documentation. An example would be this:
import threading
import time
def main():
print("main: 1")
thread = threading.Thread(target=threaded_function)
thread.start()
time.sleep(1)
print("main: 3")
time.sleep(6)
print("main: 5")
def threaded_function():
print("thread: 2")
time.sleep(4)
print("thread: 4")
main()
This code uses time.sleep to simulate that an action takes a certain amount of time. The output should look like this:
main: 1
thread: 2
main: 3
thread: 4
main: 5

How to close not responsive Win32 Internet Explorer COM interface?

actually this is not hang status, i mean..it slow response,
so in that case,
i would like to close IE and
want to restart from start.
so closing is no problem ,problem is ,how to set timeout ,for example if i set 15sec, if not webpage open less than 15 sec i want to close it and restart from start.
is this possible to use with IE com interface?
really hard to find solution
Paul,
I'm used to follow code to check wether a webpage is completely open or not.
But as I mentioned, it is not working well, because IE.navigate is looks like it hangs or does not respond.
while ie.ReadyState != 4:
time.sleep(0.5)
To avoid blocking problem use IE COM object in a thread.
Here is a simple but powerful example demonstrating how can you use thread and IE com object together. You can improve it for your purpose.
This example starts a thread a uses a queue to communicate with main thread, in main thread user can add urls to queue, and IE thread visits them one by one, after he finishes one url, IE visits next. As IE COM object is being used in a thread you need to call Coinitialize
from threading import Thread
from Queue import Queue
from win32com.client import Dispatch
import pythoncom
import time
class IEThread(Thread):
def __init__(self):
Thread.__init__(self)
self.queue = Queue()
def run(self):
ie = None
# as IE Com object will be used in thread, do CoInitialize
pythoncom.CoInitialize()
try:
ie = Dispatch("InternetExplorer.Application")
ie.Visible = 1
while 1:
url = self.queue.get()
print "Visiting...",url
ie.Navigate(url)
while ie.Busy:
time.sleep(0.1)
except Exception,e:
print "Error in IEThread:",e
if ie is not None:
ie.Quit()
ieThread = IEThread()
ieThread.start()
while 1:
url = raw_input("enter url to visit:")
if url == 'q':
break
ieThread.queue.put(url)

Categories

Resources