I have a piece of python code to practice python co-routines.
As explained by A. Jesse Jiryu Davis.
Firstly, I define a co-routine named 'get' to get the content of some
URL.
Then I define a Task class to iterate the co-routine to completion.
Then I create one Task object which open one URL.
If I put two successive socket.recv() method in the co-routine, I got the error message:
'A non-blocking socket operation could not be completed immediately' in the second chunk = s.recv(1000) line.
But if I change all the yield to time.sleep(1) and directly call
get() in the global context, the two successive s.recv(1000) will
cause no errors. Even more successive s.recv(1000) is OK.
After several days searching and reading python documents, I still have no idea why this is happening. I must missed some 'python Gotchas', do I?
I'm using python 3.6 to test. The code is as following and I have deleted all the irrelevant code to make the following code precise and relevant to the topic:
#! /usr/bin/python
import socket
import select
import time
selectors_read = []
selectors_write = []
class Task:
def __init__(self, gen):
self.gen = gen
self.step()
def step(self):
try:
next(self.gen)
except StopIteration:
return
def get():
s = socket.socket()
selectors_write.append(s.fileno())
s.setblocking(False)
try:
s.connect(('www.baidu.com', 80))
except:
pass
yield
selectors_write.remove(s.fileno())
print('[CO-ROUTINE] ', 'Send')
selectors_read.append(s.fileno())
s.send('GET /index.html HTTP/1.0\r\n\r\n'.encode())
yield
while True:
chunk = s.recv(1000)
chunk = s.recv(1000)
if chunk:
print('[CO-ROUTINE] received')
else:
selectors_read.remove(s.fileno())
break
# yield
task_temp = Task(get())
while True:
for filenums in select.select(selectors_read, selectors_write, []):
for fd in filenums:
task_temp.step()
Related
I´m working on a Python program supposed to read incoming MS-Word documents in a client/server fashion, i.e. the client sends a request (one or multiple MS-Word documents) and the server reads specific content from those requests using pythoncom and win32com.
Because I want to minimize waiting time for the client (client needs a status message from server, I do not want to open an MS-Word instance for every request. Hence, I intend to have a pool of running MS-Word instances from which the server can pick and choose. This, in turn, means I have to reuse those instances from the pool in different threads and this is what causes trouble right now.
After I fixed the following error I asked previously on stack overflow, my code looks now like this:
import pythoncom, win32com.client, threading, psutil, os, queue, time, datetime
class WordInstance:
def __init__(self,app):
self.app = app
self.flag = True
appPool = {'WINWORD.EXE': queue.Queue()}
def initAppPool():
global appPool
wordApp = win32com.client.DispatchEx('Word.Application')
appPool["WINWORD.EXE"].put(wordApp) # For testing purpose I only use one MS-Word instance currently
def run_in_thread(instance,appid, path):
print(f"[{datetime.now()}] open doc ... {threading.current_thread().name}")
pythoncom.CoInitialize()
wordApp = win32com.client.Dispatch(pythoncom.CoGetInterfaceAndReleaseStream(appid, pythoncom.IID_IDispatch))
doc = wordApp.Documents.Open(path)
doc.SaveAs(rf'{path}.FB.pdf', FileFormat=17)
doc.Close()
print(f"[{datetime.now()}] close doc ... {threading.current_thread().name}")
instance.flag = True
if __name__ == '__main__':
initAppPool()
pathOfFile2BeRead1 = r'C:\Temp\file4.docx'
pathOfFile2BeRead2 = r'C:\Temp\file5.docx'
#treat first request
wordApp = appPool["WINWORD.EXE"].get(True, 10)
wordApp.flag = False
pythoncom.CoInitialize()
wordApp_id = pythoncom.CoMarshalInterThreadInterfaceInStream(pythoncom.IID_IDispatch, wordApp.app)
readDocjob1 = threading.Thread(target=run_in_thread,args=(wordApp,wordApp_id,pathOfFile2BeRead1), daemon=True)
readDocjob1.start()
appPool["WINWORD.EXE"].put(wordApp)
#wait here until readDocjob1 is done
wait = True
while wait:
try:
wordApp = appPool["WINWORD.EXE"].get(True, 1)
if wordApp.flag:
print(f"[{datetime.now()}] ok appPool extracted")
wait = False
else:
appPool["WINWORD.EXE"].put(wordApp)
except queue.Empty:
print(f"[{datetime.datetime.now()}] error: appPool empty")
except BaseException as err:
print(f"[{datetime.datetime.now()}] error: {err}")
wordApp.flag = False
openDocjob2 = threading.Thread(target=run_in_thread,args=(wordApp,wordApp_id,pathOfFile2BeRead2), daemon=True)
openDocjob2.start()
When I run the script I receive the following output printed on the terminal:
[2022-03-29 11:41:08.217678] open doc ... Thread-1
[2022-03-29 11:41:10.085999] close doc ... Thread-1
[2022-03-29 11:41:10.085999] ok appPool extracted
[2022-03-29 11:41:10.085999] open doc ... Thread-2
Process finished with exit code 0
And only the first word file is converted to a pdf. It seems like def run_in_thread terminates after the print statement and before/during pythoncom.CoInitialize(). Sadly I do not receive any error message which makes it quite hard to understand the cause of this behavior.
After reading into Microsofts documentation I tried using
pythoncom.CoInitializeEx(pythoncom.APARTMENTTHREADED) instead of pythoncom.CoInitialize(). Since my COM object needs to be called by multiple threads. However this changed nothing.
I want to concurrently execute the same instance method from each object in a list in python.
I created a DataPipe class that downloads pages and stores the result in an array. Then when I'm done downloading the links of interest of a specific domain, I yield these pages and then yield from the pages the corresponding items.
The code works pretty much as expected and now, I want to download from mutliple domains at the same time.
class DownloadCommand(Command):
def __init__(self, domain):
self.domain = domain
self.request_config = {'headers': self.domain.get_header(), 'proxy': self.domain.get_proxy()}
self.data_pipe = DataPipe(command=self)
def execute(self):
# try:
for brand, start_urls in self.domain.start_url.items():
for start_url in start_urls:
# yield from self.data_pipe.get_item_divs(brand, start_url)
yield from self.data_pipe.get_item_divs(brand, start_url)`
Currently, I'm doing this sequentially.
def scrape(self):
for domain in self.get_initial_domain_list():
yield from self.fetch_from_dom(domain)
def fetch_from_dom(self, domain):
self.set_current_domain(domain)
for start_url_values, brand, start_url in domain.command.execute():
for items in start_url_values:
yield [self.get_item_value(item_div) for item_div in items]
I tried to multithread this application using multiprocessing.pool.Pool but it does not work for instance methods. Then when I used pathos.multiprocessing import ProcessingPool it returned an error:
multiprocess.pool.MaybeEncodingError: Error sending result: '[<generator object fetch_from_dom at 0x7fa984814af0>]'. Reason: 'TypeError("can't pickle generator objects",)'
I want to switch to either asyncio or concurrent.futures but I'm not sure of which one would be better to do what I want that is if it's actually possible to do that in python(concurrently executing instance methods from objects in a list). Can anyone help?
Cant use selenium with python multiprocessing cause it clone memory. You can try to avoid more simple with threads. But this is my solution for multiprocessing
NOTE: self is my driver, cause I have custom class implemented over Selenium
#Exit function
def cleanup(self):
print("++cleanup()++")
try:
try:
self.close()
except Exception as e:
#print("except cleanup - 2 - self.close() -> %s" %e)
pass
try:
self.quit()
except Exception as e:
#print("except cleanup - 3 - self.quit() -> %s" %e)
pass
try:
self.dispose()
#print("Fake disabled dispose()")
except Exception as e:
#print("except cleanup - 4 - self.dispose() -> %s" %e)
pass
try:
self.service.process.send_signal(signal.SIGTERM)
except Exception as e:
#print("except cleanup - 1 - self.service.process.send_signal(signal.SIGTERM) -> %s" %e)
pass
except Exception as e:
print("Except - CLEANUP -> %s" %e)
#print(str(e))
pass
In script code
#Before start threads
browser.cleanup()
del browser
#Now start multiprocessing and instance browser on each subprocess
I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?
I'm a running a proxy as suggested in Mitmproxy github examples:
from libmproxy import proxy, flow
class MitmProxy(flow.FlowMaster):
def run(self):
try:
flow.FlowMaster.run(self)
except KeyboardInterrupt:
self.shutdown()
def handle_request(self, r):
f = flow.FlowMaster.handle_request(self, r)
if f:
r.reply()
return f
def handle_response(self, r):
f = flow.FlowMaster.handle_response(self, r)
if f:
r.reply()
return f
config = proxy.ProxyConfig(
cacert = os.path.expanduser("~/.ssl/mitmproxy.pem")
)
state = flow.State()
server = proxy.ProxyServer(config, 8083)
m = MitmProxy(server, state)
try:
m.run()
except Exception, e:
print e.message
m.shutdown()
I want to handle each request/response without blocking the others,
for that i need to use the concurrent decorator and scripts
my question is: how do i load and unload scripts to the proxy running in this configuration?
You can use concurrent mode with script loading.
Here is an example for this kind of usage
I preferred to implement the mitmproxy logic in the flow level.
You can use this code
def handle_response(self, r):
reply = f.response.reply
f.response.reply = controller.DummyReply()
if hasattr(reply, "q"):
f.response.reply.q = reply.q
def run():
pass
threading.Thread(target=run)
You basically have to copy how handle_concurrent_reply works in libmproxy.script
f = flow.FlowMaster.handle_request(self,r)
if f:
def run():
request.reply() #if you forget this you'll end up in a loop and never reply
threading.Thread(target=run).start() #this will start run
Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()
There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()
Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.