How can I fix this multithreaded Python script? - python

I'm writing a python script to read through a list of domains, find out what rating Mcafee's Siteadvisor service gives, then output the domain and result to a CSV.
I've based my script off this previous answer. It uses the urllib to scrape Siteadvisor's page for the domain in question (not the best method, I know, but Siteadvisor provides no alternative). Unfortunately, it fails to produce anything - I consistently get this error:
Traceback (most recent call last):
File "multi.py", line 55, in <module>
main()
File "multi.py", line 44, in main
resolver_thread.start()
File "/usr/lib/python2.6/threading.py", line 474, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
Here is my script:
import threading
import urllib
class Resolver(threading.Thread):
def __init__(self, address, result_dict):
threading.Thread.__init__(self)
self.address = address
self.result_dict = result_dict
def run(self):
try:
content = urllib.urlopen("http://www.siteadvisor.com/sites/" + self.address).read(12000)
search1 = content.find("didn't find any significant problems.")
search2 = content.find('yellow')
search3 = content.find('web reputation analysis found potential security')
search4 = content.find("don't have the results yet.")
if search1 != -1:
result = "safe"
elif search2 != -1:
result = "caution"
elif search3 != -1:
result = "warning"
elif search4 != -1:
result = "unknown"
else:
result = ""
self.result_dict[self.address] = result
except:
pass
def main():
infile = open("domainslist", "r")
intext = infile.readlines()
threads = []
results = {}
for address in [address.strip() for address in intext if address.strip()]:
resolver_thread = Resolver(address, results)
threads.append(resolver_thread)
resolver_thread.start()
for thread in threads:
thread.join()
outfile = open('final.csv', 'w')
outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))
outfile.close()
if __name__ == '__main__':
main()
Any help would be greatly appreciated.

It looks like you are trying to start too many threads.
You can check how many items are in [address.strip() for address in intext if address.strip()] list. I quess this is a problem here. Basically there is a limit of available resources that allows to start new threads.
The solution for this is to chunk your list to pieces of let's say 20 elements, do the stuff (in 20 threads), wait for threads to finish their jobs, and then pick up next chunk. Do this until all elements from your list are processed.
You can also use some thread pool for better threads management. (I recently used this implementation).

There's probably an upper limit to the number of threads you can create, and you're probably exceeding it.
Suggestion: Create a small, fixed number of Resolvers - under 10 will probably get you 90% of the possible parallelism benefit possible - and a (threadsafe) Queue from python's queue lib. Have the main thread dump all the domains into the queue, and have each Resolver take one domain at a time from the queue and work on it.

Related

Shutting down manager error "AttributeError: 'ForkAwareLocal' object has no attribute 'connection'" when using namespace and shared memory dict

I am trying to:
share a dataframe between processes
update a shared dict based on calculations performed on (but not changing) that dataframe
I am using a multiprocessing.Manager() to create a dict in shared memory (to store results) and a Namespace to store/share my dataframe that I want to read from.
import multiprocessing
import pandas as pd
import numpy as np
def add_empty_dfs_to_shared_dict(shared_dict, key):
shared_dict[key] = pd.DataFrame()
def edit_df_in_shared_dict(shared_dict, namespace, ind):
row_to_insert = namespace.df.loc[ind]
df = shared_dict[ind]
df[ind] = row_to_insert
shared_dict[ind] = df
if __name__ == '__main__':
manager = multiprocessing.Manager()
shared_dict = manager.dict()
namespace = manager.Namespace()
n = 100
dataframe_to_be_shared = pd.DataFrame({
'player_id': list(range(n)),
'data': np.random.random(n),
}).set_index('player_id')
namespace.df = dataframe_to_be_shared
for i in range(n):
add_empty_dfs_to_shared_dict(shared_dict, i)
jobs = []
for i in range(n):
p = multiprocessing.Process(
target=edit_df_in_shared_dict,
args=(shared_dict, namespace, i)
)
jobs.append(p)
p.start()
for p in jobs:
p.join()
print(shared_dict[1])
When running the above, it writes to shared_dict correctly as my print statement executes with some data. I also get an error regarding the manager:
Process Process-88:
Traceback (most recent call last):
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 788, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/henrysorsky/Library/Preferences/PyCharm2019.2/scratches/scratch_13.py", line 34, in edit_df_in_shared_dict
row_to_insert = namespace.df.loc[ind]
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 1099, in __getattr__
return callmethod('__getattribute__', (key,))
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 792, in _callmethod
self._connect()
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 779, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
I understand this is coming from the manager and seems to be due to it not shutting down properly. The only similar issue I can find online:
Share list between process in python server
suggests joining all the child processes, which I am already doing.
So after a full nights sleep I realised it was actually the reading of the dataframe in shared memory that was causing issues and that at around the 20th child process, some of them were failing this read. I added a max number of processes to run at once and this solved it.
For anyone wondering, the code I used is:
import multiprocessing
import pandas as pd
import numpy as np
def add_empty_dfs_to_shared_dict(shared_dict, key):
shared_dict[key] = pd.DataFrame()
def edit_df_in_shared_dict(shared_dict, namespace, ind):
row_to_insert = namespace.df.loc[ind]
df = shared_dict[ind]
df[ind] = row_to_insert
shared_dict[ind] = df
if __name__ == '__main__':
# region define inputs
max_jobs_running = 4
n = 100
# endregion
manager = multiprocessing.Manager()
shared_dict = manager.dict()
namespace = manager.Namespace()
dataframe_to_be_shared = pd.DataFrame({
'player_id': list(range(n)),
'data': np.random.random(n),
}).set_index('player_id')
namespace.df = dataframe_to_be_shared
for i in range(n):
add_empty_dfs_to_shared_dict(shared_dict, i)
jobs = []
jobs_running = 0
for i in range(n):
p = multiprocessing.Process(
target=edit_df_in_shared_dict,
args=(shared_dict, namespace, i)
)
jobs.append(p)
p.start()
jobs_running += 1
if jobs_running >= max_jobs_running:
while jobs_running >= max_jobs_running:
jobs_running = 0
for p in jobs:
jobs_running += p.is_alive()
for p in jobs:
p.join()
for key, value in shared_dict.items():
print(f"key: {key}")
print(f"value: {value}")
print("-" * 50)
This would probably be better handled by a Queue and Pool setup rather than my hacky fix.
The problem is probably in your main process, which created the shared dict. If you forgot to use process.join() (or an infinite loop) in your main process, then the main process may finish before the other processes using the dict. This way the dict gets destroyed, and the processes cannot connect to it.
The number of processes should not be a problem. You should be able to use the dict with as many as you wish.
TL;DR This error might happen if you initiate too many new connections to multiprocessing.Manager() objects in parallel due to hard-coded backlog limit (16 at the time of writing) in multiprocessing/managers.py:
# do authentication later
self.listener = Listener(address=address, backlog=16)
self.address = self.listener.address
Details: I was starting a few hundreds subprocesses trying to get a value from multiprocessing.Manager().dict object at the very start of my program (basically instantly parallel). First few worked fine, but then they started to fail sporadically.
Interestingly, in my case, this only happened under VSCode debugger. I have found a mailing list discussion mentioning this issue more than 10 years ago. Looking at the source code of multiprocessing I found out that the backlog limit is still hard-coded (seems to get increased from 5 to 16 in modern versions). I increased it to 64 and all errors were gone.
So if the pending connections queue reaches the limit, all new connections will be refused. Especially when you run your code under debugger, connections are getting served a tick slower and the backlog buffer may get full when hundreds of them are flowing fast in parallel.

create very large queue for python multiprocessing

I would like to create a queue of about 256K paths to files and have the paths dequeued and processed by parallel worker processes. This is multiprocessing rather than threads.
However, when I create a multiprocessing.queue there seems to be a hard limit at 32K objects in the queue. This might be even smaller if the objects were full paths to files, as intended.
What would be an alternate way to create a multiserver queue for multiprocessing?
import multiprocessing
import sys
q = multiprocessing.Queue()
for i in range(32768 * 2):
print i
try:
q.put('abcdef')
except:
print "Unexpected error on ()".format(i), sys.exc_info()[0]
raise
yields:
...
32766
32767
Traceback (most recent call last):
Unexpected error on () <type 'exceptions.KeyboardInterrupt'>
File "/Users/Wes/Dropbox/Programming/ElectionTransparency/vops_addons/dead/tryq.py", line 13, in <module>
q.put('abc')
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 101, in put
if not self._sem.acquire(block, timeout):
KeyboardInterrupt
You could try using celery - http://www.celeryproject.org/ - the queue limit would be up to the broker configuration.
Moreover, you would not be limited to workers on the same machine - any computer that could mount the same filesystem could run celery workers to process your tasks. (Although if remote processing then is not an option, using celery workers could still have advantages over raw multiprocessing, as there are niceties such as automatic retry)
Here is what I finally found that worked. I made the array of paths available to all the worker processes and used a multiprocessing.Value() object to create a shared index into the array protected with a lock.
from multiprocessing import Process, Lock, Value
import os
import sys
import time
def info(title, lock, item=None):
pid = os.getpid()
lock.acquire()
print '<', title, item,' ', __name__, pid, '>'
sys.stdout.flush()
lock.release()
def f(stdout_lock, next_item, worklist):
while True:
with next_item.get_lock():
if len(worklist) <= next_item.value:
return
item = worklist[next_item.value]
next_item.value += 1
info('queue item: ', stdout_lock, item)
time.sleep(0.0001)
if __name__ == '__main__':
next_item = Value('l')
worklist = [str(i) for i in range(250000)]
next_item.value = 0
stdout_lock = Lock()
plist = []
for i in range(3):
plist.append(Process(target=f, args=(stdout_lock, next_item, worklist)))
plist[-1].start()
for i in range(3):
plist[i].join()

Multiprocessing Robust to Occasional Failures

I have a 100-1000 timeseries paths and a fairly expensive simulation that I'd like to parallelize. However, the library I'm using hangs on rare occasions and I'd like to make it robust to those issues. This is the current setup:
with Pool() as pool:
res = pool.map_async(simulation_that_occasionally_hangs, (p for p in paths))
all_costs = res.get()
I know get() has a timeout parameter but if I understand correctly that works on the whole process of the 1000 paths. What I'd like to do is check if any single simulation is taking longer than 5 minutes (a normal path takes 4 seconds) and if so just stop that path and continue to get() the rest.
EDIT:
Testing timeout in pebble
def fibonacci(n):
if n == 0: return 0
elif n == 1: return 1
else: return fibonacci(n - 1) + fibonacci(n - 2)
def main():
with ProcessPool() as pool:
future = pool.map(fibonacci, range(40), timeout=10)
iterator = future.result()
all = []
while True:
try:
all.append(next(iterator))
except StopIteration:
break
except TimeoutError as e:
print(f'function took longer than {e.args[1]} seconds')
print(all)
Errors:
RuntimeError: I/O operations still in flight while destroying Overlapped object, the process may crash
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\anaconda3\lib\multiprocessing\spawn.py", line 99, in spawn_main
new_handle = reduction.steal_handle(parent_pid, pipe_handle)
File "C:\anaconda3\lib\multiprocessing\reduction.py", line 87, in steal_handle
_winapi.DUPLICATE_SAME_ACCESS | _winapi.DUPLICATE_CLOSE_SOURCE)
PermissionError: [WinError 5] Access is denied
The pebble library has been designed to address these kinds of issues. It handles transparently job timeouts and failures such as C library crashes.
You can check the documentation examples to see how to use it. It has a similar interface as concurrent.futures.
Probably the easiest way is to run each heavy simulation in a separate subprocess, with the parent process watching it. Specifically:
def risky_simulation(path):
...
def safe_simulation(path):
p = multiprocessing.Process(target=risky_simulation, args=(path,))
p.start()
p.join(timeout) # Your timeout here
p.kill() # or p.terminate()
# Here read and return the output of the simulation
# Can be from a file, or using some communication object
# between processes, from the `multiprocessing` module
with Pool() as pool:
res = pool.map_async(safe_simulation, paths)
all_costs = res.get()
Notes:
If the simulation may hang, you may want to run it in a separate process (i.e. the Process object should not be a thread), as depending on how it's done, it may catch the GIL.
This solution only uses the pool for the immediate sub-processes, but the computations are off-loaded to new processes. We can also make sure the computations share a pool, but that would result in uglier code, so I skipped it.

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

python: Why diff processes return same epoll object when calling select.epoll()

My goal: start N sub-processes, each deal with different socket sets.
-- This means different epoll objects are needed.
Porblem: When i call select.epoll() in sub-processes, it returns the same object.
Here's a simple example below:
from multiprocessing import Process,Lock
import time,select,os
class A(Process):
def run(self):
fd = select.epoll()
print 'A.pid=',os.getpid(),'poll_fd:', fd, fd.fileno()
while 1:
poll_list = fd.poll(timeout=3600)
for fd,events in poll_list:
pass
class B(Process):
def run(self):
fd = select.epoll()
print 'B.pid=',os.getpid(),'poll_fd:', fd, fd.fileno()
while 1:
poll_list = fd.poll(timeout=3600)
for fd,events in poll_list:
pass
A().start()
B().start()
Why did this happen?
What should i do to fix it?
Any help will be appreciated.?
Since this are different processes, the epoll resources are also different. Each process has it's own set of file numbers. They lowest free file number is selected for a new resource. That's why both processes uses the same file number. No fix needed.

Categories

Resources