Python - For loop finishing before it is supposed to - python

I am currently executing tasks via a thread pool based on a for loop length, and it is ending its execution when it is not supposed to (before end of loop). Any ideas why? Here is the relavent code:
from classes.scraper import size
from multiprocessing import Pool
import threading
if __name__ == '__main__':
print("Do something")
size = size()
pool = Pool(processes=50)
with open('size.txt','r') as file:
asf = file.read()
for x in range(0,1000000):
if '{num:06d}'.format(num=x) in asf:
continue
else:
res = pool.apply_async(size.scrape, ('{num:06d}'.format(num=x),))
Here is the console output (I am printing out the values inside size.scrape().
...
...
...
013439
013440
013441
013442
013443
Process finished with exit code 0

Related

Why does this for-loop paralellization doesn't work in Python?

I need to navigate across 10,000 folders, collect some data from each folder, add it to 3 containers (c18, c17, c16, 3 initially empty lists each of which will be populated with 10,000 numbers) and it would take forever without parallellization.
My aim is to iterate through all folders with a for-loop (for i in range(10000)) and append 3 values extracted from each folder to c18, c17, c16 respectively, at each iteration of the for-loop.
I also want to display a progress bar - to know roughly how long would it take.
I have never parallelized a loop before or included a progress bar. I have tried to use SO. After reading some answers, I got to the point at which I wrote:
pool = multiprocessing.Pool(4)
pool.imap(funct, tqdm.tqdm(range(len(a0s))) # or pool.map(funct, tqdm.tqdm(range(len(a0s))))
len(a0s) yields 10,000.
The function funct is def funct(i): and does what I wrote above: for a given folder defined using the for-loop variable i (current iteration number), it does the job of extracting 3 values and appending them to c18, c17, c16.
I am calling pool.imap(funct, tqdm.tqdm(range(len(a0s))) inside a main() function and at the end of the .py script I wrote :
if __name__ == '__main__':
main()
I am importing:
import processing
import tqdm
However, all the above doesn't work.
How shall I proceed? Any help is welcomed.
Thanks!
a0s = np.loadtxt("Intensity_Wcm2_versus_a0_10_21_10_23_range.txt", usecols=(1,)) # has 10,000 entries
pool = multiprocessing.Pool(4)
top_folder_path = os.getcwd()
base_path = top_folder_path + "/a0_"
for i in range(len(a0s)):
results_folder = base_path + "{:.4f}".format(a0s[i])
if os.path.isdir(results_folder):
os.chdir(results_folder)
S = happi.Open(".")
pbb = S.ParticleBinning(0).get() # charge states diagnostic
c18.append(pbb['data'][-1][-1]) # first -1 is for last timestep recorded by diagnostic, second -1 is for last charge state (bare ions, Ar18+)
c17.append(pbb['data'][-1][-2])
c16.append(pbb['data'][-1][-2])
print("###########################################################]#########")
print("We have done the folder number: " + str(i) + " out of: " + str(len(a0s)))
os.chdir(top_folder_path)
else:
continue
def funct(i):
results_folder = base_path + "{:.4f}".format(a0s[i])
if os.path.isdir(results_folder):
os.chdir(results_folder)
S = happi.Open(".")
pbb = S.ParticleBinning(0).get() # charge states diagnosti
c18_val = pbb['data'][-1][-1]
c17_val = pbb['data'][-1][-2]
c16_val = pbb['data'][-1][-3]
c18.append(c18_val)
c17.append(c17_val)
c16.append(c16_val)
else:
return
def main():
pool.imap(funct, tqdm(range(len(a0s))))
if __name__ == '__main__':
main()
Here's a template for multiple progress bars and multiprocessing. Hope it helps. I set it up to expect to be updated 10 times in each process and added a sleep to be the parallelized "work".
import multiprocessing as mp
import tqdm
import time
from itertools import repeat
def funct(lock,i):
with lock:
bar = tqdm.tqdm(position=i,total=10,leave=False,ncols=100)
bar.set_lock(lock)
for _ in range(10):
time.sleep(.2)
bar.update(1)
bar.close()
return i*2
def main():
lock = mp.Manager().Lock()
with mp.Pool() as pool:
result = pool.starmap(funct, zip(repeat(lock),range(8)))
print()
print(result)
if __name__ == '__main__':
main()

python multiprocessing to create an excel file with multiple sheets [duplicate]

I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.
This is my code so far:
list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()
if __name__ == '__main__':
global list_of_days
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
nr_of_cores = multiprocessing.cpu_count()
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
pool.map(f, range(len(list_of_days)))
pool.close()
pool.join()
def init(l):
global lock
lock = l
def f(k):
global results
*** DO SOME STUFF HERE***
results = results[ *** finished pandas dataframe *** ]
lock.acquire()
results.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
lock.release()
The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:
How to avoid defining global variables?
Is it even possible to pass around dataframes?
Should I move the locking to main instead?
Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks
1) Why did you implement time.sleep in several places in your 2nd method?
In __main__, time.sleep(0.1), to give the started process a timeslice to startup.
In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and
as q.get_nowait() are used.
In w(q), are only for testing simulating long run of writer.to_excel(...),
i removed this one.
2) What is the difference between pool.map and pool = [mp.Process( . )]?
Using pool.map needs no Queue, no parameter passed, shorter code.
The worker_process have to return immediately the result and terminates.
pool.map starts a new process as long as all iteration are done.
The results have to be processed after that.
Using pool = [mp.Process( . )], starts n processes.
A process terminates on queue.Empty
Can you think of a situation where you would prefer one method over the other?
Methode 1: Quick setup, serialized, only interested in the result to continue.
Methode 2: If you want to do all workload parallel.
You could't use global writer in processes.
The writer instance has to belong to one process.
Usage of mp.Pool, for instance:
def f1(k):
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
return results
if __name__ == '__main__':
pool = mp.Pool()
results = pool.map(f1, range(len(list_of_days)))
writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
for k, result in enumerate(results):
result.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
pool.close()
This leads to .to_excel(...) are called in sequence in the __main__ process.
If you want parallel .to_excel(...) you have to use mp.Queue().
For instance:
The worker process:
# mp.Queue exeptions have to load from
try:
# Python3
import queue
except:
# Python 2
import Queue as queue
def f2(fq, q):
while True:
try:
k = fq.get_nowait()
except queue.Empty:
exit(0)
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
q.put( (list_of_days[k], results) )
time.sleep(0.1)
The writer process:
def w(q):
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
while True:
try:
titel, result = q.get()
except ValueError:
writer.save()
exit(0)
result.to_excel(writer, sheet_name=titel)
The __main__ process:
if __name__ == '__main__':
w_q = mp.Queue()
w_p = mp.Process(target=w, args=(w_q,))
w_p.start()
time.sleep(0.1)
f_q = mp.Queue()
for i in range(len(list_of_days)):
f_q.put(i)
pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
w_q.put('STOP')
w_p.join()
Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6

How get output in _thread?

i created a function with Python, for the poaching of some devices, the need for fast times or the idea of using threads. the python code I wrote function and it is very fast the peripherals respond (verified with wire shark), but now I need each thread to have the output of the function I launch to have them all in an output vector. How can I save the output of each thread I launch with this "_thread" library?
below is the code I used:
import _thread
import time
import atenapy
try:
tic = time.process_time()
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5A0000005A'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2600000026'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5100000051'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2700000027'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5000000050'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'6000000060'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5200000052'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2D0000002D'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5700000057'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'5F0000005F'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5300000053'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2200000022'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5600000056'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2300000023'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5500000055'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2B0000002B'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5400000054'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2C0000002C'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'0C0000000C'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2800000028'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'0D0000000D'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2900000029'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'0E0000000E'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2A0000002A'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'0F0000000F'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1400000014'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1800000018'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1900000019'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1A0000001A'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1B0000001B'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1C0000001C'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1D0000001D'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1E0000001E'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1F0000001F'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'2000000020'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'2100000021'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.162',9761,'0200000002'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.162',9761,'0300000003'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.162',9761,'0800000008'))
toc = time.process_time()
print("all PE time pooling = "+str(toc - tic))
except:
print ("Error: unable to start thread")
Wrap your function in a worker function that collects the result and appends to a list. The lock is optional when appending to a list (Ref: What kinds of global value mutation are thread safe).
import threading
lock = threading.Lock()
results = []
def func(a,b):
with lock:
results.append(a+b)
threads = [threading.Thread(target=func,args=(a,b))
for a in range(3) for b in range(3)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print(results)

Python Multiprocessing Process causes Parent to idle

My question is very similar to this question here, except the solution with catching didn't quite work for me.
Problem: I'm using multiprocessing to handle a file in parallel. Around 97%, it works. However, sometimes, the parent process will idle forever and CPU usage shows 0.
Here is a simplified version of my code
from PIL import Image
import imageio
from multiprocessing import Process, Manager
def split_ranges(min_n, max_n, chunks=4):
chunksize = ((max_n - min_n) / chunks) + 1
return [range(x, min(max_n-1, x+chunksize)) for x in range(min_n, max_n, chunksize)]
def handle_file(file_list, vid, main_array):
for index in file_list:
try:
#Do Stuff
valid_frame = Image.fromarray(vid.get_data(index))
main_array[index] = 1
except:
main_array[index] = 0
def main(file_path):
mp_manager = Manager()
vid = imageio.get_reader(file_path, 'ffmpeg')
num_frames = vid._meta['nframes'] - 1
list_collector = mp_manager.list(range(num_frames)) #initialize a list as the size of number of frames in the video
total_list = split_ranges(10, min(200, num_frames), 4) #some arbitrary numbers between 0 and num_frames of video
processes = []
file_readers = []
for split_list in total_list:
video = imageio.get_reader(file_path, 'ffmpeg')
proc = Process(target=handle_file, args=(split_list, video, list_collector))
print "Started Process" #Always gets printed
proc.Daemon = False
proc.start()
processes.append(proc)
file_readers.append(video)
for i, proc in enumerate(processes):
proc.join()
print "Join Process " + str(i) #Doesn't get printed
fd = file_readers[i]
fd.close()
return list_collector
The issue is that I can see the processes starting and I can see that all of the items are being handled. However, sometimes, the processes don't rejoin. When I check back, only the parent process is there but it's idling as if it's waiting for something. None of the child processes are there, but I don't think join is called because my print statement doesn't show up.
My hypothesis is that this happens to videos with a lot of broken frames. However, it's a bit hard to reproduce this error because it rarely occurs.
EDIT: Code should be valid now. Trying to find a file that can reproduce this error.

Python MemoryError with Queue and threading

I'm currently writing a script that reads reddit comments from a large file (5 gigs compressed, ~30 gigs of data being read). My script reads the comments, checks for some text, parses them, and sends them off to a Queue function (running in a seperate thread). No matter what I do, I always get a MemoryError on a specific iteration (number 8162735 if it matters in the slightest). And I can't seem to handle the error, Windows just keeps shutting down python when it hits. Here's my script:
import ujson
from tqdm import tqdm
import bz2
import json
import threading
import spacy
import Queue
import time
nlp = spacy.load('en')
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in (enumerate(file_)):
yield ujson.loads(line)['body']
objects = iter_comments('RC_2015-01.bz2')
q = Queue.Queue()
f = open("reddit_dump.bin", 'wb')
def worker():
while True:
item = q.get()
f.write(item)
q.task_done()
for i in range(0, 2):
t = threading.Thread(target=worker)
t.daemon = True
t.start()
def finish_parse(comment):
global q
try:
comment_parse = nlp(unicode(comment))
comment_bytes = comment_parse.to_bytes()
q.put(comment_bytes)
except MemoryError:
print "MemoryError with comment {0}, waiting for Queue to empty".format(comment)
time.sleep(2)
except AssertionError:
print "AssertionError with comment {0}, skipping".format(comment)
for comment in tqdm(objects):
comment = str(comment.encode('ascii', 'ignore'))
if ">" in comment:
c_parse_thread = threading.Thread(target=finish_parse, args=(comment,))
c_parse_thread.start()
q.join()
f.close()
Does anybody know what I'm doing wrong?
Looks like its not in your code but may be in the data. Have you tried to skip that iteration?
x = 0
for comment in tqdm(objects):
x += 1
if x != 8162735
comment = str(comment.encode('ascii', 'ignore'))
if ">" in comment:
c_parse_thread = threading.Thread(target=finish_parse, args=(comment,))
c_parse_thread.start()

Categories

Resources