How to troubleshoot an "AttributeError: __exit__" in multiproccesing in Python? - python

I tried to rewrite some csv-reading code to be able to run it on multiple cores in Python 3.2.2. I tried to use the Pool object of multiprocessing, which I adapted from working examples (and already worked for me for another part of my project). I ran into an error message I found hard to decipher and troubleshoot.
The error:
Traceback (most recent call last):
File "parser5_nodots_parallel.py", line 256, in <module>
MG,ppl = csv2graph(r)
File "parser5_nodots_parallel.py", line 245, in csv2graph
node_chunks)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/multiprocessing/pool.py", line 552, in get
raise self._value
AttributeError: __exit__
The relevant code:
import csv
import time
import datetime
import re
from operator import itemgetter
from multiprocessing import Pool
import itertools
def chunks(l,n):
"""Divide a list of nodes `l` in `n` chunks"""
l_c = iter(l)
while 1:
x = tuple(itertools.islice(l_c,n))
if not x:
return
yield x
def csv2nodes(r):
strptime = time.strptime
mktime = time.mktime
l = []
ppl = set()
pattern = re.compile(r"""[A-Za-z0-9"/]+?(?=[,\n])""")
for row in r:
with pattern.findall(row) as f:
cell = int(f[3])
id = int(f[2])
st = mktime(strptime(f[0],'%d/%m/%Y'))
ed = mktime(strptime(f[1],'%d/%m/%Y'))
# collect list
l.append([(id,cell,{1:st,2: ed})])
# collect separate sets
ppl.add(id)
return (l,ppl)
def csv2graph(source):
MG=nx.MultiGraph()
# Remember that I use integers for edge attributes, to save space! Dic above.
# start: 1
# end: 2
p = Pool()
node_divisor = len(p._pool)
node_chunks = list(chunks(source,int(len(source)/int(node_divisor))))
num_chunks = len(node_chunks)
pedgelists = p.map(csv2nodes,
node_chunks)
ll = []
ppl = set()
for l in pedgelists:
ll.append(l[0])
ppl.update(l[1])
MG.add_edges_from(ll)
return (MG,ppl)
with open('/Users/laszlosandor/Dropbox/peers_prisons/python/codetenus_test.txt','r') as source:
r = source.readlines()
MG,ppl = csv2graph(r)
What's a good way to troubleshoot this?

The problem is in this line:
with pattern.findall(row) as f:
You are using the with statement. It requires an object with __enter__ and __exit__ methods. But pattern.findall returns a list, with tries to store the __exit__ method, but it can't find it, and raises an error. Just use
f = pattern.findall(row)
instead.

It is not the asker's problem in this instance but the first troubleshooting step for a generic "AttributeError: __exit__" should be making sure the brackets are there, e.g.
with SomeContextManager() as foo:
#works because a new object is referenced...
not
with SomeContextManager as foo:
#AttributeError because the class is referenced
Catches me out from time to time and I end up here -__-

The error also happens when trying to use the
with multiprocessing.Pool() as pool:
# ...
with a Python version that is too old (like Python 2.X) and does not support using with together with multiprocessing pools.
(See this answer https://stackoverflow.com/a/25968716/1426569 to another question for more details)

The reason behind this error is :
Flask app is already running, hasn't shut down and in middle of that we try to start another instance by:
with app.app_context():
#Code
Before we use this with statement we need to make sure that scope of the previous running app is closed.

Related

Calling a func from a Python file

I have two Python files (using PyCharm). In Python file#2, I want to call a function in Python file#1.
from main import load_data_from_file
delay, wavelength, measured_trace = load_data_from_file("Sweep_0.txt")
print(delay.shape)
which main is the name of python file#1. However, when I run python file#2 (the code posted at the top), I can see that whole python file#1 is also running.
Any suggestion on how I can just run print(delay. shape) without running the entire python file#1??
Here are my codes:
class import_trace:
def __init__(self,delays,wavelengths,spectra):
self.delays = delays
self.wavelengths = wavelengths
self.spectra = spectra
def load_data_from_file(self, filename):
# logging.info("Entered load_data_from_file")
with open(filename, 'r') as data_file:
wavelengths = []
spectra = []
delays = []
for num, line in enumerate(data_file):
if num == 0:
# Get the 1st line, drop the 1st element, and convert it
# to a float array.
delays = np.array([float(stri) for stri in line.split()[1:]])
else:
data = [float(stri) for stri in line.split()]
# The first column contains wavelengths.
wavelengths.append(data[0])
# All other columns contain intensity at that wavelength
# vs time.
spectra.append(np.array(data[1:]))
logging.info("Data loaded from file has sizes (%dx%d)" %
(delays.size, len(wavelengths)))
return delays, np.array(wavelengths), np.vstack(spectra)
and below I use this to get the values, however it does not work:
frog = import_trace(delays,wavelengths,spectra)
delay, wavelength, measured_trace =
frog.load_data_from_file("Sweep_0.txt")
I got this error:
frog = import_trace(delays,wavelengths,spectra)
NameError: name 'delays' is not defined
You can wrap up the functions of file 1 in a class and in file 2 you can create object and can call the specific function. However if you can share the file 1's code then it would be clear. For your reference find below...
File-1
class A:
def func1():
...
File-2
import file1 as f
# function call
f.A.func1()

How to do multiprocessing of image augmentations of large quantity?

I'm trying to do image augmentations for which I'm getting images from a large number of folders Doing this in a sequel takes lots of time, so I'm running the same script in different terminals in order to complete augmentations very quickly by providing the start and end value of the list index as shown in the below code.
do_augmentations(args):
total_count = len(all_folders)
split_no = total_count //2
start = split_no
if split_no == 0:
split_no = 1
end = total_count
for folder in all_folders[start:end]:
allImgs = list(paths.list_images(folder))
count = len(allImgs)
for img in allImages:
augmentations(img)
cv2.imwrite(img)
def main():
all_folders= os.walk(folderpath)
do_augmentations(all_folders)
I was wondering if we could use multiple CPU cores with multithreading and multiprocessing packages in Python in parallel rather than sequentially because it takes a long time. Here I'm mentioning the start and end value of the folder number to run separately in multiple terminals to run faster . I tried using a multiprocessing library to implement this in parallel, but it runs in the same sequential manner as before. Below is the code of my approach to solving this.
from multiprocessing import pool
from multiprocessing.dummy import Pool as ThreadPool
do_augmentations(args):
all_folers = args[0]
bpath = args[1]
for folder in all_folders:
allImgs = list(paths.list_images(folder))
count = len(allImgs)
for img in allImages:
augmentations(img)
cv2.imwrite(img)
def main():
bpath = 'img_foldr'
all_folders= os.walk(folderpath)
pool = ThreadPool(4)
pool.map(do_augmentations,[[all_folders,bpath]])
When running this, it does image processing on one folder at a time in a loop instead of in parallel for many folders simultaneously. I don't understand what I'm doing wrong. Any help or suggestions to solve this will be very helpful.
Update:
I tried answer given by Jan Wilamowski as below
def augment(image):
img = do_augmentation(image)
cv2.imwrite(img)
def main():
all_folders = os.walk(imagefolder)
all_images = chain(paths.list_images(folder) for folder in all_folders)
pool = Pool(4)
pool.map(augment,all_images)
I get error as below
Traceback (most recent call last):
File "aug_img.py", line 424, in
main()
File "aug_img.py", line 346, in main
pool.map(augment,all_images)
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\pool.py",
line 364, in map return
self._map_async(func, iterable, mapstar, chunksize).get()
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\pool.py",
line 771, in get raise self._value
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\pool.py",
line 537, in _handle_tasks put(task)
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\connection.py",
line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\reduction.py",
line 51, in dumps cls(buf, protocol).dump(obj)
TypeError: cannot pickle 'generator' object
Have your function work on a single folder and pass the folder list to pool.map(). Also, use a process pool to avoid problems with the GIL (as pointed out by several commenters):
from multiprocessing import Pool
do_augmentations(folder):
allImgs = list(paths.list_images(folder))
count = len(allImgs)
for img in allImages:
augmentations(img)
cv2.imwrite(img)
def main():
bpath = 'img_foldr'
all_folders = os.walk(folderpath)
pool = Pool(4)
pool.map(do_augmentations, all_folders)
You could also break it down further and have your function work on a single image, giving more consistent performance:
from itertools import chain
from imutils import paths
def augment(image):
augmentations(image)
cv2.imwrite(image)
def main():
all_images = paths.list_images(folderpath)
pool = Pool(4)
pool.map(augment, all_images)
Note however that disk I/O can be a bottleneck so don't expect a linear performance improvement.
Let me give you my simple multiprocessing recipy for any task not just augmentation.
This is a top down view.
import os
import multiprocessing
from workers import batch_function
no_of_cpus = os.cpu_count()
all_folders= os.walk(folderpath)
input_list = get_all_the_files_from_folder(all_folders) #This is specific to your file structures
mp_dict= split_input_list(number_of_splits=int(no_of_cpus), input_list= input_list)
pool = multiprocessing.Pool()
results = pool.map(workers.batch_function, mp_dict) # Call Image Augmentations
First I would make a list of all the data that needs to be preprocessed and then split it in the number of process I want by means of a function split_input_list.
If you don't need to return anything from the batch function you don't need the results variable but in essence you will have a list of results from each process that you can iterate by means of for res in results:
no matter what you return in batch_function.
def split_input_list(process_number, d):
dict_list = []
pn = process_number
for i in range(pn - 1):
start = len(d) // pn * i
finish = len(d) // pn * (i + 1)
split_dict = dict(list(d.items())[start:finish])
print(len(split_dict))
dict_list.append(split_dict)
last_dict = dict(list(d.items())[finish:])
dict_list.append(last_dict)
print(len(last_dict))
return dict_list
Then in a separate workers.py file I usually have multiple batch_function that can accomplish certain tasks. In this case for augmentations I would do something similar to :
def batch_function(mp_list):
local_split = mp_list[0]
for k,_ in local_split.items():
augmentations(k)
...
#return what_you_need
Also if you don't have an impressive amount of RAM and a CPU with 32 cores expect some crashing from lack of memory.

How to efficiently run parallel processes in python, while each should read from a big file

I have a large delimited protobuf file (can be between 1GB-30GB).
Each message in the file has a certain format, where the first attribute is a string, and the second is a repeated (list like) object that contains 2 attributes.
It's similar to this text representation
BIG FILE:
first 10:32, 12:1, ... ,100002:3
second 1:3, 15:5, ... ,548756:57
...
...
ten_million 4:7, 48:4, ... ,12357458:8
Currently my code looks something like that:
import itertools
from multiprocessing import Pool
from google.protobuf.internal.decoder import _DecodeVarint32
import proto_pb2
def read_message(buffer, n, message_type):
message = message_type()
msg_len, new_pos = _DecodeVarint32(buffer, n)
n = new_pos
msg_buf = buffer[n:n + msg_len]
n += msg_len
message.ParseFromString(msg_buf)
return message
class A:
def __init__(self, big_file):
with open(big_file, 'rb') as fp:
self.buf = fp.read()
def get_line(self, n):
return read_message(self.buf, n, proto_pb2.line_type)
def func(obj_a, lines):
res = []
for line in lines:
res.append(obj_a.get_line(line))
return res
if __name__ == '__main__':
all_lines = [[54487, 78, 10025, 548967], [12, 3218], [45786, 5744, 567, 45648], [45156, 456, 75]]
a = A(big_file)
with Pool() as pool:
result = pool.starmap(func, itertools.product([a], all_lines))
print(result)
I open and read the file inside the class and hold it once creating a class object to avoid multiple openings/closings of the file.
It fits into the memory but I'd like to avoid that.
Then I expect each of the processes created by the Pool to read the lines it needs from the file and return the wanted result.
All the sub-processes will only read from the file (no writing) and print/return the results
Currently it doesn't really works in parallel, seems to me that it waits on acquiring lock for a long time before each process can run.
I assume that it's happening due to the large size of the file, which being copied for each process.
What would be a proper implementation?
This code is only used for the example, the actual file is a protobuf file (hope I haven't made many mistakes there), and I keep a mapping (dict) of the lines numbers and their location in the file

Handle multiple results in Python multiprocessing

I'm writing a Python piece of code to parse a lot of ascii file using multiprocessing functionality.
For each file I've to perform the operations of this function
def parse_file(file_name):
record = False
path_include = []
buffer_include = []
include_file_filters = {}
include_keylines = {}
grids_lines = []
mat_name_lines = []
pids_name_lines = []
pids_shell_lines= []
pids_weld_lines = []
shells_lines = []
welds_lines = []
with open(file_name, 'rb') as in_file:
for lineID, line in enumerate(in_file):
if record:
path_include += line
if record and re.search(r'[\'|\"]$', line.strip()):
buffer_include.append(re_path_include.search(
path_include).group(1).replace('\n', ''))
record = False
if 'INCLUDE' in line and '$' not in line:
if re_path_include.search(line):
buffer_include.append(
re_path_include.search(line).group(1))
else:
path_include = line
record = True
if line.startswith('GRID'):
grids_lines += [lineID]
if line.startswith('$HMNAME MAT'):
mat_name_lines += [lineID]
if line.startswith('$HMNAME PROP'):
pids_name_lines += [lineID]
if line.startswith('PSHELL'):
pids_shell_lines += [lineID]
if line.startswith('PWELD'):
pids_weld_lines += [lineID]
if line.startswith(('CTRIA3', 'CQUAD4')):
shells_lines += [lineID]
if line.startswith('CWELD'):
welds_lines += [lineID]
include_keylines = {'grid': grids_lines, 'mat_name': mat_name_lines, 'pid_name': pids_name_lines, \
'pid_shell': pids_shell_lines, 'pid_weld': pids_weld_lines, 'shell': shells_lines, 'weld': welds_lines}
include_file_filters = {file_name: include_keylines}
return buffer_include, include_file_filters
This function is used in a loop through list of files, in this way (each process on CPU parse one entire file)
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
for include in grouper([list_of_file_path]):
current = mp.current_process()
print 'Running: ', current.name, current._identity
results = p.map(parse_file, include)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
The grouper function used above is defined as
def grouper(iterable, padvalue=None):
return itertools.izip_longest(*[iter(iterable)]*mp.cpu_count(), fillvalue=padvalue)
I'm using Python 2.7.15 in cpu with 4 cores (Intel Core i3-6006U).
When I run my code, I see all the CPUs engaged on 100%, the output in Python console as Running: MainProcess () but nothing appened otherwise. It seems that my code is blocked at instruction results = p.map(parse_file, include) and can't go ahead (the code works well when i parse the files one at a time without parallelization).
What is wrong?
How can I deal with the results given by parse_file function
during parallel execution?My approach is correct or not?
Thanks in advance for your support
EDIT
Thanks darc for your reply. I've tried your suggestion but the issue is the same. The problem, seems to be overcome if I put the code under if statement like so
if __name__ == '__main__':
Maybe this is due to the manner in which Python IDLE handle the process. I'm using the IDLE environ for development and debugging reasons.
according to python docs:
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
since it is blocking your process wait until parse file is done.
since map already chnucks the iterable you can try to send all of the includes together as one large iterable.
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
results = p.map(parse_file, list_of_file_path, 1)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
if you want to keep the original loop use apply_async, or if you are using python3 you can use ProcessPoolExecutor submit() function and read the results.

Why aren't my variables being defined in my python for loop?

Here is the code:
import math
with open("test.stl") as file:
vertices = [map(float, line.split()[1:4])
for line in file
if line.lstrip().startswith('vertex')]
normals = [map(float, line.split()[2:5])
for line in file
if line.lstrip().startswith('facet')]
V=len(vertices)
ordering=[]
N=len(normals)
for i in range(0,N):
p1=vertices[3*i]
p2=vertices[3*i+1]
p3=verticies[3*i+2]
print p1
x1=p1[0]
y1=p1[1]
z1=p1[2]
x2=p2[0]
y2=p2[1]
z2=p2[2]
x3=p3[0]
y3=p3[1]
z3=p3[2]
a=[x2-x1,y2-y1,z2-z1]
b=[x3-x1,y3-y1,z3-z1]
a1=x2-x1
a2=y2-y1
a3=z2-z1
b1=x3-x1
b2=y3-y1
b3=z3-z1
normal=normals[i]
cross_vector=[a2*b3-a3*b2,a3*b1-a1*b3,a1*b2-a2*b1]
if cross_vector==normal:
ordering.append([i,i+1,i+2])
else:
ordering.append([i,i+2,i+1])
print ordering
print cross_vector
If I try to add print p1 (or any of the other variables such as cross_vector) inside of the for loop, there aren't any errors but no output and if I try to print them outside of the for loop it says NameError: name '(variable name)' is not defined. So if none of these variables are being defined, obviously my ordering array prints as [] (blank). How can I change this. Do variables have to be declared before they are defined?
Edit: Here is the error output when the code above is run:
Traceback (most recent call last):
File "convert.py", line 52, in <module>
print cross_vector
NameError: name 'cross_vector' is not defined
As explained above this happens with any variable defined in the for loop, I am just using cross_vector as an example.
This line:
vertices = [map(float, line.split()[1:4])
for line in file
if line.lstrip().startswith('vertex')]
reads through all the lines in the file. After that, you're at the end of the file, and there's nothing left to read. So
normals = [map(float, line.split()[2:5])
for line in file
if line.lstrip().startswith('facet')]
is empty (normals == []). Thus
N=len(normals)
sets N to 0, meaning that this loop:
for i in range(0,N):
is never executed. That's why printing from inside it does nothing -- the loop isn't being run.
To solve the problem diagnosed by DSM, use:
import math
import itertools
with open("test.stl") as file:
i1, i2 = itertools.tee(file)
vertices = [map(float, line.split()[1:4])
for line in i1
if line.lstrip().startswith('vertex')]
normals = [map(float, line.split()[2:5])
for line in i2
if line.lstrip().startswith('facet')]
You might also want to try and drop the list comprehension, and work with iterators throughout, to save on memory for large files.
Edit:
At present, you load the entire file into memory, and then create two more full size lists in memory. Instead, you can write it in a way that only reads from the file in memory as required. As an example, we can replace the list comprehensions with generator comprehensions:
import math
import itertools
with open("test.stl") as file:
i1, i2 = itertools.tee(file)
vertexIter = (map(float, line.split()[1:4])
for line in i1
if line.lstrip().startswith('vertex'))
normalIter = (map(float, line.split()[2:5])
for line in i2
if line.lstrip().startswith('facet'))
Here, we've avoided using any memory at all.
For this to be useful, you need to be able to replace your loop, from:
for i in range(0,N):
p1=vertices[3*i]
p2=vertices[3*i+1]
p3=verticies[3*i+2]
normal = normals[i]
# processing
To a single iterator:
for normal, p1, p2, p3 in myMagicIterator:
# processing
One way I can think of doing this is:
myMagicIterator = itertools.izip(
normalIter,
itertools.islice(vertexIter, 0, 3),
itertools.islice(vertexIter, 1, 3),
itertools.islice(vertexIter, 2, 3)
)
Which is the iterator equivalent of:
myNormalList = zip(normals, vertices[0::3], vertices[1::3], vertices[2::3])
Declare them outside of it (before the for loop) and see what happens. Even if it would be ok to declare them in the for loop, you would probably like to have a "default" value of them when the loop doesn't run.
And please try to post a lot smaller example if necessary.

Categories

Resources