Read multiple HDF5 files in Python using multiprocessing - python

I'm trying to read a bunch of HDF5 files ("a bunch" meaning N > 1000 files) using PyTables and multiprocessing. Basically, I create a class to read and store my data in RAM; it works perfectly fine in a sequential mode and I'd like to parallelize it to gain some performance.
I tried a dummy approach for now, creating a new method flatten() to my class to parallelize file reading. The following example is a simplified example of what I'm trying to do. listf is a list of strings containing the name of the files to read, nx and ny are the size of the array I want to read in the file:
import numpy as np
import multiprocessing as mp
import tables
class data:
def __init__(self, listf, nx, ny, nproc=0):
self.listinc = []
for i in range(len(listf)):
self.listinc.append((listf[i], nx, ny))
def __del__(self):
del self.listinc
def get_dsets(self, tuple_inc):
listf, nx, ny = tuple_inc
x = np.zeros((nx, ny))
f = tables.openFile(listf)
x = np.transpose(f.root.x[:ny,:nx])
f.close()
return(x)
def flatten(self):
nproc = mp.cpu_count()*2
def worker(tasks, results):
for i, x in iter(tasks.get, 'STOP'):
print i, x
results.put(i, self.get_dsets(x))
tasks = mp.Queue()
results = mp.Queue()
manager = mp.Manager()
lx = manager.list()
for i, out in enumerate(self.listinc):
tasks.put((i, out))
for i in range(nproc):
mp.Process(target=worker, args=(tasks, results)).start()
for i in range(len(self.listinc)):
j, res = results.get()
lx.append(res)
for i in range(nproc):
tasks.put('STOP')
I tried different things (including, like I did in this simple example, the use of a manager to retrieve the data) but I always get a TypeError: an integer is required.
I do not use ctypes array because I don't really require to have shared arrays (I just want to retrieve my data) and after retrieving the data, I want to play with it with NumPy.
Any thought, hint or help would be highly appreciated!
Edit: The complete error I get is the following:
Process Process-341:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/toto/test/rd_para.py", line 81, in worker
results.put(i, self.get_dsets(x))
File "/usr/lib/python2.7/multiprocessing/queues.py", line 101, in put
if not self._sem.acquire(block, timeout):
TypeError: an integer is required

The answer was actually very simple...
In the worker, since it is a tuple that I retrieve, i can't do:
result.put(i, self.get_dsets(x))
but instead I have to do:
result.put((i, self.get_dsets(x)))
which then works perfectly well.

Related

Is it possible to get dask to work with python multiprocessing shared_memory (BrokenProcessPool error)?

To speed up a data-intensive computation, I would like to access a shared_memory array from within different processes created with dask delayed/compute.
The code looks as follows (input_data is the array to be shared: it contains columns of ints, floats, and datetime objects, and has overall dtype 'O'):
import numpy as np
import dask
import dask.multiprocessing
from multiprocessing import shared_memory
def main():
shm = shared_memory.SharedMemory(create=True, size=input_data.nbytes)
shared_array = np.ndarray(input_data.shape, dtype=input_data.dtype, buffer=shm_trades.buf)
shared_array[:] = input_data[:]
dask_collect = []
for i in data_ids:
dask_collect.append(delayed(data_processing)(i, shm.name, input_data.shape, input_data.dtype))
result, = dask.compute(dask_collect, scheduler='processes')
def data_processing(i, shm_name, shm_dim, shm_dtype):
shm = shared_memory.SharedMemory(name=shm_name)
shared_array = np.ndarray(shm_dim, dtype=shm_dtype, buffer=shm.buf)
shared_array_subset = shared_array[shared_array[:, 0] == i]
data_operations(shared_array_subset)
if __name__ == '__main__':
main()
All of this works fine if I use scheduler='single-threaded' as a kwarg to dask.compute, but I get the following error with scheduler='processes':
Traceback (most recent call last):
File "C:/my_path/my_script.py", line 274, in <module>
main()
File "C:/my_path/my_script.py", line 207, in main
result, = dask.compute(dask_collect, scheduler='processes')
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\base.py", line 568, in compute
results = schedule(dsk, keys, **kwargs)
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\multiprocessing.py", line 219, in get
result = get_async(
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\local.py", line 506, in get_async
for key, res_info, failed in queue_get(queue).result():
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Process finished with exit code 1
The error occurs before reaching the "data_operations(shared_array_subset)" part.
Am I using shared_memory or dask incorrectly?
Thanks!
NumPy releases the Python GIL, so multithreading can give you better performance than multiprocessing:
result, = dask.compute(dask_collect, scheduler='processes')
You can also use Dask Array (Dask's parallel and distributed implementation of NumPy) here instead of the Delayed API. It has better optimizations for NumPy.
Relevant docs:
https://docs.dask.org/en/latest/scheduling.html#local-threads
https://docs.dask.org/en/latest/array.html

Iterating through DataLoader (PyTorch): RuntimeError: Expected object of scalar type unsigned char but got scalar type float for sequence element 9

I am new to PyTorch and am running into an expected error. The overall context is trying to build a building segmentation model off of Spacenet imagery. I am forked off of this repo from someone at Microsoft AI who built a segmentation model, and I am just trying to re-run her training scripts.
I've been able to download the data, and do the pre-processing. My issue comes when trying to actually train the model, I am trying to iterate through my DataLoader, and I get the following error message:
RuntimeError: Expected object of scalar type unsigned char but got
scalar type float for sequence element 9.
Snippets of code that are useful:
I have a dataset.py that creates the SpaceNetDataset class and looks like:
import os
# Ignore warnings
import warnings
import numpy as np
from PIL import Image
import torch
from torch.utils.data import Dataset
warnings.filterwarnings('ignore')
class SpaceNetDataset(Dataset):
"""Class representing a SpaceNet dataset, such as a training set."""
def __init__(self, root_dir, splits=['trainval', 'test'], transform=None):
"""
Args:
root_dir (string): Directory containing folder annotations and .txt files with the
train/val/test splits
splits: ['trainval', 'test'] - the SpaceNet utilities code would create these two
splits while converting the labels from polygons to mask annotations. The two
splits are created after chipping larger images into the required input size with
some overlaps. Thus to have splits that do not have overlapping areas, we manually
split the images (not chips) into train/val/test using utils/split_train_val_test.py,
followed by using the SpaceNet utilities to annotate each folder, and combine the
trainval and test splits it creates inside each folder.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.root_dir = root_dir
self.transform = transform
self.image_list = []
self.xml_list = []
data_files = []
for split in splits:
with open(os.path.join(root_dir, split + '.txt')) as f:
data_files.extend(f.read().splitlines())
for line in data_files:
line = line.split(' ')
image_name = line[0].split('/')[-1]
xml_name = line[1].split('/')[-1]
self.image_list.append(image_name)
self.xml_list.append(xml_name)
def __len__(self):
return len(self.image_list)
def __getitem__(self, idx):
img_path = os.path.join(self.root_dir, 'RGB-PanSharpen', self.image_list[idx])
target_path = os.path.join(self.root_dir, 'annotations', self.image_list[idx].replace('.tif', 'segcls.tif'))
image = np.array(Image.open(img_path))
target = np.array(Image.open(target_path))
target[target == 100] = 1 # building interior
target[target == 255] = 2 # border
sample = {'image': image, 'target': target, 'image_name': self.image_list[idx]}
if self.transform:
sample = self.transform(sample)
return sample
To create the DataLoader, I have something like:
dset_train = SpaceNetDataset(data_path_train, split_tags, transform=T.Compose([ToTensor()]))
loader_train = DataLoader(dset_train, batch_size=train_batch_size, shuffle=True,
num_workers=num_workers)
I then iterate over the data loader by doing something like:
for batch in loader_train:
image_tensors = batch['image']
images = batch['image'].cpu().numpy()
break # take the first shuffled batch
but then I get the error:
Traceback (most recent call last):
File "training/train_aml.py", line 137, in <module> sample_images_train, sample_images_train_tensors = get_sample_images(which_set='train')
File "training/train_aml.py", line 123, in get_sample_images for i, batch in enumerate(loader):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 345, in __next__ data = self._next_data() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out)
RuntimeError: Expected object of scalar type unsigned char but got scalar type float for sequence element 9.
The error seems quite similar to this one, although I did try a similar solution by casting:
dtype = torch.cuda.CharTensor if torch.cuda.is_available() else torch.CharTensor
for batch in loader:
batch['image'] = batch['image'].type(dtype)
batch['target'] = batch['target'].type(dtype)
but I end up with the same error.
A couple of other things that are weird:
This seems to be non-deterministic. Most of the time I get this error, but some times the code keeps running (not sure why)
The "Sequence Element" number at the end of the error message keeps changing. In this case it was "sequence element 9" sometimes it's "sequence element 2", etc. Not sure why.
Ah nevermind.
Turns out unsigned char comes from C++ where it gives you 0 to 255, so it makes sense that's what it expects from image data.
So I actually fixed this by doing:
image = np.array(Image.open(img_path)).astype(np.int)
target = np.array(Image.open(target_path)).astype(np.int)
inside the SpaceNetDataset class and it seemed to work!

python multiprocessing struct.error

I am looping through a set of large files, and using multiprocessing for manipulation/writing. I create an iterable out of my dataframe and pass it to multiprocessing's map function. The processing is fine for the smaller files, but when I hit the larger ones (~10g) I get the error:
python struct.error: 'i' format requires -2147483648 <= number <= 2147483647
the code:
data = np.array_split(data, 10)
with mp.Pool(processes=5, maxtasksperchild=1) as pool1:
pool1.map(write_in_parallel, data)
pool1.close()
pool1.join()
Based on this answer I thought the problem is the file I am passing to map is too large. So I tried first splitting the dataframe into 1.5g chunks and passing each independently to map, but I am still receiving the same error.
Full traceback:
Traceback (most recent call last):
File "_FNMA_LLP_dataprep_final.py", line 51, in <module>
write_files()
File "_FNMA_LLP_dataprep_final.py", line 29, in write_files
'.txt')
File "/DATAPREP/appl/FNMA_LLP/code/FNMA_LLP_functions.py", line 116, in write_dynamic_columns_fannie
pool1.map(write_in_parallel, first)
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/opt/Python364/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/opt/Python364/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
In the answer you mentioned was also another gist: the data should be loaded by the child function. In your case, it's function write_in_parallel. What I recommend you is to alter your child function in the next way:
def write_in_parallel('/path/to/your/data'):
""" We'll make an assumption that your data is stored in csv file"""
data = pd.read_csv('/path/to/your/data')
...
Then your "Pool code" should look like this:
with mp.Pool(processes=(mp.cpu_count() - 1)) as pool:
chunks = pool.map(write_in_parallel, ('/path/to/your/data',))
df = pd.concat(chunks)
I hope that will help you.

Python3.4 : OSError: [Errno 12] Cannot allocate memory

I am initializing a bunch of multiprocessing arrays that are 1048576 by 16 long in the file dijk_inner_mp.py:
N1=1048576
DEG1=16
P1=1
W = [[0 for x in range(DEG1)] for x in range(N1)]
W_index = [[0 for x in range(DEG1)] for x in range(N1)]
u = multiprocessing.Array('i',range(P1))
D = multiprocessing.Array('i',range(N1))
Q = multiprocessing.Array('i',range(N1))
l = [multiprocessing.Lock() for i in range(0,N1)]
After the initialization I create P1 number of processes that work on the allocated arrays. However, I keep running into this error on execution:
File "dijk_inner_mp.py", line 20, in <module>
l = [multiprocessing.Lock() for i in range(0,N1)]
File "dijk_inner_mp.py", line 20, in <listcomp>
l = [multiprocessing.Lock() for i in range(0,N1)]
File "/usr/lib/python3.4/multiprocessing/context.py", line 66, in Lock
return Lock(ctx=self.get_context())
File "/usr/lib/python3.4/multiprocessing/synchronize.py", line 163, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/usr/lib/python3.4/multiprocessing/synchronize.py", line 60, in __init__
unlink_now)
OSError: [Errno 12] Cannot allocate memory
I have tried increasing the swap file size to say a few Gb after seeing some other questions about the issue, but that didnt seem to help. I also reduced the size to 131K from 1M and ended up with the same error. Any ideas on how to circumvent this issue?
Every instance of multiprocessing.Lock() maps a new semaphore file in /dev/shm/ into memory.
man mmap:
ENOMEM The process's maximum number of mappings would have been exceeded.
(Errno 12 is defined as ENOMEM.)
The system's maximum number of mappings is controlled by the kernel parameter vm.max_map_count; you can read it with /sbin/sysctl vm.max_map_count. Without much doubt you will see that this value on your system is clearly lower than the number of locks you want to create.
For ways to alter vm.max_map_count see e. g. this Linux Forums thread.

Python loadtxt function is not working

I am attempting the following program.
from numpy import zeros,loadtxt
from pylab import plot,xlim,show
from cmath import exp,pi
def dft(y):
N = len(y)
c = zeros(N//2+1,complex)
for k in range(N//2+1):
for n in range(N):
c[k] += y[n]*exp(-2j*pi*k*n/N)
return c
y = loadtxt("pitch.txt",float)
c = dft(y)
plot(abs(c))
xlim(0,500)
show()
However, when I attempt to run the program, I receive an error code for the line 13:
y = loadtxt("pitch.txt",float)
File "C:\Python32\lib\site-packages\numpy\lib\npyio.py", line 689, in loadtxt
fh = iter(open(fname, 'U'))
IOError: [Errno 2] No such file or directory: 'pitch.txt'
I was given a file that has all the resources needed to run the program, and I uploaded them into the same folders I have my Python program saved in. The pitch.txt file is a text file with a single column of numbers. I'm wondering if there is something wrong with the written program, or do did I upload the files in the wrong.

Categories

Resources