optimal data structure to store million of pixels in python?

optimal data structure to store million of pixels in python? - python

I have several images and after some basic processing and contour detection I want to store the detected pixels locations and their adjacent neighbours values into a Python Data Structure. I settled for numpy.array
The pixel locations from each Image are retrieved using:
locationsPx = cv2.findNonZero(SomeBWImage)
which will return an array of the shape (NumberOfPixels,1L,2L) with :
print(locationsPx[0]) : array([[1649, 4]])
for example.
My question is: is it possible to store this double array on a single column in another array? Or should I use a list and drop the array all together?
note: the dataset of images might increase so the dimensions of my chose data structure will not be only huge, but also variable
EDIT: or maybe numpy.array is not good idea and Pandas Dataframe is better suited? I am open to suggestion from those who have more experience in this.

Numpy arrays are great for computation. They are not great for storing data if the size of the data keeps changing. As ali_m pointed out, all forms of array concatenation in numpy are inherently slow. Better to store the arrays in a plain-old python list:
coordlist = []
coordlist.append(locationsPx[0])
Alternatively, if your images have names, it might be attractive to use a dict with the image names as keys:
coorddict = {}
coorddict[image_name] = locationsPx[0]
Either way, you can readily iterate over the contents of the list:
for coords in coordlist:
or
for image_name, coords in coorddict.items():
And pickle is a convenient way to store your results in a file:
import pickle
with open("filename.pkl", "wb") as f:
pickle.dump(coordlist, f, pickle.HIGHEST_PROTOCOL)
(or same with coorddict instead of coordlist).
Reloading is trivially easy as well:
with open("filename.pkl", "rb") as f:
coordlist = pickle.load(f)
There are some security concerns with pickle, but if you only load files you have created yourself, those don't apply.
If you find yourself frequently adding to a previously pickled file, you might be better off with an alternative back end, such as sqlite.

Related

Converting CSV to numpy NPY efficiently

How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)

Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.

TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format

It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..

import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```

I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)

h5py file subset taking more space than parent file?

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?

Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

How to split large data and rejoin later

My code generates a list of numpy arrays of size (1, 1, n, n, m, m) where n may vary from 50-100 and m from 5-10 depending on the case at hand. The length of the list itself may go up to 10,000 and is being written/dumped using pickle at the end of the code. For cases at the higher end of these numbers or when file sizes go beyond 5-6 GB, I get Out of Memory error. Below is a made up example of the situation,
import numpy as np
list, list_length = [], 1000
n = 100
m = 3
for i in range(0, list_length):
list.append(np.random.random((1, 1, n, n, m, m)))
file_path = 'C:/Users/Desktop/Temp/'
with open(file_path, 'wb') as file:
pickle.dump(list, file)
I am looking for a way that helps me to
split the data so that I can get rid of memory error, and
rejoin the data in the original form when needed later
All I could think is:
for i in range(0, list_length):
data = np.random.random((1, 1, n, n, m, m))
file_path = 'C:/Users/Desktop/Temp/'+str(i)
with open(file_path, 'wb') as file:
pickle.dump(data, file)
and then combine using:
combined_list = []
for i in range(0, list_length):
file_path = 'C:/Users/Desktop/Temp/single' + str(i)
with open(file_path, 'rb') as file:
data = pickle.load(file)
combined_list.append(data)
Using this way, the file size certainly reduces due to multiple files, but that also increases processing time due to multiple file I/O operations.
Is there a more elegant and better way to do this?

Using savez, savez_compressed, or even things like h5py can be useful as #tel mentioned, but that takes extra effort trying to do "reinvent" caching mechanism. There are two easier ways to process larger-than-memory ndarray if applicable:
The easiest way is of course enable pagefile (or some other name) on Windows or swap on Linux (not sure about OS X counter part). This creates a virtually large enough memory so that you don't need to worry about memory at all. It will save to disk/load from disk accordingly
If the first way is not applicable due to not have admin rights or etc, numpy provides another way: np.memmap. This function maps an ndarray to disk such that you can index it just like it is in memory. Technically IO is done directly to the hard disk but OS will cache accordingly
For the second way, you can create a hard-disk side ndarray using:
np.memmap('yourFileName', 'float32', 'w+', 0, 2**32)
This creates a 16GB float32 array within no time (containing 4G numbers). You can then do IO to it. A lot of functions have an out parameter. You can set the out parameter accordingly so that the output is not "copied" to the disk from memory
If you want to save a list of ndarrays using the second method, either create a lot of memmaps, or concat them to a single array

Don't use pickle to store large data, it's not an efficient way to serialize anything. Instead, use the built-in numpy serialization formats/functions via the numpy.savez_compressed and numpy.load functions.
System memory isn't infinite, so at some point you'll still need to split your files (or use a heavier duty solution such as the one provided by the h5py package). However, if you were able to fit the original list into memory then savez_compressed and load should do what you need.

Can I delete an element from an HDF5 dataset?

I would like to delete an element from an HDF5 dataset in Python. Below I have my example code
DeleteHDF5Dataset.py
# This code works, which deletes an HDF5 dataset from an HDF5 file
file_name = os.path.join('myfilepath', 'myfilename.hdf5')
f = h5py.File(file_name, 'r+')
f.__delitem__('Log list')
However, this is not what I want to do. 'mydatatset' is an HDF5 dataset that has several elements, and I would like to delete one or more of the elements individually, for example
DeleteHDF5DatasetElement.py
# This code does not work, but I would like to achieve what it's trying to do
file_name = os.path.join('myfilepath', 'myfilename.hdf5')
f = h5py.File(file_name, 'r+')
print(f['Log list'][3]) # prints the correct dataset element
f.__delitem__('Log list')[3] # I want to delete element 3 of this HDF5 dataset
The best solution I can come up with is to create a temporary dataset, loop through the original dataset, and only add the entries I want to keep to the temp dataset, and then replace the old dataset with the new one. But this seems pretty clunky. Does anybody have a clean solution to do this? It seems like there should be a simple way to just delete an element.
Thanks, and sorry if any of my terminology is incorrect.

It looks like you have an array of strings. It's not the recommended way of storing strings in HDF5, but let's assume you have no choice on how data is stored.
HDF5 prefers you to keep your array size fixed. Operations such as deleting arbitrary elements are expensive. In addition, with HDF5, space is not automatically freed when you delete data.
After all this, if you still want to remove data in your specified format, you can try simply extracting an array, deleting an element, then reassigning to your dataset:
arr = f['Log list'][:] # extract to numpy array
res = np.delete(arr, 1) # delete element with index 1, i.e. second element
f.__delitem__('Log list') # delete existing dataset
f['Log list'] = res # reassign to dataset

get specific content from file python

I have a file test.txt which has an array:
array = [3,5,6,7,9,6,4,3,2,1,3,4,5,6,7,8,5,3,3,44,5,6,6,7]
Now what I want to do is get the content of array and perform some calculations with the array. But the problem is when I do open("test.txt") it outputs the content as the string. Actually the array is very big, and if I do a loop it might not be efficient. Is there any way to get the content without splitting , ? Any new ideas?

I recommend that you save the file as json instead, and read it in with the json module. Either that, or make it a .py file, and import it as python. A .txt file that looks like a python assignment is kind of odd.

Does your text file need to look like python syntax? A list of comma separated values would be the usual way to provide data:
1,2,3,4,5
Then you could read/write with the csv module or the numpy functions mentioned above. There's a lot of documentation about how to read csv data in efficiently. Once you had your csv reader data object set up, data could be stored with something like:
data = [ map( float, row) for row in csvreader]

If you want to store a python-like expression in a file, store only the expression (i.e. without array =) and parse it using ast.literal_eval().
However, consider using a different format such as JSON. Depending on the calculations you might also want to consider using a format where you do not need to load all data into memory at once.

Must the array be saved as a string? Could you use a pickle file and save it as a Python list?
If not, could you try lazy evaluation? Maybe only process sections of the array as needed.
Possibly, if there are calculations on the entire array that you must always do, it might be a good idea to pre-compute those results and store them in the txt file either in addition to the list or instead of the list.

You could also use numpy to load the data from the file using numpy.genfromtxt or numpy.loadtxt. Both are pretty fast and both have the ability to do the recasting on load. If the array is already loaded though, you can use numpy to convert it to an array of floats, and that is really fast.
import numpy as np
a = np.array(["1", "2", "3", "4"])
a = a.astype(np.float)

You could write a parser. They are very straightforward. And much much faster than regular expressions, please don't do that. Not that anyone suggested it.
# open up the file (r = read-only, b = binary)
stream = open("file_full_of_numbers.txt", "rb")
prefix = '' # end of the last chunk
full_number_list = []
# get a chunk of the file at a time
while True:
# just a small 1k chunk
buffer = stream.read(1024)
# no more data is left in the file
if '' == buffer:
break
# delemit this chunk of data by a comma
split_result = buffer.split(",")
# append the end of the last chunk to the first number
split_result[0] = prefix + split_result[0]
# save the end of the buffer (a partial number perhaps) for the next loop
prefix = split_result[-1]
# only work with full results, so skip the last one
numbers = split_result[0:-1]
# do something with the numbers we got (like save it into a full list)
full_number_list += numbers
# now full_number_list contains all the numbers in text format
You'll also have to add some logic to use the prefix when the buffer is blank. But I'll leave that code up to you.

OK, so the following methods ARE dangerous. Since they are used to attack systems by injecting code into them, used them at your own risk.
array = eval(open("test.txt", 'r').read().strip('array = '))
execfile('test.txt') # this is the fastest but most dangerous.
Safer methods.
import ast
array = ast.literal_eval(open("test.txt", 'r').read().strip('array = ')).
...
array = [float(value) for value in open('test.txt', 'r').read().strip('array = [').strip('\n]').split(',')]
The eassiest way to serialize python objects so you can load them later is to use pickle. Assuming you dont want a human readable format since this adds major head, either-wise, csv is fast and json is flexible.
import pickle
import random
array = random.sample(range(10**3), 20)
pickle.dump(array, open('test.obj', 'wb'))
loaded_array = pickle.load(open('test.obj', 'rb'))
assert array == loaded_array
pickle does have some overhead and if you need to serialize large objects you can specify the compression ratio, the default is 0 no compression, you can set it to pickle.HIGHEST_PROTOCOL pickle.dump(array, open('test.obj', 'wb'), pickle.HIGHEST_PROTOCOL)
If you are working with large numerical or scientific data sets then use numpy.tofile/numpy.fromfile or scipy.io.savemat/scipy.io.loadmat they have little overhead, but again only if you are already using numpy/scipy.
good luck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

optimal data structure to store million of pixels in python? - python

Related

Converting CSV to numpy NPY efficiently

h5py file subset taking more space than parent file?

How to split large data and rejoin later

Can I delete an element from an HDF5 dataset?

get specific content from file python

Categories

Resources