Retrieving error when reading a .npy file - python

I am trying to read a large .npy file but I am unable to read the file. Below is my python code for reading the file.
import numpy as np
pre_train = np.load('weights.npy',allow_pickle=True, encoding="latin1")
data_pic = pre_train.item()
#print(type(data_dic))
for item in data_pic:
print(item)
Error at: data_pic = pre_train.item()
Can only convert an array of size 1 to a Python scalar

Your code does not crash when loading the file. It crashes when using numpy.ndarray.item. In your case, you do not need to use item().
Using a good old for-loop will do!
data = np.load('...')
for i in data:
for j in i:
print(j)
# 2, 2, 6, 1, ...

Related

Saving Python Results as .txt File

I have this code (unique_set=np.random.choice([0, 1], (10000, 10, 10, 10))) that generates 10000 3D binary matrices and I'm attempting to save the result as a .txt file. The other similar questions I checked were either trying to write a print statement to a file or were noticeably different. I tried so many of the solutions like the one below, but none of them worked.
sys.stdout = open("test.txt", "w")
print(unique_set)
sys.stdout.close()
Try this one
import numpy as np
file = open('D:\\yourpath\\filename.txt', 'w')
unique_set=np.random.choice([0, 1], (10000, 10, 10, 10))
file.write('%s\n' %unique_set)
Not knowing how the format of your output file should look like, this is one possibility:
np.savetxt("test.txt", unique_set.flatten(), delimiter=",")
You can store as JSON text file (preserves it being a 4D array)
--storing Numpy N dimensional arrays
import json
with open('"test.txt"', 'w') as f:
json.dump(unique_set.tolist(), f)

H5Py and storage

I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:
for i in np.arange(numberOfChunks):
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation
As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)
I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:
import h5py
# Make the file
h5py_file = h5py.File(filename, "a")
# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?
Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:
import h5py
# Read in the file
h5py_file = h5py.File(filename, "a")
# Read in myArray
myArray = h5py_file['myArray']
for i in np.arange(numberOfChunks):
# Read in chunk
myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]
# ... Do some calculation on myArrayChunk
But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.
You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.
When reading data with h5py there 2 ways to read the data:
This returns a NumPy array:
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:
myArrayDS = myArray
The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]
My example also corrects 1 small error in your chunksize increment equation.
You had:
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk
Working Example (writes and reads):
import h5py
import numpy as np
# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:
numberOfChunks = 3
chunkSize = 4
print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
# Write dataset to disk
h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
for i in range(numberOfChunks):
h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
print (h5ArrayChunk)
h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
with h5py.File("SO_61173314.h5", "r") as h5r:
print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
# Access myArray dataset - Note: This is NOT a NumpPy array
myArray = h5r['myArray']
for i in range(numberOfChunks):
# Read a chunk into memory (as a NumPy array)
myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
# ... Do some calculation on myArrayChunk
print (myArrayChunk)

Streaming multiple numpy arrays to a file

This differs from Write multiple numpy arrays to file in that I need to be able to stream content, rather than writing it all at once.
I need to write multiple compressed numpy arrays in binary to a file. I can not store all the arrays in memory before writing so it is more like streaming numpy arrays to a file.
This currently works fine as text
file = open("some file")
while doing stuff:
file.writelines(somearray + "\n")
where some array is a new instance every loop
however this does not work if i try and write the arrays as binary.
arrays are created at 30hz and grow too big to keep in memory. They also can not each be stored into a bunch of single array files because that would just be wasteful and cause a huge mess.
So i would like only one file per a session instead of 10k files per a session.
One option might be to use pickle to save the arrays to a file opened as an append binary file:
import numpy as np
import pickle
arrays = [np.arange(n**2).reshape((n,n)) for n in range(1,11)]
with open('test.file', 'ab') as f:
for array in arrays:
pickle.dump(array, f)
new_arrays = []
with open('test.file', 'rb') as f:
while True:
try:
new_arrays.append(pickle.load(f))
except EOFError:
break
assert all((new_array == array).all() for new_array, array in zip(new_arrays, arrays))
This might not be the fastest, but it should be fast enough. It might seem like this would take up more data, but comparing these:
x = 300
y = 300
arrays = [np.random.randn(x, y) for x in range(30)]
with open('test2.file', 'ab') as f:
for array in arrays:
pickle.dump(array, f)
with open('test3.file', 'ab') as f:
for array in arrays:
f.write(array.tobytes())
with open('test4.file', 'ab') as f:
for array in arrays:
np.save(f, array)
You'll find the file sizes as 1,025 KB, 1,020 KB, and 1,022 KB respectively.
An NPZ file is just a zip archive, so you could save each array to a temporary NPY file, add that NPY file to the zip archive, and then delete the temporary file.
For example,
import os
import zipfile
import numpy as np
# File that will hold all the arrays.
filename = 'foo.npz'
with zipfile.ZipFile(filename, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
for i in range(10):
# `a` is the array to be written to the file in this iteration.
a = np.random.randint(0, 10, size=20)
# Name for the temporary file to which `a` is written. The root of this
# filename is the name that will be assigned to the array in the npz file.
# I've used 'arr_{}' (e.g. 'arr_0', 'arr_1', ...), similar to how `np.savez`
# treats positional arguments.
tmpfilename = "arr_{}.npy".format(i)
# Save `a` to a npy file.
np.save(tmpfilename, a)
# Add the file to the zip archive.
zf.write(tmpfilename)
# Delete the npy file.
os.remove(tmpfilename)
Here's an example where that script is run, and then the data is read back using np.load:
In [1]: !ls
add_array_to_zip.py
In [2]: run add_array_to_zip.py
In [3]: !ls
add_array_to_zip.py foo.npz
In [4]: foo = np.load('foo.npz')
In [5]: foo.files
Out[5]:
['arr_0',
'arr_1',
'arr_2',
'arr_3',
'arr_4',
'arr_5',
'arr_6',
'arr_7',
'arr_8',
'arr_9']
In [6]: foo['arr_0']
Out[6]: array([0, 9, 3, 7, 2, 2, 7, 2, 0, 5, 8, 1, 1, 0, 4, 2, 5, 1, 8, 2])
You'll have to test this on your system to see if it can keep up with your array generation process.
Another alternative is to use something like HDF5, with either h5py or pytables.

Batch processing: read image files, then write multidimensional numpy array to HDFS

I am trying to iteratively load a batch of images from a folder, process, then store the results of the batch to an hdf file. What's the best practice for batch reading images/files, and batch storing a resulting multi-dimensional array?
First Part
I start with a csv list of file names:
file_list = [''.join(x) + '.png' for x in permutations('abcde')][:100]
Say for example I want to process 5 images at a time.
I currently grab 5 file names from the list, create an empty array to hold 5 images, then read each image one at a time to yield a batch.
def load_images(file_list):
for i in range(0, 100, 5):
files_list = file_list[i, i + 5]
image_list = np.zeros(shape=(5, 50, 50, 3))
for idx, file in enumerate(files_list):
loaded_img = np.random.random((50, 50, 3)) # misc.imread(file)
image_list[idx] = loaded_img
yield image_list, files_list
Question 1: Is there a way to eliminate the second for loop? Can I batch read in the images, or is the method above (one at a time) best practice?
Second Part:
After loading the images I do some processing on them. This results in a different size array
def process_images(image_batch):
result = image_batch[:, 5, 4, 3] # a novel down-sampling algorithm
return result
Now, I want to store the batch of images with their original file names.
def store_images(data, file_names):
with pd.HDFstore('output.h5') as hdf:
pass
Question 2: What is the best way to store a batch of multidimensional numpy arrays, while still referencing them with a key (such as the original file name)?
I would like to explore using .h5 files, so if anyone knows how to batch process data to an .h5 and has advice on this, it would be most appreciated. Alternatively I think there is a way to save the numpy arrays as just .npy files to a folder, but I was having trouble with this and still wouldn't know how to do it other than one sample at a time (versus one batch at a time)

Fastest way to write a file with h5py

First of all, I read the topic "Fastest way to write hdf5 file with Python?", but it was not very helpful.
I am trying to load a file which has about 1GB (a matrix of size (70133351,1)) in a h5f5 structure.
Pretty simple code, but slow.
import h5py
f = h5py.File("8.hdf5", "w")
dset = f.create_dataset("8", (70133351,1))
myfile=open("8.txt")
for line in myfile:
line=line.split("\t")
dset[line[1]]=line[0]
myfile.close()
f.close()
I have a smaller version of the matrix with 50MB, and I tried the same code, and it was not finished after 24 hours.
I know the way to make it faster is to avoid the "for loop". If I were using regular python, I would use hash comprehension. However, looks like it does not fit here.
I can query the file later by:
f = h5py.File("8.hdf5")
h=f['8']
print 'GFXVG' in h.attrs
Which would answer me "True" conseidering that GFXVG is on of the keys in h
Does someone have any idea?
Example of part of the file:
508 LREGASKW
592 SVFKINKS
1151 LGHWTVSP
131 EAGQIISE
198 ELDDSARE
344 SQAVAVAN
336 ELDDSARF
592 SVFKINKL
638 SVFKINKI
107 PRTGAGQH
107 PRTGAAAA
Thanks
You can load all the data to an numpy array with loadtext and use it to instantiate your hdf5 dataset.
import h5py
import numpy as np
d = np.loadtxt('data.txt', dtype='|S18')
which return
array([['508.fna', 'LREGASKW'],
['592.fna', 'SVFKINKS'],
['1151.fna', 'LGHWTVSP'],
['131.fna', 'EAGQIISE'],
['198.fna', 'ELDDSARE'],
['344.fna', 'SQAVAVAN'],
['336.fna', 'ELDDSARF'],
['592.fna', 'SVFKINKL'],
['638.fna', 'SVFKINKI'],
['107.fna', 'PRTGAGQH'],
['1197.fna', 'ELDDSARR'],
['1309.fna', 'SQTIYVWF'],
['974.fna', 'PNNLRFIA'],
['230.fna', 'IGKVYHIE'],
['76.fna', 'PGVHSVWV'],
['928.fna', 'HERGGAND'],
['520.fna', 'VLKTDTTG'],
['1290.fna', 'EAALDLHR'],
['25.fna', 'FCSILGVV'],
['284.fna', 'YHKLTFED'],
['1110.fna', 'KITSSSDF']],
dtype='|S18')
and then
h = h5py.File('data.hdf5', 'w')
dset = h.create_dataset('data', data=d)
that gives:
<HDF5 dataset "init": shape (21, 2), type "|S18">
Since its only a gb, why not load it completely in memory first? Note, it looks like you're also indexing into the dset with a str, which is likely the issue.
I just realized I misread the initial question, sorry about that. It looks like your code is attempting to use the index 1, which appears to be a string, as an index? Perhaps there is a typo?
import h5py
from numpy import zeros
data = zeros((70133351,1), dtype='|S8') # assuming your strings are all 8 characters, use object if vlen
with open('8.txt') as myfile:
for line in myfile:
idx, item = line.strip().split("\t")
data[int(line[0])] = line[1]
with h5py.File('8.hdf5', 'w') as f:
dset = f.create_dataset("8", (70133351, 1), data=data)
I ended up using the library shelve (Pickle versus shelve storing large dictionaries in Python) to store a large dictionary into a file. It took me 2 days only to write the hash into a file, but once it was done, I am able to load and access any element very fast. In the end of the day, I dont have to read my big file and write all the information in the has and do whatever I was trying to do with the hash.
Problem solved!

Categories

Resources