Saving a dictionary of numpy arrays in human-readable format - python

This is not a duplicate question. I looked around a lot and found this question, but the savezand pickle utilities render the file unreadable by a human. I want to save it in a .txt file which can be loaded back into a python script. So I wanted to know whether there are some utilities in python which can facilitate this task and keep the written file readable by a human.
The dictionary of numpy arrays contains 2D arrays.
EDIT:
According to Craig's answer, I tried the following :
import numpy as np
W = np.arange(10).reshape(2,5)
b = np.arange(12).reshape(3,4)
d = {'W':W, 'b':b}
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
f = open('out.txt', 'r')
d = eval(f.readline())
print(d)
This gave the following error: SyntaxError: unexpected EOF while parsing.
But the out.txtdid contain the dictionary as expected. How can I load it correctly?
EDIT 2:
Ran into a problem : Craig's answer truncates the array if the size is large. The out.txt shows first few elements, replaces the middle elements by ... and shows the last few elements.

Convert the dict to a string using repr() and write that to the text file.
import numpy as np
d = {'a':np.zeros(10), 'b':np.ones(10)}
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
You can read it back in and convert to a dictionary with eval():
import numpy as np
f = open('out.txt', 'r')
data = f.read()
data = data.replace('array', 'np.array')
d = eval(data)
Or, you can directly import array from numpy:
from numpy import array
f = open('out.txt', 'r')
data = f.read()
d = eval(data)
H/T: How can a string representation of a NumPy array be converted to a NumPy array?
Handling large arrays
By default, numpy summarizes arrays longer than 1000 elements. You can change this behavior by calling numpy.set_printoptions(threshold=S) where S is larger than the size of the arrays. For example:
import numpy as np
W = np.arange(10).reshape(2,5)
b = np.arange(12).reshape(3,4)
d = {'W':W, 'b':b}
largest = max(np.prod(a.shape) for a in d.values()) #get the size of the largest array
np.set_printoptions(threshold=largest) #set threshold to largest to avoid summarizing
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
np.set_printoptions(threshold=1000) #recommended, but not necessary
H/T: Ellipses when converting list of numpy arrays to string in python 3

Related

How do I read a text file of numbers into an array of arrays

In python, using the OpenCV library, I need to create some polylines. The example code for the polylines method shows:
cv2.polylines(img,[pts],True,(0,255,255))
I have all the 'pts' laid out in a text file in the format:
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
How can I read this file and provide the data to the [pts] variable in the method call?
I've tried the np.array(csv.reader(...)) method as well as a few others I've found examples of. I can successfully read the file, but it's not in the format the polylines method wants. (I am a newbie when it comes to python, if this was C++ or Java, it wouldn't be a problem).
I would try to use numpy to read the csv as an array.
from numpy import genfromtxt
p = genfromtxt('myfile.csv', delimiter=',')
cv2.polylines(img,p,True,(0,255,255))
You may have to pass a dtype argument to the genfromtext if you need to coerce the data to a specific format.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
In case you know it is a fixed number of items in each row:
import csv
with open('myfile.csv') as csvfile:
rows = csv.reader(csvfile)
res = list(zip(*rows))
print(res)
I know it's not pretty and there is probably a MUCH BETTER way to do this, but it works. That being said, if someone could show me a better way, it would be much appreciated.
pointlist = []
f = open(args["slots"])
data = f.read().split()
for row in data:
tmp = []
col = row.split(";")
for points in col:
xy = points.split(",")
tmp += [[int(pt) for pt in xy]]
pointlist += [tmp]
slots = np.asarray(pointlist)
You might need to draw each polyline individually (to expand on #Chris's answer):
from numpy import genfromtxt
lines = genfromtxt('myfile.csv', delimiter=',')
for line in lines:
cv2.polylines(img, line.reshape((-1, 2)), True, (0,255,255))

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

Reading several arrays in a binary file with numpy

I'm trying to read a binary file which is composed by several matrices of float numbers separated by a single int. The code in Matlab to achieve this is the following:
fid1=fopen(fname1,'r');
for i=1:xx
Rstart= fread(fid1,1,'int32'); #read blank at the begining
ZZ1 = fread(fid1,[Nx Ny],'real*4'); #read z
Rend = fread(fid1,1,'int32'); #read blank at the end
end
As you can see, each matrix size is Nx by Ny. Rstart and Rend are just dummy values. ZZ1 is the matrix I'm interested in.
I am trying to do the same in python, doing the following:
Rstart = np.fromfile(fname1,dtype='int32',count=1)
ZZ1 = np.fromfile(fname1,dtype='float32',count=Ny1*Nx1).reshape(Ny1,Nx1)
Rend = np.fromfile(fname1,dtype='int32',count=1)
Then, I have to iterate to read the subsequent matrices, but the function np.fromfile doesn't retain the pointer in the file.
Another option:
with open(fname1,'r') as f:
ZZ1=np.memmap(f, dtype='float32', mode='r', offset = 4,shape=(Ny1,Nx1))
plt.pcolor(ZZ1)
This works fine for the first array, but doesn't read the next matrices. Any idea how can I do this?
I searched for similar questions but didn't find a suitable answer.
Thanks
The cleanest way to read all your matrices in a single vectorized statement is to use a struct array:
dtype = [('start', np.int32), ('ZZ', np.float32, (Ny1, Nx1)), ('end', np.int32)]
with open(fname1, 'rb') as fh:
data = np.fromfile(fh, dtype)
print(data['ZZ'])
There are 2 solutions for this problem.
The first one:
for i in range(x):
ZZ1=np.memmap(fname1, dtype='float32', mode='r', offset = 4+8*i+(Nx1*Ny1)*4*i,shape=(Ny1,Nx1))
Where i is the array you want to get.
The second one:
fid=open('fname','rb')
for i in range(x):
Rstart = np.fromfile(fid,dtype='int32',count=1)
ZZ1 = np.fromfile(fid,dtype='float32',count=Ny1*Nx1).reshape(Ny1,Nx1)
Rend = np.fromfile(fid,dtype='int32',count=1)
So as morningsun points out, np.fromfile can receive a file object as an argument and keep track of the pointer. Notice that you must open the file in binary mode 'rb'.

Fastest way to write a file with h5py

First of all, I read the topic "Fastest way to write hdf5 file with Python?", but it was not very helpful.
I am trying to load a file which has about 1GB (a matrix of size (70133351,1)) in a h5f5 structure.
Pretty simple code, but slow.
import h5py
f = h5py.File("8.hdf5", "w")
dset = f.create_dataset("8", (70133351,1))
myfile=open("8.txt")
for line in myfile:
line=line.split("\t")
dset[line[1]]=line[0]
myfile.close()
f.close()
I have a smaller version of the matrix with 50MB, and I tried the same code, and it was not finished after 24 hours.
I know the way to make it faster is to avoid the "for loop". If I were using regular python, I would use hash comprehension. However, looks like it does not fit here.
I can query the file later by:
f = h5py.File("8.hdf5")
h=f['8']
print 'GFXVG' in h.attrs
Which would answer me "True" conseidering that GFXVG is on of the keys in h
Does someone have any idea?
Example of part of the file:
508 LREGASKW
592 SVFKINKS
1151 LGHWTVSP
131 EAGQIISE
198 ELDDSARE
344 SQAVAVAN
336 ELDDSARF
592 SVFKINKL
638 SVFKINKI
107 PRTGAGQH
107 PRTGAAAA
Thanks
You can load all the data to an numpy array with loadtext and use it to instantiate your hdf5 dataset.
import h5py
import numpy as np
d = np.loadtxt('data.txt', dtype='|S18')
which return
array([['508.fna', 'LREGASKW'],
['592.fna', 'SVFKINKS'],
['1151.fna', 'LGHWTVSP'],
['131.fna', 'EAGQIISE'],
['198.fna', 'ELDDSARE'],
['344.fna', 'SQAVAVAN'],
['336.fna', 'ELDDSARF'],
['592.fna', 'SVFKINKL'],
['638.fna', 'SVFKINKI'],
['107.fna', 'PRTGAGQH'],
['1197.fna', 'ELDDSARR'],
['1309.fna', 'SQTIYVWF'],
['974.fna', 'PNNLRFIA'],
['230.fna', 'IGKVYHIE'],
['76.fna', 'PGVHSVWV'],
['928.fna', 'HERGGAND'],
['520.fna', 'VLKTDTTG'],
['1290.fna', 'EAALDLHR'],
['25.fna', 'FCSILGVV'],
['284.fna', 'YHKLTFED'],
['1110.fna', 'KITSSSDF']],
dtype='|S18')
and then
h = h5py.File('data.hdf5', 'w')
dset = h.create_dataset('data', data=d)
that gives:
<HDF5 dataset "init": shape (21, 2), type "|S18">
Since its only a gb, why not load it completely in memory first? Note, it looks like you're also indexing into the dset with a str, which is likely the issue.
I just realized I misread the initial question, sorry about that. It looks like your code is attempting to use the index 1, which appears to be a string, as an index? Perhaps there is a typo?
import h5py
from numpy import zeros
data = zeros((70133351,1), dtype='|S8') # assuming your strings are all 8 characters, use object if vlen
with open('8.txt') as myfile:
for line in myfile:
idx, item = line.strip().split("\t")
data[int(line[0])] = line[1]
with h5py.File('8.hdf5', 'w') as f:
dset = f.create_dataset("8", (70133351, 1), data=data)
I ended up using the library shelve (Pickle versus shelve storing large dictionaries in Python) to store a large dictionary into a file. It took me 2 days only to write the hash into a file, but once it was done, I am able to load and access any element very fast. In the end of the day, I dont have to read my big file and write all the information in the has and do whatever I was trying to do with the hash.
Problem solved!

Adding Header to Numpy array

I have an array I would like to add a header for.
This is what i have now:
0.0,1.630000e+01,1.990000e+01,1.840000e+01
1.0,1.630000e+01,1.990000e+01,1.840000e+01
2.0,1.630000e+01,1.990000e+01,1.840000e+01
This is what i want:
SP,1,2,3
0.0,1.630000e+01,1.990000e+01,1.840000e+01
1.0,1.630000e+01,1.990000e+01,1.840000e+01
2.0,1.630000e+01,1.990000e+01,1.840000e+01
Notes:
"SP" will always be 1st followed by the numbering of the columns which may vary
here is my existing code:
fmt = ",".join(["%s"] + ["%10.6e"] * (my_array.shape[1]-1))
np.savetxt('final.csv', my_array, fmt=fmt,delimiter=",")
Ever since Numpy 1.7.0, three parameters have been added to numpy.savetxt for exactly this purpose: header, footer and comments. So the code to do as you wanted can easily be written as:
import numpy
a = numpy.array([[0.0,1.630000e+01,1.990000e+01,1.840000e+01],
[1.0,1.630000e+01,1.990000e+01,1.840000e+01],
[2.0,1.630000e+01,1.990000e+01,1.840000e+01]])
fmt = ",".join(["%s"] + ["%10.6e"] * (a.shape[1]-1))
numpy.savetxt("temp", a, fmt=fmt, header="SP,1,2,3", comments='')
Note: this answer was written for an older version of numpy, relevant when the question was written. With modern numpy, makhlaghi's answer provides a more elegant solution.
Since numpy.savetxt can also write to file objects, you can open the file youself and write your header before the data:
import numpy
a = numpy.array([[0.0,1.630000e+01,1.990000e+01,1.840000e+01],
[1.0,1.630000e+01,1.990000e+01,1.840000e+01],
[2.0,1.630000e+01,1.990000e+01,1.840000e+01]])
fmt = ",".join(["%s"] + ["%10.6e"] * (a.shape[1]-1))
# numpy.savetxt, at least as of numpy 1.6.2, writes bytes
# to file, which doesn't work with a file open in text mode. To
# work around this deficiency, open the file in binary mode, and
# write out the header as bytes.
with open('final.csv', 'wb') as f:
f.write(b'SP,1,2,3\n')
#f.write(bytes("SP,"+lists+"\n","UTF-8"))
#Used this line for a variable list of numbers
numpy.savetxt(f, a, fmt=fmt, delimiter=",")
It is also possible to save other things than numpy arrays to file using the savez or savez_compressed functions. Using load function you can retrieve all data like it was pickled like a dict.
import numpy as np
np.savez("filename.npz", array_to_save=np.array([0.0, 0.0]), header="Some header")
data = np.load("filename.npz")
array = data["array_to_save"]
header = str(data["header"])

Categories

Resources