First of all, I read the topic "Fastest way to write hdf5 file with Python?", but it was not very helpful.
I am trying to load a file which has about 1GB (a matrix of size (70133351,1)) in a h5f5 structure.
Pretty simple code, but slow.
import h5py
f = h5py.File("8.hdf5", "w")
dset = f.create_dataset("8", (70133351,1))
myfile=open("8.txt")
for line in myfile:
line=line.split("\t")
dset[line[1]]=line[0]
myfile.close()
f.close()
I have a smaller version of the matrix with 50MB, and I tried the same code, and it was not finished after 24 hours.
I know the way to make it faster is to avoid the "for loop". If I were using regular python, I would use hash comprehension. However, looks like it does not fit here.
I can query the file later by:
f = h5py.File("8.hdf5")
h=f['8']
print 'GFXVG' in h.attrs
Which would answer me "True" conseidering that GFXVG is on of the keys in h
Does someone have any idea?
Example of part of the file:
508 LREGASKW
592 SVFKINKS
1151 LGHWTVSP
131 EAGQIISE
198 ELDDSARE
344 SQAVAVAN
336 ELDDSARF
592 SVFKINKL
638 SVFKINKI
107 PRTGAGQH
107 PRTGAAAA
Thanks
You can load all the data to an numpy array with loadtext and use it to instantiate your hdf5 dataset.
import h5py
import numpy as np
d = np.loadtxt('data.txt', dtype='|S18')
which return
array([['508.fna', 'LREGASKW'],
['592.fna', 'SVFKINKS'],
['1151.fna', 'LGHWTVSP'],
['131.fna', 'EAGQIISE'],
['198.fna', 'ELDDSARE'],
['344.fna', 'SQAVAVAN'],
['336.fna', 'ELDDSARF'],
['592.fna', 'SVFKINKL'],
['638.fna', 'SVFKINKI'],
['107.fna', 'PRTGAGQH'],
['1197.fna', 'ELDDSARR'],
['1309.fna', 'SQTIYVWF'],
['974.fna', 'PNNLRFIA'],
['230.fna', 'IGKVYHIE'],
['76.fna', 'PGVHSVWV'],
['928.fna', 'HERGGAND'],
['520.fna', 'VLKTDTTG'],
['1290.fna', 'EAALDLHR'],
['25.fna', 'FCSILGVV'],
['284.fna', 'YHKLTFED'],
['1110.fna', 'KITSSSDF']],
dtype='|S18')
and then
h = h5py.File('data.hdf5', 'w')
dset = h.create_dataset('data', data=d)
that gives:
<HDF5 dataset "init": shape (21, 2), type "|S18">
Since its only a gb, why not load it completely in memory first? Note, it looks like you're also indexing into the dset with a str, which is likely the issue.
I just realized I misread the initial question, sorry about that. It looks like your code is attempting to use the index 1, which appears to be a string, as an index? Perhaps there is a typo?
import h5py
from numpy import zeros
data = zeros((70133351,1), dtype='|S8') # assuming your strings are all 8 characters, use object if vlen
with open('8.txt') as myfile:
for line in myfile:
idx, item = line.strip().split("\t")
data[int(line[0])] = line[1]
with h5py.File('8.hdf5', 'w') as f:
dset = f.create_dataset("8", (70133351, 1), data=data)
I ended up using the library shelve (Pickle versus shelve storing large dictionaries in Python) to store a large dictionary into a file. It took me 2 days only to write the hash into a file, but once it was done, I am able to load and access any element very fast. In the end of the day, I dont have to read my big file and write all the information in the has and do whatever I was trying to do with the hash.
Problem solved!
Related
I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:
for i in np.arange(numberOfChunks):
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation
As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)
I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:
import h5py
# Make the file
h5py_file = h5py.File(filename, "a")
# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?
Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:
import h5py
# Read in the file
h5py_file = h5py.File(filename, "a")
# Read in myArray
myArray = h5py_file['myArray']
for i in np.arange(numberOfChunks):
# Read in chunk
myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]
# ... Do some calculation on myArrayChunk
But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.
You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.
When reading data with h5py there 2 ways to read the data:
This returns a NumPy array:
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:
myArrayDS = myArray
The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]
My example also corrects 1 small error in your chunksize increment equation.
You had:
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk
Working Example (writes and reads):
import h5py
import numpy as np
# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:
numberOfChunks = 3
chunkSize = 4
print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
# Write dataset to disk
h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
for i in range(numberOfChunks):
h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
print (h5ArrayChunk)
h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
with h5py.File("SO_61173314.h5", "r") as h5r:
print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
# Access myArray dataset - Note: This is NOT a NumpPy array
myArray = h5r['myArray']
for i in range(numberOfChunks):
# Read a chunk into memory (as a NumPy array)
myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
# ... Do some calculation on myArrayChunk
print (myArrayChunk)
In python, using the OpenCV library, I need to create some polylines. The example code for the polylines method shows:
cv2.polylines(img,[pts],True,(0,255,255))
I have all the 'pts' laid out in a text file in the format:
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
How can I read this file and provide the data to the [pts] variable in the method call?
I've tried the np.array(csv.reader(...)) method as well as a few others I've found examples of. I can successfully read the file, but it's not in the format the polylines method wants. (I am a newbie when it comes to python, if this was C++ or Java, it wouldn't be a problem).
I would try to use numpy to read the csv as an array.
from numpy import genfromtxt
p = genfromtxt('myfile.csv', delimiter=',')
cv2.polylines(img,p,True,(0,255,255))
You may have to pass a dtype argument to the genfromtext if you need to coerce the data to a specific format.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
In case you know it is a fixed number of items in each row:
import csv
with open('myfile.csv') as csvfile:
rows = csv.reader(csvfile)
res = list(zip(*rows))
print(res)
I know it's not pretty and there is probably a MUCH BETTER way to do this, but it works. That being said, if someone could show me a better way, it would be much appreciated.
pointlist = []
f = open(args["slots"])
data = f.read().split()
for row in data:
tmp = []
col = row.split(";")
for points in col:
xy = points.split(",")
tmp += [[int(pt) for pt in xy]]
pointlist += [tmp]
slots = np.asarray(pointlist)
You might need to draw each polyline individually (to expand on #Chris's answer):
from numpy import genfromtxt
lines = genfromtxt('myfile.csv', delimiter=',')
for line in lines:
cv2.polylines(img, line.reshape((-1, 2)), True, (0,255,255))
I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).
I have a large dataset: 20,000 x 40,000 as a numpy array. I have saved it as a pickle file.
Instead of reading this huge dataset into memory, I'd like to only read a few (say 100) rows of it at a time, for use as a minibatch.
How can I read only a few randomly-chosen (without replacement) lines from a pickle file?
You can write pickles incrementally to a file, which allows you to load them
incrementally as well.
Take the following example. Here, we iterate over the items of a list, and
pickle each one in turn.
>>> import cPickle
>>> myData = [1, 2, 3]
>>> f = open('mydata.pkl', 'wb')
>>> pickler = cPickle.Pickler(f)
>>> for e in myData:
... pickler.dump(e)
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
>>> f.close()
Now we can do the same process in reverse and load each object as needed. For
the purpose of example, let's say that we just want the first item and don't
want to iterate over the entire file.
>>> f = open('mydata.pkl', 'rb')
>>> unpickler = cPickle.Unpickler(f)
>>> unpickler.load()
1
At this point, the file stream has only advanced as far as the first
object. The remaining objects weren't loaded, which is exactly the behavior you
want. For proof, you can try reading the rest of the file and see the rest is
still sitting there.
>>> f.read()
'I2\n.I3\n.'
Since you do not know the internal workings of pickle, you need to use another storing method. The script below uses the tobytes() functions to save the data line-wise in a raw file.
Since the length of each line is known, it's offset in the file can be computed and accessed via seek() and read(). After that, it is converted back to an array with the frombuffer() function.
The big disclaimer however is that the size of the array in not saved (this could be added as well but requires some more complications) and that this method might not be as portable as a pickled array.
As #PadraicCunningham pointed out in his comment, a memmap is likely to be an alternative and elegant solution.
Remark on performance: After reading the comments I did a short benchmark. On my machine (16GB RAM, encrypted SSD) I was able to do 40000 random line reads in 24 seconds (with a 20000x40000 matrix of course, not the 10x10 from the example).
from __future__ import print_function
import numpy
import random
def dumparray(a, path):
lines, _ = a.shape
with open(path, 'wb') as fd:
for i in range(lines):
fd.write(a[i,...].tobytes())
class RandomLineAccess(object):
def __init__(self, path, cols, dtype):
self.dtype = dtype
self.fd = open(path, 'rb')
self.line_length = cols*dtype.itemsize
def read_line(self, line):
offset = line*self.line_length
self.fd.seek(offset)
data = self.fd.read(self.line_length)
return numpy.frombuffer(data, self.dtype)
def close(self):
self.fd.close()
def main():
lines = 10
cols = 10
path = '/tmp/array'
a = numpy.zeros((lines, cols))
dtype = a.dtype
for i in range(lines):
# add some data to distinguish lines
numpy.ndarray.fill(a[i,...], i)
dumparray(a, path)
rla = RandomLineAccess(path, cols, dtype)
line_indices = list(range(lines))
for _ in range(20):
line_index = random.choice(line_indices)
print(line_index, rla.read_line(line_index))
if __name__ == '__main__':
main()
Thanks everyone. I ended up finding a workaround (a machine with more RAM so I could actually load the dataset into memory).
I just started to learn python, so I need some help.
I have closeparams.txt file, it has CSV structure:
3;700;3;10;1
6;300;3;20;1
9;500;2;10;5
I need read this file to 2 dimension array.
a[i,j] where i - is row and j - is column
I searched but not found exactly samples.
I will use this massive like this:
i=0
j=3
print a(i,j)
I suppose that display:
10
Or
i=2
j=1
print a(i,j)
I suppose that display:
500
I suggest to use numpy if you want to deal with arrays. In your case:
import numpy
a = numpy.loadtxt('apaga.txt', delimiter=';')
print a[0,3]
You didn't specify how important will the array construct be for you, but Numpy is very, very powerful for complex tasks, and can be very lean to perform smaller, quick'n'dirty tasks in a compact, fast and readable way.
display_list = []
with open('closeparams.txt') as data_file:
for line in data_file:
display_list.append(line.strip().split(';'))
print(display_list[0][3]) # [i][j]
edit - python3 print()
How about:
import csv
sheet = list(csv.reader(open(source_path)))
print sheet[0][0]
Just typecast the opened csv to a list!