I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:
for i in np.arange(numberOfChunks):
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation
As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)
I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:
import h5py
# Make the file
h5py_file = h5py.File(filename, "a")
# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?
Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:
import h5py
# Read in the file
h5py_file = h5py.File(filename, "a")
# Read in myArray
myArray = h5py_file['myArray']
for i in np.arange(numberOfChunks):
# Read in chunk
myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]
# ... Do some calculation on myArrayChunk
But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.
You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.
When reading data with h5py there 2 ways to read the data:
This returns a NumPy array:
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:
myArrayDS = myArray
The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]
My example also corrects 1 small error in your chunksize increment equation.
You had:
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk
Working Example (writes and reads):
import h5py
import numpy as np
# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:
numberOfChunks = 3
chunkSize = 4
print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
# Write dataset to disk
h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
for i in range(numberOfChunks):
h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
print (h5ArrayChunk)
h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
with h5py.File("SO_61173314.h5", "r") as h5r:
print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
# Access myArray dataset - Note: This is NOT a NumpPy array
myArray = h5r['myArray']
for i in range(numberOfChunks):
# Read a chunk into memory (as a NumPy array)
myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
# ... Do some calculation on myArrayChunk
print (myArrayChunk)
Related
I'm trying to understand how can I write a DAQ in Python where I manage two signals (I and Q from an IQ mixer) from a NI device. My doubt concern two problems:
What are the main differences to use h5py instead of pandas? My data are not complex, I need only two matrices datasets, one for the I signal and one for the Q signal.
Is it more efficient to create the whole dataset and then occupy a lot of memory before storing it in an HDF5 file, or to open the HDF5 file each time to add a new row (new data) to the matrix?
#Frostman, this is primarily an opinion question. Remember, "Beauty is in the eye of the beholder."
The question about memory use is the more important consideration. The answer depends on memory required to hold your data (and if you have enough before writing to disk). Creating the whole dataset in memory is faster and easier. But, that's not an option if it doesn't fit. :-) Note: you don't want to write data 1 row at a time. That is the slowest way to work with HDF5 data. If you need to save incrementally, write "a lot" of rows at 1 time (say 1,000).
There is a related consideration: which of these packages is fast enough (& easy enough) to keep up with I/O requirement to data acquisition? (I have no expertise in this areas, but know high sample rates will quickly create a lot of data.)
From a technical perspective, the difference in the packages is "cosmetic" (IMHO). (FYI, you can also use PyTables to create HDF5 data.) In other words, all 3 can easily create a HDF5 file with the data described in your question. The question (for you), is which package do you want to learn? And, which package do you plan use for later post-processing? (I assume you want to open the file and "do something" with the data later.)
All else being equal, I would to create the data with the same package I plan to use for downstream processing. Why? h5py and pandas use different schema to store the data. So, reading the data will be easier if you write and read with the same package. (That said, you can manipulate HDF5 data between the packages.)
If the downstream processing requirement has not been decided, I would select h5py if either of these are true: a) you are comfortable with NumPy, or b) you need to use NumPy for other operations. h5py is pretty easy to learn if you know NumPy.
Otherwise, you might prefer pandas. Many claim it is easier to use (vs h5py and PyTables). I am not a pandas expert, so can't comment. I prefer h5py and PyTables primarly because the 2d data schema is saved in a table format that is easy to review with HDFView. (Also, I use NumPy 90+% of the time, so h5py/PyTables are natural extensions.)
If you want to compare code for each, look at the answers to this question: How to write large multiple arrays to a h5 file in layers? They show code required to store data similar to yours for all 3 packages: h5py, pytables and pandas.
Especially when acquiring long measurements, it is handy to write directly into an HDF5 file. This is my preferred way, because any interrupt (power failure etc.) wont result in data loss.
This is my solution using a while-loop that collects in each cycle all available samples from the DAQ and stores them immediately into the HDF5 file. You could imaging some real-time display during each loop cycle, but be aware of the loop duration (set parameter[debug_output] = 3 to see some more statistics like buffer size of each cycle)
changing the boolean hdf5_write to False causes the code to store into Numpy array data which sooner or later will fills the memory. If True, all the samples are written directly into a growing HDF5 file.
import nidaqmx
import datetime
import time
import numpy as np
import h5py
def hdf5_write_parameter(h5_file, parameter, group_name='parameter'):
# add parameter group
param_grp = h5_file.create_group(group_name)
# write single item
for key, item in parameter.items():
try:
if item is None:
item = 'None'
if isinstance(item, dict):
# recursive write each dictionary
hdf5_write_parameter(h5_file, item, group_name+'/'+key)
else:
h5_file.create_dataset("/"+group_name+"/{}".format(key), data=item)
except:
print("[hdf5_write_parameter]: failed to write:", key, "=", item)
return
run_bool = True # should be controlled by GUI or caller thread
measurement_duration = 1 # in seconds
filename = 'test_acquisition'
hdf5_write = True # a hdf5 file with ending '.h5' is created, False = numpy array
# check if device is available
system = nidaqmx.system.System.local()
system.driver_version
for device in system.devices:
print(device) # plot devices
ADC_DEVICE_NAME = device.name # 'PCI6024e'
print('ADC: init measure for', measurement_duration, 'seconds')
# Setup ADC
parameter = {
"channels": 8, # number of AI channels
"channel_name": ADC_DEVICE_NAME + '/ai0:7',
"log_rate": int(20000), # Samples per second
"adc_min_value": -5.0, # minimum ADC value in Volts
"adc_max_value": 5.0, # maximum ADC value in Volts
"timeout": measurement_duration + 2.0, # timeout to detect external clock on read
"debug_output": 1,
"measurement_duration": measurement_duration,
}
parameter["buffer_size"] = int(parameter["log_rate"]) # buffer size in samples
# must be bigger than loop duration!
parameter["requested_samples"] = parameter["log_rate"] * measurement_duration
parameter["hdf5_write"] = hdf5_write # write in array
if parameter['hdf5_write']:
filename += '.h5'
f = h5py.File(filename, 'w') # create a h5-file object if True
data = f.create_dataset('data', (0, parameter["channels"]),
maxshape=(None, parameter["channels"]), chunks=True)
else:
filename += '.csv'
# pre-allocate array, we might get up to 1 buffer more than requested...
data = np.empty((parameter["requested_samples"]+parameter["buffer_size"], parameter["channels"]), dtype=np.float64)
data[:] = np.nan
with nidaqmx.Task() as task:
task.ai_channels.add_ai_voltage_chan(parameter["channel_name"],
terminal_config=nidaqmx.constants.TerminalConfiguration.RSE,
min_val=parameter["adc_min_value"],
max_val=parameter["adc_max_value"],
units=nidaqmx.constants.VoltageUnits.VOLTS
)
task.timing.cfg_samp_clk_timing(rate=parameter["log_rate"],
sample_mode=nidaqmx.constants.AcquisitionType.CONTINUOUS)
# helper variables
total_samples = 0
i = 0
last_display = -1
parameter["acquisition_start"] = str(datetime.datetime.now())
if 1:
print("ADC: --- acquisition started:", parameter["acquisition_start"])
print("ADC: Requested samples:", parameter["requested_samples"], "Acquisition duration:",
measurement_duration)
task.control(nidaqmx.constants.TaskMode.TASK_COMMIT)
time_adc_start = time.perf_counter()
# ############################# READING LOOP ##########################
while run_bool and total_samples < parameter["requested_samples"] and time.perf_counter() - time_adc_start < parameter[
"timeout"]:
i = i + 1
if parameter["debug_output"] >= 1:
elapsed_time = np.floor(time.perf_counter() - time_adc_start) # in sec
if elapsed_time != last_display:
print("ADC: ...", round(elapsed_time), "of", measurement_duration, "sec:",
total_samples, "acquired ...")
last_display = elapsed_time
# high-lvl read function: always create a new array
data_buff = np.asarray(
task.read(number_of_samples_per_channel=nidaqmx.constants.READ_ALL_AVAILABLE)).T
time_adc_end = time.perf_counter()
samples_from_buffer = data_buff.shape[0]
# get nr of samples and acumulate to total_samples
total_samples = int(total_samples + samples_from_buffer)
if parameter["debug_output"] >= 2:
print("ADC: iter", i, "total:", total_samples, "smp from buffer", samples_from_buffer,
"time elapsed", time.perf_counter() - time_adc_start)
if samples_from_buffer > 0:
# prepair buffer and hdf5 dataset
if parameter["hdf5_write"]: # sequential write to hdf5 file
chunk_start = data.shape[0]
# resize dataset in file
data.resize(data.shape[0] + samples_from_buffer, axis=0)
else:
# prepair buffer to fit in pre-allocated array 'data'
chunk_start = int(np.count_nonzero(~np.isnan(data)) / parameter["channels"])
if parameter['channels'] == 1:
data_buff = data_buff[:, np.newaxis]
if parameter["debug_output"] >= 3:
print("Non-empty data shape: (", data.shape,
"), buffer shape:", data_buff.shape,
"chunk start:", chunk_start)
# write buffer to HDF5 file or into numpy array
data[chunk_start:chunk_start + samples_from_buffer, :] = data_buff
# ############################# READING LOOP #########################
parameter["acquisition_stop"] = str(datetime.datetime.now())
if parameter["debug_output"] >= 1:
print("ADC: requested points: ", parameter["requested_samples"])
print("ADC: total aqcuired points", total_samples, "in", time_adc_end - time_adc_start)
print("ADC: data array shape:", data.shape)
print("ADC: --- aqcuisition finished:", parameter["acquisition_stop"])
print("ADC: sample rate:", round(1/((time_adc_end-time_adc_start)/parameter["requested_samples"])))
# prepare data nparray for return
if not parameter["hdf5_write"]:
# shrink numpy array by all nan's (from oversize with buffer size)
total_written = int(np.count_nonzero(~np.isnan(data)) / parameter["channels"])
if parameter["debug_output"] >= 2:
print("resize data array by cutting", data.shape[0] - total_written, "tailing NaN's")
data = np.resize(data, (total_written, parameter["channels"]))
# add more parameter to wrtie into the hdf5 file
parameter["total_samples"] = total_samples
parameter["total_acquisition_time"] = time_adc_end - time_adc_start
parameter["data_shape"] = data.shape
if parameter['hdf5_write']:
hdf5_write_parameter(f, parameter) # write parameter
f.close()
I have a large dataset: 20,000 x 40,000 as a numpy array. I have saved it as a pickle file.
Instead of reading this huge dataset into memory, I'd like to only read a few (say 100) rows of it at a time, for use as a minibatch.
How can I read only a few randomly-chosen (without replacement) lines from a pickle file?
You can write pickles incrementally to a file, which allows you to load them
incrementally as well.
Take the following example. Here, we iterate over the items of a list, and
pickle each one in turn.
>>> import cPickle
>>> myData = [1, 2, 3]
>>> f = open('mydata.pkl', 'wb')
>>> pickler = cPickle.Pickler(f)
>>> for e in myData:
... pickler.dump(e)
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
>>> f.close()
Now we can do the same process in reverse and load each object as needed. For
the purpose of example, let's say that we just want the first item and don't
want to iterate over the entire file.
>>> f = open('mydata.pkl', 'rb')
>>> unpickler = cPickle.Unpickler(f)
>>> unpickler.load()
1
At this point, the file stream has only advanced as far as the first
object. The remaining objects weren't loaded, which is exactly the behavior you
want. For proof, you can try reading the rest of the file and see the rest is
still sitting there.
>>> f.read()
'I2\n.I3\n.'
Since you do not know the internal workings of pickle, you need to use another storing method. The script below uses the tobytes() functions to save the data line-wise in a raw file.
Since the length of each line is known, it's offset in the file can be computed and accessed via seek() and read(). After that, it is converted back to an array with the frombuffer() function.
The big disclaimer however is that the size of the array in not saved (this could be added as well but requires some more complications) and that this method might not be as portable as a pickled array.
As #PadraicCunningham pointed out in his comment, a memmap is likely to be an alternative and elegant solution.
Remark on performance: After reading the comments I did a short benchmark. On my machine (16GB RAM, encrypted SSD) I was able to do 40000 random line reads in 24 seconds (with a 20000x40000 matrix of course, not the 10x10 from the example).
from __future__ import print_function
import numpy
import random
def dumparray(a, path):
lines, _ = a.shape
with open(path, 'wb') as fd:
for i in range(lines):
fd.write(a[i,...].tobytes())
class RandomLineAccess(object):
def __init__(self, path, cols, dtype):
self.dtype = dtype
self.fd = open(path, 'rb')
self.line_length = cols*dtype.itemsize
def read_line(self, line):
offset = line*self.line_length
self.fd.seek(offset)
data = self.fd.read(self.line_length)
return numpy.frombuffer(data, self.dtype)
def close(self):
self.fd.close()
def main():
lines = 10
cols = 10
path = '/tmp/array'
a = numpy.zeros((lines, cols))
dtype = a.dtype
for i in range(lines):
# add some data to distinguish lines
numpy.ndarray.fill(a[i,...], i)
dumparray(a, path)
rla = RandomLineAccess(path, cols, dtype)
line_indices = list(range(lines))
for _ in range(20):
line_index = random.choice(line_indices)
print(line_index, rla.read_line(line_index))
if __name__ == '__main__':
main()
Thanks everyone. I ended up finding a workaround (a machine with more RAM so I could actually load the dataset into memory).
I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images.
(Plus an array of 100,000 integers whose values are one-digit).
My approximation for the total size of the pickle is,
4 byte x 80 x 80 x 100000 = 2.88 GB
plus the array of integers, which shouldn't be that large.
The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources.
Is there something wrong with my calculation or is it the way I pickled it?
I pickled the file in the following way.
from PIL import Image
import pickle
import os
import numpy
import time
trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)
i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
print 'hello'
for f in filenames:
try:
im = Image.open(os.path.join(root,f))
Imv=im.load()
x,y=im.size
pixelv = numpy.empty(6400)
ind=0
for ii in range(x):
for j in range(y):
temp=float(Imv[j,ii])
temp=float(temp/255.0)
pixelv[ind]=temp
ind+=1
if i<40000:
trainpixels[tr]=pixelv
tr+=1
elif i<45000:
validpixels[va]=pixelv
va+=1
else:
testpixels[te]=pixelv
te+=1
print str(i)+'\t'+str(f)
i+=1
except IOError:
continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)
output=open('data.pkl','wb')
pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)
Please let me know if you see something wrong with either my calculation or my code!
Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data."
The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle:
import os
import sys
import numpy
import pickle
testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0
test_labels_size = sys.getsizeof(testlabels) #80
output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)
print os.path.getsize('/tmp/pickle')
Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes (sys.getsizeof(1)) and numpy arrays are a minimum of 80 bytes (sys.getsizeof(numpy.array([0], float))).
As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects.
As an option: reduce the size of your training data/Pickle fewer objects.
I was trying to calculate the correlation between a large set of data read from a text. For extremely large data set the program give a memory error. Can anyone please tell me how to correct this problem. Thanks
The following is my code:
enter code here
import numpy
from numpy import *
from array import *
from decimal import *
import sys
Threshold = 0.8;
TopMostData = 10;
FileName = sys.argv[1]
File = open(FileName,'r')
SignalData = numpy.empty((1, 128));
SignalData[:][:] = 0;
for line in File:
TempLine = line.split();
TempInt = [float(i) for i in TempLine]
SignalData = vstack((SignalData,TempInt))
del TempLine;
del TempInt;
File.close();
TempData = SignalData;
SignalData = SignalData[1:,:]
SignalData = SignalData[:,65:128]
print "File Read | Data Stored" + " | Total Lines: " + str(len(SignalData))
CorrelationData = numpy.corrcoef(SignalData)
The following is the error:
Traceback (most recent call last):
File "Corelation.py", line 36, in <module>
CorrelationData = numpy.corrcoef(SignalData)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1824, in corrcoef
return c/sqrt(multiply.outer(d, d))
MemoryError
You run out of memory as the comments show. If that happens because you are using 32-bit Python, even the method below will fail. But for the 64-bit Python and not-so-much-RAM situation we can do a lot as calculating the correlations is easily done piecewise, as you only need two lines in the memory simultaneously.
So, you may split your input into, say, 1000 row chunks, and then the resulting 1000 x 1000 matrices are easy to keep in memory. Then you can assemble your result into the big output matrix which is not necessarily in the RAM. I recommend this approach even if you have a lot of RAM, because this is much more memory-friendly. Correlation coefficient calculation is not an operation where fast random accesses would help a lot if the input can be kept in RAM.
Unfortunately, the numpy.corrcoef does not do this automatically, and we'll have to roll our own correlation coefficient calculation. Fortunately, that is not as hard as it sounds.
Something along these lines:
import numpy as np
# number of rows in one chunk
SPLITROWS = 1000
# the big table, which is usually bigger
bigdata = numpy.random.random((27000, 128))
numrows = bigdata.shape[0]
# subtract means form the input data
bigdata -= np.mean(bigdata, axis=1)[:,None]
# normalize the data
bigdata /= np.sqrt(np.sum(bigdata*bigdata, axis=1))[:,None]
# reserve the resulting table onto HDD
res = np.memmap("/tmp/mydata.dat", 'float64', mode='w+', shape=(numrows, numrows))
for r in range(0, numrows, SPLITROWS):
for c in range(0, numrows, SPLITROWS):
r1 = r + SPLITROWS
c1 = c + SPLITROWS
chunk1 = bigdata[r:r1]
chunk2 = bigdata[c:c1]
res[r:r1, c:c1] = np.dot(chunk1, chunk2.T)
Some notes:
the code above is tested above np.corrcoef(bigdata)
if you have complex values, you'll need to create a complex output array res and take the complex conjugate of chunk2.T
the code garbles bigdata to maintain performance and minimize memory use; if you need to preserve it, make a copy
The above code takes about 85 s to run on my machine, but the data will mostly fit in RAM, and I have a SSD disk. The algorithm is coded in such order to avoid too random access into the HDD, i.e. the access is reasonably sequential. In comparison, the non-memmapped standard version is not significantly faster even if you have a lot of memory. (Actually, it took a lot more time in my case, but I suspect I ran out of my 16 GiB and then there was a lot of swapping going on.)
You can make the actual calculations faster by omitting half of the matrix, because res.T == res. In practice, you can omit all blocks where c > r and then mirror them later on. On the other hand, the performance is most likely limited by the HDD preformance, so other optimizations do not necessarily bring much more speed.
Of course, this approach is easy to make parallel, as the chunk calculations are completely independent. Also the memmapped array can be shared between threads rather easily.
First of all, I read the topic "Fastest way to write hdf5 file with Python?", but it was not very helpful.
I am trying to load a file which has about 1GB (a matrix of size (70133351,1)) in a h5f5 structure.
Pretty simple code, but slow.
import h5py
f = h5py.File("8.hdf5", "w")
dset = f.create_dataset("8", (70133351,1))
myfile=open("8.txt")
for line in myfile:
line=line.split("\t")
dset[line[1]]=line[0]
myfile.close()
f.close()
I have a smaller version of the matrix with 50MB, and I tried the same code, and it was not finished after 24 hours.
I know the way to make it faster is to avoid the "for loop". If I were using regular python, I would use hash comprehension. However, looks like it does not fit here.
I can query the file later by:
f = h5py.File("8.hdf5")
h=f['8']
print 'GFXVG' in h.attrs
Which would answer me "True" conseidering that GFXVG is on of the keys in h
Does someone have any idea?
Example of part of the file:
508 LREGASKW
592 SVFKINKS
1151 LGHWTVSP
131 EAGQIISE
198 ELDDSARE
344 SQAVAVAN
336 ELDDSARF
592 SVFKINKL
638 SVFKINKI
107 PRTGAGQH
107 PRTGAAAA
Thanks
You can load all the data to an numpy array with loadtext and use it to instantiate your hdf5 dataset.
import h5py
import numpy as np
d = np.loadtxt('data.txt', dtype='|S18')
which return
array([['508.fna', 'LREGASKW'],
['592.fna', 'SVFKINKS'],
['1151.fna', 'LGHWTVSP'],
['131.fna', 'EAGQIISE'],
['198.fna', 'ELDDSARE'],
['344.fna', 'SQAVAVAN'],
['336.fna', 'ELDDSARF'],
['592.fna', 'SVFKINKL'],
['638.fna', 'SVFKINKI'],
['107.fna', 'PRTGAGQH'],
['1197.fna', 'ELDDSARR'],
['1309.fna', 'SQTIYVWF'],
['974.fna', 'PNNLRFIA'],
['230.fna', 'IGKVYHIE'],
['76.fna', 'PGVHSVWV'],
['928.fna', 'HERGGAND'],
['520.fna', 'VLKTDTTG'],
['1290.fna', 'EAALDLHR'],
['25.fna', 'FCSILGVV'],
['284.fna', 'YHKLTFED'],
['1110.fna', 'KITSSSDF']],
dtype='|S18')
and then
h = h5py.File('data.hdf5', 'w')
dset = h.create_dataset('data', data=d)
that gives:
<HDF5 dataset "init": shape (21, 2), type "|S18">
Since its only a gb, why not load it completely in memory first? Note, it looks like you're also indexing into the dset with a str, which is likely the issue.
I just realized I misread the initial question, sorry about that. It looks like your code is attempting to use the index 1, which appears to be a string, as an index? Perhaps there is a typo?
import h5py
from numpy import zeros
data = zeros((70133351,1), dtype='|S8') # assuming your strings are all 8 characters, use object if vlen
with open('8.txt') as myfile:
for line in myfile:
idx, item = line.strip().split("\t")
data[int(line[0])] = line[1]
with h5py.File('8.hdf5', 'w') as f:
dset = f.create_dataset("8", (70133351, 1), data=data)
I ended up using the library shelve (Pickle versus shelve storing large dictionaries in Python) to store a large dictionary into a file. It took me 2 days only to write the hash into a file, but once it was done, I am able to load and access any element very fast. In the end of the day, I dont have to read my big file and write all the information in the has and do whatever I was trying to do with the hash.
Problem solved!