Hierarchical dictionary (reducing memory footprint or using a database) - python

I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?

One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)

Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.

Related

What is a good way to save high dimensional data so it doesn't run every time?

I have the following code, which computes cosine similarity of the descriptions of tv shows and movies.
for i, row in df.iterrows():
doc = nlp(row['description'])
similarities[i] = {}
# print(row['title'])
for j, row2 in df.iterrows():
doc2 = nlp(row2['description'])
#print(f"{row['title']} x {row2['title']}: {doc.similarity(doc2):.10f}")
similarities[i][j] = doc.similarity(doc2)
I've also written this function, which takes as arguments two titles and returns their similarity
def lookup(title1, title2):
return similarities[lookup_by_title(title1)][lookup_by_title(title2)]
my issue is that the dataframe I loop through has 4884 rows, so I'm have a list of 23.8 million computations. So I'm wondering what the best way is to run the computations once and save that information somewhere efficiently.
After you calculate similarities at the first time, you can dump it to a local file, and then in the next times, instead of doing the computations again, just load similarities from the file.
You can use pickle for this, See a nice tutorial here.
I'm copying the samples in case the webpage won't be available in future. In your case, of course you need to replace config_dictionary with similarities:
Dump:
# Step 1
import pickle
config_dictionary = {'remote_hostname': 'google.com', 'remote_port': 80}
# Step 2
with open('config.dictionary', 'wb') as config_dictionary_file:
# Step 3
pickle.dump(config_dictionary, config_dictionary_file)
Load:
# Step 1
import pickle
# Step 2
with open('config.dictionary', 'rb') as config_dictionary_file:
# Step 3
config_dictionary = pickle.load(config_dictionary_file)
# After config_dictionary is read from file
print(config_dictionary)

Optimize data acquisition with HDF5 files in Python

I'm trying to understand how can I write a DAQ in Python where I manage two signals (I and Q from an IQ mixer) from a NI device. My doubt concern two problems:
What are the main differences to use h5py instead of pandas? My data are not complex, I need only two matrices datasets, one for the I signal and one for the Q signal.
Is it more efficient to create the whole dataset and then occupy a lot of memory before storing it in an HDF5 file, or to open the HDF5 file each time to add a new row (new data) to the matrix?
#Frostman, this is primarily an opinion question. Remember, "Beauty is in the eye of the beholder."
The question about memory use is the more important consideration. The answer depends on memory required to hold your data (and if you have enough before writing to disk). Creating the whole dataset in memory is faster and easier. But, that's not an option if it doesn't fit. :-) Note: you don't want to write data 1 row at a time. That is the slowest way to work with HDF5 data. If you need to save incrementally, write "a lot" of rows at 1 time (say 1,000).
There is a related consideration: which of these packages is fast enough (& easy enough) to keep up with I/O requirement to data acquisition? (I have no expertise in this areas, but know high sample rates will quickly create a lot of data.)
From a technical perspective, the difference in the packages is "cosmetic" (IMHO). (FYI, you can also use PyTables to create HDF5 data.) In other words, all 3 can easily create a HDF5 file with the data described in your question. The question (for you), is which package do you want to learn? And, which package do you plan use for later post-processing? (I assume you want to open the file and "do something" with the data later.)
All else being equal, I would to create the data with the same package I plan to use for downstream processing. Why? h5py and pandas use different schema to store the data. So, reading the data will be easier if you write and read with the same package. (That said, you can manipulate HDF5 data between the packages.)
If the downstream processing requirement has not been decided, I would select h5py if either of these are true: a) you are comfortable with NumPy, or b) you need to use NumPy for other operations. h5py is pretty easy to learn if you know NumPy.
Otherwise, you might prefer pandas. Many claim it is easier to use (vs h5py and PyTables). I am not a pandas expert, so can't comment. I prefer h5py and PyTables primarly because the 2d data schema is saved in a table format that is easy to review with HDFView. (Also, I use NumPy 90+% of the time, so h5py/PyTables are natural extensions.)
If you want to compare code for each, look at the answers to this question: How to write large multiple arrays to a h5 file in layers? They show code required to store data similar to yours for all 3 packages: h5py, pytables and pandas.
Especially when acquiring long measurements, it is handy to write directly into an HDF5 file. This is my preferred way, because any interrupt (power failure etc.) wont result in data loss.
This is my solution using a while-loop that collects in each cycle all available samples from the DAQ and stores them immediately into the HDF5 file. You could imaging some real-time display during each loop cycle, but be aware of the loop duration (set parameter[debug_output] = 3 to see some more statistics like buffer size of each cycle)
changing the boolean hdf5_write to False causes the code to store into Numpy array data which sooner or later will fills the memory. If True, all the samples are written directly into a growing HDF5 file.
import nidaqmx
import datetime
import time
import numpy as np
import h5py
def hdf5_write_parameter(h5_file, parameter, group_name='parameter'):
# add parameter group
param_grp = h5_file.create_group(group_name)
# write single item
for key, item in parameter.items():
try:
if item is None:
item = 'None'
if isinstance(item, dict):
# recursive write each dictionary
hdf5_write_parameter(h5_file, item, group_name+'/'+key)
else:
h5_file.create_dataset("/"+group_name+"/{}".format(key), data=item)
except:
print("[hdf5_write_parameter]: failed to write:", key, "=", item)
return
run_bool = True # should be controlled by GUI or caller thread
measurement_duration = 1 # in seconds
filename = 'test_acquisition'
hdf5_write = True # a hdf5 file with ending '.h5' is created, False = numpy array
# check if device is available
system = nidaqmx.system.System.local()
system.driver_version
for device in system.devices:
print(device) # plot devices
ADC_DEVICE_NAME = device.name # 'PCI6024e'
print('ADC: init measure for', measurement_duration, 'seconds')
# Setup ADC
parameter = {
"channels": 8, # number of AI channels
"channel_name": ADC_DEVICE_NAME + '/ai0:7',
"log_rate": int(20000), # Samples per second
"adc_min_value": -5.0, # minimum ADC value in Volts
"adc_max_value": 5.0, # maximum ADC value in Volts
"timeout": measurement_duration + 2.0, # timeout to detect external clock on read
"debug_output": 1,
"measurement_duration": measurement_duration,
}
parameter["buffer_size"] = int(parameter["log_rate"]) # buffer size in samples
# must be bigger than loop duration!
parameter["requested_samples"] = parameter["log_rate"] * measurement_duration
parameter["hdf5_write"] = hdf5_write # write in array
if parameter['hdf5_write']:
filename += '.h5'
f = h5py.File(filename, 'w') # create a h5-file object if True
data = f.create_dataset('data', (0, parameter["channels"]),
maxshape=(None, parameter["channels"]), chunks=True)
else:
filename += '.csv'
# pre-allocate array, we might get up to 1 buffer more than requested...
data = np.empty((parameter["requested_samples"]+parameter["buffer_size"], parameter["channels"]), dtype=np.float64)
data[:] = np.nan
with nidaqmx.Task() as task:
task.ai_channels.add_ai_voltage_chan(parameter["channel_name"],
terminal_config=nidaqmx.constants.TerminalConfiguration.RSE,
min_val=parameter["adc_min_value"],
max_val=parameter["adc_max_value"],
units=nidaqmx.constants.VoltageUnits.VOLTS
)
task.timing.cfg_samp_clk_timing(rate=parameter["log_rate"],
sample_mode=nidaqmx.constants.AcquisitionType.CONTINUOUS)
# helper variables
total_samples = 0
i = 0
last_display = -1
parameter["acquisition_start"] = str(datetime.datetime.now())
if 1:
print("ADC: --- acquisition started:", parameter["acquisition_start"])
print("ADC: Requested samples:", parameter["requested_samples"], "Acquisition duration:",
measurement_duration)
task.control(nidaqmx.constants.TaskMode.TASK_COMMIT)
time_adc_start = time.perf_counter()
# ############################# READING LOOP ##########################
while run_bool and total_samples < parameter["requested_samples"] and time.perf_counter() - time_adc_start < parameter[
"timeout"]:
i = i + 1
if parameter["debug_output"] >= 1:
elapsed_time = np.floor(time.perf_counter() - time_adc_start) # in sec
if elapsed_time != last_display:
print("ADC: ...", round(elapsed_time), "of", measurement_duration, "sec:",
total_samples, "acquired ...")
last_display = elapsed_time
# high-lvl read function: always create a new array
data_buff = np.asarray(
task.read(number_of_samples_per_channel=nidaqmx.constants.READ_ALL_AVAILABLE)).T
time_adc_end = time.perf_counter()
samples_from_buffer = data_buff.shape[0]
# get nr of samples and acumulate to total_samples
total_samples = int(total_samples + samples_from_buffer)
if parameter["debug_output"] >= 2:
print("ADC: iter", i, "total:", total_samples, "smp from buffer", samples_from_buffer,
"time elapsed", time.perf_counter() - time_adc_start)
if samples_from_buffer > 0:
# prepair buffer and hdf5 dataset
if parameter["hdf5_write"]: # sequential write to hdf5 file
chunk_start = data.shape[0]
# resize dataset in file
data.resize(data.shape[0] + samples_from_buffer, axis=0)
else:
# prepair buffer to fit in pre-allocated array 'data'
chunk_start = int(np.count_nonzero(~np.isnan(data)) / parameter["channels"])
if parameter['channels'] == 1:
data_buff = data_buff[:, np.newaxis]
if parameter["debug_output"] >= 3:
print("Non-empty data shape: (", data.shape,
"), buffer shape:", data_buff.shape,
"chunk start:", chunk_start)
# write buffer to HDF5 file or into numpy array
data[chunk_start:chunk_start + samples_from_buffer, :] = data_buff
# ############################# READING LOOP #########################
parameter["acquisition_stop"] = str(datetime.datetime.now())
if parameter["debug_output"] >= 1:
print("ADC: requested points: ", parameter["requested_samples"])
print("ADC: total aqcuired points", total_samples, "in", time_adc_end - time_adc_start)
print("ADC: data array shape:", data.shape)
print("ADC: --- aqcuisition finished:", parameter["acquisition_stop"])
print("ADC: sample rate:", round(1/((time_adc_end-time_adc_start)/parameter["requested_samples"])))
# prepare data nparray for return
if not parameter["hdf5_write"]:
# shrink numpy array by all nan's (from oversize with buffer size)
total_written = int(np.count_nonzero(~np.isnan(data)) / parameter["channels"])
if parameter["debug_output"] >= 2:
print("resize data array by cutting", data.shape[0] - total_written, "tailing NaN's")
data = np.resize(data, (total_written, parameter["channels"]))
# add more parameter to wrtie into the hdf5 file
parameter["total_samples"] = total_samples
parameter["total_acquisition_time"] = time_adc_end - time_adc_start
parameter["data_shape"] = data.shape
if parameter['hdf5_write']:
hdf5_write_parameter(f, parameter) # write parameter
f.close()

What would be a recommended data structure for a set of baking conversions?

I am working on a Python package for converting baking recipes. Ideally, the recipe is simply stored as a CSV file read in by the package. Given the recipe can be imperial or metric units of measurement, I am trying to internally convert any set of measurement units to metric for simplicity.
The main question I am trying to solve is a light-weight way to store a lot of conversions and ratios given a variety of names that a measurement unit can be.
For example, if a recipe has "tsp", I would want to classify it in the teaspoon family which would consist of ['tsp', 'tsps', 'teaspoon', 'teaspoons'] and have them all use the TSP_TO_METRIC conversion ratio.
Initially, I started as a list of lists but I feel like there may be a more elegant way to store and access these items. I was thinking a dictionary or some sort of JSON file to read in but unsure where the line is between needing an external file versus a long file of constants? I will continue to expand conversions as different ingredients are added so I am also looking for an easy way to scale.
Here is an example of the data conversions I am attempting to store. Then I use a series of if-else coupled with any(unit in sublist for sublist in VOLUME_NAMES): to check the lists of lists.
TSP_TO_METRIC = 5
TBSP_TO_METRIC = 15
OZ_TO_METRIC = 28.35
CUP_TO_METRIC = 8 * OZ_TO_METRIC
PINT_TO_METRIC = 2 * CUP_TO_METRIC
QUART_TO_METRIC = 4 * CUP_TO_METRIC
GALLON_TO_METRIC = 16 * CUP_TO_METRIC
LB_TO_METRIC = 16 * OZ_TO_METRIC
STICK_TO_METRIC = 8 * TBSP_TO_METRIC
TSP_NAMES = ['TSP', 'TSPS', 'TEASPOON', 'TEASPOONS']
TBSP_NAMES = ['TBSP', 'TBSPS', 'TABLESPOON', 'TABLESPOONS']
CUP_NAMES = ['CUP', 'CUPS']
LB_NAMES = ['LB', 'LBS', 'POUND', 'POUNDS']
OZ_NAMES = ['OZ', 'OUNCE', 'OUNCES']
BUTTER_NAMES = ['STICK', 'STICKS']
EGG_NAMES = ['CT', 'COUNT']
GALLON_NAMES = ['GAL', 'GALLON', 'GALLONS']
VOLUME_NAMES = [TSP_NAMES, TBSP_NAMES, CUP_NAMES, GALLON_NAMES]
WEIGHT_NAMES = [LB_NAMES, OZ_NAMES]

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

H5PY key reads slow

I've created a dataset with 1000 groups, each with 1300 uint8 arrays of varying lengths (though each one has a fixed size). Keys are strings of ~10 characters. I'm not trying to do anything tricky while saving (no chunking, compression etc - the data is already compressed).
Iterating over all keys is extremely slow the first time I run a script, though speeds up significantly the second time (same script, different process called later), so I suspect there is some caching involved somehow. After a while performance resets to the terrible level until I've waited it out again.
Is there a way to store the data to alleviate this problem? Or can I read it differently somehow?
Simplified code to save
with h5py.File('my_dataset.hdf5', 'w') as fp:
for k0 in keys0:
group = fp.create_group(k0)
for k1, v1 in get_items(k0):
group.create_dataset(k1, data=np.array(v1, dtype=np.uint8))
Simplified key accessing code:
with h5py.File('my_dataset.hdf5', 'r') as fp:
keys0 = fp.keys()
for k0 in keys0:
group = fp[k0]
n += len(tuple(group.keys())
If I track the progress of this script during a 'slow phase', it takes almost a second for each iteration. However, if I kill it after, say, 100 steps, then the next time I run the script the first 100 steps take < 1sec to run total, then performance drops back to a crawl.
While I'm still unsure why this is still slow, I've found a workaround: merge each sub-group into a single dataset
with h5py.File('my_dataset.hdf5', 'w') as fp:
for k0 in keys0:
subkeys = get_subkeys(k0)
nk = len(subkeys)
data = fp.create_dataset(
'data', shape=(nk,),
dtype=h5py.special_dtype(vlen=np.dtype(np.uint8)))
keys = fp.create_dataset('keys', shape=(nk,), dtype='S32')
for i, (k1, v1) in enumerate(get_items(k0)):
keys[i] = k1
data[i] = v1

Categories

Resources