My goal is to feed an object that supports the buffer protocol into hashlib's sha2 generator such that sha2 hashes generated from the same underlying data in different execution environments are consistent, and so can be used for equality tests.
I would like this to work for arbitrary data types without having to write a bunch of boilerplate wrappers around bytes() or bytearray(), ie, one function I can pass strings (with encoding), numerics, and bools. Extra points if I can get the memory layout for a complex type like a dict or list.
I am looking at struct, as well as doing something like loading the data into a pandas DataFrame and then using Apache Arrow to access the memory layout directly.
Looking for guidance as to the most "pythonic" way to accomplish this.
hashlib.sha256(bytes(struct.pack('!f', 12.3))).hexdigest())
Repeat for all native types.
Related
Define data type in python ?
Define data structures in python ?
Are list,sets,dictionaries and tuples are data structures or data type in python ?
In almost every programming language data type simply is the type of data that you would be using.In python we have 6 basic data types they are int,float,complex,bool,string and bytes. where as the datastructures are the collection of data on which tasks can be done efficiently. It enables easier access and efficient modifications. Data Structures allows you to organize your data in such a way that enables you to store collections of data, relate them and perform operations on them accordingly.
In python DataStructures can be broadly classified into two categories(Well some people don't accept it)
they are
1)Built in - Tuples,Set,Dict,List.
2)User defined- Stacks,Queues,Tree,Graphs etc.
From my understanding...
Data Structures in Python are also data types.
Why do I think this?
The Official Python docs places lists under the data structure section AND immediately calls list a data type.
Therefore, I believe list, tuple, set, and dict data types are also data structures.
From my understanding, data structures are just data types that can contain multiple data types within them (including the data type of the containing data structure itself).
I am building a flexible, lightweight, in-memory database in Python, and discovered a performance problem with the way I was looking up values and using indexes. In an effort to improve this I've tried a few options, trying to balance speed with memory usage. My current implementation uses a dict of dicts to store data by record (object reference) and field (also an object reference). So for example, if I have three records with three fields, where some of the data is missing (i.e. NULL values)::
{<Record1>: {<Field1>: 4, <Field2>: 'value', <Field3>: <Other Record>},
{<Record2>: {<Field1>: 4, <Field2>: 'value'},
{<Record3>: {<Field1>: 5}}
I considered a numpy array, but I would still need two dictionaries to map object instances to array indexes, so I can't see that it will perform be any better.
Indexes are implemented using a pair of bisected lists, essentially acting as a map from value to record instance. For example, and index on the above Field1>:
[[4, 4, 5], [<Record1>, <Record2>, <Record3>]]
I was previously using a simple dict of bins, but this didn't allow range lookups (e.g. all values > 5) (see Python hash table for fuzzy matching).
My question is this. I am concerned that I have several object references, and multiple copies of the same values in the indexes. Do all these duplicate references actually use more memory, or are references cheap in python? My alternative is to try to associate a numerical key to each object, which might improve things at least up to 256, but I don't know enough about how python handles references to know if this would really be any better.
Does anyone have any suggestions of a better way to manage this?
Reimplementing the critical parts in C is an option I want to keep as a last resort.
For anyone interested, my code is here.
Edit 1:
The question, simple put, is which of the following is more efficient in terms of memory usage, where a is an object instance and i is an integer:
[a] * 1000
Or
[i] * 1000, {a: i}
Edit 2:
Because of the large number of comments suggesting I use an existing system, here are my requirements. If anyone can suggest a system which fulfills all of these, that would be great, but so far I have not found anything which does. Otherwise, my original question still relates to memory usage of references in python.:
Must be light-weight and in-memory. Definitely not a client/server model.
Need to be able to easily alter tables, change fields, change rules, etc, on the fly.
Need to easily apply very complex validation rules. SQL doesn't meet this requirement. Although it is sometimes possible to build up very complicated statements, it is far from easy.
Need to support joins and associations between tables. Many NoSQL databases don't support joins at all, or at most only simple joins.
Need to support a method of loading and storing data to any file format. I am currently implementing this by providing a framework which makes it easy to add new formats as needed.
It does not need persistence (beyond storing data as in the previous point), and does not need to handle massive amounts of data, i.e. not more than a couple of million records. Typically, I am dealing with a few thousand.
Each reference is in effect a pointer, each pointer requires a small amount of memory.
You can use memory profiler to view memory use on a line by line basis. In this way you can see what happens when you make a reference.
Python does not specify a particular implementation for dynamic memory management, but from the semantics of the language one can assume that a reference uses memory similar to a C pointer.
FWIW, I ran some tests on a 100x100 structure, testing a sparsely populated dictionary structure, a fully populated dictionary structure, a list, and a numpy array. The latter two had a dictionary mapping object references to indexes. I timed getting every item in the structure by index (returning a sentinel for missing data in the sparse dict), and also reported the total size. My results were somewhat surprising:
Structure Time Size
============= ======== =====
full dict 0.0236s 6284
list 0.0426s 13028
sparse dict 0.1079s 1676
array 0.2262s 12608
So the fastest and second smallest was a full dict, presumable because there was no need to run a key in dict check on it.
I am working with information from big models, which means I have a lot of big ascii files with two float columns (lets say X and Y). However, whenever I have to read these files it takes a long time, so I thought maybe converthing them to binary files will make the reading process much faster.
I converted my asciifiles into binary files using the uu.encode(ascii_file,binary_file) command, and it worked quite well (Actually, tested the decode part and I recovered the same files).
My question is: is there anyway to read the binary files directly into python and get the data into two variables (x and y)?
Thanks!
You didn't specify how your float columns are represented in Python. The cPickle module is a fast general solution, with the drawback that it creates files readable only from Python, and that it should never be allowed to read untrusted data (received from the network). It is likely to just work with all regular datatypes, including numpy arrays.
If you can use numpy and store your data in numpy arrays, look into numpy.save and numpy.savetxt and the corresponding loading functions, which should offer performance superior to manually extracting the data.
array.array also has methods for writing array data to file, with the drawback that the array data is written in the native format and cannot be read from a different architecture.
Check out python's struct module. It's probably what you'd want to be using for reading and writing your data.
I suggest that instead of the suggested struct module, if your model is just floats/doubles (coordinates), you should see the array module, must be much faster than any ops in the struct module. The downside of it is that the collection is homogenous, you need to have first values in odd indexes, second ones in even indexes, or sequentially.
PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y)) where x and y are determined upon the insertion of each row.
If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr on.
Any pointers toward PyTables best practices are much appreciated!
The long answer is "yes, but you probably don't want to."
PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).
Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:
/particles/particlename1/pressure
/particles/particlename1/temperature
/particles/particlename2/pressure
/particles/particlename2/temperature
and so on. The lat and long values would be attributes on the /particles/particlename group rather than datasets, though having small datasets for them is perfectly fine too.
If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.
The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.
I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.
I suspect the best thing to do is either:
a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression.
b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)
HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.
I need to handle tens of Gigabytes data in one binary file. Each record in the data file is variable length.
So the file is like:
<len1><data1><len2><data2>..........<lenN><dataN>
The data contains integer, pointer, double value and so on.
I found python can not even handle this situation. There is no problem if I read the whole file in memory. It's fast. But it seems the struct package is not good at performance. It almost stuck on unpack the bytes.
Any help is appreciated.
Thanks.
struct and array, which other answers recommend, are fine for the details of the implementation, and might be all you need if your needs are always to sequentially read all of the file or a prefix of it. Other options include buffer, mmap, even ctypes, depending on many details you don't mention regarding your exact needs. Maybe a little specialized Cython-coded helper can offer all the extra performance you need, if no suitable and accessible library (in C, C++, Fortran, ...) already exists that can be interfaced for the purpose of handling this humongous file as you need to.
But clearly there are peculiar issues here -- how can a data file contain pointers, for example, which are intrinsically a concept related to addressing memory? Are they maybe "offsets" instead, and, if so, how exactly are they based and coded? Are your needs at all more advanced than simply sequential reading (e.g., random access), and if so, can you do a first "indexing" pass to get all the offsets from start of file to start of record into a more usable, compact, handily-formatted auxiliary file? (That binary file of offsets would be a natural for array -- unless the offsets need to be longer than array supports on your machine!). What is the distribution of record lengths and compositions and number of records to make up the "tens of gigabytes"? Etc, etc.
You have a very large scale problem (and no doubt very large scale hardware to support it, since you mention that you can easily read all of the file into memory that means a 64bit box with many tens of GB of RAM -- wow!), so it's well worth the detailed care to optimize the handling thereof -- but we can't help much with such detailed care unless we know enough detail to do so!-).
have a look at array module, specifically at array.fromfile method. This bit:
Each record in the data file is variable length.
is rather unfortunate. but you could handle it with a try-except clause.
For a similar task, I defined a class like this:
class foo(Structure):
_fields_ = [("myint", c_uint32)]
created an instance
bar = foo()
and did,
block = file.read(sizeof(bar))
memmove(addressof(bar), block, sizeof(bar))
In the event of variable-size records, you can use a similar method for retrieving lenN, and then read the corresponding data entries. Seems trivial to implement. However, I have no idea of how fast this method is compared to using pack() and unpack(), perhaps someone else has profiled both methods.
For help with parsing the file without reading it into memory you can use the bitstring module.
Internally this is using the struct module and a bytearray, but an immutable Bits object can be initialised with a filename so it won't read it all into memory.
For example:
from bitstring import Bits
s = Bits(filename='your_file')
while s.bytepos != s.length:
# Read a byte and interpret as an unsigned integer
length = s.read('uint:8')
# Read 'length' bytes and convert to a Python string
data = s.read(length*8).bytes
# Now do whatever you want with the data
Of course you can parse the data however you want.
You can also use slice notation to read the file contents, although note that the indices will be in bits rather than bytes so for example s[-800:] would be the final 100 bytes.
What if you use dump the data file into sqlite3 in memory.
import sqlite3
sqlite3.Connection(":memory:")
You can then use sql to process the data.
Besides, you might want to look at generators (or here) and iterators (or here and here).
PyTables is a very good library to handle HDF5, a binary format used in astronomy and meteorology to handle very big datasets:
PyTables
It works more or less like an hierarchical database, where you can store multiple tables, inside columns. Have a look at it.