Python str id hash - python

I'm trying to convert user access log into a pure binary format, which would require me to convert string into int using some hash method, and then the mapping relationship of "id -> string value" would be stored somewhere for further backward retrieve.
Since I'm using Python, in order to save some process time, instead of introducing hashlib to calculate hash, can I simply use
string_hash = id(intern(some_string))
as the hash method? Any basic difference to be aware of comparing to MD5 / SHA1? Is the probability of conflict obviously higher than MD5 / SHA1?

Doesn't work. id is not guaranteed to be consistent across interpreter executions; in CPython, it's the memory location of the object. Even if it were consistent, it doesn't have enough bytes for collision resistance. Why not just keep using the strings? ASCII or Unicode, strings can be serialized easily.

Related

PyBytes_FromString different endianness

I have a python-wrapped C++ object whose underlying data is a container std::vector<T> that represents bits. I have a function that writes these bits to a PyBytes object. If the endianness is the same, then there is no issue. However if I wish to write the bytes in a different endianness, then I need to bitswap (or byteswap) each word.
Ideally, I could pass an output iterator to the PyBytes_FromString constructor, where the output operator just transforms the endianness of each word. This would be O(1) extra memory, which is the target.
Less ideally, I could somehow construct an empty PyBytes object, create the different-endianness char array manually and somehow assign that to the PyBytes object (basically reimplementing the PyBytes constructors). This would also be O(1) extra memory. Unfortunately, the way to do this would be to use _PyBytes_FromSize, but that's not available in the API.
The current way of doing this is to create an entire copy of the reversed words, just to then copy that representation over to the PyBytes objects representation.
I think the second option is the most practical way of doing this, but the only way I can see that working is by basically copying the _PyBytes_FromSize function into my source code which seems hacky. I'm new to the python-C api and am wondering if there's a cleaner way to do this.
PyBytes_FromStringAndSize lets you pass NULL as the first argument, in which case it returns an uninitialized bytes object (which you can edit). It's really just equivalent to _PyBytes_FromSize and would let you do your second option.
If you wanted to try your "output iterator" option instead, then the solution would be to call PyBytes_Type:
PyObject *result = PyObject_CallFunctionObjArgs((PyObject*)&PyBytes_Type, your_iterable, NULL);
Any iterable that returns values between 0 and 255 should work. You can pick the PyObject_Call* that you find easiest to use.
I suspect writing the iterable in C/C++ will be more trouble than writing the loop though.

Use of .digest() in hashing?

What is the use of .digest() in this statement? Why do we use it ? I searched on google ( and documentation also) but still I am not able to figure it out.
train_hashes = [hashlib.sha1(x).digest() for x in train_dataset]
What I found is that it convert to string. Am I right or wrong?
The .digest() method returns the actual digest the hash is designed to produce.
It is a separate method because the hashing API is designed to accept data in multiple pieces:
hash = hashlib.sha1()
for chunk in large_amount_of_data:
hash.update(chunk)
final_digest = hash.digest()
The above code creates a hashing object without passing any initial data in, then uses the hash.update() method to put chunks of data in in a loop. This helps avoid having to all of the data into memory all at once, so you can hash anything between 1 byte and the entire Google index, if you ever had access to something that large.
If hashlib.sha1(x) produced the digest directly you could never add additional data to hash first. Moreover, there is also an alternative method of accessing the digest, as a hexadecimal string using the hash.hexdigest() method (equivalent to hash.digest().hex(), but more convenient).
The code you found uses the fact that the constructor of the hash object also accepts data; since that's the all of the data that you wanted to hash you can call .digest() immediately.
The module documentation covers it this way:
There is one constructor method named for each type of hash. All return a hash object with the same simple interface. For example: use sha256() to create a SHA-256 hash object. You can now feed this object with bytes-like objects (normally bytes) using the update() method. At any point you can ask it for the digest of the concatenation of the data fed to it so far using the digest() or hexdigest() methods.
(bold emphasis mine).

Use integer keys in Berkeley DB with python (using bsddb3)

I want to use BDB as a time-series data store, and planning to use the microseconds since epoch as the key values. I am using BTREE as the data store type.
However, when I try to store integer keys, bsddb3 gives an error saying TypeError: Integer keys only allowed for Recno and Queue DB's.
What is the best workaround? I can store them as strings, but that probably will make it unnecessarily slower.
Given BDB itself can handle any kind of data, why is there a restriction? can I sorta hack the bsddb3 implementation? has anyone used anyother methods?
You can't store integers since bsddb doesn't know how to represent integers and which kind of integer it is.
If you convert your integer to a string you will break the lexicographic ordering of keys of bsddb: 10 > 2 but as strings "10" < "2".
You have to use python struct to convert your integers into a string (or in python 3 into bytes) to store then store them in bsddb. You have to use bigendian packing or ordering will not be correct.
Then you can use bsddb's Cursor.set_range(key) to query for information in a given slice of time.
For instance, Cursor.set_range(struct.unpack('>Q', 123456789)) will set the cursor at the key of the even happening at 123456789 or the first that happens after.
Well, there's no workaround. But you can use two approaches
Store the integers as string using str or repr. If the ints are big, you can even use string formatting
use cPickle/pickle module to store and retrieve data. This is a good way if you have data types other than basic types. For basics ints and floats this actually is slower and takes more space than just storing strings

Hash function that protects against collisions, not attacks. (Produces a random UUID-size result space)

Using SHA1 to hash down larger size strings so that they can be used as a keys in a database.
Trying to produce a UUID-size string from the original string that is random enough and big enough to protect against collisions, but much smaller than the original string.
Not using this for anything security related.
Example:
# Take a very long string, hash it down to a smaller string behind the scenes and use
# the hashed key as the data base primary key instead
def _get_database_key(very_long_key):
return hashlib.sha1(very_long_key).digest()
Is SHA1 a good algorithm to be using for this purpose? Or is there something else that is more appropriate?
Python has a uuid library, based on RFC 4122.
The version that uses SHA1 is UUIDv5, so the code would be something like this:
import uuid
uuid.uuid5(uuid.NAMESPACE_OID, 'your string here')

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.
The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).
Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.
The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.
There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.
C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/
A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.
Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

Categories

Resources