Use of .digest() in hashing? - python

What is the use of .digest() in this statement? Why do we use it ? I searched on google ( and documentation also) but still I am not able to figure it out.
train_hashes = [hashlib.sha1(x).digest() for x in train_dataset]
What I found is that it convert to string. Am I right or wrong?

The .digest() method returns the actual digest the hash is designed to produce.
It is a separate method because the hashing API is designed to accept data in multiple pieces:
hash = hashlib.sha1()
for chunk in large_amount_of_data:
hash.update(chunk)
final_digest = hash.digest()
The above code creates a hashing object without passing any initial data in, then uses the hash.update() method to put chunks of data in in a loop. This helps avoid having to all of the data into memory all at once, so you can hash anything between 1 byte and the entire Google index, if you ever had access to something that large.
If hashlib.sha1(x) produced the digest directly you could never add additional data to hash first. Moreover, there is also an alternative method of accessing the digest, as a hexadecimal string using the hash.hexdigest() method (equivalent to hash.digest().hex(), but more convenient).
The code you found uses the fact that the constructor of the hash object also accepts data; since that's the all of the data that you wanted to hash you can call .digest() immediately.
The module documentation covers it this way:
There is one constructor method named for each type of hash. All return a hash object with the same simple interface. For example: use sha256() to create a SHA-256 hash object. You can now feed this object with bytes-like objects (normally bytes) using the update() method. At any point you can ask it for the digest of the concatenation of the data fed to it so far using the digest() or hexdigest() methods.
(bold emphasis mine).

Related

PyBytes_FromString different endianness

I have a python-wrapped C++ object whose underlying data is a container std::vector<T> that represents bits. I have a function that writes these bits to a PyBytes object. If the endianness is the same, then there is no issue. However if I wish to write the bytes in a different endianness, then I need to bitswap (or byteswap) each word.
Ideally, I could pass an output iterator to the PyBytes_FromString constructor, where the output operator just transforms the endianness of each word. This would be O(1) extra memory, which is the target.
Less ideally, I could somehow construct an empty PyBytes object, create the different-endianness char array manually and somehow assign that to the PyBytes object (basically reimplementing the PyBytes constructors). This would also be O(1) extra memory. Unfortunately, the way to do this would be to use _PyBytes_FromSize, but that's not available in the API.
The current way of doing this is to create an entire copy of the reversed words, just to then copy that representation over to the PyBytes objects representation.
I think the second option is the most practical way of doing this, but the only way I can see that working is by basically copying the _PyBytes_FromSize function into my source code which seems hacky. I'm new to the python-C api and am wondering if there's a cleaner way to do this.
PyBytes_FromStringAndSize lets you pass NULL as the first argument, in which case it returns an uninitialized bytes object (which you can edit). It's really just equivalent to _PyBytes_FromSize and would let you do your second option.
If you wanted to try your "output iterator" option instead, then the solution would be to call PyBytes_Type:
PyObject *result = PyObject_CallFunctionObjArgs((PyObject*)&PyBytes_Type, your_iterable, NULL);
Any iterable that returns values between 0 and 255 should work. You can pick the PyObject_Call* that you find easiest to use.
I suspect writing the iterable in C/C++ will be more trouble than writing the loop though.

Python str id hash

I'm trying to convert user access log into a pure binary format, which would require me to convert string into int using some hash method, and then the mapping relationship of "id -> string value" would be stored somewhere for further backward retrieve.
Since I'm using Python, in order to save some process time, instead of introducing hashlib to calculate hash, can I simply use
string_hash = id(intern(some_string))
as the hash method? Any basic difference to be aware of comparing to MD5 / SHA1? Is the probability of conflict obviously higher than MD5 / SHA1?
Doesn't work. id is not guaranteed to be consistent across interpreter executions; in CPython, it's the memory location of the object. Even if it were consistent, it doesn't have enough bytes for collision resistance. Why not just keep using the strings? ASCII or Unicode, strings can be serialized easily.

Can I use binding arguments in a copy_to query?

Currently I am using the copy_to(..) function to get the following output:
>>> cur.copy_to(sys.stdout, 'test', sep="|")
1|100|abc'def
2|\N|dada
...
What I would like to achieve is to use the copy_to(..) function for selecting large amounts of data. I reviewed the documentation for psycopg2, however I could not find a way to use binding arguments with this function. Any suggestions?
From the psycopg docs:
Reads data from a file-like object appending them to a database table
"What's a file-like object?" you might ask. File objects are described in the python documentation and the distinction between them and file-like objects are indicated there. In general, it's an object that supports methods like open()/read()/write()/close() with the signatures matching that of the file object.
So the way to bind it would be either to use a real file (consider the tempfile module), or an in-memory "file" like that in StringIO.

Is there a Python module for transparently working with a file-like object as a buffer?

I'm working on a pure Python parser, where the input data may range in size from kilobytes to gigabytes. Is there a module that wraps a file-like object and abstracts explicit .open()/.seek()/.read()/.close() calls into a simple buffer-like object? You might think of this as the inverse of StringIO. I expect it might look something like:
with FileLikeObjectBackedBuffer(urllib.urlopen("http://www.google.com")) as buf:
header = buf[0:0x10]
footer = buf[-0x10:]
Note, I asked a similar quesetion yesterday, and accepted mmaping a file. Here, I am specifically looking for a module that wraps a file-like object (for argument's sake, say like what is returned by urllib).
Update
I've repeatedly come back to this question since I first asked it, and it turns out urllib may not have been the best example. Its a bit of a special case since its a streaming interface. StringIO and bz2 expose a more traditional seek/read/close interface, and personally I use these more often. Therefore, I wrote a module that wraps file-like objects as buffers. You can check it out here.
Although urllib.urlopen returns a file-like obj, I don't believe it's possible to do what you want without writing your own - it doesn't support seek for instance, but does support next, read etc... And since you're dealing with a forward only stream - you'd have to handle jump-aheads by retrieving until you reach a certain point and caching for any backtracking.
IMHO - you can't efficiently skips part of a network IO stream (if you want the last byte, you still have to get all previous bytes to get there - how you manage that storage is up to you).
I would be tempted to urlretrieve (or similar) the file, and mmap as per your previous answer.
If your server can accept ranges (and the response size is known and from that derived blocks as per your example), then a possible work around is to use http://en.wikipedia.org/wiki/Byte_serving (but can't say I've ever tried that).
Given the example, if you only want the first 16 and last 16 and don't want to do something "too fancy":
from string import ascii_lowercase
from random import choice
from StringIO import StringIO
buf = ''.join(choice(ascii_lowercase) for _ in range(50))
print buf
sio_buf = StringIO(buf) # make it a bit more like a stream object
first16 = sio_buf.read(16)
print first16
from collections import deque
last16 = deque(iter(lambda: sio_buf.read(1), ''), 16) # read(1) may look bad but it's buffered anyway - so...
print ''.join(last16)
Output:
gpsgvqsbixtwyakpgefrhntldsjqlmfvyzwjoykhsapcmvjmar
gpsgvqsbixtwyakp
wjoykhsapcmvjmar

what's the difference between the HMAC signature and hashing directly?

Just out of curiosity, really... for example, in python,
hashlib.sha1("key" + "data").hexdigest() != hmac.new("key", "data", hashlib.sha1)
is there some logical distinction I'm missing between the two actions?
hashlib.sha1 gives you simply sha1 hash of content "keydata" that you give as a parameter (note that you are simply concatenating the two strings). The hmac call gives you keyed hash of the string "data" using string "key" as the key and sha1 as the hash function. The fundamental difference between the two calls are that the HMAC can only be reproduced if you know the key so you would also know something about who has generated the hmac. SHA1 can only be used to detect that content has not changed.
I found the answer in the manual.
https://en.wikipedia.org/wiki/Hmac#Design_principles

Categories

Resources