Generate ID from string in Python - python

I'm struggling a bit to generate ID of type integer for given string in Python.
I thought the built-it hash function is perfect but it appears that the IDs are too long sometimes. It's a problem since I'm limited to 64bits as maximum length.
My code so far: hash(s) % 10000000000.
The input string(s) which I can expect will be in range of 12-512 chars long.
Requirements are:
integers only
generated from provided string
ideally up to 10-12 chars long (I'll have ~5 million items only)
low probability of collision..?
I would be glad if someone can provide any tips / solutions.

I would do something like this:
>>> import hashlib
>>> m = hashlib.md5()
>>> m.update("some string")
>>> str(int(m.hexdigest(), 16))[0:12]
'120665287271'
The idea:
Calculate the hash of a string with MD5 (or SHA-1 or ...) in hexadecimal form (see module hashlib)
Convert the string into an integer and reconvert it to a String with base 10 (there are just digits in the result)
Use the first 12 characters of the string.
If characters a-f are also okay, I would do m.hexdigest()[0:12].

If you're not allowed to add extra dependency, you can continue using hash function in the following way:
>>> my_string = "whatever"
>>> str(hash(my_string))[1:13]
'460440266319'
NB:
I am ignoring 1st character as it may be the negative sign.
hash may return different values for same string, as PYTHONHASHSEED Value will change everytime you run your program. You may want to set it to some fixed value. Read here

encode utf-8 was needed for mine to work:
def unique_name_from_str(string: str, last_idx: int = 12) -> str:
"""
Generates a unique id name
refs:
- md5: https://stackoverflow.com/questions/22974499/generate-id-from-string-in-python
- sha3: https://stackoverflow.com/questions/47601592/safest-way-to-generate-a-unique-hash
(- guid/uiid: https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python?noredirect=1&lq=1)
"""
import hashlib
m = hashlib.md5()
string = string.encode('utf-8')
m.update(string)
unqiue_name: str = str(int(m.hexdigest(), 16))[0:last_idx]
return unqiue_name
see my ultimate-utils python library.

Related

Generate a variable length hash

Is there a way in python to generate a short version of hashcode, SHA128 or SHA256
I tried this but its not correct way to get[extracting substring to shorten]
hashlib.sha256(str.encode('utf-8')).hexdigest()[:8]
for example if i have a string, i want 8 character hash
str = "iamtestingastringtogethash"
print(hashlib.sha256(str.encode('utf-8').hexdigest(),8))
Output: some_8char_hash
Ok i found my own fix,
hashlib.shake_256(str.encode("utf-8")).hexdigest(length=8)
As per the hashlib source code hexdigest method looks for integer value as input without a keyword
def hexdigest(self, __length: int) -> str: ...
the following will do the work without the keyword length and double of input number is the total length
hashlib.shake_256(str.encode("utf-8")).hexdigest(4)

Hex string convert to ASCII after XOR

I am new to Python & I am trying to learn how to XOR hex encoded ciphertexts against one another & then derive the ASCII value of this.
I have tried some of the functions as outlined in previous posts on this subject - such as bytearray.fromhex, binascii.unhexlify, decode("hex") and they have all generated different errors (obviously due to my lack of understanding). Some of these errors were due to my python version (python 3).
Let me give a simple example, say I have a hex encoded string ciphertext_1 ("4A17") and a hex endoded string ciphertext_2. I want to XOR these two strings and derive their ASCII value. The closest that I have come to a solution is with the following code:
result=hex(int(ciphertext_1, 16) ^ int(ciphertext_2, 16))
print(result)
This prints me a result of: 0xd07
(This is a hex string is my understanding??)
I then try to convert this to its ASCII value. At the moment, I am trying:
binascii.unhexliy(result)
However this gives me an error: "binascii.Error: Odd-length string"
I have tried the different functions as outlined above, as well as trying to solve this specific error (strip function gives another error) - however I have been unsuccessful. I realise my knowledge and understanding of the subject are lacking, so i am hoping someone might be able to advise me?
Full example:
#!/usr/bin/env python
import binascii
ciphertext_1="4A17"
ciphertext_2="4710"
result=hex(int(ciphertext_1, 16) ^ int(ciphertext_2, 16))
print(result)
print(binascii.unhexliy(result))
from binascii import unhexlify
ciphertext_1 = "4A17"
ciphertext_2 = "4710"
xored = (int(ciphertext_1, 16) ^ int(ciphertext_2, 16))
# We format this integer: hex, no leading 0x, uppercase
string = format(xored, 'X')
# We pad it with an initial 0 if the length of the string is odd
if len(string) % 2:
string = '0' + string
# unexlify returns a bytes object, we decode it to obtain a string
print(unhexlify(string).decode())
#
# Not much appears, just a CR followed by a BELL
Or, if you prefer the repr of the string:
print(repr(unhexlify(string).decode()))
# '\r\x07'
When doing byte-wise operations like XOR, it's often easier to work with bytes objects (since the individual bytes are treated as integers). From this question, then, we get:
ciphertext_1 = bytes.fromhex("4A17")
ciphertext_2 = bytes.fromhex("4710")
XORing the bytes can then be accomplished as in this question, with a comprehension. Then you can convert that to a string:
result = [c1 ^ c2 for (c1, c2) in zip(ciphertext_1, ciphertext_2)]
result = ''.join(chr(c) for c in result)
I would probably take a slightly different angle and create a bytes object instead of a list, which can be decoded into your string:
result = bytes(b1 ^ b2 for (b1, b2) in zip(ciphertext_1, ciphertext_2)).decode()

Strings, ints and leading zeros

I need to record SerialNumber(s) on an object. We enter many objects. Most serial numbers are strings - the numbers aren't used numerically, just as unique identifiers - but they are often sequential. Further, leading zeros are important due to unique id status of serial number.
When doing data entry, it's nice to just enter the first "sequential" serial number (eg 000123) and then the number of items (eg 5) to get the desired output - that way we can enter data in bulk see below:
Obj1.serial = 000123
Obj2.serial = 000124
Obj3.serial = 000125
Obj4.serial = 000126
Obj5.serial = 000127
The problem is that when you take the first number-as-string, turn to integer and increment, you loose the leading zeros.
Not all serials are sequential - not all are even numbers (eg FDM-434\RRTASDVI908)
But those that are, I would like to automate entry.
In python, what is the most elegant way to check for leading zeros (*and, I guess, edge cases like 0009999) in a string before iterating, and then re-application of those zeros after increment?
I have a solution to this problem but it isn't elegant. In fact, it's the most boring and blunt alg possible.
Is there an elegant solution to this problem?
EDIT
To clarify the question, I want the serial to have the same number of digits after the increment.
So, in most cases, this will mean reapplying the same number of leading zeros. BUT in some edge cases the number of leading zeros will be decremented. eg: 009 -> 010; 0099 -> 0100
Try str.zfill():
>>> s = "000123"
>>> i = int(s)
>>> i
123
>>> n = 6
>>> str(i).zfill(n)
'000123'
I develop my comment here, Obj1.serial being a string:
Obj1.serial = "000123"
('%0'+str(len(Obj1.serial))+'d') % (1+int(Obj1.serial))
It's like #owen-s answer '%06d' % n: print the number and pad with leading 0.
Regarding '%d' % n, it's just one way of printing. From PEP3101:
In Python 3.0, the % operator is supplemented by a more powerful
string formatting method, format(). Support for the str.format()
method has been backported to Python 2.6.
So you may want to use format instead… Anyway, you have an integer at the right of the % sign, and it will replace the %d inside the left string.
'%06d' means print a minimum of 6 (6) digits (d) long, fill with 0 (0) if necessary.
As Obj1.serial is a string, you have to convert it to an integer before the increment: 1+int(Obj1.serial). And because the right side takes an integer, we can leave it like that.
Now, for the left part, as we can't hard code 6, we have to take the length of Obj1.serial. But this is an integer, so we have to convert it back to a string, and concatenate to the rest of the expression %0 6 d : '%0'+str(len(Obj1.serial))+'d'. Thus
('%0'+str(len(Obj1.serial))+'d') % (1+int(Obj1.serial))
Now, with format (format-specification):
'{0:06}'.format(n)
is replaced in the same way by
('{0:0'+str(len(Obj1.serial))+'}').format(1+int(Obj1.serial))
You could check the length of the string ahead of time, then use rjust to pad to the same length afterwards:
>>> s = "000123"
>>> len_s = len(s)
>>> i = int(s)
>>> i
123
>>> str(i).rjust(len_s, "0")
'000123'
You can check a serial number for all digits using:
if serial.isdigit():

Length of hexadecimal number

How can we get the length of a hexadecimal number in the Python language?
I tried using this code but even this is showing some error.
i = 0
def hex_len(a):
if a > 0x0:
# i = 0
i = i + 1
a = a/16
return i
b = 0x346
print(hex_len(b))
Here I just used 346 as the hexadecimal number, but my actual numbers are very big to be counted manually.
Use the function hex:
>>> b = 0x346
>>> hex(b)
'0x346'
>>> len(hex(b))-2
3
or using string formatting:
>>> len("{:x}".format(b))
3
While using the string representation as intermediate result has some merits in simplicity it's somewhat wasted time and memory. I'd prefer a mathematical solution (returning the pure number of digits without any 0x-prefix):
from math import ceil, log
def numberLength(n, base=16):
return ceil(log(n+1)/log(base))
The +1 adjustment takes care of the fact, that for an exact power of your number base you need a leading "1".
As Ashwini wrote, the hex function does the hard work for you:
hex(x)
Convert an integer number (of any size) to a hexadecimal string. The result is a valid Python expression.

Random strings in Python 2.6 (Is this OK?)

I've been trying to find a more pythonic way of generating random string in python that can scale as well. Typically, I see something similar to
''.join(random.choice(string.letters) for i in xrange(len))
It sucks if you want to generate long string.
I've been thinking about random.getrandombits for a while, and figuring out how to convert that to an array of bits, then hex encode that. Using python 2.6 I came across the bitarray object, which isn't documented. Somehow I got it to work, and it seems really fast.
It generates a 50mil random string on my notebook in just about 3 seconds.
def rand1(leng):
nbits = leng * 6 + 1
bits = random.getrandbits(nbits)
uc = u"%0x" % bits
newlen = int(len(uc) / 2) * 2 # we have to make the string an even length
ba = bytearray.fromhex(uc[:newlen])
return base64.urlsafe_b64encode(str(ba))[:leng]
edit
heikogerlach pointed out that it was an odd number of characters causing the issue. New code added to make sure it always sent fromhex an even number of hex digits.
Still curious if there's a better way of doing this that's just as fast.
import os
random_string = os.urandom(string_length)
and if you need url safe string :
import os
random_string = os.urandom(string_length).hex()
(note random_string length is greatest than string_length in that case)
Sometimes a uuid is short enough and if you don't like the dashes you can always.replace('-', '') them
from uuid import uuid4
random_string = str(uuid4())
If you want it a specific length without dashes
random_string_length = 16
str(uuid4()).replace('-', '')[:random_string_length]
Taken from the 1023290 bug report at Python.org:
junk_len = 1024
junk = (("%%0%dX" % junk_len) % random.getrandbits(junk_len *
8)).decode("hex")
Also, see the issues 923643 and 1023290
It seems the fromhex() method expects an even number of hex digits. Your string is 75 characters long.
Be aware that something[:-1] excludes the last element! Just use something[:].
Regarding the last example, the following fix to make sure the line is even length, whatever the junk_len value:
junk_len = 1024
junk = (("%%0%dX" % (junk_len * 2)) % random.getrandbits(junk_len * 8)).decode("hex")

Categories

Resources