Shortest possible generated unique ID - python

So we can generate a unique id with str(uuid.uuid4()), which is 36 characters long.
Is there another method to generate a unique ID which is shorter in terms of characters?
EDIT:
If ID is usable as primary key then even better
Granularity should be better than 1ms
This code could be distributed, so we can't assume time independence.

If this is for use as a primary key field in db, consider just using auto-incrementing integer instead.
str(uuid.uuid4()) is 36 chars but it has four useless dashes (-) in it, and it's limited to 0-9 a-f.
Better uuid4 in 32 chars:
>>> uuid.uuid4().hex
'b327fc1b6a2343e48af311343fc3f5a8'
Or just b64 encode and slice some urandom bytes (up to you to guarantee uniqueness):
>>> base64.b64encode(os.urandom(32))[:8]
b'iR4hZqs9'

TLDR
Most of the times it's better to work with numbers internally and encode them to short IDs externally. So here's a function for Python3, PowerShell & VBA that will convert an int32 to an alphanumeric ID. Use it like this:
int32_to_id(225204568)
'F2AXP8'
For distributed code use ULIDs: https://github.com/mdipierro/ulid
They are much longer but unique across different machines.
How short are the IDs?
It will encode about half a billion IDs in 6 characters so it's as compact as possible while still using only non-ambiguous digits and letters.
How can I get even shorter IDs?
If you want even more compact IDs/codes/Serial Numbers, you can easily expand the character set by just changing the chars="..." definition. For example if you allow all lower and upper case letters you can have 56 billion IDs within the same 6 characters. Adding a few symbols (like ~!##$%^&*()_+-=) gives you 208 billion IDs.
So why didn't you go for the shortest possible IDs?
The character set I'm using in my code has an advantage: It generates IDs that are easy to copy-paste (no symbols so double clicking selects the whole ID), easy to read without mistakes (no look-alike characters like 2 and Z) and rather easy to communicate verbally (only upper case letters). Sticking to numeric digits only is your best option for verbal communication but they are not compact.
I'm convinced: show me the code
Python 3
def int32_to_id(n):
if n==0: return "0"
chars="0123456789ACEFHJKLMNPRTUVWXY"
length=len(chars)
result=""
remain=n
while remain>0:
pos = remain % length
remain = remain // length
result = chars[pos] + result
return result
PowerShell
function int32_to_id($n){
$chars="0123456789ACEFHJKLMNPRTUVWXY"
$length=$chars.length
$result=""; $remain=[int]$n
do {
$pos = $remain % $length
$remain = [int][Math]::Floor($remain / $length)
$result = $chars[$pos] + $result
} while ($remain -gt 0)
$result
}
VBA
Function int32_to_id(n)
Dim chars$, length, result$, remain, pos
If n = 0 Then int32_to_id = "0": Exit Function
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result$ = ""
remain = n
Do While (remain > 0)
pos = remain Mod length
remain = Int(remain / length)
result$ = Mid(chars$, pos + 1, 1) + result$
Loop
int32_to_id = result
End Function
Function id_to_int32(id$)
Dim chars$, length, result, remain, pos, value, power
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result = 0
power = 1
For pos = Len(id$) To 1 Step -1
result = result + (InStr(chars$, Mid(id$, pos, 1)) - 1) * power
power = power * length
Next
id_to_int32 = result
End Function
Public Sub test_id_to_int32()
Dim i
For i = 0 To 28 ^ 3
If id_to_int32(int32_to_id(i)) <> i Then Debug.Print "Error, i=", i, "int32_to_id(i)", int32_to_id(i), "id_to_int32('" & int32_to_id(i) & "')", id_to_int32(int32_to_id(i))
Next
Debug.Print "Done testing"
End Sub

Yes. Just use the current UTC millis. This number never repeats.
const uniqueID = new Date().getTime();
EDIT
If you have the rather seldom requirement to produce more than one ID within the same millisecond, this method is of no use as this number‘s granularity is 1ms.

Related

win32com LineStyle Excel

Luckily i found this side:
https://www.linuxtut.com/en/150745ae0cc17cb5c866/
(There are many Linetypes difined
Excel Enum XlLineStyle)
(xlContinuous = 1
xlDashDot = 4
xlDashDotDot = 5
xlSlantDashDot = 13
xlDash = -4115
xldot = -4118
xlDouble = -4119
xlLineStyleNone = -4142)
i run with try and except +/- 100.000 times set lines because i thought anywhere should be this
[index] number for put this line in my picture too but they warsnt.. why not?
how can i set this line?
why are there some line indexe's in a such huge negative ranche and not just 1, 2, 3...?
how can i discover things like the "number" for doing things like that?
why is this even possible, to send apps data's in particular positions, i want to step a little deeper in that, where can i learn more about this?
(1) You can't find the medium dashed in the linestyle enum because there is none. The line that is drawn as border is a combination of lineStyle and Weight. The lineStyle is xlDash, the weight is xlThin for value 03 in your table and xlMedium for value 08.
(2) To figure out how to set something like this in VBA, use the Macro recorder, it will reveal that lineStyle, Weight (and color) are set when setting a border.
(3) There are a lot of pages describing all the constants, eg have a look to the one #FaneDuru linked to in the comments. They can also be found at Microsoft itself: https://learn.microsoft.com/en-us/office/vba/api/excel.xllinestyle and https://learn.microsoft.com/en-us/office/vba/api/excel.xlborderweight. It seems that someone translated them to Python constants on the linuxTut page.
(4) Don't ask why the enums are not continuous values. I assume especially the constants with negative numbers serve more that one purpose. Just never use the values directly, always use the defined constants.
(5) You can assume that numeric values that have no defined constant can work, but the results are kind of unpredictable. It's unlikely that there are values without constant that result in something "new" (eg a different border style).
As you can see in the following table, not all combination give different borders. Setting the weight to xlHairline will ignore the lineStyle. Setting it to xlThick will also ignore the lineStyle, except for xlDouble. Ob the other hand, xlDouble will be ignored when the weight is not xlThick.
Sub border()
With ThisWorkbook.Sheets(1)
With .Range("A1:J18")
.Clear
.Interior.Color = vbWhite
End With
Dim lStyles(), lWeights(), lStyleNames(), lWeightNames
lStyles() = Array(xlContinuous, xlDash, xlDashDot, xlDashDotDot, xlDot, xlDouble, xlLineStyleNone, xlSlantDashDot)
lStyleNames() = Array("xlContinuous", "xlDash", "xlDashDot", "xlDashDotDot", "xlDot", "xlDouble", "xlLineStyleNone", "xlSlantDashDot")
lWeights = Array(xlHairline, xlThin, xlMedium, xlThick)
lWeightNames = Array("xlHairline", "xlThin", "xlMedium", "xlThick")
Dim x As Long, y As Long
For x = LBound(lStyles) To UBound(lStyles)
Dim row As Long
row = x * 2 + 3
.Cells(row, 1) = lStyleNames(x) & vbLf & "(" & lStyles(x) & ")"
For y = LBound(lWeights) To UBound(lWeights)
Dim col As Long
col = y * 2 + 3
If x = 1 Then .Cells(1, col) = lWeightNames(y) & vbLf & "(" & lWeights(y) & ")"
With .Cells(row, col).Borders
.LineStyle = lStyles(x)
.Weight = lWeights(y)
End With
Next
Next
End With
End Sub

A variable is spontaneously growing when I attempt to assign it to a list index

Essentially, I have a list of hashes. I need to modify each one slightly by inserting a section of the hash in front of it. A bit random, I admit, but all of the main code works fine.
import hashlib
bytes_to_hash = b"this is an interesting bytes object"
cycles = 65
digest_length = 10000
hash_list = [
hashlib.shake_256(bytes_to_hash).digest(digest_length),
hashlib.shake_256(bytes_to_hash + b"123").digest(digest_length)
]
for _ in range(cycles):
number = 0 #will be used to generate the next hash later on
for x in range(4):
selection = int(hashlib.sha256(hash_list[-1] + str(x).encode()).hexdigest(), 16) % digest_length
#^ used to pseudorandomly select a number
for count, hash in enumerate(hash_list):
number += hash[selection]
prev_hash = hash_list[count - 1]
prev_hash = prev_hash[:selection - 512] + hash[selection - 512 : selection] + prev_hash[selection:]
#hash_list[count - 1] = prev_hash
#^ this is the issue
number = number % digest_length
selection_range = hash_list[-1][number - 128 : number]
hash_list.append(hashlib.shake_256(selection_range).digest(digest_length))
The problem comes when I attempt to actually modify the list itself. This simple line screws everything up -
hash_list[count - 1] = prev_hash
After adding this line, the script suddenly takes a very long time of to execute, up from about 75 ms to over 100 seconds. A sanity check of removing that line and inserting print(len(prev_hash)) proved that the length of prev_hash is always 10,000, exactly what I expected. It is only when I attempt to assign prev_hash to a list index that the length reaches as high as 19,000,000 for no apparent reason. The weirdest part is that after reintroducing that line, it seems that the length of prev_hash itself is what's actually expanding, not the value in the list.
I have no idea what I'm doing wrong, any help is appreciated.

Parse list of strings for speed

Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']
Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1
I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo
Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]

Python:16 bit one's complement addition implementation

I implemented one's complement addition of 16 bit integers in python, however I am trying to see if there is a better way to do it.
# This function returns a string of the bits (exactly 16 bits)
# for the number (in base 10 passed to it)
def get_bits(some_num):
binar = bin(some_num)[2::]
zeroes = 16 - len(binar)
padding = zeroes*"0"
binar = padding + binar
return binar
# This function adds the numbers, and handles the carry over
# from the most significant bit
def add_bits(num1, num2):
result = bin(int(num1,2) + int(num2,2))[2::]
# There is no carryover
if len(result) <= 16 :
result = get_bits(int(result,2))
# There is carryover
else :
result = result[1::]
one = '0000000000000001'
result = bin(int(result,2) + int(one,2))[2::]
result = get_bits(int(result,2))
return result
And now an example of running it would be:
print add_bits("1010001111101001", "1000000110110101")
returns :
0010010110011111
Is what wrote safe as far as results (Note I didn't do any negation here since that part is trivial, I am more interested in the intermediate steps)? Is there a better pythonic way to do it?
Thanks for any help.
Converting back and forth between string and ints to do math is inefficient. Do the math in integers and use formatting to display binary:
MOD = 1 << 16
def ones_comp_add16(num1,num2):
result = num1 + num2
return result if result < MOD else (result+1) % MOD
n1 = 0b1010001111101001
n2 = 0b1000000110110101
result = ones_comp_add16(n1,n2)
print('''\
{:016b}
+ {:016b}
------------------
{:016b}'''.format(n1,n2,result))
Output:
1010001111101001
+ 1000000110110101
------------------
0010010110011111
Converting back and forth between numbers, lists of one-bit strings, and strings probably doesn't feel like a very Pythonic way to get started.
More specifically, converting an int to a sequence of bits by using bin(i)[2:] is pretty hacky. It may be worth doing anyway (e.g., because it's more concise or more efficient than doing it numerically), but even if it is, it would be better to wrap it in a function named for what it does (and maybe even add a comment explaining why you did it that way).
You've also got unnecessarily complexifying code in there. For example, to do the carry, you do this:
one = '0000000000000001'
result = bin(int(result,2) + int(one,2))[2::]
But you know that int(one,2) is just the number 1, unless you've screwed up, so why not just use 1, which is shorter, more readable and obvious, and removes any chance of screwing up?
And you're not following PEP 8 style.
So, sticking with your basic design of "use a string for the bits, use only the basic string operations that are unchanged from Python 1.5 through 3.5 instead of format, and do the basic addition on integers instead of on the bits", I'd write it something like this:
def to_bits(n):
return bin(n)[2:]
def from_bits(n):
return int(n, 2)
def pad_bits(b, length=16):
return ["0"*length + b][-length:]
def add_bits(num1, num2):
result = to_bits(from_bits(num1) + from_bits(num2))
if len(result) <= 16: # no carry
return pad_bits(result)
return pad_bits(to_bits(from_bits(result[1:]) + 1))
But an even better solution would be to abstract out the string representation completely. Build a class that knows how to act like an integer, but also knows how to act like a sequence of bits. Or just find one on PyPI. Then your code becomes trivial. For example:
from bitstring import BitArray
def add_bits(n1, n2):
"""
Given two BitArray values of the same length, return a BitArray
of the same length that's the one's complement addition.
"""
result = n1.uint + n2.uint
if result >= (1 << n1.length):
result = result % n1.length + 1
return BitArray(uint=result, length=n1.length)
I'm not sure that bitstring is actually the best module for what you're doing. There are a half-dozen different bit-manipulating libraries on PyPI, all of which have different interfaces and different strengths and weaknesses; I just picked the first one that came up in a search and slapped together an implementation using it.

Importing big tecplot block files in python as fast as possible

I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages
Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).
If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...

Categories

Resources