Constructing an object that's greater than any string - python

In Python 3, I have a list of strings and would find it useful to be able to append a sentinel that would compare greater than all elements in the list.
Is there a straightforward way to construct such an object?
I can define a class, possibly subclassing str, but it feels like there ought to be a simpler way.
For this to be useful in simplifying my algorithm, I need to do this ahead of time, before I know what the strings contained in the list are going to be (and so it can't be a function of those strings).

This is kind of a naïve answer, but when you're dealing with numbers and need a sentinel value for comparison purposes, it's not uncommon to use the largest (or smallest) number that a specific number type can hold.
Python strings are compared lexicographically, so to create a "max string", you'd simply need to create a long string of the "max char":
# 1114111 is the highest value that chr seems to accept
MAX_CHAR = chr(1114111)
# One million is entirely arbitary here.
# It should ideally be 1 + the length of the longest possible string that you'll compare against
MAX_STRING = MAX_CHAR * int(1e6)
Unless there's weird corner cases that I'm not aware of, MAX_STRING should now be considered greater than any other string (other than itself); providing that it's long enough.

Related

Assign a unique value to a string s based on its lexicographic order

I want to come up with a function that assigns unique values to a string based on it's lexicographic order. For instance if my function is labelled as get_key(s), the function should take as input a string s and return a unique integer which will allow me to compare two strings based on those unique integers that I get , in O(1) time.
Some code for clarity:
get_key('aaa')
#Returns some integer
get_key('b')
#Returns another integer > output of get_key('aaa') since 'b' > 'aaa'
Any help would be highly appreciated.
Note: Cannot use python built in function id()
It's impossible.
Why? No matter what number you return for a string, I can always find a new string that's in between those two.
You would need an unlimited number of values, because there's an infinite amount of strings.
If I understand your problem clearly, one idea I come to is to convert the input to hex then from hex to int, this I believe would solve the problem, however, I guess it is impossible to solve it in O(1). The solution I provided (and every possible solution in my mind) needs O(n) since you don't have any specification on the input length and the function will operate depending on the length of the input.

Why is this certain function a bad hashing function?

I would like to know why the following code snippet is a bad hashing function.
def computeHash(self, s):
h = 0
for ch in s:
h = h + ord(ch) # ord gives the ASCII value of character ch
return h % HASH_TABLE_SIZE
If I dramatically increase the size of the hash table will this make up for the inadequacies of the hash function?
It's a bad hashing function because strings are order-sensitive, but the hash is not; "ab" and "ba" would hash identically, and for longer strings the collisions just get worse; all of "abc", "acb", "bac", "bca", "cab", "cba" would share the same hash.
For an order-insensitive data structure (e.g. frozenset) a strategy like this isn't as bad, but it's still too easy to produce collisions by simply reducing one ordinal by one and increasing another by one, or by simplying putting a NUL character in there; frozenset({'\0', 'a'}) would hash identically to just frozenset({'a'}); typically this is countered by incorporating the length of the collection into the hash in some manner.
Secure hashes (e.g. Python uses SipHash) are the best solution; a randomized seed combined with an algorithm that conceals the seed while incorporating it into the resulting hash makes it not only harder to accidentally create collisions (which simple improvements like hashing the index as well as the ordinal would help with, to make the hash order and length sensitive), but also makes it nigh impossible for malicious data sources to intentionally cause collisions (which the simple solutions don't handle at all).
The other problem with this hash is that it's doesn't distribute the bits evenly; short strings mean only low bits are set in the hash. This means that increasing the table size is completely pointless when the strings are all relatively short; if all the strings are ASCII and 100 characters or less, the largest possible raw hash value is 12700; if you need to store a million such strings, you'll average nearly 79 collisions per bucket in the first 12,700 buckets (in practice, much more than that for common buckets; there will be humps with many more collisions in the middling values, and fewer collisions near the beginning, and almost none at the end, since stuff like '\x7f' * 100 is the only way to reach said maximum value), and no matter how many more buckets you have, they'll never be used. Technically, an open-addressing based hash table might use them, but it would be largely equivalent to separate chaining per bucket, since all indices past 12700 would only be found by the open-addressing "bounce around" algorithm; if thats badly designed, e.g. linear scanning, you might end up linearly scanning the whole table even if no entries actually collide for your particular hash (your bucket was filled by chaining, and it has to linearly scan until it finds an empty slot or the matching element).
Bad hashing function:
1.AC and BB would give same result at for big string there can be many permutations in which sum of ascii value is same.
2.Even different length string would give same hash result . 'A ' (A +space) = 'a'.
3.Even rearrangement characters in string would give same hash.
This is a bad hashing function. One big problem: re-arranging any or all characters returns the exact same hash.
Increasing the TABLE_SIZE does nothing to adjust for this.

Why can't a long integer be converted to an integer when inside a Python list?

I have seen many posts here, which gives ways of removing the trailing L from a list of python long integers.
The most proposed way is
print map(int,list)
However this seems not to work always.
Example---
A=[4198400644L, 3764083286L, 2895448686L, 1158179486, 2316359001L]
print map(int,A)
The above code gives the same result as the input.
I have noticed that the map method doesn't work whenever the number preceding L is large, and only when the numbers are in a list. e.g. Application of int() on 4198400644L does give the number without L, when out of the list.
Why is this occurring and more importantly, how to overcome this?
I think I really need to remove this L, because this is a small part of a program where I need to multiply some integer from this list A, with some integer from a list of non-long integers, and this L is disturbing.I could ofcourse convert the long integers into string,remove the L and convert them back to integer.But is there another way?
I am still using the now outdated Python 2.7.
Python has two different kinds of integers. The int type is used for those that fit into 32 bits, or -0x80000000 to 0x7fffffff. The long type is for anything outside that range, as all your examples are. The difference is marked with the L appended to the number, but only when you use repr(n) as is done automatically when the number is part of a list.
In Python 3 they realized that this difference was arbitrary and unnecessary. Any int can be as large as you want, and long is no longer a type. You won't see repr put the trailing L on any numbers no matter how large, and adding it yourself on a constant is a syntax error.

How to change numbers in a number

I'm currently trying to learn python.
Suppose there was a a number n = 12345.
How would one go about changing every digit starting from the first spot and iterating it between (1-9) and every other spot after (0-9).
I'm sadly currently learning python so I apologize for all the syntax error that might follow.
Here's my last few attempts/idea for skeleton of the code.
define the function
turn n into string
start with a for loop that for i in n range(0,9) for i[1]
else range(10)
Basically how does one fix a number while changing the others?
Please don't give solution just hints I enjoy the thinking process.
For example if n =29 the program could check
19,39,49,59,69,79,89,99
and
21,22,23,24,25,26,27,28
Although you are new, the process seems far easy than you think.
You want to make that change to every digit of the number (let's say n=7382). But you cannot iterate over numbers (not even changing specific digits of it as you want to): only over iterables (like lists). A string is an iterable. If you get the way to do a int-str conversion, you could iterate over every number and then print the new number.
But how do you change only the digit you are iterating to? Again, the way is repeating the conversion (saving it into a var before the loop would make great DRY) and getting a substring that gets all numbers except the one you are. There are two ways of doing this:
You search for that specific value and get its index (bad).
You enumerate the loop (good).
Why 2 is good? Because you have the real position of the actual number being change (think that doing an index in 75487 with 7 as the actual one changing would not work well when you get to the last one). Search for a way to iterate over items in a loop to get its actual index.
The easiest way to get a substring in Python is slicing. You slice two times: one to get all numbers before the actual one, and other to get all after it. Then you just join those two str with the actual variable number and you did it.
I hope I didn't put it easy for you, but is hard for a simple task as that.

What could affect Python string comparison performance for strings over 64 characters?

I'm trying to evaluate if comparing two string get slower as their length increases. My calculations suggest comparing strings should take an amortized constant time, but my Python experiments yield strange results:
Here is a plot of string length (1 to 400) versus time in milliseconds. Automatic garbage collection is disabled, and gc.collect is run between every iteration.
I'm comparing 1 million random strings each time, counting matches as follows.The process is repeated 50 times before taking the min of all measured times.
for index in range(COUNT):
if v1[index] == v2[index]:
matches += 1
else:
non_matches += 1
What might account for the sudden increase around length 64?
Note: The following snippet can be used to try to reproduce the problem assuming v1 and v2 are two lists of random strings of length n and COUNT is their length.
timeit.timeit("for i in range(COUNT): v1[i] == v2[i]",
"from __main__ import COUNT, v1, v2", number=50)
Further note: I've made two extra tests: comparing string with is instead of == suppresses the problem completely, and the performance is about 210ms/1M comparisons.
Since interning has been mentioned, I made sure to add a white space after each string, which should prevent interning; that doesn't change anything. Is it something else than interning then?
Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.
When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):
if (a == b) {
switch (op) {
case Py_EQ:case Py_LE:case Py_GE:
result = Py_True;
goto out;
// ...
Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.
Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.
Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.
For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.
I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:
The CPU cache line size is 64 bytes and longer strings cause a cache miss.
Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.

Categories

Resources