Shorten long small-alphabet string using larger alphabet

Shorten long small-alphabet string using larger alphabet - python

I have a set of ~100 long (between 120 and 150 characters) strings encoded using a 20 letter alphabet (the natural amino acid alphabet). I'm using them in database entries, but they're cumbersome. I'd like to shorten (not compressing, because I don't care about the memory size) them to make them easier to:
Visually compare
Copy/Paste
Manually enter
I was hoping a feasible way to shorten them would be convert the string to a larger alphabet. Specifically, the set of single digits, as well as lower and upper case alphabet.
For example:
# given some long string as input
shorten("ACTRYP...TW")
# returns something shorter like "a3A4n"
Possible approaches
From my elementary understanding of compression, this could be accomplished naively by making a lookup dictionary which maps certain repeating sequences elements of the larger alphabet.
Related Question
This question seemed to pointing in a similar direction, but was working with the DNA alphabet and seemed to be actually seeking compression.

As suggested by #thethiny a combination of hashing can accomplish the shortening desired:
import base64
import hashlib
kinda_long = "ELYWPSRVESGTLVGYQYGRAITGQGKTSGGGSGWLGGGLRLSALELSGKTFSCDQAYYQVLSLNRGVICFLKVSTSVWSYESAAGFTMSGSAQYDYNVSGKANRSDMPTAFDVSGA"
shorter = base64.b32encode(hashlib.sha256(af.encode()).digest()).decode().strip("=")
My original question mentioned using ASCII alphabet and digits. This would be a base 62 encoding. Various libraries exist for this.

Related

Why is this certain function a bad hashing function?

I would like to know why the following code snippet is a bad hashing function.
def computeHash(self, s):
h = 0
for ch in s:
h = h + ord(ch) # ord gives the ASCII value of character ch
return h % HASH_TABLE_SIZE
If I dramatically increase the size of the hash table will this make up for the inadequacies of the hash function?

It's a bad hashing function because strings are order-sensitive, but the hash is not; "ab" and "ba" would hash identically, and for longer strings the collisions just get worse; all of "abc", "acb", "bac", "bca", "cab", "cba" would share the same hash.
For an order-insensitive data structure (e.g. frozenset) a strategy like this isn't as bad, but it's still too easy to produce collisions by simply reducing one ordinal by one and increasing another by one, or by simplying putting a NUL character in there; frozenset({'\0', 'a'}) would hash identically to just frozenset({'a'}); typically this is countered by incorporating the length of the collection into the hash in some manner.
Secure hashes (e.g. Python uses SipHash) are the best solution; a randomized seed combined with an algorithm that conceals the seed while incorporating it into the resulting hash makes it not only harder to accidentally create collisions (which simple improvements like hashing the index as well as the ordinal would help with, to make the hash order and length sensitive), but also makes it nigh impossible for malicious data sources to intentionally cause collisions (which the simple solutions don't handle at all).
The other problem with this hash is that it's doesn't distribute the bits evenly; short strings mean only low bits are set in the hash. This means that increasing the table size is completely pointless when the strings are all relatively short; if all the strings are ASCII and 100 characters or less, the largest possible raw hash value is 12700; if you need to store a million such strings, you'll average nearly 79 collisions per bucket in the first 12,700 buckets (in practice, much more than that for common buckets; there will be humps with many more collisions in the middling values, and fewer collisions near the beginning, and almost none at the end, since stuff like '\x7f' * 100 is the only way to reach said maximum value), and no matter how many more buckets you have, they'll never be used. Technically, an open-addressing based hash table might use them, but it would be largely equivalent to separate chaining per bucket, since all indices past 12700 would only be found by the open-addressing "bounce around" algorithm; if thats badly designed, e.g. linear scanning, you might end up linearly scanning the whole table even if no entries actually collide for your particular hash (your bucket was filled by chaining, and it has to linearly scan until it finds an empty slot or the matching element).

Bad hashing function:
1.AC and BB would give same result at for big string there can be many permutations in which sum of ascii value is same.
2.Even different length string would give same hash result . 'A ' (A +space) = 'a'.
3.Even rearrangement characters in string would give same hash.

This is a bad hashing function. One big problem: re-arranging any or all characters returns the exact same hash.
Increasing the TABLE_SIZE does nothing to adjust for this.

Came up with serialization algorithm for strings, but only works for words that are < 10 letters long.

So I had a question about a serialization algorithm I just came up with it, wanted to know if it already exists and if there's a better version out there.
So we know normal algorithms use a delimiter and join words in a list, but then you have to look through the whole word for existence of the delimiter, escape, etc, or make the serialization algorithm not robust. I thought a more intuitive approach would be to use higher level languages like Python where len() is O(1) and prepend that to each word. So for example this code I attached.
Wouldn't this be faster because instead of going through every letter of every word we instead just go through every word? And then deserialization we don't have to look through every character to find the delimiter, we can just skip directly to the end of each word.
The only problem I see is that double digit sizes would cause problems, but I'm sure there's a way around that I haven't found yet.
It was suggested to me that protocol buffers are similar to this idea, but I haven't understood why yet.
def serialize(list_o_words):
return ''.join(str(len(word)) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index])
next_index = index + word_length + 1
deserialized_list.append(serialized_list_o_words[index+1:next_index])
index = next_index
return deserialized_list
serialized_list = "some,comma,separated,text".split(",")
print(serialize(serialized_list))
print(deserialize(serialize(serialized_list)) == serialized_list)
Essentially, I want to know how I can handle double digit lengths.

There are many variations on length-prefixed strings, but the key bits come down to how you store the length.
You're deserializing the lengths as a single-character ASCII number, which means you can only handle lengths from 0 to 9. (You don't actually test that on the serialize size, so you can generate garbage, but let's forget that.)
So, the obvious option is to use 2 characters instead of 1. Let's add in a bit of error handling while we're at it; the code is still pretty easy:
def _len(word):
s = format(len(word), '02')
if len(s) != 2:
raise ValueError(f"Can't serialize {s}; it's too long")
return s
def serialize(list_o_words):
return ''.join(_len(word) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index+1 < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index:index+2])
next_index = index + word_length + 2
deserialized_list.append(serialized_list_o_words[index+2:next_index])
index = next_index
return deserialized_list
But now you can't handle strings >99 characters.
Of course you can keep adding more digits for longer strings, but if you think "I'm never going to need a 100,000-character string"… you are going to need it, and then you'll have a zillion old files in the 5-digit format that aren't compatible with the new 6-digit format.
Also, this wastes a lot of bytes. If you're using 5-digit lengths, s encodes as 00000s, which is 6x as big as the original value.
You can stretch things a lot farther by using binary lengths instead of ASCII. Now, with two bytes, we can handle lengths up to 65535 instead of just 99. And if you go to four or eight bytes, that might actually be big enough for all your strings ever. Of course this only works if you're storing bytes rather than Unicode strings, but that's fine; you probably needed to encode your strings for persistence anyway. So:
def _len(word):
# already raises an exception for lengths > 65535
s = struct.pack('>H', len(word))
def serialize(list_o_words):
utfs8 = (word.encode() for word in list_o_words)
return b''.join(_len(utf8) + utf8 for utf8 in utfs8)
Of course this isn't very human-readable or -editable; you need to be comfortable in a hex editor to replace a string in a file this way.
Another option is to delimit the lengths. This may sound like a step backward—but it still gives us all the benefits of knowing the length in advance. Sure, you have to "read until comma", but you don't have to worry about escaped or quoted commas the way you do with CSV files, and if you're worried about performance, it's going to be much faster to read a buffer of 8K at a time and chunk through it with some kind of C loop (whether that's slicing, or str.find, barely matters by comparison) than to actually read either until comma or just two bytes.
This also has the benefit of solving the sync problem. With delimited values, if you come in mid-stream, or get out of sync because of an error, it's no big deal; just read until the next unescaped delimiter and worst-case you missed a few values. With length-prefixed values, if you're out of sync, you're reading arbitrary characters and treating them as a length, which just throws you even more out of sync. The netstring format is a minor variation on this idea, with a tiny bit more redundancy to make sync problems easier to detect/recover from.
Going back to binary lengths, there are all kinds of clever tricks for encoding variable-length numbers. Here's one idea, in pseudocode:
if the current byte is < hex 0x80 (128):
that's the length
else:
add the low 7 bits of the current byte
plus 128 times (recursively process the next byte)
Now you can handle short strings with just 1 byte of length, but if a 5-billion-character string comes along, you can handle that too.
Of course this is even less human-readable than fixed binary lengths.
And finally, if you ever want to be able to store other kinds of values, not just strings, you probably want a format that uses a "type code". For example, use I for 32-bit int, f for 64-bit float, D for datetime.datetime, etc. Then you can use s for strings <256 characters with a 1-byte length, S for strings <65536 characters with a 2-byte length, z for string <4B characters with a 4-byte length, and Z for unlimited strings with a complicated variable-int length (or maybe null-terminated strings, or maybe an 8-byte length is close enough to unlimited—after all, nobody's ever going to want more than 640KB in a computer…).

position-independent comparison of Hangul characters

I am writing a python3 program that has to handle text in various writing systems, including Hangul (Korean) and I have problems with the comparison of the same character in different positions.
For those unfamiliar with Hangul (not that I know much about it, either), this script has the almost unique feature of combining the letters of a syllable into square blocks. For example 'ㅎ' is pronounced [h] and 'ㅏ' is pronounced [a], the syllable 'hah' is written '핳' (in case your system can't render Hangul: the first h is displayed in the top-left corner, the a is in the top-right corner and the second h is under them in the middle). Unicode handles this by having two different entries for each consonant, depending on whether it appears in the onset or the coda of a syllable. For example, the previous syllable is encoded as '\u1112\u1161\u11c2'.
My code needs to compare two chars, considering them as equal if they only differ for their positions. This is not the case with simple comparison, even applying Unicode normalizations. Is there a way to do it?

You will need to use a tailored version of the Unicode Collation Algorithm (UCA) that assigns equal weights to identical syllables. The UCA technical report describes the general problem for sorting Hangul.
Luckily, the ICU library has a set of collation rules that does exactly this: ko-u-co-search – Korean (General-Purpose Search); which you can try out on their demo page. To use this in Python, you will either need use a library like PyICU, or one that implements the UCA and supports the ICU rule file format (or lets you write your own rules).

I'm the developer for Python jamo (the Hangul letters are called jamo). An easy way to do this would be to cast all jamo code points to their respective Hangul compatibility jamo (HCJ) code points. HCJ is the display form of jamo characters, so initial and final forms of consonants are the same code point.
For example:
>>> import jamo
>>> initial, vowel, final = jamo.j2hcj('\u1112\u1161\u11c2')
>>> initial == final
True
The way this is done internally is with a lookup table copied from the Unicode specifications.

Python: Custom sort a list of lists

I know this has been asked before, but I have not been able to find a solution.
I'm trying to alphabetize a list of lists according to a custom alphabet.
The alphabet is a representation of the Burmese script as used by Sgaw Karen in plain ASCII. The Burmese script is an alphasyllabary—a few dozen onsets, a handful of medial diacritics, and a few dozen rhymes that can be combined in thousands of different ways, each of which is a single "character" representing one syllable. The map.txt file has these syllables, listed in (Karen/Burmese) alphabetical order, but converted in some unknown way into ASCII symbols, so the first character is u>m;.Rf rather than က or [ka̰]. For example:
u>m;.Rf ug>m;.Rf uH>m;.Rf uX>m;.Rf uk>m;.Rf ul>m;.Rf uh>m;.Rf uJ>m;.Rf ud>m;.Rf uD>m;.Rf u->m;.Rf uj>m;.Rf us>m;.Rf uV>m;.Rf uG>m;.Rf uU>m;.Rf uS>m;.Rf u+>m;.Rf uO>m;.Rf uF>m;.Rf
c>m;.Rf cg>m;.Rf cH>m;.Rf cX>m;.Rf ck>m;.Rf cl>m;.Rf ch>m;.Rf cJ>m;.Rf cd>m;.Rf cD>m;.Rf c->m;.Rf cj>m;.Rf cs>m;.Rf cV>m;.Rf cG>m;.Rf cU>m;.Rf cS>m;.Rf c+>m;.Rf cO>m;.Rf cF>m;.Rf
Each list in the list of lists has, as its first element, a word of Sgaw Karen converted into ASCII symbols in the same way. For example:
[['u&X>', 'n', 'yard'], ['vk.', 'n', 'yarn'], ['w>ouDxD.', 'n', 'yawn'], ['w>wuDxD.', 'n', 'yawn']]
This is what I have so far:
def alphabetize(word_list):
alphabet = ''.join([line.rstrip() for line in open('map.txt', 'rb')])
word_list = sorted(word_list, key=lambda word: [alphabet.index(c) for c in word[0]])
return word_list
I would like to alphabetize word_list by the first element of each list (eg. 'u&X>', 'vk.'), according to the pattern in alphabet.
My code's not working yet and I'm struggling to understand the sorted command with lambda and the for loop.

First, if you're trying to look up the entire word[0] in alphabet, rather than each character individually, you shouldn't be looping over the characters of word[0]. Just use alphabet.index(word[0]) directly.
From your comments, it sounds like you're trying to look up each transliterated-Burmese-script character in word[0]. That isn't possible unless you can write an algorithm to split a word up into those characters. Splitting it up into the ASCII bytes of the transliteration doesn't help at all.
Second, you probably shouldn't be using index here. When you think you need to use index or similar functions, 90% of the time, that means you're using the wrong data structure. What you want here is a mapping (presumably why it's called map.txt), like a dict, keyed by words, not a list of words that you have to keep explicitly searching. Then, looking up a word in that dictionary is trivial. (It's also a whole lot more efficient, but the fact that it's easy to read and understand can be even more important.)
Finally, I suspect that your map.txt is supposed to be read as a whitespace-separated list of transliterated characters, and what you want to find is the index into that list for any given word.
So, putting it all together, something like this:
with open('map.txt', 'rb') as f:
mapping = {word: index for index, word in enumerate(f.read().split())}
word_list = sorted(word_list, key=lambda word: mapping[word[0]])
But, again, that's only going to work for one-syllable words, because until you can figure out how to split a word up into the units that should be alphabetized (in this case, the symbols), there is no way to make it work for multi-syllable words.
And once you've written the code that does that, I'll bet it would be pretty easy to just convert everything to proper Unicode representations of the Burmese script. Each syllable still takes 1-4 code points in Unicode—but that's fine, because the standard Unicode collation algorithm, which comes built-in with Python, already knows how to alphabetize things properly for that script, so you don't have to write it yourself.
Or, even better, unless this is some weird transliteration that you or your teacher invented, there's probably already code to translate between this format and Unicode, which means you shouldn't even have to write anything yourself.

Fastest sorted string concatenation

What is the fastest and most efficient way to do this:
word = "dinosaur"
newWord = word[0] + ''.join(sorted(word[1:]))
output:
"dainorsu"
Thoughts:
Would something as converting the word to an array increase performance? I read somewhere that arrays have less overhead due to them being the same data type compared to a string.
Basically I want to sort everything after the first character in the string as fast as possible. If memory is saved that would also be a plus. The problem I am trying to solve needs to be within a certain time limit so I am trying to be as fast as possible. I dont know much about python efficiency under the hood so if you explain why this method is fast as well that would be AWESOME!

Here's how I'd approach it.
Create an array of size 26 (assuming that only lowercase letters are used). Then iterate through each character in the string. For the 1st letter of the alphabet, increment the 1st index of the array; for the 2nd, increment the 2nd. Once you've scanned the whole string (which is of complexity O(n)) you will be able to reconstruct it afterwards by repeating the 'a' array[0] times, 'b' array[1] times, and so on.
This approach would beat a fast sort algorithm like quicksort or partition sort, which have complexity O(nlogn).
EDIT: Finally you'd also want to reassemble the final string efficiently. The string concatenation provided by some languages using the + operator can be inefficient, so consider using an efficient string builder class.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.