Python: Custom sort a list of lists - python

I know this has been asked before, but I have not been able to find a solution.
I'm trying to alphabetize a list of lists according to a custom alphabet.
The alphabet is a representation of the Burmese script as used by Sgaw Karen in plain ASCII. The Burmese script is an alphasyllabary—a few dozen onsets, a handful of medial diacritics, and a few dozen rhymes that can be combined in thousands of different ways, each of which is a single "character" representing one syllable. The map.txt file has these syllables, listed in (Karen/Burmese) alphabetical order, but converted in some unknown way into ASCII symbols, so the first character is u>m;.Rf rather than က or [ka̰]. For example:
u>m;.Rf ug>m;.Rf uH>m;.Rf uX>m;.Rf uk>m;.Rf ul>m;.Rf uh>m;.Rf uJ>m;.Rf ud>m;.Rf uD>m;.Rf u->m;.Rf uj>m;.Rf us>m;.Rf uV>m;.Rf uG>m;.Rf uU>m;.Rf uS>m;.Rf u+>m;.Rf uO>m;.Rf uF>m;.Rf
c>m;.Rf cg>m;.Rf cH>m;.Rf cX>m;.Rf ck>m;.Rf cl>m;.Rf ch>m;.Rf cJ>m;.Rf cd>m;.Rf cD>m;.Rf c->m;.Rf cj>m;.Rf cs>m;.Rf cV>m;.Rf cG>m;.Rf cU>m;.Rf cS>m;.Rf c+>m;.Rf cO>m;.Rf cF>m;.Rf
Each list in the list of lists has, as its first element, a word of Sgaw Karen converted into ASCII symbols in the same way. For example:
[['u&X>', 'n', 'yard'], ['vk.', 'n', 'yarn'], ['w>ouDxD.', 'n', 'yawn'], ['w>wuDxD.', 'n', 'yawn']]
This is what I have so far:
def alphabetize(word_list):
alphabet = ''.join([line.rstrip() for line in open('map.txt', 'rb')])
word_list = sorted(word_list, key=lambda word: [alphabet.index(c) for c in word[0]])
return word_list
I would like to alphabetize word_list by the first element of each list (eg. 'u&X>', 'vk.'), according to the pattern in alphabet.
My code's not working yet and I'm struggling to understand the sorted command with lambda and the for loop.

First, if you're trying to look up the entire word[0] in alphabet, rather than each character individually, you shouldn't be looping over the characters of word[0]. Just use alphabet.index(word[0]) directly.
From your comments, it sounds like you're trying to look up each transliterated-Burmese-script character in word[0]. That isn't possible unless you can write an algorithm to split a word up into those characters. Splitting it up into the ASCII bytes of the transliteration doesn't help at all.
Second, you probably shouldn't be using index here. When you think you need to use index or similar functions, 90% of the time, that means you're using the wrong data structure. What you want here is a mapping (presumably why it's called map.txt), like a dict, keyed by words, not a list of words that you have to keep explicitly searching. Then, looking up a word in that dictionary is trivial. (It's also a whole lot more efficient, but the fact that it's easy to read and understand can be even more important.)
Finally, I suspect that your map.txt is supposed to be read as a whitespace-separated list of transliterated characters, and what you want to find is the index into that list for any given word.
So, putting it all together, something like this:
with open('map.txt', 'rb') as f:
mapping = {word: index for index, word in enumerate(f.read().split())}
word_list = sorted(word_list, key=lambda word: mapping[word[0]])
But, again, that's only going to work for one-syllable words, because until you can figure out how to split a word up into the units that should be alphabetized (in this case, the symbols), there is no way to make it work for multi-syllable words.
And once you've written the code that does that, I'll bet it would be pretty easy to just convert everything to proper Unicode representations of the Burmese script. Each syllable still takes 1-4 code points in Unicode—but that's fine, because the standard Unicode collation algorithm, which comes built-in with Python, already knows how to alphabetize things properly for that script, so you don't have to write it yourself.
Or, even better, unless this is some weird transliteration that you or your teacher invented, there's probably already code to translate between this format and Unicode, which means you shouldn't even have to write anything yourself.

Related

Shorten long small-alphabet string using larger alphabet

I have a set of ~100 long (between 120 and 150 characters) strings encoded using a 20 letter alphabet (the natural amino acid alphabet). I'm using them in database entries, but they're cumbersome. I'd like to shorten (not compressing, because I don't care about the memory size) them to make them easier to:
Visually compare
Copy/Paste
Manually enter
I was hoping a feasible way to shorten them would be convert the string to a larger alphabet. Specifically, the set of single digits, as well as lower and upper case alphabet.
For example:
# given some long string as input
shorten("ACTRYP...TW")
# returns something shorter like "a3A4n"
Possible approaches
From my elementary understanding of compression, this could be accomplished naively by making a lookup dictionary which maps certain repeating sequences elements of the larger alphabet.
Related Question
This question seemed to pointing in a similar direction, but was working with the DNA alphabet and seemed to be actually seeking compression.
As suggested by #thethiny a combination of hashing can accomplish the shortening desired:
import base64
import hashlib
kinda_long = "ELYWPSRVESGTLVGYQYGRAITGQGKTSGGGSGWLGGGLRLSALELSGKTFSCDQAYYQVLSLNRGVICFLKVSTSVWSYESAAGFTMSGSAQYDYNVSGKANRSDMPTAFDVSGA"
shorter = base64.b32encode(hashlib.sha256(af.encode()).digest()).decode().strip("=")
My original question mentioned using ASCII alphabet and digits. This would be a base 62 encoding. Various libraries exist for this.

How to search through each letter in a dictionary and use it properly

If I were to take a dictionary, such as
living_beings= {"Reptile":"Snake","mammal":"whale", "Other":"bird"}
and wished to search for individual characters (such as "a") (e.g.
for i in living_beings:
if "a" in living_beings:
print("a is here")
would there be an efficient- runs fastest- method of doing this?
The input is simply searching as outlined above (although my approach didn't work).
My (failed) code goes as follows:
animals=[]
for row in reader: #'reader' is simply what was in the dictionary
animals.append(row) #I tried to turn it into a list to sort it that way
for i in range(1, len(animals)):
r= animals[i]
for i in r:
if i== "a": #My attempt to find "a". This is obviously False as i= one of the strings in
k=i.replace("'","/") #this is my attempt at the further bit, for a bit of context
test= animals.append(k)
print(test)
In case you were wondering,
The next step would be to insert a character- "/"- before that letter (in this case "a"), although this is a slightly different problem and so not linked with my question and is simply there to give a greater understanding of the problem.
EDIT
I have found another error relating to dictionary. If the dictionary features an apostrophe (') the output is affected as it prints that particular word in quotes ("") rather that the normal apostrophes. EXAMPLE: living_beings= {"Reptile":"Snake's","mammal":"whale", "Other":"bird"} and if you use the following code (which I need to):
new= []
for i in living_beings:
r=living_beings[i]
new.append(r)
then the output is "snake's", 'whale', 'bird' (Note the difference between the first and other outputs). So My question is: How to stop the apostrophes affecting output.
My approach would be to use dict comprehension to map over the dictionary and replace every occurence of 'a' by '/a'.
I don't think there are significant performance improvements that can be done from there. You algorithm will be linear with regard to the total number of characters in the keys and items of the dict as you need to traverse the whole dictionary whatever the input.
living_beings= {"Reptile":"Snake","mammal":"whale", "Other":"bird"}
new_dict = {
kind.replace('a', '/a'): animal.replace('a', '/a') for kind, animal in living_beings.items()
}
# new_dict: {"Reptile":"Sn/ake","m/amm/al":"wh/ale", "Other":"bird"}
You could maybe optimize with a more convoluted solution that loops through the dict to mutate it instead of creating a new one, but in general I recommend not trying to do such things in Python. Just write good code, with good practices, and let Python do the optimization under the hood. After all this is what the Zen of Python tells us: Simple is better than complex.
This can be done quite efficiently using a regular expression match, e.g.:
import re
re_containsA = re.compile(r'.*a.*')
for key, word in worddict.items():
if re_containsA.match(word):
print(key)
The re.match object can then be used to find the location of the matched text.

Encoding included in .append? Python

I am a complete beginner when it comes to Python, and currently getting closer towards the end of LPTHW. Now, in exercise 41, there is a line of code in a for-loop that I do not quite get.
I have searched online to the best of my abilities, but as I am still learning, I was not completely sure how to even search for this.
To clarify:
WORD_URL is just a series of words.
WORDS is an empty list.
This is the loop:
for word in urlopen(WORD_URL).readlines():
WORDS.append(str(word.strip(), encoding="utf-8"))
Now what I do not really understand is what this WORDS.append(str(word.strip(), encoding='utf-8') does. Why is encoding="utf-8" included, and what does it to in this context? I suspect the use of str here is connected to this some way, but not completely sure. Would it not be possible to simply just have it like this:
.append(word.strip())?
Thanks!
The code snippet
WORDS.append(str(word.strip(), encoding='utf-8'))
removes the whitespace from word, and coverts word to a utf-8 encoded string.
The str function takes care of the encoding.
The code you provided
WORDS.append(word.strip())
would place the variable word, with no whitespace at the end of the list WORDS.
The difference being that no encoding is specified, so the variable word will be placed at the end of WORDS in whatever encoding the variable word currently is. In the first example, all strings placed in WORDS will be utf-8 encoded.

How to sort an alphabet of a language not presented in ISO?

There is a language (Circassian) which is not presented in ISO, it's based on Cyrillic characters, but have it's own order which differs from standard Cyrillic order. So there is a problem I can't solve. I need to sort words in database according to Circassian alphabet order. I need your ideas, guys, how to solve this problem, because I'm pretty sure my skills aren't good enough for this task.
You can use the built-in function .sort() and module locale to sort the words. A simple example would be:
alphabet = ['Ж', 'Жь', 'Гъ'] # obviously the list should contain all the letters
import locale
locale.setlocale(category=locale.LC_ALL, locale="kbd")
alphabet.sort(key=locale.strxfrm)
print(alphabet)
['Гъ', 'Ж', 'Жь']

python write unicode to file and retrieve it as a symbol

I am working on a Python program which contains an Arabic-English database and allows to update this database and also to study the vocabluary. I am almost done with implementing all the functions I need but the most important part is missing: The encoding of the Arabic strings. To append new vocabulary to the data base txt file, a dictionary is created and then its content is appended to the file. To study vocabulary, the content of the txt file is converted into a dictionary again, a random word is printed to the console and the user is asked for its translation. Now the idea is that the user has the possibility to write the Englisch word as well as the Arabic word in latin letters and the program will internally convert the pseudo-arabic string to Arabic letters. For example, if the user writes 'b' when asked for the Arabic word, I want to append 'ب‎'.
1. There are about 80 signs I have to consider in the implementation. Is there a way of creating some mapping between the latin-letter input string and the respective Arabic signs? For me, the most intuitive idea would be to write one if statement after the other but that's probably super slow.
2. I have trouble printing the Arabic string to the console. This input
print('bla{}!'.format(chr(0xfe9e)))
print('bla{}!'.format(chr(int('0x'+'0627',16))))
will result in printing the Arabic sign whereas this won't:
print('{}'.format(chr(0xfe9e)))
What can I do in order to avoid this problem, since I want a sequence which consists of unicode symbols only?
Did you try encode/decode function? for example you can write
u = ("سلام".encode('utf-8'))
print(u.decode('utf-8'))
This is not the final answer but can give you a start.
First of all check your encoding:
import sys
sys.getdefaultencoding()
Edit:
sys.setdefaultencoding('UTF8') was removed from sys module. But still, you can comment what sys.getdefaultencoding() returns in your box.
However, for Arabic characters, you can range them all at once:
According to this website, Arabic characters are from 0x620 to 0x64B and Basic Latin characters are from 0x0061 to 0x007B (for lower cases).
So:
arabic_chr = [chr(k) for k in range(0x620, 0x064B, 1)]
latin_chr = [chr(k) for k in range(0x0061, 0x007B, 1)]
Now, all what you have to do, is finding a relation between the two lists, orr maybe extend more the ranges (I speak arabic and i know that there is many forms of one char and a character can change from a word to another).

Categories

Resources