Total number of alphanum values in python - python

I am making a code in which every character is associated with a number. To make my life easier I decided to use alphanumeric values (a=97, b=98, z=121). The first step in my code would be to get a number out of a character. For example:
char = input"Write a character:"
print(ord(char.lower()))
Afterwards though, I need to know the total number of alphanum characters that exist and nowhere have I found my answer...

Your question is not very clear. Why would you need total number of alphanum characters?
thing is, that number depends on the encoding in question. If ASCII is in question then:
>>> import string
>>> len(string.letters+string.digits)
Which is something you could do by counting manually.
And this is even not really the total count, as there is a few more alpha from other languages within 0-128 ASCII range.
If unicode, well, then you will have to search for the specification to see how many of these are there. I do not even know how many alphabets are crammed into unicode or UTF-8.
If it is a question of recognizing alpha-numeric characters in a string, then Python has a nice method to do so:
>>> "A".isalnum()
>>> "0".isalnum()
>>> "[".isalnum()
So please, express yourself more clearly.

Related

Shorten long small-alphabet string using larger alphabet

I have a set of ~100 long (between 120 and 150 characters) strings encoded using a 20 letter alphabet (the natural amino acid alphabet). I'm using them in database entries, but they're cumbersome. I'd like to shorten (not compressing, because I don't care about the memory size) them to make them easier to:
Visually compare
Copy/Paste
Manually enter
I was hoping a feasible way to shorten them would be convert the string to a larger alphabet. Specifically, the set of single digits, as well as lower and upper case alphabet.
For example:
# given some long string as input
shorten("ACTRYP...TW")
# returns something shorter like "a3A4n"
Possible approaches
From my elementary understanding of compression, this could be accomplished naively by making a lookup dictionary which maps certain repeating sequences elements of the larger alphabet.
Related Question
This question seemed to pointing in a similar direction, but was working with the DNA alphabet and seemed to be actually seeking compression.
As suggested by #thethiny a combination of hashing can accomplish the shortening desired:
import base64
import hashlib
kinda_long = "ELYWPSRVESGTLVGYQYGRAITGQGKTSGGGSGWLGGGLRLSALELSGKTFSCDQAYYQVLSLNRGVICFLKVSTSVWSYESAAGFTMSGSAQYDYNVSGKANRSDMPTAFDVSGA"
shorter = base64.b32encode(hashlib.sha256(af.encode()).digest()).decode().strip("=")
My original question mentioned using ASCII alphabet and digits. This would be a base 62 encoding. Various libraries exist for this.

Came up with serialization algorithm for strings, but only works for words that are < 10 letters long.

So I had a question about a serialization algorithm I just came up with it, wanted to know if it already exists and if there's a better version out there.
So we know normal algorithms use a delimiter and join words in a list, but then you have to look through the whole word for existence of the delimiter, escape, etc, or make the serialization algorithm not robust. I thought a more intuitive approach would be to use higher level languages like Python where len() is O(1) and prepend that to each word. So for example this code I attached.
Wouldn't this be faster because instead of going through every letter of every word we instead just go through every word? And then deserialization we don't have to look through every character to find the delimiter, we can just skip directly to the end of each word.
The only problem I see is that double digit sizes would cause problems, but I'm sure there's a way around that I haven't found yet.
It was suggested to me that protocol buffers are similar to this idea, but I haven't understood why yet.
def serialize(list_o_words):
return ''.join(str(len(word)) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index])
next_index = index + word_length + 1
deserialized_list.append(serialized_list_o_words[index+1:next_index])
index = next_index
return deserialized_list
serialized_list = "some,comma,separated,text".split(",")
print(serialize(serialized_list))
print(deserialize(serialize(serialized_list)) == serialized_list)
Essentially, I want to know how I can handle double digit lengths.
There are many variations on length-prefixed strings, but the key bits come down to how you store the length.
You're deserializing the lengths as a single-character ASCII number, which means you can only handle lengths from 0 to 9. (You don't actually test that on the serialize size, so you can generate garbage, but let's forget that.)
So, the obvious option is to use 2 characters instead of 1. Let's add in a bit of error handling while we're at it; the code is still pretty easy:
def _len(word):
s = format(len(word), '02')
if len(s) != 2:
raise ValueError(f"Can't serialize {s}; it's too long")
return s
def serialize(list_o_words):
return ''.join(_len(word) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index+1 < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index:index+2])
next_index = index + word_length + 2
deserialized_list.append(serialized_list_o_words[index+2:next_index])
index = next_index
return deserialized_list
But now you can't handle strings >99 characters.
Of course you can keep adding more digits for longer strings, but if you think "I'm never going to need a 100,000-character string"… you are going to need it, and then you'll have a zillion old files in the 5-digit format that aren't compatible with the new 6-digit format.
Also, this wastes a lot of bytes. If you're using 5-digit lengths, s encodes as 00000s, which is 6x as big as the original value.
You can stretch things a lot farther by using binary lengths instead of ASCII. Now, with two bytes, we can handle lengths up to 65535 instead of just 99. And if you go to four or eight bytes, that might actually be big enough for all your strings ever. Of course this only works if you're storing bytes rather than Unicode strings, but that's fine; you probably needed to encode your strings for persistence anyway. So:
def _len(word):
# already raises an exception for lengths > 65535
s = struct.pack('>H', len(word))
def serialize(list_o_words):
utfs8 = (word.encode() for word in list_o_words)
return b''.join(_len(utf8) + utf8 for utf8 in utfs8)
Of course this isn't very human-readable or -editable; you need to be comfortable in a hex editor to replace a string in a file this way.
Another option is to delimit the lengths. This may sound like a step backward—but it still gives us all the benefits of knowing the length in advance. Sure, you have to "read until comma", but you don't have to worry about escaped or quoted commas the way you do with CSV files, and if you're worried about performance, it's going to be much faster to read a buffer of 8K at a time and chunk through it with some kind of C loop (whether that's slicing, or str.find, barely matters by comparison) than to actually read either until comma or just two bytes.
This also has the benefit of solving the sync problem. With delimited values, if you come in mid-stream, or get out of sync because of an error, it's no big deal; just read until the next unescaped delimiter and worst-case you missed a few values. With length-prefixed values, if you're out of sync, you're reading arbitrary characters and treating them as a length, which just throws you even more out of sync. The netstring format is a minor variation on this idea, with a tiny bit more redundancy to make sync problems easier to detect/recover from.
Going back to binary lengths, there are all kinds of clever tricks for encoding variable-length numbers. Here's one idea, in pseudocode:
if the current byte is < hex 0x80 (128):
that's the length
else:
add the low 7 bits of the current byte
plus 128 times (recursively process the next byte)
Now you can handle short strings with just 1 byte of length, but if a 5-billion-character string comes along, you can handle that too.
Of course this is even less human-readable than fixed binary lengths.
And finally, if you ever want to be able to store other kinds of values, not just strings, you probably want a format that uses a "type code". For example, use I for 32-bit int, f for 64-bit float, D for datetime.datetime, etc. Then you can use s for strings <256 characters with a 1-byte length, S for strings <65536 characters with a 2-byte length, z for string <4B characters with a 4-byte length, and Z for unlimited strings with a complicated variable-int length (or maybe null-terminated strings, or maybe an 8-byte length is close enough to unlimited—after all, nobody's ever going to want more than 640KB in a computer…).

Formatting string to certain character limit in Python

I have a code that outputs a whole bunch of numbers after doing some maths on them. At one point in the code they are rounded off with numpy.rint, and in certain cases (I believe when a 9 is rounded to a 10) I end up with a trailing zero that I do not want. I have some code that looks sort-of like this
ra3n = ra3/60 * 10
ra3n = np.rint(ra3n)
ra3n = ra3n.astype(str) ##there is a good reason that this needs to be a string
I need all of the resulting ra3n to be 5 characters long, but occasionally one pops out as 6 characters long. How would I format this properly? Keep in mind I'm a total python noob, so I might need it spelled out for me =)
EDIT:
Here's my output:
00244-2451
00244-2702
00278-0629
00286-1614
00295-1101
002910-0546
00303+0711
00305+2246
00348+2604
003410+0423
00355-0204
00359+1236
00360-0931
00386-1210
The instances where there are six digits instead of 5 in the first half of the string are the erroneous ones; those trailing zeroes should not be there.
ra3n = ra3n[:-1] if ra3n[-1] == '0' else ra3n
There's probably a better solution, but I'm not sure I really understand your issue without seeing some output.
You change the type of ra3n, which is poor programming practice. Try this.
ra3n = format(ra3/60.*10., '5f')[:5]
This gives exactly five characters. Note that if the string would usually be six characters long, this cuts off the last character, for good or for bad. Note also that I included decimal points in the 60 and 10 numbers: this guarantees that floating-point division will be used, rather than integer division if this is done in Python 2.

Make overlines in Python

Hello my fellow coders!
I'm an absolute beginner to Python and coding in general. Right now, I'm writing a code that converts regular arabic numerals to roman. For numbers larger than 3 999, the romans usually wrote a line over a letter to make it thousand times larger. For example, IV with a line over it represented 4 000. How is this possible in Python? I have understood that you can create an "overscore" by writing "\u203E". How can I make this appear over a letter instead of beside it?
Regards
You need to use the combining character U+0304 instead.
>>> print(u'a\u0304')
ā
U+0305 is probably a better choice (as viraptor suggests). You can also use the Unicode Roman numerals (U+2160 through U+217f) instead of regular uppercase Latin letters, although (at least in my terminal) they don't render as well with the overline.
>>> print(u'\u2163\u0305')
Ⅳ̅
>>> print u'I\u0305V\u0305'
I̅V̅
(Or as I see it:
Notice the overline is centered over, but does not completely cover, the single-character Roman numeral 4.)
(Any pure text option will only be as good as the font and renderer used by the person running the code. Case in point, the I+V version does not even display consistently while I type this; sometimes the overbars are over the letters, sometimes they follow the letters.)
A combining overline is \u305 and it works quite well with "IV". What you want is for example: u'I\u0305V\u0305' (gives I̅V̅)
I looked for something online but didn't find it. The best workaround I'd suggest would be the following:
def over(character):
return "_\n"+character
Such as:
>>> print over("M")
_
M
>>>

Python: Custom sort a list of lists

I know this has been asked before, but I have not been able to find a solution.
I'm trying to alphabetize a list of lists according to a custom alphabet.
The alphabet is a representation of the Burmese script as used by Sgaw Karen in plain ASCII. The Burmese script is an alphasyllabary—a few dozen onsets, a handful of medial diacritics, and a few dozen rhymes that can be combined in thousands of different ways, each of which is a single "character" representing one syllable. The map.txt file has these syllables, listed in (Karen/Burmese) alphabetical order, but converted in some unknown way into ASCII symbols, so the first character is u>m;.Rf rather than က or [ka̰]. For example:
u>m;.Rf ug>m;.Rf uH>m;.Rf uX>m;.Rf uk>m;.Rf ul>m;.Rf uh>m;.Rf uJ>m;.Rf ud>m;.Rf uD>m;.Rf u->m;.Rf uj>m;.Rf us>m;.Rf uV>m;.Rf uG>m;.Rf uU>m;.Rf uS>m;.Rf u+>m;.Rf uO>m;.Rf uF>m;.Rf
c>m;.Rf cg>m;.Rf cH>m;.Rf cX>m;.Rf ck>m;.Rf cl>m;.Rf ch>m;.Rf cJ>m;.Rf cd>m;.Rf cD>m;.Rf c->m;.Rf cj>m;.Rf cs>m;.Rf cV>m;.Rf cG>m;.Rf cU>m;.Rf cS>m;.Rf c+>m;.Rf cO>m;.Rf cF>m;.Rf
Each list in the list of lists has, as its first element, a word of Sgaw Karen converted into ASCII symbols in the same way. For example:
[['u&X>', 'n', 'yard'], ['vk.', 'n', 'yarn'], ['w>ouDxD.', 'n', 'yawn'], ['w>wuDxD.', 'n', 'yawn']]
This is what I have so far:
def alphabetize(word_list):
alphabet = ''.join([line.rstrip() for line in open('map.txt', 'rb')])
word_list = sorted(word_list, key=lambda word: [alphabet.index(c) for c in word[0]])
return word_list
I would like to alphabetize word_list by the first element of each list (eg. 'u&X>', 'vk.'), according to the pattern in alphabet.
My code's not working yet and I'm struggling to understand the sorted command with lambda and the for loop.
First, if you're trying to look up the entire word[0] in alphabet, rather than each character individually, you shouldn't be looping over the characters of word[0]. Just use alphabet.index(word[0]) directly.
From your comments, it sounds like you're trying to look up each transliterated-Burmese-script character in word[0]. That isn't possible unless you can write an algorithm to split a word up into those characters. Splitting it up into the ASCII bytes of the transliteration doesn't help at all.
Second, you probably shouldn't be using index here. When you think you need to use index or similar functions, 90% of the time, that means you're using the wrong data structure. What you want here is a mapping (presumably why it's called map.txt), like a dict, keyed by words, not a list of words that you have to keep explicitly searching. Then, looking up a word in that dictionary is trivial. (It's also a whole lot more efficient, but the fact that it's easy to read and understand can be even more important.)
Finally, I suspect that your map.txt is supposed to be read as a whitespace-separated list of transliterated characters, and what you want to find is the index into that list for any given word.
So, putting it all together, something like this:
with open('map.txt', 'rb') as f:
mapping = {word: index for index, word in enumerate(f.read().split())}
word_list = sorted(word_list, key=lambda word: mapping[word[0]])
But, again, that's only going to work for one-syllable words, because until you can figure out how to split a word up into the units that should be alphabetized (in this case, the symbols), there is no way to make it work for multi-syllable words.
And once you've written the code that does that, I'll bet it would be pretty easy to just convert everything to proper Unicode representations of the Burmese script. Each syllable still takes 1-4 code points in Unicode—but that's fine, because the standard Unicode collation algorithm, which comes built-in with Python, already knows how to alphabetize things properly for that script, so you don't have to write it yourself.
Or, even better, unless this is some weird transliteration that you or your teacher invented, there's probably already code to translate between this format and Unicode, which means you shouldn't even have to write anything yourself.

Categories

Resources