position-independent comparison of Hangul characters - python

I am writing a python3 program that has to handle text in various writing systems, including Hangul (Korean) and I have problems with the comparison of the same character in different positions.
For those unfamiliar with Hangul (not that I know much about it, either), this script has the almost unique feature of combining the letters of a syllable into square blocks. For example 'ㅎ' is pronounced [h] and 'ㅏ' is pronounced [a], the syllable 'hah' is written '핳' (in case your system can't render Hangul: the first h is displayed in the top-left corner, the a is in the top-right corner and the second h is under them in the middle). Unicode handles this by having two different entries for each consonant, depending on whether it appears in the onset or the coda of a syllable. For example, the previous syllable is encoded as '\u1112\u1161\u11c2'.
My code needs to compare two chars, considering them as equal if they only differ for their positions. This is not the case with simple comparison, even applying Unicode normalizations. Is there a way to do it?

You will need to use a tailored version of the Unicode Collation Algorithm (UCA) that assigns equal weights to identical syllables. The UCA technical report describes the general problem for sorting Hangul.
Luckily, the ICU library has a set of collation rules that does exactly this: ko-u-co-search – Korean (General-Purpose Search); which you can try out on their demo page. To use this in Python, you will either need use a library like PyICU, or one that implements the UCA and supports the ICU rule file format (or lets you write your own rules).

I'm the developer for Python jamo (the Hangul letters are called jamo). An easy way to do this would be to cast all jamo code points to their respective Hangul compatibility jamo (HCJ) code points. HCJ is the display form of jamo characters, so initial and final forms of consonants are the same code point.
For example:
>>> import jamo
>>> initial, vowel, final = jamo.j2hcj('\u1112\u1161\u11c2')
>>> initial == final
True
The way this is done internally is with a lookup table copied from the Unicode specifications.

Related

How do I add a space between characters when a condition is met?

I want to add a space when two adjacent glyphs are not of the same type. In my case, I am trying to get it to work for Japanese Hiragana and Katakana.
So if I input 'これはペンです', I'll get 'これは ペン です' because は and ペ are not of the same type and ン and で are not of the same type. Likewise, if I input '日本人です', I should get '日本人 です' because 人 and で are not of the same type. The glyphs '日本人' are left alone because they're not members of the Hiragana and Katakana set.
Do I need to make a list of glyphs for Hiragana and Katakana? (This is no problem, by the way.) Is there a way designating an 'elsewhere list' of all things that belong to neither Hiragana and Katakana?
Disclaimer: I'm a linguist and I'm fairly new to programming. I know how it works, but I don't have a ton of hands-on experience. Also, I'm not looking for an extant parser or something like that.
The Hiragana and Katakana characters are contiguous: it is very easy to calculate whether a code point is within a specific set. Basically if a codepoint is between 0x3041 and 0x309F it is Hiragana. It if is between 0x30A1 and 0x30FF then it is Katakana. No list required.

Shorten long small-alphabet string using larger alphabet

I have a set of ~100 long (between 120 and 150 characters) strings encoded using a 20 letter alphabet (the natural amino acid alphabet). I'm using them in database entries, but they're cumbersome. I'd like to shorten (not compressing, because I don't care about the memory size) them to make them easier to:
Visually compare
Copy/Paste
Manually enter
I was hoping a feasible way to shorten them would be convert the string to a larger alphabet. Specifically, the set of single digits, as well as lower and upper case alphabet.
For example:
# given some long string as input
shorten("ACTRYP...TW")
# returns something shorter like "a3A4n"
Possible approaches
From my elementary understanding of compression, this could be accomplished naively by making a lookup dictionary which maps certain repeating sequences elements of the larger alphabet.
Related Question
This question seemed to pointing in a similar direction, but was working with the DNA alphabet and seemed to be actually seeking compression.
As suggested by #thethiny a combination of hashing can accomplish the shortening desired:
import base64
import hashlib
kinda_long = "ELYWPSRVESGTLVGYQYGRAITGQGKTSGGGSGWLGGGLRLSALELSGKTFSCDQAYYQVLSLNRGVICFLKVSTSVWSYESAAGFTMSGSAQYDYNVSGKANRSDMPTAFDVSGA"
shorter = base64.b32encode(hashlib.sha256(af.encode()).digest()).decode().strip("=")
My original question mentioned using ASCII alphabet and digits. This would be a base 62 encoding. Various libraries exist for this.

Python, Convert white spaces to invisible ASCII character

A quick question. I have no idea how to google for an answer.
I have a python program for taking input from users and generating string output.
I then use this string variables to populate text boxes on other software (Illustrator).
The string is made out of: number + '%' + text, e.g.
'50% Cool', '25% Alright', '25% Decent'.
These three elements are imputed into one Text Box (next to one another), and as it is with text boxes if one line does not fit the whole text, the text is moved down to another line as soon as it finds a white space ' '. Like So:
50% Cool 25% Alright 25%
Decent
I need to keep this feature in (Where text gets moved down to a lower line if it does not fit) but I need it to move the whole element and not split it.
Like So:
50% Cool 25% Alright
25% Decent
The only way I can think of to stop this from happening; is to use some sort of invisible ASCII code which connects each element together (while still retaining human visible white spaces).
Does anyone know of such ASCII connector that could be used?
So, understand first of all that what you're asking about is encoding specific. In ASCII/ANSI encoding (or Latin1), a non-breaking space can either be character 160 or character 255. (See this discussion for details.) Eg:
non_breaking_space = ord(160)
HOWEVER, that's for encoded ASCII binary strings. Assuming you're using Python 3 (which you should consider if you're not), your strings are all Unicode strings, not encoded binary strings.
And this also begs the question of how you plan on getting these strings into Illustrator. Does the user copy and paste them? Are they encoded into a file? That will affect how you want to transmit a non-breaking space.
Assuming you're using Python 3 and not worrying about encoding, this is what you want:
'Alright\u002025%'
\u0020 inserts a Unicode non-breaking space.

Strings in Python not equal due to German Umlaut

I tried to compare to strings, both contained the German Umlaut "ü". Both look literaly the same, there is also no trailing \n or somethins similar.
One of those bits is read from an xml-File, another from the filesystem. Comparing them letter by letter, shows a difference with the Umlaut.
The distorted Umlaut (consisting of two letters, a normal u and two upper dots) is coming from the file system. I'm using macOS High Sierra and running Python 3.7. The filename is read using os.listdir().
I'd appreciate suggestions to handle this strange behavior (getting rid of the "ü" is not an option).
Instead of comparing the strings directly, compare their unicodedata.normalize results, given the same form parameter
From the documentation: Comparing strings
A second tool is the unicodedata module’s normalize() function that
converts strings to one of several normal forms, where letters
followed by a combining character are replaced with single characters.
normalize() can be used to perform string comparisons that won’t
falsely report inequality if two strings use combining characters
differently
import unicodedata
def compare_strs(s1, s2):
def NFD(s):
return unicodedata.normalize('NFD', s)
return NFD(s1) == NFD(s2)

finding all float literals in Python code

I am trying to find all occurrences of a literal float value in Python code. Can I do that in Komodo (or in any other way)?
In other words, I want to find every line where something like 0.0 or 1.5 or 1e5 is used, assuming it is interpreted by Python as a float literal (so no comments, for example).
I'm using Komodo 6.0 with Python 3.1.
If possible, a way to find string and integer literals would be nice to have as well.
Our SD Source Code Search Engine (SCSE) can easily do this.
SCSE is a tool for searching large source code bases, much faster than grep, by indexing the elements of the source code languages of interest. Queries can then be posed, which use the index to enable fast location of search hits. Queries and hits are displayed in a GUI, and a click on a hit will show the block of source code containing the hit.
The SCSE knows the lexical structure of each language it has indexed with the precision as that langauge's compiler. (It uses front ends from family of accurate programming language processors; this family is pretty large and happens to include the OP's target langauge of Python/Perl/Java/...). Thus it knows exactly where identifiers, comments, and literals (integral, float, character or string) are, and exactly their content.
SCSE queries are composed of commands representing sequences of language elements of interest. The query
'for' ... I '=' N=103
finds a for keyword near ("...") an arbitrary identifier(I) which is initialized ("=") with the numeric value ("N") of 103. Because SCSE understands the language structure, it ignores language-whitespace between the tokens, e.g., it can find this regardless off intervening blanks, whitespace, newlines or comments.
The query tokens I, N, F, S, C represent I(dentifier), Natural (number), F(loat), S(tring) and C(omment) respectively. The OP's original question, of finding all the floats, is thus the nearly trivial query
F
Similarly for finding all String literals ("S") and integral literals ("N"). If you wanted to find just copies of values near Pi, you'd add low and upper bound constraints:
F>3.14<3.16
(It is pretty funny to run this on large Fortran codes; you see all kinds of bad approximations of Pi).
SCSE won't find a Float in a comment or a string, because it intimately knows the difference. Writing a grep-style expression to handle all the strange combinations to eliminate whitespace or surrounding quotes and commente delimiters should be obviously a lot more painful. Grep ain't the way to do this.
You could do that by selecting what you need with regular expressions.
This command (run it on a terminal) should do the trick:
sed -r "s/^([^#]*)#.*$/\1/g" YOUR_FILE | grep -P "[^'\"\w]-?[1-9]\d*[.e]\d*[^'\"\w]"
You'll probably need to tweak it to get a better result.
`sed' cuts out comments, while grep selects only lines containing (a small subsect of - the expression I gave is not perfect) float values...
Hope it helps.

Categories

Resources