Strict validation for latin character set (ISO 8859)

Strict validation for latin character set (ISO 8859) - python

I'm looking to validate user input to ensure that all characters in the input string fall within the western-latin character set.
Background
I'm specifically working in Python, but I'm looking more to understand the ISO-8859 character set than receive actual Python code.
For a simpler example of what I'm looking for, if I was looking to ensure that user input was entirely ASCII, I could easily do so by checking that each character's numeric value falls in the range [0-126]:
def is_ascii(s):
for c in s:
if not (0 <= ord(c) <= 126):
return False
return True
Simple enough! But now I want to validate for ISO-8859 (western latin character set).
Question
Is this a simple case of changing the upper bound for the value of ord(c)?
If so, what value should I replace 126 with?
If not, how do I perform this validation?
Note
I'm expecting to receive characters that are certainly outside of ISO-8859, for example emotes entered from a mobile device's keyboard.
Edit
After some further research it looks like replacing 126 with 255 would potentially be a valid solution, but I would appreciate if anyone could confirm this?

Related

printing \x1bZ in python is weird, explanation needed

The following python code has some weird behavior which I don't understand
print('\x1bZ')
when I run this code whether in a file or in the interpreter I have a wierd outcome:
actual values as displayed when you write this value to a file as bytes:
Discoveries at time of posting this question:
whether single quotes or double quotes make a difference (they don't)
0x1b is hex for 27 which in ascii is ESC which matches as displayed with the second picture. This lead me to theorize that the letter Z in the string literal can be replaced but as per my test in point number 3 it cant be reproduced with other letters
instead of \x1bZ (ESC and then Z) trying ESC and then some other letter (I haven't checked all possibilities) yielded no apparent result except from replacing Z with c which seems to clear the terminal
Hypothesis that I came up with:
This page may be relevant to the answer: https://pypi.org/project/py100/ because I have found a pattern there that resembles weird result: Esc[?1;Value0c where Value would be replaced by something. also ^[[?1;<n>0c appears in https://espterm.github.io/docs/VT100%20escape%20codes.html
Is this some encoding problem?
Is this related to ANSI character escaping? [?1;0c vs [38;2; which is used when changing background color of text
Questions:
Why is this particular sequence of characters results in this output?
What is VT100 and how it is related if it is related? (I visited it's Wikipedia page)
whether it is possible to print a string that contains this specific sequence without that weird outcome as displayed in the first picture
all help and knowledge about this will be appreciated!!

Determining validity of key

new to python and trying to think up how to approach having a list of valid characters allowed for a key within my dictionary.This key can be any combination of the characters all the way down to a single character or empty.
For example:
allowedWalkingDirection['N','n,'S','s','E','e','W','w']
def isRotateValid(path):
if allowedWalkingDirection in path['walk']:
return path
return False
And so if I try say: {'rotate':'WeNsE'} my input says it isn't valid.
I'm sorry if this isn't very clear and concise, in short, my goal is to allow the valid walking directions to be input however many times within my key, but it's currently only allowing one character in the string.

ok, upon further brain melting and relentless internet perusing I've found some help from Valid characters in a String , I then thought of implementing something like
def isRotateValid(path):
for i in range(0,(len(path['walk']))):
if path['walk'][i] not in allowedWalkingDirection:
return False

Checking if name of a person is valid or not

I am working on a project in which I have to figure out if the name of a person is valid or not. One case of invalidity is a single character name.
In English, it is straight forward to figure out by checking the length.
if len(name) < 2:
return 0
I am not sure if checking length will work out for other languages to, like 玺. I am not sure if this is one character or something else.
Can someone help me to solve this issue?
Dataset info:
countries: 125
total names: 11 Million

While I can't vouch for all other languages, checking a python script with the character you provided confirms that according to Python it is still a single character.
print(len("玺"))
if len("玺") < 2:
print("Single char name")
Another potential solution would be to check ord(char) (29626 for the given char) to check if it is outside the standard latin alphabet and perform additional conditional checks.

Maybe use a dictionary:
language_dict = {
'english':2,
'chinese':1
}
if len(name) < language_dict['english']:
return 0

Using difflib to detect missing characters

I've created a program to take in a password and tell the user by how many characters the password is wrong. To do this I've used difflib.Differ().
However I'm not sure how to create another loop to make it also be able to tell my how many characters the password is wrong if the input is missing characters.
This is the checking function itself.
import difflib
def check_text(correctPass, inputPass):
count = 0
difference = 0
d = difflib.Differ()
count_difference = list(d.compare(correctPass, inputPass))
while count < len(count_difference):
if '+' in count_difference[count]:
difference +=1
count +=1
return difference
At the moment the function can only pick up mistakes if the mistakes are extra characters (correct password in this case is 'rusty').
My understanding of difflib is quite poor. Do I create a new loop, swap the < for a > and the +s for -s? Or do I just change the conditions already in the while loop?
EDIT:
example input/output: rusty33 –> wrong by 2 characters
rsty –> wrong by 0 characters
I'm mainly trying to make the function detect if there are characters missing or extra, not too fussed about order of characters for the moment.

position-independent comparison of Hangul characters

I am writing a python3 program that has to handle text in various writing systems, including Hangul (Korean) and I have problems with the comparison of the same character in different positions.
For those unfamiliar with Hangul (not that I know much about it, either), this script has the almost unique feature of combining the letters of a syllable into square blocks. For example 'ㅎ' is pronounced [h] and 'ㅏ' is pronounced [a], the syllable 'hah' is written '핳' (in case your system can't render Hangul: the first h is displayed in the top-left corner, the a is in the top-right corner and the second h is under them in the middle). Unicode handles this by having two different entries for each consonant, depending on whether it appears in the onset or the coda of a syllable. For example, the previous syllable is encoded as '\u1112\u1161\u11c2'.
My code needs to compare two chars, considering them as equal if they only differ for their positions. This is not the case with simple comparison, even applying Unicode normalizations. Is there a way to do it?

You will need to use a tailored version of the Unicode Collation Algorithm (UCA) that assigns equal weights to identical syllables. The UCA technical report describes the general problem for sorting Hangul.
Luckily, the ICU library has a set of collation rules that does exactly this: ko-u-co-search – Korean (General-Purpose Search); which you can try out on their demo page. To use this in Python, you will either need use a library like PyICU, or one that implements the UCA and supports the ICU rule file format (or lets you write your own rules).

I'm the developer for Python jamo (the Hangul letters are called jamo). An easy way to do this would be to cast all jamo code points to their respective Hangul compatibility jamo (HCJ) code points. HCJ is the display form of jamo characters, so initial and final forms of consonants are the same code point.
For example:
>>> import jamo
>>> initial, vowel, final = jamo.j2hcj('\u1112\u1161\u11c2')
>>> initial == final
True
The way this is done internally is with a lookup table copied from the Unicode specifications.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strict validation for latin character set (ISO 8859) - python

Related

printing \x1bZ in python is weird, explanation needed

Determining validity of key

Checking if name of a person is valid or not

Using difflib to detect missing characters

position-independent comparison of Hangul characters

Categories

Resources