How to remove font from text in python - python

I have the letter 'ᴇ' in a text and when I have if 'ᴇ' == 'e' it returns False. How can I convert 'ᴇ'
to 'E'?
I tried to encode it, but when I run:
'ᴇ'.encode("utf-16")
>>>b'\xff\xfe\x07\x1d'

The problem is that what you're trying to do isn't "remove a font". You're trying to map one-or-more unicode characters into some other subset of characters. There doesn't seem to be an out-of-the-box way to do what you want, probably because there are so many different things you might "actually want".
All of the following, and many other distinct characters, are arguable "e"s: ᴇ, ꠄ, 𝔼, E, ⅇ, aͤ *, Ӭ, Ꭱ, e, Ⓔ, ℰ, é
Somehow you'll have to decide which of these you want to transform, and into what, and which you'd like to leave alone. Even assuming reasonable answers exist for every question that will come up, there are simply a lot of unicode characters; don't actually try cover all cases.
Depending on the scope of the transformation you have in mind, you may be able to use str.translate or something funky with codecs.register_error to perform OK transformation on many possible inputs.
*That's actually two characters, the "a" is just an "a".

Related

Geopandas encodings for "never seen before" characters

during these days I'm struggling with geographical dataframes which I'm managing with geopandas. My problem comes from weird format of special characters that belong to the names of regions and towns. I never saw the format which I'm in front of. Fortunately they are not so many.
I tried to select all kind of encodings, from latin-1 to several ISO-xxx but the only way that appears to work properly is a manual replacement with a dictionary (which I don't like as it is built only with the examples I can reach from the dataframe itself. If it does change in the future, it will omit that).
Here's an example of how I approached the replacement. Since I couldn't find any good encoding that allowed me to read the dataframe properly, I put the 'utf-8' encoding as a parameter of the geopandas opener.
df1 = gpd.read_file('path/to/my/file.shp', encoding='utf-8')
The obtained result is the same inserted in the example, anyway. For the sake of the example, I put only 2 instances, beside in my original dataframe there is at least one for each pair in the dictionary.
df = pd.DataFrame([[b"Pr\x8e-Saint-Didier", b"Vall\x8e d'Aoste"],[ "Bozen", b"Trentino Alto Adige - S\x9ddtirol"]], columns = ['town', 'region'])
special_chars = {
'\x9f':'ü',
'\x93':'ì',
'\xed':'ì',
'\x8e':'é',
'\x8f':'è',
'\x8d':'ç',
'\x90':'ê',
'\x98':'ò',
'\x9d':'ù',
'\x88':'à',
}
df['town'] = df['town'].str.decode('latin-1').replace(special_chars, regex=True)
df['region'] = df['region'].str.decode('latin-1').replace(special_chars, regex=True)
Does anybody have any idea on how to solve this problem?
How to handle it?
Probably it is an existing encoding, so you have several possibilities: check few of such characters in Wikipedia. Some of accented characters have a list of possible encoding. In this case, I found that an old MacOS codepage had some of your characters correct. So I checked other Mac encodings, I think I found it.
Alternatively (and do this if you have many different files and encodings): you can write a Python script with a short conversion table, and iterate all encodings. Select the 3 encodings with better point (and maybe print also the character in such encodings). This is longer on first try, but if you have often such problem, it will help you (especially because it seems you are dealing with old data).
Note: It seems that maybe few guesses of you are wrong (wrong case?).
What I found?
I think it is Mac OS Roman. Or maybe some related Mac_OS encoding. Now it is your task to check carefully if my guess is correct (I didn't check all characters).
Note: This encoding is known as mac_roman in Python.

Make overlines in Python

Hello my fellow coders!
I'm an absolute beginner to Python and coding in general. Right now, I'm writing a code that converts regular arabic numerals to roman. For numbers larger than 3 999, the romans usually wrote a line over a letter to make it thousand times larger. For example, IV with a line over it represented 4 000. How is this possible in Python? I have understood that you can create an "overscore" by writing "\u203E". How can I make this appear over a letter instead of beside it?
Regards
You need to use the combining character U+0304 instead.
>>> print(u'a\u0304')
ā
U+0305 is probably a better choice (as viraptor suggests). You can also use the Unicode Roman numerals (U+2160 through U+217f) instead of regular uppercase Latin letters, although (at least in my terminal) they don't render as well with the overline.
>>> print(u'\u2163\u0305')
Ⅳ̅
>>> print u'I\u0305V\u0305'
I̅V̅
(Or as I see it:
Notice the overline is centered over, but does not completely cover, the single-character Roman numeral 4.)
(Any pure text option will only be as good as the font and renderer used by the person running the code. Case in point, the I+V version does not even display consistently while I type this; sometimes the overbars are over the letters, sometimes they follow the letters.)
A combining overline is \u305 and it works quite well with "IV". What you want is for example: u'I\u0305V\u0305' (gives I̅V̅)
I looked for something online but didn't find it. The best workaround I'd suggest would be the following:
def over(character):
return "_\n"+character
Such as:
>>> print over("M")
_
M
>>>

position-independent comparison of Hangul characters

I am writing a python3 program that has to handle text in various writing systems, including Hangul (Korean) and I have problems with the comparison of the same character in different positions.
For those unfamiliar with Hangul (not that I know much about it, either), this script has the almost unique feature of combining the letters of a syllable into square blocks. For example 'ㅎ' is pronounced [h] and 'ㅏ' is pronounced [a], the syllable 'hah' is written '핳' (in case your system can't render Hangul: the first h is displayed in the top-left corner, the a is in the top-right corner and the second h is under them in the middle). Unicode handles this by having two different entries for each consonant, depending on whether it appears in the onset or the coda of a syllable. For example, the previous syllable is encoded as '\u1112\u1161\u11c2'.
My code needs to compare two chars, considering them as equal if they only differ for their positions. This is not the case with simple comparison, even applying Unicode normalizations. Is there a way to do it?
You will need to use a tailored version of the Unicode Collation Algorithm (UCA) that assigns equal weights to identical syllables. The UCA technical report describes the general problem for sorting Hangul.
Luckily, the ICU library has a set of collation rules that does exactly this: ko-u-co-search – Korean (General-Purpose Search); which you can try out on their demo page. To use this in Python, you will either need use a library like PyICU, or one that implements the UCA and supports the ICU rule file format (or lets you write your own rules).
I'm the developer for Python jamo (the Hangul letters are called jamo). An easy way to do this would be to cast all jamo code points to their respective Hangul compatibility jamo (HCJ) code points. HCJ is the display form of jamo characters, so initial and final forms of consonants are the same code point.
For example:
>>> import jamo
>>> initial, vowel, final = jamo.j2hcj('\u1112\u1161\u11c2')
>>> initial == final
True
The way this is done internally is with a lookup table copied from the Unicode specifications.

Python: Custom sort a list of lists

I know this has been asked before, but I have not been able to find a solution.
I'm trying to alphabetize a list of lists according to a custom alphabet.
The alphabet is a representation of the Burmese script as used by Sgaw Karen in plain ASCII. The Burmese script is an alphasyllabary—a few dozen onsets, a handful of medial diacritics, and a few dozen rhymes that can be combined in thousands of different ways, each of which is a single "character" representing one syllable. The map.txt file has these syllables, listed in (Karen/Burmese) alphabetical order, but converted in some unknown way into ASCII symbols, so the first character is u>m;.Rf rather than က or [ka̰]. For example:
u>m;.Rf ug>m;.Rf uH>m;.Rf uX>m;.Rf uk>m;.Rf ul>m;.Rf uh>m;.Rf uJ>m;.Rf ud>m;.Rf uD>m;.Rf u->m;.Rf uj>m;.Rf us>m;.Rf uV>m;.Rf uG>m;.Rf uU>m;.Rf uS>m;.Rf u+>m;.Rf uO>m;.Rf uF>m;.Rf
c>m;.Rf cg>m;.Rf cH>m;.Rf cX>m;.Rf ck>m;.Rf cl>m;.Rf ch>m;.Rf cJ>m;.Rf cd>m;.Rf cD>m;.Rf c->m;.Rf cj>m;.Rf cs>m;.Rf cV>m;.Rf cG>m;.Rf cU>m;.Rf cS>m;.Rf c+>m;.Rf cO>m;.Rf cF>m;.Rf
Each list in the list of lists has, as its first element, a word of Sgaw Karen converted into ASCII symbols in the same way. For example:
[['u&X>', 'n', 'yard'], ['vk.', 'n', 'yarn'], ['w>ouDxD.', 'n', 'yawn'], ['w>wuDxD.', 'n', 'yawn']]
This is what I have so far:
def alphabetize(word_list):
alphabet = ''.join([line.rstrip() for line in open('map.txt', 'rb')])
word_list = sorted(word_list, key=lambda word: [alphabet.index(c) for c in word[0]])
return word_list
I would like to alphabetize word_list by the first element of each list (eg. 'u&X>', 'vk.'), according to the pattern in alphabet.
My code's not working yet and I'm struggling to understand the sorted command with lambda and the for loop.
First, if you're trying to look up the entire word[0] in alphabet, rather than each character individually, you shouldn't be looping over the characters of word[0]. Just use alphabet.index(word[0]) directly.
From your comments, it sounds like you're trying to look up each transliterated-Burmese-script character in word[0]. That isn't possible unless you can write an algorithm to split a word up into those characters. Splitting it up into the ASCII bytes of the transliteration doesn't help at all.
Second, you probably shouldn't be using index here. When you think you need to use index or similar functions, 90% of the time, that means you're using the wrong data structure. What you want here is a mapping (presumably why it's called map.txt), like a dict, keyed by words, not a list of words that you have to keep explicitly searching. Then, looking up a word in that dictionary is trivial. (It's also a whole lot more efficient, but the fact that it's easy to read and understand can be even more important.)
Finally, I suspect that your map.txt is supposed to be read as a whitespace-separated list of transliterated characters, and what you want to find is the index into that list for any given word.
So, putting it all together, something like this:
with open('map.txt', 'rb') as f:
mapping = {word: index for index, word in enumerate(f.read().split())}
word_list = sorted(word_list, key=lambda word: mapping[word[0]])
But, again, that's only going to work for one-syllable words, because until you can figure out how to split a word up into the units that should be alphabetized (in this case, the symbols), there is no way to make it work for multi-syllable words.
And once you've written the code that does that, I'll bet it would be pretty easy to just convert everything to proper Unicode representations of the Burmese script. Each syllable still takes 1-4 code points in Unicode—but that's fine, because the standard Unicode collation algorithm, which comes built-in with Python, already knows how to alphabetize things properly for that script, so you don't have to write it yourself.
Or, even better, unless this is some weird transliteration that you or your teacher invented, there's probably already code to translate between this format and Unicode, which means you shouldn't even have to write anything yourself.

Problem with list of strings in python

Why on Earth doesn't the interpreter raise SyntaxError everytime I do this:
my_abc = ['a',
'b',
'c'
'd',]
I just wanted to add 'c' to the list of strings, and forgot to append the comma. I would expect this to cause some kind of error, as it's cleary incorrect.
Instead, what I got:
>>> my_abc
['a', 'b', 'cd']
And this is never what I want.
Why is it automatically concatenated? I can hardly count how many times I got bitten by this behavior.
Is there anything I can do with it?
Just to clarify*: I don't actually mind auto-concatenation, my problem has to do ONLY with lists of strings, because they often do much more than just carry text, they're used to control flow, to pass field names and many other things.
Is called "Implicit String Concatenation" and a PEP that proposed its removal was rejected: http://www.python.org/dev/peps/pep-3126/
It's by design. It allows, for example, writing long string literals in several lines without using +.
As others said, it's by design.
Why is it so ? Mostly for historical reasons : C also does it.
In some cases it's handy because it reduce syntaxic noise and avoid adding unwanted spaces (inline SQL queries, complexes regexpes, etc).
What you can do about it ? Not much, but if it really happens often for you, try one of the following tricks.
indent your list with coma at the beginning of the line. It's weird, but if you do so the missing comas become obvious.
assign strings to variables and use variables list whenever you can (and it's often a good idea for other reasons, like avoiding duplicate strings).
split your list: for list of words you can put the whole list inside only one string and split it like below. For more than 5 elements it's also shorter.
'a b c d e'.split(' ').
Because two string literals side-by-side, delimited by whitespace, are concatenated. Since the strings are within a list, they are 'side-by-side'.
See: http://docs.python.org/reference/lexical_analysis.html#string-literal-concatenation
Because often people want to do something like this:
line = ("Here's a very long line, with no line breaks,"
" which should be displayed to the user (perhaps"
" as an error message or question box).")
It's easier to write this without having to manually concatenate strings. C, C++, and (I believe) Java and C# also have this behavior.

Categories

Resources