How to convert formatted text to the plain text [duplicate] - python

Is there a standard way, in Python, to normalize a unicode string, so that it only comprehends the simplest unicode entities that can be used to represent it ?
I mean, something which would translate a sequence like ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT'] to ['LATIN SMALL LETTER A WITH ACUTE'] ?
See where is the problem:
>>> import unicodedata
>>> char = "á"
>>> len(char)
1
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A WITH ACUTE']
But now:
>>> char = "á"
>>> len(char)
2
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
I could, of course, iterate over all the chars and do manual replacements, etc., but it is not efficient, and I'm pretty sure I would miss half of the special cases, and do mistakes.

The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER - U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
>>> print(ascii(unicodedata.normalize('NFD', '\u00e1')))
'a\u0301'
(I used the ascii() function here to ensure non-ASCII codepoints are printed using escape syntax, making the differences clear).
NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.
The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 ROMAN NUMERAL ONE is really just the same thing as U+0049 LATIN CAPITAL LETTER I but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form.
Here is an example using the U+2167 ROMAN NUMERAL EIGHT codepoint; using the NFKC form replaces this with a sequence of ASCII V and I characters:
>>> unicodedata.normalize('NFC', '\u2167')
'Ⅷ'
>>> unicodedata.normalize('NFKC', '\u2167')
'VIII'
Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.

Yes, there is.
unicodedata.normalize(form, unistr)
You need to select one of the four normalization forms.

Related

How can I sort a string based on the ASCII value of the characters

I am trying to create a program that can get a string and return the string with the characters sorted by their ASCII Values (Ex: "Hello, World!" should return "!,HWdellloor").
I've tried this and it worked :
text = "Hello, World!"
cop = []
for i in range(len(text)):
cop.append(ord(text[i]))
cop.sort()
ascii_text = " ".join([str(chr(item)).strip() for item in cop])
print(ascii_text)
But I was curious to see if something like this is possible with string manipulation functions alone.
Ascii is a subset of Unicode. The first 127 codepoints of Unicode are identical to the first 127 codepoints of ascii (though when ascii was invented in 1963 they didn't use the word codepoint). So if your data is all ascii and you just call the built-in sort() method or the sorted() function, you will get it in ascii order.
Like this:
>>> text = "Hello, World!"
>>> ''.join(sorted(text))
' !,HWdellloor'
The result is sorted by the Unicode codepoint of the character, but since that is the same as the ascii codepoint, there is no difference.
As you can see, the first character in the sorted result is a space, ascii 32, which is the printable character with the lowest codepoint, 32 (yes, you can argue about whether it is printable) followed by digits (codepoints 48-57), then the uppercase letters (65-90), then the lowercase letters (97-122), with punctuation and symbols scattered in the gaps between, and control codes like carriage return and linefeed in the range 0-31 and 127.
You can confirm this by sorting string.ascii_letters:
>>> import string
>>> ''.join(sorted(string.ascii_letters))
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
and then checking the sequence against one of the many ascii tables available.
If you are wondering about the placement of the various blocks of characters, uppercase and lowercase were arranged in such a way that there was a one-bit difference between an uppercase character (example A: 0x41) and the corresponding lowercase character (a: 0x61), which simplified case-insensitive matching; and there were similar arguments for placement of the other blocks. For some of the control codes there were strong legacy imperatives, from punched paper tape patterns.

How to deal with variations in apostrophe ’ and ' [duplicate]

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site's editor (simple textarea, no WYSIWYG etc.).
Those texts usually contain "nice" quotes instead of the plain ascii ones ("). They also sometimes contain those longer dashes like – instead of -.
Now I want to replace all those characters with their ascii counterparts. However, I do not want to remove umlauts and other non-ascii character. I'd also highly prefer to use a proper solution that does not involve creating a mapping dict for all those characters.
All my strings are unicode objects.
What about this?
It creates translation table first, but honestly I don't think you can do this without it.
transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-", u"'''\"\"--") ] )
with open( "a.txt", "w", encoding = "utf-8" ) as f_out :
a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” "
print( " a_str = " + a_str, file = f_out )
fixed_str = a_str.translate( transl_table )
print( " fixed_str = " + fixed_str, file = f_out )
I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
The output in the a.txt file looks as follows:
a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single
quotes’ “nice double quotes” fixed_str = 'funny single quotes'
long--and--short dashes 'nice single quotes' "nice double quotes"
By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language
There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.
For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:
#!/usr/bin/env python3
import unicodedata
def unicode_character_name(char):
try:
return unicodedata.name(char)
except ValueError:
return None
# Generate all Unicode characters with their names
all_unicode_characters = []
for n in range(0, 0x10ffff): # Unicode planes 0-16
char = chr(n) # Python 3
#char = unichr(n) # Python 2
name = unicode_character_name(char)
if name:
all_unicode_characters.append((char, name))
# Find all Unicode quotation marks
print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name]))
# " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 " 🙶 🙷 🙸
# Find all Unicode hyphens
print (' '.join([char for char, name in all_unicode_characters if 'HYPHEN' in name]))
# - ­ ֊ ᐀ ᠆ ‐ ‑ ‧ ⁃ ⸗ ⸚ ⹀ ゠ ﹣ - 󠀭
# Find all Unicode dashes
print (' '.join([char for char, name in all_unicode_characters if 'DASH' in name and 'DASHED' not in name]))
# ‒ – — ⁓ ⊝ ⑈ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ╌ ╍ ╎ ╏ ⤌ ⤍ ⤎ ⤏ ⤐ ⥪ ⥫ ⥬ ⥭ ⩜ ⩝ ⫘ ⫦ ⬷ ⸺ ⸻ ⹃ 〜 〰 ︱ ︲ ﹘ 💨
As you can see, as easy as this example is, there are many problems. There are many quotation marks in Unicode that don't look anything like the quotation marks in US-ASCII and there are many hyphens in Unicode that don't look anything like the hyphen-minus sign in US-ASCII.
And there are many questions. For example:
should the "SWUNG DASH" (⁓) symbol be replaced with an ASCII hyphen (-) or a tilde (~)?
should the "CANADIAN SYLLABICS HYPHEN" (᐀) be replaced with an ASCII hyphen (-) or an equals sign (=)?
should the "SINGLE LEFT-POINTING ANGLE QUOTATION MARK" (‹) be replaces with an ASCII quotation mark ("), an apostrophe (') or a less-than sign (<)?
To establish a "correct" ASCII counterpart, somebody needs to answer these questions based on the use context. That's why all the solutions to your problem are based on a mapping dictionary in one way or another. And all these solutions will provide different results.
You can build on top of the unidecode package.
This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.
import unidecode
import unicodedata
import re
def char_filter(string):
latin = re.compile('[a-zA-Z]+')
for char in unicodedata.normalize('NFC', string):
decoded = unidecode.unidecode(char)
if latin.match(decoded):
yield char
else:
yield decoded
def clean_string(string):
return "".join(char_filter(string))
print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
# prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé
You can use the str.translate() method (http://docs.python.org/library/stdtypes.html#str.translate). However, read the doc related to Unicode -- the translation table has another form: unicode ordinal number --> unicode string (usually char) or None.
Well, but it requires the dict. You have to capture the replacements anyway. How do you want to do that without any table or arrays? You could use str.replace() for the single characters, but this would be inefficient.
This tool will normalize punctuation in markdown: http://johnmacfarlane.net/pandoc/README.html
-S, --smart Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes,
and ... to ellipses. Nonbreaking spaces are inserted after certain
abbreviations, such as “Mr.” (Note: This option is significant only
when the input format is markdown or textile. It is selected
automatically when the input format is textile or the output format is
latex or context.)
It's haskell, so you'd have to figure out the interface.

Understanding unistr of unicodedata.normalize()

Wikipedia basically says the following for the four values of unistr.
- NFC (Normalization Form Canonical Composition)
- Characters are decomposed
- then recomposed by canonical equivalence.
- NFKC (Normalization Form Compatibility Composition)
- Characters are decomposed by compatibility
- recomposed by canonical equivalence
- NFD (Normalization Form Canonical Decomposition)
- Characters are decomposed by canonical equivalence
- multiple combining characters are arranged in a specific order.
- NFKD (Normalization Form Compatibility Decomposition)
- Characters are decomposed by compatibility
- multiple combining characters are arranged in a specific order.
So for each of the choice, it is a two step transform? But normalize() only shows the final result. Is there a way to see the intermediate results?
Wiki also says
For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
But I can not reproduce it with normalize(). Could anybody provide more examples to show how these four options work? What are their differences?
>>> from unicodedata import normalize
>>> print(normalize('NFC', 'A°'))
A°
>>> print(normalize('NFKC', 'A°'))
A°
>>> print(normalize('NFD', 'Å'))
Å
>>> print(normalize('NFKD', 'Å'))
Å
>>> len(normalize('NFC', 'A°'))
2
>>> len(normalize('NFKC', 'A°'))
2
>>> len(normalize('NFD', 'Å'))
2
>>> len(normalize('NFKD', 'Å'))
2
I am asking for examples to show the difference among the four cases.
Try the following script (Python 3):
# -*- coding: utf-8 -*-
import sys
from unicodedata import normalize
def encodeuni(s):
'''
Returns input string encoded to escape sequences as in a string literal.
Output is similar to
str(s.encode('unicode_escape')).strip("b'").replace('\\\\','\\');
but even every ASCII character is encoded as a \\xNN escape sequence
(except a space character). For instance:
s = 'A á ř 🌈';
encodeuni(s); # '\\x41 \\xe1 \\u0159 \\U0001f308' while
str(s.encode('unicode_escape')).strip("b'").replace('\\\\','\\');
# # 'A \\xe1 \\u0159 \\U0001f308'
'''
def encodechar(ch):
ordch = ord(ch)
return ( ch if ordch == 0x20 else
f"\\x{ordch:02x}" if ordch <= 0xFF else
f"\\u{ordch:04x}" if ordch <= 0xFFFF else
f"\\U{ordch:08x}" )
return ''.join([encodechar(ch) for ch in s])
if len(sys.argv) >= 2:
letters = ' '.join([sys.argv[i] for i in range(1,len(sys.argv))])
# .\SO\59979037.py ÅÅÅ🌈
else:
letters = '\u212B \u00C5 \u0041\u030A \U0001f308'
# \u212B Å Angstrom Sign
# \u00C5 Å Latin Capital Letter A With Ring Above
# \u0041 A Latin Capital Letter A
# \u030A ̊ Combining Ring Above
# \U0001f308 🌈 Rainbow
print('\t'.join( ['raw' , letters.ljust(10), str(len(letters)), encodeuni(letters),'\n']))
for form in ['NFC','NFKC','NFD','NFKD']:
letnorm = normalize(form, letters)
print( '\t'.join( [form, letnorm.ljust(10), str(len(letnorm)), encodeuni(letnorm)]))
Output shows a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC. See the picture at the very end (Windows 10 cmd prompt, font DejaVu Sans Mono):
.\SO\59979037.py ÅÅÅ
raw ÅÅÅ 4 \u212b\xc5\x41\u030a
NFC ÅÅÅ 3 \xc5\xc5\xc5
NFKC ÅÅÅ 3 \xc5\xc5\xc5
NFD ÅÅÅ 6 \x41\u030a\x41\u030a\x41\u030a
NFKD ÅÅÅ 6 \x41\u030a\x41\u030a\x41\u030a

Regex - match a character and all its diacritic variations (aka accent-insensitive)

I am trying to match a character and all its possible diacritic variations (aka accent-insensitive) with a regular expression. What I could do of course is:
re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é")
but that is not a general solution. If I use unicode categories like \pL I can't reduce the match to a specific character, in this case e.
A workaround to achieve the desired goal would be to use unidecode to get rid of all diacritics first, and then just match agains the regular e
re.match(r"^e$", unidecode("é"))
Or in this simplified case
unidecode("é") == "e"
Another solution which doesn't depend on the unidecode-library, preserves unicode and gives more control is manually removing the diacritics as follows:
Use unicodedata.normalize() to turn your input string into normal form D (for decomposed), making sure composite characters like é get turned into the decomposite form e\u301 (e + COMBINING ACUTE ACCENT)
>>> input = "Héllô"
>>> input
'Héllô'
>>> normalized = unicodedata.normalize("NFKD", input)
>>> normalized
'He\u0301llo\u0302'
Then, remove all codepoints which fall into the category Mark, Nonspacing (short Mn). Those are all characters that have no width themselves and just decorate the previous character.
Use unicodedata.category() to determine the category.
>>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn")
>>> stripped
'Hello'
The result can be used as a source for regex-matching, just as in the unidecode-example above.
Here's the whole thing as a function:
def remove_diacritics(text):
"""
Returns a string with all diacritics (aka non-spacing marks) removed.
For example "Héllô" will become "Hello".
Useful for comparing strings in an accent-insensitive fashion.
"""
normalized = unicodedata.normalize("NFKD", text)
return "".join(c for c in normalized if unicodedata.category(c) != "Mn")

How to clean the tweets having a specific but varying length pattern?

I pulled out some tweets for analysis. When I separate the words in tweets I can see a lot of following expressions in my output:
\xe3\x81\x86\xe3\x81\xa1
I want to use regular expressions to replace these patterns with nothing. I am not very good with regex. I tried using solution in some similar questions but nothing worked for me. They are replacing characters like "xt" from "extra".
I am looking for something that will replace \x?? with nothing, considering ?? can be either a-f or 0-9 but word must be 4 letter and starting with \x.
Also i would like to add replacement for anything other than alphabets. Like:
"Hi!! my number is (7097868709809)."
after replacement should yield
"Hi my number is."
Input:
\xe3\x81\x86\xe3Extra
Output required:
Extra
What you are seeing is Unicode characters that can't directly be printed, expressed as pairs of hexadecimal digits. So for a more printable example:
>>> ord('a')
97
>>> hex(97)
'0x61'
>>> "\x61"
'a'
Note that what appears to be a sequence of four characters '\x61' evaluates to a single character, 'a'. Therefore:
?? can't "be anything" - they can be '0'-'9' or 'a'-'f'; and
Although e.g. r'\\x[0-9a-f]{2}' would match the sequence you see, that's not what the regex would parse - each "word" is really a single character.
You can remove the characters "other than alphabets" using e.g. string.printable:
>>> s = "foo\xe3\x81"
>>> s
'foo\xe3\x81'
>>> import string
>>> valid_chars = set(string.printable)
>>> "".join([c for c in s if c in valid_chars])
'foo'
Note that e.g. '\xe3' can be directly printed in Python 3 (it's 'ã'), but isn't included in string.printable. For more on Unicode in Python, see the docs.

Categories

Resources