Take a Unicode character from within a string and decode it - python

I'm currently working in Python, and I'm pulling a whole bunch a data from the net, including titles of photos. Some of the strings I'm getting have unicode in them, and I'd like to display it as its original character.
I know that if I type, for example,
print u'\u00a9'
that is will output the right character to the terminal.
However, if I get a string such as:
string = 'Copyright \u00a9 David'
I am not sure how to pull it out.
I managed to pull out the character code with RegEx, but I don't know how to insert it back in without getting an error.
I tried:
char = \u00a9
string = 'Copyright' + u'char' + 'David'
which didn't really work.
I need a way to programatically pull out the code (which I can do with RegEx), and then re-insert into the original string with the u' in front of it.

I think you're misunderstanding what the u is. It's a way of identifying and displaying unicode literals in code, and has nothing to do with converting string variables from one representation to another.
What you actually need is to decode the string using the "unicode-escape" codec:
>>> print string.decode('unicode-escape')
Copyright © David

There is a good reason why
char = \u00a9
string = 'Copyright' + u'char' + 'David'
doesn work ;-)
char = u'\u00a9'
string = 'Copyright ' + char + ' David'
print string
>>> Copyright © David

Store char as char = u'\u00a9' rather than char = \u00a9. Then when you are appending your string just do:
string = 'Copyright ' + char + ' David'

Related

How to convert escaped sequences to literal characters

I have to deal with rtf files where cyrillic characters are converted to escaped sequences:
{\rtf1\fbidis\ansicpg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}
I want to convert cyrillic symbols, but left rtf tags unchaged. Is there pythonic way to do it without third party apps (like OpenOffice)?
We can first make a list of the hex codes using a regex, then create a bytes object with these values, which we can decode. It appears that your data was encoded using "cp1251".
data = r"pg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}"
hex_codes = re.findall(r"(?<=')[0-9A-F]{2}", data)
encoded = bytes(int(hcode, 16) for hcode in hex_codes)
# or, as rightly suggested by #Henry Tjhia:
# encoded = bytes.fromhex(''.join(hex_codes))
text = encoded.decode('cp1251')
print(text)
# Список документов
Despite #Thierry Lathuille's answer didn't solve the initial problem (I need rtf tags unchanged) it solved the most difficult part. So the solution to the initial problem:
string = "{\rtf1\fbidis\ansicpg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}"
hex_codes = re.findall("(?<=')[0-9A-F]{2}", string)
d = {"\\\'" + code: bytes.fromhex(code).decode("cp1251") for code in hex_codes}
for byte, char in d.items():
string = string.replace(byte, char)
print(string)
# {\rtf1\fbidis\ansicpg1251{\info{\title Список документов}

How to deal with variations in apostrophe ’ and ' [duplicate]

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site's editor (simple textarea, no WYSIWYG etc.).
Those texts usually contain "nice" quotes instead of the plain ascii ones ("). They also sometimes contain those longer dashes like – instead of -.
Now I want to replace all those characters with their ascii counterparts. However, I do not want to remove umlauts and other non-ascii character. I'd also highly prefer to use a proper solution that does not involve creating a mapping dict for all those characters.
All my strings are unicode objects.
What about this?
It creates translation table first, but honestly I don't think you can do this without it.
transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-", u"'''\"\"--") ] )
with open( "a.txt", "w", encoding = "utf-8" ) as f_out :
a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” "
print( " a_str = " + a_str, file = f_out )
fixed_str = a_str.translate( transl_table )
print( " fixed_str = " + fixed_str, file = f_out )
I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
The output in the a.txt file looks as follows:
a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single
quotes’ “nice double quotes” fixed_str = 'funny single quotes'
long--and--short dashes 'nice single quotes' "nice double quotes"
By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language
There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.
For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:
#!/usr/bin/env python3
import unicodedata
def unicode_character_name(char):
try:
return unicodedata.name(char)
except ValueError:
return None
# Generate all Unicode characters with their names
all_unicode_characters = []
for n in range(0, 0x10ffff): # Unicode planes 0-16
char = chr(n) # Python 3
#char = unichr(n) # Python 2
name = unicode_character_name(char)
if name:
all_unicode_characters.append((char, name))
# Find all Unicode quotation marks
print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name]))
# " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 " 🙶 🙷 🙸
# Find all Unicode hyphens
print (' '.join([char for char, name in all_unicode_characters if 'HYPHEN' in name]))
# - ­ ֊ ᐀ ᠆ ‐ ‑ ‧ ⁃ ⸗ ⸚ ⹀ ゠ ﹣ - 󠀭
# Find all Unicode dashes
print (' '.join([char for char, name in all_unicode_characters if 'DASH' in name and 'DASHED' not in name]))
# ‒ – — ⁓ ⊝ ⑈ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ╌ ╍ ╎ ╏ ⤌ ⤍ ⤎ ⤏ ⤐ ⥪ ⥫ ⥬ ⥭ ⩜ ⩝ ⫘ ⫦ ⬷ ⸺ ⸻ ⹃ 〜 〰 ︱ ︲ ﹘ 💨
As you can see, as easy as this example is, there are many problems. There are many quotation marks in Unicode that don't look anything like the quotation marks in US-ASCII and there are many hyphens in Unicode that don't look anything like the hyphen-minus sign in US-ASCII.
And there are many questions. For example:
should the "SWUNG DASH" (⁓) symbol be replaced with an ASCII hyphen (-) or a tilde (~)?
should the "CANADIAN SYLLABICS HYPHEN" (᐀) be replaced with an ASCII hyphen (-) or an equals sign (=)?
should the "SINGLE LEFT-POINTING ANGLE QUOTATION MARK" (‹) be replaces with an ASCII quotation mark ("), an apostrophe (') or a less-than sign (<)?
To establish a "correct" ASCII counterpart, somebody needs to answer these questions based on the use context. That's why all the solutions to your problem are based on a mapping dictionary in one way or another. And all these solutions will provide different results.
You can build on top of the unidecode package.
This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.
import unidecode
import unicodedata
import re
def char_filter(string):
latin = re.compile('[a-zA-Z]+')
for char in unicodedata.normalize('NFC', string):
decoded = unidecode.unidecode(char)
if latin.match(decoded):
yield char
else:
yield decoded
def clean_string(string):
return "".join(char_filter(string))
print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
# prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé
You can use the str.translate() method (http://docs.python.org/library/stdtypes.html#str.translate). However, read the doc related to Unicode -- the translation table has another form: unicode ordinal number --> unicode string (usually char) or None.
Well, but it requires the dict. You have to capture the replacements anyway. How do you want to do that without any table or arrays? You could use str.replace() for the single characters, but this would be inefficient.
This tool will normalize punctuation in markdown: http://johnmacfarlane.net/pandoc/README.html
-S, --smart Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes,
and ... to ellipses. Nonbreaking spaces are inserted after certain
abbreviations, such as “Mr.” (Note: This option is significant only
when the input format is markdown or textile. It is selected
automatically when the input format is textile or the output format is
latex or context.)
It's haskell, so you'd have to figure out the interface.

How do I use both single and double quotes in a python string

I am going a little crazy trying out all combinations. I need a string variable whose value is set to: r'" a very long string \r"'
This long string is given across multiple lines. My code looks like this:
str = r'" a very
long
string \r"'
This is introducing "\n" in the str variable. I tried using this syntax """ ...""" too. But I get a syntax error. Can someone help me please ? I saw the other Qs on stackoverflow, they don't seem to match this requirement.
You can use multiple string literals; as long as they are on the same logical line they'll be concatenated to one long string. You can extend the logical line with paretheses:
yourstr = (
'" a very'
'long '
r'string \r"')
Note that I mixed string literal types here. The first two parts are normal string literals, the latter part is a raw string literal so you don't have to double the \ in \r. If you really wanted to have a CR carriage return, omit the r prefix.
Demo:
>>> yourstr = (
... '" a very'
... 'long '
... r'string \r"')
>>> yourstr
'" a verylong string \\r"'
>>> print yourstr
" a verylong string \r"
The Python compiler concatenates adjacent string literals. The trick is to tell it that it should be considered a single line of code.
S = ('" a very '
'long '
r'string \r"')

How to replace unicode characters in string with something else python?

I have a string that I got from reading a HTML webpage with bullets that have a symbol like "•" because of the bulleted list. Note that the text is an HTML source from a webpage using Python 2.7's urllib2.read(webaddress).
I know the unicode character for the bullet character as U+2022, but how do I actually replace that unicode character with something else?
I tried doing
str.replace("•", "something")
but it does not appear to work... how do I do this?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "*")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")
(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str shadows the built-in type str.)
Encode string as unicode.
>>> special = u"\u2022"
>>> abc = u'ABC•def'
>>> abc.replace(special,'X')
u'ABCXdef'
import re
regex = re.compile("u'2022'",re.UNICODE)
newstring = re.sub(regex, something, yourstring, <optional flags>)
Try this one.
you will get the output in a normal string
str.encode().decode('unicode-escape')
and after that, you can perform any replacement.
str.replace('•','something')
str1 = "This is Python\u500cPool"
Encode the string to ASCII and replace all the utf-8 characters with '?'.
str1 = str1.encode("ascii", "replace")
Decode the byte stream to string.
str1 = str1.decode(encoding="utf-8", errors="ignore")
Replace the question mark with the desired character.
str1 = str1.replace("?"," ")
Funny the answer is hidden in among the answers.
str.replace("•", "something")
would work if you use the right semantics.
str.replace(u"\u2022","something")
works wonders ;) , thnx to RParadox for the hint.
If you want to remove all \u character. Code below for you
def replace_unicode_character(self, content: str):
content = content.encode('utf-8')
if "\\x80" in str(content):
count_unicode = 0
i = 0
while i < len(content):
if "\\x" in str(content[i:i + 1]):
if count_unicode % 3 == 0:
content = content[:i] + b'\x80\x80\x80' + content[i + 3:]
i += 2
count_unicode += 1
i += 1
content = content.replace(b'\x80\x80\x80', b'')
return content.decode('utf-8')

efficiently replace bad characters

I often work with utf-8 text containing characters like:
\xc2\x99
\xc2\x95
\xc2\x85
etc
These characters confuse other libraries I work with so need to be replaced.
What is an efficient way to do this, rather than:
text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')
There is always regular expressions; just list all of the offending characters inside square brackets like so:
import re
print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")
This prints: 'Hello There ', with the unwanted characters replaced by spaces.
Alternately, if you have a different replacement character for each:
# remove annoying characters
chars = {
'\xc2\x82' : ',', # High code comma
'\xc2\x84' : ',,', # High code double comma
'\xc2\x85' : '...', # Tripple dot
'\xc2\x88' : '^', # High carat
'\xc2\x91' : '\x27', # Forward single quote
'\xc2\x92' : '\x27', # Reverse single quote
'\xc2\x93' : '\x22', # Forward double quote
'\xc2\x94' : '\x22', # Reverse double quote
'\xc2\x95' : ' ',
'\xc2\x96' : '-', # High hyphen
'\xc2\x97' : '--', # Double hyphen
'\xc2\x99' : ' ',
'\xc2\xa0' : ' ',
'\xc2\xa6' : '|', # Split vertical bar
'\xc2\xab' : '<<', # Double less than
'\xc2\xbb' : '>>', # Double greater than
'\xc2\xbc' : '1/4', # one quarter
'\xc2\xbd' : '1/2', # one half
'\xc2\xbe' : '3/4', # three quarters
'\xca\xbf' : '\x27', # c-single quote
'\xcc\xa8' : '', # modifier - under curve
'\xcc\xb1' : '' # modifier - under line
}
def replace_chars(match):
char = match.group(0)
return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)
I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.
\xc2\x95 is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?
Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)
If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)
>>> b'\x95'.decode('windows-1252')
'\u2022'
>>> import unicodedata
>>> unicodedata.name(_)
'BULLET'
If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:
def restore_windows_1252_characters(s):
"""Replace C1 control characters in the Unicode string s by the
characters at the corresponding code points in Windows-1252,
where possible.
"""
import re
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)
For example:
>>> restore_windows_1252_characters('\x95\x99\x85')
'•™…'
If you want to remove all non-ASCII characters from a string, you can use
text.encode("ascii", "ignore")
import unicodedata
# Convert to unicode
text_to_uncicode = unicode(text, "utf-8")
# Convert back to ascii
text_fixed = unicodedata.normalize('NFKD',text_to_unicode).encode('ascii','ignore')
This is not "Unicode characters" - it feels more like this an UTF-8 encoded string. (Although your prefix should be \xC3, not \xC2 for most chars). You should not just throw them away in 95% of the cases, unless you are comunicating with a COBOL backend. The World is not limited to 26 characters, you know.
There is a concise reading to explain the differences between Unicode strings (what is used as an Unicode object in python 2 and as strings in Python 3 here: http://www.joelonsoftware.com/articles/Unicode.html - please, for your sake do read that. Even if you are never planning to have anything that is not English in all of your applications, you still will stumble on symbols like € or º that won't fit in 7 bit ASCII. That article will help you.
That said, maybe the libraries you are using do accept Unicode python objects, and you can transform your UTF-8 Python 2 strings into unidoce by doing:
var_unicode = var.decode("utf-8")
If you really need 100% pure ASCII, replacing all non ASCII chars, after decoding the string to unicode, re-encode it to ASCII, telling it to ignore characters that don't fit in the charset with:
var_ascii = var_unicode.encode("ascii", "replace")
These characters are not in ASCII Library and that is the reason why you are getting the errors.
To avoid these errors, you can do the following while reading the file.
import codecs
f = codecs.open('file.txt', 'r',encoding='utf-8')
To know more about these kind of errors, go through this link.

Categories

Resources