Dealing with Unicode Characters - python

I know this question has been asked countless times before, but I can't seem to get any of the solutions working. I've tried using the codecs module, the io module. Nothing seems to work.
I'm scraping some stuff off the web, then logging the details of each item to a text file, yet the script breaks as soon as it first encounters a Unicode character.
AHIMSA Centro de Sanación Pránica, Pranic Healing
Further, I'm not sure where and or when Unicode characters might pop up, which adds an extra level of complexity, so I need an overarching solution and I'm not exactly sure how to deal with potential non-ASCII characters.
I'm not sure if I'll have Python 3.6.5 in the production environment, so the solution has to work with 2.7.
What can I do here? How can I deal with this?
# -*- coding: utf-8 -*-
...
with open('test.txt', 'w') as f:
f.write(str(len(discoverable_cards)) + '\n\n')
for cnt in range(0, len(discoverable_cards)):
t = get_time()
f.write('[ {} ] {}\n'.format(t, discoverable_cards[cnt]))
f.write('[ {} ] {}\n'.format(t, cnt + 1))
f.write('[ {} ] {}\n'.format(t, product_type[cnt].text))
f.write('[ {} ] {}\n'.format(t, titles[cnt].text))
...
Any help would be appreciated!

Given that you are in python2.7 you will probably want to explicitly encode all of your strings with a unicode compatible character set like "utf8" before passing them to write, you can do this with a simple encode method:
def safe_encode(str_or_unicode):
# future py3 compatibility: define unicode, if needed:
try:
unicode
except NameError:
unicode = str
if isinstance(str_or_unicode, unicode):
return str_or_unicode.encode("utf8")
return str_or_unicode
You would then use it like this:
f.write('[ {} ] {}\n'.format(safe_encode(t), safe_encode(discoverable_cards[cnt])))

Related

Python TypeError: expected a string or other character buffer object when importing text file

I am pretty new to python. For this task, I am trying to import a text file, add and to id, and remove punctuation from the text. I tried this method How to strip punctuation from a text file.
import string
def readFile():
translate_table = dict((ord(char), None) for char in string.punctuation)
with open('out_file.txt', 'w') as out_file:
with open('moviereview.txt') as file:
for line in file:
line = ' '.join(line.split(' '))
line = line.translate(translate_table)
out_file.write("<s>" + line.rstrip('\n') + "</s>" + '\n')
return out_file
However, I get an error saying:
TypeError: expected a string or other character buffer object
My thought is that after I split and join the line, I get a list of strings, so I cannot use str.translate() to process it. But it seems like everyone else have the same thing and it works,
ex. https://appliedmachinelearning.blog/2017/04/30/language-identification-from-texts-using-bi-gram-model-pythonnltk/ in example code from line 13.
So I am really confused, can anyone help? Thanks!
On Python 2, only unicode types have a translate method that takes a dict. If you intend to work with arbitrary text, the simplest solution here is to just use the Python 3 version of open on Py2; it will seamlessly decode your inputs and produce unicode instead of str.
As of Python 2.6+, replacing the normal built-in open with the Python 3 version is simple. Just add:
from io import open
to the imports at the top of your file. You can also remove line = ' '.join(line.split(' ')); that's definitionally a no-op (it splits on single spaces to make a list, then rejoins on single spaces). You may also want to add:
from __future__ import unicode_literals
to the very top of your file (before all of your code); that will make all of your uses of plain quotes automatically unicode literals, not str literals (prefix actual binary data with b to make it a str literal on Py2, bytes literal on Py3).
The above solution is best if you can swing it, because it will make your code work correctly on both Python 2 and Python 3. If you can't do it for whatever reason, then you need to change your translate call to use the API Python 2's str.translate expects, which means removing the definition of translate_table entirely (it's not needed) and just doing:
line = line.translate(None, string.punctuation)
For Python 2's str.translate, the arguments are a one-to-one mapping table for all values from 0 to 255 inclusive as the first argument (None if no mapping needed), and the second argument is a string of characters to delete (which string.punctuation already provides).
Answering here because a comment doesn't let me format code properly:
def r():
translate_table = dict((ord(char), None) for char in string.punctuation)
a = []
with open('out.txt', 'w') as of:
with open('test.txt' ,'r') as f:
for l in f:
l = l.translate(translate_table)
a.append(l)
of.write(l)
return a
This code runs fine for me with no errors. Can you try running that, and responding with a screenshot of the code you ran?

Setting Python encoding for printing Chinese characters [duplicate]

This question already has answers here:
Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
(4 answers)
Closed last year.
My code is below. I don't know why can't print Chinese. Please help.
When trying to print more than one variable at a time, the words look like ASCII or raw type.
How to fix it?
# -*- coding: utf-8 -*-
import pygoldilocks
import sys
reload(sys)
sys.setdefaultencoding('utf8')
rows = ( '已','经激活的区域语言' )
print( rows[0] )
print( rows[1] )
print( rows[0], rows[1] )
print( rows[0].encode('utf8'), rows[1].decode('utf8') )
print( rows[0], 1 )
$ python test.py
已
经激活的区域语言
('\xe5\xb7\xb2', '\xe7\xbb\x8f\xe6\xbf\x80\xe6\xb4\xbb\xe7\x9a\x84\xe5\x8c\xba\xe5\x9f\x9f\xe8\xaf\xad\xe8\xa8\x80')
('\xe5\xb7\xb2', u'\u7ecf\u6fc0\u6d3b\u7684\u533a\u57df\u8bed\u8a00')
('\xe5\xb7\xb2', 1)
All your outputs are normal. By the way, this:
reload(sys)
sys.setdefaultencoding('utf8')
is really a poor man's trick to set the Python default encoding. It is seldom really useful - IMHO it is not in shown code - and should only be used when no cleaner way is possible. I had been using Python 2 for decades with non ascii charset (Latin1) and only used that in my very first scripts.
And the # -*- coding: utf-8 -*- is not used either by Python here, though it may be useful for your text editor: it only makes sense when you have unicode literal strings in your script - what you have not.
Now what really happens:
You define row as a 2 tuple of (byte) strings containing chinese characters encoded in utf8. Fine.
When you print a string, the characters are passed directly to the output system (here a terminal or screen). As it correctly processes UTF8 it converts the utf8 byte representation into the proper characters. So print (row[0]) (which is executed as print row[0] in Python 2 - (row[0]) is not a tuple, (row[0],) is a 1-tuple) correctly displays chinese characters.
But when you print a tuple, Python actually prints the representation of the elements of the tuple (it would be the same for a list, set or map). And in Python 2, the representation of a byte or unicode string encodes all non ASCII characters in \x.. of \u.... forms.
In a Python interactive session, you should see:
>>> print rows[0]
已
>>> print repr(rows[0])
'\xe5\xb7\xb2'
TL/DR: when you print containers, you actually print the representation of the elements. If you want to display the string values, use an explicit loop or a join:
print '(' + ', '.join(rows) + ')'
displays as expected:
(已, 经激活的区域语言)
Your problem is that you are using Python 2, I guess. Your code
print( rows[0], rows[1] )
is evaluated as
tmp = ( rows[0], rows[1] ) # a tuple!
print tmp # Python 2 print statement!
Since the default formatting for tuples is done via repr(), you see the ASCII-escaped representation.
Solution: Upgrade to Python 3.
There are two less drastic solutions than upgrading to Python 3.
The first is not to use Python 3 print() syntax:
rows = ( '已','经激活的区域语言' )
print rows[0]
print rows[1]
print rows[0], rows[1]
print rows[0].decode('utf8'), rows[1].decode('utf8')
print rows[0], 1
已
经激活的区域语言
已 经激活的区域语言
已 经激活的区域语言
已 1
The second is to import Python 3 print() syntax into Python 2:
from __future__ import print_function
rows = ( '已','经激活的区域语言' )
print (rows[0])
print (rows[1])
print (rows[0], rows[1])
print (rows[0].decode('utf8'), rows[1].decode('utf8'))
print (rows[0], 1)
Output is the same.
And drop that sys.setdefaultencoding() call. It's not intended to be used like that (only in the site module) and does more harm than good.

Using unicode / umlauts in Python: Dictionary v manual input

I am using a dictionary to store some character pairs in Python (I am replacing umlaut characters). Here is what it looks like:
umlautdict={
'ae': 'ä',
'ue': 'ü',
'oe': 'ö'
}
Then I run my inputwords through it like so:
for item in umlautdict.keys():
outputword=inputword.replace(item,umlautdict[item])
But this does not do anything (no replacement happens). When I printed out my umlautdict, I saw that it looks like this:
{'ue': '\xfc', 'oe': '\xf6', 'ae': '\xc3\xa4'}
Of course that is not what I want; however, trying things like unicode() (--> Error) or pre-fixing u did not improve things.
If I type the 'ä' or 'ö' into the replace() command by hand, everything works just fine. I also changed the settings in my script (working in TextWrangler) to # -*- coding: utf-8 -*- as it would net even let me execute the script containing umlauts without it.
So I don't get...
Why does this happen? Why and when do the umlauts change from "good
to evil" when I store them in the dictionary?
How do I fix it?
Also, if anyone knows: what is a good resource to learn about
encoding in Python? I have issues all the time and so many things
don't make sense to me / I can't wrap my head around.
I'm working on a Mac in Python 2.7.10. Thanks for your help!
Converting to Unicode is done by decoding your string (assuming you're getting bytes):
data = "haer ueber loess"
word = data.decode('utf-8') # actual encoding depends on your data
Define your dict with unicode strings as well:
umlautdict={
u'ae': u'ä',
u'ue': u'ü',
u'oe': u'ö'
}
and finally print umlautdict will print out some representation of that dict, usually involving escapes. That's normal, you don't have to worry about that.
Declare your coding.
Use raw format for the special characters.
Iterate properly on your string: keep the changes from each loop iteration as you head to the next.
Here's code to get the job done:
\# -*- coding: utf-8 -*-
umlautdict = {
'ae': r'ä',
'ue': r'ü',
'oe': r'ö'
}
print umlautdict
inputword = "haer ueber loess"
for item in umlautdict.keys():
inputword = inputword.replace(item, umlautdict[item])
print inputword
Output:
{'ue': '\xc3\xbc', 'oe': '\xc3\xb6', 'ae': '\xc3\xa4'}
här über löss

How to read a non-English language text from a text file and print it in python?

I have a text file which contains some Persian text, I want to read the file and calculate the number of occurrence of each word then print the calculated values. this is my code:
f = open('C:/python programs/hafez.txt')
wordDict ={}
for line in f:
wordList = line.strip().split(' ')
for word in wordList:
if word not in wordDict:
wordDict[word] = 1
else: wordDict[word] = wordDict[word]+1
print((str(wordDict)))
It produces results which has wrong coding format, I tried various ways to fix this but no good result! Here is part of the text that this code produces:
{"\x00'\x063\x06(\x06": 3, "\x00,\x06'\x06E\x06G\x06": 16, "\x00'\x063\x06*\x06E\x06'\x069\x06": 1, '\x00-\x064\x061\x06': 1, .....}
There are several ways to deal with this, but perhaps the easiest is with codecs.open(). (I'm assuming you're using Python 2.7 for some of the other tricks here with Counter and with).
import codecs
from collections import Counter
wordDict = Counter()
with codecs.open('C:/python programs/hafez.txt','r',encoding='cp720') as f:
for line in f:
wordDict.update(line.strip().split())
for word, count in wordDict.most_common():
print word, count
In Python 3, you need the parentheses with print (it's a function in Python 3 but a statement in Python 2), and you don't need to import codecs because the builtin open() has support for different encodings.
If your encoding isn't Code Page 720, then you need to replace that option with the abbreviation for the appropriate encoding.
This is a good opportunity to learn some about encodings. While I agree with Joel, that no programmer should pretend that we live in a US English / ASCII world, the issue of encoding becomes especially pertinent when you're dealing with a non Latin alphabet on a regular basis. (Besides, ASCII isn't even enough for English -- many English words are borrowings that kept their accents, amongst other issues.) Good starting places are Joel's article (The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)), the Pragmatic Unicode (including the Unicode sandwich), and for ease of producing said sandwich in Python 2, the codecs module. There's also a HOWTO in the Python docs, which is easier to understand after you've read the other articles.
If you've decided to go full Python 3, then you can simple select your exact version from the listbox at the top of the documentation pages. The BDFL's summary of the differences between Python 2 and 3 also includes a bit on issues with Unicode and how it's handled differently in Python 2 and 3.
I think in general you could encode your txt file with UTF-8, and read UTF-8 in python with # -- coding: UTF-8 -- in the start part of py file.
Consider using the python Counter subclass for counting occurrences of words.
As for the text, python2.7 is not unicode by default. Read: http://docs.python.org/2/howto/unicode.html
You can use
for i,j in wordDict.iteritems():
print unicode(i),j
I admit I don't know python, I'm learning it since a few days. And for the moment I'm a little confused between python 2 and 3, since it handles strings differently.
Just a tip to where to look : read your file into a string ( I don't know if you should open the file in binary mode ). Then convert it to unicode using
unicodestr = str.decode('cp720')

How to search and replace utf-8 special characters in Python?

I'm a Python beginner, and I have a utf-8 problem.
I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue').
u-umlaut has unicode code point 252, so I tried this:
>>> str = unichr(252) + 'ber'
>>> print repr(str)
u'\xfcber'
>>> print repr(str).replace(unichr(252), 'ue')
u'\xfcber'
I expected the last string to be u'ueber'.
What I ultimately want to do is replace all u-umlauts in a file with 'ue':
import sys
import codecs
f = codecs.open(sys.argv[1],encoding='utf-8')
for line in f:
print repr(line).replace(unichr(252), 'ue')
Thanks for your help! (I'm using Python 2.3.)
I would define a dictionary of special characters (that I want to map) then I use translate method.
line = 'Ich möchte die Qualität des Produkts überprüfen, bevor ich es kaufe.'
special_char_map = {ord('ä'):'ae', ord('ü'):'ue', ord('ö'):'oe', ord('ß'):'ss'}
print(line.translate(special_char_map))
you will get the following result:
Ich moechte die Qualitaet des Produkts ueberpruefen, bevor ich es kaufe.
I think it's easiest and clearer to do it on a more straightforward way, using directly the unicode representation os 'ü' better than unichr(252).
>>> s = u'über'
>>> s.replace(u'ü', 'ue')
u'ueber'
There's no need to use repr, as this will print the 'Python representation' of the string, you just need to present the readable string.
You will need also to include the following line at the beggining of the .py file, in case it's not already present, to tell the encoding of the file
#-*- coding: UTF-8 -*-
Added: Of course, the coding declared must be the same as the encoding of the file. Please check that as can be some problems (I had problems with Eclipse on Windows, for example, as it writes by default the files as cp1252. Also it should be the same encoding of the system, which could be utf-8, or latin-1 or others.
Also, don't use str as the definition of a variable, as it is part of the Python library. You could have problems later.
(I am trying on Python 2.6, I think in Python 2.3 the result is the same)
repr(str) returns a quoted version of str, that when printed out, will be something you could type back in as Python to get the string back. So, it's a string that literally contains \xfcber, instead of a string that contains über.
You can just use str.replace(unichr(252), 'ue') to replace the ü with ue.
If you need to get a quoted version of the result of that, though I don't believe you should need it, you can wrap the entire expression in repr:
repr(str.replace(unichr(252), 'ue'))
You can avoid all that sourcefile encoding stuff and its problems. Use the Unicode names, then its screamingly obvious what you are doing and the code can be read and modified anywhere.
I don't know of any language where the only accented Latin letter is lower-case-u-with-umlaut-aka-diaeresis, so I've added code to loop over a table of translations under the assumption that you'll need it.
# coding: ascii
translations = (
(u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'),
(u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'),
# et cetera
)
test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen'
out = test
for from_str, to_str in translations:
out = out.replace(from_str, to_str)
print out
output:
Moeller von Muenchen

Categories

Resources