So I am trying to teach myself python and pymarc for a school project I am working on. I have a sample marc file and I am trying to read it using this simple code:
from pymarc import *
reader = MARCReader(open('dump.mrc', 'rb'), to_unicode=True)
for record in reader:
print(record)
The for loop is to just print out each record to make sure I am getting the correct data. The only thing is I am getting this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
I've looked online but could not find an answer to my problem. What does this error mean and how can I go about fixing it? Thanks in advance.
You can set the python environment to support UTF-8 and get record as a dictionary.
Try:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pymarc import *
reader = MARCReader(open('dump.mrc', 'rb'), to_unicode=True, force_utf8=True)
for record in reader:
print record.as_dict()
Note:
If you still get the unicode exception, you can set to_unicode=False and skip force_utf8=True.
Also please check if your dump.mrc file is encoded to UTF-8 or not. Try:
$ chardet dump.mrc
I want to make sure all string are unicode in my code, so I use unicode_literals, then I need to write string to file:
from __future__ import unicode_literals
with open('/tmp/test', 'wb') as f:
f.write("中文") # UnicodeEncodeError
so I need to do this:
from __future__ import unicode_literals
with open('/tmp/test', 'wb') as f:
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
but every time I need to encode in code, I am lazy, so I change to codecs:
from __future__ import unicode_literals
from codecs import open
import locale, codecs
lang, encoding = locale.getdefaultlocale()
with open('/tmp/test', 'wb', encoding) as f:
f.write("中文")
still I think this is too much if I just want to write to file, any easier method?
You don't need to call .encode() and you don't need to call locale.getdefaultlocale() explicitly:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import io
with io.open('/tmp/test', 'w') as file:
file.write(u"中文" * 4)
It uses locale.getpreferredencoding(False) character encoding to save Unicode text to the file.
On Python 3:
you don't need to use the explicit encoding declaration (# -*- coding: utf-8 -*-), to use literal non-ascii characters in your Python source code. utf-8 is the default.
you don't need to use import io: builtin open() is io.open() there
you don't need to use u'' (u prefix). '' literals are Unicode by default. If you want to omit u'' then put back from __future__ import unicode_literals as in your code in the question.
i.e., the complete Python 3 code is:
#!/usr/bin/env python3
with open('/tmp/test', 'w') as file:
file.write("中文" * 4)
What about this solution?
Write to UTF-8 file in Python
Only three lines of code.
This code should write some text to file.
When I'm trying to write my text to console, everything works. But when I try to write the text into the file, I get UnicodeEncodeError. I know, that this is a common problem which can be solved using proper decode or encode, but I tried it and still getting the same UnicodeEncodeError. What am I doing wrong?
I've attached an example.
print "(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2])
(StarBuy s.r.o.,Inzertujte s foto, auto-moto, oblečenie, reality, prácu, zvieratá, starožitnosti, dovolenky, nábytok, všetko pre deti, obuv, stroj....
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u010d' in position 50: ordinal not in range(128)
Where could be the problem?
To write Unicode text to a file, you could use io.open() function:
#!/usr/bin/env python
from io import open
with open('utf8.txt', 'w', encoding='utf-8') as file:
file.write(u'\u010d')
It is default on Python 3.
Note: you should not use the binary file mode ('b') if you want to write text.
# coding: utf8 that defines the source code encoding has nothing to do with it.
If you see sys.setdefaultencoding() outside of site.py or Python tests; assume the code is broken.
#ned-batchelder is right. You have to declare that the system default encoding is "utf-8". The coding comment # -*- coding: utf-8 -*- doesn't do this.
To declare the system default encoding, you have to import the module sys, and call sys.setdefaultencoding('utf-8'). However, sys was previously imported by the system and its setdefaultencoding method was removed. So you have to reload it before you call the method.
So, you will need to add the following codes at the beginning:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
You may need to explicitly declare that python use UTF-8 encoding.
The answer to this SO question explains how to do that: Declaring Encoding in Python
For Python 2:
Declare document encoding on top of the file (if not done yet):
# -*- coding: utf-8 -*-
Replace .decode with .encode:
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".encode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))
I'm having trouble formating the strings to utf-8
In this script im getting data from excel file
then printing it out in a loop, the problem is that
the string with special characters shows up wrong.
In result I keep getting 'PatrÄ«cija' instead of 'Patrīcija'
Can't seem to find the solution for this problem
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import xlrd
import datetime
def todaysnames():
todaysdate = datetime.datetime.strftime(datetime.date.today(), "%d.%m")
book = xlrd.open_workbook("vardadienas.xls")
sheet = book.sheet_by_name('Calendar')
for rownr in range(sheet.nrows):
if sheet.cell(rownr, 0).value == todaysdate:
string = (sheet.cell(rownr, 1).value)
string = string.encode(encoding="UTF-8",errors="strict")
names = string.split(', ')
return names
names = todaysnames()
for name in names:
print name
Changed encoding to iso8859_13(Baltic languages) and it fixed it.
I think that your problem may be caused by the print. The xlrd returns utf8. Depending of the encoding of your console, the print may have difficulties to print it correctly. I've noticed this sometimes on a french Windows (where encoding is cp1252)
The following question: Python, Unicode, and the Windows console explains how to print unicode char on the console on Windows. I didn't try myself but it looks good.
I hope it helps
When I use open() to open a file, I am not able to write unicode strings. I have learned that I need to use codecs and open the file with Unicode encoding (see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data).
Now I need to create some temporary files. I tried to use the tempfile library, but it doesn't have any encoding option. When I try to write any unicode string in a temporary file with tempfile, it fails:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
How can I create a temporary file with Unicode encoding in Python?
Edit:
I am using Linux and the error message that I get for this code is:
Traceback (most recent call last):
File "tmp_file.py", line 5, in <module>
fh.write(u"Hello World: ä")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 13: ordinal not in range(128)
This is just an example. In practice I am trying to write a string that some API returned.
Everyone else's answers are correct, I just want to clarify what's going on:
The difference between the literal 'foo' and the literal u'foo' is that the former is a string of bytes and the latter is the Unicode object.
First, understand that Unicode is the character set. UTF-8 is the encoding. The Unicode object is the about the former—it's a Unicode string, not necessarily a UTF-8 one. In your case, the encoding for a string literal will be UTF-8, because you specified it in the first lines of the file.
To get a Unicode string from a byte string, you call the .encode() method:
>>>> u"ひらがな".encode("utf-8") == "ひらがな"
True
Similarly, you could call your string.encode in the write call and achieve the same effect as just removing the u.
If you didn't specify the encoding in the top, say if you were reading the Unicode data from another file, you would specify what encoding it was in before it reached a Python string. This would determine how it would be represented in bytes (i.e., the str type).
The error you're getting, then, is only because the tempfile module is expecting a str object. This doesn't mean it can't handle unicode, just that it expects you to pass in a byte string rather than a Unicode object—because without you specifying an encoding, it wouldn't know how to write it to the temp file.
tempfile.TemporaryFile has encoding option in Python 3:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile(mode='w+', encoding='utf-8') as fh:
fh.write("Hello World: ä")
fh.seek(0)
for line in fh:
print(line)
Note that now you need to specify mode='w+' instead of the default binary mode. Also note that string literals are implicitly Unicode in Python 3, there's no u modifier.
If you're stuck with Python 2.6, temporary files are always binary, and you need to encode the Unicode string before writing it to the file:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä".encode('utf-8'))
fh.seek(0)
for line in fh:
print line.decode('utf-8')
Unicode specifies the character set, not the encoding, so in either case you need a way to specify how to encode the Unicode characters!
Since I am working on a Python program with TemporaryFile objects that should run in both Python 2 and Python 3, I don't find it satisfactory to manually encode all strings written as UTF-8 like the other answers suggest.
Instead, I have written the following small polyfill (because I could not find something like it in six) to wrap a binary file-like object into a UTF-8 file-like object:
from __future__ import unicode_literals
import sys
import codecs
if sys.hexversion < 0x03000000:
def uwriter(fp):
return codecs.getwriter('utf-8')(fp)
else:
def uwriter(fp):
return fp
It is used in the following way:
# encoding: utf-8
from tempfile import NamedTemporaryFile
with uwriter(NamedTemporaryFile(suffix='.txt', mode='w')) as fp:
fp.write('Hællo wörld!\n')
I have figured out one solution: create a temporary file that is not automatically deleted with tempfile, close it and open it again using codecs:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import codecs
import os
import tempfile
f = tempfile.NamedTemporaryFile(delete=False)
filename = f.name
f.close()
with codecs.open(filename, 'w+b', encoding='utf-8') as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
os.unlink(filename)
You are trying to write a unicode object (u"...") to the temporary file where you should use an encoded string ("..."). You don't have to explicitly pass an "encode=" parameter, because you've already stated the encoding in line two ("# -*- coding: utf-8 -*-"). Just use fh.write("ä") instead of fh.write(u"ä") and you should be fine.
Dropping the u made your code work for me:
fh.write("Hello World: ä")
I guess it's because it's already unicode.
Setting the sys as default encoding to UTF-8 will fix the encoding issue
import sys
reload(sys)
sys.setdefaultencoding('utf-8') #set to utf-8 by default this will solve the errors
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line