Unicode Error in Django while loading in data

Unicode Error in Django while loading in data - python

So I'm trying to load this line in as a name for a model:
"Auf der grünen Wiese (1953)"
but I get the error
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 70: invalid start byte
I'm looking at: http://docs.python.org/2/howto/unicode.html#the-unicode-type
but I'm still not exactly sure about the fix to this problem. I can cast it as a unicode with the option to replace/ignore the error but I don't think that is the most ideal solution?
I also see that django provides a few functions to help with this stuff: https://docs.djangoproject.com/en/dev/ref/unicode/ but I'm still not quite sure how to approach it.

The line is encoded using latin1. To properly decode it
you should do (assuming Python 2.x):
line = 'Auf der gr\xfcnen Wiese (1953)'
name = line.decode('latin1')
If you are reading this from a file, you can also do:
f = codecs.open(path, 'r', 'latin1')
name = f.readline().strip()

Related

DjangoUnicodeDecodeError while getting field

I have faced a problem while I was reading from GEOSGeometry object. I have used this code
ds = DataSource(shp file path)
lyr = ds[0]
for feat in lyr:
geom_t = feat.geom.transform(wgs84, clone=True)
name =feat.get('name')
this code works fine for my shape files.but if name field contains a utf8 string such as 'تست' it raises this error
DjangoUnicodeDecodeError at /views/importdata/
'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte. You passed in b'\xc8\xe1\xe6\xc7\xd1 \xc7\xe3\xc7\xe3 \xd1\xd6\xc7' (<class 'bytes'>)
Unicode error hint
The string that could not be encoded/decoded was: �����
well I find out this is an internal error which is related to gdal or geos wrapper in django. the error comes from this line
return force_text(string, encoding=self._feat.encoding, strings_only=True)
in field.py in this directory
D:\Python\Python36\lib\site-packages\django\contrib\gis\gdal\field.py in as_string
is there any way to find a solution for this problem?
thanks

well for the other people who might face this problem . I had to set encoding in datasource definition.
here is the main part
ds = DataSource(shp,encoding='cp1256')

Python encoding issue while reading a file

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.
readFile = codecs.open("FileName",encoding='utf-8')
The line I am trying to read is this with nothing else in it.
Aeëtes
Here are some of the errors I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte
UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)
If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.

Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.
The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.
This is the line of code that I used to convert it.
readFile = codecs.open("FileName",encoding='mac_roman')

problems of reading a collection of files including non-ascii characters

I am trying to build word vectors using the following code segment
DIR = "C:\Users\Desktop\data\\rec.sport.hockey"
posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
x_train = vectorizer.fit_transform(posts)
However, the code returns the following error message
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte
I think it is related to some non-ascii characters. How to solve this issue?

Is the file automatically generated? If not, one simple solution is to open the file with Notepad++ and convert the encoding to utf-8.

load .json into python; UnicodeDecodeError

I am trying to load a json file into python with no success. I have been googling a solution for the past few hours and just cannot seem to get it to load. I have tried to load it using the same json.load('filename') function that has worked for everyone. I keep getting :
"UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in postion 124: invalid continuation byte"
Here is the code I am using
import json
json_data = open('myfile.json')
for line in json_data:
data = json.loads(line) <--I get an error at this.
Here is a sample line from my file
{"topic":"security","question":"Putting the Biba-LaPadula Mandatory Access Control Methods to Practise?","excerpt":"Text books on database systems always refer to the two Mandatory Access Control models; Biba for the Integrity objective and Bell-LaPadula for the Secrecy or Confidentiality objective.\n\nText books ...\r\n "}
What is my error if this seems to have worked for everyone in every example I have googled?

Have you tried:
json.loads(line.decode("utf-8"))
Similar question asked here: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2
Edit:
If the above does not work,
json.loads(line.decode("utf-8","ignore"))
will.

Python Encoding\Decoding for writing to a text file

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!

My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)

Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode Error in Django while loading in data - python

The line is encoded using latin1. To properly decode it you should do (assuming Python 2.x): line = 'Auf der gr\xfcnen Wiese (1953)' name = line.decode('latin1') If you are reading this from a file, you can also do: f = codecs.open(path, 'r', 'latin1') name = f.readline().strip()

Related

DjangoUnicodeDecodeError while getting field

Python encoding issue while reading a file

problems of reading a collection of files including non-ascii characters

load .json into python; UnicodeDecodeError

Python Encoding\Decoding for writing to a text file

Categories

Resources