I am using Python to extract data from an MSSQL database, using an ODBC connection. I am then trying to put the extracted data into an Excel file, using xlwt.
However this generates the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 20: ordinal not in range(128)
I have run the script to just print the data and established that the offending character in the database is an O with a slash through it. On the python print it shows as "\xd8".
The worksheet encoding for xlwt is set as UTF-8.
Is there any way to have this come straight through into Excel?
Edit
Full error message below:
C:\>python dbtest1.py
Traceback (most recent call last):
File "dbtest1.py", line 24, in <module>
ws.write(i,j,item)
File "build\bdist.win32\egg\xlwt\Worksheet.py", line 1032, in write
File "build\bdist.win32\egg\xlwt\Row.py", line 240, in write
File "build\bdist.win32\egg\xlwt\Workbook.py", line 309, in add_str
File "build\bdist.win32\egg\xlwt\BIFFRecords.py", line 25, in add_str
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 20: invalid
continuation byte
Setting the workbook encoding to 'latin-1' seems to have achieved the same:
wb = xlwt.Workbook(encoding='latin-1')
(It was set at 'UTF-8' before)
The other answer didn't work in my case as there were other fields that were not strings.
The SQL extraction seems to be returning strings encoded using ascii. You can convert them to unicode with:
data = unicode(input_string, 'latin-1')
You can then put them into a spreadsheet with xlwt.
Related
I have a file that contains a Windows-1254 encoded data and I want to decode it to a readable format with python but I have no success so far.
I tried this code but I am getting an erorr:
file = open('file_name', 'rb')
content_of_file = file.read()
file.close()
decoded_data = content_of_file.decode('cp1254') # here I am trying to decode that file
print(decoded_data)
This is the error that I get:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/encodings/cp1254.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 658: character maps to
The way that I now that this file was encoded in Windows-1254 is by using the chardet.detect(data)['encoding'] so keep that in mind.
Does someone knows how can I decode taht data in a way I could understand it?
this is the file content if someone need it.
b'c\x00\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00C\x00\x00\x00s3\x03\x00\x00d\x01\x00\x01\x03\x03\x03\t\x02\x02\t\x03\x03\x03\t\x02\x02\td\x02\x00d\x00\x00l\x00\x00}\x01\x00d\x02\x00d\x00\x00l\x01\x00}\x02\x00t\x02\x00\x83\x00\x00}\x03\x00d\x03\x00}\x00\x00t\x03\x00|\x03\x00\x83\x01\x00d\x04\x00k\x03\x00rZ\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\xd0\x02|\x03\x00d\x06\x00\x19d\x07\x00k\x03\x00ry\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x08\x00\x19\x83\x01\x00d\t\x00Ad\n\x00k\x03\x00r\xa2\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x0b\x00d\x0c\x00!d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19j\x06\x00d\r\x00\x83\x01\x00s\xd4\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19j\x08\x00\x83\x00\x00\x0co\x07\x01|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19d\x0f\x00j\t\x00d\x10\x00\x83\x01\x00k\x02\x00s\x19\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00|\x01\x00j\n\x00d\x11\x00d\x04\x00\x83\x02\x00d\x06\x00\x14d\x11\x00\x17\x19j\x0b\x00d\x12\x00\x83\x01\x00d\x13\x00k\x03\x00rU\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x14\x00\x19\x83\x01\x00t\x05\x00|\x03\x00d\x15\x00\x19\x83\x01\x00Ad\x16\x00k\x03\x00s\x89\x01|\x03\x00d\x14\x00\x19d\x17\x00k\x03\x00r\x98\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x18\x00\x19d\x19\x00j\t\x00d\x1a\x00\x83\x01\x00k\x03\x00r\xc0\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x1b\x00\x19d\x1c\x00j\x0c\x00d\x1c\x00d\x1d\x00\x83\x02\x00k\x03\x00r\xeb\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\r\x00d\x1e\x00d\x0b\x00\x83\x02\x00\x19t\x0e\x00d\x11\x00\x83\x01\x00d\x1f\x00\x17j\t\x00d\x10\x00\x83\x01\x00k\x03\x00r&\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d \x00d!\x00!j\x0c\x00d"\x00d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19d\x1c\x00\x83\x02\x00d\x1c\x00k\x03\x00ra\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d!\x00\x19t\x0f\x00i\x00\x00\x83\x01\x00j\x10\x00d\x0b\x00\x19k\x03\x00r\x8d\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d#\x00\x19t\x0e\x00d\x06\x00\x83\x01\x00k\x03\x00r\xb2\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d$\x00\x19d%\x00k\x07\x00r\xd1\x02d&\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x02\x00j\x01\x00|\x03\x00d'\x00\x19\x83\x01\x00j\x11\x00\x83\x00\x00d(\x00k\x03\x00r\xff\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d)\x00\x19\x83\x01\x00t\x07\x00d*\x00\x83\x01\x00k\x03\x00r*\x03d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00d+\x00GHd\x00\x00S(,\x00\x00\x00Ns\x1a\x00\x00\x00#Import random,md5 modulesi\xff\xff\xff\xffs\x14\x00\x00\x00#Start checking flagi\x12\x00\x00\x00s\x05\x00\x00\x00NO!!!i\x00\x00\x00\x00t\x01\x00\x00\x00Mi\x01\x00\x00\x00i\xff\x00\x00\x00i\xaa\x00\x00\x00i\x02\x00\x00\x00i\x04\x00\x00\x00t\x02\x00\x00\x00HCs\x10\x00\x00\x00([] > {}) << -~1t\x02\x00\x00\x007bt\x03\x00\x00\x00hexi\x05\x00\x00\x00t\x06\x00\x00\x00base64s\x05\x00\x00\x00ZA==\ni\x06\x00\x00\x00i\x07\x00\x00\x00i\x1a\x00\x00\x00t\x01\x00\x00\x00ii\x08\x00\x00\x00t\x01\x00\x00\x00nt\x05\x00\x00\x00rot13i\t\x00\x00\x00t\x00\x00\x00\x00t\x01\x00\x00\x00st\x04\x00\x00\x001010t\x01\x00\x00\x00fi\x0b\x00\x00\x00i\r\x00\x00\x00t\x02\x00\x00\x00ypi\x0e\x00\x00\x00i\x0f\x00\x00\x00t\x08\x00\x00\x00dddddddds\x06\x00\x00\x00NO!!!!i\x10\x00\x00\x00t \x00\x00\x00eccbc87e4b5ce2fe28308fd9f2a7baf3i\x11\x00\x00\x00sM\x00\x00\x00~(~(~((((~({}<[])<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))s\n\x00\x00\x00Bazinga!!!(\x12\x00\x00\x00t\x06\x00\x00\x00randomt\x03\x00\x00\x00md5t\t\x00\x00\x00raw_inputt\x03\x00\x00\x00lent\x04\x00\x00\x00quitt\x03\x00\x00\x00ordt\n\x00\x00\x00startswitht\x04\x00\x00\x00evalt\x07\x00\x00\x00isalphat\x06\x00\x00\x00decodet\x07\x00\x00\x00randintt\x06\x00\x00\x00encodet\x07\x00\x00\x00replacet\x03\x00\x00\x00intt\x03\x00\x00\x00strt\x04\x00\x00\x00typet\x08\x00\x00\x00__name__t\t\x00\x00\x00hexdigest(\x04\x00\x00\x00t\x07\x00\x00\x00commentR\x0f\x00\x00\x00R\x10\x00\x00\x00t\x03\x00\x00\x00ans(\x00\x00\x00\x00(\x00\x00\x00\x00sA\x00\x00\x00XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXt\x0c\x00\x00\x00get_the_flag\x0b\x00\x00\x00sl\x00\x00\x00\x00\x02\x12\x01\x0c\x01\x0c\x01\t\x02\x06\x01\x12\x01\x05\x01\n\x02\x10\x01\x05\x01\n\x01\x1a\x01\x05\x01\n\x01#\x01\x05\x01\n\x016\x01\x05\x01\n\x01-\x01\x05\x01\n\x014\x01\x05\x01\n\x01\x19\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01,\x01\x05\x01\n\x01,\x01\x05\x01\n\x01\x1d\x01\x05\x01\n\x01\x16\x01\x05\x01\n\x01\x10\x01\x05\x01\n\x01\x1f\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01'
I am doing some csv read/write operations over a huge flat file. According the source, the contents of the file are under UTF-8 encoding but while trying to write to a .csv file I am getting following error:
Traceback (most recent call last):
File " basic.py", line 12, in <module>
F.write(q)
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x9a' in position 19: character maps to <undefined>
Quite a possibility that the file contains some multicultural symbols, given it contains data that represent some global information.
But as its huge it can’t be fixed manually one by one. So is there a way I can fix these errors, make the code ignore these or ideally can standardize these characters. As after writing these the csv file I will uploading it over a sql server db.
I had the same error message when trying to write text to a .txt file.
I solved the problem byreplacing
my_text = "text containing non-unicode characters"
text_file = open("Output.txt", "w")
text_file.write(my_text)
with
my_text = "text containing non-unicode characters"
text_file = open("Output.txt", "w", encoding='utf-8')
text_file.write(my_text)
While trying to decode the values from HBase, i am seeing an error but it is apparent that Python thinks it is not in UTF-8 format but the Java application that put the data into HBase encoded it in UTF-8 only
a = '\x00\x00\x00\x00\x10j\x00\x00\x07\xe8\x02Y'
a.decode("UTF-8")
Traceback (most recent call last):
File "", line 1, in
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 9: invalid continuation byte
any thoughts?
that data is not valid utf-8, so if you really retrieved it as such from the database, you should check who/what put it in there.
I am using hfcca to calculate cyclomatic complexity for a c++ code. hfcca is a simple python script (https://code.google.com/p/headerfile-free-cyclomatic-complexity-analyzer/). When i am trying to run the script to generate the output in the form of an xml file i am getting following errors :
Traceback (most recent call last):
"./hfcca.py", line 802, in <module>
main(sys.argv[1:])
File "./hfcca.py", line 798, in main
print(xml_output([f for f in r], options))
File "./hfcca.py", line 798, in <listcomp>
print(xml_output([f for f in r], options))
File "/x/home06/smanchukonda/PREFIX/lib/python3.3/multiprocessing/pool.py", line 652, in next
raise value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte
Please help me with this..
The problem looks like the file has characters represented with latin1 that aren't characters in utf8. The file utility can be useful for figuring out what encoding a file should be treated as, e.g:
monk#monk-VirtualBox:~$ file foo.txt
foo.txt: UTF-8 Unicode text
Here's what the bytes mean in latin1:
>>> b'\xe2'.decode('latin1')
'â'
Probably easiest is to convert the files to utf8.
I also had the same problem rendering Markup("""yyyyyy""") but i solved it using an online tool with removed the 'bad' characters. https://pteo.paranoiaworks.mobi/diacriticsremover/
It is a nice tool and works even offline.
I am trying to read a utf-8 encoded xml file in python and I am doing some processing on the lines read from the file something like below:
next_sent_separator_index = doc_content.find(word_value, int(characterOffsetEnd_value) + 1)
Where doc_content is the line read from the file and word_value is one of the string from the the same line. I am getting encoding related error for above line whenever doc_content or word_value is having some Unicode characters. So, I tried to decode them first with utf-8 decoding (instead of default ascii encoding) as below :
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
But I am still getting UnicodeDecodeError as below :
Traceback (most recent call last):
File "snippetRetriver.py", line 402, in <module>
sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
File "snippetRetriver.py", line 201, in getSentenceList
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)
Can anyone suggest me a suitable approach / way to avoid these kind of encoding errors in python 2.7 ?
codecs.utf_8_decode(input.encode('utf8'))