Getting error in python with characters from word document

Getting error in python with characters from word document - python

I have this text which is entered in text box
‘f’fdsfs’`124539763~!##$%^’’;’””::’
I am coverting to json and then it comes as
"\\u2018f\\u2019fdsfs\\u2019`124539763~!##$%^\\u2019\\u2019;\\u2019\\u201d\\u201d::\\u2019e"
Now when i am writing the csv file then i get this error
'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)
csv.writer(data)
I tried all data.encode('utf-8') data.decode('unicode-escape') but didn't work

csv module does not support unicode use https://github.com/jdunck/python-unicodecsv instead
although im not sure \u2018 is part of the utf-8 charset
x = "\\u2018f\\u2019fdsfs..."; j = json.loads('"' + x + '"'); print j.encode('cp1252')
‘f’fdsfs...
note that it is being encoded as cp1252
>>> import unicodecsv as csv #https://github.com/jdunck/python-unicodecsv
>>> x = "\\u2018f\\u2019fdsfs..."; j = json.loads('"' + x + '"');
>>> with open("some_file.csv","wb") as f:
... w = csv.writer(f,encoding="cp1252")
... w.writerow([j,"normal"])
...
>>>
here is the csv file : https://www.dropbox.com/s/m4gta1o9vg8tfap/some_file.csv

Related

HexString to packed EBCDIC string

I need to convert '767f440128e1a00a' hex data to packed EBCDIC string. I want all result outcomes into one string but python is giving Unicode error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
s='767f440128e1a00a'
output = []
DDF = [1]
distance = 0
for y in range (1,len(s[2:])):
for x in DDF:
if s[2:][distance:x*2+distance]!='':
output.append(s[2:][distance:x*2+distance])
else:
continue
distance += x*2
print(output)
final=[]
result=''
bytearrya=[]
for x in output:
result=(str(bytearray.fromhex(x).decode()))
x = codecs.decode(x, "hex")
final.append(x)

Here is the code based on Python byte representation of a hex string that is EBCDIC that mentioned "According to this, you need to use 'cp500' for decoding"
Codec / Aliases / Languages
cp500 / EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500 / Western Europe
my_string_in_hex = '767f440128e1a00a'
my_bytes = bytearray.fromhex(my_string_in_hex)
print(my_bytes)
my_string = my_bytes.decode('cp500')
print(my_string)
Output:

writing utf-8 encoded text to a file

I am trying to print chinese text to a file. When i print it on the terminal, it looks correct to me. When i type print >> filename... i get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 24: ordinal not in range(128)
I dont know what else i need to do.. i already encoded all textual data to utf-8 and used string formatting.
This is my code:
# -*- coding: utf-8 -*-
exclude = string.punctuation
def get_documents_cleaned(topics,filename):
c = codecs.open(filename, "w", "utf-8")
for top in topics:
print >> c , "num" , "," , "text" , "," , "id" , "," , top
document_results = proj.get('docs',text=top)['results']
for doc in document_results:
print "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
print >> c , "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
get_documents_cleaned(my_topics,"export_c.csv")
print doc[0]['text'] looks like this:
u' \u3001 \u6010...'

Since your first print statement works, it's clear, that it's not the format function raising the UnicodeDecodeError.
Instead it's a problem with the file writer. c seems expect a unicode object but only gets a UTF-8 encoded str object (let's name it s). So c tries to call s.decode() implicitly which results in the UnicodeDecodeError.
You could fix your problem by simply calling s.decode('utf-8') before printing or by using the Python default open(filename, "w") function instead.

python encoding for huge volumes of data

i have a huge amount of jsondata that i need to transfer to excel(10,000 or so rows and 20ish columns) Im using csv.my code:
x = json.load(urllib2.urlopen('#####'))
f = csv.writer(codecs.open("fsbmigrate3.csv", "wb+", encoding='utf-8'))
y = #my headers
f.writerow(y)
for row in x:
f.writerow(row.values())
unicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
is what comes up.
i have tried encoding the json data
dict((k.encode('utf-8'), v.encode('utf-8')) for (k,v) in x)
but there is too much data to handle.
any ideas on how to pull this off, (apologies for the lack of SO convention its my first post
the full traceback is; Traceback (most recent call last):
File "C:\Users\bryand\Desktop\bbsports_stuff\gba2.py", line 22, in <module>
f.writerow(row.values())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
[Finished in 6.2s]

Since you didn't specify here's a Python 3 solution. The Python 2 solution is much more painful. I've included some short sample data with non-ASCII characters:
#!python3
import json
import csv
json_data = '[{"a": "\\u9a6c\\u514b", "c": "somethingelse", "b": "something"}, {"a": "one", "c": "three", "b": "two"}]'
data = json.loads(json_data)
with open('fsbmigrate3.csv','w',encoding='utf-8-sig',newline='') as f:
w = csv.DictWriter(f,fieldnames=sorted(data[0].keys()))
w.writeheader()
w.writerows(data)
The utf-8-sig codec makes sure a byte order mark character (BOM) is written at the start of the output file, since Excel will assume the local ANSI encoding otherwise.
Since you have json data with key/value pairs, using DictWriter allows the headers to be specified; otherwise, the header order isn't predictable.

Why do I get an error when iterating over paragraphs in MS Word with win32com?

It is giving the following error
print Text
File "C:\Python27\lib\encodings\cp437.py", line 12, in encodereturn codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2026' in position 0:
character maps to undefined
The code i am using is
import win32com.client
import os
MSWord = win32com.client.Dispatch("Word.Application")
MSWord.Visible = True
Document = MSWord.Documents.Open(os.getcwd()+'\\' + 'MARS.doc')
for paragraph in Document.Paragraphs:
Text = paragraph.Range.Text
print Text

Your Text has unicode characters that cannot be printed to stdout. Try
Document = MSWord.Documents.Open(os.getcwd()+'\\' + 'MARS.doc')
counter = 0
for paragraph in Document.Paragraphs:
counter += 1
Text = paragraph.Range.Text
print "paragraph to edit:", counter, ":"
print str(Text).encode('ascii', 'replace')
this way the non-printable characters will show up as '?'. But if you clarify in your question what you are really trying to do (probably some text processing), you'll get more useful answers.

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?

If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã

With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data

Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"

This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (Ð½Ð°Ð¿Ð¾Ð¼Ð¸Ð½Ð°Ð»ÐºÐ¸)")
'Шепот (напоминалки)'

Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting error in python with characters from word document - python

Related

HexString to packed EBCDIC string

writing utf-8 encoded text to a file

python encoding for huge volumes of data

Why do I get an error when iterating over paragraphs in MS Word with win32com?

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

Categories

Resources