How to convert unicode characters into their respective symbols in python? - python

I have a text file which contains unicode characters in the following format:
\u0935\u094d\u0926\u094d\u0928\u094d\u0935\u094d\u0926\
I want to convert it into devnagri characters in the following format:
वर्जनरूपमिति दर्शित्म् । स पूरुषः अमृतत्वाय कल्पते व्द्न्व्द
and then write it to a file.
Presently my code
encoded = x.encode('utf-8')
print (encoded.decode('unicode-escape'))
can print the devnagri characters in the terminal. However when I try to write it to a file using
text = 'target:'+encoded.decode('unicode-escape')+'\n'
fileid.write(text)
I am getting the following error.
'ascii' codec can't encode characters in position 7-18: ordinal not in range(128)
Can anybody please help me?

If you are using Python 2 it's because after using .decode('unicode-escape') you have an unicode object and fileid.write() only accepts string objects. Python then tries to convert the object to a byte string by using the ASCII encoding that doesn't cover devnagri characters. This conversion causes the exception.
You need to manually convert the unicode string back into a byte string before writing it to the file:
fileid.write(text.encode('utf-8'))
Here I assumed you want UTF-8 encoding. If you want to save the characters in another encoding replace 'utf-8' with the name of that encoding.
In Python 3 you can set the used encoding when opening the file:
fileid = open('compare.txt', 'a', encoding='utf-8')
Then the extra .encode('utf-8') isn't neccessary.

Related

Python search through file and un-escape unicode characters [duplicate]

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.
Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.
try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

Python 2.7 with pandas to_csv() gives UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)

I am using Python 2.7, and to overcome UTF-8 issues, I am using pandas to_csv method. The issue is, I am still getting unicode errors, which I dont get when I run the script on my local laptop with python 3 (not an option for batch processing).
df = pd.DataFrame(stats_results)
df.to_csv('/home/mp9293q/python_scripts/stats_dim_registration_set_column_transpose.csv', quoting=csv.QUOTE_ALL, doublequote=True, index=False,
index_label=False, header=False, line_terminator='\n', encoding='utf-8');
Gives error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc7' in position 4: ordinal not in range(128)
I believe you might be having one of these two problems (or maybe both):-
As you mentioned in the comments, the file in which you are trying
to save Unicode Data, already exists. Then there are quite
likely chances that the destination file, may not have
UTF-8/16/32 as it encoding scheme.
By this I mean, when the file was originally created, it's encoding
scheme may not be UTF-8, it could possibly be ANSI. So,
check whether the destination file's encoding scheme is of UTF
family or not.
Encode the Unicode string to UTF-8,
before storing it in a file. By this I mean, any content that you
are trying to save to your destination file, if contains Unicode
text, then it should be first encoded.
Ex.
# A character which could not be encoded via 8 bit ASCII
Uni_str = u"Ç"
# Converting the unicode text, into UTF-8 format
Uni_str = Uni_str.encode("utf-8")
The above code works differently in python 2.x and 3.x, the reason
being that 2.x uses ASCII as default encoding scheme, and 3.x uses
UTF-8. Another difference between the two is how they treat a string
after passing it via encode().
In Python 2.x
type(u"Ç".encode("utf-8"))
Outputs
<type 'str'>
In Python 3.x
type(u"Ç".encode("utf-8"))
Outputs
<class 'bytes'>
As you can notice, in python 2.x the return type of encode() is
string, but in 3.x it is bytes.
So for your case, I would recommend you to encode each string value containing unicode data in your dataframe using encode() before trying to store it in the file.

Python Conversion from int to Windows-1252

I am currently writing a program that reads data from a serial port adds some header information then writes this data to a .jpg file.
I require to write to the file in Windows-1252 encoding format, yes the method in which I construct the data and the header is in hexadecimal format.
I realised my problem when comparing the picture that should be written and what was actually written, and saw that DOULBE LOW 9 QUOTES were not written as quotes but rather as a zero.
The decimal code for that symbol is 132 (0x84). If I use chr(0x84) I get the following error
UnicodeEncodeError: 'charmap' codec can't encode character \x84 in position 0: character maps to
Which only makes sense if chr() was trying to map to Latin-1 codeset. I have tried to convert the int to a unicode but from my research chr is the only function that does this.
I have also tried to use the struct package in python.
import struct
a = 123;
b = struct.pack("c",a)
print(b)
I get the error
Traceback (most recent call last): File "python", line 3, in
struct.error: char format requires a bytes object of length 1
Reading past questions, answers and documentation does get quite confusing as there is a mix of python2 and python3 answers mixed in with people converting to ascii (which obviously wouldn't work).
I am using Python 3.4.3 (the latest version) on a Windows 7 machine.
UnicodeEncodeError: 'charmap' codec can't encode character \x84
\x84 is the encoding of the lower quotes character in Windows-1252. This suggests your data is already encoded, and you should not try to encode it again. In a text string the quote should show up as "\u201E". "\u0084" (the result of chr(132)) is actually a control character.
You should have either bytes which you can decode to a string:
>>> b"\x84".decode('windows-1252')
'\u201e'
Or you should have a text string, which you can encode to a byte string
>>> "\u201e".encode('windows-1252')
b'\x84'
If you read data from somewhere you could use the struct module like this
# suppose we download some data:
data=b'*\x00\x00\x00abcde'
a, txt = struct.unpack("I5s", data)
print(txt.decode('windows-1252'))

unicode error with codecs when reading a pdf file in python

I am trying to read a pdf file with the following contain:
%PDF-1.4\n%âãÏÓ
If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Do you know a way to get a unicode string from the content of this file?
PS: I am using python 2.7
PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.
For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:
f = codecs.open(filename, encoding="ISO8859-1", mode="rb")
But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.
The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.

Figuring out unicode: 'ascii' codec can't decode

I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)
What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.
You need to mark the string as a unicode string:
s = u'abcdefö'
s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s

Categories

Resources