Python convert file content to unicode form

Python convert file content to unicode form - python

For example, I have a file a.js whose content is:
Hello, 你好, bye.
Which contains two Chinese characters whose unicode form is \u4f60\u597d
I want to write a python program which convert the Chinese characters in a.js to its unicode form to output b.js, whose content should be: Hello, \u4f60\u597d, bye.
My code:
fp = open("a.js")
content = fp.read()
fp.close()
fp2 = open("b.js", "w")
result = content.decode("utf-8")
fp2.write(result)
fp2.close()
but it seems that the Chinese characters are still one character , not an ASCII string like I want.

>>> print u'Hello, 你好, bye.'.encode('unicode-escape')
Hello, \u4f60\u597d, bye.
But you should consider using JSON, via json.

You can try codecs module
codecs.open(filename, mode[, encoding[, errors[, buffering]]])
a = codecs.open("a.js", "r", "cp936").read() # a is a unicode object
codecs.open("b.js", "w", "utf16").write(a)

There two ways you can use.
first one, use 'encode' method
str1 = "Hello, 你好, bye. "
print(str1.encode("raw_unicode_escape"))
print(str1.encode("unicode_escape"))
Also you can use 'codecs' module：
import codecs
print(codecs.raw_unicode_escape_encode(str1))

I found that repr(content.decode("utf-8")) will return "u'Hello, \u4f60\u597d, bye'"
so repr(content.decode("utf-8"))[2:-1] will do the job

you can use repr:
a = u"Hello, 你好, bye. "
print repr(a)[2:-1]
or you can use encode method:
print a.encode("raw_unicode_escape")
print a.encode("unicode_escape")

Related

Python writing to file and json returns None/null instead of value

I'm trying to write data to a file with the following code
#!/usr/bin/python37all
print('Content-type: text/html\n\n')
import cgi
from Alarm import *
import json
htmldata = cgi.FieldStorage()
alarm_time = htmldata.getvalue('alarm_time')
alarm_date = htmldata.getvalue('alarm_date')
print(alarm_time,alarm_date)
data = {'time':alarm_time,'date':alarm_date}
# print(data['time'],data['date'])
with open('alarm_data.txt','w') as f:
json.dump(data,f)
...
but when opening the the file, I get the following output:
{'time':null,'date':null}
The print statement returns what I except it to: 14:26 2020-12-12.
I've tried this same method with f.write() but it returns both values as None. This is being run on a raspberry pi. Why aren't the correct values being written?
--EDIT--
The json string I expect to see is the following:{'time':'14:26','date':'2020-12-12'}

Perhaps you meant:
data = {'time':str(alarm_time), 'date':str(alarm_date)}
I would expect to see your file contents like this:
{"time":"14:26","date":"2020-12-12"}
Note the double quotes: ". json is very strict about these things, so don't fool yourself into having single quotes ' in a file and expecting json to parse it.

how to fix LookupError: base64 in python3 while running String decode() Method

I am running below program in python3, but getting an error
Str = "this is string example....wow!!!";
Str = Str.encode('base64','strict');
print ("Encoded String: " + Str)
print ("Decoded String: " + Str.decode('base64','strict'))
and getting Error is:-
File "learn.py", line 646, in <module>
Str = Str.encode('base64','strict');
LookupError: 'base64' is not a text encoding; use codecs.encode() to handle arbitrary codecs

Base64 is not a string encoding method, its a method for encoding bytes use the base64 module instead

Encode and Decode base64 (using more than required variable so that everyone can keep track of operations).
import base64
print("base64 Encoded/Decoded string \n")
str="this is a test string "
s1=str.encode('ascii') # it will result like: b'this is a test string'
s2=base64.b64encode(s1); #storing value before printing
print("encoded : ",s2)
print("decoded : ",base64.b64decode(s2)) #direct conversion

"Prefix" before "unicode(entry.title.text, "utf-8")"

I'm editing a script which gets the name of a video url and the line of code that does this is:
title = unicode(entry.title.text, "utf-8")
It can be found here. Is there a simple way to add a predefined prefix before this?
For example if there is a Youtube video named "test", the script should show "Testing Videos: test".

Just prepend a unicode string:
title = u'Testing Videos: ' + unicode(entry.title.text, "utf-8")
or use string formatting for more complex options; like adding both a prefix and a postfix:
title = u'Testing Videos: {} (YouTube)'.format(unicode(entry.title.text, "utf-8"))
All that unicode(inputvalue, codec) does is decode a byte string to a unicode value; you are free to concatenate that with other unicode values, including unicode literals.
An alternative spelling would be to use the str.decode() method on the entry.title.text object:
title = u'Testing Videos: ' + entry.title.text.decode("utf-8")
but the outcome would be the same.

Python/Django: How to convert utf-16 str bytes to unicode?

Fellows,
I am unable to parse a unicode text file submitted using django forms. Here are the quick steps I performed:
Uploaded a text file ( encoding: utf-16 ) ( File contents: Hello World 13 )
On server side, received the file using filename = request.FILES['file_field']
Going line by line: for line in filename: yield line
type(filename) gives me <class 'django.core.files.uploadedfile.InMemoryUploadedFile'>
type(line) is <type 'str'>
print line : '\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00 \x001\x003\x00'
codecs.BOM_UTF16_LE == line[:2] returns True
Now, I want to re-construct the unicode or ascii string back like "Hello World 13" so that I can parse the integer from line.
One of the ugliest way of doing this is to retrieve using line[-5:] (= '\x001\x003\x00') and thus construct using line[-5:][1], line[-5:][3].
I am sure there must be better way of doing this. Please help.
Thanks in advance!

Use codecs.iterdecode() to decode the object on the fly:
from codecs import iterdecode
for line in iterdecode(filename, 'utf16'): yield line

character encoding in python

I have a byte stream that looks like this '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
str_data is wrote into text file using the following code
file = open("test_doc","w")
file.write(str_data)
file.close()
If test_doc is opened in a web browser and character encoding is set to Japanese it just works fine.
I am using reportlab for generating pdf . using the following code
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfgen.canvas import Canvas
from reportlab.pdfbase.cidfonts import CIDFont
pdfmetrics.registerFont(CIDFont('HeiseiMin-W3','90ms-RKSJ-H'))
pdfmetrics.registerFont(CIDFont('HeiseiKakuGo-W5','90ms-RKSJ-H'))
c = Canvas('test1.pdf')
c.setFont('HeiseiMin-W3-90ms-RKSJ-H', 6)
message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88';
c.drawString(100, 675,message1)
c.save()
Here I use message1 variable which gives output in Japanese I need to use message3 instead of message1 to generate the pdf. message3 generated garabage probably because of improper encoding.

Here is an answer:
message1 is encoded in shift_jis; message3 and str_data are encoded in UTF-8. All appear to represent Japanese text. See the following IDLE session:
>>> message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
>>> print message1.decode('shift_jis')
これは平成明朝です。
>>> message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
>>> print message3.decode('UTF-8')
テスト
>>>str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
>>> print str_data.decode('UTF-8')
日本語
>>>
Google Translate detects the language as Japanese and translates them to the English "This is the Heisei Mincho.", "Test", and "Japanese" respectively.
What is the question?

I guess you have to learn more about encoding of strings in general. A string in python has no encoding information attached, so it's up to you to use it in the right way or convert it appropriately. Have a look at unicode strings, the encode / decode methods and the codecs module. And check whether c.drawString might also allow to pass a unicode string, which might make your live much easier.

If you need to detect these encodings on the fly, you can take a look at Mark Pilgrim's excellent open source Universal Encoding Detector.
#!/usr/bin/env python
import chardet
message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
print chardet.detect(message1)
message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
print chardet.detect(message3)
str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
print chardet.detect(str_data)
Output:
{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
{'confidence': 0.87625, 'encoding': 'utf-8'}
{'confidence': 0.87625, 'encoding': 'utf-8'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python convert file content to unicode form - python

>>> print u'Hello, 你好, bye.'.encode('unicode-escape') Hello, \u4f60\u597d, bye. But you should consider using JSON, via json.

You can try codecs module codecs.open(filename, mode[, encoding[, errors[, buffering]]]) a = codecs.open("a.js", "r", "cp936").read() # a is a unicode object codecs.open("b.js", "w", "utf16").write(a)

There two ways you can use. first one, use 'encode' method str1 = "Hello, 你好, bye. " print(str1.encode("raw_unicode_escape")) print(str1.encode("unicode_escape")) Also you can use 'codecs' module： import codecs print(codecs.raw_unicode_escape_encode(str1))

I found that repr(content.decode("utf-8")) will return "u'Hello, \u4f60\u597d, bye'" so repr(content.decode("utf-8"))[2:-1] will do the job

you can use repr: a = u"Hello, 你好, bye. " print repr(a)[2:-1] or you can use encode method: print a.encode("raw_unicode_escape") print a.encode("unicode_escape")

Related

Python writing to file and json returns None/null instead of value

how to fix LookupError: base64 in python3 while running String decode() Method

"Prefix" before "unicode(entry.title.text, "utf-8")"

Python/Django: How to convert utf-16 str bytes to unicode?

character encoding in python

Categories

Resources