Encoding and decoding string in Python - python

I want to write a string to a file using Python. I know how to do that, so that's not a problem. I also wish to encode that string once it has been written. The encoding doesn't really matter, so I'll stick to let's say UTF-32. What I do for that is after I wrote the string, I read from the file again, encode the string into bytes and then re-write to the same file. I can do the encoding part, but my problem arises with the decoding. I want to read it as bytes so that I can convert it back to a str. What I do for this is the same principle: Read from file, decode and write to the same file. What I get from reading the encoded string looks like b'\xff\xfe\x00\x001\x00\x00\x004\x00\x00\x002\x00\x00\x00'
When I read this as bytes, it doubles the b and the backslashes. If I read it like this, as a string, and then try to decode, it keeps saying 'str' object does not have attribute decode or something. I know that I can't decode the string, but if I try with bytes it seems to be "doubling" the bytes.
Here is my code:
def readfile(filename):
f = open(filename, 'r')
s = f.read()
f.close()
return s
def readfile_b(filename):
f = open(filename, 'rb')
s = f.read()
f.close()
return s
def writefile(filename, writeobject):
f = open(filename, 'w')
f.write(writeobject)
f.close()
def encode(filename):
s = readfile(filename)
s_enc = bytes(s, 'utf-32')
writefile(filename, str(s_enc))
def decode(filename):
s_enc = readfile_b(filename)
print(s_enc)
s = str(s_enc, 'utf-32')
writefile(filename, s)
encode("Example.txt")
decode("Example.txt")
Output (for decode(), encode() didn't have any errors):
b"b'\\xff\\xfe\\x00\\x00H\\x00\\x00\\x00e\\x00\\x00\\x00l\\x00\\x00\\x00l\\x00\\x00\\x00o\\x00\\x00\\x00'"
Traceback (most recent call last):
File "C:/bla/bla/bla/bla/Example.py", line 29, in <module>
decode("MamaAccount.txt")
File "C:/bla/bla/bla/bla/Example.py", line 26, in decode
s = str(s_enc, 'utf-32')
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
Any help is greatly appreciated

Try using writefile with binary writing. Currently you are writing the bytes casted to a string. When you read that back you'll get a b or 2.
This works for me:
def readfile(filename):
f = open(filename, 'r')
s = f.read()
f.close()
return s
def readfile_b(filename):
f = open(filename, 'rb')
s = f.read()
f.close()
return s
def writefile(filename, writeobject):
f = open(filename, 'w')
f.write(writeobject)
f.close()
def writefile_b(filename, writeobject):
f = open(filename, 'wb')
f.write(writeobject)
f.close()
def encode(filename):
s = readfile(filename)
s_enc = bytes(s, 'utf-32')
writefile_b("bin_"+filename, s_enc)
def decode(filename):
s_enc = readfile_b(filename)
#print(s_enc)
s = str(s_enc, 'utf-32')
print(s)
writefile("dec_"+filename, s)
encode("Example.txt")
decode("bin_Example.txt")

Related

How to this error ? utf-8' codec can't decode byte 0xef in position 32887: invalid continuation byte

enter image description here
Hello. I am trying to open this file which is in .txt format but it gives me an error.
Sometimes when you don't have uniform files you have to by specific with the correct encoding,
You should indicate it in function open for example,
with open(‘file.txt’, encoding = ‘utf-8’) as f:
etc
also you can detect the file encoding like this:
from chardet import detect
with open(file, 'rb') as f:
rawdata = f.read()
enc = detect(rawdata)['encoding']
with open(‘file.txt’, encoding = enc) as f:
etc
Result:
>>> from chardet import detect
>>>
>>> with open('test.c', 'rb') as f:
... rawdata = f.read()
... enc = detect(rawdata)['encoding']
...
>>> print(enc)
ascii
Python 3.7.0

Read str from file contain hex bytes str character and decode?

I have a file example.log which contains:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"/>
I want to read the file and convert the str to utf-8 encoding format and write to a new file. Currently my code below:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_bytes = bytes(line, encoding='raw_unicode_escape')
line_decoded = line_bytes.decode('utf-8')
print(line_decoded)
f.write(line_decoded)
else:
pass
But the example_decoded.log's content:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"
The \xe6\xb6\x88\xe6\x81\xaf part isn't being decoded, so I am wondering how to deal with this mix-type str decode issue?
import codecs
decode_hex = codecs.getdecoder("hex_codec")
string = decode_hex(string)[0]
https://docs.python.org/3/library/codecs.html
Refer this: Read hex characters and convert them to utf-8 using python 3
the solution is:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_decoded = bytes(line, 'utf-8').decode('unicode_escape').encode('latin-1').decode('utf8')
print(line_decoded)
f.write(line_decoded)
else:
pass
although I don't understand why encode('latin-1')first,
can someone explain that?
decodedVal = struct.unpack(">f", bytes.fromhex(encdoded_val))[0]
refer below link to add your endian and type instead of ">f"
https://docs.python.org/3/library/struct.html

Getting unicode decode error in python?

I am using facebook graph API but getting error when I try to run graph.py
How should I resolve this problem of charmap. I am facing unicode decode error.
enter image description here
In graph.py :
table = json2html.convert(json = variable)
htmlfile=table.encode('utf-8')
f = open('Table.html','wb')
f.write(htmlfile)
f.close()
# replacing '&gt' with '>' and '&lt' with '<'
f = open('Table.html','r')
s=f.read()
s=s.replace(">",">")
s=s.replace("<","<")
f.close()
# writting content to html file
f = open('Table.html','w')
f.write(s)
f.close()
# output
webbrowser.open("Table.html")
else:
print("We couldn't find anything for",PageName)
I could not understand why I am facing this issue. Also getting some error with 's=f.read()'
In error message I see it tries to guess encoding used in file when you read it and finally it uses encoding cp1250 to read it (probably because Windows use cp1250 as default in system) but it is incorrect encoding becuse you saved it as 'utf-8'.
So you have to use open( ..., encoding='utf-8') and it will not have to guess encoding.
# replacing '&gt' with '>' and '&lt' with '<'
f = open('Table.html','r', encoding='utf-8')
s = f.read()
f.close()
s = s.replace(">",">")
s = s.replace("<","<")
# writting content to html file
f = open('Table.html','w', encoding='utf-8')
f.write(s)
f.close()
But you could change it before you save it. And then you don't have to open it again.
table = json2html.convert(json=variable)
table = table.replace(">",">").replace("<","<")
f = open('Table.html', 'w', encoding='utf-8')
f.write(table)
f.close()
# output
webbrowser.open("Table.html")
BTW: python has function html.unescape(text) to replace all "chars" like > (so called entity)
import html
table = json2html.convert(json=variable)
table = html.unescape(table)
f = open('Table.html', 'w', encoding='utf-8')
f.write(table)
f.close()
# output
webbrowser.open("Table.html")

Read xml as a txt in python

i have following code in python (which only load data from txt):
def main():
f = open("text.txt", "r" ) //load txt
a = [] // new array
for line in f:
a.append(line.strip()) //append line
main()
How can i do this with xml file? f = open("myxml.xml", "r" ) doesnt work. I get error : UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 4877: character maps to <undefined>
This has nothing to do with the xml file format, but in which encoding your file is. Python3 assumes everything to be in utf-8, but if you are on windows your file is probably in windows-1252. You should use:
f = open("text.txt", "r", encoding="cp1252")
this will sure do your job.
a=[]
with open('reboot.xml', 'r') as f:
a = f.read()
f.closed
print a

Appending hexlifyed content to file

file_1 = ('test.png')
with open(file_1, 'rb') as b:
file_hex = b.read()
binascii.hexlify(file_hex)
file_1_size = len(file_hex)
print (file_1_size)
file_new = open("test.tp", "a")
file_new.write(binascii.hexlify(file_hex))
file_new.close()
I've been trying to get this hexlifyed content appended to the file. I've even tried to apply the hexlifyed content to a variable of its own. like this,
file_1 = ('test.png')
with open(file_1, 'rb') as b:
file_hex = b.read()
x = binascii.hexlify(file_hex)
file_1_size = len(file_hex)
print (file_1_size)
file_new = open("test.tp", "a")
file_new.write(x)
file_new.close()
both end with error
TypeError: must be str, not bytes
Open your file in binary mode to append bytes:
with open("test.tp", "ab") as file_new:
file_new.write(x)
or decode your bytes to a string first:
with open("test.tp", "a") as file_new:
file_new.write(x.decode('ascii')
Hex digits fall within the ASCII code range, so decoding with that codec is safe.

Categories

Resources