Read str from file contain hex bytes str character and decode? - python

I have a file example.log which contains:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"/>
I want to read the file and convert the str to utf-8 encoding format and write to a new file. Currently my code below:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_bytes = bytes(line, encoding='raw_unicode_escape')
line_decoded = line_bytes.decode('utf-8')
print(line_decoded)
f.write(line_decoded)
else:
pass
But the example_decoded.log's content:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"
The \xe6\xb6\x88\xe6\x81\xaf part isn't being decoded, so I am wondering how to deal with this mix-type str decode issue?

import codecs
decode_hex = codecs.getdecoder("hex_codec")
string = decode_hex(string)[0]
https://docs.python.org/3/library/codecs.html

Refer this: Read hex characters and convert them to utf-8 using python 3
the solution is:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_decoded = bytes(line, 'utf-8').decode('unicode_escape').encode('latin-1').decode('utf8')
print(line_decoded)
f.write(line_decoded)
else:
pass
although I don't understand why encode('latin-1')first,
can someone explain that?

decodedVal = struct.unpack(">f", bytes.fromhex(encdoded_val))[0]
refer below link to add your endian and type instead of ">f"
https://docs.python.org/3/library/struct.html

Related

how to print a string containing utf8 code read from file

I have a file which contains UTF-8 encoded text:
b'\xd8\xa3\xd9\x8a \xd8\xb9\xd9\x84\xd9\x85 \xd9\x87\xd8\xb0\xd8\xa7 \xd8\xa7\xd9\x84\xd8\xb0\xd9\x8a \xd9\x84\xd9\x85 \xd9\x8a\xd8\xb3\xd8\xaa\xd8\xb7\xd8\xb9 \xd8\xad\xd8\xaa\xd9\x89 \xd8\xa7\xd9\x84\xd8\xa2\xd9\x86 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xb6\xd8\xb9 \xd8\xa3\xd8\xb5\xd9\x88\xd8\xa7\xd8\xaa \xd9\x85\xd9\x86 \xd9\x86\xd8\xad\xd8\xa8 \xd9\x81\xd9\x8a \xd8\xa3\xd9\x82\xd8\xb1\xd8\xa7\xd8\xb5 \xd8\x8c \xd8\xa3\xd9\x88 \xd8\xb2\xd8\xac\xd8\xa7\xd8\xac\xd8\xa9 \xd8\xaf\xd9\x88\xd8\xa7\xd8\xa1 \xd9\x86\xd8\xaa\xd9\x86\xd8\xa7\xd9\x88\xd9\x84\xd9\x87\xd8\xa7 \xd8\xb3\xd8\xb1\xd9\x91\xd9\x8b\xd8\xa7 \xd8\x8c \xd8\xb9\xd9\x86\xd8\xaf\xd9\x85\xd8\xa7 \xd9\x86\xd8\xb5\xd8\xa7\xd8\xa8 \xd8\xa8\xd9\x88\xd8\xb9\xd9\x83\xd8\xa9 \xd8\xb9\xd8\xa7\xd8\xb7\xd9\x81\xd9\x8a\xd8\xa9 \xd8\xa8\xd8\xaf\xd9\x88\xd9\x86 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xaf\xd8\xb1\xd9\x8a \xd8\xb5\xd8\xa7\xd8\xad\xd8\xa8\xd9\x87\xd8\xa7 \xd9\x83\xd9\x85 \xd9\x86\xd8\xad\xd9\x86 \xd9\x86\xd8\xad\xd8\xaa\xd8\xa7\xd8\xac\xd9\x87 - \xd8\xa3\xd8\xad\xd9\x84\xd8\xa7\xd9\x85 \xd9\x85\xd8\xb3\xd8\xaa\xd8\xba\xd8\xa7\xd9\x86\xd9\x85\xd9\x8a, \xd8\xb9\xd8\xa7\xd8\xa8\xd8\xb1 \xd8\xb3\xd8\xb1\xd9\x8a\xd8\xb1'
I've tried to print it correctly once decoded but I did not succeed when:
reading from file as text option 'r', decode by bytes(text,'utf8').decode('utf8')
reading from file as binary option 'rb', decode by binary.decode('utf8')
I tried to convert the content in many ways (split text in list, cut out the b' ... ', ...) but didn't succeed to print it clearly!
What am I missing - is the file correctly 'encoded'?
Here is my code in Python 3.7.3
with open('/home/pi/Desktop/unicode_a_decoder.txt', 'r') as f:
text = f.read()
print(type(text),text)
#seq = text.decode
#seq = bytes(text,"utf8")
#print('seq',seq)
#seq = text
seq = text.split(" ")
#print(seq, seq[0],bytes(seq[0]))
print('seq',seq)
s0 = seq[0]
print(s0,type(s0))
s02byte = bytes(s0, 'utf8')
print(s02byte, type(s02byte))
#print(seq.decode("utf8"))
For me, it worked when I simply used .decode()
This is what I did:
text = b'\xd8\xa3\xd9\x8a \xd8\xb9\xd9\x84\xd9\x85 \xd9\x87\xd8\xb0\xd8\xa7 \xd8\xa7\xd9\x84\xd8\xb0\xd9\x8a \xd9\x84\xd9\x85 \xd9\x8a\xd8\xb3\xd8\xaa\xd8\xb7\xd8\xb9 \xd8\xad\xd8\xaa\xd9\x89 \xd8\xa7\xd9\x84\xd8\xa2\xd9\x86 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xb6\xd8\xb9 \xd8\xa3\xd8\xb5\xd9\x88\xd8\xa7\xd8\xaa \xd9\x85\xd9\x86 \xd9\x86\xd8\xad\xd8\xa8 \xd9\x81\xd9\x8a \xd8\xa3\xd9\x82\xd8\xb1\xd8\xa7\xd8\xb5 \xd8\x8c \xd8\xa3\xd9\x88 \xd8\xb2\xd8\xac\xd8\xa7\xd8\xac\xd8\xa9 \xd8\xaf\xd9\x88\xd8\xa7\xd8\xa1 \xd9\x86\xd8\xaa\xd9\x86\xd8\xa7\xd9\x88\xd9\x84\xd9\x87\xd8\xa7 \xd8\xb3\xd8\xb1\xd9\x91\xd9\x8b\xd8\xa7 \xd8\x8c \xd8\xb9\xd9\x86\xd8\xaf\xd9\x85\xd8\xa7 \xd9\x86\xd8\xb5\xd8\xa7\xd8\xa8 \xd8\xa8\xd9\x88\xd8\xb9\xd9\x83\xd8\xa9 \xd8\xb9\xd8\xa7\xd8\xb7\xd9\x81\xd9\x8a\xd8\xa9 \xd8\xa8\xd8\xaf\xd9\x88\xd9\x86 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xaf\xd8\xb1\xd9\x8a \xd8\xb5\xd8\xa7\xd8\xad\xd8\xa8\xd9\x87\xd8\xa7 \xd9\x83\xd9\x85 \xd9\x86\xd8\xad\xd9\x86 \xd9\x86\xd8\xad\xd8\xaa\xd8\xa7\xd8\xac\xd9\x87 - \xd8\xa3\xd8\xad\xd9\x84\xd8\xa7\xd9\x85 \xd9\x85\xd8\xb3\xd8\xaa\xd8\xba\xd8\xa7\xd9\x86\xd9\x85\xd9\x8a, \xd8\xb9\xd8\xa7\xd8\xa8\xd8\xb1 \xd8\xb3\xd8\xb1\xd9\x8a\xd8\xb1'
print(text.decode())

How can i convert a UTF-16-LE txt file to an ANSI txt file and remove the header in PYTHON?

I have a .txt file in UTF-16-LE encoding .
I want to remove the headers(1st row) and save it in ANSI
I can do it maually but i need to do that for 150 txt files EVERY day
So i wanted to use Python to do it automatically.
But i am stuck ,
i have tried this code but it is not working ,produces an error :
*"return mbcs_encode(input, self.errors)[0]
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character "*
filename = "filetochangecodec.txt"
path = "C:/Users/fallen/Desktop/New folder/"
pathfile = path + filename
coding1 = "utf-16-le"
coding2 = "ANSI"
f= open(pathfile, 'r', encoding=coding1)
content= f.read()
f.close()
f= open(pathfile, 'w', encoding=coding2)
f.write(content)
f.close()
A noble contributer helped me with the solution and i now post it so everyone can benefit and save time.
Instead of trying to write all the content , we make a list with every line of the txt file and then we write them in a new file one by one with the use of " for " .
import os
inpath = r"C:/Users/user/Desktop/insert/"
expath = r"C:/Users/user/Desktop/export/"
encoding1 = "utf-16"
encoding2 = "ansi"
input_filename = "text.txt"
input_pathfile = os.path.join(inpath, input_filename)
output_filename = "new_text.txt"
output_pathfile = os.path.join(expath, output_filename)
with open(input_pathfile, 'r', encoding=encoding1) as file_in:
lines = []
for line in file_in:
lines.append(line)
with open(output_pathfile, 'w', encoding='ANSI') as f:
for line in lines:
f.write(line)

How to translate encoding by ansi into unicode

When I use the CountVectorizer in sklearn, it needs the file encoding in unicode, but my data file is encoding in ansi.
I tried to change the encoding to unicode using notepad++, then I use readlines, it cannot read all the lines, instead it can only read the last line. After that, I tried to read the line into data file, and write them into the new file by using unicode, but I failed.
def merge_file():
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
resname='resule_final.txt'
if os.path.exists(resname):
os.remove(resname)
result = codecs.open(resname,'w','utf-8')
num = 1
for back_name in os.listdir(r'd:\\workspace\\minibatchk-means\\data\\20_newsgroups'):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num ,":" ,str(filename)
num = num+1
path=current_dir + "\\" +str(filename)
source=open(path,'r')
line = source.readline()
line = line.strip('\n')
line = line.strip('\r')
while line !="":
line = unicode(line,"gbk")
line = line.replace('\n',' ')
line = line.replace('\r',' ')
result.write(line + ' ')
line = source.readline()
else:
print 'End file :'+ str(filename)
result.write('\n')
source.close()
print 'End All.'
result.close()
The error message is :UnicodeDecodeError: 'gbk' codec can't decode bytes in position 0-1: illegal multibyte sequence
Oh,I find the way.
First, use chardet to detect string encoding.
Second,use codecs to input or output to the file in the specific encoding.
Here is the code.
import chardet
import codecs
import os
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
num = 1
failed = []
for back_name in os.listdir("d:\\workspace\\minibatchk-means\\data\\20_newsgroups"):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num,":",str(filename)
num=num+1
path=current_dir+"\\"+str(filename)
content = open(path,'r').read()
source_encoding=chardet.detect(content)['encoding']
if source_encoding == None:
print '??' , filename
failed.append(filename)
elif source_encoding != 'utf-8':
content=content.decode(source_encoding,'ignore')
codecs.open(path,'w',encoding='utf-8').write(content)
print failed
Thanks for all your help.

Encoding and decoding string in Python

I want to write a string to a file using Python. I know how to do that, so that's not a problem. I also wish to encode that string once it has been written. The encoding doesn't really matter, so I'll stick to let's say UTF-32. What I do for that is after I wrote the string, I read from the file again, encode the string into bytes and then re-write to the same file. I can do the encoding part, but my problem arises with the decoding. I want to read it as bytes so that I can convert it back to a str. What I do for this is the same principle: Read from file, decode and write to the same file. What I get from reading the encoded string looks like b'\xff\xfe\x00\x001\x00\x00\x004\x00\x00\x002\x00\x00\x00'
When I read this as bytes, it doubles the b and the backslashes. If I read it like this, as a string, and then try to decode, it keeps saying 'str' object does not have attribute decode or something. I know that I can't decode the string, but if I try with bytes it seems to be "doubling" the bytes.
Here is my code:
def readfile(filename):
f = open(filename, 'r')
s = f.read()
f.close()
return s
def readfile_b(filename):
f = open(filename, 'rb')
s = f.read()
f.close()
return s
def writefile(filename, writeobject):
f = open(filename, 'w')
f.write(writeobject)
f.close()
def encode(filename):
s = readfile(filename)
s_enc = bytes(s, 'utf-32')
writefile(filename, str(s_enc))
def decode(filename):
s_enc = readfile_b(filename)
print(s_enc)
s = str(s_enc, 'utf-32')
writefile(filename, s)
encode("Example.txt")
decode("Example.txt")
Output (for decode(), encode() didn't have any errors):
b"b'\\xff\\xfe\\x00\\x00H\\x00\\x00\\x00e\\x00\\x00\\x00l\\x00\\x00\\x00l\\x00\\x00\\x00o\\x00\\x00\\x00'"
Traceback (most recent call last):
File "C:/bla/bla/bla/bla/Example.py", line 29, in <module>
decode("MamaAccount.txt")
File "C:/bla/bla/bla/bla/Example.py", line 26, in decode
s = str(s_enc, 'utf-32')
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
Any help is greatly appreciated
Try using writefile with binary writing. Currently you are writing the bytes casted to a string. When you read that back you'll get a b or 2.
This works for me:
def readfile(filename):
f = open(filename, 'r')
s = f.read()
f.close()
return s
def readfile_b(filename):
f = open(filename, 'rb')
s = f.read()
f.close()
return s
def writefile(filename, writeobject):
f = open(filename, 'w')
f.write(writeobject)
f.close()
def writefile_b(filename, writeobject):
f = open(filename, 'wb')
f.write(writeobject)
f.close()
def encode(filename):
s = readfile(filename)
s_enc = bytes(s, 'utf-32')
writefile_b("bin_"+filename, s_enc)
def decode(filename):
s_enc = readfile_b(filename)
#print(s_enc)
s = str(s_enc, 'utf-32')
print(s)
writefile("dec_"+filename, s)
encode("Example.txt")
decode("bin_Example.txt")

Read xml as a txt in python

i have following code in python (which only load data from txt):
def main():
f = open("text.txt", "r" ) //load txt
a = [] // new array
for line in f:
a.append(line.strip()) //append line
main()
How can i do this with xml file? f = open("myxml.xml", "r" ) doesnt work. I get error : UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 4877: character maps to <undefined>
This has nothing to do with the xml file format, but in which encoding your file is. Python3 assumes everything to be in utf-8, but if you are on windows your file is probably in windows-1252. You should use:
f = open("text.txt", "r", encoding="cp1252")
this will sure do your job.
a=[]
with open('reboot.xml', 'r') as f:
a = f.read()
f.closed
print a

Categories

Resources