I have a file which contains UTF-8 encoded text:
b'\xd8\xa3\xd9\x8a \xd8\xb9\xd9\x84\xd9\x85 \xd9\x87\xd8\xb0\xd8\xa7 \xd8\xa7\xd9\x84\xd8\xb0\xd9\x8a \xd9\x84\xd9\x85 \xd9\x8a\xd8\xb3\xd8\xaa\xd8\xb7\xd8\xb9 \xd8\xad\xd8\xaa\xd9\x89 \xd8\xa7\xd9\x84\xd8\xa2\xd9\x86 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xb6\xd8\xb9 \xd8\xa3\xd8\xb5\xd9\x88\xd8\xa7\xd8\xaa \xd9\x85\xd9\x86 \xd9\x86\xd8\xad\xd8\xa8 \xd9\x81\xd9\x8a \xd8\xa3\xd9\x82\xd8\xb1\xd8\xa7\xd8\xb5 \xd8\x8c \xd8\xa3\xd9\x88 \xd8\xb2\xd8\xac\xd8\xa7\xd8\xac\xd8\xa9 \xd8\xaf\xd9\x88\xd8\xa7\xd8\xa1 \xd9\x86\xd8\xaa\xd9\x86\xd8\xa7\xd9\x88\xd9\x84\xd9\x87\xd8\xa7 \xd8\xb3\xd8\xb1\xd9\x91\xd9\x8b\xd8\xa7 \xd8\x8c \xd8\xb9\xd9\x86\xd8\xaf\xd9\x85\xd8\xa7 \xd9\x86\xd8\xb5\xd8\xa7\xd8\xa8 \xd8\xa8\xd9\x88\xd8\xb9\xd9\x83\xd8\xa9 \xd8\xb9\xd8\xa7\xd8\xb7\xd9\x81\xd9\x8a\xd8\xa9 \xd8\xa8\xd8\xaf\xd9\x88\xd9\x86 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xaf\xd8\xb1\xd9\x8a \xd8\xb5\xd8\xa7\xd8\xad\xd8\xa8\xd9\x87\xd8\xa7 \xd9\x83\xd9\x85 \xd9\x86\xd8\xad\xd9\x86 \xd9\x86\xd8\xad\xd8\xaa\xd8\xa7\xd8\xac\xd9\x87 - \xd8\xa3\xd8\xad\xd9\x84\xd8\xa7\xd9\x85 \xd9\x85\xd8\xb3\xd8\xaa\xd8\xba\xd8\xa7\xd9\x86\xd9\x85\xd9\x8a, \xd8\xb9\xd8\xa7\xd8\xa8\xd8\xb1 \xd8\xb3\xd8\xb1\xd9\x8a\xd8\xb1'
I've tried to print it correctly once decoded but I did not succeed when:
reading from file as text option 'r', decode by bytes(text,'utf8').decode('utf8')
reading from file as binary option 'rb', decode by binary.decode('utf8')
I tried to convert the content in many ways (split text in list, cut out the b' ... ', ...) but didn't succeed to print it clearly!
What am I missing - is the file correctly 'encoded'?
Here is my code in Python 3.7.3
with open('/home/pi/Desktop/unicode_a_decoder.txt', 'r') as f:
text = f.read()
print(type(text),text)
#seq = text.decode
#seq = bytes(text,"utf8")
#print('seq',seq)
#seq = text
seq = text.split(" ")
#print(seq, seq[0],bytes(seq[0]))
print('seq',seq)
s0 = seq[0]
print(s0,type(s0))
s02byte = bytes(s0, 'utf8')
print(s02byte, type(s02byte))
#print(seq.decode("utf8"))
For me, it worked when I simply used .decode()
This is what I did:
text = b'\xd8\xa3\xd9\x8a \xd8\xb9\xd9\x84\xd9\x85 \xd9\x87\xd8\xb0\xd8\xa7 \xd8\xa7\xd9\x84\xd8\xb0\xd9\x8a \xd9\x84\xd9\x85 \xd9\x8a\xd8\xb3\xd8\xaa\xd8\xb7\xd8\xb9 \xd8\xad\xd8\xaa\xd9\x89 \xd8\xa7\xd9\x84\xd8\xa2\xd9\x86 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xb6\xd8\xb9 \xd8\xa3\xd8\xb5\xd9\x88\xd8\xa7\xd8\xaa \xd9\x85\xd9\x86 \xd9\x86\xd8\xad\xd8\xa8 \xd9\x81\xd9\x8a \xd8\xa3\xd9\x82\xd8\xb1\xd8\xa7\xd8\xb5 \xd8\x8c \xd8\xa3\xd9\x88 \xd8\xb2\xd8\xac\xd8\xa7\xd8\xac\xd8\xa9 \xd8\xaf\xd9\x88\xd8\xa7\xd8\xa1 \xd9\x86\xd8\xaa\xd9\x86\xd8\xa7\xd9\x88\xd9\x84\xd9\x87\xd8\xa7 \xd8\xb3\xd8\xb1\xd9\x91\xd9\x8b\xd8\xa7 \xd8\x8c \xd8\xb9\xd9\x86\xd8\xaf\xd9\x85\xd8\xa7 \xd9\x86\xd8\xb5\xd8\xa7\xd8\xa8 \xd8\xa8\xd9\x88\xd8\xb9\xd9\x83\xd8\xa9 \xd8\xb9\xd8\xa7\xd8\xb7\xd9\x81\xd9\x8a\xd8\xa9 \xd8\xa8\xd8\xaf\xd9\x88\xd9\x86 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xaf\xd8\xb1\xd9\x8a \xd8\xb5\xd8\xa7\xd8\xad\xd8\xa8\xd9\x87\xd8\xa7 \xd9\x83\xd9\x85 \xd9\x86\xd8\xad\xd9\x86 \xd9\x86\xd8\xad\xd8\xaa\xd8\xa7\xd8\xac\xd9\x87 - \xd8\xa3\xd8\xad\xd9\x84\xd8\xa7\xd9\x85 \xd9\x85\xd8\xb3\xd8\xaa\xd8\xba\xd8\xa7\xd9\x86\xd9\x85\xd9\x8a, \xd8\xb9\xd8\xa7\xd8\xa8\xd8\xb1 \xd8\xb3\xd8\xb1\xd9\x8a\xd8\xb1'
print(text.decode())
Related
I have a file example.log which contains:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"/>
I want to read the file and convert the str to utf-8 encoding format and write to a new file. Currently my code below:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_bytes = bytes(line, encoding='raw_unicode_escape')
line_decoded = line_bytes.decode('utf-8')
print(line_decoded)
f.write(line_decoded)
else:
pass
But the example_decoded.log's content:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"
The \xe6\xb6\x88\xe6\x81\xaf part isn't being decoded, so I am wondering how to deal with this mix-type str decode issue?
import codecs
decode_hex = codecs.getdecoder("hex_codec")
string = decode_hex(string)[0]
https://docs.python.org/3/library/codecs.html
Refer this: Read hex characters and convert them to utf-8 using python 3
the solution is:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_decoded = bytes(line, 'utf-8').decode('unicode_escape').encode('latin-1').decode('utf8')
print(line_decoded)
f.write(line_decoded)
else:
pass
although I don't understand why encode('latin-1')first,
can someone explain that?
decodedVal = struct.unpack(">f", bytes.fromhex(encdoded_val))[0]
refer below link to add your endian and type instead of ">f"
https://docs.python.org/3/library/struct.html
I have a .txt file in UTF-16-LE encoding .
I want to remove the headers(1st row) and save it in ANSI
I can do it maually but i need to do that for 150 txt files EVERY day
So i wanted to use Python to do it automatically.
But i am stuck ,
i have tried this code but it is not working ,produces an error :
*"return mbcs_encode(input, self.errors)[0]
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character "*
filename = "filetochangecodec.txt"
path = "C:/Users/fallen/Desktop/New folder/"
pathfile = path + filename
coding1 = "utf-16-le"
coding2 = "ANSI"
f= open(pathfile, 'r', encoding=coding1)
content= f.read()
f.close()
f= open(pathfile, 'w', encoding=coding2)
f.write(content)
f.close()
A noble contributer helped me with the solution and i now post it so everyone can benefit and save time.
Instead of trying to write all the content , we make a list with every line of the txt file and then we write them in a new file one by one with the use of " for " .
import os
inpath = r"C:/Users/user/Desktop/insert/"
expath = r"C:/Users/user/Desktop/export/"
encoding1 = "utf-16"
encoding2 = "ansi"
input_filename = "text.txt"
input_pathfile = os.path.join(inpath, input_filename)
output_filename = "new_text.txt"
output_pathfile = os.path.join(expath, output_filename)
with open(input_pathfile, 'r', encoding=encoding1) as file_in:
lines = []
for line in file_in:
lines.append(line)
with open(output_pathfile, 'w', encoding='ANSI') as f:
for line in lines:
f.write(line)
When I use the CountVectorizer in sklearn, it needs the file encoding in unicode, but my data file is encoding in ansi.
I tried to change the encoding to unicode using notepad++, then I use readlines, it cannot read all the lines, instead it can only read the last line. After that, I tried to read the line into data file, and write them into the new file by using unicode, but I failed.
def merge_file():
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
resname='resule_final.txt'
if os.path.exists(resname):
os.remove(resname)
result = codecs.open(resname,'w','utf-8')
num = 1
for back_name in os.listdir(r'd:\\workspace\\minibatchk-means\\data\\20_newsgroups'):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num ,":" ,str(filename)
num = num+1
path=current_dir + "\\" +str(filename)
source=open(path,'r')
line = source.readline()
line = line.strip('\n')
line = line.strip('\r')
while line !="":
line = unicode(line,"gbk")
line = line.replace('\n',' ')
line = line.replace('\r',' ')
result.write(line + ' ')
line = source.readline()
else:
print 'End file :'+ str(filename)
result.write('\n')
source.close()
print 'End All.'
result.close()
The error message is :UnicodeDecodeError: 'gbk' codec can't decode bytes in position 0-1: illegal multibyte sequence
Oh,I find the way.
First, use chardet to detect string encoding.
Second,use codecs to input or output to the file in the specific encoding.
Here is the code.
import chardet
import codecs
import os
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
num = 1
failed = []
for back_name in os.listdir("d:\\workspace\\minibatchk-means\\data\\20_newsgroups"):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num,":",str(filename)
num=num+1
path=current_dir+"\\"+str(filename)
content = open(path,'r').read()
source_encoding=chardet.detect(content)['encoding']
if source_encoding == None:
print '??' , filename
failed.append(filename)
elif source_encoding != 'utf-8':
content=content.decode(source_encoding,'ignore')
codecs.open(path,'w',encoding='utf-8').write(content)
print failed
Thanks for all your help.
I have written code to create frequency table. but it is breaking at the line ext_string = document_text.read().lower(. I even put a try and except to catch the error but it is not helping.
import re
import string
frequency = {}
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
try:
count = frequency.get(word,0)
frequency[word] = count + 1
except UnicodeDecodeError:
pass
frequency_list = frequency.keys()
for words in frequency_list:
print (words, frequency[words])
You are opening your file twice, the second time without specifying the encoding:
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
You should open the file as follows:
frequencies = {}
with open('EVG_text mining.txt', encoding="utf8", mode='r') as f:
text = f.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
...
The second time you were opening your file, you were not defining what encoding to use which is probably why it errored.
The with statement helps perform certain task linked with I/O for a file. You can read more about it here: https://www.pythonforbeginners.com/files/with-statement-in-python
You should probably have a look at error handling as well as you were not enclosing the line that was actually causing the error: https://www.pythonforbeginners.com/error-handling/
The code ignoring all decoding issues:
import re
import string # Do you need this?
with open('EVG_text mining.txt', mode='rb') as f: # The 'b' in mode changes the open() function to read out bytes.
bytes = f.read()
text = bytes.decode('utf-8', 'ignore') # Change 'ignore' to 'replace' to insert a '?' whenever it finds an unknown byte.
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
frequencies = {}
for word in match_pattern: # Your error handling wasn't doing anything here as the error didn't occur here but when reading the file.
count = frequencies.setdefault(word, 0)
frequencies[word] = count + 1
for word, freq in frequencies.items():
print (word, freq)
To read a file with some special characters, use encoding as 'latin1' or 'unicode_escape'
I am trying this since morning.
My sample.txt
choice = \u9078\u629e
Code:
with open('sample.txt', encoding='utf-8') as f:
for line in f:
print(line)
print("選択" in line)
print(line.encode('utf-8').decode('utf-8'))
print(line.encode().decode('utf-8'))
print(line.encode('utf-8').decode())
print(line.encode().decode('unicode-escape').encode("latin-1").decode('utf-8')) # as suggested.
out:
choice = \u9078\u629e
False
choice = \u9078\u629e
choice = \u9078\u629e
choice = \u9078\u629e
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-10: ordinal not in range(256)
When I do this in ipython qtconsole:
In [29]: "choice = \u9078\u629e"
Out[29]: 'choice = 選択'
So the question is how can I read the text file containing the unicode escaped string like \u9078\u629e (I don't know exactly what it's called) and convert it to utf-8 like 選択?
If you read it from a file, just give the encoding when opening:
with open('test.txt', encoding='unicode-escape') as f:
a = f.read()
print(a)
# choice = 選択
with test.txt containing:
choice = \u9078\u629e
If you already had your text in a string, you could have converted it like this:
a = "choice = \\u9078\\u629e"
a.encode().decode('unicode-escape')
# 'choice = 選択'