python read from chinese file err - python

I make read from chinese
but ,it didn't run normally
The code is below:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
#read from file
file=open('temp2','rb',encoding='utf-8')
lines=file.readlines()
for line in lines:
print(line)
file.close()
this file content is below:
http://www.sina.com.cn/intro/copyright.shtml
新浪新闻
国内、国际。
国内、国际。

You should change utf-8 to utf8. - is not needed.
Also, please change rb to r if the text file is not saved in binary format.

Related

Pyhton generated latex file throws UTF-8 error

I am trying to write a python program to generate and compile a Lualatex file with German special characters (ä, ü, ß etc.).
Unfortunately, it throws me this error:
! String contains an invalid utf-8 sequence.
Here is my example code:
import subprocess
import shutil
txtFileRecipe = open(r"C:\Users\canna\OneDrive\Desktop\TestTest.tex", "w")
txtFileRecipe.write(
("\\documentclass[a5paper]{article}\n"
"\\usepackage[ngerman]{babel}\n"
"\\usepackage{fontspec}\n"
"\\begin{document}\n"
"Äpfelmüß\n"
"\\end{document}\n")
)
txtFileRecipe.close()
subprocess.check_call(["LuaLatex", r"C:\Users\canna\OneDrive\Desktop\TestTest.tex"])
Try opening the file as binary an encoding it to UTF-8
with open(r"C:\Users\canna\OneDrive\Desktop\TestTest.tex", "wb") as txtFileRecipe:
txtFileRecipe.write(
("\\documentclass[a5paper]{article}\n"
"\\usepackage[ngerman]{babel}\n"
"\\usepackage{fontspec}\n"
"\\begin{document}\n"
"Äpfelmüß\n"
"\\end{document}\n"
.encode('utf-8')) #explicit encoding as utf-8
)
subprocess.check_call(["LuaLatex", r"C:\Users\canna\OneDrive\Desktop\TestTest.tex"])
(Switched to opening the file with context manager to follow good practices)

Korean txt file encoding with utf-8

I'm trying to process a Korean text file with python, but it fails when I try to encode the file with utf-8.
#!/usr/bin/python
#-*- coding: utf-8 -*-
f = open('tag.txt', 'r', encoding='utf=8')
s = f.readlines()
z = open('tagresult.txt', 'w')
y = z.write(s)
z.close
=============================================================
Traceback (most recent call last):
File "C:\Users\******\Desktop\tagging.py", line 5, in <module>
f = open('tag.txt', 'r', encoding='utf=8')
TypeError: 'encoding' is an invalid keyword argument for this function
[Finished in 0.1s]
==================================================================
And when I just opens a Korean txt file encoded with utf-8, the fonts are broken like this. What can I do?
\xc1\xc1\xbe\xc6\xc1\xf6\xb4\xc2\n',
'\xc1\xc1\xbe\xc6\xc7\xcf\xb0\xc5\xb5\xe7\xbf\xe4\n',
'\xc1\xc1\xbe\xc6\xc7\xcf\xbd\xc3\xb4\xc2\n',
'\xc1\xcb\xbc\xdb\xc7\xd1\xb5\xa5\xbf\xe4\n',
'\xc1\xd6\xb1\xb8\xbf\xe4\
I don't know Korean, and don't have sample string to try, but here are some advices for you:
1
f = open('tag.txt', 'r', encoding='utf=8')
You have a typo here, utf-8 not utf=8, this explains for the exception you got.
The default mode of open() is 'r' so you don't have to define it again.
2 Don't just use open, you should use context manager statement to manage the opening/closing file descriptor, like this:
with open('tagresult.txt', 'w') as f:
f.write(s)
In Python 2 the open function does not take an encoding parameter. Instead you read a line and convert it to unicode. This article on kitchen (as in kitchen sink) modules provides details and some lightweight utilities to work with unicode in python 2.x.

How to convert hex string to plain text?

How to read a hex file and convert it to plain text?
For example, this is my file user.dat.(For China mainland user.dat)
And here is what I have tried so far:
# -*- coding:utf-8 -*-
with open('user.dat','rb') as f:
data = f.read()
print data
And the result is like this.Some is right,while some is not.
How to get the entire right content?
just add this line of in your code, str.decode('hex') will decode string into plain text.
output = data.decode('hex')
print output
OK you have some error so try this...
import binascii
with open('user.dat', 'rb') as f:
data = f.read()
print(binascii.hexlify(data))

Changing encoding in csv file through python UTF-8 to UTF-16

How do you change the encoding through a python script?
I've got some files that I'm looping doing some other stuff. But before that I need to change the encoding on each file from UTF-8 to UTF-16 since SQL server does not support UTF-8
Tried this, but not working.
data = "UTF-8 data"
udata = data.decode("utf-8")
data = udata.encode("utf-16","ignore")
Cheers!
If you want to convert a file from utf-8 encoding to a file with utf-16 encoding, this script works:
#!/usr/bin/python2.7
import codecs
import shutil
with codecs.open("input_file.utf8.txt", encoding="utf-8") as input_file:
with codecs.open(
"output_file.utf16.txt", "w", encoding="utf-16") as output_file:
shutil.copyfileobj(input_file, output_file)

Processing Russian text file fails

I have this code:
# -*- coding: utf-8 -*-
import codecs
prefix = u"а"
rus_file = "rus_names.txt"
output = "rus_surnames.txt"
with codecs.open(rus_file, 'r', 'utf-8') as infile:
with codecs.open(output, 'a', 'utf-8') as outfile:
for line in infile.readlines():
outfile.write(line+prefix)
And it gives me smth kinda chineese text in an output file. Even when I try to outfile.write(line) it gives me the same crap in an output. I just don't get it.
The purpose: I have a huge file with male surnames. I need to get the same file with female surnames. In russian it looks like this: Ivanov - Ivanova | Иванов - Иванова
Try
lastname = str(line+prefix, 'utf-8')
outfile.write(lastname)
So #AndreyAtapin was partially right. I've tried to add lines in a file which contains my previous mistakes with chineese characters. even flushing the file didn't help. But when I delete it and script creates it once again, it works! thanks.

Categories

Resources