I want to read some cyrilic text from a txt file in Python 3.
This is what the text file contains.
абцдефгчийклмнопярстувшхыз
I used:
with open('text.txt', 'r') as myfile:
text=myfile.read()
print (text)
But this is the ouput in the python shell:
ÿþ01F45D3G89:;<=>?O#ABC2HEK7
Can someone explain why this is the output?
Python supports utf-8 for this sort of thing.
You should be able to do:
with open('text.txt', encoding = 'utf-8', mode = 'r') as my_file:
...
Also, be sure that your text file is saved with utf-8 encoding. I tested this in my shell and without proper encoding my output was:
?????????????????????
With proper encoding:
file = open('text.txt', encoding='utf-8', mode='r')
text = file.read()
print(text)
абцдефгчийклмнопярстувшхы
Try working on the file using codecs, you need to
import codecs
and then do
text = codecs.open('text.txt', 'r', 'utf-8')
Basically you need utf8
Related
I was trying to decode a json file that has escaped unicode text /uHHH .. the original text is Arabic
my research lead me to the following code using python.
s = '\u00d8\u00b5\u00d9\u0088\u00d8\u00b1 \u00d8\u00a7\u00d9\u0084\u00d9\u008a\u00d9\u0088\u00d9\u0085\u00d9\u008a\u00d8\u00a7\u00d8\u00aa'
ouy= s.encode('utf-8').decode('unicode-escape').encode('latin1').decode('utf-8')
print(ouy)
the result text will be: صÙر اÙÙÙÙÙات
which still needs some fix using online tool to become the original text: صور اليوميات
Is there any way to perform that fix using the above code?
Would appreciate your help guys, thanks in advance
you can use this script to update all JSON files
import json
filename = 'YourFile.json' # file name we want to compress
newname = filename.replace('.json', '.min.json') # Output file name
with open(filename, encoding="utf8") as fp:
print("Compressing file: " + filename)
print('Compressing...')
jload = json.load(fp)
newfile = json.dumps(jload, indent=None, separators=(',', ':'), ensure_ascii=False)
newfile = newfile.encode('latin1').decode('utf-8') # remove this
#print(newfile)
with open(newname, 'w', encoding="utf8") as f: # add encoding="utf8"
f.write(newfile)
print('Compression complete!')
DecodeJsonToOrigin
I want to user either R or Python to convert .csv.gz file to utf-8 encoding. How can I do this directly? I am not able find any comprehensive guide as how to do this.
My best attempt was to read .csv.gz file with csv.reader in python:
csvFile = gzip.open('pracodawcy_20190611_5.csv.gz', 'rt', newline='')
reader = csv.reader(csvFile)
But later how to save it as csv with utf-8?
Very easily, it puts the file in a vector:
import gzip
### assuming the file is separated as you said
with gzip.open('input_file.csv.gz', 'rt', newline='\n') as f:
content = f.readlines()
### to print the vector content
for v in content :
print(v)
### to write to .csv.gz
with gzip.open('output.csv.gz', 'wb') as f:
for v in content :
f.write(v.encode('utf-8'))
you can also lazy-open it line per line if it's too big with read() or for. There are a lot of examples here and in the web.
I read the "Unicdoe Pain" article days ago. And I keep the "Unicode Sandwich" in mind.
Now I have to handle some Chinese and I've got a list
chinese = [u'中文', u'你好']
Do i need to proceed encoding before writing to file?
add_line_break = [word + u'\n' for word in chinese]
encoded_chinese = [word.encode('utf-8') for word in add_line_break]
with open('filename', 'wb') as f:
f.writelines(encoded_chinese)
Somehow I find out that in python2. I can do this:
chinese = ['中文', '你好']
with open('filename', 'wb') as f:
f.writelines(chinese)
no unicode matter involed. :D
You don't have to do that, you could use io or codecs to open the file with encoding.
import io
with io.open('file.txt', 'w', encoding='utf-8') as f:
f.write(u'你好')
codecs.open has the same syntax.
In python3;
with open('file.txt, 'w', encoding='utf-8') as f:
f.write('你好')
will do just fine.
How do you change the encoding through a python script?
I've got some files that I'm looping doing some other stuff. But before that I need to change the encoding on each file from UTF-8 to UTF-16 since SQL server does not support UTF-8
Tried this, but not working.
data = "UTF-8 data"
udata = data.decode("utf-8")
data = udata.encode("utf-16","ignore")
Cheers!
If you want to convert a file from utf-8 encoding to a file with utf-16 encoding, this script works:
#!/usr/bin/python2.7
import codecs
import shutil
with codecs.open("input_file.utf8.txt", encoding="utf-8") as input_file:
with codecs.open(
"output_file.utf16.txt", "w", encoding="utf-16") as output_file:
shutil.copyfileobj(input_file, output_file)
f = open("go.txt", "w")
f.write(title)
f.close()
What if "title" is in japanese/utf-8? How do I modify this code to be able to write "title" without having the ascii error?
Edit: Then, how do I read this file in UTF-8?
How to use UTF-8:
import codecs
# ...
# title is a unicode string
# ...
f = codecs.open("go.txt", "w", "utf-8")
f.write(title)
# ...
fileObj = codecs.open("go.txt", "r", "utf-8")
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
It depends on whether you want to insert a Unicode UTF-8 byte order mark, of which the only way I know of is to open a normal file and write:
import codecs
f = open('go.txt', 'wb')
f.write(codecs.BOM_UTF8)
f.write(title.encode('utf-8')
f.close()
Generally though, I don't want to add a UTF-8 BOM and the following will suffice though:
import codecs
f = codecs.open('go.txt', 'w', 'utf-8')
f.write(title)
f.close()