Changing encoding in csv file through python UTF-8 to UTF-16 - python

How do you change the encoding through a python script?
I've got some files that I'm looping doing some other stuff. But before that I need to change the encoding on each file from UTF-8 to UTF-16 since SQL server does not support UTF-8
Tried this, but not working.
data = "UTF-8 data"
udata = data.decode("utf-8")
data = udata.encode("utf-16","ignore")
Cheers!

If you want to convert a file from utf-8 encoding to a file with utf-16 encoding, this script works:
#!/usr/bin/python2.7
import codecs
import shutil
with codecs.open("input_file.utf8.txt", encoding="utf-8") as input_file:
with codecs.open(
"output_file.utf16.txt", "w", encoding="utf-16") as output_file:
shutil.copyfileobj(input_file, output_file)

Related

How to display cyrillic text from a file in python?

I want to read some cyrilic text from a txt file in Python 3.
This is what the text file contains.
абцдефгчийклмнопярстувшхыз
I used:
with open('text.txt', 'r') as myfile:
text=myfile.read()
print (text)
But this is the ouput in the python shell:
ÿþ01F45D3G89:;<=>?O#ABC2HEK7
Can someone explain why this is the output?
Python supports utf-8 for this sort of thing.
You should be able to do:
with open('text.txt', encoding = 'utf-8', mode = 'r') as my_file:
...
Also, be sure that your text file is saved with utf-8 encoding. I tested this in my shell and without proper encoding my output was:
?????????????????????
With proper encoding:
file = open('text.txt', encoding='utf-8', mode='r')
text = file.read()
print(text)
абцдефгчийклмнопярстувшхы
Try working on the file using codecs, you need to
import codecs
and then do
text = codecs.open('text.txt', 'r', 'utf-8')
Basically you need utf8

Can't decode from windows1252 to UTF8

I know questions about encoding and decoding in utf-8 have been asked so many times, but I could not find an answer to my question.
I have a CSV file in windows1252 and I want to make it in UTF-8, here is the script:
import os
import sys
import inspect
import codecs
import chardet
from bs4 import UnicodeDammit
#Declare the variables
defaultencoding = 'utf-8'
filename = '19-01-2017+06-00-00.csv'
#open the file and get the content
file_obj = open(filename,"r")
content = file_obj.read()
file_obj.close()
#Check the initial encoding using both unicodeDammit and chardet
dammit = UnicodeDammit(content)
#print it
print(dammit.original_encoding)
print(chardet.detect(content)['encoding'])
#Decode in UTF8
content_decoded = content.decode('windows-1252')
content_encoded = content_decoded.encode(defaultencoding)
#Write the result in a temporary file
file_obj = open('tmp.txt',"w")
try:
file_obj.write(content_encoded)
finally:
file_obj.close()
#Read the result decoded file
file_obj = open('tmp.txt', "r")
content = file_obj.read()
file_obj.close()
#Check if it is really in UTF8 using both unicodeDammit and chardet
dammit = UnicodeDammit(content)
print(dammit.original_encoding)
print(chardet.detect(content)['encoding'])
Output:
windows-1252
windows-1252
windows-1252
windows-1252
Expected output:
windows-1252
windows-1252
utf-8
utf-8
I used chardet and uncodeDammit because I found out that chardet is not giving the correct encoding guess all the time.
Why can't encode the file in utf-8 ?

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

I am trying to write characters with double dots (umlauts) such as ä, ö and Ö. I am able to write it to the file with data.encode("utf-8") but the result b'\xc3\xa4\xc3\xa4\xc3\x96' is not nice (UTF-8 as literal characters). I want to get "ääÖ" as written stored to a file.
How can I write data with umlaut characters to a CSV file in Python 3?
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
data=resultFile
a.writerows(data)
Traceback:
File "<ipython-input-280-73b1f615929e>", line 5, in <module>
a.writerows(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 15: ordinal not in range(128)
Add a parameter encoding to the open() and set it to 'utf8'.
import csv
data = "ääÖ"
with open("test.csv", 'w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Edit: Removed the use of io library as open is same as io.open in Python 3.
This solution should work on both python2 and 3 (not needed in python3):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Credits to:
Working with utf-8 encoding in Python source

Python: Special characters encoding

This is the code i am using in order to replace special characters in text files and concatenate them to a single file.
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = "C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = dirpath+"\\"+fname
with codecs.open(currentfile, encoding='utf8') as infile:
#print currentfile
outfile.write(fname)
outfile.write('\n')
outfile.write('\n')
for line in infile:
line = line.replace(u"´ı", "i")
line = line.replace(u"ï¬", "fi")
line = line.replace(u"fl", "fl")
outfile.write (line)
The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:
import codecs
currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"
which always returns "not found!" proving that those characters aren't read.
When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:
Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)
Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)
I am stuck here. Any suggestions?? all answers are welcome!
There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().
Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = os.path.join(dirpath, fname)
with codecs.open(currentfile, encoding='utf8') as infile:
outfile.write(fname + '\n\n')
for line in infile:
line = line.replace(u"´ı", u"i")
line = line.replace(u"ï¬", u"fi")
line = line.replace(u"fl", u"fl")
outfile.write (line)
This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.
Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.
If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.

This is my current way of writing to a file. However, I can't do UTF-8?

f = open("go.txt", "w")
f.write(title)
f.close()
What if "title" is in japanese/utf-8? How do I modify this code to be able to write "title" without having the ascii error?
Edit: Then, how do I read this file in UTF-8?
How to use UTF-8:
import codecs
# ...
# title is a unicode string
# ...
f = codecs.open("go.txt", "w", "utf-8")
f.write(title)
# ...
fileObj = codecs.open("go.txt", "r", "utf-8")
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
It depends on whether you want to insert a Unicode UTF-8 byte order mark, of which the only way I know of is to open a normal file and write:
import codecs
f = open('go.txt', 'wb')
f.write(codecs.BOM_UTF8)
f.write(title.encode('utf-8')
f.close()
Generally though, I don't want to add a UTF-8 BOM and the following will suffice though:
import codecs
f = codecs.open('go.txt', 'w', 'utf-8')
f.write(title)
f.close()

Categories

Resources