I'm trying to encode csv file to utf8 using python

I'm trying to encode csv file to utf8 using python - python

I'm using python to read and encode many files to utf8 using python,I try it with the code below:
import os
from os import listdir
def find_csv_filenames(path_to_dir, suffix=".csv" ):
path_to_dir = os.path.normpath(path_to_dir)
filenames = listdir(path_to_dir)
#Check *csv directory
fp = lambda f: not os.path.isdir(path_to_dir+"/"+f) and f.endswith(suffix)
return [path_to_dir+"/"+fname for fname in filenames if fp(fname)]
def convert_files(files, ascii, to="utf-8"):
count = 0
lineno = 0
for name in files:
lineno = lineno+1
with open(name) as f:
file_target = open(name, mode='r', encoding='latin-1')
file_content = file_target.read()
file_target.close
print(lineno)
file_source = open("./csv/data{}.csv".format(lineno), mode='w', encoding='utf-8')
file_source.write(file_content)
csv_files = find_csv_filenames('./csv', ".csv")
convert_files(csv_files, "cp866")
The problem is that after I read and write data to other files and set encode it to utf8 but it still not work.

Before you open a file which encoding is not clear, you could use chardet to detect the file's encoding rather than use a encoding guessed to open a file. Usage is like this:
>>> import chardet
>>> encoding = chardet.detect('PATH/TO/FILE')['encoding']
And then open the file with the encoding detected and write the contents into a file opened with 'utf-8' encoding.
If you're not sure whether the file is converted using 'utf-8' encoding, you could use enca to see if the encoding of the file is 'ASCII' or 'utf-8' like this in Linux shell:
$ enca FILENAME

Related

Extract gzip file without BOM in Python 3.6

I have multiple gzfile in subfolders that I want to unzip in one folder. It works fine but there's a BOM signature at the beginning of each file that I would like to be removed. I have checked other questions like Removing BOM from gzip'ed CSV in Python or Convert UTF-8 with BOM to UTF-8 with no BOM in Python but it doesn't seem to work. I use Python 3.6 in Pycharm on Windows.
Here's first my code without attempt:
import gzip
import pickle
import glob
def save_object(obj, filename):
with open(filename, 'wb') as output: # Overwrites any existing file.
pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
output_path = 'path_out'
i = 1
for filename in glob.iglob(
'path_in/**/*.gz', recursive=True):
print(filename)
with gzip.open(filename, 'rb') as f:
file_content = f.read()
new_file = output_path + "z" + str(i) + ".txt"
save_object(file_content, new_file)
f.close()
i += 1
Now, with the logic defined in Removing BOM from gzip'ed CSV in Python (at least what I understand of it) if I replace file_content = f.read() by file_content = csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines()), I get:
TypeError: can't pickle _csv.reader objects
I checked for this error (e.g. "Can't pickle <type '_csv.reader'>" error when using multiprocessing on Windows) but I found no solution I could apply.

A minor adaptation of the very first question you link to trivially works.
tripleee$ cat bomgz.py
import gzip
from subprocess import run
with open('bom.txt', 'w') as handle:
handle.write('\ufeffmoo!\n')
run(['gzip', 'bom.txt'])
with gzip.open('bom.txt.gz', 'rb') as f:
file_content = f.read().decode('utf-8-sig')
with open('nobom.txt', 'w') as output:
output.write(file_content)
tripleee$ python3 bomgz.py
tripleee$ gzip -dc bom.txt.gz | xxd
00000000: efbb bf6d 6f6f 210a ...moo!.
tripleee$ xxd nobom.txt
00000000: 6d6f 6f21 0a moo!.
The pickle parts didn't seem relevant here but might have been obscuring the goal of getting a block of decoded str out of an encoded blob of bytes.

pandas reading csv file encoding error

i have a iso8859-9 encoded csv file and trying to read it into a dataframe.
here is the code and error I got.
iller = pd.read_csv('/Users/me/Documents/Works/map/dist.csv' ,sep=';',encoding='iso-8859-9')
iller.head()
and error is
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 250: ordinal not in range(128)
and code below works without error.
import codecs
myfile = codecs.open('/Users/me/Documents/Works/map/dist.csv', "r",encoding='iso-8859-9')
for a in myfile:
print a
My question is why pandas not reading my correctly encoded file ? and is there any way to make it read?

Not possible to see what could be off with you data of course, but if you can read in the data without issues with codecs, then maybe an idea would be to write out the file to UTF encoding(?)
import codecs
filename = '/Users/me/Documents/Works/map/dist.csv'
target_filename = '/Users/me/Documents/Works/map/dist-utf-8.csv'
myfile = codecs.open(filename, "r",encoding='iso-8859-9')
f_contents = myfile.read()
or
import codecs
with codecs.open(filename, 'r', encoding='iso-8859-9') as fh:
f_contents = fh.read()
# write out in UTF-8
with codecs.open(target_filename, 'w', encoding = 'utf-8') as fh:
fh.write(f_contents)
I hope this helps!

charmap codec cant encode characters in position xx - xx

I am trying to use unicodecsv python library in python 2.7.x
import codecs
import unicodecsv
def read(self, path):
with codecs.open(path, "rb", encoding = "utf-8") as f:
r = unicodecsv.reader(f, encoding = 'utf-8')
row = r.next()
print row
read("unicode.csv")
Error:
charmap codec cant encode characters in position xx - xx
I have manually converted my csv file to utf-8 using txt editors so i am sure the input file is fine

I see few problems with your code:
def read(self, path):
You using self no within class
after opening file with codecs.openyou can use standard python csv reader.
With some modifications:
f = "/home/dzagorulkin/workspace/zont/file.txt"
import codecs
#import unicodecsv
def read(path):
with codecs.open(path, "rb", encoding = "utf-8") as f:
for line in f:
print line
read(f)
i used none ASCII file and output:
Меня Дима зовут! Меня Дима зовут!

Changing encoding in csv file through python UTF-8 to UTF-16

How do you change the encoding through a python script?
I've got some files that I'm looping doing some other stuff. But before that I need to change the encoding on each file from UTF-8 to UTF-16 since SQL server does not support UTF-8
Tried this, but not working.
data = "UTF-8 data"
udata = data.decode("utf-8")
data = udata.encode("utf-16","ignore")
Cheers!

If you want to convert a file from utf-8 encoding to a file with utf-16 encoding, this script works:
#!/usr/bin/python2.7
import codecs
import shutil
with codecs.open("input_file.utf8.txt", encoding="utf-8") as input_file:
with codecs.open(
"output_file.utf16.txt", "w", encoding="utf-16") as output_file:
shutil.copyfileobj(input_file, output_file)

This is my current way of writing to a file. However, I can't do UTF-8?

f = open("go.txt", "w")
f.write(title)
f.close()
What if "title" is in japanese/utf-8? How do I modify this code to be able to write "title" without having the ascii error?
Edit: Then, how do I read this file in UTF-8?

How to use UTF-8:
import codecs
# ...
# title is a unicode string
# ...
f = codecs.open("go.txt", "w", "utf-8")
f.write(title)
# ...
fileObj = codecs.open("go.txt", "r", "utf-8")
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file

It depends on whether you want to insert a Unicode UTF-8 byte order mark, of which the only way I know of is to open a normal file and write:
import codecs
f = open('go.txt', 'wb')
f.write(codecs.BOM_UTF8)
f.write(title.encode('utf-8')
f.close()
Generally though, I don't want to add a UTF-8 BOM and the following will suffice though:
import codecs
f = codecs.open('go.txt', 'w', 'utf-8')
f.write(title)
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I'm trying to encode csv file to utf8 using python - python

Related

Extract gzip file without BOM in Python 3.6

pandas reading csv file encoding error

charmap codec cant encode characters in position xx - xx

Changing encoding in csv file through python UTF-8 to UTF-16

This is my current way of writing to a file. However, I can't do UTF-8?

Categories

Resources