Removing BOM from gzip'ed CSV in Python

Removing BOM from gzip'ed CSV in Python - python

I'm using the following code to unzip and save a CSV file:
with gzip.open(filename_gz) as f:
file = open(filename, "w");
output = csv.writer(file, delimiter = ',')
output.writerows(csv.reader(f, dialect='excel', delimiter = ';'))
Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file.
I've read that encoding the content in utf-8-sig should fix the issue. However, adding:
.read().encoding('utf-8-sig')
to f in csv.reader fails with:
File "ckan_gz_datastore.py", line 16, in <module>
output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';'))
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
How can I remove the BOM and just save the content in correct utf-8?

First, you need to decode the file contents, not encode them.
Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.
Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.
So:
csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())
However, you might consider it simpler / more efficent just to remove the BOM manually:
def remove_bom(line):
return line[3:] if line.startswith(codecs.BOM_UTF8) else line
csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')
That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:
def remove_bom_from_first(iterable):
f = iter(iterable)
firstline = next(f, None)
if firstline is not None:
yield remove_bom(firstline)
for line in f:
yield f

Related

Searching for a string in a file is not working in Python

I am using this code to find a string in Python:
buildSucceeded = "Build succeeded."
datafile = r'C:\PowerBuild\logs\Release\BuildAllPart2.log'
with open(datafile, 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
I am quite sure there is the string in the file although it does not return anything.
If I just print one line by line it returns a lot of 'NUL' characters between each "valid" character.
EDIT 1:
The problem was the encoding of Windows. I changed the encoding following this post and it worked: Why doesn't Python recognize my utf-8 encoded source file?
Anyway the file looks like this:
Line 1.
Line 2.
...
Build succeeded.
0 Warning(s)
0 Error(s)
...
I am currently testing with Sublime for Windows editor - which outputs a 'NUL' character between each "real" character which is very odd.
Using python command line I have this output:
C:\Dev>python readFile.py
Traceback (most recent call last):
File "readFile.py", line 7, in <module>
print(line)
File "C:\Program Files\Python35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 1: character maps to <undefined>
Thanks for your help anyway...

If your file is not that big you can do a simple find. Otherwise I would check to file to see if you have the string in the file/ check the location for any spelling mistakes and try to narrow down the problem.
f = open(datafile, 'r')
lines = f.read()
answer = lines.find(buildSucceeded)
Also note that if it does not find the string answer would be -1.

As explained, the problem happening was related to encoding. In the below website there is a very good explanation on how to convert between files with one encoding to some other.
I used the last example (with Python 3 which is my case) it worked as expected:
buildSucceeded = "Build succeeded."
datafile = 'C:\\PowerBuild\\logs\\Release\\BuildAllPart2.log'
# Open both input and output streams.
#input = open(datafile, "rt", encoding="utf-16")
input = open(datafile, "r", encoding="utf-16")
output = open("output.txt", "w", encoding="utf-8")
# Stream chunks of unicode data.
with input, output:
while True:
# Read a chunk of data.
chunk = input.read(4096)
if not chunk:
break
# Remove vertical tabs.
chunk = chunk.replace("\u000B", "")
# Write the chunk of data.
output.write(chunk)
with open('output.txt', 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
Source: http://blog.etianen.com/blog/2013/10/05/python-unicode-streams/

Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

I'm currently have an issue with my python 3 code.
replace_line('Products.txt', line, tenminus_str)
Is the line I'm trying to turn into utf-8, however when I try to do this like I would with others, I get errors such as no attribute ones and when I try to add, for example...
.decode("utf8")
...to the end of it, I still get errors that it is using ascii. I also tried other methods that worked with other lines such as adding io. infront and adding a comma with
encoding = 'utf8'
The function that I am using for replace_line is:
def replace_line(file_name, line_num, text):
lines = open(file_name, 'r').readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
How would I fix this issue? Please note that I'm very new to Python and not advanced enough to do debugging well.
EDIT: Different fix to this question than 'duplicate'
EDIT 2:I have another error with the function now.
File "FILELOCATION", line 45, in refill replace_line('Products.txt', str(line), tenminus_str)
File "FILELOCATION", line 6, in replace_line lines[line_num] = text
TypeError: list indices must be integers, not str
What does this mean and how do I fix it?

Change your function to:
def replace_line(file_name, line_num, text):
with open(file_name, 'r', encoding='utf8') as f:
lines = f.readlines()
lines[line_num] = text
with open(file_name, 'w', encoding='utf8') as out:
out.writelines(lines)
encoding='utf8' will decode your UTF-8 file correctly.
with automatically closes the file when its block is exited.
Since your file started with \xef it likely has a UTF-8-encoding byte order mark (BOM) character at the beginning. The above code will maintain that on output, but if you don't want it use utf-8-sig for the input encoding. Then it will be automatically removed.

codecs module is just what you need. detail here
import codecs
def replace_line(file_name, line_num, text):
f = codecs.open(file_name, 'r', encoding='utf-8')
lines = f.readlines()
lines[line_num] = text
f.close()
w = codecs.open(file_name, 'w', encoding='utf-8')
w.writelines(lines)
w.close()

Handling coding problems You can try adding the following settings to your head
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Type = sys.getfilesystemencoding()

Try adding encoding='utf8' if you are reading a file
with open("../file_path", encoding='utf8'):
# your code

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is:
Codi;Codi_lloc_anonim;Nom
and the code of my program is:
def lectdict(filename,colkey,colvalue):
f = open(filename,'r')
D = dict()
for line in f:
if line == '\n': continue
D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]
f.close
return D
Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

In Python2,
f = open(filename,'r')
for line in f:
reads lines from the file as bytes.
In Python3, the same code reads lines from the file as strings. Python3
strings are what Python2 call unicode objects. These are bytes decoded
according to some encoding. The default encoding in Python3 is utf-8.
The error message
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.
To fix the problem you need to specify the correct encoding of the file:
with open(filename, encoding=enc) as f:
for line in f:
If you do not know the correct encoding, you could run this program to simply
try all the encodings known to Python. If you are lucky there will be an
encoding which turns the bytes into recognizable characters. Sometimes more
than one encoding may appear to work, in which case you'll need to check and
compare the results carefully.
# Python3
import pkgutil
import os
import encodings
def all_encodings():
modnames = set(
[modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
try:
with open(filename, encoding=enc) as f:
# print the encoding and the first 500 characters
print(enc, f.read(500))
except Exception:
pass

Ok, I did the same as #unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :
f = open(filename,'r')
to
f = open(filename,'r', encoding='cp1250')
like #triplee suggest me. And now I can read my files.

In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte
My decision is to open file in binary mode:
open(filename, 'rb')

Delete every non utf-8 symbols from string

I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb.
Currently I have code like this.
with open(fname, "r") as fp:
for line in fp:
line = line.strip()
line = line.decode('utf-8', 'ignore')
line = line.encode('utf-8', 'ignore')
somehow I still get an error
bson.errors.InvalidStringData: strings in documents must be valid UTF-8:
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin
I don't get it. Is there some simple way to do it?
UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.

Try below code line instead of last two lines. Hope it helps:
line=line.decode('utf-8','ignore').encode("utf-8")

For python 3, as mentioned in a comment in this thread, you can do:
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.
If your line is already a bytes object (e.g. b'my string') then you just need to decode it with decode('utf-8', 'ignore').

Example to handle no utf-8 characters
import string
test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
print ''.join(x for x in test if x in string.printable)

with open(fname, "r") as fp:
for line in fp:
line = line.strip()
line = line.decode('cp1252').encode('utf-8')

Python: Special characters encoding

This is the code i am using in order to replace special characters in text files and concatenate them to a single file.
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = "C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = dirpath+"\\"+fname
with codecs.open(currentfile, encoding='utf8') as infile:
#print currentfile
outfile.write(fname)
outfile.write('\n')
outfile.write('\n')
for line in infile:
line = line.replace(u"´ı", "i")
line = line.replace(u"ï¬", "fi")
line = line.replace(u"ï¬‚", "fl")
outfile.write (line)
The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:
import codecs
currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"
which always returns "not found!" proving that those characters aren't read.
When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:
Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)
Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)
I am stuck here. Any suggestions?? all answers are welcome!

There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().
Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = os.path.join(dirpath, fname)
with codecs.open(currentfile, encoding='utf8') as infile:
outfile.write(fname + '\n\n')
for line in infile:
line = line.replace(u"´ı", u"i")
line = line.replace(u"ï¬", u"fi")
line = line.replace(u"ï¬‚", u"fl")
outfile.write (line)
This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.
Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.
If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing BOM from gzip'ed CSV in Python - python

Related

Searching for a string in a file is not working in Python

Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

Delete every non utf-8 symbols from string

Python: Special characters encoding

Categories

Resources