Python: Special characters encoding - python

This is the code i am using in order to replace special characters in text files and concatenate them to a single file.
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = "C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = dirpath+"\\"+fname
with codecs.open(currentfile, encoding='utf8') as infile:
#print currentfile
outfile.write(fname)
outfile.write('\n')
outfile.write('\n')
for line in infile:
line = line.replace(u"´ı", "i")
line = line.replace(u"ï¬", "fi")
line = line.replace(u"fl", "fl")
outfile.write (line)
The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:
import codecs
currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"
which always returns "not found!" proving that those characters aren't read.
When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:
Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)
Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)
I am stuck here. Any suggestions?? all answers are welcome!

There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().
Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = os.path.join(dirpath, fname)
with codecs.open(currentfile, encoding='utf8') as infile:
outfile.write(fname + '\n\n')
for line in infile:
line = line.replace(u"´ı", u"i")
line = line.replace(u"ï¬", u"fi")
line = line.replace(u"fl", u"fl")
outfile.write (line)
This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.
Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.
If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.

Related

Searching for a string in a file is not working in Python

I am using this code to find a string in Python:
buildSucceeded = "Build succeeded."
datafile = r'C:\PowerBuild\logs\Release\BuildAllPart2.log'
with open(datafile, 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
I am quite sure there is the string in the file although it does not return anything.
If I just print one line by line it returns a lot of 'NUL' characters between each "valid" character.
EDIT 1:
The problem was the encoding of Windows. I changed the encoding following this post and it worked: Why doesn't Python recognize my utf-8 encoded source file?
Anyway the file looks like this:
Line 1.
Line 2.
...
Build succeeded.
0 Warning(s)
0 Error(s)
...
I am currently testing with Sublime for Windows editor - which outputs a 'NUL' character between each "real" character which is very odd.
Using python command line I have this output:
C:\Dev>python readFile.py
Traceback (most recent call last):
File "readFile.py", line 7, in <module>
print(line)
File "C:\Program Files\Python35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 1: character maps to <undefined>
Thanks for your help anyway...
If your file is not that big you can do a simple find. Otherwise I would check to file to see if you have the string in the file/ check the location for any spelling mistakes and try to narrow down the problem.
f = open(datafile, 'r')
lines = f.read()
answer = lines.find(buildSucceeded)
Also note that if it does not find the string answer would be -1.
As explained, the problem happening was related to encoding. In the below website there is a very good explanation on how to convert between files with one encoding to some other.
I used the last example (with Python 3 which is my case) it worked as expected:
buildSucceeded = "Build succeeded."
datafile = 'C:\\PowerBuild\\logs\\Release\\BuildAllPart2.log'
# Open both input and output streams.
#input = open(datafile, "rt", encoding="utf-16")
input = open(datafile, "r", encoding="utf-16")
output = open("output.txt", "w", encoding="utf-8")
# Stream chunks of unicode data.
with input, output:
while True:
# Read a chunk of data.
chunk = input.read(4096)
if not chunk:
break
# Remove vertical tabs.
chunk = chunk.replace("\u000B", "")
# Write the chunk of data.
output.write(chunk)
with open('output.txt', 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
Source: http://blog.etianen.com/blog/2013/10/05/python-unicode-streams/

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

I am trying to write characters with double dots (umlauts) such as ä, ö and Ö. I am able to write it to the file with data.encode("utf-8") but the result b'\xc3\xa4\xc3\xa4\xc3\x96' is not nice (UTF-8 as literal characters). I want to get "ääÖ" as written stored to a file.
How can I write data with umlaut characters to a CSV file in Python 3?
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
data=resultFile
a.writerows(data)
Traceback:
File "<ipython-input-280-73b1f615929e>", line 5, in <module>
a.writerows(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 15: ordinal not in range(128)
Add a parameter encoding to the open() and set it to 'utf8'.
import csv
data = "ääÖ"
with open("test.csv", 'w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Edit: Removed the use of io library as open is same as io.open in Python 3.
This solution should work on both python2 and 3 (not needed in python3):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Credits to:
Working with utf-8 encoding in Python source

Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

I'm currently have an issue with my python 3 code.
replace_line('Products.txt', line, tenminus_str)
Is the line I'm trying to turn into utf-8, however when I try to do this like I would with others, I get errors such as no attribute ones and when I try to add, for example...
.decode("utf8")
...to the end of it, I still get errors that it is using ascii. I also tried other methods that worked with other lines such as adding io. infront and adding a comma with
encoding = 'utf8'
The function that I am using for replace_line is:
def replace_line(file_name, line_num, text):
lines = open(file_name, 'r').readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
How would I fix this issue? Please note that I'm very new to Python and not advanced enough to do debugging well.
EDIT: Different fix to this question than 'duplicate'
EDIT 2:I have another error with the function now.
File "FILELOCATION", line 45, in refill replace_line('Products.txt', str(line), tenminus_str)
File "FILELOCATION", line 6, in replace_line lines[line_num] = text
TypeError: list indices must be integers, not str
What does this mean and how do I fix it?
Change your function to:
def replace_line(file_name, line_num, text):
with open(file_name, 'r', encoding='utf8') as f:
lines = f.readlines()
lines[line_num] = text
with open(file_name, 'w', encoding='utf8') as out:
out.writelines(lines)
encoding='utf8' will decode your UTF-8 file correctly.
with automatically closes the file when its block is exited.
Since your file started with \xef it likely has a UTF-8-encoding byte order mark (BOM) character at the beginning. The above code will maintain that on output, but if you don't want it use utf-8-sig for the input encoding. Then it will be automatically removed.
codecs module is just what you need. detail here
import codecs
def replace_line(file_name, line_num, text):
f = codecs.open(file_name, 'r', encoding='utf-8')
lines = f.readlines()
lines[line_num] = text
f.close()
w = codecs.open(file_name, 'w', encoding='utf-8')
w.writelines(lines)
w.close()
Handling coding problems You can try adding the following settings to your head
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Type = sys.getfilesystemencoding()
Try adding encoding='utf8' if you are reading a file
with open("../file_path", encoding='utf8'):
# your code

Removing BOM from gzip'ed CSV in Python

I'm using the following code to unzip and save a CSV file:
with gzip.open(filename_gz) as f:
file = open(filename, "w");
output = csv.writer(file, delimiter = ',')
output.writerows(csv.reader(f, dialect='excel', delimiter = ';'))
Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file.
I've read that encoding the content in utf-8-sig should fix the issue. However, adding:
.read().encoding('utf-8-sig')
to f in csv.reader fails with:
File "ckan_gz_datastore.py", line 16, in <module>
output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';'))
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
How can I remove the BOM and just save the content in correct utf-8?
First, you need to decode the file contents, not encode them.
Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.
Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.
So:
csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())
However, you might consider it simpler / more efficent just to remove the BOM manually:
def remove_bom(line):
return line[3:] if line.startswith(codecs.BOM_UTF8) else line
csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')
That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:
def remove_bom_from_first(iterable):
f = iter(iterable)
firstline = next(f, None)
if firstline is not None:
yield remove_bom(firstline)
for line in f:
yield f

This is my current way of writing to a file. However, I can't do UTF-8?

f = open("go.txt", "w")
f.write(title)
f.close()
What if "title" is in japanese/utf-8? How do I modify this code to be able to write "title" without having the ascii error?
Edit: Then, how do I read this file in UTF-8?
How to use UTF-8:
import codecs
# ...
# title is a unicode string
# ...
f = codecs.open("go.txt", "w", "utf-8")
f.write(title)
# ...
fileObj = codecs.open("go.txt", "r", "utf-8")
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
It depends on whether you want to insert a Unicode UTF-8 byte order mark, of which the only way I know of is to open a normal file and write:
import codecs
f = open('go.txt', 'wb')
f.write(codecs.BOM_UTF8)
f.write(title.encode('utf-8')
f.close()
Generally though, I don't want to add a UTF-8 BOM and the following will suffice though:
import codecs
f = codecs.open('go.txt', 'w', 'utf-8')
f.write(title)
f.close()

Categories

Resources