python-write to file (ignore non-ascii chars)

python-write to file (ignore non-ascii chars) - python

I am on Linux and a want to write string (in utf-8) to txt file. I tried many ways, but I always got an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position in position 36: ordinal not in range(128)
Is there any way, how to write to file only ascii characters? And ignore non-ascii characters.
My code:
# -*- coding: UTF-8-*-
import os
import sys
def __init__(self, dirname, speaker, file, exportFile):
text_file = open(exportFile, "a")
text_file.write(speaker.encode("utf-8"))
text_file.write(file.encode("utf-8"))
text_file.close()
Thank you.

You can use the codecs module:
import codecs
text_file = codecs.open(exportFile,mode='a',encoding='utf-8')
text_file.write(...)

Try using the codecs module.
# -*- coding: UTF-8-*-
import codecs
def __init__(self, dirname, speaker, file, exportFile):
with codecs.open(exportFile, "a", 'utf-8') as text_file:
text_file.write(speaker.encode("utf-8"))
text_file.write(file.encode("utf-8"))
Also, beware that your file variable has a name which collides with the builtin file function.
Finally, I would suggest you have a look at http://www.joelonsoftware.com/articles/Unicode.html to better understand what is unicode, and one of these pages (depending on your python version) to understand how to use it in Python:
http://docs.python.org/2/howto/unicode
http://docs.python.org/3/howto/unicode.html

You could decode your input string before writing it;
text = speaker.decode("utf8")
with open(exportFile, "a") as text_file:
text_file.write(text.encode("utf-8"))
text_file.write(file.encode("utf-8"))

Related

python script not encoding to utf-8

I have this Python 3 script to read a json file and save as csv. It works fine except for the special characters like \u00e9. So Montr\u00e9al should be encoded like Montréal, but it is giving me MontrÃ©al instead.
import json
ifilename = 'business.json'
ofilename = 'business.csv'
json_lines = [json.loads( l.strip() ) for l in open(ifilename).readlines() ]
OUT_FILE = open(ofilename, "w", newline='', encoding='utf-8')
root = csv.writer(OUT_FILE)
root.writerow(["business_id","name","neighborhood","address","city","state"])
json_no = 0
for l in json_lines:
root.writerow([l["business_id"],l["name"],l["neighborhood"],l["address"],l["city"],l["state"]])
json_no += 1
print('Finished {0} lines'.format(json_no))
OUT_FILE.close()

It turns out the csv file was displaying correctly when opening it with Notepad++ but not with Excel. So I had to import the csv file with Excel and specify 65001: Unicode (UTF-8).
Thanks for the help.

Try using this at the top of the file
# -*- coding: utf-8 -*-
Consider this example:
# -*- coding: utf-8 -*-
import sys
print("my default encoding is : {0}".format(sys.getdefaultencoding()))
string_demo="Montréal"
print(string_demo)
reload(sys) # just in python2.x
sys.setdefaultencoding('UTF8') # just in python2.x
print("my default encoding is : {0}".format(sys.getdefaultencoding()))
print(str(string_demo.encode('utf8')), type(string_demo.encode('utf8')))
In my case, the output is like this if i run in python2.x:
my default encoding is : ascii
Montréal
my default encoding is : UTF8
('Montr\xc3\xa9al', <type 'str'>)
but when i comment out the reload and setdefaultencoding lines, my output is like this:
my default encoding is : ascii
Montréal
my default encoding is : ascii
Traceback (most recent call last):
File "test.py", line 12, in <module>
print(str(string_demo.encode('utf8')), type(string_demo.encode('utf8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
It's most a problem with the editor, Python when it's a encode error raise a Exception.

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

I am trying to write characters with double dots (umlauts) such as ä, ö and Ö. I am able to write it to the file with data.encode("utf-8") but the result b'\xc3\xa4\xc3\xa4\xc3\x96' is not nice (UTF-8 as literal characters). I want to get "ääÖ" as written stored to a file.
How can I write data with umlaut characters to a CSV file in Python 3?
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
data=resultFile
a.writerows(data)
Traceback:
File "<ipython-input-280-73b1f615929e>", line 5, in <module>
a.writerows(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 15: ordinal not in range(128)

Add a parameter encoding to the open() and set it to 'utf8'.
import csv
data = "ääÖ"
with open("test.csv", 'w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Edit: Removed the use of io library as open is same as io.open in Python 3.

This solution should work on both python2 and 3 (not needed in python3):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Credits to:
Working with utf-8 encoding in Python source

How do i read and print a whole .txt file using python?

I am totally new to python, and I am supposed to write a program that can read a whole .txt file and print it. The file is an article in my first language(Norwegian), and long. I have three versions that should do the same thing, but all get error. I have tried in bot PyCharm and eclipse with PyDev installed, and i get the same errors on both...
from sys import argv
import pip._vendor.distlib.compat
script, dev = argv
txt = open(dev)
print("Here's your file %r:" % dev)
print(txt.read())
print("Type the filename again:")h
file_again = pip._vendor.distlib.compat.raw_input("> ")
txt_again = open(file_again)
print(txt_again.read())
But this gets the errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/1A.py", line 5, in <module>
script, dev = argv
ValueError: not enough values to unpack (expected 2, got 1)
Again, i am new to python, and i searched around, but didn't find a solution.
My next attempt was this:
# -*- coding: utf-8 -*-
import sys, traceback
fr = open('dev.txt', 'r')
text = fr.read()
print(text)
But this gets these errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/v2.py", line 6, in <module>
text = fr.read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
I do not understand why i doesn't work.
My third attempt looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("dev.txt", help="dev.txt")
args = parser.parse_args()
if args.filename:
with open('dev.txt') as f:
for line in f:
name, _ = line.strip().split('\t')
print(name)
And this gets the errors:
usage: v3.py [-h] dev.txt
v3.py: error: the following arguments are required: dev.txt
Any help to why these doesnt work is welcome.
Thank you in advance :D

For the 2nd approach is the simplest, I'll stick to it.
You stated the contents of dev.txt to be Norwegian, that means, it will include non-ascii characters like Æ,Ø,Å etc. The python interpreter is trying to tell you this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128) It cannot interpret the byte 0xC3 = 195 (decimal) as an ascii character, which is limited to a range of 128 different characters.
I'll assume you're using UTF-8 as encoding but if not, change the parameter in line 2.
# -*- coding: utf-8 -*-
fr = open('dev.txt', 'r', encoding='utf-8')
text = fr.read()
print(text)
If you do not know your encoding, you can find it out via your editor or use python to guess it.
Your terminal could also cause the error when it's not configured to print Unicode Characters or map them correctly. You might want to take a look at this question and its answers.
After operating a file, it is recommended to close it. You can either do that manually via fr.close() or make python do it automatically:
with open('dev.txt', 'r', encoding='utf-8') as fr:
# automatically closes fr when leaving this code-block

file = open("File.txt", "r")
a = str(file.read())
print(a)
Is this what you were looking for?

For example:
open ("fileA.txt", "r") as fileA:
for line in fileA:
print(line);

This is a possible solution:
f = open("textfile.txt", "r")
lines = f.readlines()
for line in lines:
print(line)
f.close()
Save it as for example myscript.py and execute it:
python /path/to/myscript.py

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is:
Codi;Codi_lloc_anonim;Nom
and the code of my program is:
def lectdict(filename,colkey,colvalue):
f = open(filename,'r')
D = dict()
for line in f:
if line == '\n': continue
D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]
f.close
return D
Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

In Python2,
f = open(filename,'r')
for line in f:
reads lines from the file as bytes.
In Python3, the same code reads lines from the file as strings. Python3
strings are what Python2 call unicode objects. These are bytes decoded
according to some encoding. The default encoding in Python3 is utf-8.
The error message
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.
To fix the problem you need to specify the correct encoding of the file:
with open(filename, encoding=enc) as f:
for line in f:
If you do not know the correct encoding, you could run this program to simply
try all the encodings known to Python. If you are lucky there will be an
encoding which turns the bytes into recognizable characters. Sometimes more
than one encoding may appear to work, in which case you'll need to check and
compare the results carefully.
# Python3
import pkgutil
import os
import encodings
def all_encodings():
modnames = set(
[modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
try:
with open(filename, encoding=enc) as f:
# print the encoding and the first 500 characters
print(enc, f.read(500))
except Exception:
pass

Ok, I did the same as #unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :
f = open(filename,'r')
to
f = open(filename,'r', encoding='cp1250')
like #triplee suggest me. And now I can read my files.

In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte
My decision is to open file in binary mode:
open(filename, 'rb')

Python: Special characters encoding

This is the code i am using in order to replace special characters in text files and concatenate them to a single file.
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = "C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = dirpath+"\\"+fname
with codecs.open(currentfile, encoding='utf8') as infile:
#print currentfile
outfile.write(fname)
outfile.write('\n')
outfile.write('\n')
for line in infile:
line = line.replace(u"´ı", "i")
line = line.replace(u"ï¬", "fi")
line = line.replace(u"ï¬‚", "fl")
outfile.write (line)
The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:
import codecs
currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"
which always returns "not found!" proving that those characters aren't read.
When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:
Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)
Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)
I am stuck here. Any suggestions?? all answers are welcome!

There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().
Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = os.path.join(dirpath, fname)
with codecs.open(currentfile, encoding='utf8') as infile:
outfile.write(fname + '\n\n')
for line in infile:
line = line.replace(u"´ı", u"i")
line = line.replace(u"ï¬", u"fi")
line = line.replace(u"ï¬‚", u"fl")
outfile.write (line)
This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.
Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.
If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python-write to file (ignore non-ascii chars) - python

You can use the codecs module: import codecs text_file = codecs.open(exportFile,mode='a',encoding='utf-8') text_file.write(...)

You could decode your input string before writing it; text = speaker.decode("utf8") with open(exportFile, "a") as text_file: text_file.write(text.encode("utf-8")) text_file.write(file.encode("utf-8"))

Related

python script not encoding to utf-8

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

How do i read and print a whole .txt file using python?

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

Python: Special characters encoding

Categories

Resources