python script not encoding to utf-8

python script not encoding to utf-8 - python

I have this Python 3 script to read a json file and save as csv. It works fine except for the special characters like \u00e9. So Montr\u00e9al should be encoded like Montréal, but it is giving me MontrÃ©al instead.
import json
ifilename = 'business.json'
ofilename = 'business.csv'
json_lines = [json.loads( l.strip() ) for l in open(ifilename).readlines() ]
OUT_FILE = open(ofilename, "w", newline='', encoding='utf-8')
root = csv.writer(OUT_FILE)
root.writerow(["business_id","name","neighborhood","address","city","state"])
json_no = 0
for l in json_lines:
root.writerow([l["business_id"],l["name"],l["neighborhood"],l["address"],l["city"],l["state"]])
json_no += 1
print('Finished {0} lines'.format(json_no))
OUT_FILE.close()

It turns out the csv file was displaying correctly when opening it with Notepad++ but not with Excel. So I had to import the csv file with Excel and specify 65001: Unicode (UTF-8).
Thanks for the help.

Try using this at the top of the file
# -*- coding: utf-8 -*-
Consider this example:
# -*- coding: utf-8 -*-
import sys
print("my default encoding is : {0}".format(sys.getdefaultencoding()))
string_demo="Montréal"
print(string_demo)
reload(sys) # just in python2.x
sys.setdefaultencoding('UTF8') # just in python2.x
print("my default encoding is : {0}".format(sys.getdefaultencoding()))
print(str(string_demo.encode('utf8')), type(string_demo.encode('utf8')))
In my case, the output is like this if i run in python2.x:
my default encoding is : ascii
Montréal
my default encoding is : UTF8
('Montr\xc3\xa9al', <type 'str'>)
but when i comment out the reload and setdefaultencoding lines, my output is like this:
my default encoding is : ascii
Montréal
my default encoding is : ascii
Traceback (most recent call last):
File "test.py", line 12, in <module>
print(str(string_demo.encode('utf8')), type(string_demo.encode('utf8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
It's most a problem with the editor, Python when it's a encode error raise a Exception.

Related

import utf-8 csv in python3.x - special german characters

I have been trying to import a csv file containing special characters (ä ö ü)
in python 2.x all special characters where automatically encoded without need of specifying econding attribute in the open command.
I can´t figure out how to get this to work in python 3.x
import csv
f = open('sample_1.csv', 'rU', encoding='utf-8')
csv_f = csv.reader(f, delimiter=';')
bla = list(csv_f)
print(type(bla))
print(bla[0])
print(bla[1])
print(bla[2])
print()
print(bla[3])
Console output (Sublime Build python3)
<class 'list'>
['\ufeffCat1', 'SEO Meta Text']
['Damen', 'Damen----']
['Damen', 'Damen-Accessoires-Beauty-Geschenk-Sets-']
Traceback (most recent call last):
File "/Users/xxx/importer_tree.py", line 13, in <module>
print(bla[3])
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 37: ordinal not in range(128)
input sample_1.csv (excel file saved as utf-8 csv)
Cat1;SEO Meta Text
Damen;Damen----
Damen;Damen-Accessoires-Beauty-Geschenk-Sets-
Damen;Damen-Accessoires-Beauty-Körperpflege-
Männer;Männer-Sport-Sportschuhe-Trekkingsandalen-
Männer;Männer-Sport-Sportschuhe-Wanderschuhe-
Männer;Männer-Sport-Sportschuhe--
is this only an output format issue or am I also importing the data
wrongly?
how can I print out "Männer"?
thank you for your help/guidance!

thank you to juanpa-arrivillaga and to this answer: https://stackoverflow.com/a/44088439/9059135
Issue is due to my Sublime settings:
sys.stdout.encoding returns US-ASCII
in the terminal same command returns UTF-8
Setting up the Build system in Sublime properly will solve the issue

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

I am trying to write characters with double dots (umlauts) such as ä, ö and Ö. I am able to write it to the file with data.encode("utf-8") but the result b'\xc3\xa4\xc3\xa4\xc3\x96' is not nice (UTF-8 as literal characters). I want to get "ääÖ" as written stored to a file.
How can I write data with umlaut characters to a CSV file in Python 3?
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
data=resultFile
a.writerows(data)
Traceback:
File "<ipython-input-280-73b1f615929e>", line 5, in <module>
a.writerows(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 15: ordinal not in range(128)

Add a parameter encoding to the open() and set it to 'utf8'.
import csv
data = "ääÖ"
with open("test.csv", 'w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Edit: Removed the use of io library as open is same as io.open in Python 3.

This solution should work on both python2 and 3 (not needed in python3):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Credits to:
Working with utf-8 encoding in Python source

How do i read and print a whole .txt file using python?

I am totally new to python, and I am supposed to write a program that can read a whole .txt file and print it. The file is an article in my first language(Norwegian), and long. I have three versions that should do the same thing, but all get error. I have tried in bot PyCharm and eclipse with PyDev installed, and i get the same errors on both...
from sys import argv
import pip._vendor.distlib.compat
script, dev = argv
txt = open(dev)
print("Here's your file %r:" % dev)
print(txt.read())
print("Type the filename again:")h
file_again = pip._vendor.distlib.compat.raw_input("> ")
txt_again = open(file_again)
print(txt_again.read())
But this gets the errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/1A.py", line 5, in <module>
script, dev = argv
ValueError: not enough values to unpack (expected 2, got 1)
Again, i am new to python, and i searched around, but didn't find a solution.
My next attempt was this:
# -*- coding: utf-8 -*-
import sys, traceback
fr = open('dev.txt', 'r')
text = fr.read()
print(text)
But this gets these errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/v2.py", line 6, in <module>
text = fr.read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
I do not understand why i doesn't work.
My third attempt looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("dev.txt", help="dev.txt")
args = parser.parse_args()
if args.filename:
with open('dev.txt') as f:
for line in f:
name, _ = line.strip().split('\t')
print(name)
And this gets the errors:
usage: v3.py [-h] dev.txt
v3.py: error: the following arguments are required: dev.txt
Any help to why these doesnt work is welcome.
Thank you in advance :D

For the 2nd approach is the simplest, I'll stick to it.
You stated the contents of dev.txt to be Norwegian, that means, it will include non-ascii characters like Æ,Ø,Å etc. The python interpreter is trying to tell you this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128) It cannot interpret the byte 0xC3 = 195 (decimal) as an ascii character, which is limited to a range of 128 different characters.
I'll assume you're using UTF-8 as encoding but if not, change the parameter in line 2.
# -*- coding: utf-8 -*-
fr = open('dev.txt', 'r', encoding='utf-8')
text = fr.read()
print(text)
If you do not know your encoding, you can find it out via your editor or use python to guess it.
Your terminal could also cause the error when it's not configured to print Unicode Characters or map them correctly. You might want to take a look at this question and its answers.
After operating a file, it is recommended to close it. You can either do that manually via fr.close() or make python do it automatically:
with open('dev.txt', 'r', encoding='utf-8') as fr:
# automatically closes fr when leaving this code-block

file = open("File.txt", "r")
a = str(file.read())
print(a)
Is this what you were looking for?

For example:
open ("fileA.txt", "r") as fileA:
for line in fileA:
print(line);

This is a possible solution:
f = open("textfile.txt", "r")
lines = f.readlines()
for line in lines:
print(line)
f.close()
Save it as for example myscript.py and execute it:
python /path/to/myscript.py

Korean txt file encoding with utf-8

I'm trying to process a Korean text file with python, but it fails when I try to encode the file with utf-8.
#!/usr/bin/python
#-*- coding: utf-8 -*-
f = open('tag.txt', 'r', encoding='utf=8')
s = f.readlines()
z = open('tagresult.txt', 'w')
y = z.write(s)
z.close
=============================================================
Traceback (most recent call last):
File "C:\Users\******\Desktop\tagging.py", line 5, in <module>
f = open('tag.txt', 'r', encoding='utf=8')
TypeError: 'encoding' is an invalid keyword argument for this function
[Finished in 0.1s]
==================================================================
And when I just opens a Korean txt file encoded with utf-8, the fonts are broken like this. What can I do?
\xc1\xc1\xbe\xc6\xc1\xf6\xb4\xc2\n',
'\xc1\xc1\xbe\xc6\xc7\xcf\xb0\xc5\xb5\xe7\xbf\xe4\n',
'\xc1\xc1\xbe\xc6\xc7\xcf\xbd\xc3\xb4\xc2\n',
'\xc1\xcb\xbc\xdb\xc7\xd1\xb5\xa5\xbf\xe4\n',
'\xc1\xd6\xb1\xb8\xbf\xe4\

I don't know Korean, and don't have sample string to try, but here are some advices for you:
1
f = open('tag.txt', 'r', encoding='utf=8')
You have a typo here, utf-8 not utf=8, this explains for the exception you got.
The default mode of open() is 'r' so you don't have to define it again.
2 Don't just use open, you should use context manager statement to manage the opening/closing file descriptor, like this:
with open('tagresult.txt', 'w') as f:
f.write(s)

In Python 2 the open function does not take an encoding parameter. Instead you read a line and convert it to unicode. This article on kitchen (as in kitchen sink) modules provides details and some lightweight utilities to work with unicode in python 2.x.

python-write to file (ignore non-ascii chars)

I am on Linux and a want to write string (in utf-8) to txt file. I tried many ways, but I always got an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position in position 36: ordinal not in range(128)
Is there any way, how to write to file only ascii characters? And ignore non-ascii characters.
My code:
# -*- coding: UTF-8-*-
import os
import sys
def __init__(self, dirname, speaker, file, exportFile):
text_file = open(exportFile, "a")
text_file.write(speaker.encode("utf-8"))
text_file.write(file.encode("utf-8"))
text_file.close()
Thank you.

You can use the codecs module:
import codecs
text_file = codecs.open(exportFile,mode='a',encoding='utf-8')
text_file.write(...)

Try using the codecs module.
# -*- coding: UTF-8-*-
import codecs
def __init__(self, dirname, speaker, file, exportFile):
with codecs.open(exportFile, "a", 'utf-8') as text_file:
text_file.write(speaker.encode("utf-8"))
text_file.write(file.encode("utf-8"))
Also, beware that your file variable has a name which collides with the builtin file function.
Finally, I would suggest you have a look at http://www.joelonsoftware.com/articles/Unicode.html to better understand what is unicode, and one of these pages (depending on your python version) to understand how to use it in Python:
http://docs.python.org/2/howto/unicode
http://docs.python.org/3/howto/unicode.html

You could decode your input string before writing it;
text = speaker.decode("utf8")
with open(exportFile, "a") as text_file:
text_file.write(text.encode("utf-8"))
text_file.write(file.encode("utf-8"))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python script not encoding to utf-8 - python

It turns out the csv file was displaying correctly when opening it with Notepad++ but not with Excel. So I had to import the csv file with Excel and specify 65001: Unicode (UTF-8). Thanks for the help.

Related

import utf-8 csv in python3.x - special german characters

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

How do i read and print a whole .txt file using python?

Korean txt file encoding with utf-8

python-write to file (ignore non-ascii chars)

Categories

Resources