import utf-8 csv in python3.x - special german characters - python

I have been trying to import a csv file containing special characters (ä ö ü)
in python 2.x all special characters where automatically encoded without need of specifying econding attribute in the open command.
I can´t figure out how to get this to work in python 3.x
import csv
f = open('sample_1.csv', 'rU', encoding='utf-8')
csv_f = csv.reader(f, delimiter=';')
bla = list(csv_f)
print(type(bla))
print(bla[0])
print(bla[1])
print(bla[2])
print()
print(bla[3])
Console output (Sublime Build python3)
<class 'list'>
['\ufeffCat1', 'SEO Meta Text']
['Damen', 'Damen----']
['Damen', 'Damen-Accessoires-Beauty-Geschenk-Sets-']
Traceback (most recent call last):
File "/Users/xxx/importer_tree.py", line 13, in <module>
print(bla[3])
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 37: ordinal not in range(128)
input sample_1.csv (excel file saved as utf-8 csv)
Cat1;SEO Meta Text
Damen;Damen----
Damen;Damen-Accessoires-Beauty-Geschenk-Sets-
Damen;Damen-Accessoires-Beauty-Körperpflege-
Männer;Männer-Sport-Sportschuhe-Trekkingsandalen-
Männer;Männer-Sport-Sportschuhe-Wanderschuhe-
Männer;Männer-Sport-Sportschuhe--
is this only an output format issue or am I also importing the data
wrongly?
how can I print out "Männer"?
thank you for your help/guidance!

thank you to juanpa-arrivillaga and to this answer: https://stackoverflow.com/a/44088439/9059135
Issue is due to my Sublime settings:
sys.stdout.encoding returns US-ASCII
in the terminal same command returns UTF-8
Setting up the Build system in Sublime properly will solve the issue

Related

Continuing for loop after exception in Python

So first of all I saw similar questions, but nothing worked/wasn't applicable to my problem.
I'm writing a program that is taking in a Text file with a lot of search queries to be searched on Youtube. The program is iterating through the text file line by line. But these have special UTF-8 characters that cannot be decoded. So at a certain point the program stops with a
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
As I cannot check every line of my entries, I want it to except the error, print the line it was working on and continue at that point.
As the error is not happening in my for loop, but rather the for loop itself, I don't know how to write an try...except statement.
This is the code:
import urllib.request
import re
from unidecode import unidecode
with open('out.txt', 'r') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
try:
clean = unidecode(line)
search_keyword = clean
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
#print("https://www.youtube.com/watch?v=" + video_ids[0])
except:
print("Error encounted with Line: " + line)
This is the full error message, to see that the for loop itself is causing the problem.
Traceback (most recent call last):
File "ytbysearchtolinks.py", line 6, in
for line in infh:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
If you need an example of input I'm working with: https://pastebin.com/LEkwdU06
The try-except-block looks correct and should allow you to catch all occurring exceptions.
The usage of unidecode probably won't help you because non-ASCII characters must be encoded in a specific way in URLs, see, e.g., here.
One solution is to use urllib's quote() function. As per documentation:
Replace special characters in string using the %xx escape.
This is what works for me with the input you've provided:
import urllib.request
from urllib.parse import quote
import re
with open('out.txt', 'r', encoding='utf-8') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
search_keyword = quote(line)
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
print("https://www.youtube.com/watch?v=" + video_ids[0])
EDIT:
After thinking about it, I believe you are running into the following problem:
You are running the code on Windows, and apparently, Python will try to open the file with cp1252 encoding when on Windows, while the file that you shared is in UTF-8 encoding:
$ file out.txt
out.txt: UTF-8 Unicode text, with CRLF line terminators
This would explain the exception you are getting and why it's not being caught by your try-except-block (it's occurring when trying to open the file).
Make sure that you are using encoding='utf-8' when opening the file.
i ran your code, but i didnt have some problems. Do you have create virtual environment with virtualenv and install all the packages you use ?

python script not encoding to utf-8

I have this Python 3 script to read a json file and save as csv. It works fine except for the special characters like \u00e9. So Montr\u00e9al should be encoded like Montréal, but it is giving me Montréal instead.
import json
ifilename = 'business.json'
ofilename = 'business.csv'
json_lines = [json.loads( l.strip() ) for l in open(ifilename).readlines() ]
OUT_FILE = open(ofilename, "w", newline='', encoding='utf-8')
root = csv.writer(OUT_FILE)
root.writerow(["business_id","name","neighborhood","address","city","state"])
json_no = 0
for l in json_lines:
root.writerow([l["business_id"],l["name"],l["neighborhood"],l["address"],l["city"],l["state"]])
json_no += 1
print('Finished {0} lines'.format(json_no))
OUT_FILE.close()
It turns out the csv file was displaying correctly when opening it with Notepad++ but not with Excel. So I had to import the csv file with Excel and specify 65001: Unicode (UTF-8).
Thanks for the help.
Try using this at the top of the file
# -*- coding: utf-8 -*-
Consider this example:
# -*- coding: utf-8 -*-
import sys
print("my default encoding is : {0}".format(sys.getdefaultencoding()))
string_demo="Montréal"
print(string_demo)
reload(sys) # just in python2.x
sys.setdefaultencoding('UTF8') # just in python2.x
print("my default encoding is : {0}".format(sys.getdefaultencoding()))
print(str(string_demo.encode('utf8')), type(string_demo.encode('utf8')))
In my case, the output is like this if i run in python2.x:
my default encoding is : ascii
Montréal
my default encoding is : UTF8
('Montr\xc3\xa9al', <type 'str'>)
but when i comment out the reload and setdefaultencoding lines, my output is like this:
my default encoding is : ascii
Montréal
my default encoding is : ascii
Traceback (most recent call last):
File "test.py", line 12, in <module>
print(str(string_demo.encode('utf8')), type(string_demo.encode('utf8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
It's most a problem with the editor, Python when it's a encode error raise a Exception.

How do i read and print a whole .txt file using python?

I am totally new to python, and I am supposed to write a program that can read a whole .txt file and print it. The file is an article in my first language(Norwegian), and long. I have three versions that should do the same thing, but all get error. I have tried in bot PyCharm and eclipse with PyDev installed, and i get the same errors on both...
from sys import argv
import pip._vendor.distlib.compat
script, dev = argv
txt = open(dev)
print("Here's your file %r:" % dev)
print(txt.read())
print("Type the filename again:")h
file_again = pip._vendor.distlib.compat.raw_input("> ")
txt_again = open(file_again)
print(txt_again.read())
But this gets the errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/1A.py", line 5, in <module>
script, dev = argv
ValueError: not enough values to unpack (expected 2, got 1)
Again, i am new to python, and i searched around, but didn't find a solution.
My next attempt was this:
# -*- coding: utf-8 -*-
import sys, traceback
fr = open('dev.txt', 'r')
text = fr.read()
print(text)
But this gets these errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/v2.py", line 6, in <module>
text = fr.read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
I do not understand why i doesn't work.
My third attempt looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("dev.txt", help="dev.txt")
args = parser.parse_args()
if args.filename:
with open('dev.txt') as f:
for line in f:
name, _ = line.strip().split('\t')
print(name)
And this gets the errors:
usage: v3.py [-h] dev.txt
v3.py: error: the following arguments are required: dev.txt
Any help to why these doesnt work is welcome.
Thank you in advance :D
For the 2nd approach is the simplest, I'll stick to it.
You stated the contents of dev.txt to be Norwegian, that means, it will include non-ascii characters like Æ,Ø,Å etc. The python interpreter is trying to tell you this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128) It cannot interpret the byte 0xC3 = 195 (decimal) as an ascii character, which is limited to a range of 128 different characters.
I'll assume you're using UTF-8 as encoding but if not, change the parameter in line 2.
# -*- coding: utf-8 -*-
fr = open('dev.txt', 'r', encoding='utf-8')
text = fr.read()
print(text)
If you do not know your encoding, you can find it out via your editor or use python to guess it.
Your terminal could also cause the error when it's not configured to print Unicode Characters or map them correctly. You might want to take a look at this question and its answers.
After operating a file, it is recommended to close it. You can either do that manually via fr.close() or make python do it automatically:
with open('dev.txt', 'r', encoding='utf-8') as fr:
# automatically closes fr when leaving this code-block
file = open("File.txt", "r")
a = str(file.read())
print(a)
Is this what you were looking for?
For example:
open ("fileA.txt", "r") as fileA:
for line in fileA:
print(line);
This is a possible solution:
f = open("textfile.txt", "r")
lines = f.readlines()
for line in lines:
print(line)
f.close()
Save it as for example myscript.py and execute it:
python /path/to/myscript.py

Korean txt file encoding with utf-8

I'm trying to process a Korean text file with python, but it fails when I try to encode the file with utf-8.
#!/usr/bin/python
#-*- coding: utf-8 -*-
f = open('tag.txt', 'r', encoding='utf=8')
s = f.readlines()
z = open('tagresult.txt', 'w')
y = z.write(s)
z.close
=============================================================
Traceback (most recent call last):
File "C:\Users\******\Desktop\tagging.py", line 5, in <module>
f = open('tag.txt', 'r', encoding='utf=8')
TypeError: 'encoding' is an invalid keyword argument for this function
[Finished in 0.1s]
==================================================================
And when I just opens a Korean txt file encoded with utf-8, the fonts are broken like this. What can I do?
\xc1\xc1\xbe\xc6\xc1\xf6\xb4\xc2\n',
'\xc1\xc1\xbe\xc6\xc7\xcf\xb0\xc5\xb5\xe7\xbf\xe4\n',
'\xc1\xc1\xbe\xc6\xc7\xcf\xbd\xc3\xb4\xc2\n',
'\xc1\xcb\xbc\xdb\xc7\xd1\xb5\xa5\xbf\xe4\n',
'\xc1\xd6\xb1\xb8\xbf\xe4\
I don't know Korean, and don't have sample string to try, but here are some advices for you:
1
f = open('tag.txt', 'r', encoding='utf=8')
You have a typo here, utf-8 not utf=8, this explains for the exception you got.
The default mode of open() is 'r' so you don't have to define it again.
2 Don't just use open, you should use context manager statement to manage the opening/closing file descriptor, like this:
with open('tagresult.txt', 'w') as f:
f.write(s)
In Python 2 the open function does not take an encoding parameter. Instead you read a line and convert it to unicode. This article on kitchen (as in kitchen sink) modules provides details and some lightweight utilities to work with unicode in python 2.x.

Load an UTF8 JSON file with Python

I try to parse a JSON file and I have an error when I want to print a JSON value that is HTML string.
The error is : Traceback (most recent call last): File "parseJson.py", line 11, in <module> print entryContentHTML.prettify() UnicodeEncodeError: 'ascii' codec can't encode character u'\u02c8' in position 196: ordinal not in range(128)
import json
import codecs
from bs4 import BeautifulSoup
with open('cat.json') as f:
data = json.load(f)
print data["entryLabel"]
entryContentHTML = BeautifulSoup(data["entryContent"])
print entryContentHTML.prettify()
What is the common way to load a json file with UTF8 specification ?
You are loading the JSON just fine. It is your print statement that fails.
You are trying to print to a console or terminal that is configured for ASCII handling only. You'll either have to alter your console configuration or explicitly encode your output:
print data["entryLabel"].encode('ascii', 'replace')
and
print entryContentHTML.prettify().encode('ascii', 'replace')
Without more information about your environment it is otherwise impossible to tell how to fix your configuration (if at all possible).

Categories

Resources