Pyhton generated latex file throws UTF-8 error - python

I am trying to write a python program to generate and compile a Lualatex file with German special characters (ä, ü, ß etc.).
Unfortunately, it throws me this error:
! String contains an invalid utf-8 sequence.
Here is my example code:
import subprocess
import shutil
txtFileRecipe = open(r"C:\Users\canna\OneDrive\Desktop\TestTest.tex", "w")
txtFileRecipe.write(
("\\documentclass[a5paper]{article}\n"
"\\usepackage[ngerman]{babel}\n"
"\\usepackage{fontspec}\n"
"\\begin{document}\n"
"Äpfelmüß\n"
"\\end{document}\n")
)
txtFileRecipe.close()
subprocess.check_call(["LuaLatex", r"C:\Users\canna\OneDrive\Desktop\TestTest.tex"])

Try opening the file as binary an encoding it to UTF-8
with open(r"C:\Users\canna\OneDrive\Desktop\TestTest.tex", "wb") as txtFileRecipe:
txtFileRecipe.write(
("\\documentclass[a5paper]{article}\n"
"\\usepackage[ngerman]{babel}\n"
"\\usepackage{fontspec}\n"
"\\begin{document}\n"
"Äpfelmüß\n"
"\\end{document}\n"
.encode('utf-8')) #explicit encoding as utf-8
)
subprocess.check_call(["LuaLatex", r"C:\Users\canna\OneDrive\Desktop\TestTest.tex"])
(Switched to opening the file with context manager to follow good practices)

Related

Continuing for loop after exception in Python

So first of all I saw similar questions, but nothing worked/wasn't applicable to my problem.
I'm writing a program that is taking in a Text file with a lot of search queries to be searched on Youtube. The program is iterating through the text file line by line. But these have special UTF-8 characters that cannot be decoded. So at a certain point the program stops with a
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
As I cannot check every line of my entries, I want it to except the error, print the line it was working on and continue at that point.
As the error is not happening in my for loop, but rather the for loop itself, I don't know how to write an try...except statement.
This is the code:
import urllib.request
import re
from unidecode import unidecode
with open('out.txt', 'r') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
try:
clean = unidecode(line)
search_keyword = clean
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
#print("https://www.youtube.com/watch?v=" + video_ids[0])
except:
print("Error encounted with Line: " + line)
This is the full error message, to see that the for loop itself is causing the problem.
Traceback (most recent call last):
File "ytbysearchtolinks.py", line 6, in
for line in infh:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
If you need an example of input I'm working with: https://pastebin.com/LEkwdU06
The try-except-block looks correct and should allow you to catch all occurring exceptions.
The usage of unidecode probably won't help you because non-ASCII characters must be encoded in a specific way in URLs, see, e.g., here.
One solution is to use urllib's quote() function. As per documentation:
Replace special characters in string using the %xx escape.
This is what works for me with the input you've provided:
import urllib.request
from urllib.parse import quote
import re
with open('out.txt', 'r', encoding='utf-8') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
search_keyword = quote(line)
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
print("https://www.youtube.com/watch?v=" + video_ids[0])
EDIT:
After thinking about it, I believe you are running into the following problem:
You are running the code on Windows, and apparently, Python will try to open the file with cp1252 encoding when on Windows, while the file that you shared is in UTF-8 encoding:
$ file out.txt
out.txt: UTF-8 Unicode text, with CRLF line terminators
This would explain the exception you are getting and why it's not being caught by your try-except-block (it's occurring when trying to open the file).
Make sure that you are using encoding='utf-8' when opening the file.
i ran your code, but i didnt have some problems. Do you have create virtual environment with virtualenv and install all the packages you use ?

import utf-8 csv in python3.x - special german characters

I have been trying to import a csv file containing special characters (ä ö ü)
in python 2.x all special characters where automatically encoded without need of specifying econding attribute in the open command.
I can´t figure out how to get this to work in python 3.x
import csv
f = open('sample_1.csv', 'rU', encoding='utf-8')
csv_f = csv.reader(f, delimiter=';')
bla = list(csv_f)
print(type(bla))
print(bla[0])
print(bla[1])
print(bla[2])
print()
print(bla[3])
Console output (Sublime Build python3)
<class 'list'>
['\ufeffCat1', 'SEO Meta Text']
['Damen', 'Damen----']
['Damen', 'Damen-Accessoires-Beauty-Geschenk-Sets-']
Traceback (most recent call last):
File "/Users/xxx/importer_tree.py", line 13, in <module>
print(bla[3])
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 37: ordinal not in range(128)
input sample_1.csv (excel file saved as utf-8 csv)
Cat1;SEO Meta Text
Damen;Damen----
Damen;Damen-Accessoires-Beauty-Geschenk-Sets-
Damen;Damen-Accessoires-Beauty-Körperpflege-
Männer;Männer-Sport-Sportschuhe-Trekkingsandalen-
Männer;Männer-Sport-Sportschuhe-Wanderschuhe-
Männer;Männer-Sport-Sportschuhe--
is this only an output format issue or am I also importing the data
wrongly?
how can I print out "Männer"?
thank you for your help/guidance!
thank you to juanpa-arrivillaga and to this answer: https://stackoverflow.com/a/44088439/9059135
Issue is due to my Sublime settings:
sys.stdout.encoding returns US-ASCII
in the terminal same command returns UTF-8
Setting up the Build system in Sublime properly will solve the issue

Changing encoding in csv file through python UTF-8 to UTF-16

How do you change the encoding through a python script?
I've got some files that I'm looping doing some other stuff. But before that I need to change the encoding on each file from UTF-8 to UTF-16 since SQL server does not support UTF-8
Tried this, but not working.
data = "UTF-8 data"
udata = data.decode("utf-8")
data = udata.encode("utf-16","ignore")
Cheers!
If you want to convert a file from utf-8 encoding to a file with utf-16 encoding, this script works:
#!/usr/bin/python2.7
import codecs
import shutil
with codecs.open("input_file.utf8.txt", encoding="utf-8") as input_file:
with codecs.open(
"output_file.utf16.txt", "w", encoding="utf-16") as output_file:
shutil.copyfileobj(input_file, output_file)

python-write to file (ignore non-ascii chars)

I am on Linux and a want to write string (in utf-8) to txt file. I tried many ways, but I always got an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position in position 36: ordinal not in range(128)
Is there any way, how to write to file only ascii characters? And ignore non-ascii characters.
My code:
# -*- coding: UTF-8-*-
import os
import sys
def __init__(self, dirname, speaker, file, exportFile):
text_file = open(exportFile, "a")
text_file.write(speaker.encode("utf-8"))
text_file.write(file.encode("utf-8"))
text_file.close()
Thank you.
You can use the codecs module:
import codecs
text_file = codecs.open(exportFile,mode='a',encoding='utf-8')
text_file.write(...)
Try using the codecs module.
# -*- coding: UTF-8-*-
import codecs
def __init__(self, dirname, speaker, file, exportFile):
with codecs.open(exportFile, "a", 'utf-8') as text_file:
text_file.write(speaker.encode("utf-8"))
text_file.write(file.encode("utf-8"))
Also, beware that your file variable has a name which collides with the builtin file function.
Finally, I would suggest you have a look at http://www.joelonsoftware.com/articles/Unicode.html to better understand what is unicode, and one of these pages (depending on your python version) to understand how to use it in Python:
http://docs.python.org/2/howto/unicode
http://docs.python.org/3/howto/unicode.html
You could decode your input string before writing it;
text = speaker.decode("utf8")
with open(exportFile, "a") as text_file:
text_file.write(text.encode("utf-8"))
text_file.write(file.encode("utf-8"))

Some characters (trademark sign, etc) unable to write to a file but is printable on the screen

I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.
I've tried the encode and decode functions and the various encodings but to no avail.
Please find an excerpt of the current code that I've written below:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream
Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:
f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)
... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:
line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))
Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.

Categories

Resources