Writing XML to file corrupts files in python

Writing XML to file corrupts files in python - python

I'm attempting to write contents from xml.dom.minidom object to file. The simple idea is to use 'writexml' method:
import codecs
def write_xml_native():
# Building DOM from XML
xmldoc = minidom.parse('semio2.xml')
f = codecs.open('codified.xml', mode='w', encoding='utf-8')
# Using native writexml() method to write
xmldoc.writexml(f, encoding="utf=8")
f.close()
The problem is that it corrupts the non-latin-encoded text in the file. The other way is to get the text string and write it to file explicitly:
def write_xml():
# Building DOM from XML
xmldoc = minidom.parse('semio2.xml')
# Opening file for writing UTF-8, which is XML's default encoding
f = codecs.open('codified3.xml', mode='w', encoding='utf-8')
# Writing XML in UTF-8 encoding, as recommended in the documentation
f.write(xmldoc.toxml("utf-8"))
f.close()
This results in the following error:
Traceback (most recent call last):
File "D:\Projects\Semio\semioparser.py", line 45, in <module>
write_xml()
File "D:\Projects\Semio\semioparser.py", line 42, in write_xml
f.write(xmldoc.toxml(encoding="utf-8"))
File "C:\Python26\lib\codecs.py", line 686, in write
return self.writer.write(data)
File "C:\Python26\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 2064: ordinal not in range(128)
How do I write an XML text to file? What is it I'm missing?
EDIT. Error is fixed by adding decode statement:
f.write(xmldoc.toxml("utf-8").decode("utf-8"))
But russian symbols are still corrupted.
The text is not corrupted when viewed in an interpreter, but when it's written in file.

Hmm, though this should work:
xml = minidom.parse("test.xml")
with codecs.open("out.xml", "w", "utf-8") as out:
xml.writexml(out)
you may alternatively try:
with codecs.open("test.xml", "r", "utf-8") as inp:
xml = minidom.parseString(inp.read().encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
xml.writexml(out)
Update: In case you construct xml out of string object, you should encode it before passing to minidom parser, like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import xml.dom.minidom as minidom
xml = minidom.parseString(u"<ru>Тест</ru>".encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
xml.writexml(out)

Try this:
with open("codified.xml", "w") as f:
f.write(xmldoc.toxml("utf-8").decode("utf-8"))
This works for me (under Python 3, though).

Related

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

I am trying to write characters with double dots (umlauts) such as ä, ö and Ö. I am able to write it to the file with data.encode("utf-8") but the result b'\xc3\xa4\xc3\xa4\xc3\x96' is not nice (UTF-8 as literal characters). I want to get "ääÖ" as written stored to a file.
How can I write data with umlaut characters to a CSV file in Python 3?
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
data=resultFile
a.writerows(data)
Traceback:
File "<ipython-input-280-73b1f615929e>", line 5, in <module>
a.writerows(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 15: ordinal not in range(128)

Add a parameter encoding to the open() and set it to 'utf8'.
import csv
data = "ääÖ"
with open("test.csv", 'w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Edit: Removed the use of io library as open is same as io.open in Python 3.

This solution should work on both python2 and 3 (not needed in python3):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Credits to:
Working with utf-8 encoding in Python source

How do i read and print a whole .txt file using python?

I am totally new to python, and I am supposed to write a program that can read a whole .txt file and print it. The file is an article in my first language(Norwegian), and long. I have three versions that should do the same thing, but all get error. I have tried in bot PyCharm and eclipse with PyDev installed, and i get the same errors on both...
from sys import argv
import pip._vendor.distlib.compat
script, dev = argv
txt = open(dev)
print("Here's your file %r:" % dev)
print(txt.read())
print("Type the filename again:")h
file_again = pip._vendor.distlib.compat.raw_input("> ")
txt_again = open(file_again)
print(txt_again.read())
But this gets the errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/1A.py", line 5, in <module>
script, dev = argv
ValueError: not enough values to unpack (expected 2, got 1)
Again, i am new to python, and i searched around, but didn't find a solution.
My next attempt was this:
# -*- coding: utf-8 -*-
import sys, traceback
fr = open('dev.txt', 'r')
text = fr.read()
print(text)
But this gets these errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/v2.py", line 6, in <module>
text = fr.read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
I do not understand why i doesn't work.
My third attempt looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("dev.txt", help="dev.txt")
args = parser.parse_args()
if args.filename:
with open('dev.txt') as f:
for line in f:
name, _ = line.strip().split('\t')
print(name)
And this gets the errors:
usage: v3.py [-h] dev.txt
v3.py: error: the following arguments are required: dev.txt
Any help to why these doesnt work is welcome.
Thank you in advance :D

For the 2nd approach is the simplest, I'll stick to it.
You stated the contents of dev.txt to be Norwegian, that means, it will include non-ascii characters like Æ,Ø,Å etc. The python interpreter is trying to tell you this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128) It cannot interpret the byte 0xC3 = 195 (decimal) as an ascii character, which is limited to a range of 128 different characters.
I'll assume you're using UTF-8 as encoding but if not, change the parameter in line 2.
# -*- coding: utf-8 -*-
fr = open('dev.txt', 'r', encoding='utf-8')
text = fr.read()
print(text)
If you do not know your encoding, you can find it out via your editor or use python to guess it.
Your terminal could also cause the error when it's not configured to print Unicode Characters or map them correctly. You might want to take a look at this question and its answers.
After operating a file, it is recommended to close it. You can either do that manually via fr.close() or make python do it automatically:
with open('dev.txt', 'r', encoding='utf-8') as fr:
# automatically closes fr when leaving this code-block

file = open("File.txt", "r")
a = str(file.read())
print(a)
Is this what you were looking for?

For example:
open ("fileA.txt", "r") as fileA:
for line in fileA:
print(line);

This is a possible solution:
f = open("textfile.txt", "r")
lines = f.readlines()
for line in lines:
print(line)
f.close()
Save it as for example myscript.py and execute it:
python /path/to/myscript.py

Korean txt file encoding with utf-8

I'm trying to process a Korean text file with python, but it fails when I try to encode the file with utf-8.
#!/usr/bin/python
#-*- coding: utf-8 -*-
f = open('tag.txt', 'r', encoding='utf=8')
s = f.readlines()
z = open('tagresult.txt', 'w')
y = z.write(s)
z.close
=============================================================
Traceback (most recent call last):
File "C:\Users\******\Desktop\tagging.py", line 5, in <module>
f = open('tag.txt', 'r', encoding='utf=8')
TypeError: 'encoding' is an invalid keyword argument for this function
[Finished in 0.1s]
==================================================================
And when I just opens a Korean txt file encoded with utf-8, the fonts are broken like this. What can I do?
\xc1\xc1\xbe\xc6\xc1\xf6\xb4\xc2\n',
'\xc1\xc1\xbe\xc6\xc7\xcf\xb0\xc5\xb5\xe7\xbf\xe4\n',
'\xc1\xc1\xbe\xc6\xc7\xcf\xbd\xc3\xb4\xc2\n',
'\xc1\xcb\xbc\xdb\xc7\xd1\xb5\xa5\xbf\xe4\n',
'\xc1\xd6\xb1\xb8\xbf\xe4\

I don't know Korean, and don't have sample string to try, but here are some advices for you:
1
f = open('tag.txt', 'r', encoding='utf=8')
You have a typo here, utf-8 not utf=8, this explains for the exception you got.
The default mode of open() is 'r' so you don't have to define it again.
2 Don't just use open, you should use context manager statement to manage the opening/closing file descriptor, like this:
with open('tagresult.txt', 'w') as f:
f.write(s)

In Python 2 the open function does not take an encoding parameter. Instead you read a line and convert it to unicode. This article on kitchen (as in kitchen sink) modules provides details and some lightweight utilities to work with unicode in python 2.x.

Changing encoding in csv file through python UTF-8 to UTF-16

How do you change the encoding through a python script?
I've got some files that I'm looping doing some other stuff. But before that I need to change the encoding on each file from UTF-8 to UTF-16 since SQL server does not support UTF-8
Tried this, but not working.
data = "UTF-8 data"
udata = data.decode("utf-8")
data = udata.encode("utf-16","ignore")
Cheers!

If you want to convert a file from utf-8 encoding to a file with utf-16 encoding, this script works:
#!/usr/bin/python2.7
import codecs
import shutil
with codecs.open("input_file.utf8.txt", encoding="utf-8") as input_file:
with codecs.open(
"output_file.utf16.txt", "w", encoding="utf-16") as output_file:
shutil.copyfileobj(input_file, output_file)

Python: Special characters encoding

This is the code i am using in order to replace special characters in text files and concatenate them to a single file.
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = "C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = dirpath+"\\"+fname
with codecs.open(currentfile, encoding='utf8') as infile:
#print currentfile
outfile.write(fname)
outfile.write('\n')
outfile.write('\n')
for line in infile:
line = line.replace(u"´ı", "i")
line = line.replace(u"ï¬", "fi")
line = line.replace(u"ï¬‚", "fl")
outfile.write (line)
The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:
import codecs
currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"
which always returns "not found!" proving that those characters aren't read.
When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:
Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)
Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)
I am stuck here. Any suggestions?? all answers are welcome!

There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().
Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = os.path.join(dirpath, fname)
with codecs.open(currentfile, encoding='utf8') as infile:
outfile.write(fname + '\n\n')
for line in infile:
line = line.replace(u"´ı", u"i")
line = line.replace(u"ï¬", u"fi")
line = line.replace(u"ï¬‚", u"fl")
outfile.write (line)
This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.
Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.
If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing XML to file corrupts files in python - python

Try this: with open("codified.xml", "w") as f: f.write(xmldoc.toxml("utf-8").decode("utf-8")) This works for me (under Python 3, though).

Related

Writing CSV file with umlauts causing "UnicodeEncodeError: 'ascii' codec can't encode character"

How do i read and print a whole .txt file using python?

Korean txt file encoding with utf-8

Changing encoding in csv file through python UTF-8 to UTF-16

Python: Special characters encoding

Categories

Resources