Python convert UTF8 file to CP1250 - python

I have a UTF-8 encoded file which I need to save as a CP1250 encoded file. So I did the following
import codecs
# Read file as UTF-8
with codecs.open("utf.htm", "r", 'utf-8') as sourceFile:
# Write file as CP1250
with codecs.open('win1250.htm', "w", "cp1250", "xmlcharrefreplace") as targetFile:
while True:
contents = sourceFile.read()
if not contents:
break
targetFile.write(contents)
When I inspect the unicode string contents in my editor, all the characters seems to be fine. But when I open the final file in notepad, the file is not written correctly. For instance, instead of letter ř I get symbol ø. Any ideas what is going wrong here?
Thanks

Notepad probably thinks the file holds text encoded with CP-1252:
>>> 'ř'.encode('cp1250').decode('cp1250')
'ř'
>>> 'ř'.encode('cp1250').decode('cp1252')
'ø'
This is a problem with Notepad. Use a text editor where you can specify the encoding manually, like Notepad++.

Related

I am trying to read a utf-8 file in python and keep BOM, but BOM is automatically deleted when i do file.read

I have done the following:
a = open(text, encoding='utf-8').read
I am trying to count the words in this text file and include the BOM.
However, when I use readline, the BOM is not deleted.
Does anyone know how to keep the BOM with read not readlines?
Hello I try this and work:
a = open('text.txt',encoding='utf-8').read()
print(a)

Python 3 How to ignore errors when writing UTF-8 to file

I have the following program:
with open(r'C:\s_f.csv', 'w', encoding="utf-8", errors="ignore") as outf:
with open(r'C:\street.csv', 'r', encoding="utf-8", errors="ignore") as f:
for line in f:
out_line = line
out_line = out_line.replace('"','¬')
out_line = out_line.replace(',','~')
outf.write(out_line)
For some reason I am still getting:
File "c:\Program Files\Anaconda3\streets.py", line 5
SyntaxError: Non-UTF-8 code starting with '\xac' in file c:\Program Files\Anaconda3\streets.py on line 5, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
How can I ignore the UTF-8 errors in Python 3?
You have saved your source code as something other than UTF-8, most likely as Latin-1 or Windows Codepage 1252.
Your options are to change encoding used for the source (using your text editor), declare the source code encoding on the first or second line of your source file (as indicated by the error message), or use an ASCII-safe escape sequence.
The latter can be done here by using using a \xhh or \uhhhh escape sequence:
out_line = out_line.replace('"','\xAC') # or `'\u00AC'`
\xac or \x00ac (case insensitive) encodes the same character in the Unicode standard, the U+00AC NOT SIGN codepoint. If properly encoded to UTF-8, this would use the C2 AC byte sequence, but your .py file was saved with AC only at that point.
If you do know the encoding used but don't want to change it, add a PEP 263 comment to the start of your file (first or second line at the top):
# coding=cp1252
Your best option is to configure your code editor to save the file as UTF-8, however. That's the default encoding Python 3 will use to read your source code.
This has otherwise nothing to do with writing to the CSV file, Python can't even begin to run your code as it can't read the source properly.
Maybe you can use:
# -*- coding: utf-8 -*-
As first line of your code

How to check each line of file for UTF-8 and write in another file?

I would like to know how I can write to another file on live the lines which are utf-8 encoded. I have a folder containing number of files. I cannot go and check each and every file for UTF-8 character.
I have tried this code:
import codecs
try:
f = codecs.open(filename, encoding='utf-8', errors='strict')
for line in f:
pass
print "Valid utf-8"
except UnicodeDecodeError:
print "invalid utf-8"
This check the whole while is UTF-8 verified or not. But I am trying to check each and every line of the file in a folder and write those lines which are UTF-8 character encoded.
I would like to delete the lines in my file which are not UTF-8 encoded. If while reading line the program get to know that the line is UTF-8 then it should move on to next line, else delete the line which is not UTF-8. I think now it is clear.
I would like to know how I can do it with the help of Python. Kindly let me know.
I am not looking to convert them, but to delete them. Or write to another file the UTF-8 satisfied line from the files.
This article will be of help about how to process text files on Python 3
Basically if you use:
open(fname, encoding="utf-8", errors="strict")
It will raise an exception if the file is not utf-8 encoded, but you can change the errors handling param for read the file and apply your algorithm for exclude lines.
By example:
open(fname, encoding="utf-8", errors="replace")
Will replace non utf-8 characters by a ? symbol.
As #Leon says, you need to consider that Chinese and/or Arabic characters can be utf-8 valid.
If you want a more strict character set you can try to open your file using a latin-1 or a ascii encoding (takin into account that utf-8 and latin-1 are ASCII compatible)
You need to take in count that there are so many character encoding types, and they can be not ASCII compatibles. Is very dificult to read properly text files if you dont know its encoding type, the chardet module can help on that, but is not 100% reliable.

Change string in file in python

I have a problem changing a file in python. I need to change a string.
The file is not a text file but can be edited with a text editor.
Here is my code:
with open(esp,"r") as f:
content=f.readlines()
with open(esp_carsat,"w") as f:
for line in content:
f.write(line.replace("201","202")))
Problem is that content is in byte I think.
'\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n'
So my replace is not working. I tried to play with encoding but the file is not readable afterwards. Furthermore, I have accents in the file (é,è...)
Is there a way to do what I want?
You have UTF-16 encoded data. Decode to Unicode text, replace, and then encode back to UTF-16 again:
>>> data = '\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n\x00'
>>> data.decode('utf16')
u'<InstanceNames>\r\n'
I had to append an extra \x00 to decode that; by reading the file without decoding Python split the line on the \n and left the \x00 for the next line.
Unicode data can handle accents just fine, no further work required there.
This is easiest done with io.open() to open file objects that do the decoding and encoding for you:
import io
with io.open(esp, "r", encoding='utf16') as f:
content=f.readlines()
with open(esp_carsat, "w", encoding='utf16') as f:
for line in content:
f.write(line.replace("201", "202")))
It's UTF-16-LE data:
>>> b
'\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n'
>>> print(b[:-1].decode('utf-16-le'))
<InstanceNames>

Python: How to preserve Ä,Ö,Ü when writing to file

I open 2 files in Python, change and replace some of their content and write the new output into a 3rd file.
My 2 input files are XMLs, encoded in 'UTF-8 without BOM' and they have German Ä,Ö,Ü and ß in them.
When I open my output XML file in Notepad++, the encoding is not specified (i.e. there's no encoding checked in the 'Encoding' tab). My Ä,Ö,Ü and ß are transformed into something like
ü
When I create the output in Python, I use
with open('file', 'w') as fout:
fout.write(etree.tostring(tree.getroot()).decode('utf-8'))
What do I have to do instead?
I think this should work:
import codecs
with codecs.open("file.xml", 'w', "utf-8") as fout:
# do stuff with filepointer
To write an ElementTree object tree to a file named 'file' using the 'utf-8' character encoding:
tree.write('file', encoding='utf-8')
When writing raw bytestrings, you want to open the file in binary mode:
with open('file', 'wb') as fout:
fout.write(xyz)
Otherwise the open call opens the file in text mode and expects unicode strings instead, and will encode them for you.
To decode, is to interpret an encoding (like utf-8) and the output is unicode text. If you do want to decode first, specify an encoding when opening the file in text mode:
with open(file, 'w', encoding='utf-8') as fout:
fout.write(xyz.decode('utf-8'))
If you don't specify an encoding Python will use a default, which usually is a Bad Thing. Note that since you are already have UTF-8 encoded byte strings to start with, this is actually useless.
Note that python file operations never transform existing unicode points to XML character entities (such as ü), other code you have could do this but you didn't share that with us.
I found Joel Spolsky's article on Unicode invaluable when it comes to understanding encodings and unicode.
Some explanation for the xml.etree.ElementTree for Python 2, and for its function parse(). The function takes the source as the first argument. Or it can be an open file object, or it can be a filename. The function creates the ElementTree instance, and then it passes the argument to the tree.parse(...) that looks like this:
def parse(self, source, parser=None):
if not hasattr(source, "read"):
source = open(source, "rb")
if not parser:
parser = XMLParser(target=TreeBuilder())
while 1:
data = source.read(65536)
if not data:
break
parser.feed(data)
self._root = parser.close()
return self._root
You can guess from the third line that if the filename was passed, the file is opened in binary mode. This way, if the file content was in UTF-8, you are processing elements with UTF-8 encoded binary content. If this is the case, you should open also the output file in binary mode.
Another possibility is to use codecs.open(filename, encoding='utf-8') for opening the input file, and passing the open file object to the xml.etree.ElementTree.parse(...). This way, the ElementTree instance will work with Unicode strings, and you should encode the result to UTF-8 when writing the content back. If this is the case, you can use codecs.open(...) with UTF-8 also for writing. You can pass the opened output file object to the mentioned tree.write(f), or you let the tree.write(filename, encoding='utf-8') open the file for you.

Categories

Resources