Change string in file in python - python

I have a problem changing a file in python. I need to change a string.
The file is not a text file but can be edited with a text editor.
Here is my code:
with open(esp,"r") as f:
content=f.readlines()
with open(esp_carsat,"w") as f:
for line in content:
f.write(line.replace("201","202")))
Problem is that content is in byte I think.
'\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n'
So my replace is not working. I tried to play with encoding but the file is not readable afterwards. Furthermore, I have accents in the file (é,è...)
Is there a way to do what I want?

You have UTF-16 encoded data. Decode to Unicode text, replace, and then encode back to UTF-16 again:
>>> data = '\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n\x00'
>>> data.decode('utf16')
u'<InstanceNames>\r\n'
I had to append an extra \x00 to decode that; by reading the file without decoding Python split the line on the \n and left the \x00 for the next line.
Unicode data can handle accents just fine, no further work required there.
This is easiest done with io.open() to open file objects that do the decoding and encoding for you:
import io
with io.open(esp, "r", encoding='utf16') as f:
content=f.readlines()
with open(esp_carsat, "w", encoding='utf16') as f:
for line in content:
f.write(line.replace("201", "202")))

It's UTF-16-LE data:
>>> b
'\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n'
>>> print(b[:-1].decode('utf-16-le'))
<InstanceNames>

Related

Should I use utf8 or utf-8-sig when opening a file to read in Python?

I have alway used 'utf8' to read in a file:
with open(filename, 'r', encoding='utf8') as f, open(filename2, 'r', encoding='utf8') as f2:
for line in f:
line = line.strip()
columns = line.split(' ')
for line in f2:
line = line.strip()
columns = line.split(' ')
However, the code above introduced an additional '\ufeff' code at the line of for 'f2':
columns = line.split(' ')
Now the columns[0] contains this character, while 'line' doesn't have this character. Why is that? Then I switched to 'utf-8-sig', and the problem is gone.
However, the first file reading 'f' and the 'columns' doesn't have this issue at all even with 'encoding=utf8' only. Both are plain text files.
So I have two questions regarding:
I am using Python3 and when reading a file, should I always use 'utf-8-sig' to be safe?
Why doesn't 'line' contain this additional code, but 'columns' contains it?
UTF-8-encoded files can be written with a signature indicating it is UTF-8. This signature code is called the "byte order mark" (or BOM) and has the Unicode code point value U+FEFF. If the file containing a BOM is viewed in a hex editor the file will start with the hexadecimal bytes EF BB BF. When viewed in a text editor with a non UTF-8 encoding they often appear as  but that depends on the encoding.
The 'utf-8-sig' codec can read UTF-8-encoded files written with and without the starting BOM signature and will remove it if present.
Use 'utf-8-sig' for writing a file only if you want a UTF-8 BOM written at the start of the file. Some (usually Windows) programs, such as Excel when reading text files, expect a BOM if the file contains UTF-8, and assume a localized encoding otherwise. Other programs may not expect a BOM and could read it as an extra character, so the choice is yours.
So for your two questions:
I am using Python3 and when reading a file, should I always use 'utf-8-sig' to be safe?
Yes, it will remove the BOM if present.
Why doesn't 'line' contain this additional code, but 'columns' contains it?
line.strip() doesn't remove \ufeff so I can't reproduce your claim. If a UTF-8 w/ BOM-encoded file is opened with utf8 the first character should be \ufeff. Are you using print to display the line? \ufeff is a whitespace character if printed:
>>> line = '\ufeffabc'
>>> line
'\ufeffabc'
>>> print(line)
abc
>>> print(line.strip())
abc
>>> line.strip()
'\ufeffabc'

Python convert UTF8 file to CP1250

I have a UTF-8 encoded file which I need to save as a CP1250 encoded file. So I did the following
import codecs
# Read file as UTF-8
with codecs.open("utf.htm", "r", 'utf-8') as sourceFile:
# Write file as CP1250
with codecs.open('win1250.htm', "w", "cp1250", "xmlcharrefreplace") as targetFile:
while True:
contents = sourceFile.read()
if not contents:
break
targetFile.write(contents)
When I inspect the unicode string contents in my editor, all the characters seems to be fine. But when I open the final file in notepad, the file is not written correctly. For instance, instead of letter ř I get symbol ø. Any ideas what is going wrong here?
Thanks
Notepad probably thinks the file holds text encoded with CP-1252:
>>> 'ř'.encode('cp1250').decode('cp1250')
'ř'
>>> 'ř'.encode('cp1250').decode('cp1252')
'ø'
This is a problem with Notepad. Use a text editor where you can specify the encoding manually, like Notepad++.

How do I convert unicode to unicode-escaped text [duplicate]

This question already has an answer here:
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I'm loading a file with a bunch of unicode characters (e.g. \xe9\x87\x8b). I want to convert these characters to their escaped-unicode form (\u91cb) in Python. I've found a couple of similar questions here on StackOverflow including this one Evaluate UTF-8 literal escape sequences in a string in Python3, which does almost exactly what I want, but I can't work out how to save the data.
For example:
Input file:
\xe9\x87\x8b
Python Script
file = open("input.txt", "r")
text = file.read()
file.close()
encoded = text.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
file = open("output.txt", "w")
file.write(encoded) # fails with a unicode exception
file.close()
Output File (That I would like):
\u91cb
You need to encode it again with unicode-escape encoding.
>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'
Code modified (used binary mode to reduce unnecessary encode/decodes)
with open("input.txt", "rb") as f:
text = f.read().rstrip() # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
f.write(decoded.encode('unicode-escape'))
http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq
\xe9\x87\x8b is not a Unicode character. It looks like a representation of a bytestring that represents 釋 Unicode character encoded using utf-8 character encoding. \u91cb is a representation of 釋 character in Python source code (or in JSON format). Don't confuse the text representation and the character itself:
>>> b"\xe9\x87\x8b".decode('utf-8')
u'\u91cb' # repr()
>>> print(b"\xe9\x87\x8b".decode('utf-8'))
釋
>>> import unicodedata
>>> unicodedata.name(b"\xe9\x87\x8b".decode('utf-8'))
'CJK UNIFIED IDEOGRAPH-91CB'
To read text encoded as utf-8 from a file, specify the character encoding explicitly:
with open('input.txt', encoding='utf-8') as file:
unicode_text = file.read()
It is exactly the same for saving Unicode text to a file:
with open('output.txt', 'w', encoding='utf-8') as file:
file.write(unicode_text)
If you omit the explicit encoding parameter then locale.getpreferredencoding(False) is used that may produce mojibake if it does not correspond to the actual character encoding used to save a file.
If your input file literally contains \xe9 (4 characters) then you should fix whatever software generates it. If you need to use 'unicode-escape'; something is broken.
It looks as if your input file is UTF-8 encoded so specify UTF-8 encoding when you open the file (Python3 is assumed as per your reference):
with open("input.txt", "r", encoding='utf8') as f:
text = f.read()
text will contain the content of the file as a str (i.e. unicode string). Now you can write it in unicode escaped form directly to a file by specifying encoding='unicode-escape':
with open('output.txt', 'w', encoding='unicode-escape') as f:
f.write(text)
The content of your file will now contain unicode-escaped literals:
$ cat output.txt
\u91cb

Reading UTF-8 file returns unexpected chars

Running Windows 8 64-bit. I have a file where I store some data, saved with the UTF-8 encoding using Windows notepad. Supposing this is the content of the file:
1,some,data,here,0,-1
I'm reading it like this:
f = open("file.txt", "rb")
f.read()
f.close()
And f.read() returns this:
u"\xef\xbb\xbf1,some,data,here,0,-1"
I can just use f.read()[3:] but that's not a clean solution.
What are those characters at the beginning of the file?
Those first 3 bytes are the UTF-8 BOM, or Byte Order Mark. UTF-8 doesn't need the BOM (it has a fixed byte order unlike UTF-16 and UTF-32), but many tools (mostly Microsoft's) add it anyway to aid in file-encoding detection.
You can test for it and skip it safely, use codecs.BOM_UTF8 to handle it:
import codecs
data = f.read()
if data.startswith(codecs.BOM_UTF8):
data = data[3:]
You could also use the io.open() function to open the file and have Python decode the file for you to Unicode, and tell it to use the utf_8_sig codec:
import io
with io.open('file.txt', encoding='utf_8_sig'):
data = f.read()
That´s the BOM (byte order mark).
In reality, UTF-8 has only one valid byte order,
but despite of that there can be this 3-byte-sequence
at the beginning of the file (data in general).
-> If there are exactly these values as first 3 bytes, ignore them.

Unicode error in python when printing a list

Edit: http://pastebin.com/W4iG3tjS - the file
I have a text file encoded in utf8 with some Cyrillic text it. To load it, I use the following code:
import codecs
fopen = codecs.open('thefile', 'r', encoding='utf8')
fread = fopen.read()
fread dumps the file on the screen all unicodish (escape sequences). print fread displays it in readable form (ASCII I guess).
I then try to split it and write it to an empty file with no encoding:
a = fread.split()
for l in a:
print>>dasFile, l
But I get the following error message: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Is there a way to dump fread.split() into a file? How can I get rid of this error?
Since you've opened and read the file via codecs.open(), it's been decoded to Unicode. So to output it you need to encode it again, presumably back to UTF-8.
for l in a:
dasFile.write(l.encode('utf-8'))
print is going to use the default encoding, which is normally "ascii". So you see that error with print. But you can open a file and write directly to it.
a = fopen.readlines() # returns a list of lines already, with line endings intact
# do something with a
dasFile.writelines(a) # doesn't add line endings, expects them to be present already.
assuming the lines in a are encoded already.
PS. You should also investigate the io module.

Categories

Resources