What I want is extremely simple and can be done in PHP language literally with one line of code:
file_put_contents('target.txt', iconv('windows-1252', 'utf-8', file_get_contents('source.txt')));
In Python I spent a whole day trying to figure out how to achieve the same trivial thing, but to no avail. When I try to read or write files I usually get UnicodeDecode errors, str has no method decode and a dozen of similar errors. It seems like I scanned all threads at SO, but still do not know how can I do this.
Are you specifying the "encoding" keyword argument when you call open?
with open('source.txt', encoding='windows-1252') as f_in:
with open('target.txt', 'w', encoding='utf-8') as f_out:
f_out.write(f_in.read())
Since Python 3.5 you can write:
Path('target.txt').write_text(
Path('source.txt').read_text(encoding='windows-1252'),
encoding='utf-8'
)
Related
Here's the issue I'm running into:
Error: iterator should return strings, not bytes (did you open the file in text mode?)
The code that's causing this looks something like:
for fileinfo in tarfile.open(filename):
f = t.extractfile(fileinfo)
reader = csv.DictReader(f)
reader.fieldnames
The trouble seems to be that the extractfile() method produces a io.BufferedReader that is a very basic file-like object and has no high-level text interface.
What would be a good way to handle this?
I'm thinking of looking at decoding the bytes from the reader into text but I need to retain streaming because these files are very large. The codebase is Python 3.6 running on Docker/Linux.
Thanks to both #Aran-Fey and #zwer who led me to another StackOverflow question that answered it. Here's how:
for fileinfo in tarfile.open(filename):
with t.extractfile(fileinfo) as f:
ft = codecs.getreader("utf-8")(f)
reader = csv.DictReader(ft)
reader.fieldnames
This seems to work so far.
Okay this is probably a very vague ill-posed question.. but Im going to give it a try anyway.
I have read in the first line of a .data file by using
with open('raw_000.data', 'rb') as f:
A = f.readline()
(using only 'r', it used an UTF-8 encoding and that failed).
This gave me the following byte-string
b'\x10\xa2\x8f\xbc-X\x98?\xfe\xd4\x17>\xdd\xda\x0e\xbf\xdc\xc5d?e\x19\x91?\xe0m\xb0<\xe7\xa8R#=\xca\xbd>\x94\x91\xb2\xbf\xba\xb3u>)\xbe\x01\xc0\x05\x1f\x83\xbf#\x04\xe2\xbf\x80\xbd;>\xe5\x0e<\xc0\x0cS0?\xbd\xcaG?\x15\x9c\x07\xc0lX\x9d?\xc5\xa3j\xc0X+D\xc0T\x91\xad?\x13\x87\xdd\xbfjCs?m\xdd\x02#\xebBi\xbf\xfc\xd8g=*NM\xbf&\x94&\xc0\x94\x91\xb2?=\xca\xbd>\xfc\xbfm\xbf\xf5\x96\x9f?\xf4\x8b\xc0\xbfAz\x12#X\xc6\xee\xbe\x84\t\xcf\xbf\x1d\xdb\x93\xbfpw\x19\xc0\xbc\xe0\x85>|\xd5\xa1?\xe5\x0e\xbc?\x80\xbd\xbb=|\xc0\xf7\xbe\\xc5\xda\xbe\xacB\xe4\xbf\x99\xbb\r#NGB\xbf\xaa\xbd~#;\xc0\xf2\xbf\x1a\xd1\xc8>\xdc\xc5\xe4\xbfe\x19\x11\xc0\x10\xa2\x8f<-X\x98\xbf\n'
Now this should contain some meaningful data. But I have no idea how to 'decode' this, as in.. what type of decoding...
All I know is that
chardet.detect('...')
gave
{'confidence': 0.0, 'encoding': None}.
And besides that, the file raw_000.data comes from an MRI machine by philips. However, I could not find any documentation in that area as well.
What other options do I have?
Well, apparently there is something called Little Endian Ordering. Never knew anything about it, but luckily there is a wikipedia page
https://en.wikipedia.org/wiki/Endianness
And of course, we can give this information to python as well by using the `struct' package like so:
import struct
with open('your_byte_file', 'rb') as f:
A = f.readline()
res = struct.unpack('<f', A)
Here we have that the <-sign tells us that we are dealing with little Endian Ordering,
details. And the f tells us that we are expecting a float, details.
Please do not behead me for my noob question. I have looked up many other questions on stackoverflow concerning this topic, but haven't found a solution that works as intended.
The Problem:
I have a fairly large txt-file (about 5 MB) that I want to copy via readlines() or any other build in string-handling function into a new file. For smaller files the following code sure works (only schematically coded here):
f = open('C:/.../old.txt', 'r');
n = open('C:/.../new.txt', 'w');
for line in f:
print(line, file=n);
However, as I found out here (UnicodeDecodeError: 'charmap' codec can't encode character X at position Y: character maps to undefined), internal restrictions of Windows prohibit this from working on larger files. So far, the only solution I came up with is the following:
f = open('C:/.../old.txt', 'r', encoding='utf8', errors='ignore');
n = open('C:/.../new.txt', 'a');
for line in f:
print(line, file=sys.stderr) and append(line, file='C:/.../new.txt');
f.close();
n.close();
But this doesn't work. I do get a new.txt-file, but it is empty. So, how do I iterate through a long txt-file and write every line into a new txt-file? Is there a way to read the sys.stderr as the source for the new file (I actually don't have any idea, what this sys.stderr is)?
I know this is a noob question, but I don't know where to look for an answer anymore.
Thanks in advance!
There is no need to use print() just write() to the file:
with open('C:/.../old.txt', 'r') as f, open('C:/.../new.txt', 'w') as n:
n.writelines(f)
However, it sounds like you may have an encoding issue, so make sure that both files are opened with the correct encoding. If you provide the error output perhaps more help can be provided.
BTW: Python doesn't use ; as a line terminator, it can be used to separate 2 statements if you want to put them on the same line but this is generally considered bad form.
You can set standard output to file like my code.
I successfully copied 6MB text file with this.
import sys
bigoutput = open("bigcopy.txt", "w")
sys.stdout = bigoutput
with open("big.txt", "r") as biginput:
for bigline in biginput.readlines():
print(bigline.replace("\n", ""))
bigoutput.close()
Why don't you just use the shutil module and copy the file?
you can try with this code it works for me.
with open("file_path/../large_file.txt") as f:
with open("file_path/../new_file", "wb") as new_f:
new_f.writelines(f.readlines())
new_f.close()
f.close()
A friend of mine has written simple poetry using C's fprintf function. It was written using the 'wb' option so the generated file is in binary. I'd like to use Python to show the poetry in plain text.
What I'm currently getting are lots of strings like this: ��������
The code I am using:
with open("read-me-if-you-can.bin", "rb") as f:
print f.read()
f.close()
The thing is, when dealing with text written to a file, you have to know (or correctly guess) the character encoding used when writing said file. If the program reading the file is assuming the wrong encoding here, you will end up with strange characters in the text if you're lucky and with utter garbage if you're unlucky.
Don't try to guess, try to know: you need to ask your friend in what character encoding he or she wrote the poetry text to the file. You then have to open the file in Python specifying that character encoding. Let's say his/her answer is "UTF-16-LE" (for sake of example), you then write:
with open("poetry.bin", encoding="utf-16-le") as f:
print(f.read())
It seems you're on Python 2 still though, so there you write:
import io
with io.open("poetry.bin", encoding="utf-16-le") as f:
print f.read()
You could start by trying UTF-8 first though, that is an often used encoding.
So I just spent a long time processing and writing out files for a project I'm working on. The files contain objects pickled with cPickle. Now I'm trying to load the pickled files and I'm running into the problem: "Can't import module ...". The thing is, I can import the module directly from the python prompt just fine.
I started noticing that my code had the problem reading the file (getting EOF error) and I noted that I was reading it with open('file','r'). Others noted that I need to specify that it's a binary file. I don't get the EOF error anymore, but now I'm getting this error.
It seems to me that I've screwed up the writing of my files initially by writing out with 'w' and not 'wb'.
The question I have is, is there a way to process the binary file and fix what 'w' changed? Possibly by searching for line returns and changing them (which is what I think the big difference is between 'w' and 'wb' on Windows).
Any help doing this would be amazing, as otherwise I will have lost weeks of work. Thanks.
I found the answer here. It talks about a solution to the same problem having, but not before outlining the traditional solution in python 2 (to all those that do this, thank you).
The solution comes down to this:
data = open(filename, "rb").read()
newdata = data.replace("\r\n", "\n")
if newdata != data:
f = open(filename, "wb")
f.write(newdata)
f.close()
Basically, just replace all instances of "\r\n" with "\n". It seems to have worked well, I can now open the file and unpickle it just fine.