Decompressing a .bz2 file in Python - python

So, this is a seemingly simple question, but I'm apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the decompressing of that file is giving me a MAJOR headache.
I'm quite a Python newbie, so the answer is probably quite obvious, please help me.
In this bit of the script, I already have the file, and I just want to read it out to a variable, then decompress that? Is that right? I've tried all sorts of way to do this, I usually get "ValueError: couldn't find end of stream" error on the last line in this snippet. I've tried to open up the zipfile and write it out to a string in a zillion different ways. This is the latest.
openZip = open(zipFile, "r")
s = ''
while True:
newLine = openZip.readline()
if(len(newLine)==0):
break
s+=newLine
print s
uncompressedData = bz2.decompress(s)
Hi Alex, I should've listed all the other methods I've tried, as I've tried the read() way.
METHOD A:
print 'decompressing ' + filename
fileHandle = open(zipFile)
uncompressedData = ''
while True:
s = fileHandle.read(1024)
if not s:
break
print('RAW "%s"', s)
uncompressedData += bz2.decompress(s)
uncompressedData += bz2.flush()
newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")
newFile.write(uncompressedData)
newFile.close()
I get the error:
uncompressedData += bz2.decompress(s)
ValueError: couldn't find end of stream
METHOD B
zipFile = steamTF2mapdir + filename
print 'decompressing ' + filename
fileHandle = open(zipFile)
s = fileHandle.read()
uncompressedData = bz2.decompress(s)
Same error :
uncompressedData = bz2.decompress(s)
ValueError: couldn't find end of stream
Thanks so much for you prompt reply. I'm really banging my head against the wall, feeling inordinately thick for not being able to decompress a simple .bz2 file.
By the by, used 7zip to decompress it manually, to make sure the file isn't wonky or anything, and it decompresses fine.

You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.
uncompressedData = bz2.BZ2File(zipFile).read()
seems to be closer to what you're angling for.
Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:
opening ... the compressed file as if
it was a textfile ... It's NOT.
open(filename) and even the more explicit open(filename, 'r') open, for reading, a text file -- a compressed file is a binary file, so in order to read it correctly you must open it with open(filename, 'rb'). ((my recommended bz2.BZ2File KNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).
In Python 2.*, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open (but in Python 3.* you can't, as text is Unicode, while binary is bytes -- different types).
In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A' as meaning a logical end of file) and so the reading and writing low-level code must compensate.
So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb' option ("read binary") to the open built-in. (though bz2.BZ2File is still simpler, whatever platform you're using!-).

openZip = open(zipFile, "r")
If you're running on Windows, you may want to do say openZip = open(zipFile, "rb") here since the file is likely to contain CR/LF combinations, and you don't want them to be translated.
newLine = openZip.readline()
As Alex pointed out, this is very wrong, as the concept of "lines" is foreign to a compressed stream.
s = fileHandle.read(1024)
[...]
uncompressedData += bz2.decompress(s)
This is wrong for the same reason. 1024-byte chunks aren't likely to mean much to the decompressor, since it's going to want to work with it's own block-size.
s = fileHandle.read()
uncompressedData = bz2.decompress(s)
If that doesn't work, I'd say it's the new-line translation problem I mentioned above.

This was very helpful.
44 of 2300 files gave an end of file missing error, on Windows open.
Adding the b(inary) flag to open fixed the problem.
for line in bz2.BZ2File(filename, 'rb', 10000000) :
works well. (the 10M is the buffering size that works well with the large files involved)
Thanks!

Related

How to avoid file corruption in Python?

I have a pretty basic question, perhaps I didn't know the right keywords as I couldn't find a previous answer. I use Python scripts control and gather information for a smarthome environment. I mostly use text files to store and update information within and between the scripts. However, I frequently run into this one issue whenever the server crashes or loses power: The file contents tend to corrupt or vanish while the crash happens.
To write file content, I usually use a structure like this:
try:
with open(savefile, "r") as file:
lines = file.readlines()
except:
lines = []
pass
lines.append(str(time.time()) + ";" + str(value) + "\n")
if len(lines) > MAX_READINGS:
lines = lines[-MAX_READINGS:]
with open(savefile, "w") as file:
file.writelines(lines)
In case of partial corruption such as blank lines between the data points, I often use a line-by-line loop that only qualifies lines with the correct structure (such as a timestamp in the beginning in the example-like cases). However, sometimes a file gets corrupted to the point it only contains spaces or is empty, getting useless for the scrips depending on the data.
The filesystem's integrity remains intact in crashes, so it's probably not a lower level problem. But what's the suggested workaround to minimize the corruption risk?
Should I use the "a" mode to append a new line and have another way to deal with the file lengths (the MAX_READINGS), or should I make a temporary copy which I'd then use to overwrite the original after the writing is done? Or might there be an external library providing the right functionality?

Python conditional statement based on text file string

Noob question here. I'm scheduling a cron job for a Python script for every 2 hours, but I want the script to stop running after 48 hours, which is not a feature of cron. To work around this, I'm recording the number of executions at the end of the script in a text file using a tally mark x and opening the text file at the beginning of the script to only run if the count is less than n.
However, my script seems to always run regardless of the conditions. Here's an example of what I've tried:
with open("curl-output.txt", "a+") as myfile:
data = myfile.read()
finalrun = "xxxxx"
if data != finalrun:
[CURL CODE]
with open("curl-output.txt", "a") as text_file:
text_file.write("x")
text_file.close()
I think I'm missing something simple here. Please advise if there is a better way of achieving this. Thanks in advance.
The problem with your original code is that you're opening the file in a+ mode, which seems to set the seek position to the end of the file (try print(data) right after you read the file). If you use r instead, it works. (I'm not sure that's how it's supposed to be. This answer states it should write at the end, but read from the beginning. The documentation isn't terribly clear).
Some suggestions: Instead of comparing against the "xxxxx" string, you could just check the length of the data (if len(data) < 5). Or alternatively, as was suggested, use pickle to store a number, which might look like this:
import pickle
try:
with open("curl-output.txt", "rb") as myfile:
num = pickle.load(myfile)
except FileNotFoundError:
num = 0
if num < 5:
do_curl_stuff()
num += 1
with open("curl-output.txt", "wb") as myfile:
pickle.dump(num, myfile)
Two more things concerning your original code: You're making the first with block bigger than it needs to be. Once you've read the string into data, you don't need the file object anymore, so you can remove one level of indentation from everything except data = myfile.read().
Also, you don't need to close text_file manually. with will do that for you (that's the point).
Sounds more for a job scheduling with at command?
See http://www.ibm.com/developerworks/library/l-job-scheduling/ for different job scheduling mechanisms.
The first bug that is immediately obvious to me is that you are appending to the file even if data == finalrun. So when data == finalrun, you don't run curl but you do append another 'x' to the file. On the next run, data will be not equal to finalrun again so it will continue to execute the curl code.
The solution is of course to nest the code that appends to the file under the if statement.
Well there probably is an end of line jump \n character which makes that your file will contain something like xx\n and not simply xx. Probably this is why your condition does not work :)
EDIT
What happens if through the python command line you type
open('filename.txt', 'r').read() # where filename is the name of your file
you will be able to see whether there is an \n or not
Try using this condition along with if clause instead.
if data.count('x')==24
data string may contain extraneous data line new line characters. Check repr(data) to see if it actually a 24 x's.

cPickle and open('w') on Windows

So I just spent a long time processing and writing out files for a project I'm working on. The files contain objects pickled with cPickle. Now I'm trying to load the pickled files and I'm running into the problem: "Can't import module ...". The thing is, I can import the module directly from the python prompt just fine.
I started noticing that my code had the problem reading the file (getting EOF error) and I noted that I was reading it with open('file','r'). Others noted that I need to specify that it's a binary file. I don't get the EOF error anymore, but now I'm getting this error.
It seems to me that I've screwed up the writing of my files initially by writing out with 'w' and not 'wb'.
The question I have is, is there a way to process the binary file and fix what 'w' changed? Possibly by searching for line returns and changing them (which is what I think the big difference is between 'w' and 'wb' on Windows).
Any help doing this would be amazing, as otherwise I will have lost weeks of work. Thanks.
I found the answer here. It talks about a solution to the same problem having, but not before outlining the traditional solution in python 2 (to all those that do this, thank you).
The solution comes down to this:
data = open(filename, "rb").read()
newdata = data.replace("\r\n", "\n")
if newdata != data:
f = open(filename, "wb")
f.write(newdata)
f.close()
Basically, just replace all instances of "\r\n" with "\n". It seems to have worked well, I can now open the file and unpickle it just fine.

how to get tell() to work

I'm trying to open a file and read from the last point read. My files are rather big (20 Mb to ~ 1 Gb) After doing some research it seems that tell() and seek() would be one of the most efficient ways to perform this. I've tried the following code
opened = open(filename, "rU")
f1 = csv.reader(opened)
k = []
for line in f1:
k.append(opened.tell())
When I do this every value in the list is 8272 Long. Does that mean that I cannot use this implementation? Is there something I'm missing? Thanks for your help!
I'm running python 2.7 in Windows 7
Update
After piecing together everything learned here and trial and error I get the following code
opened = open(filename, "rU")
k = [0]
where = 1
for switch in opened:
where += len(switch) + 1
f = StringIO.StringIO(switch)
interesting = csv.reader(f, delimiter=',')
good_values = interesting.next()
k.append(where)
return k
This allows the user to know exactly where in the file to go to while still being able to parse it according to its format. I'm not completely sure of why the offsets need to be constantly added (It seems that the newline is not accurately accounted for in len()).
It looks like the csv.reader is reading the file in chunks of 8272 bytes, that's why you see this number returned from opened.tell() many times - until, I guess, you have read all the lines from your file in the range of 0-8272. After that you will see 8272*2 a few times, exact number will depend on the length of the lines in the buffer read.
So, basically, in your program, tell() doesn't give you offsets of new CSV lines, as you seem to assume. It's only telling you about offset of the end of the file's region currently read into an internal OS buffer used by system functions used to implement the Python's IO functions.

Python Does Not Read Entire Text File

I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.
My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.
The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.
For example, I'll do a really simply command such as:
newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
replaced = line.replace("string1", "string2")
newfile.write(replaced)
And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?
I tried a few different solutions such as using:
import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
sys.stdout.write(line.replace("string1", "string2")
But it has the same effect. Nor does reading the file in chunks such as using
f.read(10000)
I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.
Can anyone offer any advice or things to try?
I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7
Try:
f = open("filename.txt", "rb")
On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).
You can also specify the mode when using fileinput:
fileinput.input("filename.txt", inplace=1, mode="rb")
Are you sure the problem is with reading and not with writing out?
Do you close the file that is written to, either explicitly newfile.close() or using the with construct?
Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.
If you use the file like this:
with open("filename.txt") as f:
for line in f:
newfile.write(line.replace("string1", "string2"))
It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)
Found to solution thanks to Gareth Latty. Using an iterator:
def read_in_chunks(file, chunk_size=1000):
while True:
data = file.read(chunk_size)
if not data: break
yield data
This answer was posted as an edit to the question Python Does Not Read Entire Text File by the OP user1297872 under CC BY-SA 3.0.

Categories

Resources