Stream a .zst compressed file line by line - python

I am trying to sift through a big database that is compressed in a .zst. I am aware that I can simply just decompress it and then work on the resulting file, but that uses up a lot of space on my ssd and takes 2+ hours so I would like to avoid that if possible.
Often when I work with large files I would stream it line by line with code like
with open(filename) as f:
for line in f.readlines():
do_something(line)
I know gzip has this
with gzip.open(filename,'rt') as f:
for line in f:
do_something(line)
but it doesn't seem to work with .zsf, so I am wondering if there're any libraries that can decompress and stream the decompressed data in a similar way. For example:
with zstlib.open(filename) as f:
for line in f.zstreadlines():
do_something(line)

Related

Python zlib-inflate alternative

I have a compressed file which I could uncompress on ubuntu command prompt using zlib-flate as below,
zlib-flate -uncompress < inputfile > outfile
Here inputfile is a compress file and outfile is the uncompressed version.
The compress file has a byte data.
I did not find the way to do the same using Python.
Please advise.
If the entire file fits in memory, zlib can do exactly this in a very straight forward manner;
import zlib
with open("input_file", "rb") as input_file:
input_data = input_file.read()
decompressed_data = zlib.decompress(input_data)
with open("output_file", "wb") as output_file:
output_file.write(decompressed_data)
If the file is too large to fit in memory, you may want to instead use zlib.decompressobj() which can do streaming but isn't quite as straight forward.

How can I open .ebc (ebcdic) file on my laptop via python?

UPD I have opened file and found this format. How I can decode 00000000....
I need to open .ebc file on Python. The size of this file is approximately 12GB.
I have used a huge amount of tools and Python libraries for this action, but it is obvious that I am doing something in a wrong way. I can't find suitable encoding.
I tried to read the file line by line because of it size.
Python lists two code pages for EBCDIC, "cp424" (Hebrew) and "cp500" (Western scripts).
Use it like this:
with open(path, encoding='cp500') as f:
for line in f:
# process one line of text
Note: if the file is 12G in size, you'll want to avoid to call f.read() or f.readlines(), as both would read the entire file into memory.
On a laptop, this is likely to freeze your system.
Instead, iterate over the contents line by line using Python's default line iteration.
If you just want to re-encode the file with a modern encoding, eg. the very popular UTF-8, use the following pattern:
with open(in_path, encoding='cp500') as src, open(out_path, 'w', encoding='utf8') as dest:
dest.writelines(src)
This should re-encode the file with a low-memory footprint, as it reads, converts and writes the contents line by line.

Possible to decompress bz2 in python to a file instead of memory

I've worked with decompressing and reading files on the fly in memory with the bz2 library. However, i've read through the documentation and can't seem to just simply decompress the file to create a brand new file on the file system with the decompressed data without memory storage. Sure, you could read line by line using BZ2Decompressor then write that to a file, but that would be insanely slow. (Decompressing massive files, 50GB+). Is there some method or library I have overlooked to achieve the same functionality as the terminal command bz2 -d myfile.ext.bz2 in python without using a hacky solution involving a subprocess to call that terminal command?
Example why bz2 is so slow:
Decompressing that file via bz2 -d: 104seconds
Analytics on a decompressed file(just involves reading line by line): 183seconds
with open(file_src) as x:
for l in x:
Decompressing on the file and using analytics: Over 600 seconds (This time should be max 104+183)
if file_src.endswith(".bz2"):
bz_file = bz2.BZ2File(file_src)
for l in bz_file:
You could use the bz2.BZ2File object which provides a transparent file-like handle.
(edit: you seem to use that already, but don't use readlines() on a binary file, or on a text file because in your case the block size isn't big enough which explains why it's slow)
Then use shutil.copyfileobj to copy to the write handle of your output file (you can adjust block size if you can afford the memory)
import bz2,shutil
with bz2.BZ2File("file.bz2") as fr, open("output.bin","wb") as fw:
shutil.copyfileobj(fr,fw)
Even if the file is big, it doesn't take more memory than the block size. Adjust the block size like this:
shutil.copyfileobj(fr,fw,length = 1000000) # read by 1MB chunks
For smaller files that you can store in memory before you save to a file, you can use bz2.open to decompress the file and save it as an uncompressed new file.
import bz2
#decompress data
with bz2.open('compressed_file.bz2', 'rb') as f:
uncompressed_content = f.read()
#store decompressed file
with open('new_uncompressed_file.dat', 'wb') as f:
f.write(uncompressed_content)
f.close()

Parsing large, possibly compressed, files in Python

I am trying to parse a large file, line by line, for relevant information.
I may be receiving either an uncompressed or gzipped file (I may have to edit for zip file at a later stage).
I am using the following code but I feel that, because I am not inside the with statement, I am not parsing the file line by line and am in fact loading the entire file file_content into memory.
if ".gz" in FILE_LIST['INPUT_FILE']:
with gzip.open(FILE_LIST['INPUT_FILE']) as input_file:
file_content = input_file.readlines()
else:
with open(FILE_LIST['INPUT_FILE']) as input_file:
file_content = input_file.readlines()
for line in file_content:
# do stuff
Any suggestions for how I should handle this?
I would prefer not to unzip the file outside the code block, as this needs to be generic, and I would have to tidy up multiple files.
readlines reads the file fully. So it's a no-go for big files.
Doing 2 context blocks like you're doing and then using the input_file handle outside them doesn't work (operation on closed file).
To get best of both worlds, I would use a ternary conditional for the context block (which determines if open or gzip.open must be used), then iterate on the lines.
open_function = gzip.open if ".gz" in FILE_LIST['INPUT_FILE'] else open
with open_function(FILE_LIST['INPUT_FILE'],"r") as input_file:
for line in input_file:
note that I have added the "r" mode to make sure to work on text not on binary (gzip.open defaults to binary)
Alternative: open_function can be made generic so it doesn't depend on FILE_LIST['INPUT_FILE']:
open_function = lambda f: gzip.open(f,"r") if ".gz" in f else open(f)
once defined, you can reuse it at will
with open_function(FILE_LIST['INPUT_FILE']) as input_file:
for line in input_file:

for each line in file write line to an individual file in python

I have a text file which needs to be separated line by line into individual text files. So if the main file contains the strings:
foo
bar
bla
I would have 3 files which could be named numerically 1.txt (containing the string "foo"), 2.txt (sontaining the string"bar") and 3.txt (containing the string "bla")
The straightforward way to do with would be to open three files for writing and writing line by line into each file. But the problem is when we have lot of lines or we do not know exactly how many there are. It seems painfully unnecessary to have to create
f1=open('main_file', 'r')
f2=open('1.txt', 'w')
f3=open('2.txt', 'w')
f4=open('3.txt', 'w')
is there a way to put a counter in this operation or a library which can handle this type of ask?
Read the lines from the file in a loop, maintaining the line number; open a file with the name derived from the line number, and write the line into the file:
f1 = open('main_file', 'r')
for i,text in enumerate(f1):
open(str(i + 1) + '.txt', 'w').write(text)
You would want something like this. Using with is the preferred way for dealing with files, since it automatically closes them for you after the with scope.
with open('main_file', 'r') as in_file:
for line_number, line in enumerate(in_file):
with open("{}.txt".format(i+1), 'w') as out_file:
out_file.write(line)
Firstly you could read the file into a list, where each element stands for a row in the file.
with open('/path/to/data','r') as f:
data = [line.strip() for line in f]
Then you could use a for loop to write into files separately.
for counter in range(len(data)):
with open('/path/to/file/'+str(counter),'w') as f:
f.write(data[counter])
Notes:
Since you're continuously opening numerous files, I highly suggest using
with open() as f:
#your operation
The advantage of using this is that you can make sure Python release the resources on time.
Details:
What's the advantage of using 'with .. as' statement in Python?

Categories

Resources