Split large .gz files with prefixes - python

Each of my fastq files is about 20 millions reads (or 20 millions lines). Now I need to split the big fastq files into chunks, each with only 1 million reads (or 1 million lines), for the ease of further analysis. fastq file is just like .txt.
My thought is, just count the line, and print out the lines after counting every 1 million lines. But the input file is .gz compressed form (fastq.gz), do I need to unzip first?
How can I do this with python?
I tried the following command:
zless XXX.fastq.gz |split -l 4000000 prefix
(gzip first then split the file)
However, seems it doesn't work with prefix (I tried) "-prefix", still it doesn't work. Also, with split command the output is like:
prefix-aa, prefix-ab...
If my prefix is XXX.fastq.gz, then the output will be XXX.fastq.gzab, which will destroy the .fastq.gz format.
So what I need is XXX_aa.fastq.gz, XXX_ab.fastq.gz (ie. suffix). How can I do that?

As posted here
zcat XXX.fastq.gz | split -l 1000000 --additional-suffix=".fastq" --filter='gzip > $FILE.gz' - "XXX_"

...I need to unzip it first.
No you don't, at least not by hand. gzip will allow you to open the compressed file, at which point you read out a certain number of bytes and write them out to a separate compressed file. See the examples at the bottom of the linked documentation to see how to both read and write compressed files.
with gzip.open(infile, 'rb') as inp:
for <some number of loops>:
with gzip.open(outslice,'wb') as outp:
outp.write(inp.read(slicesize))
else: # only if you're not sure that you got the whole thing
with gzip.open(outslice,'wb') as outp:
outp.write(inp.read())
Note that gzip-compressed files are not random-accessible so you will need to perform the operation in one go unless you want to decompress the source file to disk first.

You can read a gzipped file just like an uncompressed file:
>>> import gzip
>>> for line in gzip.open('myfile.txt.gz', 'r'):
... process(line)
The process() function would handle the specific line-counting and conditional processing logic that you mentioned.

Related

Read N number of bytes from stdin of python and output to a temp file for further processing

I would like to read a fixed number of bytes from stdin of a python script and output it to one temporary file batch by batch for further processing. Therefore, when the first N number of bytes are passed to the temp file, I want it to execute the subsequent scripts and then read the next N bytes from stdin. I am not sure what to iterate over in the top loop before While true. This is an example of what I tried.
import sys
While True:
data = sys.stdin.read(2330049) # Number of bytes I would like to read in one iteration
if data == "":
break
file1=open('temp.fil','wb') #temp file
file1.write(data)
file1.close()
further_processing on temp.fil (I think this can only be done after file1 is closed)
Two quick suggestions:
You should pretty much never do While True
Python3
Are you trying to read from a file? or from actual standard in? (Like the output of a script piped to this?)
Here is an answer I think will work for you, if you are reading from a file, that I pieced together from some other answers listed at the bottom:
with open("in-file", "rb") as in_file, open("out-file", "wb") as out_file:
data = in_file.read(2330049)
while byte != "":
out_file.write(data)
If you want to read from actual standard in, I would read all of it in, then split it up by bytes. The only way this won't work is if you are trying to deal with constant streaming data...which I would most definitely not use standard in for.
The .encode('UTF-8') and .decode('hex') methods might be of use to you also.
Sources: https://stackoverflow.com/a/1035360/957648 & Python, how to read bytes from file and save it?

how to read a large compressed file in python without loading it all in memory

I have large log files that are in compressed format. ie largefile.gz these are commonly 4-7gigs each.
Here's the relevant part of the code:
for filename in os.listdir(path):
if not filename.startswith("."):
with open(b, 'a') as newfile, gzip.GzipFile(path+filename,'rb') as oldfile:
# BEGIN Reads each remaining line from the log into a list
data = oldfile.readlines()
for line in data:
parts = line.split()
after this the code will do some calculations (basically totaling up a the bytes) and will write to a file that says "total bytes for x critera = y". All this works fine in a small file. But on a large file it kills the system
What I think my program is doing is reading the whole file, storing it in data Correct me if i'm wrong but I think its trying to put the whole log into memory first.
Question:
how I can read 1 line from the compressed file , process it then move on to the next without trying to store the whole thing in memory first? (or is it really already doing that.. I'm not sure but based on looking at the activity monitor my guess is that it is trying to go all in memory)
Thanks
It wasn't storing the entire content in-memory until you told it to. That is to say -- instead of:
# BAD: stores your whole file's decompressed contents, split into lines, in data
data = oldfile.readlines()
for line in data:
parts = line.split()
...use:
# GOOD: Iterates a line at a time
for line in oldfile:
parts = line.split()
...so you aren't storing the entire file in a variable. And obviously, don't store parts anywhere that persists past the one line either.
That easy.

Invalid gz file after splitting

I have a gz file of 500MB and I have split it as follows
split -b 100m "file.gz" "file1.gz.part-"
after splitting the following files are obtained
file1.gz.part-aa
file1.gz.part-ab
file1.gz.part-ac
file1.gz.part-ad
file1.gz.part-ae
I am trying to iterate over objects in gzip file using gzip as follows
with gzip.open(filename) as f:
for line in f:
This is working for file1.gz.part-aa but for the other 4 parts I am getting
Not a gzipped file error
A gzip file has a header that identifies it as a gzip file. After splitting, only the first file will have this header. Rejoin the files before processing.
You can split before you gzip:
split -l 300000 "file.txt" "tweets1.part-"
^ every 300000 lines
Notice that the input of split is NOT a *.gz file but the original line-oriented file.
Then gzip every part separately:
gzip tweets1.part-*
This will also remove the parts (there's a gzip option to keep them).
In python, you can now consume each part separately.

Read large file header (~9GB) inside tarfile without full extraction

I have ~1GB *.tbz files. Inside each of those files there is a single ~9GB file. I just need to read the header of this file, the first 1024 bytes.
I want this to do this as fast as possible as I have hundreds of this 1GB files I want to process. It takes about 1m30s to extract.
I tried using full extraction:
tar = tarfile.open(fn, mode='r|bz2')
for item in tar:
tar.extract(item)
and tarfile.getmembers() but with no speed imprevement:
tar = tarfile.open(fn, mode='r|bz2')
for member in tar.getmembers():
f = tar.extractfile(member)
headerbytes = f.read(1024)
headerdict = parseHeader(headerbytes)
The getmembers() method is what's taking all the time there.
Is there any way I can to this?
I think you should use the standard library bz2 interface. .tbz is the file extension for tar files that are compressed with the -j option to specify a bzip2 format.
As #bbayles pointed out in the comments, you can open your file as a bz2.BZ2File and use seek and read:
read([size])
Read at most size uncompressed bytes, returned as a
string. If the size argument is negative or omitted, read until EOF is
reached.
seek(offset[, whence])
Move to new file position. Argument offset is a
byte count.
f = bz2.BZ2File(path)
f.seek(512)
headerbytes = f.read(1024)
You can then parse that with your functions.
headerdict = parseHeader(headerbytes)
If you're sure that every tar archive will contain only a single bz2 file, you can simply skip the first 512 bytes when first reading the tar file (NOT the bz2 file contained in it, of course), because the tar file format has a padded (fixed size) header, after which your "real" content is stored.
A simple
f.seek(512)
instead of looping over getmembers() should do the trick.

Reading a .txt file in python

I have use the following code to read a .txt file:
f = os.open(os.path.join(self.dirname, self.filename), os.O_RDONLY)
And when I want to output the content I use this:
os.read(f, 10);
Which means that this method reads 10 bytes from the beginning of the file on. While I need to read the content as much as it is, using some values such as -1 and so. What should I do?
You have two options:
Call os.read() repeatedly.
Open the file using the open() built-in (as opposed to os.open()), and just call f.read() with no arguments.
The second approach carries certain risk, in that you might run into memory issues if the file is very large.

Categories

Resources