Invalid gz file after splitting - python

I have a gz file of 500MB and I have split it as follows
split -b 100m "file.gz" "file1.gz.part-"
after splitting the following files are obtained
file1.gz.part-aa
file1.gz.part-ab
file1.gz.part-ac
file1.gz.part-ad
file1.gz.part-ae
I am trying to iterate over objects in gzip file using gzip as follows
with gzip.open(filename) as f:
for line in f:
This is working for file1.gz.part-aa but for the other 4 parts I am getting
Not a gzipped file error

A gzip file has a header that identifies it as a gzip file. After splitting, only the first file will have this header. Rejoin the files before processing.

You can split before you gzip:
split -l 300000 "file.txt" "tweets1.part-"
^ every 300000 lines
Notice that the input of split is NOT a *.gz file but the original line-oriented file.
Then gzip every part separately:
gzip tweets1.part-*
This will also remove the parts (there's a gzip option to keep them).
In python, you can now consume each part separately.

Related

Python - doc to docx file converter input, file path from a txt file

Hi stackoverflow community,
Situation,
I'm trying to run this converter found from here,
However what I want is for it to read an array of file path from a text file and convert them.
Reason being, these file path are filtered manually, so I don't have to convert unnecessary files. There are a large amount of unnecessary files in the folder.
How can I go about with this? Thank you.
with open("file_path",'r') as file_content:
content=file_content.read()
content=content.split('\n')
You can read the data of the file using the method above, Then covert the data of file into a list(or any other iteratable data type) so that we can use it with for loop.I used content=content.split('\n') to split the data of content by '\n' (Every time you press enter key, a new line character '\n' is sended), you can use any other character to split.
for i in content:
# the code you want to execute
Note
Some useful links:
Split
File writing
File read and write
By looking at your situation, I guess this is what you want (to only convert certain file in a directory), in which you don't need an extra '.txt' file to process:
import os
for f in os.listdir(path):
if f.startswith("Prelim") and f.endswith(".doc"):
convert(f)
But if for some reason you want to stick with the ".txt" processing, this may help:
with open("list.txt") as f:
lines = f.readlines()
for line in lines:
convert(line)

Unable to do regex based operations in a gzip file in Python

I have .gz file that contains several strings. My requirement is that I have to do several regex based operations in the data that is contained in the .gz file
I get the error when I use a re.findall() in the lines of data extracted
File "C:\Users\santoshn\AppData\Local\Continuum\anaconda3\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object
I have tried opening with option "r" with the same result.
Do I have to decompress this file first and then do the regex operations or is there a way to address this ?
Data contains several text lines, an example line is listed below:
ThreadContext 432 mov (8) <8;1,2>r2 <8;3,3>r4 Instruction count
I was able to fix this issue by reading the file using gzip.open()
with gzip.open(file,"rb") as f:
binFile = f.readlines()
After this file is read, each line in the file is converted to 'ascii'. Subsequently all regex operations like re.search() and re.findall() work fine.
for line in binFile: # go over each line
line = line.strip().decode('ascii')
I know this is an old question but I stumbled on it (as well as the other HTML references in the comments) when trying to sort out this same issue. Rather than opening the gzip file as a binary ("rb") and then decoding it to ASCII the gzip docs led me to simply opening the GZ file as text which allowed normal string manipulation after that:
with gzip.open(filepath,"rt") as f:
data = f.readlines()
for line in data:
split_string = date_time_pattern.split(line)
# Whatever other string manipulation you may need.
The date_time_pattern variable is simply my compiled regex for different log date formats.

Split equivalent of gzip files in python

I'm trying to replicate this bash command in Bash which returns each file gzipped 50MB each.
split -b 50m "file.dat.gz" "file.dat.gz.part-"
My attempt at the python equivalent
import gzip
infile_name = "file.dat.gz"
chunk = 50*1024*1024 # 50MB
with gzip.open(infile_name, 'rb') as infile:
for n, raw_bytes in enumerate(iter(lambda: infile.read(slice), '')):
print(n, chunk)
with gzip.open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
outfile.write(raw_bytes)
This returns 15MB each gzipped. When I gunzip the files, then they are 50MB each.
How do I split the gzipped file in python so that split up files are each 50MB each before gunzipping?
I don't believe that split works the way you think it does. It doesn't split the gzip file into smaller gzip files. I.e. you can't call gunzip on the individual files it creates. It literally breaks up the data into smaller chunks and if you want to gunzip it, you have to concatenate all the chunks back together first. So, to emulate the actual behavior with Python, we'd do something like:
infile_name = "file.dat.gz"
chunk = 50*1024*1024 # 50MB
with open(infile_name, 'rb') as infile:
for n, raw_bytes in enumerate(iter(lambda: infile.read(chunk), b'')):
print(n, chunk)
with open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
outfile.write(raw_bytes)
In reality we'd read multiple smaller input chunks to make one output chunk to use less memory.
We might be able to break the file into smaller files that we can individually gunzip, and still make our target size. Using something like a bytesIO stream, we could gunzip the file and gzip it into that memory stream until it was the target size then write it out and start a new bytesIO stream.
With compressed data, you have to measure the size of the output, not the size of the input as we can't predict how well the data will compress.
Here's a solution for emulating something like the split -l (split on lines) command option that will allow you to open each individual file with gunzip.
import io
import os
import shutil
from xopen import xopen
def split(infile_name, num_lines ):
infile_name_fp = infile_name.split('/')[-1].split('.')[0] #get first part of file name
cur_dir = '/'.join(infile_name.split('/')[0:-1])
out_dir = f'{cur_dir}/{infile_name_fp}_split'
if os.path.exists(out_dir):
shutil.rmtree(out_dir)
os.makedirs(out_dir) #create in same folder as the original .csv.gz file
m=0
part=0
buf=io.StringIO() #initialize buffer
with xopen(infile_name, 'rt') as infile:
for line in infile:
if m<num_lines: #fill up buffer
buf.write(line)
m+=1
else: #write buffer to file
with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
outfile.write(buf.getvalue())
m=0
part+=1
buf=io.StringIO() #flush buffer -> faster than seek(0); truncate(0);
#write whatever is left in buffer to file
with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
outfile.write(buf.getvalue())
buf.close()
Usage:
split('path/to/myfile.csv.gz', num_lines=100000)
Outputs a folder with split files at path/to/myfile_split.
Discussion: I've used xopen here for additional speed, but you may choose to use gzip.open if you want to stay with Python native packages. Performance-wise, I've benchmarked this to take about twice as long as a solution combining pigz and split. It's not bad, but could be better. The bottleneck is the for loop and the buffer, so maybe rewriting this to work asynchronously would be more performant.

Read large file header (~9GB) inside tarfile without full extraction

I have ~1GB *.tbz files. Inside each of those files there is a single ~9GB file. I just need to read the header of this file, the first 1024 bytes.
I want this to do this as fast as possible as I have hundreds of this 1GB files I want to process. It takes about 1m30s to extract.
I tried using full extraction:
tar = tarfile.open(fn, mode='r|bz2')
for item in tar:
tar.extract(item)
and tarfile.getmembers() but with no speed imprevement:
tar = tarfile.open(fn, mode='r|bz2')
for member in tar.getmembers():
f = tar.extractfile(member)
headerbytes = f.read(1024)
headerdict = parseHeader(headerbytes)
The getmembers() method is what's taking all the time there.
Is there any way I can to this?
I think you should use the standard library bz2 interface. .tbz is the file extension for tar files that are compressed with the -j option to specify a bzip2 format.
As #bbayles pointed out in the comments, you can open your file as a bz2.BZ2File and use seek and read:
read([size])
Read at most size uncompressed bytes, returned as a
string. If the size argument is negative or omitted, read until EOF is
reached.
seek(offset[, whence])
Move to new file position. Argument offset is a
byte count.
f = bz2.BZ2File(path)
f.seek(512)
headerbytes = f.read(1024)
You can then parse that with your functions.
headerdict = parseHeader(headerbytes)
If you're sure that every tar archive will contain only a single bz2 file, you can simply skip the first 512 bytes when first reading the tar file (NOT the bz2 file contained in it, of course), because the tar file format has a padded (fixed size) header, after which your "real" content is stored.
A simple
f.seek(512)
instead of looping over getmembers() should do the trick.

Split large .gz files with prefixes

Each of my fastq files is about 20 millions reads (or 20 millions lines). Now I need to split the big fastq files into chunks, each with only 1 million reads (or 1 million lines), for the ease of further analysis. fastq file is just like .txt.
My thought is, just count the line, and print out the lines after counting every 1 million lines. But the input file is .gz compressed form (fastq.gz), do I need to unzip first?
How can I do this with python?
I tried the following command:
zless XXX.fastq.gz |split -l 4000000 prefix
(gzip first then split the file)
However, seems it doesn't work with prefix (I tried) "-prefix", still it doesn't work. Also, with split command the output is like:
prefix-aa, prefix-ab...
If my prefix is XXX.fastq.gz, then the output will be XXX.fastq.gzab, which will destroy the .fastq.gz format.
So what I need is XXX_aa.fastq.gz, XXX_ab.fastq.gz (ie. suffix). How can I do that?
As posted here
zcat XXX.fastq.gz | split -l 1000000 --additional-suffix=".fastq" --filter='gzip > $FILE.gz' - "XXX_"
...I need to unzip it first.
No you don't, at least not by hand. gzip will allow you to open the compressed file, at which point you read out a certain number of bytes and write them out to a separate compressed file. See the examples at the bottom of the linked documentation to see how to both read and write compressed files.
with gzip.open(infile, 'rb') as inp:
for <some number of loops>:
with gzip.open(outslice,'wb') as outp:
outp.write(inp.read(slicesize))
else: # only if you're not sure that you got the whole thing
with gzip.open(outslice,'wb') as outp:
outp.write(inp.read())
Note that gzip-compressed files are not random-accessible so you will need to perform the operation in one go unless you want to decompress the source file to disk first.
You can read a gzipped file just like an uncompressed file:
>>> import gzip
>>> for line in gzip.open('myfile.txt.gz', 'r'):
... process(line)
The process() function would handle the specific line-counting and conditional processing logic that you mentioned.

Categories

Resources