python script slow read and write gz files - python

I have a xxx.wig.gz file, that have 3,000,000,000 lines in such format:
fixedStep chrom=chr1 start=1 step=1
0
0
0
0
0
1
2
3
4
5
6
7
8
9
10
...
fixedStep chrom=chr2 start=1 step=1
0
0
0
0
0
11
12
13
14
15
16
17
18
19
20
...
and i want to
break it down by "chrom". So every time I read a line starts with "fixedstep", I create a new file and close old one.
I want 0/1 output by comparing each value to a "threshold", pass=1 otherwise 0
below is my python script which runs super slow (I am projecting it to finish ~10hours, so far 2 chromosomes done after ~1 hour)
can someone help me improve it?
#!/bin/env python
import gzip
import re
import os
import sys
fn = sys.argv[1]
f = gzip.open(fn)
fo_base = os.path.basename(fn).rstrip('.wig').rstrip('.wig.gz')
fo_ext = '.bt.gz'
thres = 100
fo = None
for l in f:
if l.startswith("fixedStep"):
if fo is not None:
fo.flush()
fo.close()
fon = re.search(r'chrom=(\w*)', l).group(0).split('=')[-1]
fo = gzip.open(fo_base + "_" + fon + fo_ext,'wb')
else:
if int(l.strip())>= thres:
fo.write("1\n")
else:
fo.write("0\n")
if fo is not None:
fo.flush()
fo.close()
f.close()
PS. I assume awk can do it much faster but I am not great with awk

Thanks Summer for editing the text.
I added buffered read/write to the script and now it is several times faster (still relatively slow though):
import io
f = io.BufferedReader( gzip.open(fn) )
fo = io.BufferedWriter( gzip.open(fo_base + "." + fon + fo_ext,'wb') )

Related

Appending the length of sentences to file

I found the length and index and i want save all of them to new file:
example: index sentences length
my code
file = open("testing_for_tools.txt", "r")
lines_ = file.readlines()
for line in lines_:
lenght=len(line)-1
print(lenght)
for item in lines_:
print(lines_.index(item)+1,item)
output:
64
18
31
31
23
36
21
9
1
1 i went to city center, and i bought xbox5 , and some other stuff
2 i will go to gym !
3 tomorrow i, sill start my diet!
4 i achive some and i need more ?
5 i lost lots of weights؟
6 i have to , g,o home,, then sleep ؟
7 i have things to do )
8 i hope so
9 o
desired output and save to new file :
1 i went to city center, and i bought xbox5 , and some other stuff 64
2 i will go to gym ! 18
This can be achieved using the following code. Note the use of with ... as f which means we don't have to worry about closing the file after using it. In addition, I've used f-strings (requires Python 3.6), and enumerate to get the line number and concatenate everything into one string, which is written to the output file.
with open("test.txt", "r") as f:
lines_ = f.readlines()
with open("out.txt", "w") as f:
for i, line in enumerate(lines_, start=1):
line = line.strip()
f.write(f"{i} {line} {len(line)}\n")
Output:
1 i went to city center, and i bought xbox5 , and some other stuff 64
2 i will go to gym ! 18
If you wanted to sort the lines based on length, you could just put the following line after the first with block:
lines_.sort(key=len)
This would then give output:
1 i will go to gym ! 18
2 i went to city center, and i bought xbox5 , and some other stuff 64

Replacing a string in a file in python

What my text is
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = first label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
What i want
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = new label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
The code i am using
import re
fo=open("test5.txt", "r+")
num_lines = sum(1 for line in open('test5.txt'))
count=1
while (count <= num_lines):
line1=fo.readline()
j= line1[17 : 72]
j1=re.findall('\d+', j)
k=map(int,j1)
if (k==[30411]):
count1=count-4
line2=fo.readlines()[count1]
r1=line2[10:72]
r11=str(r1)
r2="new label"
r22=str(r2)
newdata = line2.replace(r11,r22)
f1 = open("output7.txt",'a')
lines=f1.writelines(newdata)
else:
f1 = open("output7.txt",'a')
lines=f1.writelines(line1)
count=count+1
The problem is in the writing of line. Once 30411 is searched and then it has to go 3 lines back and change the label to new one. The new output text should have all the lines same as before except label line. But it is not writing properly. Can anyone help?
Apart from many blood-curdling but noncritical problems, you are calling readlines() in the middle of an iteration using readline(), causing you to read lines not from the beginning of the file but from the current position of the fo handle, i.e. after the line containing 30411.
You need to open the input file again with a separate handle or (better) store the last 4 lines in memory instead of rereading the one you need to change.

python lines concatenate itself when writing to a file

I'm using python to generate training and testing data for 10-fold cross-validations, and to write the datasets to 2x10 separated files (each fold writes a training file and a testing file). And the weird thing is that when writing data to a file, there always is a line "missing". Actually, it might not even be "missing", since I discovered later that some line (only one line) in the middle of the file gets to concatenate itself to its previous line. So an output file should be something like the following (there should be 39150 lines in total):
44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 4
50 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1
However, I keep getting 39149 lines, and somewhere in the middle of the file seems to mess up like this:
44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 450 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1
My code:
def k_fold(myfile, myseed=1, k=10):
# Load data
data = open(myfile).readlines()
# Shuffle input
random.seed = myseed
random.shuffle(data)
# Compute partition size given input k
len_total = len(data)
len_part = int(math.floor(len_total / float(k)))
# Create one partition per fold
train = {}
test = {}
for i in range(k):
test[i] = data[i * len_part:(i + 1) * len_part]
train[i] = data[0:i * len_part] + data[(i + 1) * len_part:len_total]
return train, test
if __name__ == "__main__":
path = '....' #some path and input
input = '...'
# Generate data
[train, test] = k_fold(input)
# Write data to files
for i in range(10):
train_old = path + 'tmp_train_' + str(i)
test_old = path + 'tmp_test_' + str(i)
trainF = open(train_old, 'a')
testF = open(test_old, 'a')
print(len(train[i]))
The strange thing is that I'm doing the same thing for the training and the testing dataset. The testing dataset outputs the correct file (4350 lines), but the training dataset has the above problem. I'm sure that the function returns the 39150 lines of training data, so I think the problem should be in the file writing part. Any body has any ideas how I could possibly done wrong? Thanks in advance!
I assume that the first half of the double length line is the last line of the original file.
The lines returned by readlines (or by iterating over the file) will all still end with the LF character '\n' except the last line if the file doesn't end with an empty line. In that case, the shuffling that you do will put that '\n'-less line somewhere in the middle of 'data'.
Either append an empty line to your original file or strip all lines before processing and add the newline to each line when writing back to a file.

Alternative to bash (awk command) with python

Context : I run calculations on a program that gives me result files.
On these result files (extension .h5), I can apply a python code (I cannot change this python code) such that it gives me a square matrix :
oneptdm.py resultfile.h5
gives me for example :
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
points groups
1
2
3
...
in a file called oneptdm.dat
I want to grep the diagonal of this matrix. Usually I use simply bash:
awk '{ for (i=0; i<=NF; i++) if (NR >= 1 && NR == i) print i,$(i) }' oneptdm.dat > diagonal.dat
But for x reason, I have to do it with python now. How can I do that ?
I can of course use "subprocess" to use awk again but I would like to know if there is an alternative way to do that with a python script, version 2.6.
The result should be :
(line) (diagonal element)
1 1
2 6
3 11
4 16
You can try something like this:
with open('oneptdm.dat') as f:
for i, l in enumerate(f):
print '%d\t%s' % (i + 1, l.split()[i])
This should do the trick. It does assume that the file begins with a square matrix, and that assumption is used to limit the number of lines read from the file.
with open('oneptdm.dat') as f:
line = next(f).split()
for i in range(len(line)):
print('{0}\t{1}'.format(i+1, line[i]))
try:
line = next(f).split()
except StopIteration:
break
Output for your sample file:
1 1
2 6
3 11
4 16

How do I print a range of lines after a specific pattern into separate files when this pattern appears several times in an input file

Sorry for my previous post, I had no idea what I was doing. I am trying to cut out certain ranges of lines in a given input file and print that range to a separate file. This input file looks like:
18
generated by VMD
C 1.514895 -3.887949 2.104134
C 2.371076 -2.780954 1.718424
C 3.561071 -3.004933 1.087316
C 4.080424 -4.331872 1.114878
C 3.289761 -5.434047 1.607808
C 2.018473 -5.142150 2.078551
C 3.997237 -6.725186 1.709355
C 5.235126 -6.905640 1.295296
C 5.923666 -5.844841 0.553037
O 6.955216 -5.826197 -0.042920
O 5.269004 -4.590026 0.590033
H 4.054002 -2.184680 0.654838
H 1.389704 -5.910354 2.488783
H 5.814723 -7.796634 1.451618
O 1.825325 -1.537706 1.986256
H 2.319215 -0.796042 1.550394
H 3.390707 -7.564847 2.136680
H 0.535358 -3.663175 2.483943
18
generated by VMD
C 1.519866 -3.892621 2.109595
I would like to print every 100th frame starting from the first frame into its own file named "snapshot0.xyz" (The first frame is frame 0).
For example, the above input shows two snapshots. I would like to print out lines 1:20 into its own file named snapshot0.xyz and then skip 100 (2000 lines) snapshots and print out snapshot1.xyz (with the 100th snapshot). My attempt was in python, but you can choose either grep, awk, sed, or Python.
My input file: frames.dat
1 #!/usr/bin/Python
2
3
4
5 mest = open('frames.dat', 'r')
6 test = mest.read().strip().split('\n')
7
8 for i in range(len(test)):
9 if test[i] == '18':
10 f = open("out"+`i`+".dat", "w")
11 for j in range(19):
12 print >> f, test[j]
13 f.close()
I suggest using the csv module for this input.
import csv
def strip_empty_columns(line):
return filter(lambda s: s.strip() != "", line)
def is_count(line):
return len(line) == 1 and line[0].strip().isdigit()
def is_float(s):
try:
float(s.strip())
return True
except ValueError:
return False
def is_data_line(line):
return len(line) == 4 and is_float(line[1]) and is_float(line[2]) and is_float(line[3])
with open('frames.dat', 'r') as mest:
r = csv.reader(mest, delimiter=' ')
current_count = 0
frame_nr = 0
outfile = None
for line in r:
line = strip_empty_columns(line)
if is_count(line):
if frame_nr % 100 == 0:
outfile = open("snapshot%d.xyz" % frame_nr, "w+")
elif outfile:
outfile.close()
outfile = None
frame_nr += 1 # increment the frame counter every time you see this header line like '18'
elif is_data_line(line):
if outfile:
outfile.write(" ".join(line) + "\n")
The opening post mentions to write every 100th frame to an output file named snapshot0.xyz. I assume the 0 should be a counter, ot you would continously overwrite the file. I updated the code with a frame_nr counter and a few lines which open/close an output file depending on the frame_nr and write data if an output file is open.
This might work for you (GNU sed and csplit):
sed -rn '/^18/{x;/x{100}/z;s/^/x/;x};G;/\nx$/P' file | csplit -f snapshot -b '%d.xyz' -z - '/^18/' '{*}'
Filter every 100th frame using sed and pass that file to csplit to create the individual files.

Categories

Resources