Replacing a string in a file in python - python

What my text is
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = first label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
What i want
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = new label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
The code i am using
import re
fo=open("test5.txt", "r+")
num_lines = sum(1 for line in open('test5.txt'))
count=1
while (count <= num_lines):
line1=fo.readline()
j= line1[17 : 72]
j1=re.findall('\d+', j)
k=map(int,j1)
if (k==[30411]):
count1=count-4
line2=fo.readlines()[count1]
r1=line2[10:72]
r11=str(r1)
r2="new label"
r22=str(r2)
newdata = line2.replace(r11,r22)
f1 = open("output7.txt",'a')
lines=f1.writelines(newdata)
else:
f1 = open("output7.txt",'a')
lines=f1.writelines(line1)
count=count+1
The problem is in the writing of line. Once 30411 is searched and then it has to go 3 lines back and change the label to new one. The new output text should have all the lines same as before except label line. But it is not writing properly. Can anyone help?

Apart from many blood-curdling but noncritical problems, you are calling readlines() in the middle of an iteration using readline(), causing you to read lines not from the beginning of the file but from the current position of the fo handle, i.e. after the line containing 30411.
You need to open the input file again with a separate handle or (better) store the last 4 lines in memory instead of rereading the one you need to change.

Related

Fast I/O when working with multiple files

I have two input files and I want to mix them and output the result into a third files. In the following I will use a toy example to explain the format of the files and the desired output. Each file contain 4-line pattern which is repeated (but contains a different sequence), and I only include single 4-line:
input file 1:
#readheader1
ACACACACACACACACACACACACACACACACACACACACACACACACACACAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
input file 2:
#readheader2
AATTAATT
+
FFFFFFFF
...
desired ouput:
#readheader1_AATTAATT
ACACACACACACACACACACACACACACACACACACACACACACACACACACAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
So I want to attach thefirst line of every four line from the first file using an underscore with the small sequence found in the second line of every four line from the second file. and I simply output 2n, 3rd, and 4rd line of every four line of the first line, as is, into the output.
I am looking for any script (linux bash, python, c++, etc) that can optimize what I have below:
I wrote this code to do the task, but I found it to be slow (takes more than a day for inputs of size 60 GB and 15 GB); note that the input files are in fastq.gz format so I open them using gzip:
...
r1_file = gzip.open(r1_file_name, 'r') # input file 1
i1_file = gzip.open(i1_file_name, 'r') # input file 2
out_file_R1 = gzip.open('_R1_barcoded.fastq.gz', 'wb') # output file
r1_header = ''
r1_seq = ''
r1_orient = ''
r1_qual = ''
i1_seq = ''
cnt = 1
with gzip.open(r1_file_name, 'r') as r1_file:
for r1_line in r1_file:
if cnt==1:
r1_header = str.encode(r1_line.decode("ascii").split(" ")[0])
next(i1_file)
if cnt==2:
r1_seq = r1_line
i1_seq = next(i1_file)
if cnt==3:
r1_orient = r1_line
next(i1_file)
if cnt==4:
r1_qual = r1_line
next(i1_file)
out_4line = r1_header + b'_' + i1_seq + r1_seq + r1_orient + r1_qual
out_file_R1.write(out_4line)
cnt = 0
cnt += 1
i1_file.close()
out_file_R1.close()
Then that I have the two outputs made using 2 dataset, I wish to interleave the output files: 4 lines from the first file, 4 lines from the second file, 4 lines from the first, and so on...
Using paste utility (from GNU coreutils) and GNU sed:
paste file1 file2 |
sed -E 'N; s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/; N; N; s/\t[^\n]*//g' > file.out
If files are gzipped then use:
paste <(gzip -dc file1.gz) <(gzip -dc file2.gz) |
sed -E 'N; s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/; N; N; s/\t[^\n]*//g' > file.out
Note: This assumes no tab characters in file1 and file2
Explanation: Assume that file1 and file2 contains these lines:
File1:
Header1
ACACACACAC
XX
FFFFFFFFFFFF
File2:
Header2
AATTAATT
YY
GGGGGG
After the paste command, lines are merged, separated by TABs:
Header1\tHeader2
ACACACACAC\tAATTAATT
XX\tYY
FFFFFFFFFFFF\tGGGGGG
The \t above denotes a tab character. These lines are fed to sed. sed reads the first line, the pattern space becomes
Header1\tHeader2
The N command adds a newline to the pattern space, then appends the next line (ACACACACAC\tAATTAATT) of input to the pattern space. Pattern space becomes
Header1\tHeader2\nACACACACAC\tAATTAATT
and is matched against regex \t.*\n([^\t]*)\t(.*) as denoted below.
Header1\tHeader2\nACACACACAC\tAATTAATT
||^^^^^^^||^^^^^^^^^^||^^^^^^^^
\t .* \n ([^\t]*) \t (.*)
|| || \1 || \2
The \n denotes a newline character. Then the matching part is replaced with _\2\n\1 by the s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/ command. Pattern space becomes
Header1_AATTAATT\nACACACACAC
The two N commands read the next two lines. Now pattern space is
Header1_AATTAATT\nACACACACAC\nXX\tYY\nFFFFFFFFFFFF\tGGGGGG
The s/\t[^\n]*//g command removes all parts between a TAB (inclusive) and newline (exclusive). After this operation the final pattern space is
Header1_AATTAATT\nACACACACAC\nXX\nFFFFFFFFFFFF
which is printed out as
Header1_AATTAATT
ACACACACAC
XX
FFFFFFFFFFFF

Python print .psl format without quotes and commas

I am working on a linux system using python3 with a file in .psl format common to genetics. This is a tab separated file that contains some cells with comma separated values. An small example file with some of the features of a .psl is below.
input.psl
1 2 3 x read1 8,9, 2001,2002,
1 2 3 mt read2 8,9,10 3001,3002,3003
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
I need to filter this file to extract only regions of interest. Here, I extract only rows with a value of 9 in the fourth column.
import csv
def read_psl_transcripts():
psl_transcripts = []
with open("input.psl") as input_psl:
csv_reader = csv.reader(input_psl, delimiter='\t')
for line in input_psl:
#Extract only rows matching chromosome of interest
if '9' == line[3]:
psl_transcripts.append(line)
return psl_transcripts
I then need to be able to print or write these selected lines in a tab delimited format matching the format of the input file with no additional quotes or commas added. I cant seem to get this part right and additional brackets, quotes and commas are always added. Below is an attempt using print().
outF = open("output.psl", "w")
for line in read_psl_transcripts():
print(str(line).strip('"\''), sep='\t')
Any help is much appreciated. Below is the desired output.
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
You might be able to solve you problem with a simple awk statement.
awk '$4 == 9' input.pls > output.pls
But with python you could solve it like this:
write_pls = open("output.pls", "w")
with open("input.pls") as file:
for line in file:
splitted_line = line.split()
if splitted_line[3] == '9':
out_line = '\t'.join(splitted_line)
write_pls.write(out_line + "\n")
write_pls.close()

Separate lines in Python

I have a .txt file. It has 3 different columns. The first one is just numbers. The second one is numbers which starts with 0 and it goes until 7. The final one is a sentence like. And I want to keep them in different lists because of matching them for their numbers. I want to write a function. How can I separate them in different lists without disrupting them?
The example of .txt:
1234 0 my name is
6789 2 I am coming
2346 1 are you new?
1234 2 Who are you?
1234 1 how's going on?
And I have keep them like this:
----1----
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
----2----
2346 1 are you new?
----3-----
6789 2 I am coming
What I've tried so far:
inputfile=open('input.txt','r').read()
m_id=[]
p_id=[]
packet_mes=[]
input_file=inputfile.split(" ")
print(input_file)
input_file=line.split()
m_id=[int(x) for x in input_file if x.isdigit()]
p_id=[x for x in input_file if not x.isdigit()]
With your current approach, you are reading the entire file as a string, and performing a split on a whitespace (you'd much rather split on newlines instead, because each line is separated by a newline). Furthermore, you're not segregating your data into disparate columns properly.
You have 3 columns. You can split each line into 3 parts using str.split(None, 2). The None implies splitting on space. Each group will be stored as key-list pairs inside a dictionary. Here I use an OrderedDict in case you need to maintain order, but you can just as easily declare o = {} as a normal dictionary with the same grouping (but no order!).
from collections import OrderedDict
o = OrderedDict()
with open('input.txt', 'r') as f:
for line in f:
i, j, k = line.strip().split(None, 2)
o.setdefault(i, []).append([int(i), int(j), k])
print(dict(o))
{'1234': [[1234, 0, 'my name is'],
[1234, 2, 'Who are you?'],
[1234, 1, "how's going on?"]],
'6789': [[6789, 2, 'I am coming']],
'2346': [[2346, 1, 'are you new?']]}
Always use the with...as context manager when working with file I/O - it makes for clean code. Also, note that for larger files, iterating over each line is more memory efficient.
Maybe you want something like that:
import re
# Collect data from inpu file
h = {}
with open('input.txt', 'r') as f:
for line in f:
res = re.match("^(\d+)\s+(\d+)\s+(.*)$", line)
if res:
if not res.group(1) in h:
h[res.group(1)] = []
h[res.group(1)].append((res.group(2), res.group(3)))
# Output result
for i, x in enumerate(sorted(h.keys())):
print("-------- %s -----------" % (i+1))
for y in sorted(h[x]):
print("%s %s %s" % (x, y[0], y[1]))
The result is as follow (add more newlines if you like):
-------- 1 -----------
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
-------- 2 -----------
2346 1 are you new?
-------- 3 -----------
6789 2 I am coming
It's based on regexes (module re in python). This is a good tool when you want to match simple line based patterns.
Here it relies on spaces as columns separators but it can as easily be adapted for fixed width columns.
The results is collected in a dictionary of lists. each list containing tuples (pairs) of position and text.
The program waits output for sorting items.
It's a quite ugly code but it's quite easy to understand.
raw = []
with open("input.txt", "r") as file:
for x in file:
raw.append(x.strip().split(None, 2))
raw = sorted(raw)
title = raw[0][0]
refined = []
cluster = []
for x in raw:
if x[0] == title:
cluster.append(x)
else:
refined.append(cluster)
cluster = []
title = x[0]
cluster.append(x)
refined.append(cluster)
for number, group in enumerate(refined):
print("-"*10+str(number)+"-"*10)
for line in group:
print(*line)

run a search over the items of a list, and save each search into a file

I have a data.dat file that has 3 columns: The 3rd column is just the numbers 1 to 6 repeated again and again:
( In reality, column 3 has numbers from 1 to 1917, but for a minimal working example, let's stick to 1 to 6 )
# Title
127.26 134.85 1
127.26 135.76 2
127.26 135.76 3
127.26 160.97 4
127.26 160.97 5
127.26 201.49 6
125.88 132.67 1
125.88 140.07 2
125.88 140.07 3
125.88 165.05 4
125.88 165.05 5
125.88 203.06 6
137.20 140.97 1
137.20 140.97 2
137.20 148.21 3
137.20 155.37 4
137.20 155.37 5
137.20 184.07 6
I would like to:
1) extract the lines that contain 1 in the 3rd column and save them to a file called mode_1.dat.
2) extract the lines that contain 2 in the 3rd column and save them to a file called mode_2.dat.
3) extract the lines that contain 3 in the 3rd column and save them to a file called mode_3.dat.
.
.
.
6) extract the lines that contain 6 in the 3rd column and save them to a file called mode_6.dat.
In order to accomplish this, I have:
a) defined a variable factor = 6
a) created a one_to_factor list that has numbers 1 to 6
b) The re.search statement is in charge of extracting the lines for each value of one_to_factor. %s are the i inside the one_to_factor list
c) append these results to an empty LINES list.
However, this does not work. I cannot manage to extract the lines that contain i in the 3rd column and save them to a file called mode_i.dat
I would appreciate if you could help me.
factor = 6
one_to_factor = range(1,factor+1)
LINES = []
f_2 = open('data.dat', 'r')
for line in f_2:
for i in one_to_factor:
if re.search(r' \b%s$' %i , line):
print 'line = ', line
LINES.append(line)
print 'LINES =' , LINES
I would do it like this:
no regexes, just use str.split() to split according to whitespace
use last item (the digit) of the current line to generate the filename
use a dictionary to open the file the first time, and reuse the handle for subsequent matches (write title line at file open)
close all handles in the end
code:
title_line="# Vol \t Freq \t Mod \n"
handles = dict()
next(f_2) # skip title
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
# create files first time id encountered
if filename in handles:
pass
else:
handles[filename] = open(filename,"w")
handles[filename].write(title_line) # write title
handles[filename].write(line)
# close all files
for v in handles.values():
v.close()
EDIT: that's the fastest way but the problem is if you have too many suffixes (like in your real example), you'll get "too many open files" exception. So for this case, there's a slightly less efficient method but which works too:
import glob,os
# pre-processing: cleanup old files if any
for f in glob.glob("mode_*.dat"):
os.remove(f)
next(f_2) # skip title
s = set()
title_line="# Vol \t Freq \t Mod \n"
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
with open(filename,"a") as f:
if filename in s:
pass
else:
s.add(filename)
f.write(title_line)
f.write(line)
It basically opens as append mode, writes the lines, and closes the file.
(the set is used to detect first write in this file, so title can be written before the data)
There's a directory cleanup first to ensure that no data is left from a previous computation (append mode expects that no file exists, and if input data set changes, there's a possibility that there's an indentifier not present in the new dataset, so there would be an "orphan" file remaining from previous run)
First, instead of looping on you one_to_factor, you can get the index in one step :
index = line[-1] # Last character on the line
Then, you can check if index is in your one_to_factor list.
You should created a dictionary of lists to store your lines.
Something like :
{ "1" : [line1, line7, ...],
"2" : ....
}
And then you can use the key of the dictionnary to create the file and populate it with lines.

How do I print a range of lines after a specific pattern into separate files when this pattern appears several times in an input file

Sorry for my previous post, I had no idea what I was doing. I am trying to cut out certain ranges of lines in a given input file and print that range to a separate file. This input file looks like:
18
generated by VMD
C 1.514895 -3.887949 2.104134
C 2.371076 -2.780954 1.718424
C 3.561071 -3.004933 1.087316
C 4.080424 -4.331872 1.114878
C 3.289761 -5.434047 1.607808
C 2.018473 -5.142150 2.078551
C 3.997237 -6.725186 1.709355
C 5.235126 -6.905640 1.295296
C 5.923666 -5.844841 0.553037
O 6.955216 -5.826197 -0.042920
O 5.269004 -4.590026 0.590033
H 4.054002 -2.184680 0.654838
H 1.389704 -5.910354 2.488783
H 5.814723 -7.796634 1.451618
O 1.825325 -1.537706 1.986256
H 2.319215 -0.796042 1.550394
H 3.390707 -7.564847 2.136680
H 0.535358 -3.663175 2.483943
18
generated by VMD
C 1.519866 -3.892621 2.109595
I would like to print every 100th frame starting from the first frame into its own file named "snapshot0.xyz" (The first frame is frame 0).
For example, the above input shows two snapshots. I would like to print out lines 1:20 into its own file named snapshot0.xyz and then skip 100 (2000 lines) snapshots and print out snapshot1.xyz (with the 100th snapshot). My attempt was in python, but you can choose either grep, awk, sed, or Python.
My input file: frames.dat
1 #!/usr/bin/Python
2
3
4
5 mest = open('frames.dat', 'r')
6 test = mest.read().strip().split('\n')
7
8 for i in range(len(test)):
9 if test[i] == '18':
10 f = open("out"+`i`+".dat", "w")
11 for j in range(19):
12 print >> f, test[j]
13 f.close()
I suggest using the csv module for this input.
import csv
def strip_empty_columns(line):
return filter(lambda s: s.strip() != "", line)
def is_count(line):
return len(line) == 1 and line[0].strip().isdigit()
def is_float(s):
try:
float(s.strip())
return True
except ValueError:
return False
def is_data_line(line):
return len(line) == 4 and is_float(line[1]) and is_float(line[2]) and is_float(line[3])
with open('frames.dat', 'r') as mest:
r = csv.reader(mest, delimiter=' ')
current_count = 0
frame_nr = 0
outfile = None
for line in r:
line = strip_empty_columns(line)
if is_count(line):
if frame_nr % 100 == 0:
outfile = open("snapshot%d.xyz" % frame_nr, "w+")
elif outfile:
outfile.close()
outfile = None
frame_nr += 1 # increment the frame counter every time you see this header line like '18'
elif is_data_line(line):
if outfile:
outfile.write(" ".join(line) + "\n")
The opening post mentions to write every 100th frame to an output file named snapshot0.xyz. I assume the 0 should be a counter, ot you would continously overwrite the file. I updated the code with a frame_nr counter and a few lines which open/close an output file depending on the frame_nr and write data if an output file is open.
This might work for you (GNU sed and csplit):
sed -rn '/^18/{x;/x{100}/z;s/^/x/;x};G;/\nx$/P' file | csplit -f snapshot -b '%d.xyz' -z - '/^18/' '{*}'
Filter every 100th frame using sed and pass that file to csplit to create the individual files.

Categories

Resources