Increase String by Sequential Index - python

In a file dealing with climatological variables involving a running mean with hours, the hours progress in sequence.
Is there a sed/awk command that would take that hour (string) in the file and then change it by two, so next time the file is read its (202) and so on to (204) etc...
See the number being added to 'i' below.
timeprime = i + 569
'define climomslp = prmslmsl(t = 'timeprime' )
My goal is to increase the number in this case, 569, by one each time the file runs through other commands involved in processing the data.
The next desired number next to i would be
timeprime = i + 570 (where 569 is increased by one)
after that...
timeprime = i + 571 (where 570 is increased by one)
If there isn't a sed/awk command to do such a thing, is there such a thing in any other method?
Thank you for any answers.

You can definitely do this in Python (or Perl, Ruby, or whatever other scripting language you like, but you included a Python tag). For example:
#!/usr/bin/env python
import re
import sys
def replace(m):
return '{}{}'.format(m.group(1), int(m.group(2))+2)
for line in sys.stdin:
sys.stdout.write(re.sub(r'(timeprime = i \+ )(\d+)', replace, line))
Hopefully the regex itself is trivial to understand:
(timeprime = i \+ )(\d+)
Debuggex Demo
The sub function can take a to be applied to the match object instead of a string as the "replacement". So, lines that don't match will be printed unchanged; lines that do will have the match substituted for the same two parts, but with the second part replaced by int(number)+2

Here is an alternative using awk:
awk '/^timeprime = i [+]/{$5+=2} 1' file
Starting with this file:
$ cat file
timeprime = i + 569
'define climomslp = prmslmsl(t = 'timeprime' )
We can use the awk command to create a new file:
$ awk '/^timeprime = i [+]/{$5+=2} 1' file
timeprime = i + 571
'define climomslp = prmslmsl(t = 'timeprime' )
To overwrite the original file with the new one, use:
awk '/^timeprime = i [+]/{$5+=2} 1' file >file.tmp && mv file.tmp file
How it works
/^timeprime = i [+]/{$5+=2}
This looks for lines that start with ^timeprime = i + and, on those lines, the fifth field is incremented by 2.
1
This is awk's cryptic shorthand for print the line.

Related

python alternative for awk?

I have two fasta files, and I want to search for sequence IDs and assign only the sequence corresponding to the ID to a string in Python.
I currently have:
import os
#use awk on the command line to search reference file and cut the reference sequence
os.system("awk '/LOC_OS05G45410.1/{getline;print}' Ref_seqs.fasta > sangerRef")
#use awk on the command line to cut the aligned sequence
os.system("awk '/seq1/{getline;print}' Sanger_seq_1.fasta > sangerAlign")
Ref_seq = open('sangerRef', 'r').read()
Sanger_seq = open('sangerAlign', 'r').read()
When I print these variables, everything looks fine:
TGGTGAGGCTTTTGACAGGGTTGAGCTGAGCCTGGTCTCCCTGGAGAAACTCTTCCAGAGAGCAAATGATGCTTGCACAGCTGCTGAAGAAATGTACTCCCATGGTCATGGTGGTACTGAACCCAG
CTGCTGCCCAAGTACTTCAAGCACAACAACTTCTCCAGCTTCATCAGGCAGCTCAACGCCTACGGTTTCCGAAAAATCGATCCTGAGAGATGGGAGTTCGCAAACGAGGATTTCATAAGAGGGCACACGCACCTT
However, when I try to read these variables into another function, it doesn't work:
from Bio import pairwise2
from Bio.Align import substitution_matrices
#load sequences
s1=Ref_seq
s2=Sanger_seq
matrix = substitution_matrices.load("NUC.4.4")
gap_open = -10
gap_extend = -0.5
align = pairwise2.align.globalds(s1, s2, matrix, gap_open, gap_extend)
align
I'm thinking it might be better to replace the awk command with a Python command?
I think it's because you haven't parsed the sequences. I don't know if I am using the word 'Parse' right, though.
I think this should work
from Bio import SeqIO
s1 = SeqIO.read('filepath/filename.fasta','fasta')
s2 = SeqIO.read('filepath/file.fasta','fasta')
matrix = substitution_matrices.load("NUC.4.4")
gap_open = -10
gap_extend = -0.5
align = pairwise2.align.globalds(s1.seq, s2.seq, matrix, gap_open, gap_extend)
align
The immediate problem is that read() returns all the lines with a newline at the end of each.
But indeed, your Awk commands should be trivial to replace with native Python.
def getseq(filename, search):
with open(filename) as reffile:
for line in reffile:
if search in line:
return seqfile.__next__().rstrip('\n')
s1 = getseq("Ref_seqs.fasta", "LOC_OS05G45410.1")
s2 = getseq("Sanger_seq_1.fasta", "seq1")
Probably BioPython already contains a better function for doing this. In particular, your Awk script (and hence this blind reimplementation) assumes that each sequence only occupies one line in the file.

Fast I/O when working with multiple files

I have two input files and I want to mix them and output the result into a third files. In the following I will use a toy example to explain the format of the files and the desired output. Each file contain 4-line pattern which is repeated (but contains a different sequence), and I only include single 4-line:
input file 1:
#readheader1
ACACACACACACACACACACACACACACACACACACACACACACACACACACAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
input file 2:
#readheader2
AATTAATT
+
FFFFFFFF
...
desired ouput:
#readheader1_AATTAATT
ACACACACACACACACACACACACACACACACACACACACACACACACACACAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
So I want to attach thefirst line of every four line from the first file using an underscore with the small sequence found in the second line of every four line from the second file. and I simply output 2n, 3rd, and 4rd line of every four line of the first line, as is, into the output.
I am looking for any script (linux bash, python, c++, etc) that can optimize what I have below:
I wrote this code to do the task, but I found it to be slow (takes more than a day for inputs of size 60 GB and 15 GB); note that the input files are in fastq.gz format so I open them using gzip:
...
r1_file = gzip.open(r1_file_name, 'r') # input file 1
i1_file = gzip.open(i1_file_name, 'r') # input file 2
out_file_R1 = gzip.open('_R1_barcoded.fastq.gz', 'wb') # output file
r1_header = ''
r1_seq = ''
r1_orient = ''
r1_qual = ''
i1_seq = ''
cnt = 1
with gzip.open(r1_file_name, 'r') as r1_file:
for r1_line in r1_file:
if cnt==1:
r1_header = str.encode(r1_line.decode("ascii").split(" ")[0])
next(i1_file)
if cnt==2:
r1_seq = r1_line
i1_seq = next(i1_file)
if cnt==3:
r1_orient = r1_line
next(i1_file)
if cnt==4:
r1_qual = r1_line
next(i1_file)
out_4line = r1_header + b'_' + i1_seq + r1_seq + r1_orient + r1_qual
out_file_R1.write(out_4line)
cnt = 0
cnt += 1
i1_file.close()
out_file_R1.close()
Then that I have the two outputs made using 2 dataset, I wish to interleave the output files: 4 lines from the first file, 4 lines from the second file, 4 lines from the first, and so on...
Using paste utility (from GNU coreutils) and GNU sed:
paste file1 file2 |
sed -E 'N; s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/; N; N; s/\t[^\n]*//g' > file.out
If files are gzipped then use:
paste <(gzip -dc file1.gz) <(gzip -dc file2.gz) |
sed -E 'N; s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/; N; N; s/\t[^\n]*//g' > file.out
Note: This assumes no tab characters in file1 and file2
Explanation: Assume that file1 and file2 contains these lines:
File1:
Header1
ACACACACAC
XX
FFFFFFFFFFFF
File2:
Header2
AATTAATT
YY
GGGGGG
After the paste command, lines are merged, separated by TABs:
Header1\tHeader2
ACACACACAC\tAATTAATT
XX\tYY
FFFFFFFFFFFF\tGGGGGG
The \t above denotes a tab character. These lines are fed to sed. sed reads the first line, the pattern space becomes
Header1\tHeader2
The N command adds a newline to the pattern space, then appends the next line (ACACACACAC\tAATTAATT) of input to the pattern space. Pattern space becomes
Header1\tHeader2\nACACACACAC\tAATTAATT
and is matched against regex \t.*\n([^\t]*)\t(.*) as denoted below.
Header1\tHeader2\nACACACACAC\tAATTAATT
||^^^^^^^||^^^^^^^^^^||^^^^^^^^
\t .* \n ([^\t]*) \t (.*)
|| || \1 || \2
The \n denotes a newline character. Then the matching part is replaced with _\2\n\1 by the s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/ command. Pattern space becomes
Header1_AATTAATT\nACACACACAC
The two N commands read the next two lines. Now pattern space is
Header1_AATTAATT\nACACACACAC\nXX\tYY\nFFFFFFFFFFFF\tGGGGGG
The s/\t[^\n]*//g command removes all parts between a TAB (inclusive) and newline (exclusive). After this operation the final pattern space is
Header1_AATTAATT\nACACACACAC\nXX\nFFFFFFFFFFFF
which is printed out as
Header1_AATTAATT
ACACACACAC
XX
FFFFFFFFFFFF

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

grep in python properly

I am used to do scripting in bash, but I am also learning python.
So, as a way of learning, I am trying to modify my few old bash in python. As, say,I have a file, with line like:
TOTDOS= 0.38384E+02n_Ef= 0.81961E+02 Ebnd 0.86883E+01
to get the value of TOTDOS in bash, I just do:
grep "TOTDOS=" 630/out-Dy-eos2|head -c 19|tail -c 11
but by python, I am doing:
#!/usr/bin/python3
import re
import os.path
import sys
f1 = open("630/out-Dy-eos2", "r")
re1 = r'TOTDOS=\s*(.*)n_Ef=\s*(.*)\sEbnd'
for line in f1:
match1 = re.search(re1, line)
if match1:
TD = (match1.group(1))
f1.close()
print(TD)
Which is surely giving correct result, but seems to be much more then bash(not to mention problem with regex).
Question is, am I overworking in python, or missing something of it?
A python script that matches your bash line would be more like this:
with open('630/out-Dy-eos2', 'r') as f1:
for line in f1:
if "TOTDOS=" in line:
print line[8:19]
Looks a little bit better now.
[...] but seems to be much more than bash
Maybe (?) generators are the closest Python concept to the "pipe filtering" used in shell.
import itertools
#
# Simple generator to iterate through a file
# equivalent of line by line reading from an input file
def source(fname):
with open(fname,"r") as f:
for l in f:
yield l
src = source("630/out-Dy-eos2")
# First filter to keep only lines containing the required word
# equivalent to `grep -F`
filter1 = (l for l in src if "TOTDOS=" in l)
# Second filter to keep only line in the required range
# equivalent of `head -n ... | tail -n ...`
filter2 = itertools.islice(filter1, 10, 20,1)
# Finally output
output = "".join(filter2)
print(output)
Concerning your specific example, if you need it, you could use regexp in a generator:
re1 = r'TOTDOS=\s*(.*)n_Ef=\s*(.*)\sEbnd'
filter1 = (m.group(1) for m in (re.match(re1, l) for l in src) if m)
Those are only (some of the) basic building blocs available to you.

Bash or Python to go backwards?

I have a text file which a lot of random occurrences of the string #STRING_A, and I would be interested in writing a short script which removes only some of them. Particularly one that scans the file and once it finds a line which starts with this string like
#STRING_A
then checks if 3 lines backwards there is another occurrence of a line starting with the same string, like
#STRING_A
#STRING_A
and if it happens, to delete the occurrence 3 lines backward. I was thinking about bash, but I do not know how to "go backwards" with it. So I am sure that this is not possible with bash. I also thought about python, but then I should store all information in memory in order to go backwards and then, for long files it would be unfeasible.
What do you think? Is it possible to do it in bash or python?
Thanks
Funny that after all these hours nobody's yet given a solution to the problem as actually phrased (as #John Machin points out in a comment) -- remove just the leading marker (if followed by another such marker 3 lines down), not the whole line containing it. It's not hard, of course -- here's a tiny mod as needed of #truppo's fun solution, for example:
from itertools import izip, chain
f = "foo.txt"
for third, line in izip(chain(" ", open(f)), open(f)):
if third.startswith("#STRING_A") and line.startswith("#STRING_A"):
line = line[len("#STRING_A"):]
print line,
Of course, in real life, one would use an iterator.tee instead of reading the file twice, have this code in a function, not repeat the marker constant endlessly, &c;-).
Of course Python will work as well. Simply store the last three lines in an array and check if the first element in the array is the same as the value you are currently reading. Then delete the value and print out the current array. You would then move over your elements to make room for the new value and repeat. Of course when the array is filled you'd have to make sure to continue to move values out of the array and put in the newly read values, stopping to check each time to see if the first value in the array matches the value you are currently reading.
Here is a more fun solution, using two iterators with a three element offset :)
from itertools import izip, chain, tee
f1, f2 = tee(open("foo.txt"))
for third, line in izip(chain(" ", f1), f2):
if not (third.startswith("#STRING_A") and line.startswith("#STRING_A")):
print line,
Why shouldn't it possible in bash? You don't need to keep the whole file in memory, just the last three lines (if I understood correctly), and write what's appropriate to standard-out. Redirect that into a temporary file, check that everything worked as expected, and overwrite the source file with the temporary one.
Same goes for Python.
I'd provide a script of my own, but that wouldn't be tested. ;-)
As AlbertoPL said, store lines in a fifo for later use--don't "go backwards". For this I would definitely use python over bash+sed/awk/whatever.
I took a few moments to code this snippet up:
from collections import deque
line_fifo = deque()
for line in open("test"):
line_fifo.append(line)
if len(line_fifo) == 4:
# "look 3 lines backward"
if line_fifo[0] == line_fifo[-1] == "#STRING_A\n":
# get rid of that match
line_fifo.popleft()
else:
# print out the top of the fifo
print line_fifo.popleft(),
# don't forget to print out the fifo when the file ends
for line in line_fifo: print line,
This code will scan through the file, and remove lines starting with the marker. It only keeps only three lines in memory by default:
from collections import deque
def delete(fp, marker, gap=3):
"""Delete lines from *fp* if they with *marker* and are followed
by another line starting with *marker* *gap* lines after.
"""
buf = deque()
for line in fp:
if len(buf) < gap:
buf.append(line)
else:
old = buf.popleft()
if not (line.startswith(marker) and old.startswith(marker)):
yield old
buf.append(line)
for line in buf:
yield line
I've tested it with:
>>> from StringIO import StringIO
>>> fp = StringIO('''a
... b
... xxx 1
... c
... xxx 2
... d
... e
... xxx 3
... f
... g
... h
... xxx 4
... i''')
>>> print ''.join(delete(fp, 'xxx'))
a
b
xxx 1
c
d
e
xxx 3
f
g
h
xxx 4
i
This "answer" is for lyrae ... I'll amend my previous comment: if the needle is in the first 3 lines of the file, your script will either cause an IndexError or access a line that it shouldn't be accessing, sometimes with interesting side-effects.
Example of your script causing IndexError:
>>> lines = "#string line 0\nblah blah\n".splitlines(True)
>>> needle = "#string "
>>> for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IndexError: list index out of range
and this example shows not only that the Earth is round but also why your "fix" to the "don't delete the whole line" problem should have used .replace(needle, "", 1) or [len(needle):] instead of .replace(needle, "")
>>> lines = "NEEDLE x NEEDLE y\nnoddle\nnuddle\n".splitlines(True)
>>> needle = "NEEDLE"
>>> # Expected result: no change to the file
... for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
>>> print ''.join(lines)
x y <<<=== whoops!
noddle
nuddle
<<<=== still got unwanted newline in here
>>>
My awk-fu has never been that good... but the following may provide you what you're looking for in a bash-shell/shell-utility form:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "d"
LAST = NR
}' test_file` test_file
Basically... awk is producing a command for sed to strip certain lines. I'm sure there's a relatively easy way to make awk do all of the processing, but this does seem to work.
The bad part? It does read the test_file twice.
The good part? It is a bash/shell-utility implementation.
Edit: Alex Martelli points out that the sample file above might have confused me. (my above code deletes the whole line, rather than the #STRING_A flag only)
This is easily remedied by adjusting the command to sed:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "s/#STRING_A//"
LAST = NR
}' test_file` test_file
This may be what you're looking for?
lines = open('sample.txt').readlines()
needle = "#string "
for i,line in enumerate(lines):
if line.startswith(needle) and lines[i-3].startswith(needle):
lines[i-3] = lines[i-3].replace(needle, "")
print ''.join(lines)
this outputs:
string 0 extra text
string 1 extra text
string 2 extra text
string 3 extra text
--replaced -- 4 extra text
string 5 extra text
string 6 extra text
#string 7 extra text
string 8 extra text
string 9 extra text
string 10 extra text
In bash you can use sort -r filename and tail -n filename to read the file backwards.
$LINES=`tail -n filename | sort -r`
# now iterate through the lines and do your checking
I would consider using sed. gnu sed supports definition of line ranges. if sed would fail, then there is another beast - awk and I'm sure you can do it with awk.
O.K. I feel I should put my awk POC. I could not figure out to use sed addresses. I have not tried combination of awk+sed, but it seems to me it's overkill.
my awk script works as follows:
It reads lines and stores them into 3 line buffer
once desired pattern is found (/^data.*/ in my case), the 3-line buffer is looked up to check, whether desired pattern has been seen three lines ago
if pattern has been seen, then 3 lines are scratched
to be honest, I would probably go with python also, given that awk is really awkward.
the AWK code follows:
function max(a, b)
{
if (a > b)
return a;
else
return b;
}
BEGIN {
w = 0; #write index
r = 0; #read index
buf[0, 1, 2]; #buffer
}
END {
# flush buffer
# start at read index and print out up to w index
for (k = r % 3; k r - max(r - 3, 0); k--) {
#search in 3 line history buf
if (match(buf[k % 3], /^data.*/) != 0) {
# found -> remove lines from history
# by rewriting them -> adjust write index
w -= max(r, 3);
}
}
buf[w % 3] = $0;
w++;
}
/^.*/ {
# store line into buffer, if the history
# is full, print out the oldest one.
if (w > 2) {
print buf[r % 3];
r++;
buf[w % 3] = $0;
}
else {
buf[w] = $0;
}
w++;
}

Categories

Resources