bash 4.4 inside python with os.system - python

I have problems with running a bash script inside a python script script.py:
import os
bashCommand = """
sed "s/) \['/1, color=\"#ffcccc\", label=\"/g" list.txt | sed 's/\[/ GraphicFeature(start=/g' | sed 's/\:/, end=/g' | sed 's/>//g' | sed 's/\](/, strand=/g' | sed "s/'\]/\"),/g" >list2.txt"""
os.system("bash %s" % bashCommand)
When I run this as python script.py, no list2.txt is written, but on the terminal I see that I am inside bash-4.4 instead of the native macOS bash.
Any ideas what could cause this?
The script I posted above is part of a bigger script, where first it reads in some file and outputs list.txt.
edit: here comes some more description
In a first python script, I parsed a file (genbank file, to be specific), to write out a list with items (location, strand, name) into list.txt.
This list.txt has to be transformed to be parsable by a second python script, therefore the sed.
list.txt
[0:2463](+) ['bifunctional aspartokinase/homoserine dehydrogenase I']
[2464:3397](+) ['Homoserine kinase']
[3397:4684](+) ['Threonine synthase']
all the brackets, :, ' have to be replaced to look like desired output list2.txt
GraphicFeature(start=0, end=2463, strand=+1, color="#ffcccc", label="bifunctional aspartokinase/homoserine dehydrogenase I"),
GraphicFeature(start=2464, end=3397, strand=+1, color="#ffcccc", label="Homoserine kinase"),
GraphicFeature(start=3397, end=4684, strand=+1, color="#ffcccc", label="Threonine synthase"),

Read the file in Python, parse each line with a single regular expression, and output an appropriate line constructed from the captured pieces.
import re
import sys
# 1 2 3
# --- --- --
regex = re.compile(r"^\[(\d+):(\d+)\]\(\+\) \['(.*)'\]$")
# 1 - start value
# 2 - end value
# 3 - text value
with open("list2.txt", "w") as out:
for line in sys.stdin:
line = line.strip()
m = regex.match(line)
if m is None:
print(line, file=out)
else:
print('GraphicFeature(start={}, end={}, strand=+1, color="#ffcccc", label="{}"),'.format(*m.groups()), file=out)
I output lines that don't match the regular expression unmodified; you may want to ignore them altogether or report an error instead.

Related

Fill missing line numbers into file using sed / awk / bash

I have a (tab-delimited) file where the first "word" on each line is the line number. However, some line numbers are missing. I want to insert new lines (with corresponding line number) so that throughout the file, the number printed on the line matches the actual line number. (This is for later consumption into readarray with cut/awk to get the line after the line number.)
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python. The actual file is about 10M rows. Is there a way to represent this logic using sed, awk, or even just plain shell / bash?
linenumre = re.compile(r"^\d+")
i = 0
for line in sys.stdin:
i = i + 1
linenum = int(linenumre.findall(line)[0])
while (i < linenum):
print(i)
i = i + 1
print(line, end='')
test file looks like:
1 foo 1
2 bar 1
4 qux 1
6 quux 1
9 2
10 fun 2
expected output like:
1 foo 1
2 bar 1
3
4 qux 1
5
6 quux 1
7
8
9 2
10 fun 2
Like this, with awk:
awk '{while(++ln!=$1){print ln}}1' input.txt
Explanation, as a multiline script:
{
# Loop as long as the variable ln (line number)
# is not equal to the first column and insert blank
# lines.
# Note: awk will auto-initialize an integer variable
# with 0 upon its first usage
while(++ln!=$1) {
print ln
}
}
1 # this always expands to true, making awk print the input lines
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python.
In case you want to have running python code where python is not installed you might freeze your code. The Hitchhiker's Guide to Python has overview of tools which are able to do it. I suggest first trying pyinstaller as it support various operation system and seems easy to use.
This might work for you (GNU join, seq and join):
join -a1 -t' ' <(seq $(sed -n '$s/ .*//p' file)) file 2>/dev/null
Join a file created by the command seq using the last line number in file with file.

Fast I/O when working with multiple files

I have two input files and I want to mix them and output the result into a third files. In the following I will use a toy example to explain the format of the files and the desired output. Each file contain 4-line pattern which is repeated (but contains a different sequence), and I only include single 4-line:
input file 1:
#readheader1
ACACACACACACACACACACACACACACACACACACACACACACACACACACAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
input file 2:
#readheader2
AATTAATT
+
FFFFFFFF
...
desired ouput:
#readheader1_AATTAATT
ACACACACACACACACACACACACACACACACACACACACACACACACACACAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
So I want to attach thefirst line of every four line from the first file using an underscore with the small sequence found in the second line of every four line from the second file. and I simply output 2n, 3rd, and 4rd line of every four line of the first line, as is, into the output.
I am looking for any script (linux bash, python, c++, etc) that can optimize what I have below:
I wrote this code to do the task, but I found it to be slow (takes more than a day for inputs of size 60 GB and 15 GB); note that the input files are in fastq.gz format so I open them using gzip:
...
r1_file = gzip.open(r1_file_name, 'r') # input file 1
i1_file = gzip.open(i1_file_name, 'r') # input file 2
out_file_R1 = gzip.open('_R1_barcoded.fastq.gz', 'wb') # output file
r1_header = ''
r1_seq = ''
r1_orient = ''
r1_qual = ''
i1_seq = ''
cnt = 1
with gzip.open(r1_file_name, 'r') as r1_file:
for r1_line in r1_file:
if cnt==1:
r1_header = str.encode(r1_line.decode("ascii").split(" ")[0])
next(i1_file)
if cnt==2:
r1_seq = r1_line
i1_seq = next(i1_file)
if cnt==3:
r1_orient = r1_line
next(i1_file)
if cnt==4:
r1_qual = r1_line
next(i1_file)
out_4line = r1_header + b'_' + i1_seq + r1_seq + r1_orient + r1_qual
out_file_R1.write(out_4line)
cnt = 0
cnt += 1
i1_file.close()
out_file_R1.close()
Then that I have the two outputs made using 2 dataset, I wish to interleave the output files: 4 lines from the first file, 4 lines from the second file, 4 lines from the first, and so on...
Using paste utility (from GNU coreutils) and GNU sed:
paste file1 file2 |
sed -E 'N; s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/; N; N; s/\t[^\n]*//g' > file.out
If files are gzipped then use:
paste <(gzip -dc file1.gz) <(gzip -dc file2.gz) |
sed -E 'N; s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/; N; N; s/\t[^\n]*//g' > file.out
Note: This assumes no tab characters in file1 and file2
Explanation: Assume that file1 and file2 contains these lines:
File1:
Header1
ACACACACAC
XX
FFFFFFFFFFFF
File2:
Header2
AATTAATT
YY
GGGGGG
After the paste command, lines are merged, separated by TABs:
Header1\tHeader2
ACACACACAC\tAATTAATT
XX\tYY
FFFFFFFFFFFF\tGGGGGG
The \t above denotes a tab character. These lines are fed to sed. sed reads the first line, the pattern space becomes
Header1\tHeader2
The N command adds a newline to the pattern space, then appends the next line (ACACACACAC\tAATTAATT) of input to the pattern space. Pattern space becomes
Header1\tHeader2\nACACACACAC\tAATTAATT
and is matched against regex \t.*\n([^\t]*)\t(.*) as denoted below.
Header1\tHeader2\nACACACACAC\tAATTAATT
||^^^^^^^||^^^^^^^^^^||^^^^^^^^
\t .* \n ([^\t]*) \t (.*)
|| || \1 || \2
The \n denotes a newline character. Then the matching part is replaced with _\2\n\1 by the s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/ command. Pattern space becomes
Header1_AATTAATT\nACACACACAC
The two N commands read the next two lines. Now pattern space is
Header1_AATTAATT\nACACACACAC\nXX\tYY\nFFFFFFFFFFFF\tGGGGGG
The s/\t[^\n]*//g command removes all parts between a TAB (inclusive) and newline (exclusive). After this operation the final pattern space is
Header1_AATTAATT\nACACACACAC\nXX\nFFFFFFFFFFFF
which is printed out as
Header1_AATTAATT
ACACACACAC
XX
FFFFFFFFFFFF

Output text between two regular expression patterns over multiple lines

I can run the following command if I bring myfile to an environment with python available:
cat myfile | python filter.py
filter.py
import sys
results = []
for line in sys.stdin:
results.append(line.rstrip("\n\r"))
start_match = "some text"
lines_to_include_before_start_match = 4
end_match = "some other text"
lines_to_include_after_end_match = 4
for line_number, line in enumerate(results):
if start_match in line:
for x in xrange(line_number-lines_to_include_before_start_match, line_number):
print results[x]
print line
for x in xrange(line_number+1, len(results)):
if end_match in results[x]:
print results[x]
for z in xrange(x+1, x+lines_to_include_after_end_match):
print results[z]
break
else:
print results[x]
print ""
But the environment that I want to run this in doesn't have python. Is my only choice to convert this to perl, which I know exists in the environment? Is there an easy sed or awk command to do this?
I've tried the following but it doesn't quite give me what I'm looking for since it misses the +/- 4 lines:
cat myfile | sed -n '/some text/,/some other text/p'
[EDIT: The python script says lines_to_include_after_end_match is 4 but in reality it returns 3]
This might work for you (GNU sed):
sed ':a;$!{N;s/\n/&/4;Ta};/1st text/{:b;n;/2nd text/!bb;:c;N;s/\n/&/4;Tc;b};$d;D' file
Open up a window of n lines and if those lines contain 1st text print them and continue printing until 2nd text, then read m further lines and print those. Otherwise , if it is the end of the file delete the buffered lines else delete the first line in the buffer and repeat.
If the match text begin at start or end of a line, use:
sed ':a;$!{N;s/\n/&/4;Ta};/^start/M{:b;n;/end$/M!bb;:c;N;s/\n/&/4;Tc;b};$d;D' file
Given that the line endings are \n, you can try this:
awk '/some text/{if(l4)printf l4;p=5} /some other text/{e=1} e && p {p--; if (!p) {e=0;l4="";}} !p && !e { l4 = l4 $0 "\n"; sub(/[^\n]*\n(([^\n]*\n){4})/,"\1",l4);} p' file
Note the mark needs be 6 if you want print extra 4 lines after the end match.
I think your own python code will only print another 3 lines after end match.
Put in several lines for redability:
awk '/some text/{if(l4)printf l4;p=5}
/some other text/{e=1}
e && p {p--; if (!p) {e=0;l4="";}}
!p && !e { l4 = l4 $0 "\n"; sub(/[^\n]*\n(([^\n]*\n){4})/,"\1",l4);}
p' file
With sed, please try:
sed -n "$(($(sed -n '/some text/=' myfile) - 4)),$(($(sed -n '/some other text/=' myfile) + 4))p" myfile
The command sed -n '/some text/=' returns the line number which matches some text.
Then 4 is subtracted from the number above.
The next part sed -n '/some other text/=' works similarly and the obtained line number is added by 4.
Note that the script scans the input file three times and may not be suitable for the case execution time is crucial.
[Edit]
In case you have multiple "some other text" in the file, please try instead:
sed -n "$(($(sed -n '/some text/=' myfile) - 4)),\$p" myfile | sed "/some other text/{N;N;N;q}"

Read text file values and assign it to each value to variable

I want to read text file and find the words that starts with 56 in below text and pass each word to a variable and pass to a python file as a parameters.
My sample text file content -
51:40:2e:c0:01:c9:53:e8
56:c9:ce:90:4d:77:c6:03
56:c9:ce:90:4d:77:c6:07
51:40:2e:c0:01:c9:54:80
56:c9:ce:90:12:b4:19:01
56:c9:ce:90:12:b4:19:03
I like to pass to python file as
mytestfile.py var1 var2 var3
var1 should have value 56:c9:ce:90:4d:77:c6:03
var2 should have value 56:c9:ce:90:4d:77:c6:07
var3 should have value 56:c9:ce:90:12:b4:19:01
so on
I wrote code something like below but not working
[#var1 = "51:40:2e:c0:01:c9:53:e8"]
[#var2 = "51:40:2e:c0:01:c9:53:ea"]
filepath = '/root/SDFlex/work/cookbooks/ilorest/files/file.txt'
with open(filepath) as fp:
line = fp.readline()
cnt = 1
while line:
print("Line {}: {}".format(cnt, line.strip()))
line = fp.readline()
cnt += 1
execute "run create volume script" do
command "python SDFlexnimblevolcreate.py #{var1} #{node['ilorest']['Test0']} #{var2} #{node['ilorest']['Test1']}
cwd "#{platformdirectory}"
live_stream true
end
Thanks in advance
So this code prints string of required variables:
# a.py
with open('file.txt') as f:
result = ' '.join(map(lambda x: f'<{x}>',
filter(lambda x: x.startswith('56'),
map(str.strip, f))))
print(result, end='')
Output:
<56:c9:ce:90:4d:77:c6:03> <56:c9:ce:90:4d:77:c6:07> <56:c9:ce:90:12:b4:19:01> <56:c9:ce:90:12:b4:19:03>
You can pass result of this program using xargs to some another python program. For example if you have simple program which prints their arguments:
# b.py
import sys
print(sys.argv)
Than just type in shell:
python3 a.py | xargs python3 b.py
Output:
['b.py', '<56:c9:ce:90:4d:77:c6:03> <56:c9:ce:90:4d:77:c6:07> <56:c9:ce:90:12:b4:19:01> <56:c9:ce:90:12:b4:19:03>']
Are you trying to run your mytestfile.py once for each line that starts with '56:', like this:
python mytestfile.py 56:c9:ce:90:4d:77:c6:03
python mytestfile.py 56:c9:ce:90:4d:77:c6:07
python mytestfile.py 56:c9:ce:90:12:b4:19:01
Or do you want to run it one time only, giving it all values from the file as arguments, e.g., with your example data:
python mytestfile.py 56:c9:ce:90:4d:77:c6:03 56:c9:ce:90:4d:77:c6:07 56:c9:ce:90:12:b4:19:01
If it is the former (one run per matching line), the easiest way is:
grep '^56' textfile | xargs python mytestfile.py
otherwise, you could do this:
python mytestfile.py `grep '^56' textfile`
note the latter is dependent on having no spaces in the file - and is a bit dangerous if the file is very large (you could run into the command-line length limit).

Increase String by Sequential Index

In a file dealing with climatological variables involving a running mean with hours, the hours progress in sequence.
Is there a sed/awk command that would take that hour (string) in the file and then change it by two, so next time the file is read its (202) and so on to (204) etc...
See the number being added to 'i' below.
timeprime = i + 569
'define climomslp = prmslmsl(t = 'timeprime' )
My goal is to increase the number in this case, 569, by one each time the file runs through other commands involved in processing the data.
The next desired number next to i would be
timeprime = i + 570 (where 569 is increased by one)
after that...
timeprime = i + 571 (where 570 is increased by one)
If there isn't a sed/awk command to do such a thing, is there such a thing in any other method?
Thank you for any answers.
You can definitely do this in Python (or Perl, Ruby, or whatever other scripting language you like, but you included a Python tag). For example:
#!/usr/bin/env python
import re
import sys
def replace(m):
return '{}{}'.format(m.group(1), int(m.group(2))+2)
for line in sys.stdin:
sys.stdout.write(re.sub(r'(timeprime = i \+ )(\d+)', replace, line))
Hopefully the regex itself is trivial to understand:
(timeprime = i \+ )(\d+)
Debuggex Demo
The sub function can take a to be applied to the match object instead of a string as the "replacement". So, lines that don't match will be printed unchanged; lines that do will have the match substituted for the same two parts, but with the second part replaced by int(number)+2
Here is an alternative using awk:
awk '/^timeprime = i [+]/{$5+=2} 1' file
Starting with this file:
$ cat file
timeprime = i + 569
'define climomslp = prmslmsl(t = 'timeprime' )
We can use the awk command to create a new file:
$ awk '/^timeprime = i [+]/{$5+=2} 1' file
timeprime = i + 571
'define climomslp = prmslmsl(t = 'timeprime' )
To overwrite the original file with the new one, use:
awk '/^timeprime = i [+]/{$5+=2} 1' file >file.tmp && mv file.tmp file
How it works
/^timeprime = i [+]/{$5+=2}
This looks for lines that start with ^timeprime = i + and, on those lines, the fifth field is incremented by 2.
1
This is awk's cryptic shorthand for print the line.

Categories

Resources