Alternative to bash (awk command) with python - python

Context : I run calculations on a program that gives me result files.
On these result files (extension .h5), I can apply a python code (I cannot change this python code) such that it gives me a square matrix :
oneptdm.py resultfile.h5
gives me for example :
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
points groups
1
2
3
...
in a file called oneptdm.dat
I want to grep the diagonal of this matrix. Usually I use simply bash:
awk '{ for (i=0; i<=NF; i++) if (NR >= 1 && NR == i) print i,$(i) }' oneptdm.dat > diagonal.dat
But for x reason, I have to do it with python now. How can I do that ?
I can of course use "subprocess" to use awk again but I would like to know if there is an alternative way to do that with a python script, version 2.6.
The result should be :
(line) (diagonal element)
1 1
2 6
3 11
4 16

You can try something like this:
with open('oneptdm.dat') as f:
for i, l in enumerate(f):
print '%d\t%s' % (i + 1, l.split()[i])

This should do the trick. It does assume that the file begins with a square matrix, and that assumption is used to limit the number of lines read from the file.
with open('oneptdm.dat') as f:
line = next(f).split()
for i in range(len(line)):
print('{0}\t{1}'.format(i+1, line[i]))
try:
line = next(f).split()
except StopIteration:
break
Output for your sample file:
1 1
2 6
3 11
4 16

Related

Fill missing line numbers into file using sed / awk / bash

I have a (tab-delimited) file where the first "word" on each line is the line number. However, some line numbers are missing. I want to insert new lines (with corresponding line number) so that throughout the file, the number printed on the line matches the actual line number. (This is for later consumption into readarray with cut/awk to get the line after the line number.)
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python. The actual file is about 10M rows. Is there a way to represent this logic using sed, awk, or even just plain shell / bash?
linenumre = re.compile(r"^\d+")
i = 0
for line in sys.stdin:
i = i + 1
linenum = int(linenumre.findall(line)[0])
while (i < linenum):
print(i)
i = i + 1
print(line, end='')
test file looks like:
1 foo 1
2 bar 1
4 qux 1
6 quux 1
9 2
10 fun 2
expected output like:
1 foo 1
2 bar 1
3
4 qux 1
5
6 quux 1
7
8
9 2
10 fun 2
Like this, with awk:
awk '{while(++ln!=$1){print ln}}1' input.txt
Explanation, as a multiline script:
{
# Loop as long as the variable ln (line number)
# is not equal to the first column and insert blank
# lines.
# Note: awk will auto-initialize an integer variable
# with 0 upon its first usage
while(++ln!=$1) {
print ln
}
}
1 # this always expands to true, making awk print the input lines
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python.
In case you want to have running python code where python is not installed you might freeze your code. The Hitchhiker's Guide to Python has overview of tools which are able to do it. I suggest first trying pyinstaller as it support various operation system and seems easy to use.
This might work for you (GNU join, seq and join):
join -a1 -t' ' <(seq $(sed -n '$s/ .*//p' file)) file 2>/dev/null
Join a file created by the command seq using the last line number in file with file.

how to split a long text file at a particular symbols and pasting the splitted files side by side

Hii experts i want to split a large column of text file at a particular symbol(here >) and want to paste the splitted file side by side as given in a example below:
I tried with split -l 4 inputfile > otputfile but it doesnot help.I hope some expert will definitely help me.
For example i have data as given below:
>
1
2
2
4
>
4
3
5
3
>
4
5
2
3
and i need output like as below
1 4 4
2 3 5
2 5 2
4 3 3
EDIT: As per OP's comment lines between > mark may not be regular in numbers if this is the case I have come up with following, where it will add NA for missing specific occurrence of >. Written and tested with GNU awk and considering no empty lines in your Input_file here.
awk -v RS=">" -v FS="\n" '
FNR==NR{
max=(max>NF?max:NF)
next
}
FNR>1{
for(i=2;i<max;i++){
val[i]=(val[i]?val[i] OFS:"")($i?$i:"NA")
}
}
END{
for(i=2;i<max;i++){
print val[i]
}
}' Input_file Input_file
Could you please try following, written and tested with shown samples in GNU awk.
awk '
/^>/{
count=""
next
}
{
++count
val[count]=(val[count]?val[count] OFS:"")$0
}
END{
for(i=1;i<=count;i++){
print val[i]
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^>/{ ##Checking condition if a line starts from > then do following.
count="" ##Nullifying count variable here.
next ##next will skip all further statements from here.
}
{
++count ##Incrementing count variable with 1 here.
val[count]=(val[count]?val[count] OFS:"")$0 ##Creating val with index count and keep adding current lines values to it with spaces.
}
END{ ##Starting END block for this awk program from here.
for(i=1;i<=count;i++){ ##Starting a for loop from here.
print val[i] ##Printing array val with index i here.
}
}' Input_file ##Mentioning Input_file name here.
Please try below program
a=""">
1
2
2
4
>
4
3
5
3
>
4
5
2
3"""
res=[[c for c in b.split("\n") if c] for b in a.split(">") if b]
print("\n".join([" ".join([item[i] for item in res]) for i in range(len(res[0]))]))
Output
1 4 4
2 3 5
2 5 2
4 3 3
If you want to read from file. use this program as below.
This produces same output as above.
with open("input.txt","r") as f, open("output.txt","w") as f1:
a=f.read()
res=[[c for c in b.split("\n") if c] for b in a.split(">") if b]
f1.write("\n".join([" ".join([item[i] for item in res]) for i in range(len(res[0]))]))
A Python solution as you tagged Python:
columns = [] # List of columns, each column will be another list of lines
with open('example.txt', 'r') as f:
for line in f:
line = line.strip() # Remove leading and trailing white spaces like "\n"
if line == '>':
columns.append([]) # If we find a ">" append a new column
else:
columns[-1].append(line) # else append the line to the last column
with open('output.txt', 'w') as f:
for row in zip(*columns): # zip(*columns) trasposes the matrix
f.write(" ".join(row) + "\n")

How can I paste contents of 2 files or single file multiple times?

I am using mostly one liners in shell scripting.
If I have a file with contents as below:
1
2
3
and want it to be pasted like:
1 1
2 2
3 3
how can I do it in shell scripting using python one liner?
PS: I tried the following:-
python -c "file = open('array.bin','r' ) ; cont=file.read ( ) ; print cont*3;file.close()"
but it printed contents like:-
1
2
3
1
2
3
file = open('array.bin','r' )
cont = file.readlines()
for line in cont:
print line, line
file.close()
You could replace your print cont*3 with the following:
print '\n'.join(' '.join(ch * n) for ch in cont.strip().split())
Here n is the number of columns.
You need to break up the lines and then reassemble:
One Liner:
python -c "file=open('array.bin','r'); cont=file.readlines(); print '\n'.join([' '.join([c.strip()]*2) for c in cont]); file.close()"
Long form:
file=open('array.bin', 'r')
cont=file.readlines()
print '\n'.join([' '.join([c.strip()]*2) for c in cont])
file.close()
With array.bin having:
1
2
3
Gives:
1 1
2 2
3 3
Unfortunately, you can't use a simple for statement for a one-liner solution (as suggested in a previous answer). As this answer explains, "as soon as you add a construct that introduces an indented block (like if), you need the line break."
Here's one possible solution that avoids this problem:
Open file and read lines into a list
Modify the list (using a list comprehension). For each item:
Remove the trailing new line character
Multiply by the number of columns
Join the modified list using the new line character as separator
Print the joint list and close file
Detailed/long form (n = number of columns):
f = open('array.bin', 'r')
n = 5
original = list(f)
modified = [line.strip() * n for line in original]
print('\n'.join(modified))
f.close()
One-liner:
python -c "f = open('array.bin', 'r'); n = 5; print('\n'.join([line.strip()*n for line in list(f)])); f.close()"
REPEAT_COUNT=3 && cat contents.txt| python -c "print('\n'.join(w.strip() * ${REPEAT_COUNT} for w in open('/dev/stdin').readlines()))"
First test from the command propmt:
paste -d" " array.bin array.bin
EDIT:
OP wants to use a variable n to show how much columns are needed.
There are different ways to repeat a command 10 times, such as
for i in {1..10}; do echo array.bin; done
seq 10 | xargs -I -- echo "array.bin"
source <(yes echo "array.bin" | head -n10)
yes "array.bin" | head -n10
Other ways are given by https://superuser.com/a/86353 and I will use a variation of
printf -v spaces '%*s' 10 ''; printf '%s\n' ${spaces// /ten}
My solution is
paste -d" " $(printf "%*s" $n " " | sed 's/ /array.bin /g')

python script slow read and write gz files

I have a xxx.wig.gz file, that have 3,000,000,000 lines in such format:
fixedStep chrom=chr1 start=1 step=1
0
0
0
0
0
1
2
3
4
5
6
7
8
9
10
...
fixedStep chrom=chr2 start=1 step=1
0
0
0
0
0
11
12
13
14
15
16
17
18
19
20
...
and i want to
break it down by "chrom". So every time I read a line starts with "fixedstep", I create a new file and close old one.
I want 0/1 output by comparing each value to a "threshold", pass=1 otherwise 0
below is my python script which runs super slow (I am projecting it to finish ~10hours, so far 2 chromosomes done after ~1 hour)
can someone help me improve it?
#!/bin/env python
import gzip
import re
import os
import sys
fn = sys.argv[1]
f = gzip.open(fn)
fo_base = os.path.basename(fn).rstrip('.wig').rstrip('.wig.gz')
fo_ext = '.bt.gz'
thres = 100
fo = None
for l in f:
if l.startswith("fixedStep"):
if fo is not None:
fo.flush()
fo.close()
fon = re.search(r'chrom=(\w*)', l).group(0).split('=')[-1]
fo = gzip.open(fo_base + "_" + fon + fo_ext,'wb')
else:
if int(l.strip())>= thres:
fo.write("1\n")
else:
fo.write("0\n")
if fo is not None:
fo.flush()
fo.close()
f.close()
PS. I assume awk can do it much faster but I am not great with awk
Thanks Summer for editing the text.
I added buffered read/write to the script and now it is several times faster (still relatively slow though):
import io
f = io.BufferedReader( gzip.open(fn) )
fo = io.BufferedWriter( gzip.open(fo_base + "." + fon + fo_ext,'wb') )

How to compare two files and print second file only matching the first file

I have two files. One has two columns, ref.txt. The other has three columns, file.txt.
In ref.txt,
1 2
2 3
3 5
In file.txt,
1 2 4 <---here matching
3 4 5
6 9 4
2 3 10 <---here matching
4 7 9
3 5 7 <---here matching
I would like to compare two columns for each file, then only print the lines in file.txt matching the ref.txt.
So, the output should be,
1 2 4
2 3 10
3 5 7
I thought two dictionary comparison like,
mydict = {}
mydict1 = {}
with open('ref.txt') as f1:
for line in f1:
key, key1 = line.split()
sp1 = mydict[key, key1]
with open('file.txt') as f2:
for lines in f2:
item1, item2, value = lines.split()
sp2 = mydict1[item1, item2]
if sp1 == sp2:
print value
How can I compare two files appropriately with dictionary or others?
I found some perl and python code to solve the same number of columns in both file.
In my case, one file has two columns and the other has three columns.
How to compare two files and only print matching values?
Here's another option:
use strict;
use warnings;
my $file = pop;
my %hash = map { chomp; $_ => 1 } <>;
push #ARGV, $file;
while (<>) {
print if /^(\d+\s+\d+)/ and $hash{$1};
}
Usage: perl script.pl ref.txt file.txt [>outFile]
The last, optional parameter directs output to a file.
Output on your datasets:
1 2 4
2 3 10
3 5 7
Hope this helps!
grep -Ff ref.txt file.txt
is enough if the amount of whitespace between the characters is the same in both files. If it is not, you can do
awk '{print "^" $1 "[[:space:]]+" $2}' | xargs -I {} grep -E {} file.txt
combining three of my favorite utilities: awk, grep, and xargs... This latter method also ensures that the match only occurs at the start of the line (comparing column 1 with column 1, and column 2 with column 2).
Here's a revised and commented version that should work on your larger data set:
#read in your reference and the file
reference = open("ref.txt").read()
filetext = open("file.txt").read()
#split the reference file into a list of strings, splitting each time you encounter a new line
splitReference = reference.split("\n")
#do the same for the file
splitFile = filetext.split("\n")
#then, for each line in the reference,
for referenceLine in splitReference:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
referenceCells = referenceLine.split()
#then, for each line in your 'file',
for fileLine in splitFile:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
lineCells = fileLine.split()
#now, for each line in 'reference' check to see if the first value is equal to the first value of the current line in 'file'
if referenceCells[0] == lineCells[0]:
#if those are equal, then check to see if the current rows of the reference and the file both have a length of more than one
if len(referenceCells) > 1:
if len(lineCells) > 1:
#if both have a length of more than one, compare the values in their second columns. If they are equal, print the file line
if referenceCells[1] == lineCells[1]:
print fileLine
Output:
1 2 4
2 3 10
3 5 7

Categories

Resources