I have a (tab-delimited) file where the first "word" on each line is the line number. However, some line numbers are missing. I want to insert new lines (with corresponding line number) so that throughout the file, the number printed on the line matches the actual line number. (This is for later consumption into readarray with cut/awk to get the line after the line number.)
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python. The actual file is about 10M rows. Is there a way to represent this logic using sed, awk, or even just plain shell / bash?
linenumre = re.compile(r"^\d+")
i = 0
for line in sys.stdin:
i = i + 1
linenum = int(linenumre.findall(line)[0])
while (i < linenum):
print(i)
i = i + 1
print(line, end='')
test file looks like:
1 foo 1
2 bar 1
4 qux 1
6 quux 1
9 2
10 fun 2
expected output like:
1 foo 1
2 bar 1
3
4 qux 1
5
6 quux 1
7
8
9 2
10 fun 2
Like this, with awk:
awk '{while(++ln!=$1){print ln}}1' input.txt
Explanation, as a multiline script:
{
# Loop as long as the variable ln (line number)
# is not equal to the first column and insert blank
# lines.
# Note: awk will auto-initialize an integer variable
# with 0 upon its first usage
while(++ln!=$1) {
print ln
}
}
1 # this always expands to true, making awk print the input lines
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python.
In case you want to have running python code where python is not installed you might freeze your code. The Hitchhiker's Guide to Python has overview of tools which are able to do it. I suggest first trying pyinstaller as it support various operation system and seems easy to use.
This might work for you (GNU join, seq and join):
join -a1 -t' ' <(seq $(sed -n '$s/ .*//p' file)) file 2>/dev/null
Join a file created by the command seq using the last line number in file with file.
Related
I have a text file that looks something like this :
original-- expected output--
0 1 2 3 4 5 SET : {0,1,2,3,4,5}
1 3 RELATION:{(1,3),(3,1),(5,4),(4,5)}
3 1
5 4 REFLEXIVE : NO
4 5 SYMMETRIC : YES
and part of the code is having it print out the first line in curly braces, and the rest within one giant curly braces and each binary set in parentheses. I am still a beginner but I wanted to know if there is some way in python to make one loop that treats the first line differently than the rest?
try this with filename is your file ..
with open("filename.txt", "r") as file:
set_firstline = []
first_string = file.readline()
list_of_first_string = list(first_string)
for i in range(len(list_of_first_string)):
if str(i) in first_string:
set_firstline.append(i)
print(set_firstline)
OUTPUT : [0,1,2,3,4,5]
im new as well. so hope I can help you
I have problems with running a bash script inside a python script script.py:
import os
bashCommand = """
sed "s/) \['/1, color=\"#ffcccc\", label=\"/g" list.txt | sed 's/\[/ GraphicFeature(start=/g' | sed 's/\:/, end=/g' | sed 's/>//g' | sed 's/\](/, strand=/g' | sed "s/'\]/\"),/g" >list2.txt"""
os.system("bash %s" % bashCommand)
When I run this as python script.py, no list2.txt is written, but on the terminal I see that I am inside bash-4.4 instead of the native macOS bash.
Any ideas what could cause this?
The script I posted above is part of a bigger script, where first it reads in some file and outputs list.txt.
edit: here comes some more description
In a first python script, I parsed a file (genbank file, to be specific), to write out a list with items (location, strand, name) into list.txt.
This list.txt has to be transformed to be parsable by a second python script, therefore the sed.
list.txt
[0:2463](+) ['bifunctional aspartokinase/homoserine dehydrogenase I']
[2464:3397](+) ['Homoserine kinase']
[3397:4684](+) ['Threonine synthase']
all the brackets, :, ' have to be replaced to look like desired output list2.txt
GraphicFeature(start=0, end=2463, strand=+1, color="#ffcccc", label="bifunctional aspartokinase/homoserine dehydrogenase I"),
GraphicFeature(start=2464, end=3397, strand=+1, color="#ffcccc", label="Homoserine kinase"),
GraphicFeature(start=3397, end=4684, strand=+1, color="#ffcccc", label="Threonine synthase"),
Read the file in Python, parse each line with a single regular expression, and output an appropriate line constructed from the captured pieces.
import re
import sys
# 1 2 3
# --- --- --
regex = re.compile(r"^\[(\d+):(\d+)\]\(\+\) \['(.*)'\]$")
# 1 - start value
# 2 - end value
# 3 - text value
with open("list2.txt", "w") as out:
for line in sys.stdin:
line = line.strip()
m = regex.match(line)
if m is None:
print(line, file=out)
else:
print('GraphicFeature(start={}, end={}, strand=+1, color="#ffcccc", label="{}"),'.format(*m.groups()), file=out)
I output lines that don't match the regular expression unmodified; you may want to ignore them altogether or report an error instead.
I am using mostly one liners in shell scripting.
If I have a file with contents as below:
1
2
3
and want it to be pasted like:
1 1
2 2
3 3
how can I do it in shell scripting using python one liner?
PS: I tried the following:-
python -c "file = open('array.bin','r' ) ; cont=file.read ( ) ; print cont*3;file.close()"
but it printed contents like:-
1
2
3
1
2
3
file = open('array.bin','r' )
cont = file.readlines()
for line in cont:
print line, line
file.close()
You could replace your print cont*3 with the following:
print '\n'.join(' '.join(ch * n) for ch in cont.strip().split())
Here n is the number of columns.
You need to break up the lines and then reassemble:
One Liner:
python -c "file=open('array.bin','r'); cont=file.readlines(); print '\n'.join([' '.join([c.strip()]*2) for c in cont]); file.close()"
Long form:
file=open('array.bin', 'r')
cont=file.readlines()
print '\n'.join([' '.join([c.strip()]*2) for c in cont])
file.close()
With array.bin having:
1
2
3
Gives:
1 1
2 2
3 3
Unfortunately, you can't use a simple for statement for a one-liner solution (as suggested in a previous answer). As this answer explains, "as soon as you add a construct that introduces an indented block (like if), you need the line break."
Here's one possible solution that avoids this problem:
Open file and read lines into a list
Modify the list (using a list comprehension). For each item:
Remove the trailing new line character
Multiply by the number of columns
Join the modified list using the new line character as separator
Print the joint list and close file
Detailed/long form (n = number of columns):
f = open('array.bin', 'r')
n = 5
original = list(f)
modified = [line.strip() * n for line in original]
print('\n'.join(modified))
f.close()
One-liner:
python -c "f = open('array.bin', 'r'); n = 5; print('\n'.join([line.strip()*n for line in list(f)])); f.close()"
REPEAT_COUNT=3 && cat contents.txt| python -c "print('\n'.join(w.strip() * ${REPEAT_COUNT} for w in open('/dev/stdin').readlines()))"
First test from the command propmt:
paste -d" " array.bin array.bin
EDIT:
OP wants to use a variable n to show how much columns are needed.
There are different ways to repeat a command 10 times, such as
for i in {1..10}; do echo array.bin; done
seq 10 | xargs -I -- echo "array.bin"
source <(yes echo "array.bin" | head -n10)
yes "array.bin" | head -n10
Other ways are given by https://superuser.com/a/86353 and I will use a variation of
printf -v spaces '%*s' 10 ''; printf '%s\n' ${spaces// /ten}
My solution is
paste -d" " $(printf "%*s" $n " " | sed 's/ /array.bin /g')
I have two files. One has two columns, ref.txt. The other has three columns, file.txt.
In ref.txt,
1 2
2 3
3 5
In file.txt,
1 2 4 <---here matching
3 4 5
6 9 4
2 3 10 <---here matching
4 7 9
3 5 7 <---here matching
I would like to compare two columns for each file, then only print the lines in file.txt matching the ref.txt.
So, the output should be,
1 2 4
2 3 10
3 5 7
I thought two dictionary comparison like,
mydict = {}
mydict1 = {}
with open('ref.txt') as f1:
for line in f1:
key, key1 = line.split()
sp1 = mydict[key, key1]
with open('file.txt') as f2:
for lines in f2:
item1, item2, value = lines.split()
sp2 = mydict1[item1, item2]
if sp1 == sp2:
print value
How can I compare two files appropriately with dictionary or others?
I found some perl and python code to solve the same number of columns in both file.
In my case, one file has two columns and the other has three columns.
How to compare two files and only print matching values?
Here's another option:
use strict;
use warnings;
my $file = pop;
my %hash = map { chomp; $_ => 1 } <>;
push #ARGV, $file;
while (<>) {
print if /^(\d+\s+\d+)/ and $hash{$1};
}
Usage: perl script.pl ref.txt file.txt [>outFile]
The last, optional parameter directs output to a file.
Output on your datasets:
1 2 4
2 3 10
3 5 7
Hope this helps!
grep -Ff ref.txt file.txt
is enough if the amount of whitespace between the characters is the same in both files. If it is not, you can do
awk '{print "^" $1 "[[:space:]]+" $2}' | xargs -I {} grep -E {} file.txt
combining three of my favorite utilities: awk, grep, and xargs... This latter method also ensures that the match only occurs at the start of the line (comparing column 1 with column 1, and column 2 with column 2).
Here's a revised and commented version that should work on your larger data set:
#read in your reference and the file
reference = open("ref.txt").read()
filetext = open("file.txt").read()
#split the reference file into a list of strings, splitting each time you encounter a new line
splitReference = reference.split("\n")
#do the same for the file
splitFile = filetext.split("\n")
#then, for each line in the reference,
for referenceLine in splitReference:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
referenceCells = referenceLine.split()
#then, for each line in your 'file',
for fileLine in splitFile:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
lineCells = fileLine.split()
#now, for each line in 'reference' check to see if the first value is equal to the first value of the current line in 'file'
if referenceCells[0] == lineCells[0]:
#if those are equal, then check to see if the current rows of the reference and the file both have a length of more than one
if len(referenceCells) > 1:
if len(lineCells) > 1:
#if both have a length of more than one, compare the values in their second columns. If they are equal, print the file line
if referenceCells[1] == lineCells[1]:
print fileLine
Output:
1 2 4
2 3 10
3 5 7
Here is my problem.
I have n files and they all have overlapping and common text in them. I want to create a file using these n files such that the new file only contains unique lines in it that exist across all of the n files.
I am looking for a bash command, python api that can do it for me. If there is an algorithm I can also attempt to code it myself.
If the order of the lines is not important, you could do this:
sort -u file1 file2 ...
This will (a) sort all the lines in all the files, and then (b) remove duplicates. This will give you the lines that are unique among all the files.
For testing common data you can use comm:
DESCRIPTION
The comm utility reads file1 and file2, which should be sorted lexically,
and produces three text columns as output: lines only in file1; lines only in
file2; and lines in both files.
Another useful tool would be merge:
DESCRIPTION
merge incorporates all changes that lead from file2 to file3 into file1.
The result ordinarily goes into file1. merge is useful for combining separate
changes to an original.
sort might mess up your order. You can try the following awk command. It hasn't been tested so make sure you backup your files. :)
awk ' !x[$0]++' big_merged_file
This will remove all duplicate lines from your file.
This might work for you:
# ( seq 1 5; seq 3 7; )
1
2
3
4
5
3
4
5
6
7
# ( seq 1 5; seq 3 7; ) | sort -nu
1
2
3
4
5
6
7
# ( seq 1 5; seq 3 7; ) | sort -n | uniq -u
1
2
6
7
# ( seq 1 5; seq 3 7; ) | sort -n | uniq -d
3
4
5
You need to merge everything first, sort then finally remove duplicates
#!/bin/bash
for file in test/*
do
cat "$file" >> final
done
sort final > final2
uniq final2 final
rm -rf final2