I have a file of 10000 rows, e.g.,
1.2341105289455E+03 1.1348135000000E+00
I would like to have
1.2341105289455E+03 0.0 1.1348135000000E+00 0.0
and insert columns of '0.0' in it.
I tried to replace 'space' into '0.0' it works but I don't think it is the best solution. I tried with awk but I was only able to add '0.0' at the end of the file.
I bet there is a better solution to it. Do you know how to do it? awk? python? emacs?
Use this Perl one-liner:
perl -lane 'print join "\t", $F[0], "0.0", $F[1], "0.0"; ' in_file > out_file
The perl one-liner uses these command line flags:
-e : tells Perl to look for code in-line, instead of in a file.
-n : loop over the input one line at a time, assigning it to $_ by default.
-l : strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : split $_ into array #F on whitespace or on the regex specified in -F option.
SEE ALSO:
perlrun: command line switches
with awk
awk '{print $1,"0.0",$2,"0.0"}' file
If you want to modify the file inplace, you can do it either with GNU awk adding the-i inplace option, or adding > tmp && mv tmp file to the existing command. But always run it first without replacing, to test it and confirm the output.
Related
Basically I want to take as input text from a file, remove a line from that file, and send the output back to the same file. Something along these lines if that makes it any clearer.
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > file_name
however, when I do this I end up with a blank file.
Any thoughts?
Use sponge for this kind of tasks. Its part of moreutils.
Try this command:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | sponge file_name
You cannot do that because bash processes the redirections first, then executes the command. So by the time grep looks at file_name, it is already empty. You can use a temporary file though.
#!/bin/sh
tmpfile=$(mktemp)
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > ${tmpfile}
cat ${tmpfile} > file_name
rm -f ${tmpfile}
like that, consider using mktemp to create the tmpfile but note that it's not POSIX.
Use sed instead:
sed -i '/seg[0-9]\{1,\}\.[0-9]\{1\}/d' file_name
try this simple one
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
Your file will not be blank this time :) and your output is also printed to your terminal.
You can't use redirection operator (> or >>) to the same file, because it has a higher precedence and it will create/truncate the file before the command is even invoked. To avoid that, you should use appropriate tools such as tee, sponge, sed -i or any other tool which can write results to the file (e.g. sort file -o file).
Basically redirecting input to the same original file doesn't make sense and you should use appropriate in-place editors for that, for example Ex editor (part of Vim):
ex '+g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' -scwq file_name
where:
'+cmd'/-c - run any Ex/Vim command
g/pattern/d - remove lines matching a pattern using global (help :g)
-s - silent mode (man ex)
-c wq - execute :write and :quit commands
You may use sed to achieve the same (as already shown in other answers), however in-place (-i) is non-standard FreeBSD extension (may work differently between Unix/Linux) and basically it's a stream editor, not a file editor. See: Does Ex mode have any practical use?
One liner alternative - set the content of the file as variable:
VAR=`cat file_name`; echo "$VAR"|grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' > file_name
Since this question is the top result in search engines, here's a one-liner based on https://serverfault.com/a/547331 that uses a subshell instead of sponge (which often isn't part of a vanilla install like OS X):
echo "$(grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name)" > file_name
The general case is:
echo "$(cat file_name)" > file_name
Edit, the above solution has some caveats:
printf '%s' <string> should be used instead of echo <string> so that files containing -n don't cause undesired behavior.
Command substitution strips trailing newlines (this is a bug/feature of shells like bash) so we should append a postfix character like x to the output and remove it on the outside via parameter expansion of a temporary variable like ${v%x}.
Using a temporary variable $v stomps the value of any existing variable $v in the current shell environment, so we should nest the entire expression in parentheses to preserve the previous value.
Another bug/feature of shells like bash is that command substitution strips unprintable characters like null from the output. I verified this by calling dd if=/dev/zero bs=1 count=1 >> file_name and viewing it in hex with cat file_name | xxd -p. But echo $(cat file_name) | xxd -p is stripped. So this answer should not be used on binary files or anything using unprintable characters, as Lynch pointed out.
The general solution (albiet slightly slower, more memory intensive and still stripping unprintable characters) is:
(v=$(cat file_name; printf x); printf '%s' ${v%x} > file_name)
Test from https://askubuntu.com/a/752451:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do (v=$(cat file_uniquely_named.txt; printf x); printf '%s' ${v%x} > file_uniquely_named.txt); done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Should print:
hello
world
Whereas calling cat file_uniquely_named.txt > file_uniquely_named.txt in the current shell:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do cat file_uniquely_named.txt > file_uniquely_named.txt; done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Prints an empty string.
I haven't tested this on large files (probably over 2 or 4 GB).
I have borrowed this answer from Hart Simha and kos.
This is very much possible, you just have to make sure that by the time you write the output, you're writing it to a different file. This can be done by removing the file after opening a file descriptor to it, but before writing to it:
exec 3<file ; rm file; COMMAND <&3 >file ; exec 3>&-
Or line by line, to understand it better :
exec 3<file # open a file descriptor reading 'file'
rm file # remove file (but fd3 will still point to the removed file)
COMMAND <&3 >file # run command, with the removed file as input
exec 3>&- # close the file descriptor
It's still a risky thing to do, because if COMMAND fails to run properly, you'll lose the file contents. That can be mitigated by restoring the file if COMMAND returns a non-zero exit code :
exec 3<file ; rm file; COMMAND <&3 >file || cat <&3 >file ; exec 3>&-
We can also define a shell function to make it easier to use :
# Usage: replace FILE COMMAND
replace() { exec 3<$1 ; rm $1; ${#:2} <&3 >$1 || cat <&3 >$1 ; exec 3>&- }
Example :
$ echo aaa > test
$ replace test tr a b
$ cat test
bbb
Also, note that this will keep a full copy of the original file (until the third file descriptor is closed). If you're using Linux, and the file you're processing on is too big to fit twice on the disk, you can check out this script that will pipe the file to the specified command block-by-block while unallocating the already processed blocks. As always, read the warnings in the usage page.
The following will accomplish the same thing that sponge does, without requiring moreutils:
shuf --output=file --random-source=/dev/zero
The --random-source=/dev/zero part tricks shuf into doing its thing without doing any shuffling at all, so it will buffer your input without altering it.
However, it is true that using a temporary file is best, for performance reasons. So, here is a function that I have written that will do that for you in a generalized way:
# Pipes a file into a command, and pipes the output of that command
# back into the same file, ensuring that the file is not truncated.
# Parameters:
# $1: the file.
# $2: the command. (With $3... being its arguments.)
# See https://stackoverflow.com/a/55655338/773113
siphon()
{
local tmp file rc=0
[ "$#" -ge 2 ] || { echo "Usage: siphon filename [command...]" >&2; return 1; }
file="$1"; shift
tmp=$(mktemp -- "$file.XXXXXX") || return
"$#" <"$file" >"$tmp" || rc=$?
mv -- "$tmp" "$file" || rc=$(( rc | $? ))
return "$rc"
}
There's also ed (as an alternative to sed -i):
# cf. http://wiki.bash-hackers.org/howto/edit-ed
printf '%s\n' H 'g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' wq | ed -s file_name
You can use slurp with POSIX Awk:
!/seg[0-9]\{1,\}\.[0-9]\{1\}/ {
q = q ? q RS $0 : $0
}
END {
print q > ARGV[1]
}
Example
This does the trick pretty nicely in most of the cases I faced:
cat <<< "$(do_stuff_with f)" > f
Note that while $(…) strips trailing newlines, <<< ensures a final newline, so generally the result is magically satisfying.
(Look for “Here Strings” in man bash if you want to learn more.)
Full example:
#! /usr/bin/env bash
get_new_content() {
sed 's/Initial/Final/g' "${1:?}"
}
echo 'Initial content.' > f
cat f
cat <<< "$(get_new_content f)" > f
cat f
This does not truncate the file and yields:
Initial content.
Final content.
Note that I used a function here for the sake of clarity and extensibility, but that’s not a requirement.
A common usecase is JSON edition:
echo '{ "a": 12 }' > f
cat f
cat <<< "$(jq '.a = 24' f)" > f
cat f
This yields:
{ "a": 12 }
{
"a": 24
}
Try this
echo -e "AAA\nBBB\nCCC" > testfile
cat testfile
AAA
BBB
CCC
echo "$(grep -v 'AAA' testfile)" > testfile
cat testfile
BBB
CCC
I usually use the tee program to do this:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
It creates and removes a tempfile by itself.
I have a folder full of files with lines which look like this:
S149.sh
sox preaching.wav _001 trim 889.11 891.23
sox preaching.wav _002 trim 891.45 893.92
sox preaching.wav _003 trim 1599.95 1606.78
And I want to add the filename without its extension (which is S149) right before the first occurrence of the _ character in every line, so that it ends up looking like this:
sox preaching.wav S149_001 trim 889.11 891.23
sox preaching.wav S149_002 trim 891.45 893.92
sox preaching.wav S149_003 trim 1599.95 1606.78
And I want to automatically do this for every *.sh file in a given folder.
How do I achieve that with either bash (this includes awk, grep, sed, etc.) or python? Any help will be greatly appreciated.
One possibility, using ed, the standard editor and a loop:
for i in *.sh; do
printf '%s\n' ",g/_/ s/_/${i%.sh}&/" w q | ed -s -- "$i"
done
The parameter expansion ${i%.sh} expands to $i where the suffix .sh is removed.
The ed commands are, in the case i=S149.sh:
,g/_/ s/_/S149&/
w
,g/_/ marks all lines containing an underscore and s/_/S149&/ replaces the underscore by S149_. Then w writes the file.
A sed version:
for i in *.sh; do
sed -i "s/_/${i%.*}_/g" "$i"
done
${i%.*} expands to the filename minus the extension used by the in-place replacement operation.
With GNU awk for inplace editing:
awk -i inplace 'FNR==1{f=gensub(/\.[^.]+$/,"",1,FILENAME)} {$3=f$3} 1' *.sh
If you're considering using a shell loop instead, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
#Ruran- In case you do not have an awk which could do Input_file editing while reading the Input_file then following may help you in same.
awk '(FILENAME != P && P && Q){close(P);system("mv " Q OFS P)} {Q=P=FILENAME;sub(/\..*/,X,Q);sub(/_/,Q"&");print > Q;} END{system("mv " Q OFS P)}' *.sh
Logic, behind is simple it is changing the first occurrence of _(char) and then it is keeping the new formatted lines into a tmp file while reading next Input_file it is renaming that temp file into the previous Input_file.
Also one more point which I have not seen here in above posts it as we are using *.sh so let's say you have thousands of Input_files then code may give error which is because of too many Input_files will be opened and we are NOT closing the files, so I am closing them too, let me know if this helps you.
A non-one liner form of solution too as follows.
awk '(FILENAME != P && P && Q){
close(P);
system("mv " Q OFS P)
}
{
Q=P=FILENAME;
sub(/\..*/,X,Q);
sub(/_/,Q"&");
print > Q;
}
END {
system("mv " Q OFS P)
}
' *.sh
Trying to print filename of files that don't have 12 columns.
This works at the command line:
for i in *dim*; do awk -F',' '{if (NR==1 && NF!=12)print FILENAME}' $i; done;
When I try to embed this in subprocess.call in a python script, it doesn't work:
subprocess.call("""for %i in (*dim*.csv) do (awk -F, '{if ("NR==1 && NF!=12"^) {print FILENAME}}' %i)""", shell=True)
The first error I received was "Print is unexpected at this time" so I googled and added ^ within parentheses. Next error was "unexpected newline or end of string" so googled again and added the quotes around NR==1 && NF!=12. With the current code it's printing many lines in each file so I suspect something is wrong with the if statement. I've used awk and for looped before in this style in subprocess.call but not combined and with an if statement.
Multiple input files in AWK
In the string you are passing to subprocess.call(), your if statement is evaluating a string (probably not the comparison you want). It might be easier to just simplify the shell command by doing everything in AWK. You are executing AWK for every $i in the shell's for loop. Since you can give multiple input files to AWK, there is really no need for this loop.
You might want to scan through the entire files until you find any line that has other than 12 fields, and not only check the first line (NR==1). In this case, the condition would be only NF!=12.
If you want to check only the first line of each file, then NR==1 becomes FNR==1 when using multiple files. NR is the "number of records" (across all input files) and FNR is "file number of records" for the current input file only. These are special built-in variables in AWK.
Also, the syntax of AWK allows for the blocks to be executed only if the line matches some condition. Giving no condition (as you did) runs the block for every line. For example, to scan through all files given to AWK and print the name of a file with other than 12 fields on the first line, try:
awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv
I have added the .csv to your wildcard *dim* as you had in the Python version. The -F, of course changes the field separator to a comma from the default space. For every line in each file, AWK checks if the number of fields NF is 12, if it's not, it executes the block of code, otherwise it goes on to the next line. This block prints the FILENAME of the current file AWK is processing, then skips to the beginning of the next file with nextfile.
Try running this AWK version with your subprocess module in Python:
subprocess.call("""awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv""", shell=True)
The triple quotes makes it a literal string. The output of AWK goes to stdout and I'm assuming you know how to use this in Python with the subprocess module.
Using only Python
Don't forget that Python is itself an expressive and powerful language. If you are already using Python, it may be simpler, easier, and more portable to use only Python instead of a mixture of Python, bash, and AWK.
You can find the names of files (selected from *dim*.csv) with the first line of each file having other than 12 comma-separated fields with:
import glob
files_found = []
for filename in glob.glob('*dim*.csv'):
with open(filename, 'r') as f:
firstline = f.readline()
if len(firstline.split(',')) != 12:
files_found.append(filename)
f.close()
print(files_found)
The glob module gives the listing of files matching the wildcard pattern *dim*.csv. The first line of each of these files is read and split into fields separated by commas. If the number of these fields is not 12, it is added to the list files_found.
I have a file of 2000 rows and 1 column
1007_s_at1
1007_s_at2
1007_s_at3
1007_s_at4
1007_s_at5
1007_s_at6
1007_s_at7
1007_s_at8
1007_s_at9
1007_s_at10
looks like above, I want to remove the last numeric value after "at". In principle whatever number is in the last should be truncated.
I have tried things like splitting them and then rejoioning it, but it just complicates the problem and I am far away from answer.
Could you please suggest something in bash or shell or python or perl to solve this.
An output like below is desired
1007_s_at
1007_s_at
1007_s_at
1007_s_at
1007_s_at
1007_s_at
1007_s_at
1007_s_at
1007_s_at
1007_s_at
Thank you
With Perl:
perl -p -e "s/\d+$//" input.txt > output.txt
sed -i -e 's/[[:digit:]]*$//' filename
Just pass string.digits to .rstrip() to remove digits from the right-hand side of your strings:
import string
with open('inputfile') as infile, open('outputfile') as outfile:
for line in infile:
outfile.write(line.rstrip().rstrip(string.digits) + '\n')
If the only the number at the end changes you could potentially splice:
>>> a = '1007_s_at1'
>>> a[0:9]
'1007_s_at'
Python
Just strip all digits from the end.
>>> "1007_s_at10".rstrip('01234567890')
'1007_s_at'
If you are using Linux or Unix a simple one liner solution would be:
perl -i.bak -pe 's/\d+$//g' file.txt
if Windows:
perl -i.bak -pe "s/\d+$//g" file.txt
If you already know what it is doing then well and good otherwise, in very simple terms, -i switch with .bak would first create a backup of your file.txt and name it file.txt.bak.
The -p option would then loop over the entries in the file and print/save the output in file.txt after s/\d+$//g removes the digits in the end.
Nobody's suggested a bash solution yet:
shopt -s extglob
while read line; do
echo "${line%%*([0-9])}"
done < filename
I have a python file (/home/test.py) that has a mixture of spaces and tabs in it.
Is there a programmatic way (note: programmatic, NOT using an editor) to convert this file to use only tabs? (meaning, replace any existing 4-spaces occurrences with a single tab)?
Would be grateful for either a python code sample or a linux command to do the above. Thanks.
Sounds like a task for sed:
sed -e 's/ /\t/g' test.py > test.new
[Put a real tab instead of \t]
However...
Use 4 spaces per indentation level.
--PEP 8 -- Style Guide for Python Code
you can try iterating the file and doing replacing eg
import fileinput
for line in fileinput.FileInput("file",inplace=1):
print line.replace(" ","\t")
or you can try a *nix tool like sed/awk
$ awk '{gsub(/ /,"\t")}1' file > temp && mv temp file
$ ruby -i.bak -ne '$_.gsub!(/ /,"\t")' file