adding character at the end of each string in a file - python

I have a txt file that contains a single column of single words as such:
windfall
winnable
winner
winners
winning
I want to use the words in the file as regex strings for a mapping jobs. When finished the words should look like this:
windfall|winnable|winner|winners|winning
I need to use python or awk to open the file, place a | at the end of each and write the new content to a new file with the new character added and the column converted to a single horizontal line.
any suggestions?

Simplest is tr:
tr '\n' '|' < file.txt

Using Python you could do:
with open('oldfile.txt') as fin:
with open('newfile.txt', 'w') as fout:
fout.write('|'.join(map(str.strip, fin)))
The str.split removes newlines and whitespaces, while the join concatenates the lines with |.

Using sed:
$ cat file
windfall
winnable
winner
winners
winning
$ sed ':a;N;s/\n/|/;ba' file
windfall|winnable|winner|winners|winning
Create a loop using :a
Load the new line N in to execution space
substitute the newline with pipe
rinse and repeat.

In awk, if you don't want the trailing |:
$ awk '{ s=s (NR>1"?"|":"") $0 } END { print s }' file
windfall|winnable|winner|winners|winning
The original version with getline which was basically an (not even the) outcome of an awk jamming session was:
$ awk 'BEGIN {
while(r=getline) { # read until EOF
s=s (p==r?"|":"") $0; # pile it to s, preceed with | after the first
p=r # p revious r eturn value of getline
} print s # out with the pile
}' file
windfall|winnable|winner|winners|winning

awk -v RS= -v OFS="|" '/ /{next}$1=$1' file
windfall|winnable|winner|winners|winning

Use paste:
$ cat /tmp/so.txt
windfall
winnable
winner
winners
winning
$ paste -sd'|' /tmp/so.txt
windfall|winnable|winner|winners|winning

assuming no blank lines in between rows, and input is smaller than 500 MB, then better to keep it simple :
echo 'windfall
winnable
winner
winners
winning' |
{m,g,n}awk NF=NF RS= OFS='|'
windfall|winnable|winner|winners|winning

Related

Output text between two regular expression patterns over multiple lines

I can run the following command if I bring myfile to an environment with python available:
cat myfile | python filter.py
filter.py
import sys
results = []
for line in sys.stdin:
results.append(line.rstrip("\n\r"))
start_match = "some text"
lines_to_include_before_start_match = 4
end_match = "some other text"
lines_to_include_after_end_match = 4
for line_number, line in enumerate(results):
if start_match in line:
for x in xrange(line_number-lines_to_include_before_start_match, line_number):
print results[x]
print line
for x in xrange(line_number+1, len(results)):
if end_match in results[x]:
print results[x]
for z in xrange(x+1, x+lines_to_include_after_end_match):
print results[z]
break
else:
print results[x]
print ""
But the environment that I want to run this in doesn't have python. Is my only choice to convert this to perl, which I know exists in the environment? Is there an easy sed or awk command to do this?
I've tried the following but it doesn't quite give me what I'm looking for since it misses the +/- 4 lines:
cat myfile | sed -n '/some text/,/some other text/p'
[EDIT: The python script says lines_to_include_after_end_match is 4 but in reality it returns 3]
This might work for you (GNU sed):
sed ':a;$!{N;s/\n/&/4;Ta};/1st text/{:b;n;/2nd text/!bb;:c;N;s/\n/&/4;Tc;b};$d;D' file
Open up a window of n lines and if those lines contain 1st text print them and continue printing until 2nd text, then read m further lines and print those. Otherwise , if it is the end of the file delete the buffered lines else delete the first line in the buffer and repeat.
If the match text begin at start or end of a line, use:
sed ':a;$!{N;s/\n/&/4;Ta};/^start/M{:b;n;/end$/M!bb;:c;N;s/\n/&/4;Tc;b};$d;D' file
Given that the line endings are \n, you can try this:
awk '/some text/{if(l4)printf l4;p=5} /some other text/{e=1} e && p {p--; if (!p) {e=0;l4="";}} !p && !e { l4 = l4 $0 "\n"; sub(/[^\n]*\n(([^\n]*\n){4})/,"\1",l4);} p' file
Note the mark needs be 6 if you want print extra 4 lines after the end match.
I think your own python code will only print another 3 lines after end match.
Put in several lines for redability:
awk '/some text/{if(l4)printf l4;p=5}
/some other text/{e=1}
e && p {p--; if (!p) {e=0;l4="";}}
!p && !e { l4 = l4 $0 "\n"; sub(/[^\n]*\n(([^\n]*\n){4})/,"\1",l4);}
p' file
With sed, please try:
sed -n "$(($(sed -n '/some text/=' myfile) - 4)),$(($(sed -n '/some other text/=' myfile) + 4))p" myfile
The command sed -n '/some text/=' returns the line number which matches some text.
Then 4 is subtracted from the number above.
The next part sed -n '/some other text/=' works similarly and the obtained line number is added by 4.
Note that the script scans the input file three times and may not be suitable for the case execution time is crucial.
[Edit]
In case you have multiple "some other text" in the file, please try instead:
sed -n "$(($(sed -n '/some text/=' myfile) - 4)),\$p" myfile | sed "/some other text/{N;N;N;q}"

Split csv file vertically using command line

Is it possible to split a csv file, vertically, into multiple files? I know we can split single large files into smaller files with no of rows mentioned using the command line. I have csv files in which columns are repeating after certain column no and I want to split that file column-wise.Is that possible with the command line, If not then how can we do it with python?
For Eg.
consider above sample in which site and address present multiple times vertically, I want to create 3 different csv files containing single site and single address
Any help would be highly appreciated,
Thanks
Assuming your input files is named ~/Downloads/sites.csv and looks like this:
Google,google.com,Google,google.com,Google,google.com
MS,microsoft.com,MS,microsoft.com,MS,microsoft.com
Apple,apple.com,Apple,apple.com,Apple,apple.com
You can use cut to create 3 files, each containing one pair of company/site:
cut -d "," -f 1-2 < ~/Downloads/sites.csv > file1.csv
cut -d "," -f 3-4 < ~/Downloads/sites.csv > file2.csv
cut -d "," -f 5-6 < ~/Downloads/sites.csv > file3.csv
Explanation:
For the cut command, we declare the comma (,) as a separator, which splits every line into a set for 'fields'.
We then specify for each output file, which fields we want to be included.
HTH!
If the site-address pairs are regularly repeated, how about:
awk '{
n = split($0, ary, ",");
for (i = 1; i <= n; i += 2) {
j = (i + 1) / 2;
print ary[i] "," ary[i+1] >> "file" j ".csv";
}
}' input.csv
The following script produces what you want (based on the SO answer adjusted for your needs: number of columns, field separator). It splits the original file vertically into 2 column chunks (note n=2) and creates 3 different files (tmp.examples.1, tmp.examples.2, tmp.examples.3 or whatever you specify for the f variable):
awk -F "," -v f="tmp.examples" '{for (i=1; i<=NF; i++) printf (i%n==0||i==NF)?$i RS:$i FS > f "." int((i-1)/n+1) }' n=2 example.txt
If your example.txt file has the subsequent data:
site,address,site,address,site,address
Google,google.com,MS,microsoft.com,Apple,apple.com

python: Remove trailing 0's and decimal point from awk command

I need to remove the trailing zero's from an export:
the code is reading original tempFile i need column 2 and 6 which contains:
12|9781624311390|1|1|0|0.0000
13|9781406273687|1|1|0|99.0000
14|9781406273717|1|1|0|104.0000
15|9781406273700|1|1|0|63.0000
the awk command changes the form to comma separated and dumps column 2 and 6 into tempFile2 - and i need to remove the trailing zeros from column 6 so the end result looks like this:
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
i believe this should do the trick but have had no luck implementing it:
awk '{sub("\\.*0+$",""); print}'
Below is the code i need to adjust: $6 is the column to remove zero's
if not isError:
print "Translating SQL output to tab delimited format"
awkRunSuccess = os.system(
"awk -F\"|\" '{print $2 \"\\,\" $6}' %s > %s" %
(tempFile, tempFile2)
)
if awkRunSuccess != 0: isError = True
You can use gsub("\\.*0+$","",$2) to do this, as per the following transcript:
pax> echo '9781624311390|0.0000
9781406273687|99.0000
9781406273717|104.0000
9781406273700|63.0000' | awk -F'|' '{gsub("\\.*0+$","",$2);print $1","$2}'
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
However, given you're already within Python (and it's no slouch when it comes to regexes), you'd probably want to use it natively rather than start up an awk process.
Try this awk command
awk -F '[|.]' '{print $2","$(NF-1)}' FileName
Output:
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63

Merge two lines generated from contigs.fa

I have a file generated by assemblers. It looks like following.
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAAC
CAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTAT
ACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGA
ACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTAT
TCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTG
TCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGT
CTTCC
I want to merge the lines using python or linux sed command and want result in this way.
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC
like every seqeunce consider as single line and Node name as other line.
A small pipe of tr and sed would do this:
$ tr -d '\n' < contigser.fa | sed 's/\(>[^.]\+\.[0-9]\+\)/\n\1\n/g' > newfile.fa
In python:
file = open('contigser.fa','r+')
lines= file.read().splitlines()
file.seek(0)
file.truncate()
for line in lines:
if line.startswith('>'):
file.write('\n'+line+'\n')
else:
file.write(line)
Note: the python solution stores the changes back to contigser.fa.
You can use awk to do the job:
awk < input_file '/^>/ {print ""; print; next} {printf "%s", $0} END {print ""}'
This only starts one process (awk). Only drawback: it adds an empty first line. You can avoid such things by adding a state variable (the code belongs on one line, it's just to make it better readable):
awk < input_file '/^>/ { if (flag) print ""; print; flag=0; next }
{ printf "%s", $0; flag=1 } END { if (flag) print "" }'
#how to store it in a new file:
awk < input_file > output_file '/^>/ { .... }'
$ awk '/^>/{printf "%s%s\n",(NR>1?ORS:""),$0; next} {printf "%s",$0} END{print ""}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC
$ awk 'NR==1;ORS="";{sub(/>.*$/,"\n&\n");print (NR>1)?$0:""}END{print"\n"}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC
This might work for you (GNU sed):
sed '/^>/n;:a;$!N;s/\n\([^>]\)/\1/;ta;P;D' file
Following a line beginning with >, delete any newlines that preceed any character other than a > symbol.

Summing up two columns the Unix way

# To fix the symptom
How can you sum up the following columns effectively?
Column 1
1
3
3
...
Column 2
2323
343
232
...
This should give me
Expected result
2324
346
235
...
I have the columns in two files.
# Initial situation
I use sometimes too many curly brackets such that I have used one more this { than this } in my files.
I am trying to find where I have used the one unnecessary curly bracket.
I have used the following steps in getting the data
Find commands
find . * -exec grep '{' {} + > /tmp/1
find . * -exec grep '}' {} + > /tmp/2
AWK commands
awk -F: '{ print $2 }' /tmp/1 > /tmp/11
awk -F: '{ print $2 }' /tmp/2 > /tmp/22
The column are in the files /tmp/11 and /tmp/22.
I repeat a lot of similar commands in my procedure.
This suggests me that this is not the right way.
Please, suggests me any way such as Python, Perl or any Unix tool which can decrease the number of steps.
If c1 and c2 are youre files, you can do this:
$ paste c1 c2 | awk '{print $1 + $2}'
Or (without AWK):
$ paste c1 c2 | while read i j; do echo $(($i+$j)); done
Using python:
totals = [ int(i)+int(j) for i, j in zip ( open(fname1), open(fname2) ) ]
You can avoid the intermediate steps by just using a command that do the counts and the comparison at the same time:
find . -type f -exec perl -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g' {}\;
This calls the Perl program once per file, the Perl program counts the number of each type curly brace and prints the name of the file if they counts don't match.
You must be careful with the /([}{]])/ section, find will think it needs to do the replacement on {} if you say /([{}]])/.
WARNING: this code will have false positives and negatives if you are trying to run it against source code. Consider the following cases:
balanced, but curlies in strings:
if ($s eq '{') {
print "I saw a {\n"
}
unbalanced, but curlies in strings:
while (1) {
print "}";
You can expand the Perl command by using B::Deparse:
perl -MO=Deparse -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g'
Which results in:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
;
++$h{$_} foreach (/([}{])/g);
}
We can now look at each piece of the program:
BEGIN { $/ = "\n"; $\ = "\n"; }
This is caused by the -l option. It sets both the input and output record separators to "\n". This means anything read in will be broken into records based "\n" and any print statement will have "\n" appended to it.
LINE: while (defined($_ = <ARGV>)) {
}
This is created by the -n option. It loops over every file passed in via the commandline (or STDIN if no files are passed) reading each line of those files. This also happens to set $ARGV to the last file read by <ARGV>.
chomp $_;
This removes whatever is in the $/ variable from the line that was just read ($_), it does nothing useful here. It was caused by the -l option.
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
This is an END block, this code will run at the end of the program. It prints $ARGV (the name of the file last read from, see above) if the values stored in %h associated with the keys '{' and '}' are equal.
++$h{$_} foreach (/([}{])/g);
This needs to be broken down further:
/
( #begin capture
[}{] #match any of the '}' or '{' characters
) #end capture
/gx
Is a regex that returns a list of '{' and '}' characters that are in the string being matched. Since no string was specified the $_ variable (which holds the line last read from the file, see above) will be matched against. That list is fed into the foreach statement which then runs the statement it is in front of for each item (hence the name) in the list. It also sets $_ (as you can see $_ is a popular variable in Perl) to be the item from the list.
++h{$_}
This line increments the value in $h that is associated with $_ (which will be either '{' or '}', see above) by one.
In Python (or Perl, Awk, &c) you can reasonably do it in a single stand-alone "pass" -- I'm not sure what you mean by "too many curly brackets", but you can surely count curly use per file. For example (unless you have to worry about multi-GB files), the 10 files using most curly braces:
import heapq
import os
import re
curliest = dict()
for path, dirs, files in os.walk('.'):
for afile in files:
fn = os.path.join(path, afile)
with open(fn) as f:
data = f.read()
braces = data.count('{') + data.count('}')
curliest[fn] = bracs
top10 = heapq.nlargest(10, curlies, curliest.get)
top10.sort(key=curliest.get)
for fn in top10:
print '%6d %s' % (curliest[fn], fn)
Reply to Lutz'n answer
My problem was finally solved by this commnad
paste -d: /tmp/1 /tmp/2 | awk -F: '{ print $1 "\t" $2 - $4 }'
your problem can be solved with just 1 awk command...
awk '{getline i<"file1";print i+$0}' file2

Categories

Resources