I can run the following command if I bring myfile to an environment with python available:
cat myfile | python filter.py
filter.py
import sys
results = []
for line in sys.stdin:
results.append(line.rstrip("\n\r"))
start_match = "some text"
lines_to_include_before_start_match = 4
end_match = "some other text"
lines_to_include_after_end_match = 4
for line_number, line in enumerate(results):
if start_match in line:
for x in xrange(line_number-lines_to_include_before_start_match, line_number):
print results[x]
print line
for x in xrange(line_number+1, len(results)):
if end_match in results[x]:
print results[x]
for z in xrange(x+1, x+lines_to_include_after_end_match):
print results[z]
break
else:
print results[x]
print ""
But the environment that I want to run this in doesn't have python. Is my only choice to convert this to perl, which I know exists in the environment? Is there an easy sed or awk command to do this?
I've tried the following but it doesn't quite give me what I'm looking for since it misses the +/- 4 lines:
cat myfile | sed -n '/some text/,/some other text/p'
[EDIT: The python script says lines_to_include_after_end_match is 4 but in reality it returns 3]
This might work for you (GNU sed):
sed ':a;$!{N;s/\n/&/4;Ta};/1st text/{:b;n;/2nd text/!bb;:c;N;s/\n/&/4;Tc;b};$d;D' file
Open up a window of n lines and if those lines contain 1st text print them and continue printing until 2nd text, then read m further lines and print those. Otherwise , if it is the end of the file delete the buffered lines else delete the first line in the buffer and repeat.
If the match text begin at start or end of a line, use:
sed ':a;$!{N;s/\n/&/4;Ta};/^start/M{:b;n;/end$/M!bb;:c;N;s/\n/&/4;Tc;b};$d;D' file
Given that the line endings are \n, you can try this:
awk '/some text/{if(l4)printf l4;p=5} /some other text/{e=1} e && p {p--; if (!p) {e=0;l4="";}} !p && !e { l4 = l4 $0 "\n"; sub(/[^\n]*\n(([^\n]*\n){4})/,"\1",l4);} p' file
Note the mark needs be 6 if you want print extra 4 lines after the end match.
I think your own python code will only print another 3 lines after end match.
Put in several lines for redability:
awk '/some text/{if(l4)printf l4;p=5}
/some other text/{e=1}
e && p {p--; if (!p) {e=0;l4="";}}
!p && !e { l4 = l4 $0 "\n"; sub(/[^\n]*\n(([^\n]*\n){4})/,"\1",l4);}
p' file
With sed, please try:
sed -n "$(($(sed -n '/some text/=' myfile) - 4)),$(($(sed -n '/some other text/=' myfile) + 4))p" myfile
The command sed -n '/some text/=' returns the line number which matches some text.
Then 4 is subtracted from the number above.
The next part sed -n '/some other text/=' works similarly and the obtained line number is added by 4.
Note that the script scans the input file three times and may not be suitable for the case execution time is crucial.
[Edit]
In case you have multiple "some other text" in the file, please try instead:
sed -n "$(($(sed -n '/some text/=' myfile) - 4)),\$p" myfile | sed "/some other text/{N;N;N;q}"
I have two sets of data.
The first dataset looks like:
Storm_ID,Cell_ID,Wind_speed
2,10236258,27
2,10236300,58
2,10236301,25
3,10240400,51
The second dataset looks like:
Storm_ID,Cell_ID,Storm_surge
2,10236299,0.27
2,10236300,0.27
2,10236301,0.35
2,10240400,0.35
2,10240401,0.81
4,10240402,0.11
Now I want an output which looks something like this:
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11
I tried join command in Linux to perform this task and failed badly. Join command skipped the rows which didn't match in the database. I can use Matlab but the size of the data is more than 100 GB which is making it very difficult for this task.
Can someone please guide me on this one please. Can I use SQL or python to complete this task. I appreciate your help Thanks.
I think you want a full outer join:
select storm_id, cell_id,
coalesce(d1.wind_speed, 0) as wind_speed,
coalesce(d2.storm_surge, 0) as storm_surge
from dataset1 d1 full join
dataset2 d2
using (storm_id, cell_id);
Shell-Only Solution
Make a backup of your files first
Assuming your files are called wind1.txt and wind2.txt
You could apply these sets of shell commands:
perl -pi -E "s/,/_/" wind*
perl -pi -E 's/(.$)/$1,0/' wind1.txt
perl -pi -E "s/,/,0,/" wind2.txt
join --header -a 1 -a 2 wind1.txt wind2.txt > outfile.txt
Intermediate Result
Storm_ID_Cell_ID,Wind_speed,0
2_10236258,27,0
2_10236299,0,0.27
2_10236300,0,0.27
2_10236300,58,0
2_10236301,0,0.35
2_10236301,25,0
2_10240400,0,0.35
2_10240401,0,0.81
3_10240400,51,0
4_10240402,0,0.11
Now rename in line 0 to "storm_surge", replace first _ with "," in digits
perl -pi -E "s/Wind_speed,0/Wind_speed,Storm_surge/" outfile.txt
perl -pi -E 's/^(\d+)_/$1,/' outfile.txt
perl -pi -E "s/Storm_ID_Cell_ID/Storm_ID,Cell_ID/" outfile.txt
Intermediate result:
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,0,0.27
2,10236300,58,0
2,10236301,0,0.35
2,10236301,25,0
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11
Finally run this:
awk 'BEGIN { FS=OFS=SUBSEP=","}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' outfile.txt | sort
(Sorry - Q was closed while answering)
awk -F, -v OFS=, '{x = $1 "," $2} FNR == NR {a[x] = $3; b[x] = 0; next} {b[x] = $3} !a[x] {a[x] = 0} END {for (i in a) print i, a[i], b[i]}' f1 f2 | sort -n
Since it's a loop, awk produces random order. Hence sorting at the end.
I have a csv file that looks like this:
SKU,QTY
KA006-001,2
KA006-001,33
KA006-001,46
KA009-001,22
KA009-001,7
KA010-001,18
KA014-001,3
KA014-001,42
KA015-001,1
KA015-001,16
KA020-001,6
KA022-001,56
The first column is SKU. The second column is QTY number.
Some lines in (SKU column only) are identical.
I need to achieve the following:
SKU,QTY
KA006-001,81 (2+33+46)
KA009-001,29 (22+7)
KA010-001,18
KA014-001,45 (3+42)
so on...
I tried different things , loop statements and arrays. Got so lost, got headache.
My code:
#!/bin/bash
while IFS=, read sku qty
do
echo "SKU='$sku' QTY='$qty'"
if [ "$sku" = "$sku" ]
then
#x=("$sku" != "$sku")
for i in {0..3}; do echo $sku[$i]=$qty; done
fi
done < 2asg.csv
I'd use awk:
awk -F, 'NR==1{print} NR>1{a[$1] += $2}END{for (i in a) print i","a[i]}' file
If you want to ignore blank lines, you can either ignore lines less than 2 columns:
awk -F, 'NR==1{print} NR>1 && NF>1{a[$1] += $2} END{for (i in a) print i","a[i]}' file
or ignore ones without exactly 2 columns:
awk -F, 'NR==1{print} NR>1 && NF==2{a[$1] += $2} END{for (i in a) print i","a[i]}' file
Alternatively, you can check to see that the second column begins with a digit:
awk -F, 'NR==1{print} NR>1 && $2~/^[0-9]/{a[$1] += $2} END{for (i in a) print i","a[i]}' file
For Bash 4:
#!/bin/bash
declare -A astr
while IFS=, read -r col1 col2
do
if [ "$col1" != "SKU" ] && [ "$col1" != "" ]
then
(( astr[$col1] += col2 ))
fi
done < 2asg.csv
echo "SKU,QTY"
for i in "${!astr[#]}"
do
echo "$i,${astr[$i]}"
done | sort -t : -k 2n
https://github.com/tigertv/stackoverflow-answers
I need to remove the trailing zero's from an export:
the code is reading original tempFile i need column 2 and 6 which contains:
12|9781624311390|1|1|0|0.0000
13|9781406273687|1|1|0|99.0000
14|9781406273717|1|1|0|104.0000
15|9781406273700|1|1|0|63.0000
the awk command changes the form to comma separated and dumps column 2 and 6 into tempFile2 - and i need to remove the trailing zeros from column 6 so the end result looks like this:
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
i believe this should do the trick but have had no luck implementing it:
awk '{sub("\\.*0+$",""); print}'
Below is the code i need to adjust: $6 is the column to remove zero's
if not isError:
print "Translating SQL output to tab delimited format"
awkRunSuccess = os.system(
"awk -F\"|\" '{print $2 \"\\,\" $6}' %s > %s" %
(tempFile, tempFile2)
)
if awkRunSuccess != 0: isError = True
You can use gsub("\\.*0+$","",$2) to do this, as per the following transcript:
pax> echo '9781624311390|0.0000
9781406273687|99.0000
9781406273717|104.0000
9781406273700|63.0000' | awk -F'|' '{gsub("\\.*0+$","",$2);print $1","$2}'
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
However, given you're already within Python (and it's no slouch when it comes to regexes), you'd probably want to use it natively rather than start up an awk process.
Try this awk command
awk -F '[|.]' '{print $2","$(NF-1)}' FileName
Output:
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
I am used to have awk to retrieve a column from a file.
I need to do something similar now in python. At the moment I use a subprocess and save the result in a variable.
Is possible to run something similar to awk in python, without write a lot of code? I was looking at split; but I don't get how do you parse trough multiple lines.
The input that I have is similar to a simple ls -la or netstat -r. I would like to get the 3rd column, so I can do what I would do with
awk '{print $3}'
Example of the source:
a b c d e
1 2 4 5 2
X Y Z S R
The shortest that I can think of, is a loop splitting for each line, then split each line in single string, print the string[2]. But I am not sure how to write this in the simplest and shortest way; as short as write the awk command in a subprocess.
In bash, using pythonpy
rtb#bartek-laptop ~ $ cat tmp
a b c d e
1 2 4 5 2
X Y Z S R
rtb#bartek-laptop ~ $ cat tmp | py -x "x.split()[2]"
c
4
Z
Or in script
with open('tmp') as f:
result = [line.split()[2] for line in f]
# now result contains list ['c', '4', 'Z']