Joining two different datasets using multiple keyvalues - python
I have two sets of data.
The first dataset looks like:
Storm_ID,Cell_ID,Wind_speed
2,10236258,27
2,10236300,58
2,10236301,25
3,10240400,51
The second dataset looks like:
Storm_ID,Cell_ID,Storm_surge
2,10236299,0.27
2,10236300,0.27
2,10236301,0.35
2,10240400,0.35
2,10240401,0.81
4,10240402,0.11
Now I want an output which looks something like this:
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11
I tried join command in Linux to perform this task and failed badly. Join command skipped the rows which didn't match in the database. I can use Matlab but the size of the data is more than 100 GB which is making it very difficult for this task.
Can someone please guide me on this one please. Can I use SQL or python to complete this task. I appreciate your help Thanks.
I think you want a full outer join:
select storm_id, cell_id,
coalesce(d1.wind_speed, 0) as wind_speed,
coalesce(d2.storm_surge, 0) as storm_surge
from dataset1 d1 full join
dataset2 d2
using (storm_id, cell_id);
Shell-Only Solution
Make a backup of your files first
Assuming your files are called wind1.txt and wind2.txt
You could apply these sets of shell commands:
perl -pi -E "s/,/_/" wind*
perl -pi -E 's/(.$)/$1,0/' wind1.txt
perl -pi -E "s/,/,0,/" wind2.txt
join --header -a 1 -a 2 wind1.txt wind2.txt > outfile.txt
Intermediate Result
Storm_ID_Cell_ID,Wind_speed,0
2_10236258,27,0
2_10236299,0,0.27
2_10236300,0,0.27
2_10236300,58,0
2_10236301,0,0.35
2_10236301,25,0
2_10240400,0,0.35
2_10240401,0,0.81
3_10240400,51,0
4_10240402,0,0.11
Now rename in line 0 to "storm_surge", replace first _ with "," in digits
perl -pi -E "s/Wind_speed,0/Wind_speed,Storm_surge/" outfile.txt
perl -pi -E 's/^(\d+)_/$1,/' outfile.txt
perl -pi -E "s/Storm_ID_Cell_ID/Storm_ID,Cell_ID/" outfile.txt
Intermediate result:
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,0,0.27
2,10236300,58,0
2,10236301,0,0.35
2,10236301,25,0
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11
Finally run this:
awk 'BEGIN { FS=OFS=SUBSEP=","}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' outfile.txt | sort
(Sorry - Q was closed while answering)
awk -F, -v OFS=, '{x = $1 "," $2} FNR == NR {a[x] = $3; b[x] = 0; next} {b[x] = $3} !a[x] {a[x] = 0} END {for (i in a) print i, a[i], b[i]}' f1 f2 | sort -n
Since it's a loop, awk produces random order. Hence sorting at the end.
Related
access multiple output array of python in bash
I have a python script that print out 3 different lists. How can I access them. For example: python out: [1,2,3,4][a,b,c,d][p,q,r,s] Now in bash I want to access them as: list1=[1,2,3,4] list2=[a,b,c,d] list3=[p,q,r,s] So far, I tried something like: x=$(python myscript.py input.csv) Now, If I use echo $x I can see the above mentioned list: [1,2,3,4][a,b,c,d][p,q,r,s] How could I get 3 different lists? Thanks for help.
The Python output does not match the bash syntax. If you can not print the bash syntax directly from the Python script you will need to parse the output first. I suggest using the sed command for parsing the output into bash arrays: echo $x | sed 's|,| |g; s|\[|list1=(|; s|\[|list2=(|; s|\[|list3=(|;s|\]|)\n|g;' Command explanation sed 's|,| |g; # replaces `,` by blank space s|\[|list1=(|; # replaces the 1st `[` by `list1=(` s|\[|list2=(|; # replaces the 2nd `[` by `list2=(` s|\[|list3=(|; # replaces the 3rd `[` by `list3=(` s|\]|)\n|g;' # replaces all `]` by `)` The output would be something like: list1=(1 2 3 4) list2=(a b c d) list3=(p q r s) At this point, the output are not actual lists. To turn the output into bash commands, you can surround the whole command with eval $(...), then the output will be evaluated as a bash command. Putting all together: $ eval $(echo $x | sed 's|,| |g; s|\[|list1=(|; s|\[|list2=(|; s|\[|list3=(|;s|\]|)\n|g;') $ echo ${list1[#]} 1 2 3 4 $ echo ${list2[#]} a b c d $ echo ${list3[#]} p q r s
Here is one approach using bash. #!/usr/bin/env bash ##: This line is a simple test that it works. ##: IFS='][' read -ra main_list <<< [1,2,3,4][a,b,c,d][p,q,r,s] IFS='][' read -ra main_list < <(python myscript.py input.csv) n=1 while read -r list; do [[ $list ]] || continue read -ra list$((n++)) <<< "${list//,/ }" done < <(printf '%s\n' "${main_list[#]}") declare -p list1 list2 list3 Output declare -a list1=([0]="1" [1]="2" [2]="3" [3]="4") declare -a list2=([0]="a" [1]="b" [2]="c" [3]="d") declare -a list3=([0]="p" [1]="q" [2]="r" [3]="s") As per Philippe's comment, a for loop is also an option. IFS='][' read -ra main_list < <(python myscript.py input.csv) n=1 for list in "${main_list[#]}"; do [[ $list ]] || continue read -ra list$((n++)) <<< "${list//,/ }" done declare -p list1 list2 list3
How to find matching rows of the first column and add quantities of the second column? Bash
I have a csv file that looks like this: SKU,QTY KA006-001,2 KA006-001,33 KA006-001,46 KA009-001,22 KA009-001,7 KA010-001,18 KA014-001,3 KA014-001,42 KA015-001,1 KA015-001,16 KA020-001,6 KA022-001,56 The first column is SKU. The second column is QTY number. Some lines in (SKU column only) are identical. I need to achieve the following: SKU,QTY KA006-001,81 (2+33+46) KA009-001,29 (22+7) KA010-001,18 KA014-001,45 (3+42) so on... I tried different things , loop statements and arrays. Got so lost, got headache. My code: #!/bin/bash while IFS=, read sku qty do echo "SKU='$sku' QTY='$qty'" if [ "$sku" = "$sku" ] then #x=("$sku" != "$sku") for i in {0..3}; do echo $sku[$i]=$qty; done fi done < 2asg.csv
I'd use awk: awk -F, 'NR==1{print} NR>1{a[$1] += $2}END{for (i in a) print i","a[i]}' file If you want to ignore blank lines, you can either ignore lines less than 2 columns: awk -F, 'NR==1{print} NR>1 && NF>1{a[$1] += $2} END{for (i in a) print i","a[i]}' file or ignore ones without exactly 2 columns: awk -F, 'NR==1{print} NR>1 && NF==2{a[$1] += $2} END{for (i in a) print i","a[i]}' file Alternatively, you can check to see that the second column begins with a digit: awk -F, 'NR==1{print} NR>1 && $2~/^[0-9]/{a[$1] += $2} END{for (i in a) print i","a[i]}' file
For Bash 4: #!/bin/bash declare -A astr while IFS=, read -r col1 col2 do if [ "$col1" != "SKU" ] && [ "$col1" != "" ] then (( astr[$col1] += col2 )) fi done < 2asg.csv echo "SKU,QTY" for i in "${!astr[#]}" do echo "$i,${astr[$i]}" done | sort -t : -k 2n https://github.com/tigertv/stackoverflow-answers
Split csv file vertically using command line
Is it possible to split a csv file, vertically, into multiple files? I know we can split single large files into smaller files with no of rows mentioned using the command line. I have csv files in which columns are repeating after certain column no and I want to split that file column-wise.Is that possible with the command line, If not then how can we do it with python? For Eg. consider above sample in which site and address present multiple times vertically, I want to create 3 different csv files containing single site and single address Any help would be highly appreciated, Thanks
Assuming your input files is named ~/Downloads/sites.csv and looks like this: Google,google.com,Google,google.com,Google,google.com MS,microsoft.com,MS,microsoft.com,MS,microsoft.com Apple,apple.com,Apple,apple.com,Apple,apple.com You can use cut to create 3 files, each containing one pair of company/site: cut -d "," -f 1-2 < ~/Downloads/sites.csv > file1.csv cut -d "," -f 3-4 < ~/Downloads/sites.csv > file2.csv cut -d "," -f 5-6 < ~/Downloads/sites.csv > file3.csv Explanation: For the cut command, we declare the comma (,) as a separator, which splits every line into a set for 'fields'. We then specify for each output file, which fields we want to be included. HTH!
If the site-address pairs are regularly repeated, how about: awk '{ n = split($0, ary, ","); for (i = 1; i <= n; i += 2) { j = (i + 1) / 2; print ary[i] "," ary[i+1] >> "file" j ".csv"; } }' input.csv
The following script produces what you want (based on the SO answer adjusted for your needs: number of columns, field separator). It splits the original file vertically into 2 column chunks (note n=2) and creates 3 different files (tmp.examples.1, tmp.examples.2, tmp.examples.3 or whatever you specify for the f variable): awk -F "," -v f="tmp.examples" '{for (i=1; i<=NF; i++) printf (i%n==0||i==NF)?$i RS:$i FS > f "." int((i-1)/n+1) }' n=2 example.txt If your example.txt file has the subsequent data: site,address,site,address,site,address Google,google.com,MS,microsoft.com,Apple,apple.com
python: Remove trailing 0's and decimal point from awk command
I need to remove the trailing zero's from an export: the code is reading original tempFile i need column 2 and 6 which contains: 12|9781624311390|1|1|0|0.0000 13|9781406273687|1|1|0|99.0000 14|9781406273717|1|1|0|104.0000 15|9781406273700|1|1|0|63.0000 the awk command changes the form to comma separated and dumps column 2 and 6 into tempFile2 - and i need to remove the trailing zeros from column 6 so the end result looks like this: 9781624311390,0 9781406273687,99 9781406273717,104 9781406273700,63 i believe this should do the trick but have had no luck implementing it: awk '{sub("\\.*0+$",""); print}' Below is the code i need to adjust: $6 is the column to remove zero's if not isError: print "Translating SQL output to tab delimited format" awkRunSuccess = os.system( "awk -F\"|\" '{print $2 \"\\,\" $6}' %s > %s" % (tempFile, tempFile2) ) if awkRunSuccess != 0: isError = True
You can use gsub("\\.*0+$","",$2) to do this, as per the following transcript: pax> echo '9781624311390|0.0000 9781406273687|99.0000 9781406273717|104.0000 9781406273700|63.0000' | awk -F'|' '{gsub("\\.*0+$","",$2);print $1","$2}' 9781624311390,0 9781406273687,99 9781406273717,104 9781406273700,63 However, given you're already within Python (and it's no slouch when it comes to regexes), you'd probably want to use it natively rather than start up an awk process.
Try this awk command awk -F '[|.]' '{print $2","$(NF-1)}' FileName Output: 9781624311390,0 9781406273687,99 9781406273717,104 9781406273700,63
Combine lines with matching keys
I have a text file with the following structure ID,operator,a,b,c,d,true WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 WCBP12236,S1,77.2,81.5,69.4,84.1,82.1 WCBP12236,S2,68.0,68.0,53.2,68.5,82.1 WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 WCBP12234,J2,68.6,68.4,41.4,68.9,75.3 WCBP12234,S1,81.8,82.7,67.0,87.5,75.3 WCBP12234,S2,66.6,67.9,53.0,70.7,75.3 WCBP12238,J1,78.6,79.0,56.2,82.1,84.1 WCBP12239,J2,66.6,72.9,79.5,76.6,82.1 WCBP12239,S1,86.6,87.8,23.0,23.0,82.1 WCBP12239,S2,86.0,86.9,62.3,89.7,82.1 WCBP12239,J1,70.9,71.3,66.0,73.7,82.1 WCBP12238,J2,75.1,75.2,54.3,76.4,84.1 WCBP12238,S1,65.9,66.0,40.2,66.5,84.1 WCBP12238,S2,72.7,73.2,52.6,73.9,84.1 Each ID corresponds to a dataset which is analysed by an operator several times. i.e J1 and J2 are the first and second attempt by operator J. The measures a, b, c and d use 4 slightly different algorithms to measure a value whose true value lies in the column true What I would like to do is to create 3 new text files comparing the results for J1 vs J2, S1 vs S2 and J1 vs S1. Example output for J1 vs J2: ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1 WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3 where a1 is measurement a for J1, etc. Another example is for S1 vs S2: ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1 WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3 The IDs will not be in alphanumerical order nor will the operators be clustered for the same ID. I'm not certain how best to approach this task - using linux tools or a scripting language like perl/python. My initial attempt using linux quickly hit a brick wall First find all unique IDs (sorted) awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids Loop through these IDs and sort J1, J2: foreach i (`more unique_ids`) grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2 end This gives me the data sorted WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 WCBP12234,J2,68.6,68.4,41.4,68.9,80.4 WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 WCBP12238,J1,78.6,79.0,56.2,82.1,82.1 WCBP12238,J2,75.1,75.2,54.3,76.4,82.1 WCBP12239,J1,70.9,71.3,66.0,73.7,75.3 WCBP12239,J2,66.6,72.9,79.5,76.6,75.3 I'm not sure how to rearrange this data to get the desired structure. I tried adding an additional pipe to awk in the foreach loop awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}' Any ideas? I'm sure this can be done in a less cumbersome manner using awk, although it may be better using a proper scripting language.
You can use the Perl csv module Text::CSV to extract the fields, and then store them in a hash, where ID is the main key, the second field is the secondary key and all the fields are stored as the value. It should then be trivial to do whatever comparisons you want. If you want to retain the original order of your lines, you can use an array inside the first loop. use strict; use warnings; use Text::CSV; my %data; my $csv = Text::CSV->new({ binary => 1, # safety precaution eol => $/, # important when using $csv->print() }); while ( my $row = $csv->getline(*ARGV) ) { my ($id, $J) = #$row; # first two fields $data{$id}{$J} = $row; # store line }
Python Way: import os,sys, re, itertools info=["WCBP12236,J1,75.7,80.6,65.9,83.2,82.1", "WCBP12236,J2,76.3,79.6,61.7,81.9,82.1", "WCBP12236,S1,77.2,81.5,69.4,84.1,82.1", "WCBP12236,S2,68.0,68.0,53.2,68.5,82.1", "WCBP12234,J1,63.7,67.7,72.2,71.6,75.3", "WCBP12234,J2,68.6,68.4,41.4,68.9,80.4", "WCBP12234,S1,81.8,82.7,67.0,87.5,75.3", "WCBP12234,S2,66.6,67.9,53.0,70.7,72.7", "WCBP12238,J1,78.6,79.0,56.2,82.1,82.1", "WCBP12239,J2,66.6,72.9,79.5,76.6,75.3", "WCBP12239,S1,86.6,87.8,23.0,23.0,82.1", "WCBP12239,S2,86.0,86.9,62.3,89.7,82.1", "WCBP12239,J1,70.9,71.3,66.0,73.7,75.3", "WCBP12238,J2,75.1,75.2,54.3,76.4,82.1", "WCBP12238,S1,65.9,66.0,40.2,66.5,80.4", "WCBP12238,S2,72.7,73.2,52.6,73.9,72.7" ] def extract_data(operator_1, operator_2): operator_index=1 id_index=0 data={} result=[] ret=[] for line in info: conv_list=line.split(",") if len(conv_list) > operator_index and ((operator_1.strip().upper() == conv_list[operator_index].strip().upper()) or (operator_2.strip().upper() == conv_list[operator_index].strip().upper()) ): if data.has_key(conv_list[id_index]): iters = [iter(conv_list[int(operator_index)+1:]), iter(data[conv_list[id_index]])] data[conv_list[id_index]]=list(it.next() for it in itertools.cycle(iters)) continue data[conv_list[id_index]]=conv_list[int(operator_index)+1:] return data ret=extract_data("j1", "s2") print ret O/P: {'WCBP12239': ['70.9', '86.0', '71.3', '86.9', '66.0', '62.3', '73.7', '89.7', '75.3', '82.1'], 'WCBP12238': ['72.7', '78.6', '73.2', '79.0', '52.6', '56.2', '73.9', '82.1', '72.7', '82.1'], 'WCBP12234': ['66.6', '63.7', '67.9', '67.7', '53.0', '72.2', '70.7', '71.6', '72.7', '75.3'], 'WCBP12236': ['68.0', '75.7', '68.0', '80.6', '53.2', '65.9', '68.5', '83.2', '82.1', '82.1']}
I didn't use Text::CSV like TLP did. If you needed to you could but for this example, I thought since there were no embedded commas in the fields, I did a simple split on ','. Also, the true fields from both operators are listed (instead of just 1) as I thought the special case of the last value complicates the solution. #!/usr/bin/perl use strict; use warnings; use List::MoreUtils qw/ mesh /; my %data; while (<DATA>) { chomp; my ($id, $op, #vals) = split /,/; $data{$id}{$op} = \#vals; } my #ops = ([qw/J1 J2/], [qw/S1 S2/], [qw/J1 S1/]); for my $id (sort keys %data) { for my $comb (#ops) { open my $fh, ">>", "#$comb.txt" or die $!; my $a1 = $data{$id}{ $comb->[0] }; my $a2 = $data{$id}{ $comb->[1] }; print $fh join(",", $id, mesh(#$a1, #$a2)), "\n"; close $fh or die $!; } } __DATA__ WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 WCBP12236,S1,77.2,81.5,69.4,84.1,82.1 WCBP12236,S2,68.0,68.0,53.2,68.5,82.1 WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 WCBP12234,J2,68.6,68.4,41.4,68.9,75.3 WCBP12234,S1,81.8,82.7,67.0,87.5,75.3 WCBP12234,S2,66.6,67.9,53.0,70.7,75.3 WCBP12239,J1,78.6,79.0,56.2,82.1,82.1 WCBP12239,J2,66.6,72.9,79.5,76.6,82.1 WCBP12239,S1,86.6,87.8,23.0,23.0,82.1 WCBP12239,S2,86.0,86.9,62.3,89.7,82.1 WCBP12238,J1,70.9,71.3,66.0,73.7,84.1 WCBP12238,J2,75.1,75.2,54.3,76.4,84.1 WCBP12238,S1,65.9,66.0,40.2,66.5,84.1 WCBP12238,S2,72.7,73.2,52.6,73.9,84.1 The output files produced are below J1 J2.txt WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3,75.3 WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1,82.1 WCBP12238,70.9,75.1,71.3,75.2,66.0,54.3,73.7,76.4,84.1,84.1 WCBP12239,78.6,66.6,79.0,72.9,56.2,79.5,82.1,76.6,82.1,82.1 S1 S2.txt WCBP12234,81.8,66.6,82.7,67.9,67.0,53.0,87.5,70.7,75.3,75.3 WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1,82.1 WCBP12238,65.9,72.7,66.0,73.2,40.2,52.6,66.5,73.9,84.1,84.1 WCBP12239,86.6,86.0,87.8,86.9,23.0,62.3,23.0,89.7,82.1,82.1 J1 S1.txt WCBP12234,63.7,81.8,67.7,82.7,72.2,67.0,71.6,87.5,75.3,75.3 WCBP12236,75.7,77.2,80.6,81.5,65.9,69.4,83.2,84.1,82.1,82.1 WCBP12238,70.9,65.9,71.3,66.0,66.0,40.2,73.7,66.5,84.1,84.1 WCBP12239,78.6,86.6,79.0,87.8,56.2,23.0,82.1,23.0,82.1,82.1 Update: To get only 1 true value, the for loop could be written like this: for my $id (sort keys %data) { for my $comb (#ops) { local $" = ''; open my $fh, ">>", "#$comb.txt" or die $!; my $a1 = $data{$id}{ $comb->[0] }; my $a2 = $data{$id}{ $comb->[1] }; pop #$a2; my #mesh = grep defined, mesh(#$a1, #$a2); print $fh join(",", $id, #mesh), "\n"; close $fh or die $!; } } Update: Added 'defined' for test in grep expr. as it is the proper way (instead of just testing '$_', which possibly could be 0 and wrongly excluded for the list by grep).
Any problem that awk or sed can solve, there is no doubt that python, perl, java, go, c++, c can too. However, it is not necessary to write a complete program in any of them. Use awk in one liner VERSION 1 For the most use cases, I think the VERSION 1 is good enough. tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary sort -t, -k 1,1 | awk -F ',+' -v OFS=, '$2==x{id=$1;a=$3;b=$4;c=$5;d=$6} id==$1 && $2==y{$3=a","$3; $4=b","$4; $5=c","$5; $6=d","$6; $2=""; $0=$0; $1=$1; print}' \ x=J1 y=S1 Just replace the value of the argument x and y with what you like. Please note the value of x and y must follow the alphabet order, e.g., x=J1 y=S1 is OK, but x=S1 y=J1 doesn't work. VERSION 2 The limitation mentioned in VERSION 1 that you have to specify the x and y in alphabet order is removed. Like, x=S1 y=J1 is OK now. tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary sort -t, -k 1,1 | awk -F ',+' -v OFS=, 'id!=$1 && ($2==x||$2==y){z=$2==x?y:x; id=$1; a=$3;b=$4;c=$5;d=$6} id==$1 && $2==z{$3=a","$3;$4=b","$4;$5=c","$5;$6=d","$6; $2=""; $0=$0; $1=$1; print}' \ x=S1 y=J1 However, the data of J1 is still put before the data of S1, which means the column a1 in the resulting output is always the column a of J1 in the input file, and a2 in the resulting output is always the column a of S1 in the input file. VERSION 3 The limitation mentioned in the VERSION 2 is removed. Now with x=S1 y=J1, the output column a1 would be the input column a of S1, and the a2 would be the a of J1. tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary sort -t, -k 1,1 | awk -F ',+' -v OFS=, 'id!=$1 && ($2==x||$2==y){z=$2==x?y:x; id=$1; a=$3;b=$4;c=$5;d=$6} id==$1 && $2==z{if (z==y) {$3=a","$3;$4=b","$4;$5=c","$5;$6=d","$6} else {$3=$3","a;$4=$4","b;$5=$5","c;$6=$6","d} $2=""; $0=$0; $1=$1; print}' \ x=S1 y=J1