Summing up two columns the Unix way - python
# To fix the symptom
How can you sum up the following columns effectively?
Column 1
1
3
3
...
Column 2
2323
343
232
...
This should give me
Expected result
2324
346
235
...
I have the columns in two files.
# Initial situation
I use sometimes too many curly brackets such that I have used one more this { than this } in my files.
I am trying to find where I have used the one unnecessary curly bracket.
I have used the following steps in getting the data
Find commands
find . * -exec grep '{' {} + > /tmp/1
find . * -exec grep '}' {} + > /tmp/2
AWK commands
awk -F: '{ print $2 }' /tmp/1 > /tmp/11
awk -F: '{ print $2 }' /tmp/2 > /tmp/22
The column are in the files /tmp/11 and /tmp/22.
I repeat a lot of similar commands in my procedure.
This suggests me that this is not the right way.
Please, suggests me any way such as Python, Perl or any Unix tool which can decrease the number of steps.
If c1 and c2 are youre files, you can do this:
$ paste c1 c2 | awk '{print $1 + $2}'
Or (without AWK):
$ paste c1 c2 | while read i j; do echo $(($i+$j)); done
Using python:
totals = [ int(i)+int(j) for i, j in zip ( open(fname1), open(fname2) ) ]
You can avoid the intermediate steps by just using a command that do the counts and the comparison at the same time:
find . -type f -exec perl -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g' {}\;
This calls the Perl program once per file, the Perl program counts the number of each type curly brace and prints the name of the file if they counts don't match.
You must be careful with the /([}{]])/ section, find will think it needs to do the replacement on {} if you say /([{}]])/.
WARNING: this code will have false positives and negatives if you are trying to run it against source code. Consider the following cases:
balanced, but curlies in strings:
if ($s eq '{') {
print "I saw a {\n"
}
unbalanced, but curlies in strings:
while (1) {
print "}";
You can expand the Perl command by using B::Deparse:
perl -MO=Deparse -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g'
Which results in:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
;
++$h{$_} foreach (/([}{])/g);
}
We can now look at each piece of the program:
BEGIN { $/ = "\n"; $\ = "\n"; }
This is caused by the -l option. It sets both the input and output record separators to "\n". This means anything read in will be broken into records based "\n" and any print statement will have "\n" appended to it.
LINE: while (defined($_ = <ARGV>)) {
}
This is created by the -n option. It loops over every file passed in via the commandline (or STDIN if no files are passed) reading each line of those files. This also happens to set $ARGV to the last file read by <ARGV>.
chomp $_;
This removes whatever is in the $/ variable from the line that was just read ($_), it does nothing useful here. It was caused by the -l option.
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
This is an END block, this code will run at the end of the program. It prints $ARGV (the name of the file last read from, see above) if the values stored in %h associated with the keys '{' and '}' are equal.
++$h{$_} foreach (/([}{])/g);
This needs to be broken down further:
/
( #begin capture
[}{] #match any of the '}' or '{' characters
) #end capture
/gx
Is a regex that returns a list of '{' and '}' characters that are in the string being matched. Since no string was specified the $_ variable (which holds the line last read from the file, see above) will be matched against. That list is fed into the foreach statement which then runs the statement it is in front of for each item (hence the name) in the list. It also sets $_ (as you can see $_ is a popular variable in Perl) to be the item from the list.
++h{$_}
This line increments the value in $h that is associated with $_ (which will be either '{' or '}', see above) by one.
In Python (or Perl, Awk, &c) you can reasonably do it in a single stand-alone "pass" -- I'm not sure what you mean by "too many curly brackets", but you can surely count curly use per file. For example (unless you have to worry about multi-GB files), the 10 files using most curly braces:
import heapq
import os
import re
curliest = dict()
for path, dirs, files in os.walk('.'):
for afile in files:
fn = os.path.join(path, afile)
with open(fn) as f:
data = f.read()
braces = data.count('{') + data.count('}')
curliest[fn] = bracs
top10 = heapq.nlargest(10, curlies, curliest.get)
top10.sort(key=curliest.get)
for fn in top10:
print '%6d %s' % (curliest[fn], fn)
Reply to Lutz'n answer
My problem was finally solved by this commnad
paste -d: /tmp/1 /tmp/2 | awk -F: '{ print $1 "\t" $2 - $4 }'
your problem can be solved with just 1 awk command...
awk '{getline i<"file1";print i+$0}' file2
Related
Python csv merge multiple files with different columns
I hope somebody can help me with this issue. I have about 20 csv files (each file with its headers), each of this files has hundreds of columns. My problem is related to merging those files, because a couple of them have extra columns. I was wondering if there is an option to merge all those files in one adding all the new columns with related data without corrupting the other files. So far I used I used the awk terminal command: awk '(NR == 1) || (FNR > 1)' *.csv > file.csv to merge removing the headers from all the files expect from the first one. I got this from my previous question Merge multiple csv files into one But this does not solve the issue with the extra column. EDIT: Here are some file csv in plain text with the headers. file 1 "#timestamp","#version","_id","_index","_type","ad.(fydibohf23spdlt)/cn","ad.</o","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","ad.totalComputers","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationUserPrivileges","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type" 2021-07-27 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, file2 "#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type" 2021-07-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, file3 "#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type" 2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, file4 "#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type" 2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Those are 4 of the 20 files, I included all the headers but no rows because they contain sensitive data. When I run the script on those files, I can see that it writes the timestamp value. But when I run it against the original files (with a lot of data) all what it does, is writing the header and that's it.Please if you need some more info just let me know. Once I run the script on the original file. This is what I get back There are 20 rows (one for each file) but it doesn't write the content of each file. This could be related to the sniffing of the first line? because I think that is checking only the first line of the files and moves forward as in the script. So how is that in a small file, it manage to copy merge also the content?
Your question isn't clear, idk if you really want a solution in awk or python or either, and it doesn't have any sample input/output we can test with so it's a guess but is this what you're trying to do (using any awk in any shell on every Unix box)? $ head file{1..2}.csv ==> file1.csv <== 1,2 a,b c,d ==> file2.csv <== 1,2,3 x,y,z $ cat tst.awk BEGIN { FS = OFS = "," for (i=1; i<ARGC; i++) { if ( (getline < ARGV[i]) > 0 ) { if ( NF > maxNF ) { maxNF = NF hdr = $0 } } } } NR == 1 { print hdr } FNR > 1 { NF=maxNF; print } $ awk -f tst.awk file{1..2}.csv 1,2,3 a,b, c,d, x,y,z See http://awk.freeshell.org/AllAboutGetline for details on when/how to use getline and it's associated caveats. Alternatively with an assist from GNU head for -q: $ cat tst.awk BEGIN { FS=OFS="," } NR == FNR { if ( NF > maxNF ) { maxNF = NF hdr = $0 } next } !doneHdr++ { print hdr } FNR > 1 { NF=maxNF; print } $ head -q -n 1 file{1..2}.csv | awk -f tst.awk - file{1..2}.csv 1,2,3 a,b, c,d, x,y,z
As already explained in your original question, you can easily extend the columns in Awk if you know how many to expect. awk -F ',' -v cols=5 'BEGIN { OFS=FS } FNR == 1 && NR > 1 { next } NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" } 1' *.csv >file.csv I slightly refactored this to skip the unwanted lines with next rather than vice versa; this simplifies the rest of the script slightly. I also added the missing comma separator. You can easily print the number of columns in each file, and just note the maximum: awk -F , 'FNR==1 { print NF, FILENAME }' *.csv If you don't know how many fields there are going to be in files you do not yet have, or if you need to cope with complex CSV with quoted fields, maybe switch to Python for this. It's not too hard to do the field number sniffing in Awk, but coping with quoting is tricky. import csv import sys # Sniff just the first line from every file fields = 0 for filename in sys.argv[1:]: with open(filename) as raw: for row in csv.reader(raw): # If the line is longer than current max, update if len(row) > fields: fields = len(row) titles = row # Break after first line, skip to next file break # Now do the proper reading writer = csv.writer(sys.stdout) writer.writerow(titles) for filename in sys.argv[1:]: with open(filename) as raw: for idx, row in enumerate(csv.reader(raw)): if idx == 0: next row.extend([''] * (fields - len(row))) writer.writerow(row) This simply assumes that the additional fields go at the end. If the files could have extra columns between other columns, or columns in different order, you need a more complex solution (though not by much; the Python CSV DictReader subclass could do most of the heavy lifting). Demo: https://ideone.com/S998l4 If you wanted to do the same type of sniffing in Awk, you basically have to specify the names of the input files twice, or do some nontrivial processing in the BEGIN block to read all the files before starting the main script.
Perl: counting many words in many strings efficiently
I often find myself needing to count the number of times words appear in a number of text strings. When I do this, I want to know how many times each word, individually, appears in each text string. I don't believe my approach is very efficient and any help you could give me would be great. Usually, I will write a loop that (1) pulls in a text from a txt file as a text string, (2) executes another loop that loops over the words I want to count using a regular expression to check how many times the a given word appears each time pushing the count to an array, (3) prints the array of counts separated by commas to a file. Here is an example: #create array that holds the list of words I'm looking to count; #word_list = qw(word1 word2 word3 word4); #create array that holds the names of the txt files I want to count; $data_loc = "/data/txt_files_for_counting/" opendir(DIR1,"$data_loc")||die "CAN'T OPEN DIRECTORY"; my #file_names=readdir(DIR1); #create place to save results; $out_path_name = "/output/my_counts.csv"; open (OUT_FILE, ">>", $out_path_name); #run the loops; foreach $file(#file_names){ if ($file=~/^\./) {next;} #Pull in text from txt filea; { $P_file = $data_loc."/".$file; open (B, "$P_file") or die "can't open the file: $P_file: $!"; $text_of_txt_file = do {local $/; <B>}; close B or die "CANNOT CLOSE $P_file: $!"; } #preserve the filename so counts are interpretable; print OUT_FILE $file; foreach $wl_word(#word_list){ #use regular expression to search for term without any context; #finds_p = (); #finds_p = $text_of_txt_file =~ m/\b$wl_word\b/g; $N_finds = #finds_p; print OUT_FILE ",".$N_finds; } print OUT_FILE ",\n"; } close(OUT_FILE); I've found this approach to be very inefficient (slow) as the number of txt files and the number of words I want to count grow. Is there a more efficient way to do this? Is there a perl package that does this? Could it be more efficient in python? (e.g., Is there a python package that will do this?) Thanks! EDIT: note, I don't want to count the number of words, rather the presence of certain words. Thus, the answer in this question "What's the fastest way to count the number of words in a string in Perl?" doesn't quite apply. Unless I'm missing something.
Here's my take on how your code should be written. I'll spend a while explaining my choices and then update Always use strict and use warnings at the top of every Perl program that you write. You will also have to declare every variable using my as close as possible to its first point of use. It is an essential habit to get into as it will reveal many simple errors. They are also mandatory before you ask for help, as without them you will be seen to be negligent Don't comment source code that is self-evident. The encouragement to comment everything is a legacy from the 1970s, and has become an excuse for writing poor code. Most of the time, using identifiers and whitespace correctly will explain the function of your program far better than any comment You are correct to use the three-parameter form of open, but you should also use lexical file handles. And it is vital to check the result of every open and call die if it fails if the program cannot reasonably continue without access to the file. The die string must include the value of the variable $! to say why the open failed If your program opens many files then it is often more convenient to use the autodie pragma, which implicitly checks every IO operation for you You should read perldoc perlstyle to familiarise yourself with the format that most Perl prgrammers are comfortable with. Artifacts like if ($file=~/^\./) {next;} should be simply next if $file =~ /^\./; You have caught onto the do { local $/; ... } idiom to read an entire file into memory but you have limited its scope. Your block { $P_file = $data_loc."/".$file; open (B, "$P_file") or die "can't open the file: $P_file: $!"; $text_of_txt_file = do {local $/; <B>}; close B or die "CANNOT CLOSE $P_file: $!"; } is better written my $text_of_txt_file = do { open my $fh, '<', $file; local $/; <$fh>; }; Rather than looping over a list of words, it is faster and more concise to build a regular expression from your word list. My program below shows this use strict; use warnings; use 5.010; use autodie; use constant DATA_LOC => '/data/txt_files_for_counting/'; use constant OUTPUT_FILE => '/output/my_counts.csv'; my #word_list = qw(word1 word2 word3 word4); my $word_re = join '|', map quotemeta, #word_list; $word_re = qr/$word_re/; chdir DATA_LOC; my #text_files = grep -f, glob '*.*'; my #find_counts; for my $file ( #text_files ) { next if $file =~ /^\./; my $text = do { open my $in_fh, '<', $file; local $/; <$in_fh> }; my $n_finds = $text =~ /\b$word_re\b/g; push #find_counts, $n_finds; } open my $out_fh, '>', OUTPUT_FILE; print $out_fh join(',', #find_counts), "\n"; close $out_fh;
First off - what you're doing with opendir - I wouldn't and would suggest glob instead. And otherwise - there's another useful trick. Compile a regex for your "words". The reason this is useful, is because - with a variable in a regex, it needs to recompile the regex each time - in case the variable has changed. IF it's static, then you no longer need to. use strict; use warnings; use autodie; my #words = ( "word1", "word2", "word3", "word4", "word5 word6" ); my $words_regex = join( "|", map ( quotemeta, #words )); $words_regex = qr/\b($words_regex)\b/; open( my $output, ">", "/output/my_counts.csv" ); foreach my $file ( glob("/data/txt_files_for_counting") ) { open( my $input, "<", $file ); my %count_of; while (<$input>) { foreach my $match (m/$words_regex/g) { $count_of{$match}++; } } print {$output} $file, "\n"; foreach my $word (#words) { print {$output} $word, " => ", $count_of{$word} // 0, "\n"; } close ( $input ); } With this approach - you no longer need to 'slurp' the whole file into memory in order to process it. (Which may not be as big an advantage, depending how large the files are). When fed data of: word1 word2 word3 word4 word5 word6 word2 word5 word4 word4 word5 word word 45 sdasdfasf word5 word6 sdfasdf sadf Outputs: word1 => 1 word2 => 2 word3 => 1 word4 => 3 word5 word6 => 2 I will note however - if you have overlapping substrings in your regex, then this won't work as is - it's possible though, you just need a different regex.
If you have words separated by spaces use a collections.Counter dict using python to count all words: from collections import Counter with open("in.txt") as f: counts = Counter(word for line in f for word in line.split()) Then access by key to get the count of how many times each word appears for whatever words you want: print(counts["foo"]) print(count["bar"]) ..... So one pass over the words in the file and you can get the count for all the words so if you have 1 or 10000 words to count you only have to build the dict once. Unlike normal dicts any words/key that you try to access that are not in the dict won't raise a keyerror, 0 will be returned instead. If you wanted only certain words to be stored using a set to store the words you want to keep and doing a lookup for each word: from collections import Counter words = {"foo","bar","foobar"} with open("out.txt") as f: counts = Counter(word for line in f for word in line.split() if word in words) That would only store the count for the words in words, set lookups are on average 0(1). If you wanted to search for a phrase then you could use sum and in but you would have to do it for each phrase so multiple passes over the file: with open("in.txt") as f: count = sum("word1 word2 word3" in line for line in f)
Your biggest bottleneck is the speed at which data are read from the storage medium. Using a small number of parallel processes, your program may be able to read one file while processing others, thus speeding up the process. This is unlikely to yield any benefits unless the files themselves are large. Keep in mind, overlapping strings are hard. The code below prefers the longest match. Non-parallelized version #!/usr/bin/env perl use strict; use warnings; use File::Spec::Functions qw( catfile ); use Text::CSV_XS; die "Need directory and extension\n" unless #ARGV == 2; my ($data_dir, $ext) = #ARGV; my $pat = join('|', map quotemeta, sort { (length($b) <=> length($a)) } my #words = ( 'Visual Studio', 'INCLUDE', 'Visual', ) ); my $csv= Text::CSV_XS->new; opendir my $dir, $data_dir or die "Cannot open directory: '$data_dir': $!"; my %wanted_words; while (my $file = readdir $dir) { next unless $file =~ /[.]\Q$ext\E\z/; my $path = catfile($data_dir, $file); next unless -f $path; open my $fh, '<', $path or die "Cannot open '$path': $!"; my $contents = do { local $/; <$fh> }; close $fh or die "Cannot close '$path': $!"; while ($contents =~ /($pat)/go) { $wanted_words{ $file }{ $1 } += 1; } } for my $file (sort keys %wanted_words) { my $file_counts = $wanted_words{ $file }; my #fields = ($file, sort keys %$file_counts); $csv->combine(#fields) or die "Failed to combine [#fields]"; print $csv->string, "\n"; } For a test, I ran the script in a directory containing some temporary batch files from a Boost installation: C:\...\Temp> perl count.pl . cmdb2_msvc_14.0_vcvarsall_amd64.cmd,INCLUDE,"Visual Studio" b2_msvc_14.0_vcvarsall_x86.cmd,INCLUDE,"Visual Studio" b2_msvc_14.0_vcvarsall_x86_arm.cmd,INCLUDE,"Visual Studio" That is, all occurrences of "Visual" are ignored in favor of "Visual Studio". For generating CSV output, you should use the combine method in Text::CSV_XS, instead of using join(',' ...). Version using Parallel::ForkManager Whether this will get anything done faster depends on the sizes of the input files, and the speed of the storage medium. If there is an improvement, the right number of processes is likely to be between N/2 to N where N is the number of cores. I did not test this. #!/usr/bin/env perl use strict; use warnings; use File::Spec::Functions qw( catfile ); use Parallel::ForkManager; use Text::CSV_XS; die "Need number of processes, directory, and extension\n" unless #ARGV == 3; my ($procs, $data_dir, $ext) = #ARGV; my $pat = join('|', map quotemeta, sort { (length($b) <=> length($a)) } my #words = ( 'Visual Studio', 'INCLUDE', 'Visual', ) ); my $csv= Text::CSV_XS->new; opendir my $dir, $data_dir or die "Cannot open directory: '$data_dir': $!"; my $fm = Parallel::ForkManager->new($procs); ENTRY: while (my $file = readdir $dir) { next unless $file =~ /[.]\Q$ext\E\z/; my $path = catfile($data_dir, $file); next unless -f $path; my $pid = $fm->start and next ENTRY; my %wanted_words; open my $fh, '<', $path or die "Cannot open '$path': $!"; my $contents = do { local $/; <$fh> }; close $fh or die "Cannot close '$path': $!"; while ($contents =~ /($pat)/go) { $wanted_words{ $1 } += 1; } my #fields = ($file, sort keys %wanted_words); $csv->combine(#fields) or die "Failed to combine [#fields]"; print $csv->string, "\n"; $fm->finish; } $fm->wait_all_children;
I would rather prefer using one-liner: $ for file in /data/txt_files_for_counting/*; do perl -F'/\W+/' -nale 'BEGIN { #w = qw(word1 word2 word3 word4) } $h{$_}++ for map { $w = lc $_; grep { $_ eq $w } #w } #F; END { print join ",", $ARGV, map { $h{$_} || 0 } #w; }' "$file"; done
Combine lines with matching keys
I have a text file with the following structure ID,operator,a,b,c,d,true WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 WCBP12236,S1,77.2,81.5,69.4,84.1,82.1 WCBP12236,S2,68.0,68.0,53.2,68.5,82.1 WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 WCBP12234,J2,68.6,68.4,41.4,68.9,75.3 WCBP12234,S1,81.8,82.7,67.0,87.5,75.3 WCBP12234,S2,66.6,67.9,53.0,70.7,75.3 WCBP12238,J1,78.6,79.0,56.2,82.1,84.1 WCBP12239,J2,66.6,72.9,79.5,76.6,82.1 WCBP12239,S1,86.6,87.8,23.0,23.0,82.1 WCBP12239,S2,86.0,86.9,62.3,89.7,82.1 WCBP12239,J1,70.9,71.3,66.0,73.7,82.1 WCBP12238,J2,75.1,75.2,54.3,76.4,84.1 WCBP12238,S1,65.9,66.0,40.2,66.5,84.1 WCBP12238,S2,72.7,73.2,52.6,73.9,84.1 Each ID corresponds to a dataset which is analysed by an operator several times. i.e J1 and J2 are the first and second attempt by operator J. The measures a, b, c and d use 4 slightly different algorithms to measure a value whose true value lies in the column true What I would like to do is to create 3 new text files comparing the results for J1 vs J2, S1 vs S2 and J1 vs S1. Example output for J1 vs J2: ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1 WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3 where a1 is measurement a for J1, etc. Another example is for S1 vs S2: ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1 WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3 The IDs will not be in alphanumerical order nor will the operators be clustered for the same ID. I'm not certain how best to approach this task - using linux tools or a scripting language like perl/python. My initial attempt using linux quickly hit a brick wall First find all unique IDs (sorted) awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids Loop through these IDs and sort J1, J2: foreach i (`more unique_ids`) grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2 end This gives me the data sorted WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 WCBP12234,J2,68.6,68.4,41.4,68.9,80.4 WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 WCBP12238,J1,78.6,79.0,56.2,82.1,82.1 WCBP12238,J2,75.1,75.2,54.3,76.4,82.1 WCBP12239,J1,70.9,71.3,66.0,73.7,75.3 WCBP12239,J2,66.6,72.9,79.5,76.6,75.3 I'm not sure how to rearrange this data to get the desired structure. I tried adding an additional pipe to awk in the foreach loop awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}' Any ideas? I'm sure this can be done in a less cumbersome manner using awk, although it may be better using a proper scripting language.
You can use the Perl csv module Text::CSV to extract the fields, and then store them in a hash, where ID is the main key, the second field is the secondary key and all the fields are stored as the value. It should then be trivial to do whatever comparisons you want. If you want to retain the original order of your lines, you can use an array inside the first loop. use strict; use warnings; use Text::CSV; my %data; my $csv = Text::CSV->new({ binary => 1, # safety precaution eol => $/, # important when using $csv->print() }); while ( my $row = $csv->getline(*ARGV) ) { my ($id, $J) = #$row; # first two fields $data{$id}{$J} = $row; # store line }
Python Way: import os,sys, re, itertools info=["WCBP12236,J1,75.7,80.6,65.9,83.2,82.1", "WCBP12236,J2,76.3,79.6,61.7,81.9,82.1", "WCBP12236,S1,77.2,81.5,69.4,84.1,82.1", "WCBP12236,S2,68.0,68.0,53.2,68.5,82.1", "WCBP12234,J1,63.7,67.7,72.2,71.6,75.3", "WCBP12234,J2,68.6,68.4,41.4,68.9,80.4", "WCBP12234,S1,81.8,82.7,67.0,87.5,75.3", "WCBP12234,S2,66.6,67.9,53.0,70.7,72.7", "WCBP12238,J1,78.6,79.0,56.2,82.1,82.1", "WCBP12239,J2,66.6,72.9,79.5,76.6,75.3", "WCBP12239,S1,86.6,87.8,23.0,23.0,82.1", "WCBP12239,S2,86.0,86.9,62.3,89.7,82.1", "WCBP12239,J1,70.9,71.3,66.0,73.7,75.3", "WCBP12238,J2,75.1,75.2,54.3,76.4,82.1", "WCBP12238,S1,65.9,66.0,40.2,66.5,80.4", "WCBP12238,S2,72.7,73.2,52.6,73.9,72.7" ] def extract_data(operator_1, operator_2): operator_index=1 id_index=0 data={} result=[] ret=[] for line in info: conv_list=line.split(",") if len(conv_list) > operator_index and ((operator_1.strip().upper() == conv_list[operator_index].strip().upper()) or (operator_2.strip().upper() == conv_list[operator_index].strip().upper()) ): if data.has_key(conv_list[id_index]): iters = [iter(conv_list[int(operator_index)+1:]), iter(data[conv_list[id_index]])] data[conv_list[id_index]]=list(it.next() for it in itertools.cycle(iters)) continue data[conv_list[id_index]]=conv_list[int(operator_index)+1:] return data ret=extract_data("j1", "s2") print ret O/P: {'WCBP12239': ['70.9', '86.0', '71.3', '86.9', '66.0', '62.3', '73.7', '89.7', '75.3', '82.1'], 'WCBP12238': ['72.7', '78.6', '73.2', '79.0', '52.6', '56.2', '73.9', '82.1', '72.7', '82.1'], 'WCBP12234': ['66.6', '63.7', '67.9', '67.7', '53.0', '72.2', '70.7', '71.6', '72.7', '75.3'], 'WCBP12236': ['68.0', '75.7', '68.0', '80.6', '53.2', '65.9', '68.5', '83.2', '82.1', '82.1']}
I didn't use Text::CSV like TLP did. If you needed to you could but for this example, I thought since there were no embedded commas in the fields, I did a simple split on ','. Also, the true fields from both operators are listed (instead of just 1) as I thought the special case of the last value complicates the solution. #!/usr/bin/perl use strict; use warnings; use List::MoreUtils qw/ mesh /; my %data; while (<DATA>) { chomp; my ($id, $op, #vals) = split /,/; $data{$id}{$op} = \#vals; } my #ops = ([qw/J1 J2/], [qw/S1 S2/], [qw/J1 S1/]); for my $id (sort keys %data) { for my $comb (#ops) { open my $fh, ">>", "#$comb.txt" or die $!; my $a1 = $data{$id}{ $comb->[0] }; my $a2 = $data{$id}{ $comb->[1] }; print $fh join(",", $id, mesh(#$a1, #$a2)), "\n"; close $fh or die $!; } } __DATA__ WCBP12236,J1,75.7,80.6,65.9,83.2,82.1 WCBP12236,J2,76.3,79.6,61.7,81.9,82.1 WCBP12236,S1,77.2,81.5,69.4,84.1,82.1 WCBP12236,S2,68.0,68.0,53.2,68.5,82.1 WCBP12234,J1,63.7,67.7,72.2,71.6,75.3 WCBP12234,J2,68.6,68.4,41.4,68.9,75.3 WCBP12234,S1,81.8,82.7,67.0,87.5,75.3 WCBP12234,S2,66.6,67.9,53.0,70.7,75.3 WCBP12239,J1,78.6,79.0,56.2,82.1,82.1 WCBP12239,J2,66.6,72.9,79.5,76.6,82.1 WCBP12239,S1,86.6,87.8,23.0,23.0,82.1 WCBP12239,S2,86.0,86.9,62.3,89.7,82.1 WCBP12238,J1,70.9,71.3,66.0,73.7,84.1 WCBP12238,J2,75.1,75.2,54.3,76.4,84.1 WCBP12238,S1,65.9,66.0,40.2,66.5,84.1 WCBP12238,S2,72.7,73.2,52.6,73.9,84.1 The output files produced are below J1 J2.txt WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3,75.3 WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1,82.1 WCBP12238,70.9,75.1,71.3,75.2,66.0,54.3,73.7,76.4,84.1,84.1 WCBP12239,78.6,66.6,79.0,72.9,56.2,79.5,82.1,76.6,82.1,82.1 S1 S2.txt WCBP12234,81.8,66.6,82.7,67.9,67.0,53.0,87.5,70.7,75.3,75.3 WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1,82.1 WCBP12238,65.9,72.7,66.0,73.2,40.2,52.6,66.5,73.9,84.1,84.1 WCBP12239,86.6,86.0,87.8,86.9,23.0,62.3,23.0,89.7,82.1,82.1 J1 S1.txt WCBP12234,63.7,81.8,67.7,82.7,72.2,67.0,71.6,87.5,75.3,75.3 WCBP12236,75.7,77.2,80.6,81.5,65.9,69.4,83.2,84.1,82.1,82.1 WCBP12238,70.9,65.9,71.3,66.0,66.0,40.2,73.7,66.5,84.1,84.1 WCBP12239,78.6,86.6,79.0,87.8,56.2,23.0,82.1,23.0,82.1,82.1 Update: To get only 1 true value, the for loop could be written like this: for my $id (sort keys %data) { for my $comb (#ops) { local $" = ''; open my $fh, ">>", "#$comb.txt" or die $!; my $a1 = $data{$id}{ $comb->[0] }; my $a2 = $data{$id}{ $comb->[1] }; pop #$a2; my #mesh = grep defined, mesh(#$a1, #$a2); print $fh join(",", $id, #mesh), "\n"; close $fh or die $!; } } Update: Added 'defined' for test in grep expr. as it is the proper way (instead of just testing '$_', which possibly could be 0 and wrongly excluded for the list by grep).
Any problem that awk or sed can solve, there is no doubt that python, perl, java, go, c++, c can too. However, it is not necessary to write a complete program in any of them. Use awk in one liner VERSION 1 For the most use cases, I think the VERSION 1 is good enough. tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary sort -t, -k 1,1 | awk -F ',+' -v OFS=, '$2==x{id=$1;a=$3;b=$4;c=$5;d=$6} id==$1 && $2==y{$3=a","$3; $4=b","$4; $5=c","$5; $6=d","$6; $2=""; $0=$0; $1=$1; print}' \ x=J1 y=S1 Just replace the value of the argument x and y with what you like. Please note the value of x and y must follow the alphabet order, e.g., x=J1 y=S1 is OK, but x=S1 y=J1 doesn't work. VERSION 2 The limitation mentioned in VERSION 1 that you have to specify the x and y in alphabet order is removed. Like, x=S1 y=J1 is OK now. tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary sort -t, -k 1,1 | awk -F ',+' -v OFS=, 'id!=$1 && ($2==x||$2==y){z=$2==x?y:x; id=$1; a=$3;b=$4;c=$5;d=$6} id==$1 && $2==z{$3=a","$3;$4=b","$4;$5=c","$5;$6=d","$6; $2=""; $0=$0; $1=$1; print}' \ x=S1 y=J1 However, the data of J1 is still put before the data of S1, which means the column a1 in the resulting output is always the column a of J1 in the input file, and a2 in the resulting output is always the column a of S1 in the input file. VERSION 3 The limitation mentioned in the VERSION 2 is removed. Now with x=S1 y=J1, the output column a1 would be the input column a of S1, and the a2 would be the a of J1. tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary sort -t, -k 1,1 | awk -F ',+' -v OFS=, 'id!=$1 && ($2==x||$2==y){z=$2==x?y:x; id=$1; a=$3;b=$4;c=$5;d=$6} id==$1 && $2==z{if (z==y) {$3=a","$3;$4=b","$4;$5=c","$5;$6=d","$6} else {$3=$3","a;$4=$4","b;$5=$5","c;$6=$6","d} $2=""; $0=$0; $1=$1; print}' \ x=S1 y=J1
Extract words from a file, then list files along with line number that contain those words
I have a file called Strings.h, that I use to localize an app I have. I want to search through all of my class files, and find out if and where I am using each string, and output the classes and line numbers for each string. My thought is to use Python, but maybe that's the wrong tool for the job. Also, I have a basic algorithm, but I worry it will take too long to run. Can you write this script to do what I want, or even just suggest a better algorithm? Strings.h looks like this: #import "NonLocalizedStrings.h" #pragma mark Coordinate Behavior Strings #define LATITUDE_WORD NSLocalizedString(#"Latitude", #"used in coordinate behaviors") #define LONGITUDE_WORD NSLocalizedString(#"Longitude", #"used in coordinate behaviors") #define DEGREES_WORD NSLocalizedString(#"Degrees", #"used in coordinate behaviors") #define MINUTES_WORD NSLocalizedString(#"Minutes", #"Used in coordiante behaviors") #define SECONDS_WORD NSLocalizedString(#"Seconds", #"Used in DMSBehavior.m") ... The script should take each line that starts with #define, and then make a list of the word that appears after #define (e.g.) LATITUDE_WORD The pseudocode might be: file = strings.h for line in file: extract word after #define search_words.push(word) print search_words [LATITUDE_WORD, LONGITUDE_WORD, DEGREES_WORD, MINUTES_WORD, SECONDS WORD] After I have the list of words, my pseudocode is something like: found_words = {} for word in words: found_words[word] = [] for file in files: for line in file: for word in search_words: if line contains word: found_words[word].push((filename, linenumber)) print found_words So, found words would look something like: { LATITUDE_WORD: [ (foo.m, 42), (bar.m, 132) ], LONGITUDE_WORD: [ (baz.m, 22), (bim.m, 112) ], }
How about this [in bash] ? $ pattern="\\<($(grep '^#define ' Strings.h | cut -d' ' -f2 | tr '\n' '|' | sed 's/|$//'))\\>" $ find project_dir -iname '*.m' -exec egrep -Hno "${pattern}" {} + > matches Output: project_dir/bar.m:132:LATITUDE_WORD project_dir/baz.m:22:LONGITUDE_WORD project_dir/bim.m:112:LONGITUDE_WORD project_dir/foo.m:42:LATITUDE_WORD EDIT: I've altered the code above to redirect it's output to a file matches, so we can use that to show words that are never found: for word in $(grep '^#define ' Strings.h | cut -d' ' -f2) do if ! cut -d':' -f3 matches | grep -q "${word}" then echo "${word}" fi done
So, it looks like you've got the right idea. Here are some advantages and disadvantages to what you've got. Advantages: If you use Python, your pseudocode translates almost line for line directly to your script. You can learn a little bit more about Python (great skill to have for things like this). Disadvantages: Python will run a bit slower than some of the other bash-based solutions that have been posted (which is a problem if you have a lot of files to search). Your Python script will be a little bit longer than these other solutions, but you can be a little bit more flexible with your output as well. Answer: Because I'm familiar with Python, and that's what you asked for originally, here's a bit more code you can use: #!/usr/bin/env python # List the files you want to search here search_files = [] word_file = open('<FILE_PATH_HERE>', 'r') # Allows for sorted output later. words = [] #Contains all found instances. inst_dict = {} for line in word_file: if line[0:7] == "#define": w = line[7:].split()[0] words.append(w) inst_dict[w] = [] for file_name in search_files: file_obj = open(file_name, 'r') line_num = 0 for line in file_obj: for w in words: if w in line: inst_dict[w].append((file_name,line_num)) line_num += 1 # Do whatever you want with 'words' and 'inst_dict' words.sort() for w in words: string = w + ":\n" for inst in inst_dict[w]: string += "\tFile: " + inst[0] + "\n" string += "\tLine: " + inst[1] + "\n" print string I haven't tested the search portion of the code, so use 'as is' at your own risk. Good luck, and feel free to ask questions or augment the code as you need. Your request was pretty simple and has lots of solutions, so I'd rather you understand how this works.
This solution uses awk and globstar (the latter requires Bash 4). I think there can be further improvements but consider this a draft of sorts. shopt -s globstar awk 'NR==FNR { if ($0 ~ /^#define/) found[$2]=""; next; } { for (word in found){ if ($0 ~ word) found[word]=found[word] "\t" FILENAME ":" FNR "\n"; } } END { for (word in found) print word ":\n" found[word]} ' Strings.h **/*.m Using the snippet of Strings.h you posted, here's the sort of output I get (with some testfiles I made up) LATITUDE_WORD: lala1.m, 2 lala3.m, 1 DEGREES_WORD: lala2.m, 5 SECONDS_WORD: MINUTES_WORD: lala3.m, 3 LONGITUDE_WORD: lala3.m, 2 p/s: Haven't tested this with globstar since the bash I'm using right now is v3 (pfff!)
Here is a Python program. It can probably be reduced and made simpler, but it works. import re l=filecontent.split('\n') for item in l: if item.startswith("#define"): print re.findall("#define .+? ", item)[0].split(' ')[1]
You should try : grep -oP '^#define\s+\K\S+' strings.h If your grep lack the -P option : perl -lne 'print $& if /^#define\s+\K\S+/' strings.h
#!/bin/bash # Assuming $files constains a list of your files word_list=( $(grep '^#define' "${files[#]}" | awk '{ print $2 }') )
splitting file into smaller files using by number of fields
I'm having a hard time breaking a large (50GB) csv file into smaller part. Each line has a few thousand fields. Some of the fields are strings in double quotes, others are integers, decimals and boolean. I want to parse the file line by line and split by the number of fields in each row. The strings contain possibly several commas (such as ), as well as a number of empty fields. ,,1,30,50,"Sold by father,son and daughter for $4,000" , ,,,, 12,,,20.9,0, I tried using perl -pe' s{("[^"]+")}{($x=$1)=~tr/,/|/;$x}ge ' file >> file2 to change the commas inside the quotes to | but that didn't work. I plan to use awk -F"|" conditional statement appending to new k_fld_files file2 Is there an easier way to do this please? I'm looking at python, but I probably need a utility that will stream process the file, line by line.
Using Python - if you just want to parse CSV including embedded delimiters, and stream out with a new delimiter, then something such as: import csv import sys with open('filename.csv') as fin: csvout = csv.writer(sys.stdout, delimiter='|') for row in csv.reader(fin): csvout.writerow(row) Otherwise, it's not much more difficult to make this do all kinds of stuff. Example of outputting to files per column (untested): cols_to_output = {} for row in csv.reader(fin): for colno, col in enumerate(row): output_to = cols_to_output.setdefault(colno, open('column_output.{}'.format(colno), 'wb') csv.writer(output_to).writerow(row) for fileno in cols_to_output.itervalues(): fileno.close()
Here's an awk alternative. Assuming the quoted strings are well formatted, i.e. always have starting and terminating quotes, and no quotes within other quotes, you could do the replacement you suggested by doing a gsub on every other field replacing , with |. With pipes Below is an example of how this might go when grabbing columns 3 through 6, 11 and 14-15 with coreutils cut: awk -F'"' -v OFS='' ' NF > 1 { for(i=2; i<=NF; i+=2) { gsub(",", "|", $i); $i = FS $i FS; # reinsert the quotes } print }'\ | cut -d , -f 3-6,11,14-15 \ | awk -F'"' -v OFS='' -e ' NF > 1 { for(i=2; i<=NF; i+=2) { gsub("\\|", ",", $i) $i = FS $i FS; # reinsert the quotes } print }' Note that there is an additional post-processing step that reverts the | to ,. Entirely in awk Alternatively, you could do the whole thing in awk with some loss of generality with regards to range specification. Here we only grab columns 3 to 6: extract.awk BEGIN { OFS = "" start = 3 end = 6 } { for(i=2; i<=NF; i+=2) { gsub(",", "|", $i) $i = FS $i FS } split($0, record, ",") for(i=start; i<=end-1; i++) { gsub("\\|", ",", record[i]) printf("%s,", record[i]) } gsub("\\|", ",", record[end]) printf("%s\n", record[end]) }