splitting file into smaller files using by number of fields

splitting file into smaller files using by number of fields - python

I'm having a hard time breaking a large (50GB) csv file into smaller part. Each line has a few thousand fields. Some of the fields are strings in double quotes, others are integers, decimals and boolean.
I want to parse the file line by line and split by the number of fields in each row. The strings contain possibly several commas (such as ), as well as a number of empty fields.
,,1,30,50,"Sold by father,son and daughter for $4,000" , ,,,, 12,,,20.9,0,
I tried using
perl -pe' s{("[^"]+")}{($x=$1)=~tr/,/|/;$x}ge ' file >> file2
to change the commas inside the quotes to | but that didn't work. I plan to use
awk -F"|" conditional statement appending to new k_fld_files file2
Is there an easier way to do this please? I'm looking at python, but I probably need a utility that will stream process the file, line by line.

Using Python - if you just want to parse CSV including embedded delimiters, and stream out with a new delimiter, then something such as:
import csv
import sys
with open('filename.csv') as fin:
csvout = csv.writer(sys.stdout, delimiter='|')
for row in csv.reader(fin):
csvout.writerow(row)
Otherwise, it's not much more difficult to make this do all kinds of stuff.
Example of outputting to files per column (untested):
cols_to_output = {}
for row in csv.reader(fin):
for colno, col in enumerate(row):
output_to = cols_to_output.setdefault(colno, open('column_output.{}'.format(colno), 'wb')
csv.writer(output_to).writerow(row)
for fileno in cols_to_output.itervalues():
fileno.close()

Here's an awk alternative.
Assuming the quoted strings are well formatted, i.e. always have starting and terminating quotes, and no quotes within other quotes, you could do the replacement you suggested by doing a gsub on every other field replacing , with |.
With pipes
Below is an example of how this might go when grabbing columns 3 through 6, 11 and 14-15 with coreutils cut:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(",", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
print
}'\
| cut -d , -f 3-6,11,14-15 \
| awk -F'"' -v OFS='' -e '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub("\\|", ",", $i)
$i = FS $i FS; # reinsert the quotes
}
print
}'
Note that there is an additional post-processing step that reverts the | to ,.
Entirely in awk
Alternatively, you could do the whole thing in awk with some loss of generality with regards to range specification. Here we only grab columns 3 to 6:
extract.awk
BEGIN {
OFS = ""
start = 3
end = 6
}
{
for(i=2; i<=NF; i+=2) {
gsub(",", "|", $i)
$i = FS $i FS
}
split($0, record, ",")
for(i=start; i<=end-1; i++) {
gsub("\\|", ",", record[i])
printf("%s,", record[i])
}
gsub("\\|", ",", record[end])
printf("%s\n", record[end])
}

Related

Python csv merge multiple files with different columns

I hope somebody can help me with this issue.
I have about 20 csv files (each file with its headers), each of this files has hundreds of columns.
My problem is related to merging those files, because a couple of them have extra columns.
I was wondering if there is an option to merge all those files in one adding all the new columns with related data without corrupting the other files.
So far I used I used the awk terminal command:
awk '(NR == 1) || (FNR > 1)' *.csv > file.csv
to merge removing the headers from all the files expect from the first one.
I got this from my previous question
Merge multiple csv files into one
But this does not solve the issue with the extra column.
EDIT:
Here are some file csv in plain text with the headers.
file 1
"#timestamp","#version","_id","_index","_type","ad.(fydibohf23spdlt)/cn","ad.</o","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","ad.totalComputers","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationUserPrivileges","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-27 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file2
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file3
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file4
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Those are 4 of the 20 files, I included all the headers but no rows because they contain sensitive data.
When I run the script on those files, I can see that it writes the timestamp value. But when I run it against the original files (with a lot of data) all what it does, is writing the header and that's it.Please if you need some more info just let me know.
Once I run the script on the original file. This is what I get back
There are 20 rows (one for each file) but it doesn't write the content of each file. This could be related to the sniffing of the first line? because I think that is checking only the first line of the files and moves forward as in the script. So how is that in a small file, it manage to copy merge also the content?

Your question isn't clear, idk if you really want a solution in awk or python or either, and it doesn't have any sample input/output we can test with so it's a guess but is this what you're trying to do (using any awk in any shell on every Unix box)?
$ head file{1..2}.csv
==> file1.csv <==
1,2
a,b
c,d
==> file2.csv <==
1,2,3
x,y,z
$ cat tst.awk
BEGIN {
FS = OFS = ","
for (i=1; i<ARGC; i++) {
if ( (getline < ARGV[i]) > 0 ) {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
}
}
}
NR == 1 { print hdr }
FNR > 1 { NF=maxNF; print }
$ awk -f tst.awk file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
See http://awk.freeshell.org/AllAboutGetline for details on when/how to use getline and it's associated caveats.
Alternatively with an assist from GNU head for -q:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
next
}
!doneHdr++ { print hdr }
FNR > 1 { NF=maxNF; print }
$ head -q -n 1 file{1..2}.csv | awk -f tst.awk - file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z

As already explained in your original question, you can easily extend the columns in Awk if you know how many to expect.
awk -F ',' -v cols=5 'BEGIN { OFS=FS }
FNR == 1 && NR > 1 { next }
NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" }
1' *.csv >file.csv
I slightly refactored this to skip the unwanted lines with next rather than vice versa; this simplifies the rest of the script slightly. I also added the missing comma separator.
You can easily print the number of columns in each file, and just note the maximum:
awk -F , 'FNR==1 { print NF, FILENAME }' *.csv
If you don't know how many fields there are going to be in files you do not yet have, or if you need to cope with complex CSV with quoted fields, maybe switch to Python for this. It's not too hard to do the field number sniffing in Awk, but coping with quoting is tricky.
import csv
import sys
# Sniff just the first line from every file
fields = 0
for filename in sys.argv[1:]:
with open(filename) as raw:
for row in csv.reader(raw):
# If the line is longer than current max, update
if len(row) > fields:
fields = len(row)
titles = row
# Break after first line, skip to next file
break
# Now do the proper reading
writer = csv.writer(sys.stdout)
writer.writerow(titles)
for filename in sys.argv[1:]:
with open(filename) as raw:
for idx, row in enumerate(csv.reader(raw)):
if idx == 0:
next
row.extend([''] * (fields - len(row)))
writer.writerow(row)
This simply assumes that the additional fields go at the end. If the files could have extra columns between other columns, or columns in different order, you need a more complex solution (though not by much; the Python CSV DictReader subclass could do most of the heavy lifting).
Demo: https://ideone.com/S998l4
If you wanted to do the same type of sniffing in Awk, you basically have to specify the names of the input files twice, or do some nontrivial processing in the BEGIN block to read all the files before starting the main script.

How to Remove Pipe character from a data filed in a pipe delimited file

Experts, i have a simple pipe delimited file from source system which has a free flow text field and for one of the records, i see that "|" character is coming in as part of data. This is breaking my file unevenly and not getting parsed in to correct number of fields. I want to replace the "|" in the data field with a "#".
Record coming in from source system. There are total 9 fields in the file.
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456
If you Notice the 4th field - Free"flow|Text , this is complete value from source which has a pipe in it.
i want to change it to- Free"flow#Text and then read the file with a pipe delimiter.
Desired Outcome-
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
I tried few awk/sed combinations, but didn't get the desired output.
Thanks

Since you know there are 9 fields, and the 4th is a problem: take the first 3 fields and the last 5 fields and whatever is left over is the 4th field.
You did tag shell, so here's some bash: I'm sure the python equivalent is close:
line='OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456'
IFS='|'
read -ra fields <<<"$line"
first3=( "${fields[#]:0:3}" )
last5=( "${fields[#]: -5}" )
tmp=${line#"${first3[*]}$IFS"} # remove the first 3 joined with pipe
field4=${tmp%"$IFS${last5[*]}"} # remove the last 5 joined with pipe
data=( "${first3[#]}" "$field4" "${last5[#]}" )
newline="${first3[*]}$IFS${field4//$IFS/#}$IFS${last5[*]}"
# .......^^^^^^^^^^^^....^^^^^^^^^^^^^^^^^....^^^^^^^^^^^
printf "%s\n" "$line" "$newline"
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
with awk, it's simpler: If there are 10 fields, join fields 4 and 5, and shift the rest down one.
echo "$line" | awk '
BEGIN { FS = OFS = "|" }
NF == 10 {
$4 = $4 "#" $5
for (i=5; i<NF; i++)
$i = $(i+1)
NF--
}
1
'
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456

You tagged your question with Python so I assume a Python-based answer is acceptable.
I assume not all records in your file have the additional "|" in it, but only some records have the "|" in the free text column.
For a more realistic example, I create an input with some correct records and some erroneous records.
I use StringIO to simulate the file, in your environment read the real file with 'open'.
from io import StringIO
sample = 'OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456\nOutboundManualCall|J|LALALA HTREDFST|FreeHalalText|20191029|X|X|X|3456\nOutboundManualCall|J|LALALA HTREDFST|FrulaalText|20191029|X|X|X|3456\nOutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456'
infile = StringIO(sample)
outfile = StringIO()
for line in infile.readlines():
cols = line.split("|")
if len(cols) > 9:
print(f"bad colum {cols[3:5]}")
line = "|".join(cols[:3]) + "#".join(cols[3:5]) + "|".join(cols[5:])
outfile.write(line)
print("Corrected file:")
print(outfile.getvalue())
Results in:
> bad colum ['Free"flow', 'Text']
> bad colum ['Free"flow', 'Text']
> Corrected file:
> OutboundManualCall|H|RTYEHLA HTREDFSTFree"flow#Text20191029|X|X|X|3456
> OutboundManualCall|J|LALALA HTREDFST|FreeHalalText|20191029|X|X|X|3456
> OutboundManualCall|J|LALALA HTREDFST|FrulaalText|20191029|X|X|X|3456
> OutboundManualCall|H|RTYEHLA HTREDFSTFree"flow#Text20191029|X|X|X|3456

Split csv file vertically using command line

Is it possible to split a csv file, vertically, into multiple files? I know we can split single large files into smaller files with no of rows mentioned using the command line. I have csv files in which columns are repeating after certain column no and I want to split that file column-wise.Is that possible with the command line, If not then how can we do it with python?
For Eg.
consider above sample in which site and address present multiple times vertically, I want to create 3 different csv files containing single site and single address
Any help would be highly appreciated,
Thanks

Assuming your input files is named ~/Downloads/sites.csv and looks like this:
Google,google.com,Google,google.com,Google,google.com
MS,microsoft.com,MS,microsoft.com,MS,microsoft.com
Apple,apple.com,Apple,apple.com,Apple,apple.com
You can use cut to create 3 files, each containing one pair of company/site:
cut -d "," -f 1-2 < ~/Downloads/sites.csv > file1.csv
cut -d "," -f 3-4 < ~/Downloads/sites.csv > file2.csv
cut -d "," -f 5-6 < ~/Downloads/sites.csv > file3.csv
Explanation:
For the cut command, we declare the comma (,) as a separator, which splits every line into a set for 'fields'.
We then specify for each output file, which fields we want to be included.
HTH!

If the site-address pairs are regularly repeated, how about:
awk '{
n = split($0, ary, ",");
for (i = 1; i <= n; i += 2) {
j = (i + 1) / 2;
print ary[i] "," ary[i+1] >> "file" j ".csv";
}
}' input.csv

The following script produces what you want (based on the SO answer adjusted for your needs: number of columns, field separator). It splits the original file vertically into 2 column chunks (note n=2) and creates 3 different files (tmp.examples.1, tmp.examples.2, tmp.examples.3 or whatever you specify for the f variable):
awk -F "," -v f="tmp.examples" '{for (i=1; i<=NF; i++) printf (i%n==0||i==NF)?$i RS:$i FS > f "." int((i-1)/n+1) }' n=2 example.txt
If your example.txt file has the subsequent data:
site,address,site,address,site,address
Google,google.com,MS,microsoft.com,Apple,apple.com

Merge two lines generated from contigs.fa

I have a file generated by assemblers. It looks like following.
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAAC
CAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTAT
ACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGA
ACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTAT
TCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTG
TCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGT
CTTCC
I want to merge the lines using python or linux sed command and want result in this way.
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC
like every seqeunce consider as single line and Node name as other line.

A small pipe of tr and sed would do this:
$ tr -d '\n' < contigser.fa | sed 's/\(>[^.]\+\.[0-9]\+\)/\n\1\n/g' > newfile.fa
In python:
file = open('contigser.fa','r+')
lines= file.read().splitlines()
file.seek(0)
file.truncate()
for line in lines:
if line.startswith('>'):
file.write('\n'+line+'\n')
else:
file.write(line)
Note: the python solution stores the changes back to contigser.fa.

You can use awk to do the job:
awk < input_file '/^>/ {print ""; print; next} {printf "%s", $0} END {print ""}'
This only starts one process (awk). Only drawback: it adds an empty first line. You can avoid such things by adding a state variable (the code belongs on one line, it's just to make it better readable):
awk < input_file '/^>/ { if (flag) print ""; print; flag=0; next }
{ printf "%s", $0; flag=1 } END { if (flag) print "" }'
#how to store it in a new file:
awk < input_file > output_file '/^>/ { .... }'

$ awk '/^>/{printf "%s%s\n",(NR>1?ORS:""),$0; next} {printf "%s",$0} END{print ""}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

$ awk 'NR==1;ORS="";{sub(/>.*$/,"\n&\n");print (NR>1)?$0:""}END{print"\n"}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

This might work for you (GNU sed):
sed '/^>/n;:a;$!N;s/\n\([^>]\)/\1/;ta;P;D' file
Following a line beginning with >, delete any newlines that preceed any character other than a > symbol.

Summing up two columns the Unix way

# To fix the symptom
How can you sum up the following columns effectively?
Column 1
1
3
3
...
Column 2
2323
343
232
...
This should give me
Expected result
2324
346
235
...
I have the columns in two files.
# Initial situation
I use sometimes too many curly brackets such that I have used one more this { than this } in my files.
I am trying to find where I have used the one unnecessary curly bracket.
I have used the following steps in getting the data
Find commands
find . * -exec grep '{' {} + > /tmp/1
find . * -exec grep '}' {} + > /tmp/2
AWK commands
awk -F: '{ print $2 }' /tmp/1 > /tmp/11
awk -F: '{ print $2 }' /tmp/2 > /tmp/22
The column are in the files /tmp/11 and /tmp/22.
I repeat a lot of similar commands in my procedure.
This suggests me that this is not the right way.
Please, suggests me any way such as Python, Perl or any Unix tool which can decrease the number of steps.

If c1 and c2 are youre files, you can do this:
$ paste c1 c2 | awk '{print $1 + $2}'
Or (without AWK):
$ paste c1 c2 | while read i j; do echo $(($i+$j)); done

Using python:
totals = [ int(i)+int(j) for i, j in zip ( open(fname1), open(fname2) ) ]

You can avoid the intermediate steps by just using a command that do the counts and the comparison at the same time:
find . -type f -exec perl -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g' {}\;
This calls the Perl program once per file, the Perl program counts the number of each type curly brace and prints the name of the file if they counts don't match.
You must be careful with the /([}{]])/ section, find will think it needs to do the replacement on {} if you say /([{}]])/.
WARNING: this code will have false positives and negatives if you are trying to run it against source code. Consider the following cases:
balanced, but curlies in strings:
if ($s eq '{') {
print "I saw a {\n"
}
unbalanced, but curlies in strings:
while (1) {
print "}";
You can expand the Perl command by using B::Deparse:
perl -MO=Deparse -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g'
Which results in:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
;
++$h{$_} foreach (/([}{])/g);
}
We can now look at each piece of the program:
BEGIN { $/ = "\n"; $\ = "\n"; }
This is caused by the -l option. It sets both the input and output record separators to "\n". This means anything read in will be broken into records based "\n" and any print statement will have "\n" appended to it.
LINE: while (defined($_ = <ARGV>)) {
}
This is created by the -n option. It loops over every file passed in via the commandline (or STDIN if no files are passed) reading each line of those files. This also happens to set $ARGV to the last file read by <ARGV>.
chomp $_;
This removes whatever is in the $/ variable from the line that was just read ($_), it does nothing useful here. It was caused by the -l option.
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
This is an END block, this code will run at the end of the program. It prints $ARGV (the name of the file last read from, see above) if the values stored in %h associated with the keys '{' and '}' are equal.
++$h{$_} foreach (/([}{])/g);
This needs to be broken down further:
/
( #begin capture
[}{] #match any of the '}' or '{' characters
) #end capture
/gx
Is a regex that returns a list of '{' and '}' characters that are in the string being matched. Since no string was specified the $_ variable (which holds the line last read from the file, see above) will be matched against. That list is fed into the foreach statement which then runs the statement it is in front of for each item (hence the name) in the list. It also sets $_ (as you can see $_ is a popular variable in Perl) to be the item from the list.
++h{$_}
This line increments the value in $h that is associated with $_ (which will be either '{' or '}', see above) by one.

In Python (or Perl, Awk, &c) you can reasonably do it in a single stand-alone "pass" -- I'm not sure what you mean by "too many curly brackets", but you can surely count curly use per file. For example (unless you have to worry about multi-GB files), the 10 files using most curly braces:
import heapq
import os
import re
curliest = dict()
for path, dirs, files in os.walk('.'):
for afile in files:
fn = os.path.join(path, afile)
with open(fn) as f:
data = f.read()
braces = data.count('{') + data.count('}')
curliest[fn] = bracs
top10 = heapq.nlargest(10, curlies, curliest.get)
top10.sort(key=curliest.get)
for fn in top10:
print '%6d %s' % (curliest[fn], fn)

Reply to Lutz'n answer
My problem was finally solved by this commnad
paste -d: /tmp/1 /tmp/2 | awk -F: '{ print $1 "\t" $2 - $4 }'

your problem can be solved with just 1 awk command...
awk '{getline i<"file1";print i+$0}' file2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

splitting file into smaller files using by number of fields - python

Related

Python csv merge multiple files with different columns

How to Remove Pipe character from a data filed in a pipe delimited file

Split csv file vertically using command line

Merge two lines generated from contigs.fa

Summing up two columns the Unix way

Categories

Resources