what is the efficient way to change huge file in place - python

i have huge file(~2000000 lines) and i am trying to replace few different patterns while i am reading the file only once.
so i am guessing sed is not good since i have different patterns
i tried to use awk with if else but the file is not change
#!/usr/bin/awk -f
{
if($0 ~ /data for AAA/)
{
sub(/^[0-9]+$/, "bla_AAA", $2)
}
if($0 ~ /data for BBB/)
{
sub(/^[0-9]+$/, "bla_BBB", $2)
}
}
I expect the output of
address 01000 data for AAA
....
address 02000 data for BBB
....
to be
address bla_AAA data for AAA
....
address bla_BBB data for BBB
....

I don't see any indication in your question that your file really is large as 2000000 lines is nothing and each sample line in your question is small, so chances are this is all you need:
awk '
/data for AAA/ { $2 = "bla_AAA"; next }
/data for BBB/ { $2 = "bla_BBB"; next }
' file > tmp && mv tmp file
GNU awk has a -i inplace option to do the same kind of "inplace" editing that sed, perl, etc. do (i.e. with a tmp file being used internally).
If you really didn't have enough storage to create a copy of the input file then you could use something like this (untested!):
headLines=10000
beg=1
tmp=$(mktemp) || exit 1
while -s file; do
head -n "$headLines" file | awk 'above script' >> "$tmp" &&
headBytes=$(head -n "$headLines" file |wc -c) &&
dd if=file bs="$headBytes" skip=1 conv=notrunc of=file &&
truncate -s "-$headBytes" file
rslt=$?
done
(( rslt == 0 )) && mv "$tmp" file
so you're never using up more storage than the size of your input file plus headLines lines (massage that number to suit). See https://stackoverflow.com/a/17331179/1745001 for info on what truncate and the 2 lines before it are doing.

Something like this:
(Read a line, do the text manipulation, write the modified data to output file)
with open('in.txt') as f_in:
with open('out.txt', 'w') as f_out:
line = f_in.readline().strip()
while line:
fields = line.split(' ')
fields[1] = 'bla_{}'.format(fields[4])
f_out.write(' '.join(fields) + '\n')
line = f_in.readline()

Related

Python csv merge multiple files with different columns

I hope somebody can help me with this issue.
I have about 20 csv files (each file with its headers), each of this files has hundreds of columns.
My problem is related to merging those files, because a couple of them have extra columns.
I was wondering if there is an option to merge all those files in one adding all the new columns with related data without corrupting the other files.
So far I used I used the awk terminal command:
awk '(NR == 1) || (FNR > 1)' *.csv > file.csv
to merge removing the headers from all the files expect from the first one.
I got this from my previous question
Merge multiple csv files into one
But this does not solve the issue with the extra column.
EDIT:
Here are some file csv in plain text with the headers.
file 1
"#timestamp","#version","_id","_index","_type","ad.(fydibohf23spdlt)/cn","ad.</o","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","ad.totalComputers","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationUserPrivileges","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-27 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file2
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file3
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file4
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Those are 4 of the 20 files, I included all the headers but no rows because they contain sensitive data.
When I run the script on those files, I can see that it writes the timestamp value. But when I run it against the original files (with a lot of data) all what it does, is writing the header and that's it.Please if you need some more info just let me know.
Once I run the script on the original file. This is what I get back
There are 20 rows (one for each file) but it doesn't write the content of each file. This could be related to the sniffing of the first line? because I think that is checking only the first line of the files and moves forward as in the script. So how is that in a small file, it manage to copy merge also the content?
Your question isn't clear, idk if you really want a solution in awk or python or either, and it doesn't have any sample input/output we can test with so it's a guess but is this what you're trying to do (using any awk in any shell on every Unix box)?
$ head file{1..2}.csv
==> file1.csv <==
1,2
a,b
c,d
==> file2.csv <==
1,2,3
x,y,z
$ cat tst.awk
BEGIN {
FS = OFS = ","
for (i=1; i<ARGC; i++) {
if ( (getline < ARGV[i]) > 0 ) {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
}
}
}
NR == 1 { print hdr }
FNR > 1 { NF=maxNF; print }
$ awk -f tst.awk file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
See http://awk.freeshell.org/AllAboutGetline for details on when/how to use getline and it's associated caveats.
Alternatively with an assist from GNU head for -q:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
next
}
!doneHdr++ { print hdr }
FNR > 1 { NF=maxNF; print }
$ head -q -n 1 file{1..2}.csv | awk -f tst.awk - file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
As already explained in your original question, you can easily extend the columns in Awk if you know how many to expect.
awk -F ',' -v cols=5 'BEGIN { OFS=FS }
FNR == 1 && NR > 1 { next }
NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" }
1' *.csv >file.csv
I slightly refactored this to skip the unwanted lines with next rather than vice versa; this simplifies the rest of the script slightly. I also added the missing comma separator.
You can easily print the number of columns in each file, and just note the maximum:
awk -F , 'FNR==1 { print NF, FILENAME }' *.csv
If you don't know how many fields there are going to be in files you do not yet have, or if you need to cope with complex CSV with quoted fields, maybe switch to Python for this. It's not too hard to do the field number sniffing in Awk, but coping with quoting is tricky.
import csv
import sys
# Sniff just the first line from every file
fields = 0
for filename in sys.argv[1:]:
with open(filename) as raw:
for row in csv.reader(raw):
# If the line is longer than current max, update
if len(row) > fields:
fields = len(row)
titles = row
# Break after first line, skip to next file
break
# Now do the proper reading
writer = csv.writer(sys.stdout)
writer.writerow(titles)
for filename in sys.argv[1:]:
with open(filename) as raw:
for idx, row in enumerate(csv.reader(raw)):
if idx == 0:
next
row.extend([''] * (fields - len(row)))
writer.writerow(row)
This simply assumes that the additional fields go at the end. If the files could have extra columns between other columns, or columns in different order, you need a more complex solution (though not by much; the Python CSV DictReader subclass could do most of the heavy lifting).
Demo: https://ideone.com/S998l4
If you wanted to do the same type of sniffing in Awk, you basically have to specify the names of the input files twice, or do some nontrivial processing in the BEGIN block to read all the files before starting the main script.

How to find matching rows of the first column and add quantities of the second column? Bash

I have a csv file that looks like this:
SKU,QTY
KA006-001,2
KA006-001,33
KA006-001,46
KA009-001,22
KA009-001,7
KA010-001,18
KA014-001,3
KA014-001,42
KA015-001,1
KA015-001,16
KA020-001,6
KA022-001,56
The first column is SKU. The second column is QTY number.
Some lines in (SKU column only) are identical.
I need to achieve the following:
SKU,QTY
KA006-001,81 (2+33+46)
KA009-001,29 (22+7)
KA010-001,18
KA014-001,45 (3+42)
so on...
I tried different things , loop statements and arrays. Got so lost, got headache.
My code:
#!/bin/bash
while IFS=, read sku qty
do
echo "SKU='$sku' QTY='$qty'"
if [ "$sku" = "$sku" ]
then
#x=("$sku" != "$sku")
for i in {0..3}; do echo $sku[$i]=$qty; done
fi
done < 2asg.csv
I'd use awk:
awk -F, 'NR==1{print} NR>1{a[$1] += $2}END{for (i in a) print i","a[i]}' file
If you want to ignore blank lines, you can either ignore lines less than 2 columns:
awk -F, 'NR==1{print} NR>1 && NF>1{a[$1] += $2} END{for (i in a) print i","a[i]}' file
or ignore ones without exactly 2 columns:
awk -F, 'NR==1{print} NR>1 && NF==2{a[$1] += $2} END{for (i in a) print i","a[i]}' file
Alternatively, you can check to see that the second column begins with a digit:
awk -F, 'NR==1{print} NR>1 && $2~/^[0-9]/{a[$1] += $2} END{for (i in a) print i","a[i]}' file
For Bash 4:
#!/bin/bash
declare -A astr
while IFS=, read -r col1 col2
do
if [ "$col1" != "SKU" ] && [ "$col1" != "" ]
then
(( astr[$col1] += col2 ))
fi
done < 2asg.csv
echo "SKU,QTY"
for i in "${!astr[#]}"
do
echo "$i,${astr[$i]}"
done | sort -t : -k 2n
https://github.com/tigertv/stackoverflow-answers

Filter a smaller file using another huge file

I have a huge csv file with about 10^9 lines where each line has a pair of ids such as:
IDa,IDb
IDb,IDa
IDc,IDd
Call this file1. I have another much smaller csv file with about 10^6 lines in the same format. Call this file2.
I want to simply find the lines in file2 which contain at least one ID that exists somewhere in file1.
Is there a fast way to do this? I don't mind if it is in awk, python or perl.
$ cat > file2 # make test file2
IDb,IDa
$ awk -F, 'NR==FNR{a[$1];a[$2];next} ($1 in a&&++a[$1]==1){print $1} ($2 in a&&++a[$2]==1){print $2}' file2 file1 > file3
$ cat file3 # file2 ids in file1 put to file3
IDa
IDb
$ awk -F, 'NR==FNR{a[$1];next} ($1 in a)||($2 in a){print $0}' file3 file2
IDb,IDa
I would actually use sqlite for something like that. You could create a new database from the same directory as two files with sqlite3 test.sqlite and then do something like that:
create table file1(id1, id2);
create table file2(id1, id2);
.separator ","
.import file1.csv file1
.import file2.csv file2
WITH all_ids AS (
SELECT id1 FROM file1 UNION SELECT id2 FROM file1
)
SELECT * FROM file2 WHERE id1 IN all_ids OR id2 IN all_ids;
The advantage of using sqlite is that you can manage the memory more intelligently than a simple script that you could write in some scripting language.
Using these input files for testing:
$ cat file1
IDa,IDb
IDb,IDa
IDc,IDd
$ cat file2
IDd,IDw
IDx,IDc
IDy,IDz
If file1 can fit in memory:
$ awk -F, 'NR==FNR{a[$1];a[$2];next} ($1 in a) || ($2 in a)' file1 file2
IDd,IDw
IDx,IDc
If not but file2 can fit in memory:
$ awk -F, '
ARGIND==2 {
if ($1 in inBothFiles) {
inBothFiles[$1] = 1
}
if ($2 in inBothFiles) {
inBothFiles[$2] = 1
}
next
}
ARGIND==1 {
inBothFiles[$1] = 0
inBothFiles[$2] = 0
next
}
ARGIND==3 {
if (inBothFiles[$1] || inBothFiles[$2]) {
print
}
}
' file2 file1 file2
IDd,IDw
IDx,IDc
The above uses GNU awk for ARGIND - with other awks just add a FNR==1{ARGIND++} block at the start.
I have the ARGIND==2 block (i.e. the part that processes the 2nd argument which in this case is the 10^9 file1) listed first for efficiency so we don't unnecessarily test ARGIND==1 for every line in the much larger file.
In perl,
use strict;
use warnings;
use autodie;
# read file2
open my $file2, '<', 'file2';
chomp( my #file2 = <$file2> );
close $file2;
# record file2 line numbers each id is found on
my %id;
for my $line_number (0..$#file2) {
for my $id ( split /,/, $file2[$line_number] ) {
push #{ $id{$id} }, $line_number;
}
}
# look for those ids in file1
my #use_line;
open my $file1, '<', 'file1';
while ( my $line = <$file1> ) {
chomp $line;
for my $id ( split /,/, $line ) {
if ( exists $id{$id} ) {
#use_line[ #{ $id{$id} } ] = #{ $id{$id} };
}
}
}
close $file1;
# print lines whose ids were found
print "$_\n" for #file2[ grep defined, #use_line ];
Sample files:
cat f1
IDa,IDb
IDb,IDa
IDc,IDd
cat f2
IDt,IDy
IDb,IDj
Awk solution:
awk -F, 'NR==FNR {a[$1]=$1;b[$2]=$2;next} ($1 in a)||($2 in b)' f1 f2
IDb,IDj
This will store first and second columns of file1 in array a and b. Then print those line if either first or second column is seen for second file.

Merge two lines generated from contigs.fa

I have a file generated by assemblers. It looks like following.
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAAC
CAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTAT
ACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGA
ACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTAT
TCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTG
TCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGT
CTTCC
I want to merge the lines using python or linux sed command and want result in this way.
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC
like every seqeunce consider as single line and Node name as other line.
A small pipe of tr and sed would do this:
$ tr -d '\n' < contigser.fa | sed 's/\(>[^.]\+\.[0-9]\+\)/\n\1\n/g' > newfile.fa
In python:
file = open('contigser.fa','r+')
lines= file.read().splitlines()
file.seek(0)
file.truncate()
for line in lines:
if line.startswith('>'):
file.write('\n'+line+'\n')
else:
file.write(line)
Note: the python solution stores the changes back to contigser.fa.
You can use awk to do the job:
awk < input_file '/^>/ {print ""; print; next} {printf "%s", $0} END {print ""}'
This only starts one process (awk). Only drawback: it adds an empty first line. You can avoid such things by adding a state variable (the code belongs on one line, it's just to make it better readable):
awk < input_file '/^>/ { if (flag) print ""; print; flag=0; next }
{ printf "%s", $0; flag=1 } END { if (flag) print "" }'
#how to store it in a new file:
awk < input_file > output_file '/^>/ { .... }'
$ awk '/^>/{printf "%s%s\n",(NR>1?ORS:""),$0; next} {printf "%s",$0} END{print ""}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC
$ awk 'NR==1;ORS="";{sub(/>.*$/,"\n&\n");print (NR>1)?$0:""}END{print"\n"}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC
This might work for you (GNU sed):
sed '/^>/n;:a;$!N;s/\n\([^>]\)/\1/;ta;P;D' file
Following a line beginning with >, delete any newlines that preceed any character other than a > symbol.

splitting file into smaller files using by number of fields

I'm having a hard time breaking a large (50GB) csv file into smaller part. Each line has a few thousand fields. Some of the fields are strings in double quotes, others are integers, decimals and boolean.
I want to parse the file line by line and split by the number of fields in each row. The strings contain possibly several commas (such as ), as well as a number of empty fields.
,,1,30,50,"Sold by father,son and daughter for $4,000" , ,,,, 12,,,20.9,0,
I tried using
perl -pe' s{("[^"]+")}{($x=$1)=~tr/,/|/;$x}ge ' file >> file2
to change the commas inside the quotes to | but that didn't work. I plan to use
awk -F"|" conditional statement appending to new k_fld_files file2
Is there an easier way to do this please? I'm looking at python, but I probably need a utility that will stream process the file, line by line.
Using Python - if you just want to parse CSV including embedded delimiters, and stream out with a new delimiter, then something such as:
import csv
import sys
with open('filename.csv') as fin:
csvout = csv.writer(sys.stdout, delimiter='|')
for row in csv.reader(fin):
csvout.writerow(row)
Otherwise, it's not much more difficult to make this do all kinds of stuff.
Example of outputting to files per column (untested):
cols_to_output = {}
for row in csv.reader(fin):
for colno, col in enumerate(row):
output_to = cols_to_output.setdefault(colno, open('column_output.{}'.format(colno), 'wb')
csv.writer(output_to).writerow(row)
for fileno in cols_to_output.itervalues():
fileno.close()
Here's an awk alternative.
Assuming the quoted strings are well formatted, i.e. always have starting and terminating quotes, and no quotes within other quotes, you could do the replacement you suggested by doing a gsub on every other field replacing , with |.
With pipes
Below is an example of how this might go when grabbing columns 3 through 6, 11 and 14-15 with coreutils cut:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(",", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
print
}'\
| cut -d , -f 3-6,11,14-15 \
| awk -F'"' -v OFS='' -e '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub("\\|", ",", $i)
$i = FS $i FS; # reinsert the quotes
}
print
}'
Note that there is an additional post-processing step that reverts the | to ,.
Entirely in awk
Alternatively, you could do the whole thing in awk with some loss of generality with regards to range specification. Here we only grab columns 3 to 6:
extract.awk
BEGIN {
OFS = ""
start = 3
end = 6
}
{
for(i=2; i<=NF; i+=2) {
gsub(",", "|", $i)
$i = FS $i FS
}
split($0, record, ",")
for(i=start; i<=end-1; i++) {
gsub("\\|", ",", record[i])
printf("%s,", record[i])
}
gsub("\\|", ",", record[end])
printf("%s\n", record[end])
}

Categories

Resources