I am trying to replace some text in a file using either SED, PERL, AWK or a python script. I've tried a couple of things but can't seem to work it out.
I have the following in a text file called data.txt
&st=ALPHA&type=rec&uniId=JIM&acceptCode=123&drainNel=supp&
&st=ALPHA&type=rec&uniId=JIM&acceptCode=167&drainNel=supp&
&st=ALPHA&type=rec&uniId=SARA&acceptCode=231&drainNel=ured&
&st=ALPHA&type=rec&uniId=SARA&acceptCode=344&drainNel=iris&
&st=ALPHA&type=rec&uniId=SARA&acceptCode=349&drainNel=iris&
&st=ALPHA&type=rec&uniId=DAVE&acceptCode=201&drainNel=teef&
1) Script will take an input argument in the form of a number, e.g: 10000
2) I want to replace all the text ALPHA with the given long number as arg and increment by 100 for e.g. if uniId is the same. If it is different it will increment by 5000 for e.g.
3) I want to replace all the acceptCode to change to the first st for all lines with the same uniId
./script 10000
.. still confused? Well, the final result could be this:
&st=10000&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=10100&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=15100&type=rec&uniId=SARA&acceptCode=15100&drainNel=ured&
&st=15200&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=15300&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=20300&type=rec&uniId=DAVE&acceptCode=20300&drainNel=teef&
This ^ should be REPLACED and applied to file data.txt - not just print on screen.
Okay, here's one way, using awk (wrapped in a shell script for convenience because it's a bit too much for a one-liner):
#!/bin/sh
# Usage:
# $./transform.sh [STARTCOUNT] < data.txt > temp.txt
# $ mv -f temp.txt data.txt
awk -F '&' -v "cnt=${1:-10000}" -v 'OFS=&' \
'NR == 1 { ac = cnt; uni = $4; }
NR > 1 && $4 == uni { cnt += 100 }
$4 != uni { cnt += 5000; ac = cnt; uni = $4 }
{ $2 = "st=" cnt; $5 = "acceptCode=" ac; print }'
Running this on a file holding your sample input:
$ ./transform.sh 10000 < data.txt
&st=10000&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=10100&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=15100&type=rec&uniId=SARA&acceptCode=15100&drainNel=ured&
&st=15200&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=15300&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=20300&type=rec&uniId=DAVE&acceptCode=20300&drainNel=teef&
And a perl version that does an in-place edit of the input file:
#!/usr/bin/perl -ani -F'&'
# Usage:
# $ ./transform.pl COUNT datafile
use warnings;
use strict;
use English;
our ($count, $a, $uni);
BEGIN {
$count = shift #ARGV;
die "Missing count argument" unless defined $count and $count =~ /^\d+$/;
$ac = $count;
$uni = "";
$OFS = '&';
}
if ($NR == 1) {
$uni = $F[3];
} elsif ($uni ne $F[3]) {
$count += 5000;
$ac = $count;
$uni = $F[3];
} else {
$count += 100;
}
$F[1] = "st=$count";
$F[4] = "acceptCode=$ac";
print #F;
Running it on your sample input:
$ ./transform.pl 10000 data.txt
$ cat data.txt
&st=10000&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=10100&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=15100&type=rec&uniId=SARA&acceptCode=15100&drainNel=ured&
&st=15200&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=15300&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=20300&type=rec&uniId=DAVE&acceptCode=20300&drainNel=teef&
A few assumptions
Your requirement 2) I want to replace all the text ALPHA with the given long number as arg and increment by 100 for e.g. if uniId is the same. If it is different it will increment by 5000 for e.g._ in conjunction with your example output requires your input data to be sorted on the uniId field. If the file is not sorted, the 100 increments and the 5000 increments will not yield the desired initial values for each uniId
The increment scheme assumes that no one uniId value will have enough records to increment into the next 5000 range set for newly identified uniId values.
#!/usr/bin/env python3
from collections import OrderedDict
import csv
import sys
class TrackingVars(object):
"""
The TrackingVars class manages the business logic for maintaining the
st field counters and the acctCode values for each uniId
"""
def __init__(self, long_number):
self.uniId_table = {}
self.running_counter = long_number
def __initial_value__(self):
"""
The first encounter for a uniId will have st = acctCode
"""
retval = (self.running_counter, self.running_counter)
return retval
def get_uniId(self, id):
"""
A convenience method for returning uniId tracking values
"""
curval, original_value = self.uniId_table.get(id, self.__initial_value__())
return (curval, original_value)
def track(self, uniId):
"""
curval = original_value when a new uniId is encountered.
If the uniId is known, simply increment curval by 100
if the uniId is new and there is at least 1 key in the
tracking table increment curval by 5000
always update tracking variables
"""
curval, original_value = self.get_uniId(uniId)
if uniId in self.uniId_table.keys():
curval = curval + 100
else:
if self.uniId_table:
curval = curval + 5000
original_value = curval
self.running_counter = curval
retval = (curval, original_value)
self.uniId_table[uniId] = retval
return retval
def data_lines(filename):
"""
Read file as input delimited by &
"""
with open(filename, "r", newline=None) as fin:
csvin = csv.reader(fin, delimiter="&")
for row in csvin:
yield row
def transform_data_line(line):
"""
Transform data into key, values pairs
leading and traling & have no valid key, value pairs
"""
head = ("head", None)
tail = ("tail", None)
items = [head]
for field in line[1:-1]:
key, value = field.split("=")
items.append([key, value])
retval = OrderedDict(items)
retval["tail"] = tail
return retval
def process_data_line(record, text_to_replace, tracking_vars):
"""
if st value is ALPHA update record with tracking variables
"""
st = record.get("st")
if st is not None:
if st == text_to_replace:
uniId = record.get("uniId")
curval, original_value = tracking_vars.track(uniId)
record["st"] = curval
record["acceptCode"] = original_value
return record
def process_file():
"""
Get the long number from the command line input.
Initialize the tracking variables.
Process each row of the file.
"""
long_number = sys.argv[1]
tracking_vars = TrackingVars(int(long_number))
for row in data_lines("data.txt"):
record = transform_data_line(row)
retval = process_data_line(record, "ALPHA", tracking_vars)
yield retval
def write(iter_in, filename_out):
"""
Write each row from the iterator to the csv.
make sure the first and last fields are empty.
"""
with open(filename_out, "w", newline=None) as fout:
csvout = csv.writer(fout, delimiter="&")
for row in iter_in:
encoded_row = ["{0}={1}".format(k, v) for k, v in row.items()]
encoded_row[0]=""
encoded_row[-1]=""
csvout.writerow(encoded_row)
if __name__ == "__main__":
write(process_file(), "data.new.txt")
Output
$cat data.net.txt
&st=10000&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=10100&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=15100&type=rec&uniId=SARA&acceptCode=15100&drainNel=ured&
&st=15200&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=15300&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=20300&type=rec&uniId=DAVE&acceptCode=20300&drainNel=teef&
Conclusion
Only you know why the business rules for the incrementing number scheme are the way they are. However having a control break on uniId and the st value dependent upon the previous uniId increment seems problematic to me. You could process unsorted files if each new uniId encountered would start at a new 5000 boundary. For example 15000, 2000, 25000, etc.
P.S
I love the AWK and Perl answers. They are simple and straight forward. They answer the question exactly as it was posed. Now all we need is a SED example :)
just bit more efficient control, in one line gnu awk:
awk -F\& -vi=10000 -vOFS=\& '{if(NR==1) { ac=i; u=$4; } else { if($4==u) i+=100; else { i+=5000; ac=i; u=$4; } }; $2="st=" i; $5 =gensub(/[0-9]+/,ac,1,$5); print } ' data.txt
accept any various string on 5th field.. Thank Shawn.
Related
I have to write a program that correlates smoking with lung cancer risk. For that I have data in two files.
My code is computing the data given in the same lines (eg:America,23.3 with Spain,77.9 and
Italy,24.2 with Russia,60.8)
How to modify my code so that it computes the numbers of the same countries and leaves out the countries that occur only in one file (it shouldn't compute Germany, France, China, Korea because they are only in one file)
Thank you so much for your help in advance:)
smoking file:
Country, Percent Cigarette Smokers Data
America,23.3
Italy,24.2
Russia,23.7
France,14.9
England,17.9
Spain,17
Germany,21.7
second file:
Cases Lung Cancer per 100000
Spain,77.9
Russia,60.8
Korea,61.3
America,73.3
China,66.8
Vietnam,64.5
Italy,43.9
and my code:
def readFiles(smoking_datafile, cancer_datafile):
'''
Reads the data from the provided file objects smoking_datafile
and cancer_datafile. Returns a list of the data read from each
in a tuple of the form (smoking_datafile, cancer_datafile).
'''
# init
smoking_data = []
cancer_data = []
empty_str = ''
# read past file headers
smoking_datafile.readline()
cancer_datafile.readline()
# read data files
eof = False
while not eof:
# read line of data from each file
s_line = smoking_datafile.readline()
c_line = cancer_datafile.readline()
# check if at end-of-file of both files
if s_line == empty_str and c_line == empty_str:
eof = True
# check if end of smoking data file only
elif s_line == empty_str:
raise OSError('Unexpected end-of-file for smoking data file')
# check if at end of cancer data file only
elif c_line == empty_str:
raise OSError('Unexpected end-of-file for cancer data file')
# append line of data to each list
else:
smoking_data.append(s_line.strip().split(','))
cancer_data.append(c_line.strip().split(','))
# return list of data from each file
return (smoking_data, cancer_data)
def calculateCorrelation(smoking_data, cancer_data):
'''
Calculates and returns the correlation value for the data
provided in lists smoking_data and cancer_data
'''
# init
sum_smoking_vals = sum_cancer_vals = 0
sum_smoking_sqrd = sum_cancer_sqrd = 0
sum_products = 0
# calculate intermediate correlation values
num_values = len(smoking_data)
for k in range(0,num_values):
sum_smoking_vals = sum_smoking_vals + float(smoking_data[k][1])
sum_cancer_vals = sum_cancer_vals + float(cancer_data[k][1])
sum_smoking_sqrd = sum_smoking_sqrd + \
float(smoking_data[k][1]) ** 2
sum_cancer_sqrd = sum_cancer_sqrd + \
float(cancer_data[k][1]) ** 2
sum_products = sum_products + float(smoking_data[k][1]) * \
float(cancer_data[k][1])
# calculate and display correlation value
numer = (num_values * sum_products) - \
(sum_smoking_vals * sum_cancer_vals)
denom = math.sqrt(abs( \
((num_values * sum_smoking_sqrd) - (sum_smoking_vals ** 2)) * \
((num_values * sum_cancer_sqrd) - (sum_cancer_vals ** 2)) \
))
return numer / denom
Let's just focus on getting the data into a format that is easy to work with. The code below will get you a dictionary of the form ...
smokers_cancer_data = {
'America': {
'smokers': '23.3',
'cancer': '73.3'
},
'Italy': {
'smokers': '24.2',
'cancer': '43.9'
},
...
}
Once you have this you can get any values you need and perform your calculations. See the code below.
def read_data(filename: str) -> dict:
with open(filename, 'r') as file:
next(file) # Skip the header
data = dict();
for line in file:
cleaned_line = line.rstrip()
# Skip blank lines
if cleaned_line:
data_item = (cleaned_line.split(','))
data[data_item[0]] = float(data_item[1])
return data
# Load data into python dictionaries
smokers_data = read_data('smokersData.txt')
cancer_data = read_data('lungCancerData.txt')
# Build one dictionary that is easy to work with
smokers_cancer_data = dict()
for (key, value) in smokers_data.items():
if key in cancer_data:
smokers_cancer_data[key] = {
'smokers': smokers_data[key],
'cancer' : cancer_data[key]
}
print(smokers_cancer_data)
For example, if you want to calculate the sum of the smoker and cancer values.
smokers_total = 0
cancer_total = 0
for (key, value) in smokers_cancer_data.items():
smokers_total += value['smokers']
cancer_total += value['cancer']
This will return a list of all the countries that have datas, along with the data:
l3 = []
with open('smoking.txt','r') as f1, open('cancer.txt','r') as f2:
l1, l2 = f1.readlines(), f2.readlines()
for s1 in l1:
for s2 in l2:
if s1.split(',')[0] == s2.split(',')[0]:
cty = s1.split(',')[0]
smk = s1.split(',')[1].strip()
cnr = s2.split(',')[1].strip()
l3.append(f"{cty}: smoking: {smk}, cancer: {cnr}")
print(l3)
Output:
['Spain: smoking: 77.9, cancer: 17', 'Russia: smoking: 60.8, cancer: 23.7', 'America: smoking: 73.3, cancer: 23.3', 'Italy: smoking: 43.9, cancer24.2']
After doing the diff between the two lists, the common lines are being shown. But My requirement is, when i compare two lists, the common lines should not to be displayed (come in the diff). Can you give some idea how to suppress the same.
difflist.py
-----------
import difflib
def main():
rawfromlines = open('file1.sql', 'r').readlines()
tolines = open('file2.sql', 'r').readlines()
list_f1 = []
list_f2 = []
for f1 in rawfromlines:
for part in f1.replace('\n','').split(','):
list_f1.append(part)
for f2 in tolines:
for part in f2.replace('\n','').split(','):
list_f2.append(part)
targetfile = open('diff_of_files.sql', 'w')
differ = difflib.Differ()
diffs = list(differ.compare(list_f1, list_f2))
for i in range(0,len(diffs)):
print diffs[i]
file1.sql
----------
CREATE TABLE SALARY
(
SALARY int
);
CREATE TABLE JOB1
(
EMP1 int
);
file2.sql
---------
CREATE TABLE SALARY
(
EMPNAME VARCHAR2(255)
SALARY int
);
CREATE TABLE JOB1
(
EMP1 int
);
Actual Output
---------------
CREATE TABLE SALARY
(
+ EMPNAME VARCHAR2(255)
SALARY int
);
CREATE TABLE JOB1
(
EMP1 int
);
Expected Output
---------------
CREATE TABLE SALARY
(
+ EMPNAME VARCHAR2(255)
SALARY int
);
The common lines are not present.
Unfortunately, the Differ does not have any functionality to print a context like the HtmlDiff does. However, you can easily build something like this yourself by keeping a buffer of context lines. Something like this:
def print_with_context (diff, context = 3):
buf = []
print_more = 0
for line in diff:
if line.startswith('-') or line.startswith('+'):
if len(buf) > context:
print('...')
print('\n'.join(buf[-context:]))
buf = []
print_more = context
print(line)
elif print_more:
print(line)
print_more -= 1
if print_more == 0:
print('...')
else:
buf.append(line)
Used like this:
print_with_context(differ.compare(list_f1, list_f2), 2)
I have a pipe delimited file with 3 columns
aaa|xyz|pqr
another|column
with
line break | last column
The expected output is :
aaa|xyz|pqr
another|column with line break | last column
If I remove the line breaks then I get a single line like this...
aaa|xyz|pqr another|column with line break | last column
But I need 3 columns on each line.
You can try this awk,
awk -F'|' 'NF!=3{ line=line ? line " " $0 : $0; c=split( line, arr, "|"); if(c == 3){ $0=line; }else{ next } }1' yourfile
More readable awk version:
#!/bin/awk -f
BEGIN{
FS="|";
}
NF!=3{
line=line ? line " " $0 : $0;
c=split( line, arr, "|");
if(c == 3) {
$0=line;
}
else {
next;
}
}1
Test:
$ awk -F'|' 'NF!=3{ line=line ? line " " $0 : $0; c=split( line, arr, "|"); if(c == 3){ $0=line; }else{ next } }1' yourfile
aaa|xyz|pqr
another|column with line break | last column
It is working for your sample input.
Python solution:
import sys
def fix_rows(it, n):
row = ''
for line in it:
if row:
row = row.rstrip('\n') + ' ' + line
else:
row = line
if row.count('|') == n - 1:
yield row
row = ''
if row:
yield row
with open('a.csv') as f:
sys.stdout.writelines(fix_rows(f, 3))
output:
aaa|xyz|pqr
another|column with line break | last column
What you are describing is a three field record following this pattern:
(F1, May have CR) | (F2, May have CR) | (F3, No CR)CR
If F3 ever did have a CR, it would be ambiguous which record is which since you would not know whether the CR terminates the record or is embedded into F3 or the following F1 field.
You can easily parse what I have described with a regex in Perl:
$ perl -e '
$str = do { local $/; <> };
while ($str =~ /^\n?((?:[^|]+\|){2}[^\n]+)/gm){
$_=$1;
s/\n/ /g;
print "$_\n";
}
' /tmp/ac.csv
aaa|xyz|pqr
another|column with line break | last column
Which works by using a regex to separate the records from the stream.
Live regex to show how that works.
I'm trying to split a diff (unified format) into each section using the re module in python. The format of a diff is like this...
diff --git a/src/core.js b/src/core.js
index 9c8314c..4242903 100644
--- a/src/core.js
+++ b/src/core.js
## -801,7 +801,7 ## jQuery.extend({
return proxy;
},
- // Mutifunctional method to get and set values to a collection
+ // Multifunctional method to get and set values of a collection
// The value/s can optionally be executed if it's a function
access: function( elems, fn, key, value, chainable, emptyGet, pass ) {
var exec,
diff --git a/src/sizzle b/src/sizzle
index fe2f618..feebbd7 160000
--- a/src/sizzle
+++ b/src/sizzle
## -1 +1 ##
-Subproject commit fe2f618106bb76857b229113d6d11653707d0b22
+Subproject commit feebbd7e053bff426444c7b348c776c99c7490ee
diff --git a/test/unit/manipulation.js b/test/unit/manipulation.js
index 18e1b8d..ff31c4d 100644
--- a/test/unit/manipulation.js
+++ b/test/unit/manipulation.js
## -7,7 +7,7 ## var bareObj = function(value) { return value; };
var functionReturningObj = function(value) { return (function() { return value; }); };
test("text()", function() {
- expect(4);
+ expect(5);
var expected = "This link has class=\"blog\": Simon Willison's Weblog";
equal( jQuery("#sap").text(), expected, "Check for merged text of more then one element." );
## -20,6 +20,10 ## test("text()", function() {
frag.appendChild( document.createTextNode("foo") );
equal( jQuery( frag ).text(), "foo", "Document Fragment Text node was retreived from .text().");
+
+ var $newLineTest = jQuery("<div>test<br/>testy</div>").appendTo("#moretests");
+ $newLineTest.find("br").replaceWith("\n");
+ equal( $newLineTest.text(), "test\ntesty", "text() does not remove new lines (#11153)" );
});
test("text(undefined)", function() {
diff --git a/version.txt b/version.txt
index 0a182f2..0330b0e 100644
--- a/version.txt
+++ b/version.txt
## -1 +1 ##
-1.7.2
\ No newline at end of file
+1.7.3pre
\ No newline at end of file
I've tried the following combinations of patterns but can't quite get it right. This is the closest I have come so far...
re.compile(r'(diff.*?[^\rdiff])', flags=re.S|re.M)
but this yields
['diff ', 'diff ', 'diff ', 'diff ']
How would I match all sections in this diff?
This does it:
r=re.compile(r'^(diff.*?)(?=^diff|\Z)', re.M | re.S)
for m in re.findall(r, s):
print '===='
print m
You don't need to use regex, just split the file:
diff_file = open('diff.txt', 'r')
diff_str = diff_file.read()
diff_split = ['diff --git%s' % x for x in diff_str.split('diff --git') \
if x.strip()]
print diff_split
Why are you using regex? How about just iterating over the lines and starting a new section when a line starts with diff?
list_of_diffs = []
temp_diff = ''
for line in patch:
if line.startswith('diff'):
list_of_diffs.append(temp_diff)
temp_diff = ''
else: temp_diff.append(line)
Disclaimer, above code should be considered illustrative pseudocode only and is not expected to actually run.
Regex is a hammer but your problem isn't a nail.
Just split on any linefeed that's followed by the word diff:
result = re.split(r"\n(?=diff\b)", subject)
Though for safety's sake, you probably should try to match \r or \r\n as well:
result = re.split(r"(?:\r\n|[\r\n])(?=diff\b)", subject)
I have a for loop which references a dictionary and prints out the value associated with the key. Code is below:
for i in data:
if i in dict:
print dict[i],
How would i format the output so a new line is created every 60 characters? and with the character count along the side for example:
0001
MRQLLLISDLDNTWVGDQQALEHLQEYLGDRRGNFYLAYATGRSYHSARELQKQVGLMEP
0061
DYWLTAVGSEIYHPEGLDQHWADYLSEHWQRDILQAIADGFEALKPQSPLEQNPWKISYH
0121 LDPQACPTVIDQLTEMLKETGIPVQVIFSSGKDVDLLPQRSNKGNATQYLQQHLAMEPSQ
It's a finicky formatting problem, but I think the following code:
import sys
class EveryN(object):
def __init__(self, n, outs):
self.n = n # chars/line
self.outs = outs # output stream
self.numo = 1 # next tag to write
self.tll = 0 # tot chars on this line
def write(self, s):
while True:
if self.tll == 0: # start of line: emit tag
self.outs.write('%4.4d ' % self.numo)
self.numo += self.n
# wite up to N chars/line, no more
numw = min(len(s), self.n - self.tll)
self.outs.write(s[:numw])
self.tll += numw
if self.tll >= self.n:
self.tll = 0
self.outs.write('\n')
s = s[numw:]
if not s: break
if __name__ == '__main__':
sys.stdout = EveryN(60, sys.stdout)
for i, a in enumerate('abcdefgh'):
print a*(5+ i*5),
shows how to do it -- the output when running for demonstration purposes as the main script (five a's, ten b's, etc, with spaces in-between) is:
0001 aaaaa bbbbbbbbbb ccccccccccccccc dddddddddddddddddddd eeeeee
0061 eeeeeeeeeeeeeeeeeee ffffffffffffffffffffffffffffff ggggggggg
0121 gggggggggggggggggggggggggg hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
0181 hhhhhhh
# test data
data = range(10)
the_dict = dict((i, str(i)*200) for i in range( 10 ))
# your loops as a generator
lines = ( the_dict[i] for i in data if i in the_dict )
def format( line ):
def splitter():
k = 0
while True:
r = line[k:k+60] # take a 60 char block
if r: # if there are any chars left
yield "%04d %s" % (k+1, r) # format them
else:
break
k += 60
return '\n'.join(splitter()) # join all the numbered blocks
for line in lines:
print format(line)
I haven't tested it on actual data, but I believe the code below would do the job. It first builds up the whole string, then outputs it a 60-character line at a time. It uses the three-argument version of range() to count by 60.
s = ''.join(dict[i] for i in data if i in dict)
for i in range(0, len(s), 60):
print '%04d %s' % (i+1, s[i:i+60])
It seems like you're looking for textwrap
The textwrap module provides two convenience functions, wrap() and
fill(), as well as TextWrapper, the class that does all the work, and
a utility function dedent(). If you’re just wrapping or filling one or
two text strings, the convenience functions should be good enough;
otherwise, you should use an instance of TextWrapper for efficiency.