splitting a diff file using regex in Python

splitting a diff file using regex in Python - python

I'm trying to split a diff (unified format) into each section using the re module in python. The format of a diff is like this...
diff --git a/src/core.js b/src/core.js
index 9c8314c..4242903 100644
--- a/src/core.js
+++ b/src/core.js
## -801,7 +801,7 ## jQuery.extend({
return proxy;
},
- // Mutifunctional method to get and set values to a collection
+ // Multifunctional method to get and set values of a collection
// The value/s can optionally be executed if it's a function
access: function( elems, fn, key, value, chainable, emptyGet, pass ) {
var exec,
diff --git a/src/sizzle b/src/sizzle
index fe2f618..feebbd7 160000
--- a/src/sizzle
+++ b/src/sizzle
## -1 +1 ##
-Subproject commit fe2f618106bb76857b229113d6d11653707d0b22
+Subproject commit feebbd7e053bff426444c7b348c776c99c7490ee
diff --git a/test/unit/manipulation.js b/test/unit/manipulation.js
index 18e1b8d..ff31c4d 100644
--- a/test/unit/manipulation.js
+++ b/test/unit/manipulation.js
## -7,7 +7,7 ## var bareObj = function(value) { return value; };
var functionReturningObj = function(value) { return (function() { return value; }); };
test("text()", function() {
- expect(4);
+ expect(5);
var expected = "This link has class=\"blog\": Simon Willison's Weblog";
equal( jQuery("#sap").text(), expected, "Check for merged text of more then one element." );
## -20,6 +20,10 ## test("text()", function() {
frag.appendChild( document.createTextNode("foo") );
equal( jQuery( frag ).text(), "foo", "Document Fragment Text node was retreived from .text().");
+
+ var $newLineTest = jQuery("<div>test<br/>testy</div>").appendTo("#moretests");
+ $newLineTest.find("br").replaceWith("\n");
+ equal( $newLineTest.text(), "test\ntesty", "text() does not remove new lines (#11153)" );
});
test("text(undefined)", function() {
diff --git a/version.txt b/version.txt
index 0a182f2..0330b0e 100644
--- a/version.txt
+++ b/version.txt
## -1 +1 ##
-1.7.2
\ No newline at end of file
+1.7.3pre
\ No newline at end of file
I've tried the following combinations of patterns but can't quite get it right. This is the closest I have come so far...
re.compile(r'(diff.*?[^\rdiff])', flags=re.S|re.M)
but this yields
['diff ', 'diff ', 'diff ', 'diff ']
How would I match all sections in this diff?

This does it:
r=re.compile(r'^(diff.*?)(?=^diff|\Z)', re.M | re.S)
for m in re.findall(r, s):
print '===='
print m

You don't need to use regex, just split the file:
diff_file = open('diff.txt', 'r')
diff_str = diff_file.read()
diff_split = ['diff --git%s' % x for x in diff_str.split('diff --git') \
if x.strip()]
print diff_split

Why are you using regex? How about just iterating over the lines and starting a new section when a line starts with diff?
list_of_diffs = []
temp_diff = ''
for line in patch:
if line.startswith('diff'):
list_of_diffs.append(temp_diff)
temp_diff = ''
else: temp_diff.append(line)
Disclaimer, above code should be considered illustrative pseudocode only and is not expected to actually run.
Regex is a hammer but your problem isn't a nail.

Just split on any linefeed that's followed by the word diff:
result = re.split(r"\n(?=diff\b)", subject)
Though for safety's sake, you probably should try to match \r or \r\n as well:
result = re.split(r"(?:\r\n|[\r\n])(?=diff\b)", subject)

Related

PLY Lex: ID could be anything

I have a following simple format:
BLOCK ID {
SUBBLOCK ID {
SUBSUBBLOCK ID {
SOME STATEMENTS;
};
};
};
I configured ply to work with this format. But the issue is that ID could be any string including "BLOCK", "SUBBLOCK", etc.
In the lexer I define ID as:
#TOKEN(r'[a-zA-Z_][a-zA-Z_0-9]*')
def t_ID(self, t):
t.type = self.keyword_map.get(t.value, "ID")
return t
But it means that BLOCK word will not be allowed as a block name.
How I can overcome this issue?

The easiest solution is to create a non-terminal name to be used instead of ID in productions which need a name, such as block : BLOCK name braced_statements:
# Docstring is added later
def p_name(self, p):
p[0] = p[1]
Then you compute the productions for name and assign them to p_name's docstring by executing this before you generate the parser:
Parser.p_name.__doc__ = '\n| '.join(
['name : ID']
+ list(Lexer.keyword_map.values())
)

I probably dont understand your question
but i think something like the following would work (its been a long time since
I messed with PLY so keep in mind this is pseudocode
FUNCTION_ARGS = zeroOrMore(lazy('STATEMENT'),sep=',')
FUNCTION_CALL = t_ID + lparen + FUNCTION_ARGS + rparen
STATEMENT= FUNCTION_CALL | t_Literal | t_ID
SUBBLOCK = Literal('SUBBLOCK') + t_ID + lbrace + STATEMENT + rbrace
BLOCK = Literal('BLOCK') + lbrace + oneOrMore(SUBBLOCK) + rbrace

PyParsing: parse if not a keyword

I am trying to parse a file as follows:
testp.txt
title = Test Suite A;
timeout = 10000
exp_delay = 500;
log = TRUE;
sect
{
type = typeA;
name = "HelloWorld";
output_log = "c:\test\out.log";
};
sect
{
name = "GoodbyeAll";
type = typeB;
comm1_req = 0xDEADBEEF;
comm1_resp = (int, 1234366);
};
The file first contains a section with parameters and then some sects. I can parse a file containing just parameters and I can parse a file just containing sects but I can't parse both.
from pyparsing import *
from pathlib import Path
command_req = Word(alphanums)
command_resp = "(" + delimitedList(Word(alphanums)) + ")"
kW = Word(alphas+'_', alphanums+'_') | command_req | command_resp
keyName = ~Literal("sect") + Word(alphas+'_', alphanums+'_') + FollowedBy("=")
keyValue = dblQuotedString.setParseAction( removeQuotes ) | OneOrMore(kW,stopOn=LineEnd())
param = dictOf(keyName, Suppress("=")+keyValue+Optional(Suppress(";")))
node = Group(Literal("sect") + Literal("{") + OneOrMore(param) + Literal("};"))
final = OneOrMore(node) | OneOrMore(param)
param.setDebug()
p = Path(__file__).with_name("testp.txt")
with open(p) as f:
try:
x = final.parseFile(f, parseAll=True)
print(x)
print("...")
dx = x.asDict()
print(dx)
except ParseException as pe:
print(pe)
The issue I have is that param matches against sect so it expects a =. So I tried putting in ~Literal("sect") in keyName but that just leads to another error:
Exception raised:Found unwanted token, "sect", found '\n' (at char 188), (line:4, col:56)
Expected end of text, found 's' (at char 190), (line:6, col:1)
How do I get it use one parse method for sect and another (param) if not sect?
My final goal would be to have the whole lot in a Dict with the global params and sects included.
EDIT
Think I've figured it out:
This line...
final = OneOrMore(node) | OneOrMore(param)
...should be:
final = ZeroOrMore(param) + ZeroOrMore(node)
But I wonder if there is a more structured way (as I'd ultimately like a dict)?

Replace string with custom values and iterate for all occurrences

I am trying to replace some text in a file using either SED, PERL, AWK or a python script. I've tried a couple of things but can't seem to work it out.
I have the following in a text file called data.txt
&st=ALPHA&type=rec&uniId=JIM&acceptCode=123&drainNel=supp&
&st=ALPHA&type=rec&uniId=JIM&acceptCode=167&drainNel=supp&
&st=ALPHA&type=rec&uniId=SARA&acceptCode=231&drainNel=ured&
&st=ALPHA&type=rec&uniId=SARA&acceptCode=344&drainNel=iris&
&st=ALPHA&type=rec&uniId=SARA&acceptCode=349&drainNel=iris&
&st=ALPHA&type=rec&uniId=DAVE&acceptCode=201&drainNel=teef&
1) Script will take an input argument in the form of a number, e.g: 10000
2) I want to replace all the text ALPHA with the given long number as arg and increment by 100 for e.g. if uniId is the same. If it is different it will increment by 5000 for e.g.
3) I want to replace all the acceptCode to change to the first st for all lines with the same uniId
./script 10000
.. still confused? Well, the final result could be this:
&st=10000&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=10100&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=15100&type=rec&uniId=SARA&acceptCode=15100&drainNel=ured&
&st=15200&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=15300&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=20300&type=rec&uniId=DAVE&acceptCode=20300&drainNel=teef&
This ^ should be REPLACED and applied to file data.txt - not just print on screen.

Okay, here's one way, using awk (wrapped in a shell script for convenience because it's a bit too much for a one-liner):
#!/bin/sh
# Usage:
# $./transform.sh [STARTCOUNT] < data.txt > temp.txt
# $ mv -f temp.txt data.txt
awk -F '&' -v "cnt=${1:-10000}" -v 'OFS=&' \
'NR == 1 { ac = cnt; uni = $4; }
NR > 1 && $4 == uni { cnt += 100 }
$4 != uni { cnt += 5000; ac = cnt; uni = $4 }
{ $2 = "st=" cnt; $5 = "acceptCode=" ac; print }'
Running this on a file holding your sample input:
$ ./transform.sh 10000 < data.txt
&st=10000&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=10100&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=15100&type=rec&uniId=SARA&acceptCode=15100&drainNel=ured&
&st=15200&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=15300&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=20300&type=rec&uniId=DAVE&acceptCode=20300&drainNel=teef&
And a perl version that does an in-place edit of the input file:
#!/usr/bin/perl -ani -F'&'
# Usage:
# $ ./transform.pl COUNT datafile
use warnings;
use strict;
use English;
our ($count, $a, $uni);
BEGIN {
$count = shift #ARGV;
die "Missing count argument" unless defined $count and $count =~ /^\d+$/;
$ac = $count;
$uni = "";
$OFS = '&';
}
if ($NR == 1) {
$uni = $F[3];
} elsif ($uni ne $F[3]) {
$count += 5000;
$ac = $count;
$uni = $F[3];
} else {
$count += 100;
}
$F[1] = "st=$count";
$F[4] = "acceptCode=$ac";
print #F;
Running it on your sample input:
$ ./transform.pl 10000 data.txt
$ cat data.txt
&st=10000&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=10100&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=15100&type=rec&uniId=SARA&acceptCode=15100&drainNel=ured&
&st=15200&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=15300&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=20300&type=rec&uniId=DAVE&acceptCode=20300&drainNel=teef&

A few assumptions
Your requirement 2) I want to replace all the text ALPHA with the given long number as arg and increment by 100 for e.g. if uniId is the same. If it is different it will increment by 5000 for e.g._ in conjunction with your example output requires your input data to be sorted on the uniId field. If the file is not sorted, the 100 increments and the 5000 increments will not yield the desired initial values for each uniId
The increment scheme assumes that no one uniId value will have enough records to increment into the next 5000 range set for newly identified uniId values.
#!/usr/bin/env python3
from collections import OrderedDict
import csv
import sys
class TrackingVars(object):
"""
The TrackingVars class manages the business logic for maintaining the
st field counters and the acctCode values for each uniId
"""
def __init__(self, long_number):
self.uniId_table = {}
self.running_counter = long_number
def __initial_value__(self):
"""
The first encounter for a uniId will have st = acctCode
"""
retval = (self.running_counter, self.running_counter)
return retval
def get_uniId(self, id):
"""
A convenience method for returning uniId tracking values
"""
curval, original_value = self.uniId_table.get(id, self.__initial_value__())
return (curval, original_value)
def track(self, uniId):
"""
curval = original_value when a new uniId is encountered.
If the uniId is known, simply increment curval by 100
if the uniId is new and there is at least 1 key in the
tracking table increment curval by 5000
always update tracking variables
"""
curval, original_value = self.get_uniId(uniId)
if uniId in self.uniId_table.keys():
curval = curval + 100
else:
if self.uniId_table:
curval = curval + 5000
original_value = curval
self.running_counter = curval
retval = (curval, original_value)
self.uniId_table[uniId] = retval
return retval
def data_lines(filename):
"""
Read file as input delimited by &
"""
with open(filename, "r", newline=None) as fin:
csvin = csv.reader(fin, delimiter="&")
for row in csvin:
yield row
def transform_data_line(line):
"""
Transform data into key, values pairs
leading and traling & have no valid key, value pairs
"""
head = ("head", None)
tail = ("tail", None)
items = [head]
for field in line[1:-1]:
key, value = field.split("=")
items.append([key, value])
retval = OrderedDict(items)
retval["tail"] = tail
return retval
def process_data_line(record, text_to_replace, tracking_vars):
"""
if st value is ALPHA update record with tracking variables
"""
st = record.get("st")
if st is not None:
if st == text_to_replace:
uniId = record.get("uniId")
curval, original_value = tracking_vars.track(uniId)
record["st"] = curval
record["acceptCode"] = original_value
return record
def process_file():
"""
Get the long number from the command line input.
Initialize the tracking variables.
Process each row of the file.
"""
long_number = sys.argv[1]
tracking_vars = TrackingVars(int(long_number))
for row in data_lines("data.txt"):
record = transform_data_line(row)
retval = process_data_line(record, "ALPHA", tracking_vars)
yield retval
def write(iter_in, filename_out):
"""
Write each row from the iterator to the csv.
make sure the first and last fields are empty.
"""
with open(filename_out, "w", newline=None) as fout:
csvout = csv.writer(fout, delimiter="&")
for row in iter_in:
encoded_row = ["{0}={1}".format(k, v) for k, v in row.items()]
encoded_row[0]=""
encoded_row[-1]=""
csvout.writerow(encoded_row)
if __name__ == "__main__":
write(process_file(), "data.new.txt")
Output
$cat data.net.txt
&st=10000&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=10100&type=rec&uniId=JIM&acceptCode=10000&drainNel=supp&
&st=15100&type=rec&uniId=SARA&acceptCode=15100&drainNel=ured&
&st=15200&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=15300&type=rec&uniId=SARA&acceptCode=15100&drainNel=iris&
&st=20300&type=rec&uniId=DAVE&acceptCode=20300&drainNel=teef&
Conclusion
Only you know why the business rules for the incrementing number scheme are the way they are. However having a control break on uniId and the st value dependent upon the previous uniId increment seems problematic to me. You could process unsorted files if each new uniId encountered would start at a new 5000 boundary. For example 15000, 2000, 25000, etc.
P.S
I love the AWK and Perl answers. They are simple and straight forward. They answer the question exactly as it was posed. Now all we need is a SED example :)

just bit more efficient control, in one line gnu awk:
awk -F\& -vi=10000 -vOFS=\& '{if(NR==1) { ac=i; u=$4; } else { if($4==u) i+=100; else { i+=5000; ac=i; u=$4; } }; $2="st=" i; $5 =gensub(/[0-9]+/,ac,1,$5); print } ' data.txt
accept any various string on 5th field.. Thank Shawn.

change column numbers in a csv file

I have a pipe delimited file with 3 columns
aaa|xyz|pqr
another|column
with
line break | last column
The expected output is :
aaa|xyz|pqr
another|column with line break | last column
If I remove the line breaks then I get a single line like this...
aaa|xyz|pqr another|column with line break | last column
But I need 3 columns on each line.

You can try this awk,
awk -F'|' 'NF!=3{ line=line ? line " " $0 : $0; c=split( line, arr, "|"); if(c == 3){ $0=line; }else{ next } }1' yourfile
More readable awk version:
#!/bin/awk -f
BEGIN{
FS="|";
}
NF!=3{
line=line ? line " " $0 : $0;
c=split( line, arr, "|");
if(c == 3) {
$0=line;
}
else {
next;
}
}1
Test:
$ awk -F'|' 'NF!=3{ line=line ? line " " $0 : $0; c=split( line, arr, "|"); if(c == 3){ $0=line; }else{ next } }1' yourfile
aaa|xyz|pqr
another|column with line break | last column
It is working for your sample input.

Python solution:
import sys
def fix_rows(it, n):
row = ''
for line in it:
if row:
row = row.rstrip('\n') + ' ' + line
else:
row = line
if row.count('|') == n - 1:
yield row
row = ''
if row:
yield row
with open('a.csv') as f:
sys.stdout.writelines(fix_rows(f, 3))
output:
aaa|xyz|pqr
another|column with line break | last column

What you are describing is a three field record following this pattern:
(F1, May have CR) | (F2, May have CR) | (F3, No CR)CR
If F3 ever did have a CR, it would be ambiguous which record is which since you would not know whether the CR terminates the record or is embedded into F3 or the following F1 field.
You can easily parse what I have described with a regex in Perl:
$ perl -e '
$str = do { local $/; <> };
while ($str =~ /^\n?((?:[^|]+\|){2}[^\n]+)/gm){
$_=$1;
s/\n/ /g;
print "$_\n";
}
' /tmp/ac.csv
aaa|xyz|pqr
another|column with line break | last column
Which works by using a regex to separate the records from the stream.
Live regex to show how that works.

decompress name

what is the easiest way to decompress a data name?
For example, change compressed form:
abc[3:0]
into decompressed form:
abc[3]
abc[2]
abc[1]
abc[0]
preferable 1 liner :)

In Perl:
#!perl -w
use strict;
use 5.010;
my #abc = qw/ a b c d /;
say join( " ", reverse #abc[0..3] );
Or if you wanted them into separate variables:
my( $abc3, $abc2, $abc1, $abc0 ) = reverse #abc[0..3];
Edit: Per your clarification:
my $str = "abc[3:0]";
$str =~ /(abc)\[(\d+):(\d+)\]/;
my $base = $1;
my $from = ( $2 < $3 ? $2 : $3 );
my $to = ( $2 > $3 ? $2 : $3 );
my #strs;
foreach my $num ( $from .. $to ) {
push #strs, $base . '[' . $num . ']';
}

This is a little pyparsing exercise that I've done in the past, adapted to your example (also supports multiple ranges and unpaired indexes, all separated by commas - see the last test case):
from pyparsing import (Suppress, Word, alphas, alphanums, nums, delimitedList,
Combine, Optional, Group)
LBRACK,RBRACK,COLON = map(Suppress,"[]:")
ident = Word(alphas+"_", alphanums+"_")
integer = Combine(Optional('-') + Word(nums))
integer.setParseAction(lambda t : int(t[0]))
intrange = Group(integer + COLON + integer)
rangedIdent = ident("name") + LBRACK + delimitedList(intrange|integer)("indexes") + RBRACK
def expandIndexes(t):
ret = []
for ind in t.indexes:
if isinstance(ind,int):
ret.append("%s[%d]" % (t.name, ind))
else:
offset = (-1,1)[ind[0] < ind[1]]
ret.extend(
"%s[%d]" % (t.name, i) for i in range(ind[0],ind[1]+offset,offset)
)
return ret
rangedIdent.setParseAction(expandIndexes)
print rangedIdent.parseString("abc[0:3]")
print rangedIdent.parseString("abc[3:0]")
print rangedIdent.parseString("abc[0:3,7,14:16,24:20]")
Prints:
['abc[0]', 'abc[1]', 'abc[2]', 'abc[3]']
['abc[3]', 'abc[2]', 'abc[1]', 'abc[0]']
['abc[0]', 'abc[1]', 'abc[2]', 'abc[3]', 'abc[7]', 'abc[14]', 'abc[15]', 'abc[16]', 'abc[24]', 'abc[23]', 'abc[22]', 'abc[21]', 'abc[20]']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

splitting a diff file using regex in Python - python

This does it: r=re.compile(r'^(diff.*?)(?=^diff|\Z)', re.M | re.S) for m in re.findall(r, s): print '====' print m

You don't need to use regex, just split the file: diff_file = open('diff.txt', 'r') diff_str = diff_file.read() diff_split = ['diff --git%s' % x for x in diff_str.split('diff --git') \ if x.strip()] print diff_split

Just split on any linefeed that's followed by the word diff: result = re.split(r"\n(?=diff\b)", subject) Though for safety's sake, you probably should try to match \r or \r\n as well: result = re.split(r"(?:\r\n|[\r\n])(?=diff\b)", subject)

Related

PLY Lex: ID could be anything

PyParsing: parse if not a keyword

Replace string with custom values and iterate for all occurrences

change column numbers in a csv file

decompress name

Categories

Resources