Generating and applying diffs in python - python

Is there an 'out-of-the-box' way in python to generate a list of differences between two texts, and then applying this diff to one file to obtain the other, later?
I want to keep the revision history of a text, but I don't want to save the entire text for each revision if there is just a single edited line. I looked at difflib, but I couldn't see how to generate a list of just the edited lines that can still be used to modify one text to obtain the other.

Did you have a look at diff-match-patch from google? Apparantly google Docs uses this set of algoritms. It includes not only a diff module, but also a patch module, so you can generate the newest file from older files and diffs.
A python version is included.
http://code.google.com/p/google-diff-match-patch/

Does difflib.unified_diff do want you want? There is an example here.
The original link is broken. There is an example here

I've implemented a pure python function to apply diff patches to recover either of the input strings, I hope someone finds it useful. It uses parses the Unified diff format.
import re
_hdr_pat = re.compile("^## -(\d+),?(\d+)? \+(\d+),?(\d+)? ##$")
def apply_patch(s,patch,revert=False):
"""
Apply unified diff patch to string s to recover newer string.
If revert is True, treat s as the newer string, recover older string.
"""
s = s.splitlines(True)
p = patch.splitlines(True)
t = ''
i = sl = 0
(midx,sign) = (1,'+') if not revert else (3,'-')
while i < len(p) and p[i].startswith(("---","+++")): i += 1 # skip header lines
while i < len(p):
m = _hdr_pat.match(p[i])
if not m: raise Exception("Cannot process diff")
i += 1
l = int(m.group(midx))-1 + (m.group(midx+1) == '0')
t += ''.join(s[sl:l])
sl = l
while i < len(p) and p[i][0] != '#':
if i+1 < len(p) and p[i+1][0] == '\\': line = p[i][:-1]; i += 2
else: line = p[i]; i += 1
if len(line) > 0:
if line[0] == sign or line[0] == ' ': t += line[1:]
sl += (line[0] != sign)
t += ''.join(s[sl:])
return t
If there are header lines ("--- ...\n","+++ ...\n") it skips over them. If we have a unified diff string diffstr representing the diff between oldstr and newstr:
# recreate `newstr` from `oldstr`+patch
newstr = apply_patch(oldstr, diffstr)
# recreate `oldstr` from `newstr`+patch
oldstr = apply_patch(newstr, diffstr, True)
In Python you can generate a unified diff of two strings using difflib (part of the standard library):
import difflib
_no_eol = "\ No newline at end of file"
def make_patch(a,b):
"""
Get unified string diff between two strings. Trims top two lines.
Returns empty string if strings are identical.
"""
diffs = difflib.unified_diff(a.splitlines(True),b.splitlines(True),n=0)
try: _,_ = next(diffs),next(diffs)
except StopIteration: pass
return ''.join([d if d[-1] == '\n' else d+'\n'+_no_eol+'\n' for d in diffs])
On unix: diff -U0 a.txt b.txt
Code is on GitHub here along with tests using ASCII and random unicode characters: https://gist.github.com/noporpoise/16e731849eb1231e86d78f9dfeca3abc

AFAIK most diff algorithms use a simple Longest Common Subsequence match, to find the common part between two texts and whatever is left is considered the difference. It shouldn't be too difficult to code up your own dynamic programming algorithm to accomplish that in python, the wikipedia page above provides the algorithm too.

Does it have to be a python solution?
My first thoughts as to a solution would be to use either a Version Control System (Subversion, Git, etc.) or the diff / patch utilities that are standard with a unix system, or are part of cygwin for a windows based system.

Probably you can use unified_diff to generate the list of difference in a file. Only the changed texts in your file can be written it into a new text file where you can use it for your future reference.
This is the code which helps you to write only the difference to your new file.
I hope this is what you are asking for !
diff = difflib.unified_diff(old_file, new_file, lineterm='')
lines = list(diff)[2:]
# linesT = list(diff)[0:3]
print (lines[0])
added = [lineA for lineA in lines if lineA[0] == '+']
with open("output.txt", "w") as fh1:
for line in added:
fh1.write(line)
print '+',added
removed = [lineB for lineB in lines if lineB[0] == '-']
with open("output.txt", "a") as fh1:
for line in removed:
fh1.write(line)
print '-',removed
Use this in your code to save only the difference output !

Related

Generating multiple strings by replacing wildcards

So i have the following strings:
"xxxxxxx#FUS#xxxxxxxx#ACS#xxxxx"
"xxxxx#3#xxxxxx#FUS#xxxxx"
And i want to generate the following strings from this pattern (i'll use the second example):
Considering #FUS# will represent 2.
"xxxxx0xxxxxx0xxxxx"
"xxxxx0xxxxxx1xxxxx"
"xxxxx0xxxxxx2xxxxx"
"xxxxx1xxxxxx0xxxxx"
"xxxxx1xxxxxx1xxxxx"
"xxxxx1xxxxxx2xxxxx"
"xxxxx2xxxxxx0xxxxx"
"xxxxx2xxxxxx1xxxxx"
"xxxxx2xxxxxx2xxxxx"
"xxxxx3xxxxxx0xxxxx"
"xxxxx3xxxxxx1xxxxx"
"xxxxx3xxxxxx2xxxxx"
Basically if i'm given a string as above, i want to generate multiple strings by replacing the wildcards that can be #FUS#, #WHATEVER# or with a number #20# and generating multiple strings with the ranges that those wildcards represent.
I've managed to get a regex to find the wildcards.
wildcardRegex = f"(#FUS#|#WHATEVER#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
Which finds correctly the target wildcards.
For 1 wildcard present, it's easy.
re.sub()
For more it gets complicated. Or maybe it was a long day...
But i think my algorithm logic is failing hard because i'm failing to write some code that will basically generate the signals. I think i need some kind of recursive function that will be called for each number of wildcards present (up to maybe 4 can be present (xxxxx#2#xxx#2#xx#FUS#xx#2#x)).
I need a list of resulting signals.
Is there any easy way to do this that I'm completely missing?
Thanks.
import re
stringV1 = "xxx#FUS#xxxxi#3#xxx#5#xx"
stringV2 = "XXXXXXXXXX#FUS#XXXXXXXXXX#3#xxxxxx#5#xxxx"
regex = "(#FUS#|#DSP#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
WILDCARD_FUS = "#FUS#"
RANGE_FUS = 3
def getSignalsFromWildcards(app, can):
sigList = list()
if WILDCARD_FUS in app:
for i in range(RANGE_FUS):
outAppSig = app.replace(WILDCARD_FUS, str(i), 1)
outCanSig = can.replace(WILDCARD_FUS, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
elif len(re.findall("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", stringV1)) > 0:
wildcard = re.search("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", app).group()
tarRange = int(wildcard.strip("#"))
for i in range(tarRange):
outAppSig = app.replace(wildcard, str(i), 1)
outCanSig = can.replace(wildcard, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
return sigList
if "#" in stringV1:
resultList = getSignalsFromWildcards(stringV1, stringV2)
for item in resultList:
print(item)
results in
('xxx0xxxxi0xxxxx', 'XXXXXXXXXX0XXXXXXXXXX0xxxxxxxxxx')
('xxx0xxxxi1xxxxx', 'XXXXXXXXXX0XXXXXXXXXX1xxxxxxxxxx')
('xxx0xxxxi2xxxxx', 'XXXXXXXXXX0XXXXXXXXXX2xxxxxxxxxx')
('xxx1xxxxi0xxxxx', 'XXXXXXXXXX1XXXXXXXXXX0xxxxxxxxxx')
('xxx1xxxxi1xxxxx', 'XXXXXXXXXX1XXXXXXXXXX1xxxxxxxxxx')
('xxx1xxxxi2xxxxx', 'XXXXXXXXXX1XXXXXXXXXX2xxxxxxxxxx')
('xxx2xxxxi0xxxxx', 'XXXXXXXXXX2XXXXXXXXXX0xxxxxxxxxx')
('xxx2xxxxi1xxxxx', 'XXXXXXXXXX2XXXXXXXXXX1xxxxxxxxxx')
('xxx2xxxxi2xxxxx', 'XXXXXXXXXX2XXXXXXXXXX2xxxxxxxxxx')
long day after-all...

Breaking single line data to multi-line data

Working on project where i am given raw log data and need to parse it out to a readable state, know enought with python to break off all the undeed part and just left with raw data that needs to be split and formated, but cant figure out a way to break it apart if they put multiple records on the same line, which does not always happen.
This is the string value i am getting so far.
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,
I need to break it out apart so that each line starts with the number value, but since i can assume there will only be 1 or two of them i cant think of a way to do it, and the number of other comma seperate values after it vary so i cant go by length. this is what i am looking to get to use will further operations with data from the above example.
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000
*190206*01,2050,0100550,01,4999,0000000,,
txt = "*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,"
output = list()
i = 0
x = txt.split("*")
while i < len(x):
if len(x[i]) == 0:
i += 1
continue
print ("*{0}*{1}".format(x[i],x[i+1]))
output.append("*{0}*{1}".format(x[i],x[i+1]))
i += 2
Use split to tokezine the words between *
Print two constitutive tokens
You can use regex:
([*][0-9]*[*])
You can catch the header part with this and then split according to it.
Same answer as #mujiga but I though a dict might better for further operations
txt = "*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,"
datadict=dict()
i=0
x=txt.split("*")
while i < len(x):
if len(x[i]) == 0:
i += 1
continue
datadict[x[i]]=x[i+1]
i += 2
Adding on to #Ali Nuri Seker's suggestion to use regex, here' a simple one lacking lookarounds (which might actually hurt it in this case)
>>> import re
>>> string = '''*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000*190206*01,2050,0100550,01,4999,0000000,,'''
>>> print(re.sub(r'([\*][0-9,]+[\*]+[0-9,]+)', r'\n\1', string))
#Output
*190205*12,6000,0000000,12,6000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000,13,2590,0000000,13,7000,0000000,13,7000,0000000
*190206*01,2050,0100550,01,4999,0000000,,

How to determine whether a Python string contains an emoji?

I've seen previous responses to this question, but none of them are recent and none of them are working for me in Python 3. I have a list of strings, and I simply want to identify which ones contain emoji. What's the fastest way to do this?
To be more specific, I have a lengthy list of email subject lines from AB tests, and I'm trying to determine which subject lines contained emoji.
this link and this link both count © and other common characters as emoji. Also the former has minor mistakes and the latter still doesn't appear to work.
Here's an implementation that errs on the conservative side using this newer data and this documentation. It only considers code points that are marked with the unicode property Emoji_Presentation (which means it's definitely an emoji), or code points marked only with the property Emoji (which means it defaults to text but it could be an emoji), that are followed by a special variation selector code point fe0f that says to default to an emoji instead. The reason I say this is conservative is because certain systems aren't as picky about the fe0f and will treat characters as emoji wherever they can (read more about this here).
import re
from collections import defaultdict
def parse_line(line):
"""Return a pair (property, codepoints) where property is a string and
codepoints is a set of int unicode code points"""
pat = r'([0-9A-Z]+)(\.\.[0-9A-Z]+)? + ; +(\w+) + #.*'
match = re.match(pat, line)
assert match
codepoints = set()
start = int(match.group(1), 16)
if match.group(2):
trimmed = match.group(2)[2:]
end = int(trimmed, 16) + 1
else:
end = start + 1
for cp in range(start, end):
codepoints.add(cp)
return (match.group(3), codepoints)
def parse_emoji_data():
"""Return a dictionary mapping properties to code points"""
result = defaultdict(set)
with open('emoji-data.txt', mode='r', encoding='utf-8') as f:
for line in f:
if '#' != line[0] and len(line.strip()) > 0:
property, cp = parse_line(line)
result[property] |= cp
return result
def test_parse_emoji_data():
sets = parse_emoji_data()
sizes = {
'Emoji': 1123,
'Emoji_Presentation': 910,
'Emoji_Modifier': 5,
'Emoji_Modifier_Base': 83,
}
for k, v in sizes.items():
assert len(sets[k]) == v
def contains_emoji(text):
"""
Return true if the string contains either a code point with the
`Emoji_Presentation` property, or a code point with the `Emoji`
property that is followed by \uFE0F
"""
sets = parse_emoji_data()
for i, ch in enumerate(text):
if ord(ch) in sets['Emoji_Presentation']:
return True
elif ord(ch) in sets['Emoji']:
if len(text) > i+1 and text[i+1] == '\ufe0f':
return True
return False
test_parse_emoji_data()
assert not contains_emoji('hello')
assert not contains_emoji('hello :) :D 125% #%&*(##%&!#(^*(')
assert contains_emoji('here is a smiley \U0001F601 !!!')
To run this you need ftp://ftp.unicode.org/Public/emoji/3.0/emoji-data.txt in the working directory.
Once the regex module supports emoji properties it will be easier to use that instead.

Trying to solve old GoogleCodeJam for practice

I'm working on some of the old Google Code Jam problems as practice in python since we don't use this language at my school. Here is the current problem I am working on that is supposed to just reverse the order of a string by word.
Here is my code:
import sys
f = open("B-small-practice.in", 'r')
T = int(f.readline().strip())
array = []
for i in range(T):
array.append(f.readline().strip().split(' '))
for ar in array:
if len(ar) > 1:
count = len(ar) - 1
while count > -1:
print ar[count]
count -= 1
The problem is that instead of printing:
test a is this
My code prints:
test
a
is
this
Please let me know how to format my loop so that it prints all in one line. Also, I've had a bit of a learning curve when it comes to reading input from a .txt file and manipulating it, so if you have any advice on different methods to do so for such problems, it would be much appreciated!
print by default appends a newline. If you don't want that behavior, tack a comma on the end.
while count > -1:
print ar[count],
count -= 1
Note that a far easier method than yours to reverse a list is just to specify a step of -1 to a slice. (and join it into a string)
print ' '.join(array[::-1]) #this replaces your entire "for ar in array" loop
There are multiple ways you can alter your output format. Here are a few:
One way is by adding a comma at the end of your print statement:
print ar[count],
This makes print output a space at the end instead of a newline.
Another way is by building up a string, then printing it after the loop:
reversed_words = ''
for ar in array:
if len(ar) > 1:
count = len(ar) - 1
while count > -1:
reversed_words += ar[count] + ' '
count -= 1
print reversed_words
A third way is by rethinking your approach, and using some useful built-in Python functions and features to simplify your solution:
for ar in array:
print ' '.join(ar[::-1])
(See str.join and slicing).
Finally, you asked about reading from a file. You're already doing that just fine, but you forgot to close the file descriptor after you were done reading, using f.close(). Even better, with Python 2.5+, you can use the with keyword:
array = list()
with open(filename) as f:
T = int(f.readline().strip())
for i in range(T):
array.append(f.readline().strip().split(' '))
# The rest of your code here
This will close the file descriptor f automatically after the with block, even if something goes wrong while reading the file.
T = int(raw_input())
for t in range(T):
print "Case #%d:" % (t+1),
print ' '.join([w for w in raw_input().split(' ')][::-1])

Increment a Version Number using Regular Expression

I am trying to increment a version number using regex but I can't seem to get the hang of regex at all. I'm having trouple with the symbols in the string I am trying to read and change. The code I have so far is:
version_file = "AssemblyInfo.cs"
read_file = open(version_file).readlines()
write_file = open(version_file, "w")
r = re.compile(r'(AssemblyFileVersion\s*(\s*"\s*)(\S+))\s*"\s*')
for l in read_file:
m1 = r.match(l)
if m1:
VERSION_ID=map(int,m1.group(2).split("."))
VERSION_ID[2]+=1 # increment version
l = r.sub(r'\g<1>' + '.'.join(['%s' % (v) for v in VERSION_ID]), l)
write_file.write(l)
write_file.close()
The string I am trying to read and change is:
[assembly: AssemblyFileVersion("1.0.0.0")]
What I would like written to the file is:
[assembly: AssemblyFileVersion("1.0.0.1")]
So basically I want to increment the build number by one.
Can anyone help me fix my regualr expression. I seem to have trouble getting to grips with regular expression that have to get around symbols.
Thanks for any help.
If you specify the version as "1.0.0.*" then AFAIK it gets updated on each build automagically, at least if you're using Visual Studio.NET.
I'm not sure regex is your best bet, but one way of doing it would be this:
import re
# Don't bother matching everything, just the bits that matter.
pat = re.compile(r'AssemblyFileVersion.*\.(\d+)"')
# ... lines omitted which set up read_file, write_file etc.
for line in read_file:
m = pat.search(line)
if m:
start, end = m.span(1)
line = line[:start] + str(int(line[start:end]) + 1) + line[end:]
write_file.write(line)
Good luck with regex.
If I had to do the same, I'd convert the string to int by removing the dots, add one and convert back to string.
Well, I'd have also used a integer version number in the first place.

Categories

Resources