I'm currently working on a bioinformatic and modelling project where I need to do some pattern matching. Let's say I have a DNA fragment as follow 'atggcgtatagagc' and I split that fragment in micro-sequences of 8 nucleotides so that I have :
'atggcgta' 'tggcgtat' 'ggcgtata' 'gcgtatag' 'cgtataga' 'gtatagag' 'tatagagc'
And for each of these fragment I want to search in a whole genome and per chromosome the number of time they appear and the positions (starting positions) of the matches.
Here is how my code looks like :
you can download the genome fasta file here :
drive to the fasta file
import re
from Bio.SeqIO.FastaIO import FastaIterator
from Bio.Seq import Seq
def reverse_complement(sequence: str) -> str:
my_sequence = Seq(sequence)
return str(my_sequence.reverse_complement())
# you will need to unzip the file ant change the path below according to your working directory
path = '../data/Genome_S288c.fa'
genome = open(path, "r")
chr_sequences = {}
for record in FastaIterator(genome):
chr_id = record.id
seq = str(record.seq).lower()
rc_seq = reverse_complement(seq)
chr_sequences[chr_id] = {'5to3': seq, '3to5': rc_seq}
genome.close()
sequences = 'ATGACTAACGAAAAGGTCTGGATAGAGAAGTTGGATAATCCAACTCTTTCAGTGTTACCACATGACTTTTTACGCCCACAATCTTTAT'.lower()
micro_size = 8
micro_sequences = []
start = micro_size - 1
for i in range(start, len(sequences), 1):
current_micro_seq = sequences[i - start:i + 1]
micro_sequences.append(current_micro_seq)
genome_count = 0
chr_count = {}
chr_locations = {}
micro_fragment_stats = {}
for ii_micro, micro_seq in enumerate(micro_sequences):
for chr_idx in list(chr_sequences.keys()):
chr_counter = 0
seq = chr_sequences[chr_idx]['5to3']
pos = [m.start() for m in re.finditer(pattern=r'(?=(' + micro_seq + '))', string=seq)]
rc_seq = chr_sequences[chr_idx]['3to5']
rc_pos = [m.start() for m in re.finditer(pattern=r'(?=(' + micro_seq + '))', string=rc_seq)]
chr_locations[chr] = {'5to3': pos, '3to5': rc_pos}
chr_counter += len(pos) + len(rc_pos)
chr_count[chr_idx] = chr_counter
genome_count += chr_counter
micro_fragment_stats[ii_micro] = {'occurrences genome': genome_count,
'occurrences chromosomes': chr_count,
'locations chromosomes': chr_locations}
Actually my fragment is something like 2000bp long, so I took about 1 hour to compute all the micro-sequences. \
By the way, I use the r'(?=('+self.sequence+'))' to avoid the case of pattern that overlaps itself in the sequence, for instance :
pattern = 'aaggaaaaa'
string = 'aaggaaaaaggaaaaa'
expected output : (0, 7)
I am looking for a more efficient regex method that I can use for my case (in python if possible).
Thanks in advance
I would not recommend using regex for repetitive simple pattern matching. Outright comparison is expected to perform better. I did some basic testing and came up with the demo below.
import time
import re
import random
def compare(r1, r2, microseq_len, test_condition=1):
# condition 1: make microseqs/indexes from longer sequence and search against shorter
# condition 2: use regex to find position of microseq in reference sequence
# condition 3: use regex to find position of microseq in reference sequence after verifying if microseq in reference strain
start_time = time.time()
if test_condition == 1:
r1, r2 = r2, r1
# assemble dictionary containing microsequences and index positions
microseq_di = {}
for i in range(len(r1)-microseq_len):
microseq = r1[i:i+microseq_len]
if microseq not in microseq_di:
microseq_di[microseq] = []
microseq_di[microseq].append([i, i+microseq_len])
# mark for deletion
for microseq in microseq_di:
# condition 2
if test_condition == 2:
microseq_di[microseq] = [m.start() for m in re.finditer(pattern=r'(?=('+microseq+'))', string=r2)]
elif microseq not in r2:
microseq_di[microseq] = []
# condition 3
elif test_condition == 3:
microseq_di[microseq] = [m.start() for m in re.finditer(pattern=r'(?=('+microseq+'))', string=r2)]
print(time.time() - start_time) # run time
# delete and return
return({x:y for x, y in microseq_di.items() if y != []})
Input and Output:
r_short = "".join([random.choices(["A", "T", "G", "C"])[0] for x in range(500)])
r_long = "".join([random.choices(["A", "T", "G", "C"])[0] for x in range(100000)])
len(compare(r_short, r_long, 8, test_condition=1).keys())
0.19868111610412598
Out[1]: 400
len(compare(r_short, r_long, 8, test_condition=2).keys())
0.8831210136413574
Out[2]: 399
len(compare(r_short, r_long, 8, test_condition=3).keys())
0.7925639152526855
Out[3]: 399
Test condition 1 (microseqs from longer sequence) performed a lot better than the other two conditions using regex. Relative performance should improve with longer strings.
r_short = "".join([random.choices(["A", "T", "G", "C"])[0] for x in range(2000)])
r_long = "".join([random.choices(["A", "T", "G", "C"])[0] for x in range(1000000)])
len(compare3(r_short, r_long, 8, test_condition=1).keys())
2.2517480850219727
Out[4]: 1970
len(compare3(r_short, r_long, 8, test_condition=2).keys())
35.65084385871887
Out[5]: 1969
len(compare3(r_short, r_long, 8, test_condition=3).keys())
34.994577169418335
Out[6]: 1969
Note that condition 1 is not fully accommodating to your use-case since it doesn't exclude overlapping microseqs.
I've been playing with this question for a while and I end up with some ideas.
The algorithm is mainly divided in two parts: k-mer generation and k-mer searching in the reference.
For the k-mer generation part, I can see that your algorithm is quick, but it generates duplicates (that you have to filter afterwards when generating the dictionary). My approach has been to generate a deduplicated list directly. In my sample code I also modified your method to perform the deduplication at the same time, so you can avoid doing it later and, more important, allows for a fair time comparison with my approach.
You will see that using a set to keep the kmers offers us free deduplication, and is faster than using a list, as it has not to be traversed.
For the search of the kmer in the reference, given that you were doing exact searches, using a regex is overkill. It's far more cheaper to do a standard search. In this code, I used the methods provided by the Seq class: find and index. The idea is to find the first occurrence starting from the beginning, and the repeat the search starting with the next position after the last index found (if you want to avoid overlaps, then start after the last position found plus the k-mer size).
The code generated follows:
import re
from pathlib import Path
from timeit import timeit
from Bio.Seq import Seq
from Bio.SeqIO.FastaIO import FastaIterator
def reverse_complement(sequence: Seq) -> Seq:
return sequence.reverse_complement()
def generate_kmers(sequence: Seq, kmer_size: int) -> set[Seq]:
return {
Seq(sequence[i : i + kmer_size]) for i in range(len(sequence) - kmer_size + 1)
}
def generate_kmers_original(sequence: Seq, kmer_size: int) -> list[Seq]:
kmers: list[Seq] = []
start = kmer_size - 1
for i in range(start, len(sequence), 1):
current_micro_seq = Seq(sequence[i - start : i + 1])
# We had to add this check to avoid the duplication of k-mers
if current_micro_seq not in kmers:
kmers.append(current_micro_seq)
return kmers
def load_fasta(fasta_file: str) -> dict[str, dict[str, Seq]]:
fasta_dict: dict[str, dict[str, Seq]] = {}
with Path(fasta_file).open("r", encoding="UTF-8") as genome:
for record in FastaIterator(genome):
seq = record.seq.lower()
fasta_dict[record.id] = {"5to3": seq, "3to5": reverse_complement(seq)}
return fasta_dict
if __name__ == "__main__":
# Load the big fasta file
chr_sequences = load_fasta(
".../Saccharomyces_cerevisiae/S288c_R64/fasta/scerevisiae.S288c_R64.fasta"
)
# Generate the micro-sequences
micro_size = 8
sequences = Seq(
"ATGACTAACGAAAAGGTCTGGATAGAGAAGTTGGATAATCCAACTCTTTCAGTGTTACCACATGACTTTTTACGCCCACAATCTTTAT"
).lower()
micro_sequences = generate_kmers(sequences, micro_size)
# k-mer generation benchmark
test_size = 1000
kmer_generation_time = timeit(
"generate_kmers(sequences, micro_size)", number=test_size, globals=globals()
)
kmer_generation_original_time = timeit(
"generate_kmers_original(sequences, micro_size)",
number=test_size,
globals=globals(),
)
print(f"New k-mer generation time : {kmer_generation_time}")
print(f"Original k-mer generation time: {kmer_generation_original_time}")
print(f"There are {len(micro_sequences)} k-mers")
# Search for the kmers in the reference
def find_kmers_original(sequence: Seq, kmer: Seq) -> list[int]:
positions = [
m.start()
for m in re.finditer(
pattern=r"(?=(" + str(kmer) + "))", string=str(sequence)
)
]
return positions
def find_kmers_find(sequence: Seq, kmer: Seq) -> list[int]:
current = 0
positions: list[int] = []
while current < len(sequence):
index = sequence.find(kmer, current)
if index == -1:
break
positions.append(index)
current = index + 1
return positions
def find_kmers_index(sequence: Seq, kmer: Seq) -> list[int]:
positions: list[int] = []
current = 0
try:
while True:
index = sequence.index(kmer, current)
positions.append(index)
current = index + 1
except ValueError:
# Exception thrown when the kmer is not found
# This is our exit condition
pass
return positions
# k-mer search benchmark
test_size = 1000
haystack = next(iter(chr_sequences.values()))["5to3"]
needle = next(iter(micro_sequences))
search_original_time = timeit(
"find_kmers_original(haystack, needle)",
number=test_size,
globals=globals(),
)
search_find_time = timeit(
"find_kmers_find(haystack, needle)",
number=test_size,
globals=globals(),
)
search_index_time = timeit(
"find_kmers_index(haystack, needle)",
number=test_size,
globals=globals(),
)
print(f"Search with original time: {search_original_time}")
print(f"Search with find time : {search_find_time}")
print(f"Search with index time : {search_index_time}")
# Actual calculus
genome_count = 0
chr_count: dict[str, int] = {}
chr_locations: dict[str, dict[str, list[int]]] = {}
micro_fragment_stats: dict[
int, dict[str, int | dict[str, int] | dict[str, dict[str, list[int]]]]
] = {}
for ii_micro, micro_seq in enumerate(micro_sequences):
for chr_counter, (chromosome, contents) in enumerate(chr_sequences.items()):
pos = find_kmers_find(contents["5to3"], micro_seq)
rc_pos = find_kmers_find(contents["3to5"], micro_seq)
chr_locations[chromosome] = {"5to3": pos, "3to5": rc_pos}
chr_counter += len(pos) + len(rc_pos)
chr_count[chromosome] = chr_counter
genome_count += chr_counter
micro_fragment_stats[ii_micro] = {
"occurrences genome": genome_count,
"occurrences chromosomes": chr_count,
"locations chromosomes": chr_locations,
}
The output of this toy example is:
New k-mer generation time : 0.6696164240129292
Original k-mer generation time: 5.967410315992311
There are 81 k-mers
Search with original time: 3.1360475399997085
Search with find time : 0.5738343889825046
Search with index time : 0.5662875371053815
You can see that the k-mer generation is 9x faster and the search without the regex is around 5.5x faster.
In general, you will be better taking advantage of comprehensions and built-in data types (like the sets used here). And using the more simple approach also helps with performance. Regexes are powerful, but they need their time; if they are not required, better to avoid them. Specially in loops, where every small performance change is amplified.
Besides all of this benchmarking, you can also try to add the approach introduced by #Ghothi where the long and short sequences are exchanged. Maybe it could lead to some further improvement.
As a side note, Seq.find and Seq.index seems to offer the same performance, but I find it cleaner and more elegant the Seq.index version: you don't need a weird value to test against and the code intent is clearer. Also, the performance is slightly better, as it is avoiding a comparison in the loop, but this is a very minor improvement.
Related
Working on a project to decipher real text from gibberish, I found this code on Github and have made some slight edits to better fit my needs. When testing I keep getting a TypeError on line 17, return [c.lower() for c in line if c.lower() in accepted_chars].
import math
import pickle
accepted_chars = 'abcdefghijklmnopqrstuvwxyz '
pos = dict([(char, idx) for idx, char in enumerate(accepted_chars)])
def normalize(line):
"""Return only the subset of chars from accepted_chars.
This helps keep the model relatively small by ignoring punctuation,
infrequenty symbols, etc. """
return [c.lower() for c in line if c.lower() in accepted_chars]
def ngram(n, l):
"""Return all n grams from l after normalizing """
filtered = normalize(l)
for start in range(0, len(filtered) - n + 1):
yield ''.join(filtered[start:start + n])
def train():
""" Write a simple model as a pickle file"""
k = len(accepted_chars)
# Assume we have seen 10 of each character pair. This acts as a kind of
# prior or smoothing factor. This way, if we see a character transition
# live that we've never observed in the past, we won't assume the entire
# string has 0 probability.
counts = [[10 for i in range(k)] for i in range(k)]
# Count transitions from big text file, taken
# from http://norvig.com/spell-correct.html
for line in open('big.txt'):
for a,b in ngram(2, line):
counts[pos[a]][pos[b]] += 1
# Normalize the counts so that they become log probabilities.
# We use log probabilities rather than straight probabilities to avoid
# numeric underflow issues with long texts.
# This contains a justification:
# http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
for i, row in enumerate(counts):
s = float(sum(row))
for j in range(len(row)):
row[j] = math.log(row[j] / s)
# Find the probability of generating a few arbitrarily choosen good and
# bad phrases.
good_probs = [avg_transition_prob(l, counts) for l in open('good.txt')]
bad_probs = [avg_transition_prob(l, counts) for l in open('bad.txt')]
# Assert that we actually are capable of detecting the junk.
assert min(good_probs) > max(bad_probs)
#And pick a threshhold halfway between the worst good and best bad inputs.
thresh = (min(good_probs) + max(bad_probs)) / 2
pickle.dump({'mat': counts, 'thresh': thresh}, open('gib.model.pki', 'wb'))
def avg_transition_prob(l, log_prob_mat):
""" Return the average transition prob from l through log_prob_mat """
log_prob = 0.0
transition_ct = 0
for a, b in ngram(2,1):
log_prob += log_prob_mat[pos[a]][pos[b]]
transition_ct += 1
return math.exp(log_prob / (transition_ct or 1))
if __name__ == '__main__':
train()
ngram(2,1) calls normalize with the second parameter
normalize then does this:
return [c.lower() for c in line if c.lower() in accepted_chars]
Thus, you can't do for c in 1
Maybe you meant to put l there instead?
I developed the following code to calculate the number of identical sites in an alignment. Unfortunately the code is slow, and I have to iterate it over hundreds of files, it takes close to 12 hours to process more than 1000 alignments, meaning that something ten times faster would be appropriate. Any help would be appreciated:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import AlignIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
from Bio.Align import MultipleSeqAlignment
import time
a = SeqRecord(Seq("CCAAGCTGAATCAGCTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTAAGAGTTCCTTTCGAGCACTACAAGAAGACGATTCGCGCGAACCACCGCAT", generic_dna), id="Alpha")
b = SeqRecord(Seq("CGAAGCTGACTCAGTGGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACAATTCGTGCGAACCACCGCAT", generic_dna), id="Beta")
c = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAATCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Gamma")
d = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Delta")
e = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Epsilon")
align = MultipleSeqAlignment([a, b, c], annotations={"tool": "demo"})
start_time = time.time()
if len(align) != 1:
for n in range(0,len(align[0])):
n=0
i=0
while n<len(align[0]): #part that needs to be faster
column = align[:,n]
if (column == len(column) * column[0]) == True:
i=i+1
n=n+1
match = float(i)
length = float(n)
global_identity = 100*(float(match/length))
print(global_identity)
print("--- %s seconds ---" % (time.time() - start_time))
So, you're trying to check that each of the 5 strings have the same characters in the column? If the characters in the column all match, you increment i, else you increment n.
Your interpretation of the code is correct.
Based on the above, I'd hazard to suggest the following code as a faster alternative.
I suppose that align is a structure like this:
align = [
'AGCTCGCGGAGGCGCTGCT....',
'ACCTCGGAGGGCTGCTGTAC...',
'AGCTCGGAGGGCTGCTGTAC...',
# possibly more ...
]
We try to detect the columns of same characters in it. Above, the first column is AAA (a match), the next is GCG (a mismatch).
def all_equal(items):
"""Returns True iff all items are equal."""
first = items[0]
return all(x == first for x in items)
def compute_match(aligned_sequences):
"""Returns the ratio of same-character columns in ``aligned_sequences``.
:param aligned_sequences: a list of strings or equal length.
"""
match_count = 0
mismatch_count = 0
for chars in zip(*aligned_sequences):
# Here chars is a column of chars,
# one taken from each element of aligned_sequences.
if all_equal(chars):
match_count += 1
else:
mismatch_count += 1
return float(match_count) / float(mismatch_count)
# What would make more sense:
# return float(matches) / len(aligned_sequences[0])
An even shorter version:
def compute_match(aligned_sequences):
match_count = sum(1 for chars in zip(*aligned_sequences) if all_equal(chars))
total = len(aligned_sequences[0])
mismatch_count = total - match_count # Obviously.
return ...
I am trying to find all occurrences of sub-strings in a main string (of all lengths). My function takes one string and then returns a dictionary of every sub-string (which occurs more than once, of course) and how many times it occurs (format of the dictionary: {substring: # of occurrences, ...}). I am using collections.Counter(s) to help me with it.
Here is my function:
from collections import Counter
def patternFind(s):
patterns = {}
for index in range(1, len(s)+1)[::-1]:
d = nChunks(s, step=index)
parts = dict(Counter(d))
patterns.update({elem: parts[elem] for elem in parts.keys() if parts[elem] > 1})
return patterns
def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(start, len(iterable), step)]
I have a string, data with about 2500 random letters (in a random order). However, there are 2 strings inserted into it (random points). Say this string is 'TEST'. data.count('TEST') returns 2. However, patternFind(data)['TEST'] gives me a KeyError. Therefore, my program does not detect the two strings in it.
What have I done wrong? Thanks!
Edit: My method of creating testing-instances:
def createNewTest():
n = randint(500, 2500)
x, y = randint(500, n), randint(500, n)
s = ''
for i in range(n):
s += choice(uppercase)
if i == x or i == y: s += "TEST"
return s
Using Regular Expressions
Apart from the count() method you described, regex is an obvious alternative
import re
needle = r'TEST'
haystack = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklagh'
pattern = re.compile(needle)
print len(re.findall(pattern, haystack))
Short Cut
If you need to build a dictionary of substrings, possibly you can do this with only subset of those strings. Assuming you know the needle you are looking for in the data then you only need the dictionary of substrings of data that are the same length of needle. This is very fast.
from collections import Counter
needle = "TEST"
def gen_sub(s, len_chunk):
for start in range(0, len(s)-len_chunk+1):
yield s[start:start+len_chunk]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
parts = Counter([sub for sub in gen_sub(data, len(needle))])
print parts[needle]
Brute Force: building dictionary of all substrings
If you need to have a count of all possible substrings, this works but it is very slow:
from collections import Counter
def gen_sub(s):
for start in range(0, len(s)):
for end in range(start+1, len(s)+1):
yield s[start:end]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhz'
parts = Counter([sub for sub in gen_sub(data)])
print parts['TEST']
Substring generator adapted from this: https://stackoverflow.com/a/8305463/1290420
While jurgenreza has explained why your program didn't work, the solution is still quite slow. If you only examine substrings s for which you know that s[:-1] repeats, you get a much faster solution (typically a hundred times faster and more):
from collections import defaultdict
def pfind(prefix, sequences):
collector = defaultdict(list)
for sequence in sequences:
collector[sequence[0]].append(sequence)
for item, matching_sequences in collector.items():
if len(matching_sequences) >= 2:
new_prefix = prefix + item
yield (new_prefix, len(matching_sequences))
for r in pfind(new_prefix, [sequence[1:] for sequence in matching_sequences]):
yield r
def find_repeated_substrings(s):
s0 = s + " "
return pfind("", [s0[i:] for i in range(len(s))])
If you want a dict, you call it like this:
result = dict(find_repeated_substrings(s))
On my machine, for a run with 2247 elements, it took 0.02 sec, while the original (corrected) solution took 12.72 sec.
(Note that this is a rather naive implementation; using indexes of instead of substrings should be even faster.)
Edit: The following variant works with other sequence types (not only strings). Also, it doesn't need a sentinel element.
from collections import defaultdict
def pfind(s, length, ends):
collector = defaultdict(list)
if ends[-1] >= len(s):
del ends[-1]
for end in ends:
if end < len(s):
collector[s[end]].append(end)
for key, matching_ends in collector.items():
if len(matching_ends) >= 2:
end = matching_ends[0]
yield (s[end - length: end + 1], len(matching_ends))
for r in pfind(s, length + 1, [end + 1 for end in matching_ends if end < len(s)]):
yield r
def find_repeated_substrings(s):
return pfind(s, 0, list(range(len(s))))
This still has the problem that very long substrings will exceed recursion depth. You might want to catch the exception.
The problem is in your nChunks function. It does not give you all the chunks that are necessary.
Let's consider a test string:
s='1test2345test'
For the chunks of size 4 your nChunks function gives this output:
>>>nChunks(s, step=4)
['1tes', 't234', '5tes', 't']
But what you really want is:
>>>def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(len(iterable)-step+1)]
>>>nChunks(s, step=4)
['1tes', 'test', 'est2', 'st23', 't234', '2345', '345t', '45te', '5tes', 'test']
You can see that this way there are two 'test' chunks and your patternFind(s) will work like a charm:
>>> patternFind(s)
{'tes': 2, 'st': 2, 'te': 2, 'e': 2, 't': 4, 'es': 2, 'est': 2, 'test': 2, 's': 2}
here you can find a solution that uses a recursive wrapper around string.find() that searches all the occurences of a substring in a main string.
The collectallchuncks() function returns a defaultdict whith all the substrings as keys and for each substring a list of all the indexes where the substring is found in the main string.
import collections
# Minimum substring size, may be 1
MINSIZE = 3
# Recursive wrapper
def recfind(p, data, pos, acc):
res = data.find(p, pos)
if res == -1:
return acc
else:
acc.append(res)
return recfind(p, data, res+1, acc)
def collectallchuncks(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
if data.count(chunk) > 1:
res[chunk] = recfind(chunk, data, 0, [])
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']
EDIT: If you just need the number of occurrences of each substring in the main string you can easily obtain it getting rid of the recursive function:
import collections
MINSIZE = 3
def collectallchuncks2(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
cnt = data.count(chunk)
if cnt > 1:
res[chunk] = cnt
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks2(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']
This is a follow up to a similar question which asked the best way to write
for item in somelist:
if determine(item):
code_to_remove_item
and it seems the consensus was on something like
somelist[:] = [x for x in somelist if not determine(x)]
However, I think if you are only removing a few items, most of the items are being copied into the same object, and perhaps that is slow. In an answer to another related question, someone suggests:
for item in reversed(somelist):
if determine(item):
somelist.remove(item)
However, here the list.remove will search for the item, which is O(N) in the length of the list. May be we are limited in that the list is represented as an array, rather than a linked list, so removing items will need to move everything after it. However, it is suggested here that collections.dequeue is represented as a doubly linked list. It should then be possible to remove in O(1) while iterating. How would we actually accomplish this?
Update:
I did some time testing as well, with the following code:
import timeit
setup = """
import random
random.seed(1)
b = [(random.random(),random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
"""
listcomp = """
c[:] = [x for x in b if tokeep(x)]
"""
filt = """
c = filter(tokeep, b)
"""
print "list comp = ", timeit.timeit(listcomp,setup, number = 10000)
print "filtering = ", timeit.timeit(filt,setup, number = 10000)
and got:
list comp = 4.01255393028
filtering = 3.59962391853
The list comprehension is the asymptotically optimal solution:
somelist = [x for x in somelist if not determine(x)]
It only makes one pass over the list, so runs in O(n) time. Since you need to call determine() on each object, any algorithm will require at least O(n) operations. The list comprehension does have to do some copying, but it's only copying references to the objects not copying the objects themselves.
Removing items from a list in Python is O(n), so anything with a remove, pop, or del inside the loop will be O(n**2).
Also, in CPython list comprehensions are faster than for loops.
If you need to remove item in O(1) you can use HashMaps
Since list.remove is equivalent to del list[list.index(x)], you could do:
for idx, item in enumerate(somelist):
if determine(item):
del somelist[idx]
But: you should not modify the list while iterating over it. It will bite you, sooner or later. Use filter or list comprehension first, and optimise later.
A deque is optimized for head and tail removal, not for arbitrary removal in the middle. The removal itself is fast, but you still have to traverse the list to the removal point. If you're iterating through the entire length, then the only difference between filtering a deque and filtering a list (using filter or a comprehension) is the overhead of copying, which at worst is a constant multiple; it's still a O(n) operation. Also, note that the objects in the list aren't being copied -- just the references to them. So it's not that much overhead.
It's possible that you could avoid copying like so, but I have no particular reason to believe this is faster than a straightforward list comprehension -- it's probably not:
write_i = 0
for read_i in range(len(L)):
L[write_i] = L[read_i]
if L[read_i] not in ['a', 'c']:
write_i += 1
del L[write_i:]
I took a stab at this. My solution is slower, but requires less memory overhead (i.e. doesn't create a new array). It might even be faster in some circumstances!
This code has been edited since its first posting
I had problems with timeit, I might be doing this wrong.
import timeit
setup = """
import random
random.seed(1)
global b
setup_b = [(random.random(), random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
# define and call to turn into psyco bytecode (if using psyco)
b = setup_b[:]
def listcomp():
c[:] = [x for x in b if tokeep(x)]
listcomp()
b = setup_b[:]
def filt():
c = filter(tokeep, b)
filt()
b = setup_b[:]
def forfilt():
marked = (i for i, x in enumerate(b) if tokeep(x))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfilt()
b = setup_b[:]
def forfiltCheating():
marked = (i for i, x in enumerate(b) if (x[1] > .45) and (x[1] < .5))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfiltCheating()
"""
listcomp = """
b = setup_b[:]
listcomp()
"""
filt = """
b = setup_b[:]
filt()
"""
forfilt = """
b = setup_b[:]
forfilt()
"""
forfiltCheating = '''
b = setup_b[:]
forfiltCheating()
'''
psycosetup = '''
import psyco
psyco.full()
'''
print "list comp = ", timeit.timeit(listcomp, setup, number = 10000)
print "filtering = ", timeit.timeit(filt, setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, setup, number = 10000)
print '\nnow with psyco \n'
print "list comp = ", timeit.timeit(listcomp, psycosetup + setup, number = 10000)
print "filtering = ", timeit.timeit(filt, psycosetup + setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, psycosetup + setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, psycosetup + setup, number = 10000)
And here are the results
list comp = 6.56407690048
filtering = 5.64738512039
forfilter = 7.31555104256
forfiltCheating = 4.8994679451
now with psyco
list comp = 8.0485959053
filtering = 7.79016900063
forfilter = 9.00477004051
forfiltCheating = 4.90830993652
I must be doing something wrong with psyco, because it is actually running slower.
elements are not copied by list comprehension
this took me a while to figure out. See the example code below, to experiment yourself with different approaches
code
You can specify how long a list element takes to copy and how long it takes to evaluate. The time to copy is irrelevant for list comprehension, as it turned out.
import time
import timeit
import numpy as np
def ObjectFactory(time_eval, time_copy):
"""
Creates a class
Parameters
----------
time_eval : float
time to evaluate (True or False, i.e. keep in list or not) an object
time_copy : float
time to (shallow-) copy an object. Used by list comprehension.
Returns
-------
New class with defined copy-evaluate performance
"""
class Object:
def __init__(self, id_, keep):
self.id_ = id_
self._keep = keep
def __repr__(self):
return f"Object({self.id_}, {self.keep})"
#property
def keep(self):
time.sleep(time_eval)
return self._keep
def __copy__(self): # list comprehension does not copy the object
time.sleep(time_copy)
return self.__class__(self.id_, self._keep)
return Object
def remove_items_from_list_list_comprehension(lst):
return [el for el in lst if el.keep]
def remove_items_from_list_new_list(lst):
new_list = []
for el in lst:
if el.keep:
new_list += [el]
return new_list
def remove_items_from_list_new_list_by_ind(lst):
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
return [lst[ee] for ee in new_list_inds]
def remove_items_from_list_del_elements(lst):
"""WARNING: Modifies lst"""
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
for ind in new_list_inds[::-1]:
if not lst[ind].keep:
del lst[ind]
if __name__ == "__main__":
ClassSlowCopy = ObjectFactory(time_eval=0, time_copy=0.1)
ClassSlowEval = ObjectFactory(time_eval=1e-8, time_copy=0)
keep_ratio = .8
n_runs_timeit = int(1e2)
n_elements_list = int(1e2)
lsts_to_tests = dict(
list_slow_copy_remove_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_copy_keep_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_remove_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_keep_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
)
for lbl, lst in lsts_to_tests.items():
print()
for fct in [
remove_items_from_list_list_comprehension,
remove_items_from_list_new_list,
remove_items_from_list_new_list_by_ind,
remove_items_from_list_del_elements,
]:
lst_loc = lst.copy()
t = timeit.timeit(lambda: fct(lst_loc), number=n_runs_timeit)
print(f"{fct.__name__}, {lbl}: {t=}")
output
remove_items_from_list_list_comprehension, list_slow_copy_remove_many: t=0.0064229519994114526
remove_items_from_list_new_list, list_slow_copy_remove_many: t=0.006507338999654166
remove_items_from_list_new_list_by_ind, list_slow_copy_remove_many: t=0.006562008995388169
remove_items_from_list_del_elements, list_slow_copy_remove_many: t=0.0076057760015828535
remove_items_from_list_list_comprehension, list_slow_copy_keep_many: t=0.006243691001145635
remove_items_from_list_new_list, list_slow_copy_keep_many: t=0.007145451003452763
remove_items_from_list_new_list_by_ind, list_slow_copy_keep_many: t=0.007032064997474663
remove_items_from_list_del_elements, list_slow_copy_keep_many: t=0.007690364996960852
remove_items_from_list_list_comprehension, list_slow_eval_remove_many: t=1.2495998149970546
remove_items_from_list_new_list, list_slow_eval_remove_many: t=1.1657221479981672
remove_items_from_list_new_list_by_ind, list_slow_eval_remove_many: t=1.2621939050004585
remove_items_from_list_del_elements, list_slow_eval_remove_many: t=1.4632593330024974
remove_items_from_list_list_comprehension, list_slow_eval_keep_many: t=1.1344162709938246
remove_items_from_list_new_list, list_slow_eval_keep_many: t=1.1323430630000075
remove_items_from_list_new_list_by_ind, list_slow_eval_keep_many: t=1.1354237199993804
remove_items_from_list_del_elements, list_slow_eval_keep_many: t=1.3084568729973398
import collections
list1=collections.deque(list1)
for i in list2:
try:
list1.remove(i)
except:
pass
INSTEAD OF CHECKING IF ELEMENT IS THERE. USING TRY EXCEPT.
I GUESS THIS FASTER
As part of my release processes, I have to compare some JSON configuration data used by my application. As a first attempt, I just pretty-printed the JSON and diff'ed them (using kdiff3 or just diff).
As that data has grown, however, kdiff3 confuses different parts in the output, making additions look like giant modifies, odd deletions, etc. It makes it really hard to figure out what is different. I've tried other diff tools, too (meld, kompare, diff, a few others), but they all have the same problem.
Despite my best efforts, I can't seem to format the JSON in a way that the diff tools can understand.
Example data:
[
{
"name": "date",
"type": "date",
"nullable": true,
"state": "enabled"
},
{
"name": "owner",
"type": "string",
"nullable": false,
"state": "enabled",
}
...lots more...
]
The above probably wouldn't cause the problem (the problem occurs when there begin to be hundreds of lines), but thats the gist of what is being compared.
Thats just a sample; the full objects are 4-5 attributes, and some attributes have 4-5 attributes in them. The attribute names are pretty uniform, but their values pretty varied.
In general, it seems like all the diff tools confuse the closing "}" with the next objects closing "}". I can't seem to break them of this habit.
I've tried adding whitespace, changing indentation, and adding some "BEGIN" and "END" strings before and after the respective objects, but the tool still get confused.
If any of your tool has the option, Patience Diff could work a lot better for you. I'll try to find a tool with it (other tha Git and Bazaar) and report back.
Edit: It seems that the implementation in Bazaar is usable as a standalone tool with minimal changes.
Edit2: WTH, why not paste the source of the new cool diff script you made me hack? Here it is, no copyright claim on my side, it's just Bram/Canonical's code re-arranged.
#!/usr/bin/env python
# Copyright (C) 2005, 2006, 2007 Canonical Ltd
# Copyright (C) 2005 Bram Cohen, Copyright (C) 2005, 2006 Canonical Ltd
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
import os
import sys
import time
import difflib
from bisect import bisect
__all__ = ['PatienceSequenceMatcher', 'unified_diff', 'unified_diff_files']
py3k = False
try:
xrange
except NameError:
py3k = True
xrange = range
# This is a version of unified_diff which only adds a factory parameter
# so that you can override the default SequenceMatcher
# this has been submitted as a patch to python
def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
tofiledate='', n=3, lineterm='\n',
sequencematcher=None):
r"""
Compare two sequences of lines; generate the delta as a unified diff.
Unified diffs are a compact way of showing line changes and a few
lines of context. The number of context lines is set by 'n' which
defaults to three.
By default, the diff control lines (those with ---, +++, or ##) are
created with a trailing newline. This is helpful so that inputs
created from file.readlines() result in diffs that are suitable for
file.writelines() since both the inputs and outputs have trailing
newlines.
For inputs that do not have trailing newlines, set the lineterm
argument to "" so that the output will be uniformly newline free.
The unidiff format normally has a header for filenames and modification
times. Any or all of these may be specified using strings for
'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'. The modification
times are normally expressed in the format returned by time.ctime().
Example:
>>> for line in unified_diff('one two three four'.split(),
... 'zero one tree four'.split(), 'Original', 'Current',
... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:20:52 2003',
... lineterm=''):
... print line
--- Original Sat Jan 26 23:30:50 1991
+++ Current Fri Jun 06 10:20:52 2003
## -1,4 +1,4 ##
+zero
one
-two
-three
+tree
four
"""
if sequencematcher is None:
import difflib
sequencematcher = difflib.SequenceMatcher
if fromfiledate:
fromfiledate = '\t' + str(fromfiledate)
if tofiledate:
tofiledate = '\t' + str(tofiledate)
started = False
for group in sequencematcher(None,a,b).get_grouped_opcodes(n):
if not started:
yield '--- %s%s%s' % (fromfile, fromfiledate, lineterm)
yield '+++ %s%s%s' % (tofile, tofiledate, lineterm)
started = True
i1, i2, j1, j2 = group[0][3], group[-1][4], group[0][5], group[-1][6]
yield "## -%d,%d +%d,%d ##%s" % (i1+1, i2-i1, j1+1, j2-j1, lineterm)
for tag, i1, i2, j1, j2 in group:
if tag == 'equal':
for line in a[i1:i2]:
yield ' ' + line
continue
if tag == 'replace' or tag == 'delete':
for line in a[i1:i2]:
yield '-' + line
if tag == 'replace' or tag == 'insert':
for line in b[j1:j2]:
yield '+' + line
def unified_diff_files(a, b, sequencematcher=None):
"""Generate the diff for two files.
"""
mode = 'rb'
if py3k: mode = 'r'
# Should this actually be an error?
if a == b:
return []
if a == '-':
file_a = sys.stdin
time_a = time.time()
else:
file_a = open(a, mode)
time_a = os.stat(a).st_mtime
if b == '-':
file_b = sys.stdin
time_b = time.time()
else:
file_b = open(b, mode)
time_b = os.stat(b).st_mtime
# TODO: Include fromfiledate and tofiledate
return unified_diff(file_a.readlines(), file_b.readlines(),
fromfile=a, tofile=b,
sequencematcher=sequencematcher)
def unique_lcs_py(a, b):
"""Find the longest common subset for unique lines.
:param a: An indexable object (such as string or list of strings)
:param b: Another indexable object (such as string or list of strings)
:return: A list of tuples, one for each line which is matched.
[(line_in_a, line_in_b), ...]
This only matches lines which are unique on both sides.
This helps prevent common lines from over influencing match
results.
The longest common subset uses the Patience Sorting algorithm:
http://en.wikipedia.org/wiki/Patience_sorting
"""
# set index[line in a] = position of line in a unless
# a is a duplicate, in which case it's set to None
index = {}
for i in xrange(len(a)):
line = a[i]
if line in index:
index[line] = None
else:
index[line]= i
# make btoa[i] = position of line i in a, unless
# that line doesn't occur exactly once in both,
# in which case it's set to None
btoa = [None] * len(b)
index2 = {}
for pos, line in enumerate(b):
next = index.get(line)
if next is not None:
if line in index2:
# unset the previous mapping, which we now know to
# be invalid because the line isn't unique
btoa[index2[line]] = None
del index[line]
else:
index2[line] = pos
btoa[pos] = next
# this is the Patience sorting algorithm
# see http://en.wikipedia.org/wiki/Patience_sorting
backpointers = [None] * len(b)
stacks = []
lasts = []
k = 0
for bpos, apos in enumerate(btoa):
if apos is None:
continue
# as an optimization, check if the next line comes at the end,
# because it usually does
if stacks and stacks[-1] < apos:
k = len(stacks)
# as an optimization, check if the next line comes right after
# the previous line, because usually it does
elif stacks and stacks[k] < apos and (k == len(stacks) - 1 or
stacks[k+1] > apos):
k += 1
else:
k = bisect(stacks, apos)
if k > 0:
backpointers[bpos] = lasts[k-1]
if k < len(stacks):
stacks[k] = apos
lasts[k] = bpos
else:
stacks.append(apos)
lasts.append(bpos)
if len(lasts) == 0:
return []
result = []
k = lasts[-1]
while k is not None:
result.append((btoa[k], k))
k = backpointers[k]
result.reverse()
return result
def recurse_matches_py(a, b, alo, blo, ahi, bhi, answer, maxrecursion):
"""Find all of the matching text in the lines of a and b.
:param a: A sequence
:param b: Another sequence
:param alo: The start location of a to check, typically 0
:param ahi: The start location of b to check, typically 0
:param ahi: The maximum length of a to check, typically len(a)
:param bhi: The maximum length of b to check, typically len(b)
:param answer: The return array. Will be filled with tuples
indicating [(line_in_a, line_in_b)]
:param maxrecursion: The maximum depth to recurse.
Must be a positive integer.
:return: None, the return value is in the parameter answer, which
should be a list
"""
if maxrecursion < 0:
print('max recursion depth reached')
# this will never happen normally, this check is to prevent DOS attacks
return
oldlength = len(answer)
if alo == ahi or blo == bhi:
return
last_a_pos = alo-1
last_b_pos = blo-1
for apos, bpos in unique_lcs_py(a[alo:ahi], b[blo:bhi]):
# recurse between lines which are unique in each file and match
apos += alo
bpos += blo
# Most of the time, you will have a sequence of similar entries
if last_a_pos+1 != apos or last_b_pos+1 != bpos:
recurse_matches_py(a, b, last_a_pos+1, last_b_pos+1,
apos, bpos, answer, maxrecursion - 1)
last_a_pos = apos
last_b_pos = bpos
answer.append((apos, bpos))
if len(answer) > oldlength:
# find matches between the last match and the end
recurse_matches_py(a, b, last_a_pos+1, last_b_pos+1,
ahi, bhi, answer, maxrecursion - 1)
elif a[alo] == b[blo]:
# find matching lines at the very beginning
while alo < ahi and blo < bhi and a[alo] == b[blo]:
answer.append((alo, blo))
alo += 1
blo += 1
recurse_matches_py(a, b, alo, blo,
ahi, bhi, answer, maxrecursion - 1)
elif a[ahi - 1] == b[bhi - 1]:
# find matching lines at the very end
nahi = ahi - 1
nbhi = bhi - 1
while nahi > alo and nbhi > blo and a[nahi - 1] == b[nbhi - 1]:
nahi -= 1
nbhi -= 1
recurse_matches_py(a, b, last_a_pos+1, last_b_pos+1,
nahi, nbhi, answer, maxrecursion - 1)
for i in xrange(ahi - nahi):
answer.append((nahi + i, nbhi + i))
def _collapse_sequences(matches):
"""Find sequences of lines.
Given a sequence of [(line_in_a, line_in_b),]
find regions where they both increment at the same time
"""
answer = []
start_a = start_b = None
length = 0
for i_a, i_b in matches:
if (start_a is not None
and (i_a == start_a + length)
and (i_b == start_b + length)):
length += 1
else:
if start_a is not None:
answer.append((start_a, start_b, length))
start_a = i_a
start_b = i_b
length = 1
if length != 0:
answer.append((start_a, start_b, length))
return answer
def _check_consistency(answer):
# For consistency sake, make sure all matches are only increasing
next_a = -1
next_b = -1
for (a, b, match_len) in answer:
if a < next_a:
raise ValueError('Non increasing matches for a')
if b < next_b:
raise ValueError('Non increasing matches for b')
next_a = a + match_len
next_b = b + match_len
class PatienceSequenceMatcher_py(difflib.SequenceMatcher):
"""Compare a pair of sequences using longest common subset."""
_do_check_consistency = True
def __init__(self, isjunk=None, a='', b=''):
if isjunk is not None:
raise NotImplementedError('Currently we do not support'
' isjunk for sequence matching')
difflib.SequenceMatcher.__init__(self, isjunk, a, b)
def get_matching_blocks(self):
"""Return list of triples describing matching subsequences.
Each triple is of the form (i, j, n), and means that
a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in
i and in j.
The last triple is a dummy, (len(a), len(b), 0), and is the only
triple with n==0.
>>> s = PatienceSequenceMatcher(None, "abxcd", "abcd")
>>> s.get_matching_blocks()
[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
"""
# jam 20060525 This is the python 2.4.1 difflib get_matching_blocks
# implementation which uses __helper. 2.4.3 got rid of helper for
# doing it inline with a queue.
# We should consider doing the same for recurse_matches
if self.matching_blocks is not None:
return self.matching_blocks
matches = []
recurse_matches_py(self.a, self.b, 0, 0,
len(self.a), len(self.b), matches, 10)
# Matches now has individual line pairs of
# line A matches line B, at the given offsets
self.matching_blocks = _collapse_sequences(matches)
self.matching_blocks.append( (len(self.a), len(self.b), 0) )
if PatienceSequenceMatcher_py._do_check_consistency:
if __debug__:
_check_consistency(self.matching_blocks)
return self.matching_blocks
unique_lcs = unique_lcs_py
recurse_matches = recurse_matches_py
PatienceSequenceMatcher = PatienceSequenceMatcher_py
def main(args):
import optparse
p = optparse.OptionParser(usage='%prog [options] file_a file_b'
'\nFiles can be "-" to read from stdin')
p.add_option('--patience', dest='matcher', action='store_const', const='patience',
default='patience', help='Use the patience difference algorithm')
p.add_option('--difflib', dest='matcher', action='store_const', const='difflib',
default='patience', help='Use python\'s difflib algorithm')
algorithms = {'patience':PatienceSequenceMatcher, 'difflib':difflib.SequenceMatcher}
(opts, args) = p.parse_args(args)
matcher = algorithms[opts.matcher]
if len(args) != 2:
print('You must supply 2 filenames to diff')
return -1
for line in unified_diff_files(args[0], args[1], sequencematcher=matcher):
sys.stdout.write(line)
if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))
Edit 3: I've also made a minimally standalone version of Neil Fraser's Diff Match and Patch, I'd be very interested in a comparison of results for your use case. Again, I claim no copyrights.
Edit 4: I just found DataDiff, which might be another tool to try.
DataDiff is a library to provide
human-readable diffs of python data
structures. It can handle sequence
types (lists, tuples, etc), sets, and
dictionaries.
Dictionaries and sequences will be
diffed recursively, when applicable.
So, I wrote a tool to do unified diffs of JSON files a while ago that might be of some interest.
https://github.com/jclulow/jsondiff
Some examples of input and output for the tool appear on the github page.
You should checkout difflet from substack. It's both a node.js module and command-line utility that does exactly this:
https://github.com/substack/difflet
I know this is a pretty old question, but the python module "JSON Tools" provides another solution for diffing json files:
https://pypi.python.org/pypi/json_tools
https://bitbucket.org/vadim_semenov/json_tools/src/75cc15381188c760badbd5b66aef9941a42c93fa?at=default
Eclipse might do better. Open the two files in an eclipse project, select them both, and right click --> compare --> with each other.
Beyond formatting changes, diffing tool should also order JSON object properties in a stable manner (alphabetically, for example), since the order of properties is semantically meaningless. That is, reordering of properties should not change the meaning of contents.
Other than this, parsing and pretty-printing in a way that puts at most one entry on a single line might allow use of textual diff.
If not, any diff algorithm that works on trees (which is used for xml diffing) should work better.