Ignoring whitespace in a python diff

Ignoring whitespace in a python diff - python

Is there an elegant way to ignore whitespace in a diff in python (using difflib, or any other module)? Maybe I missed something, but I've scoured the documentation, and was unable to find any explicit support for this in difflib.
My current solution is to just break my text into lists of words, and then diff those:
d.compare(("".join(text1_lines)).split(), ("".join(text2_lines)).split())
The disadvantage of this is that if one wants a report of line-by-line differences, rather than word-by-word, one must merge the output of the diff with the original file text. This is easily doable, but a bit inconvenient.

Related

What is the standard way to represent subsequent changes in a text and to work with this representation using Python?

Assume that I have some text (for example given as a string). Later I am going to "edit" this text, which means that I want to add something somewhere or remove something. In this way I will get another version of the text. However, I do not want to have two strings representing each version of the text since there are a lot of "repetitions" (similarities) between the two subsequent versions. In other words, the differences between the strings are small, so that it makes more sense just to save differences between them. For example, the first versions.
This is my first version of the texts.
The second version:
This is the first version of the text, that I want to use as an example.
I would like to save these two versions as one object (it should not necessarily be XML, I use it just as an example):
This is the <removed>my</removed> <added>first</added> version of the text<added>, that I want to use as an example</added>.
Now I want to go further. I want to save all subsequent edits as one object. In other words, I am going to have more than two versions of the text, but I would like to save them as one object such that it is easy to get a given version of the text and easy to find out what are the difference between two subsequent (or any two given) versions.
So, to summarize, my question is: What is the standard way to represent changes in a text and to work with this representation using Python.

I would probably go with difflib: https://docs.python.org/2/library/difflib.html
You can use it to represent changes between versions of string and create your own class to store consecutive diffs.
EDIT: I just realised it doesn't really make sense in your use case as the diffs from difflib are essentially storing both strings, so you will be better off in just storing them all. However I believe that this is the standard (library-wise) way of working with changes in text, so I won't delete this answer.
EDIT2: Although it seems that if you find a way to apply unified_diff to strings this may be your answer. It seems that there is no way to do this with difflib yet: https://bugs.python.org/issue2057

Performing incremental regex searches in huge strings (Python)

Using Python 2.6.6.
I was hoping that the re module provided some method of searching that mimicked the way str.find() works, allowing you to specify a start index, but apparently not...
search() lets me find the first match...
findall() will return all (non-overlapping!) matches of a single pattern
finditer() is like findall(), but via an iterator (more efficient)
Here is the situation... I'm data mining in huge blocks of data. For parts of the parsing, regex works great. But once I find certain matches, I need to switch to a different pattern, or even use more specialized parsing to find where to start searching next. If re.search allowed me to specify a starting index, it would be perfect. But in absence of that, I'm looking at:
Using finditer(), but skipping forward until I reach an index that is past where I want to resume using re. Potential problems:
If the embedded binary data happens to contain a match that overlaps a legitimate match just after the binary chunk...
Since I'm not searching for a single pattern, I'd have to juggle multiple iterators, which also has the possibility of a false match hiding the real one.
Slicing, i.e., creating a copy of the remainder of the data each time I want to search again.
This would be robust, but would force a lot of "needless" copying on data that could be many megabytes.
I'd prefer to keep it so that all match locations were indexes into the single original string object, since I may hang onto them for a while and want to compare them. Finding subsequent matches within separate sliced-off copies is a bookkeeping hassle.
Just occurred to me that I may be able to use a "rotating buffer" sort of approach, but haven't thought it through completely. That might introduce a lot of complexity to the code.
Am I missing any obvious alternatives? Not sure if there would be a way to wrap a huge string with a class that would serve slices... Or a slicing sort of iterator or "string cursor" idiom?

Use a two-pass approach. The first pass uses the first regex to find the "interesting bits" and outputs those offsets into a separate file. You didn't say if you can tell where the "end" of each interesting segment is, but you'd include that too if available. The second pass uses the offsets to load sections of the file as independent strings and then applies whatever secondary regex you like on each smaller string.

Remove all replicas of a string more than x characters long (regex?)

I'm not certain that regex is the best approach for this, but it seems to be fairly well suited. Essentially I'm currently parsing some pdfs using pdfminer, and the drawback is that these pdf's are exported powerpoint slides, which means that all animations show up as fairly long copies of strings. Ideally I would like just one copy of each of these strings instead of a copy for each stage of an animation. Right now the current regex pattern I'm using is this:
re.sub(r"([\w^\w]{10,})\1{1,}", "\1", string)
For some reason though, this doesn't seem to change the input string. I feel like for some reason python isn't recognizing the capture group, but I'm not sure how to remedy that issue. Any thoughts appreciated.
Examples:
I would like this
text to be
reduced
I would like this
text to be
reduced
output:
I would like this
text to be
reduced
Update:
To get this to pass the pumping lemma I had to specifically make the assertion that all duplicates were adjacent. This was implied before, but I am now making it explicit to ensure that a solution is possible.

regexps are not the right tool for that task. They are based on the theory of context free languages, and they can't match if a string contains duplicates and remove the duplicates. You may find a course on automata and regexps interesting to read on the topic.
I think Josay's suggestion can be efficient and smart, but I think I got a more simple and pythonic solution, though it has its limits. You can split your string into a list of lines, and pass it through a set():
>>> s = """I would like this
... text to be
...
... reduced
... I would like this
... text to be
...
... reduced"""
>>> print "\n".join(set(s.splitlines()))
I would like this
text to be
reduced
>>>
The only thing with that solution is that you will loose the original order of the lines (the example being a pretty counter example). Also, if you have the same line in two different contexts, you will end up having only one line.
To fix the first problem, you may have to then iterate over your original string a second time to put that set back in order, or simply use an ordered set.
If you got any symbol separating each slide, it would help you merge only the duplicates, fixing the second problem of that solution.
Otherwise a more sophisticated algorithm would be needed, so you can take into account proximity and context. For that a suffix tree could be a good idea, and there are python libraries for that (cf that SO answer).
edit:
using your algorithm I could make it work, by adding support of multiline and adding spaces and endlines to your text matching:
>>> re.match(r"([\w \n]+)\n\1", string, re.MULTILINE).groups()
('I would like this\ntext to be\n\nreduced',)
Though, afaict the \1 notation is not a regular regular expression syntax in the matching part, but an extension. But it's getting late here, and I may as well be totally wrong. Maybe shall I reread those courses? :-)
I guess that the regexp engine's pushdown automata is able to push matches, because it is only a long multiline string that it can pop to match. Though I'd expect it to have side effects...

fault-tolerant python based parser for WikiLeaks cables

Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.
A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.
We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).
What kind of parser or other kind of solution would you suggest? Is something better around?

Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/

I haven't looked at the cables but lets take a similar problem and consider the options: Lets say you wanted to write a parser for RFCs, there's an RFC for formatting of RFCs, but not all RFCs follow it.
If you wrote a strict parser, you'll run into the situation you've run into - the outliers will halt your progress - in this case you've got two options:
Split them into two groups, the ones that are strictly formatted and the ones that aren't. Write your strict parser so that it gets the low hanging fruit and figure out based on the number outliers what the best options are (hand processing, outlier parser, etc)
If the two groups are equally sized, or there are more outliers than standard formats - write a flexible parser. In this case regular expressions are going to be more beneficial to you as you can process an entire file looking for a series of flexible regexs, if one of the regexes fails you can easily generate the outlier list. But, since you can make the search against a series of regexes you could build a matrix of pass/fails for each regex.
For 'fuzzy' data where some follow the format and some do not, I much prefer using the regex approach. That's just me though. (Yes, it is slower, but having to engineer the relationship between each match segment so that you have a single query (or parser) that fits every corner case is a nightmare when dealing with human generated input.

Efficient and accurate way to compact and compare Python lists?

I'm trying to a somewhat sophisticated diff between individual rows in two CSV files. I need to ensure that a row from one file does not appear in the other file, but I am given no guarantee of the order of the rows in either file. As a starting point, I've been trying to compare the hashes of the string representations of the rows (i.e. Python lists). For example:
import csv
hashes = []
for row in csv.reader(open('old.csv','rb')):
hashes.append( hash(str(row)) )
for row in csv.reader(open('new.csv','rb')):
if hash(str(row)) not in hashes:
print 'Not found'
But this is failing miserably. I am constrained by artificially imposed memory limits that I cannot change, and thusly I went with the hashes instead of storing and comparing the lists directly. Some of the files I am comparing can be hundreds of megabytes in size. Any ideas for a way to accurately compress Python lists so that they can be compared in terms of simple equality to other lists? I.e. a hashing system that actually works? Bonus points: why didn't the above method work?
EDIT:
Thanks for all the great suggestions! Let me clarify some things. "Miserable failure" means that two rows that have the exact same data, after being read in by the CSV.reader object are not hashing to the same value after calling str on the list object. I shall try hashlib at some suggestions below. I also cannot do a hash on the raw file, since two lines below contain the same data, but different characters on the line:
1, 2.3, David S, Monday
1, 2.3, "David S", Monday
I am also already doing things like string stripping to make the data more uniform, but it seems to no avail. I'm not looking for an extremely smart diff logic, i.e. that 0 is the same as 0.0.
EDIT 2:
Problem solved. What basically worked is that I needed to a bit more pre-formatting like converting ints and floats, and so forth AND I needed to change my hashing function. Both these changes seemed to do the job for me.

It's hard to give a great answer without knowing more about your constraints, but if you can store a hash for each line of each file then you should be ok. At the very least you'll need to be able to store the hash list for one file, which you then would sort and write to disk, then you can march through the two sorted lists together.
The only reason why I can imagine the above not working as written would be because your hashing function doesn't always give the same output for a given input. You could test that a second run through old.csv generates the same list. It may have to do with errant spaces, tabs-instead-of-spaces, differing capitalization, "automatic
Mind, even if the hashes are equivalent you don't know that the lines match; you only know that they might match. You still need to check that the candidate lines do match. (You may also get the situation where more than one line in the input file generates the same hash, so you'll need to handle that as well.)
After you fill your hashes variable, you should consider turning it into a set (hashes = set(hashes)) so that your lookups can be faster than linear.

Given the loose syntactic definition of CSV, it is possible for two rows to be semantically equal while being lexically different. The various Dialect definitions give some clue as two how two rows could be individually well-formed but incommensurable. And this example shows how they could be in the same dialect and not string equivalent:
0, 0
0, 0.0
More information would help yield a better answer your question.

More information would be needed on what exactly "failing miserably" means. If you are just not getting correct comparison between the two, perhaps Hashlib might solve that.
I've run into trouble previously when using the built in hash library, and solved it with that.
Edit: As someone suggested on another post, the issue could be with assuming that the two files are required to have each line be EXACTLY the same. You might want to try parsing the csv fields and appending them to a string with identical formatting (maybe trim spaces, force lowercase, etc) before computing the hash.

I'm pretty sure that the "failing miserably" line refers to a failure in time that comes from your current algorithm being O(N^2) which is quite bad for how big your files are. As has been mentioned, you can use a set to alieviate this problem (will become O(N)) or if you aren't able to do that for some reason then you can sort the list of hashes and use a binary search on it (will become O(N log N) which is also doable. You can use the bisect module if you go the binary search route.
Also, it has been mentioned that you may have the problem of a clash in the hashes: two lines yielding the same hash when the lines aren't exactly the same. If you discover that this is a problem that you are experiencing, you will have to store info with each hash about where to seek the line corresponding to the hash in the old.csv file and then seek the line out and compare the two lines.
An alternative to your current method is to sort the two files beforehand (using some sort of merge sort to disk perhaps or shell sort) and, keeping pointers to lines in each file, compare the two lines. Check if they match, and if not then advance the line that is measured as being lesser. This algorithm is also O(N log N) as long as an O(N log N) method is used for sorting. The sorting could also be done by putting each file into a database and having the database sort them.

You need to say what your problem really is. Your description "I need to ensure that a row from one file does not appear in the other file" is consistent with the body of your second loop being if hash(...) in hashes: print "Found (an interloper)" rather that what you have.
We can't tell you "why didn't the above method work" because you haven't told us what the symptoms of "failed miserably" and "didn't work" are.

Have you perhaps considered running a sort (if possible) - you'll have to go over twice of course - but might solve the mem problem.

This is likely a problem with (mis)using hash. See this SO question; as the answers there point out, you probably want hashlib.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.