Compare two documents with slight differences in Python

Compare two documents with slight differences in Python - python

I have two documents that are mostly the same, but with some small differences I want to ignore. Specifically, I know that one has hex values written as "0xFFFFFFFF" while the other has them just as "FFFFFFFF"
Basically, these two documents are lists of variables, their values, their location in memeory, size, etc.
But another problem is that they are not in the same order either.
I tried a few things, one being to just pack them all up in two lists of lists, and compare if the lists of lists have counterparts in each other, but with the number of variables being almost 100,000 the time it takes to do this is ridiculous (on the order of nearly an hour) so that isn't going to work. I'm not very seasonsed in python, or even the pythonic way of doing things, so I'm sorry if there is a quick and easy way to do this.
I've read a few other similar questions, but they all assume 100% identicallity, and other things that arent true in my case.
Basically, I have two .txts that have series of lines that look like:
***************************************
Variable: Var_name1
Size: 4
Address: 0x00FF00F0 .. 0x00FF00F3
Description: An awesome variable
..
***************************************
I don't care if the Descriptions are different, I just want to make sure that every variable has the same length and is in the same place, address-wise, and if they are any difference, I want to see them. I also want to be sure that every variable in one is present in the other.
And again, the address in the first one are written with the hex radix and in the second one, without the hex radix. And they are in a different order
--- Output ---
I don't really care about the output's format as long as it is human readable. Ideally, it'd be a .txt document that said something like:
"Var_name1 does not exist in list two"
"Var_name2 has a different size. (Size1, Size2)"
"Var_name4 is located in a different place. (Loc1, Loc2)"

COMPLETE RE-EDIT
[My initial suggestion was to use sets, but further discussion via the comments made me realize that that was nonsense, and that a dictionary was the real solution.]
You want a dictionary; keyed on variable name; and where the value is a list or a tuple or a nested dictionary or even an object, containing size and address. You can add each variable name to the dictionary and update the values as needed.
For comparing the addresses, a regex would do it, but you can probably get by with less overhead with just a str.contains(_).

Related

list or dict? To verify multiple outputs

I'm trying to figure out if I should make my output a list or dictionary? Which one would be easy to verify when you are trying to match 3 different outputs with millions of lines?
I'm going to take two different outputs and verify three pieces for each line.
Problem is, later I'll have a output C which I want to try and verify again output A and B. I'm leaning towards a dictionary?
For example: (bold is what I'll be matching
output A:
10.1.1.1:80 10.2.1.1:81 10.1.1.1:80 10.3.1.1:81 etc etc etc name
...
...
...
output B:
name etc etc etc etc 10.1.1.1/16 10.2.1.1/16
...
...
...

I'm trying to figure out if I should make my output a list or dictionary? Which one would be easy to verify when you are trying to match 3 different outputs with millions of lines?
They key difference between a list and a dictionary is the way you access your data. For a dictionary, you will be using keys, where for lists you would be using a sequence of indexes to glean your data. Which one should you use here? This is sometimes a performance issue and sometimes it's about code clarity. I don't think the latter issue relates to your case, I think performance would be your greatest asset here, that's because you're dealing with dozens of lines to be processed and matched.There has been many internal optimizations for Python objects, so it's better to focus on what you're trying to achieve instead of focusing on objects. I could very easily recommend dictionaries, but would that be a good recommendation for later developments in your code? I don't know, unless we work on that specific code any recommendation would be vague.
If you care about performance, this might help:
Python: List vs Dict for look up table

Why are Python dictionaries NOT stored in the order they were created? [duplicate]

This question already has answers here:
Why is the order in dictionaries and sets arbitrary?
(5 answers)
Closed 7 years ago.
Just curious more than anything else, but why isn't a dictionary such as the one below not ordered the same as it was created? But when I print out test it returns the same order from then on...
test = {'one':'1', 'two':'2', 'three':'3', 'four':'4'}
It's not that I need them ordered, but it's just been on my mind for awhile as to what is occurring here.
The only thing I've found on this is a quote from this article:
Python uses complex algorithms to determine where the key-value pairs are stored in a dictionary.
But what are these "complex algorithms" and why?

Python needs to be able to access D[thing] quickly.
If it stores the values in the order that it receives them, then when you ask it for D[thing], it doesn't know in advance where it put that value. It has to go and find where the key thing appears and then find that value. Since it has no control over the order these are received, this would take about N/2 steps on average where N is the number of keys it's received.
But if instead it has a function (called a hash) that can turn thing in to a location in memory, it can quickly take thing and calculate that value, and check in that spot of memory. Of course, it's got to do a bit more overhead - checking that D[thing] has actually been defined, and checking for those rare cases where you may have defined D[thing1] and D[thing2] where the hash function of thing1 and thing2 happen to be the same (in which case a "collision" occurs and python has to figure out a new place to put one of them).
So for your example, you might expect that when you search for test['four'] it just goes to the last entry in a list it's stored and says "aha, that's '4'." But it can't just do that. How does it know that four corresponds to the last entry of the list. It could have come in any order, so it would have to create some other data structure which allows it to quickly tell that four was the last entry. This would take a lot of overhead.
It would be possible to make it output things in the order they were entered, but that would still require additional overhead tracking the order things were entered.

For curiosity, if you want a ordered dictionary, use OrderedDict

Intersection of Two Lists Of Strings

I had an interview question along these lines:
Given two lists of unordered customers, return a list of the intersection of the two lists. That is, return a list of the customers that appear in both lists.
Some things I established:
Assume each customer has a unique name
If the name is the same in both lists, it's the same customer
The names are of the form first name last name
There's no trickery of II's, Jr's, weird characters, etc.
I think the point was to find an efficient algorithm/use of data structures to do this as efficiently as possible.
My progress went like this:
Read one list in to memory, then read the other list one item at a time to see if there is a match
Alphabetize both lists then start at the top of one list and see if each item appears in the other list
Put both lists into ordered lists, then use the shorter list to check item by item (that way, it one list has 2 items, you only check those 2 items)
Put one list into a hash, and check for the existence of keys from the other list
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Any other tricks to do this efficiently?
Side note, this question was in python, and I just read about sets, which seem to do this as efficiently as possible. Any idea what the data structure/algorithm of sets is?

It really doesnt matter how its implemented ... but I believe it is implemented in C so it is faster and better set([1,2,3,4,5,6]).intersection([1,2,5,9]) is likely what they wanted
In python readability counts for alot! and set operations in python are used extensively and well vetted...
that said another pythonic way of doing it would be
list_new = [itm for itm in listA if itm in listB]
or
list_new = filter(lambda itm:itm in listB,listA)
basically I believe they were testing if you were familliar with python, not if you could implement the algorithm. since they asked a question that is so well suited to python

Put one list into a bloom filter and use that to filter the second list.
Put the filtered second list into a bloom filter and use that to filter the first list.
Sort the two lists and find the intersection by one of the methods above.
The benefit of this approach (besides letting you use a semi-obscure data structure correctly in an interview) is that it doesn't require any O(n) storage until after you have (with high probability) reduced the problem size.
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Maybe they would just keep asking that until you run out of answers.
http://code.google.com/p/python-bloom-filter/ is a python implementation of bloom filters.

how to store key-value as well as value-key in Python?

I have this somewhat big data structure that stores pairs of data. The individual data is tiny and readily hashable, and there are something like a few hundred thousand data points in there.
At first, this was a simple dict that was accessed only by keys. Later on however, I discovered that I also needed to access it by value, that is, get the key for a certain value. Since this was done somewhat less often (~1/10) than access by key, I naïvely implemented it by simply iterating over all the dicts items(). Which proved a bit sluggish at a few hundred thousand calls per second. It is about 500 times slower.
So my next idea was to just use save the reverse dict, too. This seems to be a rather inelegant solution however, so I turn to you guys for help.
Do you know any data structure in Python that stores data pairs that can be accessed by either data point of the pair?

You could try bidict.

Efficient and accurate way to compact and compare Python lists?

I'm trying to a somewhat sophisticated diff between individual rows in two CSV files. I need to ensure that a row from one file does not appear in the other file, but I am given no guarantee of the order of the rows in either file. As a starting point, I've been trying to compare the hashes of the string representations of the rows (i.e. Python lists). For example:
import csv
hashes = []
for row in csv.reader(open('old.csv','rb')):
hashes.append( hash(str(row)) )
for row in csv.reader(open('new.csv','rb')):
if hash(str(row)) not in hashes:
print 'Not found'
But this is failing miserably. I am constrained by artificially imposed memory limits that I cannot change, and thusly I went with the hashes instead of storing and comparing the lists directly. Some of the files I am comparing can be hundreds of megabytes in size. Any ideas for a way to accurately compress Python lists so that they can be compared in terms of simple equality to other lists? I.e. a hashing system that actually works? Bonus points: why didn't the above method work?
EDIT:
Thanks for all the great suggestions! Let me clarify some things. "Miserable failure" means that two rows that have the exact same data, after being read in by the CSV.reader object are not hashing to the same value after calling str on the list object. I shall try hashlib at some suggestions below. I also cannot do a hash on the raw file, since two lines below contain the same data, but different characters on the line:
1, 2.3, David S, Monday
1, 2.3, "David S", Monday
I am also already doing things like string stripping to make the data more uniform, but it seems to no avail. I'm not looking for an extremely smart diff logic, i.e. that 0 is the same as 0.0.
EDIT 2:
Problem solved. What basically worked is that I needed to a bit more pre-formatting like converting ints and floats, and so forth AND I needed to change my hashing function. Both these changes seemed to do the job for me.

It's hard to give a great answer without knowing more about your constraints, but if you can store a hash for each line of each file then you should be ok. At the very least you'll need to be able to store the hash list for one file, which you then would sort and write to disk, then you can march through the two sorted lists together.
The only reason why I can imagine the above not working as written would be because your hashing function doesn't always give the same output for a given input. You could test that a second run through old.csv generates the same list. It may have to do with errant spaces, tabs-instead-of-spaces, differing capitalization, "automatic
Mind, even if the hashes are equivalent you don't know that the lines match; you only know that they might match. You still need to check that the candidate lines do match. (You may also get the situation where more than one line in the input file generates the same hash, so you'll need to handle that as well.)
After you fill your hashes variable, you should consider turning it into a set (hashes = set(hashes)) so that your lookups can be faster than linear.

Given the loose syntactic definition of CSV, it is possible for two rows to be semantically equal while being lexically different. The various Dialect definitions give some clue as two how two rows could be individually well-formed but incommensurable. And this example shows how they could be in the same dialect and not string equivalent:
0, 0
0, 0.0
More information would help yield a better answer your question.

More information would be needed on what exactly "failing miserably" means. If you are just not getting correct comparison between the two, perhaps Hashlib might solve that.
I've run into trouble previously when using the built in hash library, and solved it with that.
Edit: As someone suggested on another post, the issue could be with assuming that the two files are required to have each line be EXACTLY the same. You might want to try parsing the csv fields and appending them to a string with identical formatting (maybe trim spaces, force lowercase, etc) before computing the hash.

I'm pretty sure that the "failing miserably" line refers to a failure in time that comes from your current algorithm being O(N^2) which is quite bad for how big your files are. As has been mentioned, you can use a set to alieviate this problem (will become O(N)) or if you aren't able to do that for some reason then you can sort the list of hashes and use a binary search on it (will become O(N log N) which is also doable. You can use the bisect module if you go the binary search route.
Also, it has been mentioned that you may have the problem of a clash in the hashes: two lines yielding the same hash when the lines aren't exactly the same. If you discover that this is a problem that you are experiencing, you will have to store info with each hash about where to seek the line corresponding to the hash in the old.csv file and then seek the line out and compare the two lines.
An alternative to your current method is to sort the two files beforehand (using some sort of merge sort to disk perhaps or shell sort) and, keeping pointers to lines in each file, compare the two lines. Check if they match, and if not then advance the line that is measured as being lesser. This algorithm is also O(N log N) as long as an O(N log N) method is used for sorting. The sorting could also be done by putting each file into a database and having the database sort them.

You need to say what your problem really is. Your description "I need to ensure that a row from one file does not appear in the other file" is consistent with the body of your second loop being if hash(...) in hashes: print "Found (an interloper)" rather that what you have.
We can't tell you "why didn't the above method work" because you haven't told us what the symptoms of "failed miserably" and "didn't work" are.

Have you perhaps considered running a sort (if possible) - you'll have to go over twice of course - but might solve the mem problem.

This is likely a problem with (mis)using hash. See this SO question; as the answers there point out, you probably want hashlib.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.