I'm trying to a somewhat sophisticated diff between individual rows in two CSV files. I need to ensure that a row from one file does not appear in the other file, but I am given no guarantee of the order of the rows in either file. As a starting point, I've been trying to compare the hashes of the string representations of the rows (i.e. Python lists). For example:
import csv
hashes = []
for row in csv.reader(open('old.csv','rb')):
hashes.append( hash(str(row)) )
for row in csv.reader(open('new.csv','rb')):
if hash(str(row)) not in hashes:
print 'Not found'
But this is failing miserably. I am constrained by artificially imposed memory limits that I cannot change, and thusly I went with the hashes instead of storing and comparing the lists directly. Some of the files I am comparing can be hundreds of megabytes in size. Any ideas for a way to accurately compress Python lists so that they can be compared in terms of simple equality to other lists? I.e. a hashing system that actually works? Bonus points: why didn't the above method work?
EDIT:
Thanks for all the great suggestions! Let me clarify some things. "Miserable failure" means that two rows that have the exact same data, after being read in by the CSV.reader object are not hashing to the same value after calling str on the list object. I shall try hashlib at some suggestions below. I also cannot do a hash on the raw file, since two lines below contain the same data, but different characters on the line:
1, 2.3, David S, Monday
1, 2.3, "David S", Monday
I am also already doing things like string stripping to make the data more uniform, but it seems to no avail. I'm not looking for an extremely smart diff logic, i.e. that 0 is the same as 0.0.
EDIT 2:
Problem solved. What basically worked is that I needed to a bit more pre-formatting like converting ints and floats, and so forth AND I needed to change my hashing function. Both these changes seemed to do the job for me.
It's hard to give a great answer without knowing more about your constraints, but if you can store a hash for each line of each file then you should be ok. At the very least you'll need to be able to store the hash list for one file, which you then would sort and write to disk, then you can march through the two sorted lists together.
The only reason why I can imagine the above not working as written would be because your hashing function doesn't always give the same output for a given input. You could test that a second run through old.csv generates the same list. It may have to do with errant spaces, tabs-instead-of-spaces, differing capitalization, "automatic
Mind, even if the hashes are equivalent you don't know that the lines match; you only know that they might match. You still need to check that the candidate lines do match. (You may also get the situation where more than one line in the input file generates the same hash, so you'll need to handle that as well.)
After you fill your hashes variable, you should consider turning it into a set (hashes = set(hashes)) so that your lookups can be faster than linear.
Given the loose syntactic definition of CSV, it is possible for two rows to be semantically equal while being lexically different. The various Dialect definitions give some clue as two how two rows could be individually well-formed but incommensurable. And this example shows how they could be in the same dialect and not string equivalent:
0, 0
0, 0.0
More information would help yield a better answer your question.
More information would be needed on what exactly "failing miserably" means. If you are just not getting correct comparison between the two, perhaps Hashlib might solve that.
I've run into trouble previously when using the built in hash library, and solved it with that.
Edit: As someone suggested on another post, the issue could be with assuming that the two files are required to have each line be EXACTLY the same. You might want to try parsing the csv fields and appending them to a string with identical formatting (maybe trim spaces, force lowercase, etc) before computing the hash.
I'm pretty sure that the "failing miserably" line refers to a failure in time that comes from your current algorithm being O(N^2) which is quite bad for how big your files are. As has been mentioned, you can use a set to alieviate this problem (will become O(N)) or if you aren't able to do that for some reason then you can sort the list of hashes and use a binary search on it (will become O(N log N) which is also doable. You can use the bisect module if you go the binary search route.
Also, it has been mentioned that you may have the problem of a clash in the hashes: two lines yielding the same hash when the lines aren't exactly the same. If you discover that this is a problem that you are experiencing, you will have to store info with each hash about where to seek the line corresponding to the hash in the old.csv file and then seek the line out and compare the two lines.
An alternative to your current method is to sort the two files beforehand (using some sort of merge sort to disk perhaps or shell sort) and, keeping pointers to lines in each file, compare the two lines. Check if they match, and if not then advance the line that is measured as being lesser. This algorithm is also O(N log N) as long as an O(N log N) method is used for sorting. The sorting could also be done by putting each file into a database and having the database sort them.
You need to say what your problem really is. Your description "I need to ensure that a row from one file does not appear in the other file" is consistent with the body of your second loop being if hash(...) in hashes: print "Found (an interloper)" rather that what you have.
We can't tell you "why didn't the above method work" because you haven't told us what the symptoms of "failed miserably" and "didn't work" are.
Have you perhaps considered running a sort (if possible) - you'll have to go over twice of course - but might solve the mem problem.
This is likely a problem with (mis)using hash. See this SO question; as the answers there point out, you probably want hashlib.
Related
I have a potentially big list of image sequences from nuke. The format of the string can be:
/path/to/single_file.ext
/path/to/img_seq.###[.suffix].ext
/path/to/img_seq.%0id[.suffix].ext, i being an integer value, the values between [] being optional.
The question is: given this string, that can represent a sequence or a still image, check if at least one image on disk corresponds to that string in the fastest way possible.
There is already some code that checks if these files exist, but it's quite slow.
First it checks if the folder exists, if not, returns False
Then it checks if the file exists with os.path.isfile, if it does, it returns True.
Then it checks if no % or # is found in the path, and if not os.path.isfile, it returns False.
All this is quite fast.
But then, it uses some internal library which is in performance a bit faster than pyseq to try to find an image sequence, and does a bit more operations depending if start_frame=end_frame or not.
But it stills take a large amount of time to analyze if something is an image sequence, specially on some sections of the network and for big image sequences.
For example, for a 2500 images sequence, the analysis takes between 1 and 3 seconds.
If I take a very naive approach, and just checks if a frame exist by replacing #### by %04d, and loop over 10000 and break if found, it takes less than .02 seconds to check for os.path.isfile(f), specially if the first frame is between 1-3000.
Of course I cannot guarantee what the start frame will be, and that approach is not perfect, but in practice many of the sequences do begin between 1-3000, and I could return True if found and fallback to the sequence approach if nothing is found (it would still be quicker for most of the cases)
I'm not sure what's the best approach is for this, I already made it multithreaded when searching for many image sequences, so it's faster than before, but I'm sure there is room for improvement.
You should probably not loop for candidates using os.path.isfile(), but use glob.glob() or os.listdir() and check the returned lists for matching your file patterns, i.e. prefer memory operations over disk accesses.
If there are potentially so many files that you're worried about wasting memory for a dictionary that holds them all, you could just store a single key for each img_seq.###[.suffix].ext pattern, removing the sequence number as you scan the directory. Then a single lookup will suffice. The values in the dictionary could either be "dummy" booleans because the existence of the key is the only thing you care about, or counters in case you ever want to know how many files you have for a certain sequence.
Assume that I have some text (for example given as a string). Later I am going to "edit" this text, which means that I want to add something somewhere or remove something. In this way I will get another version of the text. However, I do not want to have two strings representing each version of the text since there are a lot of "repetitions" (similarities) between the two subsequent versions. In other words, the differences between the strings are small, so that it makes more sense just to save differences between them. For example, the first versions.
This is my first version of the texts.
The second version:
This is the first version of the text, that I want to use as an example.
I would like to save these two versions as one object (it should not necessarily be XML, I use it just as an example):
This is the <removed>my</removed> <added>first</added> version of the text<added>, that I want to use as an example</added>.
Now I want to go further. I want to save all subsequent edits as one object. In other words, I am going to have more than two versions of the text, but I would like to save them as one object such that it is easy to get a given version of the text and easy to find out what are the difference between two subsequent (or any two given) versions.
So, to summarize, my question is: What is the standard way to represent changes in a text and to work with this representation using Python.
I would probably go with difflib: https://docs.python.org/2/library/difflib.html
You can use it to represent changes between versions of string and create your own class to store consecutive diffs.
EDIT: I just realised it doesn't really make sense in your use case as the diffs from difflib are essentially storing both strings, so you will be better off in just storing them all. However I believe that this is the standard (library-wise) way of working with changes in text, so I won't delete this answer.
EDIT2: Although it seems that if you find a way to apply unified_diff to strings this may be your answer. It seems that there is no way to do this with difflib yet: https://bugs.python.org/issue2057
I have a bunch of flat files that basically store millions of paths and their corresponding info (name, atime, size, owner, etc)
I would like to compile a full list of all the paths stored collectively on the files. For duplicate paths only the largest path needs to be kept.
There are roughly 500 files and approximately a million paths in the text file. The files are also gzipped. So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
Should I go for a database solution? sqlite3? Is there a data structure or better algorithm to go about this in python? Thanks for any help!
So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
If "the current list" implies that you're keeping track of all of the paths seen so far in a list, and then doing if newpath in listopaths: for each line, then each one of those searches takes linear time. If you have 500M total paths, of which 100M are unique, you're doing O(500M*100M) comparisons.
Just changing that list to a set, and changing nothing else in your code (well, you need to replace .append with .add, and you can probably remove the in check entirely… but without seeing your code it's hard to be specific) makes each one of those checks take constant time. So you're doing O(500M) comparisons—100M times faster.
Another potential problem is that you may not have enough memory. On a 64-bit machine, you've got enough virtual memory to hold almost anything you want… but if there's not enough physical memory available to back that up, eventually you'll spend more time swapping data back and forth to disk than doing actual work, and your program will slow to a crawl.
There are actually two potential sub-problems here.
First, you might be reading each entire file in at once (or, worse, all of the files at once) when you don't need to (e.g., by decompressing the whole file instead of using gzip.open, or by using f = gzip.open(…) but then doing f.readlines() or f.read(), or whatever). If so… don't do that. Just iterate over the lines in each GzipFile, for line in f:.
Second, maybe even a simple set of however many unique lines you have is too much to fit in memory on your computer. In that case, you probably want to look at a database. But you don't need anything as complicated as sqlite. A dbm acts like a dict (except that its keys and values have to be byte strings), but it's stored on-dict, caching things in memory where appropriate, instead of stored in memory, paging to disk randomly, which means it will go a lot faster in this case. (And it'll be persistent, too.) Of course you want something that acts like a set, not a dict… but that's easy. You can model a set as a dict whose keys are always ''. So instead of paths.add(newpath), it's just paths[newpath] = ''. Yeah, that wastes a few bytes of disk space over building your own custom on-disk key-only hash table, but it's unlikely to make any significant difference.
I have two documents that are mostly the same, but with some small differences I want to ignore. Specifically, I know that one has hex values written as "0xFFFFFFFF" while the other has them just as "FFFFFFFF"
Basically, these two documents are lists of variables, their values, their location in memeory, size, etc.
But another problem is that they are not in the same order either.
I tried a few things, one being to just pack them all up in two lists of lists, and compare if the lists of lists have counterparts in each other, but with the number of variables being almost 100,000 the time it takes to do this is ridiculous (on the order of nearly an hour) so that isn't going to work. I'm not very seasonsed in python, or even the pythonic way of doing things, so I'm sorry if there is a quick and easy way to do this.
I've read a few other similar questions, but they all assume 100% identicallity, and other things that arent true in my case.
Basically, I have two .txts that have series of lines that look like:
***************************************
Variable: Var_name1
Size: 4
Address: 0x00FF00F0 .. 0x00FF00F3
Description: An awesome variable
..
***************************************
I don't care if the Descriptions are different, I just want to make sure that every variable has the same length and is in the same place, address-wise, and if they are any difference, I want to see them. I also want to be sure that every variable in one is present in the other.
And again, the address in the first one are written with the hex radix and in the second one, without the hex radix. And they are in a different order
--- Output ---
I don't really care about the output's format as long as it is human readable. Ideally, it'd be a .txt document that said something like:
"Var_name1 does not exist in list two"
"Var_name2 has a different size. (Size1, Size2)"
"Var_name4 is located in a different place. (Loc1, Loc2)"
COMPLETE RE-EDIT
[My initial suggestion was to use sets, but further discussion via the comments made me realize that that was nonsense, and that a dictionary was the real solution.]
You want a dictionary; keyed on variable name; and where the value is a list or a tuple or a nested dictionary or even an object, containing size and address. You can add each variable name to the dictionary and update the values as needed.
For comparing the addresses, a regex would do it, but you can probably get by with less overhead with just a str.contains(_).
Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!
A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).
Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count
If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.