Intersection of Two Lists Of Strings - python

I had an interview question along these lines:
Given two lists of unordered customers, return a list of the intersection of the two lists. That is, return a list of the customers that appear in both lists.
Some things I established:
Assume each customer has a unique name
If the name is the same in both lists, it's the same customer
The names are of the form first name last name
There's no trickery of II's, Jr's, weird characters, etc.
I think the point was to find an efficient algorithm/use of data structures to do this as efficiently as possible.
My progress went like this:
Read one list in to memory, then read the other list one item at a time to see if there is a match
Alphabetize both lists then start at the top of one list and see if each item appears in the other list
Put both lists into ordered lists, then use the shorter list to check item by item (that way, it one list has 2 items, you only check those 2 items)
Put one list into a hash, and check for the existence of keys from the other list
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Any other tricks to do this efficiently?
Side note, this question was in python, and I just read about sets, which seem to do this as efficiently as possible. Any idea what the data structure/algorithm of sets is?

It really doesnt matter how its implemented ... but I believe it is implemented in C so it is faster and better set([1,2,3,4,5,6]).intersection([1,2,5,9]) is likely what they wanted
In python readability counts for alot! and set operations in python are used extensively and well vetted...
that said another pythonic way of doing it would be
list_new = [itm for itm in listA if itm in listB]
or
list_new = filter(lambda itm:itm in listB,listA)
basically I believe they were testing if you were familliar with python, not if you could implement the algorithm. since they asked a question that is so well suited to python

Put one list into a bloom filter and use that to filter the second list.
Put the filtered second list into a bloom filter and use that to filter the first list.
Sort the two lists and find the intersection by one of the methods above.
The benefit of this approach (besides letting you use a semi-obscure data structure correctly in an interview) is that it doesn't require any O(n) storage until after you have (with high probability) reduced the problem size.
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Maybe they would just keep asking that until you run out of answers.
http://code.google.com/p/python-bloom-filter/ is a python implementation of bloom filters.

Related

python data structures for storing classes

I am trying to design a system in python where my customers can create an order and it will be stored in an array or similar type of structure that will be able to constantly expand to store more orders as they are placed. What is the best way to do this?
I can think of two ways to do this.
Serialization. Reference
Create two tables, One called Order and other called order_contents. You can join order and Order_contents by order id. Store Order specific data in order table and content specific data in conetnt. All contents can be retrieved with a single SQL query this way OR in python, easily with ORM.
How big would you expect and order to get and how many orders could there be? Also what is stored in an order?
If you use numpy arrays you have the problem that increasing the size of an array is a very expensive process, so doing it many times on large arrays would be problematic. Numpy arrays also are for data that is all the same type, so you would not use this for things that are combinations of strings (name, item), integers (item reference), and floats (cost).
A simple list is likely the easiest and most inclusive choice. You can put whatever you like in a list and increase the size easily since a list is actually just pointers.
A dictionary could be useful if you are expecting to have to search the list often, or have a clear key-item relationship.
It really comes down to your use case. A list is often the choice, but a dictionary could be nice, and numpy arrays are nice if you are doing math with the stored data.

list or dict? To verify multiple outputs

I'm trying to figure out if I should make my output a list or dictionary? Which one would be easy to verify when you are trying to match 3 different outputs with millions of lines?
I'm going to take two different outputs and verify three pieces for each line.
Problem is, later I'll have a output C which I want to try and verify again output A and B. I'm leaning towards a dictionary?
For example: (bold is what I'll be matching
output A:
10.1.1.1:80 10.2.1.1:81 10.1.1.1:80 10.3.1.1:81 etc etc etc name
...
...
...
output B:
name etc etc etc etc 10.1.1.1/16 10.2.1.1/16
...
...
...
I'm trying to figure out if I should make my output a list or dictionary? Which one would be easy to verify when you are trying to match 3 different outputs with millions of lines?
They key difference between a list and a dictionary is the way you access your data. For a dictionary, you will be using keys, where for lists you would be using a sequence of indexes to glean your data. Which one should you use here? This is sometimes a performance issue and sometimes it's about code clarity. I don't think the latter issue relates to your case, I think performance would be your greatest asset here, that's because you're dealing with dozens of lines to be processed and matched.There has been many internal optimizations for Python objects, so it's better to focus on what you're trying to achieve instead of focusing on objects. I could very easily recommend dictionaries, but would that be a good recommendation for later developments in your code? I don't know, unless we work on that specific code any recommendation would be vague.
If you care about performance, this might help:
Python: List vs Dict for look up table

Compare two documents with slight differences in Python

I have two documents that are mostly the same, but with some small differences I want to ignore. Specifically, I know that one has hex values written as "0xFFFFFFFF" while the other has them just as "FFFFFFFF"
Basically, these two documents are lists of variables, their values, their location in memeory, size, etc.
But another problem is that they are not in the same order either.
I tried a few things, one being to just pack them all up in two lists of lists, and compare if the lists of lists have counterparts in each other, but with the number of variables being almost 100,000 the time it takes to do this is ridiculous (on the order of nearly an hour) so that isn't going to work. I'm not very seasonsed in python, or even the pythonic way of doing things, so I'm sorry if there is a quick and easy way to do this.
I've read a few other similar questions, but they all assume 100% identicallity, and other things that arent true in my case.
Basically, I have two .txts that have series of lines that look like:
***************************************
Variable: Var_name1
Size: 4
Address: 0x00FF00F0 .. 0x00FF00F3
Description: An awesome variable
..
***************************************
I don't care if the Descriptions are different, I just want to make sure that every variable has the same length and is in the same place, address-wise, and if they are any difference, I want to see them. I also want to be sure that every variable in one is present in the other.
And again, the address in the first one are written with the hex radix and in the second one, without the hex radix. And they are in a different order
--- Output ---
I don't really care about the output's format as long as it is human readable. Ideally, it'd be a .txt document that said something like:
"Var_name1 does not exist in list two"
"Var_name2 has a different size. (Size1, Size2)"
"Var_name4 is located in a different place. (Loc1, Loc2)"
COMPLETE RE-EDIT
[My initial suggestion was to use sets, but further discussion via the comments made me realize that that was nonsense, and that a dictionary was the real solution.]
You want a dictionary; keyed on variable name; and where the value is a list or a tuple or a nested dictionary or even an object, containing size and address. You can add each variable name to the dictionary and update the values as needed.
For comparing the addresses, a regex would do it, but you can probably get by with less overhead with just a str.contains(_).

Efficient and accurate way to compact and compare Python lists?

I'm trying to a somewhat sophisticated diff between individual rows in two CSV files. I need to ensure that a row from one file does not appear in the other file, but I am given no guarantee of the order of the rows in either file. As a starting point, I've been trying to compare the hashes of the string representations of the rows (i.e. Python lists). For example:
import csv
hashes = []
for row in csv.reader(open('old.csv','rb')):
hashes.append( hash(str(row)) )
for row in csv.reader(open('new.csv','rb')):
if hash(str(row)) not in hashes:
print 'Not found'
But this is failing miserably. I am constrained by artificially imposed memory limits that I cannot change, and thusly I went with the hashes instead of storing and comparing the lists directly. Some of the files I am comparing can be hundreds of megabytes in size. Any ideas for a way to accurately compress Python lists so that they can be compared in terms of simple equality to other lists? I.e. a hashing system that actually works? Bonus points: why didn't the above method work?
EDIT:
Thanks for all the great suggestions! Let me clarify some things. "Miserable failure" means that two rows that have the exact same data, after being read in by the CSV.reader object are not hashing to the same value after calling str on the list object. I shall try hashlib at some suggestions below. I also cannot do a hash on the raw file, since two lines below contain the same data, but different characters on the line:
1, 2.3, David S, Monday
1, 2.3, "David S", Monday
I am also already doing things like string stripping to make the data more uniform, but it seems to no avail. I'm not looking for an extremely smart diff logic, i.e. that 0 is the same as 0.0.
EDIT 2:
Problem solved. What basically worked is that I needed to a bit more pre-formatting like converting ints and floats, and so forth AND I needed to change my hashing function. Both these changes seemed to do the job for me.
It's hard to give a great answer without knowing more about your constraints, but if you can store a hash for each line of each file then you should be ok. At the very least you'll need to be able to store the hash list for one file, which you then would sort and write to disk, then you can march through the two sorted lists together.
The only reason why I can imagine the above not working as written would be because your hashing function doesn't always give the same output for a given input. You could test that a second run through old.csv generates the same list. It may have to do with errant spaces, tabs-instead-of-spaces, differing capitalization, "automatic
Mind, even if the hashes are equivalent you don't know that the lines match; you only know that they might match. You still need to check that the candidate lines do match. (You may also get the situation where more than one line in the input file generates the same hash, so you'll need to handle that as well.)
After you fill your hashes variable, you should consider turning it into a set (hashes = set(hashes)) so that your lookups can be faster than linear.
Given the loose syntactic definition of CSV, it is possible for two rows to be semantically equal while being lexically different. The various Dialect definitions give some clue as two how two rows could be individually well-formed but incommensurable. And this example shows how they could be in the same dialect and not string equivalent:
0, 0
0, 0.0
More information would help yield a better answer your question.
More information would be needed on what exactly "failing miserably" means. If you are just not getting correct comparison between the two, perhaps Hashlib might solve that.
I've run into trouble previously when using the built in hash library, and solved it with that.
Edit: As someone suggested on another post, the issue could be with assuming that the two files are required to have each line be EXACTLY the same. You might want to try parsing the csv fields and appending them to a string with identical formatting (maybe trim spaces, force lowercase, etc) before computing the hash.
I'm pretty sure that the "failing miserably" line refers to a failure in time that comes from your current algorithm being O(N^2) which is quite bad for how big your files are. As has been mentioned, you can use a set to alieviate this problem (will become O(N)) or if you aren't able to do that for some reason then you can sort the list of hashes and use a binary search on it (will become O(N log N) which is also doable. You can use the bisect module if you go the binary search route.
Also, it has been mentioned that you may have the problem of a clash in the hashes: two lines yielding the same hash when the lines aren't exactly the same. If you discover that this is a problem that you are experiencing, you will have to store info with each hash about where to seek the line corresponding to the hash in the old.csv file and then seek the line out and compare the two lines.
An alternative to your current method is to sort the two files beforehand (using some sort of merge sort to disk perhaps or shell sort) and, keeping pointers to lines in each file, compare the two lines. Check if they match, and if not then advance the line that is measured as being lesser. This algorithm is also O(N log N) as long as an O(N log N) method is used for sorting. The sorting could also be done by putting each file into a database and having the database sort them.
You need to say what your problem really is. Your description "I need to ensure that a row from one file does not appear in the other file" is consistent with the body of your second loop being if hash(...) in hashes: print "Found (an interloper)" rather that what you have.
We can't tell you "why didn't the above method work" because you haven't told us what the symptoms of "failed miserably" and "didn't work" are.
Have you perhaps considered running a sort (if possible) - you'll have to go over twice of course - but might solve the mem problem.
This is likely a problem with (mis)using hash. See this SO question; as the answers there point out, you probably want hashlib.

merging dictionaries in python

Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!
A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).
Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count
If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.

Categories

Resources