Python: Find duplicate rows in large boolean matrix

Python: Find duplicate rows in large boolean matrix - python

I have a boolean matrix in python and need to find out which are duplicate rows. The representation can also be a list of bitarrays as I am using this for other purposes anyways. Comparing all rows with all rows is not an option as this would yield 12500^2 comparisons and I can only do about 500 per second. Also converting each row into an integer is not possible as each row is about 5000 bits long. Still it seems to me that the best way would be to sort the list of bitarrays and then compare only consecutive rows. Anyone has an idea how to map bitarrays to sortable values or how to sort a list of bitarrays in the first place? Or is there a different approach that is more promising? Also, since I only have to do this once, I prefer less code over efficiency.

Ok, so a list of bitarrays is quickly sortable by sort() or sorted(). Furthermore, probably better way to solve this problem is indicated in Find unique rows in numpy.array.

Related

Very efficient parallel sorting of big array in NumPy or Numba

I have an array of events over time. There are more than a billion rows, and each row is made of:
user_id
timestamp
event_type
other high-precision (float64) numerical columns
I need to sort this array in multiple manners, but let's take timestamp for example. I've tried several methods, including:
sort the entire array directly with arr = arr[arr.argsort()] => awful performance
using the same method as above, sort chunks of the array in parallel and then do a final sort => better performance
Still, even with the "sort in chunks then merge" method, performance rapidly degrades with the number of items. It takes more than 5 minutes to sort 200M rows on a 128 CPU machine. This is a problem because I need to perform multiple sortings consecutively, and I need to do it "on the fly", i.e. persisting the sorted array on disk once and for all is not an option because items are always added and removed, not necessarily in chronological order.
Is there a way to drastically improve the performance of this sorting? For instance, could a Numba implementation of mergesort which works in parallel be a solution? Also, I'd be interested in a method which works for sorting on multiple columns (e.g. sort on [user_id, timestamp]).
Notes:
the array is of dtype np.float64 and needs to stay that way because of the contents of other columns which are not mentioned in the above example.
we thought of using a database with separate tables, one for each particular sorting — advantage would be that the tables are always perfectly sorted, but the drawback is in terms of retrieving speed
for the above reason, we went with local Parquet files, which are blazingly fast to retrieve, but then it means we need to sort them once loaded

python - access all dictionary elements not corresponding to key

I'm trying to write code to compute pairwise differences in a bunch of data within and between groups. That is, I've loaded the data into a dictionary so that the ith value of data j in group k is accessible by
data[j][group[k]][i]
I've written for loops to calculate all of the within group pairwise differences, but I'm a little stuck on how to then calculate the between groups pairwise differences. Is there a way to compare all of the values in data[j][group[k]] to all of the values in data[j][*NOT*group[k]]?
Thanks for any suggestions.

You Could compare them all and then throw out the ones where the group is the same as the one being compared to. (I hope that makes sense)
Or
make a temporary group[l] equal to group[k] minus the instance you are comparing to.

Efficiently comparing two ~100M row data sets in Python?

Attempting to compare two ~100M row HDF5 datasets. The first dataset is the Master and the second is the result of the master being mapped and run thorough a cluster to discern a specific result for each row.
I need to validate that all the intended rows from the master are present, remove any duplicates and create a list of any missing rows that need to be computed. Hash values would be generated from the common elements between the two datasets. I realize though it wouldn't likely be practical to loop through them row by row with native Python.
Such being the case what would be a more efficient means of running this task? Do you try to code something in Cython to offset the issue with Python loop speed or is there a "better" way ?

Intersection of Two Lists Of Strings

I had an interview question along these lines:
Given two lists of unordered customers, return a list of the intersection of the two lists. That is, return a list of the customers that appear in both lists.
Some things I established:
Assume each customer has a unique name
If the name is the same in both lists, it's the same customer
The names are of the form first name last name
There's no trickery of II's, Jr's, weird characters, etc.
I think the point was to find an efficient algorithm/use of data structures to do this as efficiently as possible.
My progress went like this:
Read one list in to memory, then read the other list one item at a time to see if there is a match
Alphabetize both lists then start at the top of one list and see if each item appears in the other list
Put both lists into ordered lists, then use the shorter list to check item by item (that way, it one list has 2 items, you only check those 2 items)
Put one list into a hash, and check for the existence of keys from the other list
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Any other tricks to do this efficiently?
Side note, this question was in python, and I just read about sets, which seem to do this as efficiently as possible. Any idea what the data structure/algorithm of sets is?

It really doesnt matter how its implemented ... but I believe it is implemented in C so it is faster and better set([1,2,3,4,5,6]).intersection([1,2,5,9]) is likely what they wanted
In python readability counts for alot! and set operations in python are used extensively and well vetted...
that said another pythonic way of doing it would be
list_new = [itm for itm in listA if itm in listB]
or
list_new = filter(lambda itm:itm in listB,listA)
basically I believe they were testing if you were familliar with python, not if you could implement the algorithm. since they asked a question that is so well suited to python

Put one list into a bloom filter and use that to filter the second list.
Put the filtered second list into a bloom filter and use that to filter the first list.
Sort the two lists and find the intersection by one of the methods above.
The benefit of this approach (besides letting you use a semi-obscure data structure correctly in an interview) is that it doesn't require any O(n) storage until after you have (with high probability) reduced the problem size.
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Maybe they would just keep asking that until you run out of answers.
http://code.google.com/p/python-bloom-filter/ is a python implementation of bloom filters.

How to store a dynamic List into MySQL column efficiently?

I want to store a list of numbers along with some other fields into MySQL. The number of elements in the list is dynamic (some time it could hold about 60 elements)
Currently I'm storing the list into a column of varchar type and the following operations are done.
e.g. aList = [1234122433,1352435632,2346433334,1234122464]
At storing time, aList is coverted to string as below
aListStr = str(aList)
and at reading time the string is converted back to list as below.
aList = eval(aListStr)
There are about 10 million rows, and since I'm storing as strings, it occupies lot space. What is the most efficient way to do this?
Also what should be the efficient way for storing list of strings instead of numbers?

Since you wish to store integers, an effective way would be to store them in an INT/DECIMAL column.
Create an additional table that will hold these numbers and add an ID column to relate the records to other table(s).
Also what should be the efficient way
for storing list of strings instead of
numbers?
Beside what I said, you can convert them to HEX code which will be very easy & take less space.
Note that a big VARCHAR may influence badly on the performance.
VARCHAR(2) and VARCHAR(50) does matter when actions like sotring are done, since MySQL allocates fixed-size memory slices for them, according to the VARCHAR maximum size.
When those slices are too large to store in memory, MySQL will store them on disk.

MySQL also has a SET type, it works like ENUM but can hold multiple items.
Of course you'd have to have a limited list, currently MySQL only supports up to 64 different items.

I'd be less worried about storage space and more worried about record retrieveal i.e., indexability/searching.
For example, I imagine performing a LIKE or REGEXP in a WHERE clause to find a single item in the list will be quite bit more expensive than if you normalized each list item into a row in a separate table.
However, if you never need to perform such queries agains these columns, then it just won't matter.

Since you are using relational database you should know that storing non-atomic values in individual fields breaks even the first normal form. More likely than not you should follow Don's advice and keep those values in related table. I can't say that for certain because I don't know your problem domain. It may well be that choosing RDBMS for this data was a bad choice altogether.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.