I'm trying to write code to compute pairwise differences in a bunch of data within and between groups. That is, I've loaded the data into a dictionary so that the ith value of data j in group k is accessible by
data[j][group[k]][i]
I've written for loops to calculate all of the within group pairwise differences, but I'm a little stuck on how to then calculate the between groups pairwise differences. Is there a way to compare all of the values in data[j][group[k]] to all of the values in data[j][*NOT*group[k]]?
Thanks for any suggestions.
You Could compare them all and then throw out the ones where the group is the same as the one being compared to. (I hope that makes sense)
Or
make a temporary group[l] equal to group[k] minus the instance you are comparing to.
Related
The title probably sounds weird, but I'll explain what I mean.
Given k values, I want to map those to m buckets (m < k by at least an order of magnitude), such that each bucket contains either floor(k/m) or floor(k/m)+1 of the initial values.
Writing code to do that is very simple. The issue is, I want to be able to replicate the same distribution, by including something that makes the distribution reproducible in my code (something along the lines of a "seed"). Does anyone have any ideas how this can be done?
For the record, what I'm doing now in my Python code is selecting a random non-full hashbucket for every one of the k values, until all buckets contain floor(k/m) values. After this is done, I assign each of the values that I have left to a random bucket.
Designing one algorithm using Python I'm trying to maintain one invariant, but I don't know if that is even possible to maintain. It's part of an MST algorithm.
I have some "wanted nodes". They are wanted by one or more clusters, which are implemented as a list of nodes. If I get a node that is wanted by one cluster, it gets placed into that cluster. However, if more than one cluster want it, all those clusters get merged and then the node gets placed in the resulting cluster.
My goal
I am trying to get the biggest cluster of the list of "wanting clusters" in constant time, as if I had a max-heap and I could use the updated size of each cluster as the key.
What I am doing so far
The structure that I am using right now is a dict, where the keys are the nodes, and the values are lists with the clusters that want the node at the key. This way, if I get a node I can check in constant time if some cluster wants it, and in case there are, I loop through the list of clusters checking who is the biggest. Once I finish the loop, I merge the clusters by updating the information in all the smaller clusters. This way I get a total merging time of O(n logn), instead of O(n²).
Question
I was wondering if I could use something like a heap to store in my dict as the value, but I don't know how that heap would be updated with the current size of each cluster. Is it possible to do something like that by using pointers and possible other dict storing the size of each cluster?
Attempting to compare two ~100M row HDF5 datasets. The first dataset is the Master and the second is the result of the master being mapped and run thorough a cluster to discern a specific result for each row.
I need to validate that all the intended rows from the master are present, remove any duplicates and create a list of any missing rows that need to be computed. Hash values would be generated from the common elements between the two datasets. I realize though it wouldn't likely be practical to loop through them row by row with native Python.
Such being the case what would be a more efficient means of running this task? Do you try to code something in Cython to offset the issue with Python loop speed or is there a "better" way ?
I have a boolean matrix in python and need to find out which are duplicate rows. The representation can also be a list of bitarrays as I am using this for other purposes anyways. Comparing all rows with all rows is not an option as this would yield 12500^2 comparisons and I can only do about 500 per second. Also converting each row into an integer is not possible as each row is about 5000 bits long. Still it seems to me that the best way would be to sort the list of bitarrays and then compare only consecutive rows. Anyone has an idea how to map bitarrays to sortable values or how to sort a list of bitarrays in the first place? Or is there a different approach that is more promising? Also, since I only have to do this once, I prefer less code over efficiency.
Ok, so a list of bitarrays is quickly sortable by sort() or sorted(). Furthermore, probably better way to solve this problem is indicated in Find unique rows in numpy.array.
I'm fairly new to python and I need advice on figuring out how to implement this. I'm not sure what the best structure to use would be.
I need a dictionary type structure that has 2 keys for a value. I need retrieve the value with both keys, but delete the value by either key. I also need to be able to find the maximum value and return the key (or a list of keys if there are duplicate maximums)
Basically this is for finding the longest distance between any 2 points on a graph. I will have a list of points and I can calculate all the distances, but at any time I need to get the maximum distance and which points it connects. Any point can be removed at any time so I need to be able to remove values that connect to those points.
Obviously there is no existing structure that does this so i'll have to write my own class but does anyone have advise where to start? At first I was going to use a dictionary with a tuple key, but is there a fast way to find the maximum value and also get the key (or list of keys - with the possibility of duplicate values). Also how can I easily delete values by a single part of the tuple?
I'm not asking for anyone to solve this for me, I'm trying to learn, but any advice would help. Thanks in advance.