Histogram of lexicographically ordered list in Python - python

I have a list of tuples, containing floats, e.g.:
myList = [(1.0,2.0), (1.0,0.5), (2.0,1.0), (3.0,2.0), (3.0,0.0)]
The lexicographic order of the tuples is:
mySortedList = [(1.0,0.5), (1.0,2.0), (2.0,1.0), (3.0,0.0), (3.0,2.0)]
I.e. one tuple is smaller than the other, if both entries of the tuple are smaller.
Now I want to make a histogram, that shows the distribution of data that is ordered lexicographically like mySortedList. Is there any way do so with a built-in function in python? plt.hist works only for onedimensional lists. Btw is a histogram a good approach at all, to show the density in this case? (My statistic skills are rather limited, sorry)

In this case:
print(sorted(myList,key=sum))
Would work
Output:
[(1.0,0.5), (1.0,2.0), (2.0,1.0), (3.0,0.0), (3.0,2.0)]

Related

Sort unknown length array within unknown length 2D array - Python

I have a Python script which ends up creating a 2D array based on user input. Therefore, the length of the 2D array is unknown and the length of the individual arrays within the 2D array are also unknown until the user has input the information. I would like to sort the individual array pieces based on a value associated with them. An example of a possible output that needs to be sorted is below:
Basically, each individual array is a failure symptom followed by the a list of possible components, each having a "score" associated with them that is the likelihood that this component is causing the failure. My goal is to reorder the array with the components along with their scores in descending order based on the score, i.e., the component and score need to be moved together. The problem I have is like I said, I do not know the length of anything until user input is given. There could be only 1 failure symptom input, or there could be 9. The failure symptom could contain only 1 component, or maybe 12. I know it will take nested for loops and if statements, but I haven't been able to figure it out based on all the possible scenarios. Some possible scenarios I have thought of:
The array is already in order (move to the next failure symptom)
The first component is correct, but the ones after may not be. Or the first two are correct, but the ones after may not be, etc...
The array is completely backwards in order
The array only contains 1 component, therefore there is no need to sort
The array is in some random order, so some positions for some components may already be in the correct spot while some others aren't
Every time I feel like I am making headway, I think of another scenario which wouldn't hold up. Any help is greatly appreciated!
Your problem is a bit special. You don't only want to sort a multidimensional array, which would be rather simple using the default sorting algorithms, you also want to keep the order between the key/value pairs.
The second problem is that the keys are strings with numbers in it. So simple string comparison wouldn't work, because it is compared letter by letter, so "test9" > "test11" would be true (the second 1 wouldn't be even recognized, because 9>1).
The simpliest solution i figured out would be the following:
#get the failure id of one list
def failureId(value):
return int(value[0].replace("failure",""))
#get the id of one component
def componentId(value):
return int(value.replace("component",""))
#sort one failure list using bubble sort
def sortFailure(failure):
#iteraring through the array twice (only the keys, ignoring the values)
for i in range(1,len(failure), 2):
for j in range(1,i, 2):
#comparing the component ids
if (componentId(failure[j])>componentId(failure[j+2])):
#swaping keys and values
failure[j],failure[j+2] = failure[j+2],failure[j]
failure[j+1],failure[j+3] = failure[j+3],failure[j+1]
#sorting the full list
def sortData(data):
#sorting the failures using default sort algorithm
data.sort(key=failureId)
#sorting the single list of failure datas itself
for failure in data:
sortFailure(failure)
data = [['failure2', 'component2', 0.15, 'component1', 0.85], ['failure3', 'component1', 0.95], ['failure1','component1',0.05,'component3', 0.8, 'component2', 0.1, 'component4', 0.05]]
print(data)
sortData(data)
print(data)
The first two functions are required to get the numbers(=id) from the strings as mentioned above. The second function uses "bubble sort" to sort the array. It uses steps 2 for the range function, because we want to skipt the values for each component. If the data are in wrong order we are swapping the key & value. In the sortData function we are using the built in sort function for lists to sort the whole list (by failure ids). Then we take each "sublist" and sort them using the other function.

Compare 2 lists of colours (unsorted and different length)

I have a task that is really stumping me. I have produced an algorithm that sorts colours into bins/sub-groups. I want to assess how well it works compared to human intuition. So I've created some lists of colours (my data) and manually gone through and sorted them into bins/sub-groups of how I think the algorithm should sort the colours (my ground truths). Then I feed those same lists of colours (my data) to the algorithm and compare its sorting to my ground truths.
Here lies my problem. I don't know how best to compare the ground-truths to the results in order to assess how well the algorithm is working. Can anyone provide advice on how to compare 2 lists of colours?
Below is an example of the ground truth and algorithm result. I need to compare these 2 different lists of colours to see how close the result is to the ground truth (the left one). As you can see; the number of bins changes, the length of each bin is variable and the order of colours in each bin is variable. The only constant is both lists will always have the same number of colours (they will just be sorted differently). So this is why it makes it so complex (for me atleast) to figure out how to compare them.
Example input data, ie a just list of colours that is fed to the colour sorter:
[[69,99,121],[59,91,103],[71,107,140],[97,132,162],[85,117,141],[94,136,153],[86,131,144],[65,99,118],[211,214,201],[204,204,191],[203,207,188],[215,216,203],[194,199,180],[222,215,200],[219,213,195],[214,206,191],[197,188,172],[186,177,160],[206,197,181],[206,196,183],[38,35,31],[5,5,12],[31,34,41],[42,39,34],[30,32,27],[12,8,9]]
Example output from the colour sorter (the colours above have been sorted into 4 bins/sub-groups):
[
[[69,99,121],[59,91,103],[71,107,140],[97,132,162],[85,117,141],[94,136,153],[86,131,144],[65,99,118]],
[[211,214,201],[204,204,191],[203,207,188],[215,216,203],[194,199,180]],
[[222,215,200],[219,213,195],[214,206,191],[197,188,172],[186,177,160],[206,197,181],[206,196,183]],
[[38,35,31],[5,5,12],[31,34,41],[42,39,34],[30,32,27],[12,8,9]]
]
Note: I can easily change the format of the sorted colours to something else (like a numpy array or histogram) if you think that would make it easier to compare. Note with a histogram, the number of bins needs to be the same for each so I'd need to pad one of the lists presumably.
How can I compare these 2 python lists when sub-list order doesn't matter much, and the sub-list length is so variable?
Edit Clarification of the problem: I think I have bin comparison solved (see below code). The problem is how to know which bin from ground truth to compare to which bin from results. For example, in the above image I need to compare bin 2 from ground truth (left side) to bin 1 from results (right side), ie, compare the orange bins from each of them. Also the problem arises when there is no bin from results to compare to ground truth.
def validator(result_bin, ground_truth_bin):
# todo: padd the shorter bin with black values so each is the same length
dists = cdist(result_bin, ground_truth_bin, 'euclidean')
correct_guesses = np.sum(dists<25, axis=1)
score = float(len(correct_guesses)) / len(ground_truth_bin)
return score
RGB is a very unfitting representation of human color perception.
Convert to HSV or Lab. Then you could use e.g. cosine similarity for each color pair.
Since your lists are different length, finding pairs to compare can be done in many ways. I can suggest a few.
For each color in the longer list, find the closest color in the shorter list; use Euclidean length of the vector of differences as the scalar measure.
For each color in the shorter list, find the closest color in the longer list, measure difference as above, and remove them from the longer list. Now you have two lists again, repeat the process. Now you have a list of difference measures; average it by the number of runs (arithmetic or geometric mean).
Hope this helps.

What is the fastest way of computing powerset of an array in Python?

Given a list of numbers, e.g. x = [1,2,3,4,5] I need to compute its powerset (set of all subsets of that list). Right now, I am using the following code to compute the powerset, however when I have a large array of such lists (e.g. 40K of such arrays), it is extremely slow. So I am wondering if there can be any way to speed this up.
superset = [sorted(x[:i]+x[i+s:]) for i in range(len(x)) for s in range(len(x))]
I also tried the following code, however it is much slower than the code above.
from itertools import chain, combinations
def powerset(x):
xx = list(x)
return chain.from_iterable(combinations(xx, i) for i in range(len(xx)+1))
You can represent a powerset more efficiently by having all subsets reference the original set as a list and have each subset include a number whose bits indicate inclusion in the set. Thus you can enumerate the power set by computing the number of elements and then iterating through the integers with that many bits. However, as have been noted in the comments, the power set grows extremely fast, so if you can avoid having to compute or iterate through the power set, you should do so if at all possible.

Map a tuple of arbitrary length to an RGB value

I need to compute tuples (of integers) of arbitrary (but the same) length into RGB values. It would be especially nice if I could have them ordered more-or-less by magnitude, with any standard way of choosing sub-ordering between (0,1) and (1,0).
Here's how I'm doing this now:
I have a long list of RGB values of colors.
colors = [(0,0,0),(255,255,255),...]
I take the hash of the tuple mod the number of colors, and use this as the index.
def tuple_to_rgb(atuple):
index = hash(atuple) % len(colors)
return colors[index]
This works OK, but I'd like it to work more like a heatmap value, where (5,5,5) has a larger value than (0,0,0), so that adjacent colors make some sense, maybe getting "hotter" as the values get larger.
I know how to map integers onto RGB values, so perhaps if I just had a decent way of generating a unique integer from a tuple that sorted first by the magnitude of the tuple and then by the interior values it might work.
I could simply write my own sort comparitor, generate the entire list of possible tuples in advance, and use the order in the list as the unique integer, but it would be much easier if I didn't have to generate all of the possible tuples in advance.
Does anyone have any suggestions? This seems like something do-able, and I'd appreciate any hints to push me in the right direction.
For those who are interested, I'm trying to visualize predictions of electron occupations of quantum dots, like those in Fig 1b of this paper, but with arbitrary number of dots (and thus an arbitrary tuple length). The tuple length is fixed in a given application of the code, but I don't want the code to be specific to double-dots or triple-dots. Probably won't get much bigger than quadruple dots, but experimentalists dream up some pretty wild systems.
Here's an alternative method. Since the dots I've generated so far only have a subset of the possible occupations, the color maps were skewed one way, and didn't look as good. This method requires a list of possible states to be passed in, and thus these must be generated in advance, but the resulting colormaps look much nicer.
class Colormapper2:
"""
Like Colormapper, but uses a list of possible occupations to
generate the maps, rather than generating all possible occupations.
The difference is that the systems I've explored only have a subset
of the possible states occupied, and the colormaps look better
this way.
"""
def __init__(self,occs,**kwargs):
import matplotlib.pyplot as plt
colormap = kwargs.get('colormap','hot')
self.occs = sorted(list(occs),key=sum)
self.n = float(len(self.occs))
self.cmap = plt.get_cmap(colormap)
return
def __call__(self,occ):
ind255 = int(255*self.occs.index(occ)/self.n)
return self.cmap(ind255)
Here's an example of the resulting image:
You can see the colors are better separated than the other version.
Here's the code I came up with:
class Colormapper:
"""
Create a colormap to map tuples onto RGBA values produced by matplolib's
cmap function.
Arguments are the maximum value of each place in the tuple. Dimension of
the tuple is inferred from the length of the args array.
"""
def __init__(self,*args):
from itertools import product
import matplotlib.pyplot as plt
self.occs = sorted(list(product(*[xrange(arg+1) for arg in args])),key=sum)
self.n = float(len(self.occs))
self.hotmap = plt.get_cmap('hot')
return
def __call__(self,occ):
ind255 = int(255*self.occs.index(occ)/self.n)
return self.hotmap(ind255)
Here's an example of the result of this code:

Python list reordering, remember original order?

I'm working on a Bayesian probability project, in which I need to adjust probabilities based on new information. I have yet to find an efficient way to do this. What I'm trying to do is start with an equal probability list for distinct scenarios. Ex.
There are 6 people: E, T, M, Q, L, and Z, and their initial respective probabilities of being chosen are represented in
myList=[.1667, .1667, .1667, .1667, .1667, .1667]
New information surfaces that people in the first third alphabetically have a collective 70% chance of being chosen. A new list is made, sorted alphabetically by name (E, L, M, Q, T, Z), that just includes the new information. (.7/.333=2.33, .3/.667=.45)
newList=[2.33, 2.33, .45, .45, .45, .45)
I need a way to order the newList the same as myList so I can multiply the right values in list comprehension, and reach the adjust probabilities. Having a single consistent order is important because the process will be repeated several times, each with different criteria (vowels, closest to P, etc), and in a list with about 1000 items.
Each newList could instead be a newDictionary, and then once the adjustment criteria are created they could be ordered into a list, but transforming multiple dictionaries seems inefficient. Is it? Is there a simple way to do this I'm entirely missing?
Thanks!
For what it's worth, the best thing you can do for the speed of your methods in Python is to use numpy instead of the standard types (you'll thus be using pre-compiled C code to perform arithmetic operations). This will lead to a dramatic speed increase. Numpy arrays have fixed orderings anyway, and syntax is more directly applicable to mathematical operations. You just need to consider how to express the operations as matrix operations. E.g. your example:
myList = np.ones(6) / 6.
newInfo = np.array( [.7/2, .7/2, .3/4, .3/4, .3/4, .3/4] )
result = myList * newInfo
Since both vectors have unit sum there's no need to normalise (I'm not sure what you were doing in your example, I confess, so if there's a subtlety I've missed let me know), but if you do need to it's trivial:
result /= np.sum(result)
Try storing your info as a list of tuples:
bayesList = [('E', 0.1667), ('M', 0.1667), ...]
your list comprehension can be along the lines of
newBayes = [(person, prob * normalizeFactor) for person, prob in bayesList]
where you've normalizeFactor was calculated before setting up your list comprehension

Categories

Resources