Finding several regions of interest in an array

Finding several regions of interest in an array - python

Say I have conducted an experiment where I've left a python program running for some long time and in that time I've taken several measurements of some quantity against time. Each measurement is separated by some value between 1 and 3 seconds with the time step used much smaller than that... say 0.01s. An example of such an even if you just take the y axis might look like:
[...0,1,-1,4,1,0,0,2,3,1,0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1,1,0,0,1,-1,0,0,2,0...]
Here we have some period of inactivity followed by a sharp rise, fall, a brief pause around 0, drop sharply, rise sharply and settle again around 0. The dots indicate that this is part of a long stream of data extending in both directions. There will be many of these events over the whole dataset with varying lengths separated by low magnitude regions.
I wish to essentially form an array of 'n' arrays (tuples?) with varying lengths capturing just the events so I can analyse them separately later. I can't separate purely by an np.absolute() type threshold because there are occasional small regions of near zero values within a given event such as in the above example. In addition to this there may be occasional blips in between measurements with large magnitudes but short duration.
The sample above would ideally end up as with a couple of elements or so from the flat region either side or so.
[0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1]
I'm thinking something like:
Input:
[0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0]
Split based on some number of consecutive values below a magnitude of 2.
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[8,-7,-1,0],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0],[1,22,4,]]
Like in this graph:
If sub arrays length is less than say 10 then remove:
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0]]
Is this a good way to approach it? The first step is confusing me a little also. I need to preserve those small low magnitude regions within an event also.
Re-edited! I'm going to be comparing two signals each measured as a function of time so they will be zipped together in a list of tuples.

Here is my two cents, based on exponential smoothing.
import itertools
A=np.array([0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0])
B=np.hstack(([0,0],A,[0,0]))
B=np.asanyarray(zip(*[B[i:] for i in range(5)]))
C=(B*[0.25,0.5,1,0.5,0.25]).mean(axis=1) #C is the 5-element sliding windows exponentially smoothed signal
D=[]
for item in itertools.groupby(enumerate(C), lambda x: abs(x[1])>1.5):
if item[0]:
D.append(list(item[1])) #Get the indices where the signal are of magnitude >2. Change 1.5 to control the behavior.
E=[D[0]]
for item in D[1:]:
if (item[0][0]-E[-1][-1][0]) <5: #Merge interesting regions if they are 5 or less indices apart. Change 5 to control the behavior.
E[-1]=E[-1]+item
else:
E.append(item)
print [(item[0][0], item[-1][0]) for item in E]
[A[item[0][0]: item[-1][0]] for item in E if (item[-1][0]-item[0][0])>9] #Filter out the interesting regions <10 in length.

Related

Finding the coordinates to randomly distribute images on a display without overlap

I have the following code which randomly generates a list of (X,Y) tuples:
import random
coords = []
for i in range(10):
x = random.randint(85,939)
y = random.randint(75,693)
coords.append((x,y))
In the final list, the X values of each tuple are considered to overlap if the absolute difference between them is less then 85, and the Y values are considered to overlap if the absoulte difference is less than 75. How can I make sure that none of the tuples in the final list will overlap in both dimensions?

The easiest way to do this is to just keep sampling and discarding coordinates which will create overlap. This will however become very inefficient when you come close to filling the available space. If that's not an issue you should go with this solution.
A bit more efficient and as far as I can tell statistically equivalent is to sample one coordinate, like the row, first. Then compute the occupied area in that row and sample from the remaining positions if there are any.
To avoid the same problem as in the easy solution, if there are no available spaces in a row, it should be removed from the possible sampling outcomes for the row (plus the margin of 75 in both directions).
Ideally, you don't compute the occupied regions each time but you keep a mapping from the row to the occupied space in that row and the amount of non-full rows and just update this mapping when inserting new images. You will need storage for n_rows + 1 extra numbers.
To clarify: When sampling from a restricted space just subtract the occupied positions and get a sampling result n. Then find the correct position for n by walking along the coordinate axis n steps, skipping all the occupied positions.

Optimizing random sampling for scaling permutations

I have a two-fold problem here.
1) My initial problem, which was trying to improve the computational time on my algorithm
2) My next problem, which is one of my 'improvements' appears to consume over 500G RAM memory and I don't really know why.
I've been writing a permutation for a bioinformatics pipeline. Basically, the algorithm starts off sampling 1000 null variants for each variant I have. If a certain condition is not met, the algorithm repeats the sampling with 10000 nulls this time. If the condition is not met again, this up-scales to 100000, and then 1000000 nulls per variant.
I've done everything I can think of to optimize this, and at this point I'm stumped for improvements. I have this gnarly looking comprehension:
output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list]))
Basically all this is doing is calling my random_sampling_for_variants function on each variant in my variant list, and throwing the two outputs from that function into a list of lists (so, I end up with two lists of lists, output_names_df, and output_p_values_df). I then turn these list of lists back into DataFrames, rename the columns, and do whatever I need with them. The function being called looks like:
def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num):
"""
Inner function to permute_null_variants_single
equivalent to permutation for 1 variant
"""
#Get nulls that are in the same bin as our variant
permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant])
#If number of permutations > number of nulls, sample with replacement
if num_permutations >= len(permuted_null_variant_table):
replace = True
else:
replace = False
#Select rows for permutation, then add as columns to our temporary dfs
picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace)
temp_names_df = picked_indices['variant'].values[:use_num]
temp_p_values_df = picked_indices['GWAS_p_value'].values
return(temp_names_df, temp_p_values_df)
When defining permuted_null_variant_table, I'm just querying a pre-defined grouped up DataFrame to determine the appropriate null table to sample from. I found this to be faster than trying to determine the appropriate nulls to sample from on the fly. The logic that follows in there determines whether I need to sample with or without replacement, and it takes hardly any time at all. Defining picked_indices is where the random sampling is actually being done. temp_names_df and temp_p_values_df get the values that I want out of the null table, and then I send my returns out of the function.
The problem here is that the above doesn't scale very well. For 1000 permutations, the above lines of code take ~254 seconds on 7725 variants. For 10,000 permutations, ~333 seconds. For 100,000 permutations, ~720 seconds. For 1,000,000 permutations, which is the upper boundary I'd like to get to, my process got killed because it was apparently going to take up more RAM than the cluster had.
I'm stumped on how to proceed. I've turned all by ints into 8bit-uint, I've turned all my floats into 16bit-floats. I've dropped columns that I don't need where I don't need them, so I'm only ever sampling from tables with the columns I need. Initially I had the comprehension in the form of a loop, but the computational time on the comprehension is 3/5th that of the loop. Any feedback is appreciated.

Can I make an O(1) search algorithm using a sorted array with a known step?

Background
my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib functionality for displaying and manipulating the plot in the backend of my application.
These datasets contains three internal lists I use to visualize: time, height and dataset. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot.
The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time array. Here's an example of a shape which bounds a region in my program
While X1 may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the closest coordinate existing in time and height might be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit off from matploblib's coordinate.
The Problem
So given some arbitrary value, lets say x1 = 732839.154395 (a representation of the date in number format) and a list of similar values with a constant step:
732839.154392
732839.154392
732839.154393
732839.154393
732839.154394
732839.154394
732839.154395
732839.154396
732839.154396
732839.154397
732839.154397
732839.154398
732839.154398
732839.154399
etc...
What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1) time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process.
grab first and second element of time
subtract second element of time with first, obtain step
subtract bounding x value with first element of time, obtain difference
divide difference by step, obtain index
move time forward to index
check surrounding elements of index to ensure closest representation

The algorithm you suggest seems reasonable and like it would work.
As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer).
The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000
Here's a test that makes data a similar to yours:
import matplotlib.pyplot as plt
start = 97523.29783
increment = .000378912098
target = 97585.23452
# build a timeline
times = []
time = start
actual_index = None
for i in range(1000000):
trunc = float(str(time)[:10]) # truncate the time value
times.append(trunc)
if actual_index is None and time>target:
actual_index = i
time = time + increment
# now test
intervals = [1, 2, 5, 10, 100, 1000, 10000]
for i in intervals:
dt = (times[i] - times[0])/i
index = int((target-start)/dt)
print " %6i %8i %8i %.10f" % (i, actual_index, index, dt)
Result:
span actual guess est dt (actual=.000378912098)
1 163460 154841 0.0004000000
2 163460 176961 0.0003500000
5 163460 162991 0.0003800000
10 163460 162991 0.0003800000
100 163460 163421 0.0003790000
1000 163460 163464 0.0003789000
10000 163460 163460 0.0003789100
That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount.
As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12]), I get:
span actual guess est dt (actual=.000378912098)
1 163460 163853 0.0003780000
10 163460 163464 0.0003789000
100 163460 163460 0.0003789100
1000 163460 163459 0.0003789120
10000 163460 163459 0.0003789121
So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough.
Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))

What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location.
The wikipedia link contains a C++ implementation.

That what you did is actually finding the value of n-th element of arithmetic sequence given the first two elements.
It is of course good.
Apart from the real question, if you have that much data that you can't fit into ram, you could setup something like Memory Mapped Files or simply creating Virtual Memory files, on Linux called swap.

Sorting points on multiple lines

Given that we have two lines on a graph (I just noticed that I inverted the numbers on the Y axis, this was a mistake, it should go from 11-1)
And we only care about whole number X axis intersections
We need to order these points from highest Y value to lowest Y value regardless of their position on the X axis (Note I did these pictures by hand so they may not line up perfectly).
I have a couple of questions:
1) I have to assume this is a known problem, but does it have a particular name?
2) Is there a known optimal solution when dealing with tens of billions (or hundreds of millions) of lines? Our current process of manually calculating each point and then comparing it to a giant list requires hours of processing. Even though we may have a hundred million lines we typically only want the top 100 or 50,000 results some of them are so far "below" other lines that calculating their points is unnecessary.

Your data structure is a set of tuples
lines = {(y0, Δy0), (y1, Δy1), ...}
You need only the ntop points, hence build a set containing only
the top ntop yi values, with a single pass over the data
top_points = choose(lines, ntop)
EDIT --- to choose the ntop we had to keep track of the smallest
one, and this is interesting info, so let's return also this value
from choose, also we need to initialize decremented
top_points, smallest = choose(lines, ntop)
decremented = top_points
and start a loop...
while True:
Generate a set of decremented values
decremented = {(y-Δy, Δy) for y, Δy in top_points}
decremented = {(y-Δy, Δy) for y, Δy in decremented if y>smallest}
if decremented == {}: break
Generate a set of candidates
candidates = top_lines.union(decremented)
generate a new set of top points
new_top_points, smallest = choose(candidates, ntop)
The following is no more necessary
check if new_top_points == top_points
if new_top_points == top_points: break
top_points = new_top_points</strike>
of course we are in a loop...
The difficult part is the choose function, but I think that this
answer to the question
How can I sort 1 million numbers, and only print the top 10 in Python?
could help you.

It's not a really complicated thing, just a "normal" sorting problem.
Usually sorting requires a large amount of computing time. But your case is one where you don't need to use complex sorting techniques.
You on both graphs are growing or falling constantly, there are no "jumps". You can use this to your advantage. The basic algorithm:
identify if a graph is growing or falling.
write a generator, that generates the values; from left to right if raising, form right to left if falling.
get the first value from both graphs
insert the lower on into the result list
get a new value from the graph that had the lower value
repeat the last two steps until one generator is "empty"
append the leftover items from the other generator.

Python, PyTables - taking advantage of in-kernel searching

I have HDF5 files with multiple groups, where each group contains a data set with >= 25 million rows. At each time step of simulation, each agent outputs the other agents he/she sensed at that time step. There are ~2000 agents in the scenario and thousands of time steps; the O(n^2) nature of the output explains the huge number of rows.
What I'm interested in calculating is the number of unique sightings by category. For instance, agents belong to a side, red, blue, or green. I want to make a two-dimensional table where row i, column j is the number of agents in category j that were sensed by at least one agent in category i. (I'm using the Sides in this code example, but we could classify the agents in other ways as well, such as by the weapon they have, or the sensors they carry.)
Here's a sample output table; note that the simulation does not output blue/blue sensations because it takes a ton of room and we aren't interested in them. Same for green green)
blue green red
blue 0 492 186
green 1075 0 186
red 451 498 26
The columns are
tick - time step
sensingAgentId - id of agent doing sensing
sensedAgentId - id of agent being sensed
detRange - range in meters between two agents
senseType - an enumerated type for what type of sensing was done
Here's the code I am currently using to accomplish this:
def createHeatmap():
h5file = openFile("someFile.h5")
run0 = h5file.root.run0.detections
# A dictionary of dictionaries, {'blue': {'blue':0, 'red':0, ...}
classHeat = emptyDict(sides)
# Interested in Per Category Unique Detections
seenClass = {}
# Initially each side has seen no one
for theSide in sides:
seenClass[theSide] = []
# In-kernel search filtering out many rows in file; in this instance 25,789,825 rows
# are filtered to 4,409,176
classifications = run0.where('senseType == 3')
# Iterate and filter
for row in classifications:
sensedId = row['sensedAgentId']
# side is a function that returns the string representation of the side of agent
# with that id.
sensedSide = side(sensedId)
sensingSide = side(row['sensingAgentId'])
# The side has already seen this agent before; ignore it
if sensedId in seenClass[sensingSide]:
continue
else:
classHeat[sensingSide][sensedSide] += 1
seenClass[sensingSide].append(sensedId)
return classHeat
Note: I have a Java background, so I apologize if this is not Pythonic. Please point this out and suggest ways to improve this code, I would love to become more proficient with Python.
Now, this is very slow: it takes approximately 50 seconds to do this iteration and membership checking, and this is with the most restrictive set of membership criteria (other detection types have many more rows to iterate over).
My question is, is it possible to move the work out of python and into the in-kernel search query? If so, how? Is there some glaringly obvious speedup I am missing? I need to be able to run this function for each run in a set of runs (~30), and for multiple set of criteria (~5), so it would be great if this could be sped up.
Final note: I tried using psyco but that barely made a difference.

If you have N=~2k agents, I suggest putting all sightings into a numpy array of size NxN. This easily fits in memory (around 16 meg for integers). Just store a 1 wherever a sighting occurred.
Assume that you have an array sightings. The first coordinate is Sensing, the second is Sensed. Assume you also have 1-d index arrays listing which agents are on which side. You can get the number of sightings of side B by side A this way:
sideAseesB = sightings[sideAindices, sideBindices]
sideAseesBcount = numpy.logical_or.reduce(sideAseesB, axis=0).sum()
It's possible you'd need to use sightings.take(sideAindices, axis=0).take(sideBindices, axis=1) in the first step, but I doubt it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.