Finding several regions of interest in an array - python
Say I have conducted an experiment where I've left a python program running for some long time and in that time I've taken several measurements of some quantity against time. Each measurement is separated by some value between 1 and 3 seconds with the time step used much smaller than that... say 0.01s. An example of such an even if you just take the y axis might look like:
[...0,1,-1,4,1,0,0,2,3,1,0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1,1,0,0,1,-1,0,0,2,0...]
Here we have some period of inactivity followed by a sharp rise, fall, a brief pause around 0, drop sharply, rise sharply and settle again around 0. The dots indicate that this is part of a long stream of data extending in both directions. There will be many of these events over the whole dataset with varying lengths separated by low magnitude regions.
I wish to essentially form an array of 'n' arrays (tuples?) with varying lengths capturing just the events so I can analyse them separately later. I can't separate purely by an np.absolute() type threshold because there are occasional small regions of near zero values within a given event such as in the above example. In addition to this there may be occasional blips in between measurements with large magnitudes but short duration.
The sample above would ideally end up as with a couple of elements or so from the flat region either side or so.
[0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1]
I'm thinking something like:
Input:
[0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0]
Split based on some number of consecutive values below a magnitude of 2.
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[8,-7,-1,0],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0],[1,22,4,]]
Like in this graph:
If sub arrays length is less than say 10 then remove:
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0]]
Is this a good way to approach it? The first step is confusing me a little also. I need to preserve those small low magnitude regions within an event also.
Re-edited! I'm going to be comparing two signals each measured as a function of time so they will be zipped together in a list of tuples.
Here is my two cents, based on exponential smoothing.
import itertools
A=np.array([0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0])
B=np.hstack(([0,0],A,[0,0]))
B=np.asanyarray(zip(*[B[i:] for i in range(5)]))
C=(B*[0.25,0.5,1,0.5,0.25]).mean(axis=1) #C is the 5-element sliding windows exponentially smoothed signal
D=[]
for item in itertools.groupby(enumerate(C), lambda x: abs(x[1])>1.5):
if item[0]:
D.append(list(item[1])) #Get the indices where the signal are of magnitude >2. Change 1.5 to control the behavior.
E=[D[0]]
for item in D[1:]:
if (item[0][0]-E[-1][-1][0]) <5: #Merge interesting regions if they are 5 or less indices apart. Change 5 to control the behavior.
E[-1]=E[-1]+item
else:
E.append(item)
print [(item[0][0], item[-1][0]) for item in E]
[A[item[0][0]: item[-1][0]] for item in E if (item[-1][0]-item[0][0])>9] #Filter out the interesting regions <10 in length.
Related
Finding the coordinates to randomly distribute images on a display without overlap
I have the following code which randomly generates a list of (X,Y) tuples: import random coords = [] for i in range(10): x = random.randint(85,939) y = random.randint(75,693) coords.append((x,y)) In the final list, the X values of each tuple are considered to overlap if the absolute difference between them is less then 85, and the Y values are considered to overlap if the absoulte difference is less than 75. How can I make sure that none of the tuples in the final list will overlap in both dimensions?
The easiest way to do this is to just keep sampling and discarding coordinates which will create overlap. This will however become very inefficient when you come close to filling the available space. If that's not an issue you should go with this solution. A bit more efficient and as far as I can tell statistically equivalent is to sample one coordinate, like the row, first. Then compute the occupied area in that row and sample from the remaining positions if there are any. To avoid the same problem as in the easy solution, if there are no available spaces in a row, it should be removed from the possible sampling outcomes for the row (plus the margin of 75 in both directions). Ideally, you don't compute the occupied regions each time but you keep a mapping from the row to the occupied space in that row and the amount of non-full rows and just update this mapping when inserting new images. You will need storage for n_rows + 1 extra numbers. To clarify: When sampling from a restricted space just subtract the occupied positions and get a sampling result n. Then find the correct position for n by walking along the coordinate axis n steps, skipping all the occupied positions.
Optimizing random sampling for scaling permutations
I have a two-fold problem here. 1) My initial problem, which was trying to improve the computational time on my algorithm 2) My next problem, which is one of my 'improvements' appears to consume over 500G RAM memory and I don't really know why. I've been writing a permutation for a bioinformatics pipeline. Basically, the algorithm starts off sampling 1000 null variants for each variant I have. If a certain condition is not met, the algorithm repeats the sampling with 10000 nulls this time. If the condition is not met again, this up-scales to 100000, and then 1000000 nulls per variant. I've done everything I can think of to optimize this, and at this point I'm stumped for improvements. I have this gnarly looking comprehension: output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list])) Basically all this is doing is calling my random_sampling_for_variants function on each variant in my variant list, and throwing the two outputs from that function into a list of lists (so, I end up with two lists of lists, output_names_df, and output_p_values_df). I then turn these list of lists back into DataFrames, rename the columns, and do whatever I need with them. The function being called looks like: def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num): """ Inner function to permute_null_variants_single equivalent to permutation for 1 variant """ #Get nulls that are in the same bin as our variant permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant]) #If number of permutations > number of nulls, sample with replacement if num_permutations >= len(permuted_null_variant_table): replace = True else: replace = False #Select rows for permutation, then add as columns to our temporary dfs picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace) temp_names_df = picked_indices['variant'].values[:use_num] temp_p_values_df = picked_indices['GWAS_p_value'].values return(temp_names_df, temp_p_values_df) When defining permuted_null_variant_table, I'm just querying a pre-defined grouped up DataFrame to determine the appropriate null table to sample from. I found this to be faster than trying to determine the appropriate nulls to sample from on the fly. The logic that follows in there determines whether I need to sample with or without replacement, and it takes hardly any time at all. Defining picked_indices is where the random sampling is actually being done. temp_names_df and temp_p_values_df get the values that I want out of the null table, and then I send my returns out of the function. The problem here is that the above doesn't scale very well. For 1000 permutations, the above lines of code take ~254 seconds on 7725 variants. For 10,000 permutations, ~333 seconds. For 100,000 permutations, ~720 seconds. For 1,000,000 permutations, which is the upper boundary I'd like to get to, my process got killed because it was apparently going to take up more RAM than the cluster had. I'm stumped on how to proceed. I've turned all by ints into 8bit-uint, I've turned all my floats into 16bit-floats. I've dropped columns that I don't need where I don't need them, so I'm only ever sampling from tables with the columns I need. Initially I had the comprehension in the form of a loop, but the computational time on the comprehension is 3/5th that of the loop. Any feedback is appreciated.
Can I make an O(1) search algorithm using a sorted array with a known step?
Background my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib functionality for displaying and manipulating the plot in the backend of my application. These datasets contains three internal lists I use to visualize: time, height and dataset. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot. The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time array. Here's an example of a shape which bounds a region in my program While X1 may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the closest coordinate existing in time and height might be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit off from matploblib's coordinate. The Problem So given some arbitrary value, lets say x1 = 732839.154395 (a representation of the date in number format) and a list of similar values with a constant step: 732839.154392 732839.154392 732839.154393 732839.154393 732839.154394 732839.154394 732839.154395 732839.154396 732839.154396 732839.154397 732839.154397 732839.154398 732839.154398 732839.154399 etc... What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1) time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process. grab first and second element of time subtract second element of time with first, obtain step subtract bounding x value with first element of time, obtain difference divide difference by step, obtain index move time forward to index check surrounding elements of index to ensure closest representation
The algorithm you suggest seems reasonable and like it would work. As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer). The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000 Here's a test that makes data a similar to yours: import matplotlib.pyplot as plt start = 97523.29783 increment = .000378912098 target = 97585.23452 # build a timeline times = [] time = start actual_index = None for i in range(1000000): trunc = float(str(time)[:10]) # truncate the time value times.append(trunc) if actual_index is None and time>target: actual_index = i time = time + increment # now test intervals = [1, 2, 5, 10, 100, 1000, 10000] for i in intervals: dt = (times[i] - times[0])/i index = int((target-start)/dt) print " %6i %8i %8i %.10f" % (i, actual_index, index, dt) Result: span actual guess est dt (actual=.000378912098) 1 163460 154841 0.0004000000 2 163460 176961 0.0003500000 5 163460 162991 0.0003800000 10 163460 162991 0.0003800000 100 163460 163421 0.0003790000 1000 163460 163464 0.0003789000 10000 163460 163460 0.0003789100 That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount. As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12]), I get: span actual guess est dt (actual=.000378912098) 1 163460 163853 0.0003780000 10 163460 163464 0.0003789000 100 163460 163460 0.0003789100 1000 163460 163459 0.0003789120 10000 163460 163459 0.0003789121 So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough. Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))
What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location. The wikipedia link contains a C++ implementation.
That what you did is actually finding the value of n-th element of arithmetic sequence given the first two elements. It is of course good. Apart from the real question, if you have that much data that you can't fit into ram, you could setup something like Memory Mapped Files or simply creating Virtual Memory files, on Linux called swap.
Sorting points on multiple lines
Given that we have two lines on a graph (I just noticed that I inverted the numbers on the Y axis, this was a mistake, it should go from 11-1) And we only care about whole number X axis intersections We need to order these points from highest Y value to lowest Y value regardless of their position on the X axis (Note I did these pictures by hand so they may not line up perfectly). I have a couple of questions: 1) I have to assume this is a known problem, but does it have a particular name? 2) Is there a known optimal solution when dealing with tens of billions (or hundreds of millions) of lines? Our current process of manually calculating each point and then comparing it to a giant list requires hours of processing. Even though we may have a hundred million lines we typically only want the top 100 or 50,000 results some of them are so far "below" other lines that calculating their points is unnecessary.
Your data structure is a set of tuples lines = {(y0, Δy0), (y1, Δy1), ...} You need only the ntop points, hence build a set containing only the top ntop yi values, with a single pass over the data top_points = choose(lines, ntop) EDIT --- to choose the ntop we had to keep track of the smallest one, and this is interesting info, so let's return also this value from choose, also we need to initialize decremented top_points, smallest = choose(lines, ntop) decremented = top_points and start a loop... while True: Generate a set of decremented values decremented = {(y-Δy, Δy) for y, Δy in top_points} decremented = {(y-Δy, Δy) for y, Δy in decremented if y>smallest} if decremented == {}: break Generate a set of candidates candidates = top_lines.union(decremented) generate a new set of top points new_top_points, smallest = choose(candidates, ntop) The following is no more necessary check if new_top_points == top_points if new_top_points == top_points: break top_points = new_top_points</strike> of course we are in a loop... The difficult part is the choose function, but I think that this answer to the question How can I sort 1 million numbers, and only print the top 10 in Python? could help you.
It's not a really complicated thing, just a "normal" sorting problem. Usually sorting requires a large amount of computing time. But your case is one where you don't need to use complex sorting techniques. You on both graphs are growing or falling constantly, there are no "jumps". You can use this to your advantage. The basic algorithm: identify if a graph is growing or falling. write a generator, that generates the values; from left to right if raising, form right to left if falling. get the first value from both graphs insert the lower on into the result list get a new value from the graph that had the lower value repeat the last two steps until one generator is "empty" append the leftover items from the other generator.
Python, PyTables - taking advantage of in-kernel searching
I have HDF5 files with multiple groups, where each group contains a data set with >= 25 million rows. At each time step of simulation, each agent outputs the other agents he/she sensed at that time step. There are ~2000 agents in the scenario and thousands of time steps; the O(n^2) nature of the output explains the huge number of rows. What I'm interested in calculating is the number of unique sightings by category. For instance, agents belong to a side, red, blue, or green. I want to make a two-dimensional table where row i, column j is the number of agents in category j that were sensed by at least one agent in category i. (I'm using the Sides in this code example, but we could classify the agents in other ways as well, such as by the weapon they have, or the sensors they carry.) Here's a sample output table; note that the simulation does not output blue/blue sensations because it takes a ton of room and we aren't interested in them. Same for green green) blue green red blue 0 492 186 green 1075 0 186 red 451 498 26 The columns are tick - time step sensingAgentId - id of agent doing sensing sensedAgentId - id of agent being sensed detRange - range in meters between two agents senseType - an enumerated type for what type of sensing was done Here's the code I am currently using to accomplish this: def createHeatmap(): h5file = openFile("someFile.h5") run0 = h5file.root.run0.detections # A dictionary of dictionaries, {'blue': {'blue':0, 'red':0, ...} classHeat = emptyDict(sides) # Interested in Per Category Unique Detections seenClass = {} # Initially each side has seen no one for theSide in sides: seenClass[theSide] = [] # In-kernel search filtering out many rows in file; in this instance 25,789,825 rows # are filtered to 4,409,176 classifications = run0.where('senseType == 3') # Iterate and filter for row in classifications: sensedId = row['sensedAgentId'] # side is a function that returns the string representation of the side of agent # with that id. sensedSide = side(sensedId) sensingSide = side(row['sensingAgentId']) # The side has already seen this agent before; ignore it if sensedId in seenClass[sensingSide]: continue else: classHeat[sensingSide][sensedSide] += 1 seenClass[sensingSide].append(sensedId) return classHeat Note: I have a Java background, so I apologize if this is not Pythonic. Please point this out and suggest ways to improve this code, I would love to become more proficient with Python. Now, this is very slow: it takes approximately 50 seconds to do this iteration and membership checking, and this is with the most restrictive set of membership criteria (other detection types have many more rows to iterate over). My question is, is it possible to move the work out of python and into the in-kernel search query? If so, how? Is there some glaringly obvious speedup I am missing? I need to be able to run this function for each run in a set of runs (~30), and for multiple set of criteria (~5), so it would be great if this could be sped up. Final note: I tried using psyco but that barely made a difference.
If you have N=~2k agents, I suggest putting all sightings into a numpy array of size NxN. This easily fits in memory (around 16 meg for integers). Just store a 1 wherever a sighting occurred. Assume that you have an array sightings. The first coordinate is Sensing, the second is Sensed. Assume you also have 1-d index arrays listing which agents are on which side. You can get the number of sightings of side B by side A this way: sideAseesB = sightings[sideAindices, sideBindices] sideAseesBcount = numpy.logical_or.reduce(sideAseesB, axis=0).sum() It's possible you'd need to use sightings.take(sideAindices, axis=0).take(sideBindices, axis=1) in the first step, but I doubt it.