Detecting regions in (x, y) data [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need to be able to detect regions of a list of (x, y) data based on the features of the data. Some example data is shown in the first image. Right now, I need to be able to find the region between the black marks (sorry for the poor quality, imgur's editor isn't very accurate). Unfortunately, the problem is complicated by being different lengths and shapes each time this data is collected, as seen in the second image. The sharp drop from ~98 to ~85 is consistent, and the two dip/two peak feature between ~1e-9 and ~1.5e-9 should be fairly consistent.
My question is, what is the best approach for detecting events in a signal, based on features of the signal? If I can get this sliced into the three regions marked (beginning to first mark, first to second mark, second mark to end), then I believe I can extend the method to handle my more complex situations.
I've solved similar problems before, but this one is unique in the amount of variation that occurs from one set of data to another. Last time I simply wrote a hand-crafted algorithm to find a local extrema and use it to locate the edge, but I feel like it's a rather ugly and inefficient solution that can't be easily reused.
I'm using Python 2.7.5, but ideally this should be a language agnostic solution so that I can implement it in other environments like VB.NET.

Just based on the two examples that you posted, I have a couple of different suggestions: thresholding or template matching.
Thresholds
Because you mentioned that the vertical drop in the signal is relatively constant, especially for the first event you're detecting, it seems like you could use a thresholding method, where you place the event at the first occurrence of the signal crossing some threshold of interest. For instance, to place the first event (in Python, and assuming that your measurement data live in a sequence of tuples containing x-y pairs) :
def detect_onset_event(measurements):
armed = False
for offset, (timestamp, value) in enumerate(measurements):
if value > 90:
armed = True
if armed and value < 85:
return offset
return -1 # failure condition, might want to raise ValueError()
So here we trigger at the first sample offset that drops below 85 after the signal has gone above 90.
You could do something similar for the second event, but it looks like the signal levels that are significant for that event might be a little less clear-cut. Depends on your application and measurement data. This is a good example of what makes thresholding approaches not so great -- they can be brittle and rely on hard-coded values. But if your measurements are quite regular, then this can work well, with little coding effort.
Templates
In this method, you can create a template for each signal event of interest, and then convolve the templates over your signal to identify similar regions of the signal.
import numpy
def detect_twopeak_event(measurements, template):
data = numpy.asarray(measurements) # convert to numpy array
activations = numpy.convolve(
data[:, 1], # convolve over all "value" elements
template)
return activations.argmax()
Here you'll need to create a list of the sample measurements that constitute the event you're trying to detect -- for example, you might extract the measurements from the two-peak area of an example signal to use as your template. Then by convolving this template over the measurement data, you'll get a metric for how similar the measurements are to your template. You can just return the index of the best match (as in the code above) or pass these similarity estimates to some other process to pick a "best."
There are many ways to create templates, but I think one of the most promising approaches is to use an average of a bunch of neighborhoods from labeled training events. That is, suppose you have a database of signals paired with the sample offset where a given event happens. You could create a template by averaging a windowed region around these labeled events :
def create_mean_template(signals, offsets, radius=20):
w = numpy.hanning(2 * radius)
return numpy.mean(
[s[o-radius:o+radius] * w for s, o in zip(signals, offsets)],
axis=0)
This has been used successfully in many signal processing domains like face recognition (e.g., you can create a template for an eye by averaging the pixels around a bunch of labeled eyes).
One place where the template approach will start to fail is if your signal has a lot of areas that look like the template, but these areas don't correspond to events you want to detect. It's tricky to deal with this, so the template method works best if there's a distinctive signal pattern that happens near your event.
Another way the template method will fail is if your measurement data contain, say, a two-peak area that's interesting but occurs at a different frequency than the samples you use as your template. In this case, you might be able to make your templates more robust to slight frequency changes by working in the time-frequency domain rather than the time-amplitude domain. There, instead of making 1D templates that correspond to the temporal pattern of amplitude changes you're interested in, you can run a windowed FFT on your measurements and then come up with kD templates that correspond to the k-dimensional frequency changes over a small region surrounding the event you're interested in.
Hope some of these suggestions are helpful !

you could probably use a Hidden Markov Model with 6+ states, I am no math genius so I would use one with discrete states and round your data to nearest integer, my model would look something alike:
state 1: start blob (emissions around 97)
state 2: 'fall' (emissions between 83 and 100)
state 3: interesting stuff ( emissions between 82-86)
state 4: peak (80-88)
sate 5: last peak (80-94)
state 6: base line (87-85)
HMM are not the perfect tool, because they mostly capture ranges of emissions in each state, but they are good at tolerating the stuff coming out much earlier or later because they only care about the p value between states and therefore
I hope this helps and makes sense
if you are super lazy you could probably just label 6 spectra by hand and then cut the data accordingly and calculate the p values for each emission of each state.
#pseudo code
emissions = defaultdict(int) # with relevant labels initialized to 0
for state_lable, value in data:
emissions[state_lable][value] += 1
# then normalize all states to 1 and voila you have a HMM
the above is super over simplified but should be much better and more robust than the if-statement stuff you usually do :)... HMMs usually also have a transition matrix, but because the signal of your data is so strong you could 'skip' that one and go for my pragmatic solution :)
and then subsequently use the viterbi path to label all your future experiments

Related

How to identify divergences in pymc3 chain using arviz

I would like to identify divergences in a chain sampled by pymc3.
Each sample is associated with 1 group and 1 condition (coordinates in the trace).
For the purpose of this example, the following results are only considering 1 chain, and 1 condition (coordinate of the trace).
I am using Arviz.InferenceData to plot the trace of samples for a specific variable a_kg (where each line represents one group):
import arviz as az
# trace variable coming from pymc3.sample
azdata = az.from_pymc3(
trace=trace,
coords={'group': groups, 'condition': conditions},
dims={'a_kg': ['group', 'condition']}
)
azdata_sel = azdata.sel(chain=[0], condition='Control')
az.plot_trace(azdata_sel, var_names=['a_kg'], divergences='bottom');
The trace for each group is plotted below:
If I understood correctly, the divergences are shown on the bottom of the traces with a rug plot.
If this is correct, there is a divergence around draw 30.
Therefore, I get a slice of samples that has at least one divergence (in this case the slice containing sample 30) to explore this part of the trace in greater detail.
azdata_sel = azdata.sel(draw=slice(25, 35))
az.plot_trace(azdata_sel, var_names=['a_kg'], divergences='bottom')
I would like to identify why the chain diverges to understand better how this model works. However, when I look at the samples for variable a_kg, for each group, around draw 30, all values are restricted in a narrow, finite range:
array([[7.03689753e+01, 7.08419788e+01, 4.18270946e+01, 5.56815107e+01,
2.89069656e+01, 3.21847218e+01, 1.72809154e+01, 6.80358410e+00,
8.27741780e+00, 8.61561309e+00, 9.52030649e+00, 7.42601279e+00,
4.86924384e+00, 4.65123572e+00, 3.42272331e+00, 3.72094392e+00,
3.79496877e+00, 3.63692105e+00, 4.53843102e+00, 4.49938710e+00,
1.16647181e+00, 1.57530039e+00, 1.38785612e+00, 2.93999569e+00,
3.19698360e-01, 1.09373256e+00, 8.91772857e-01, 1.27258163e+00,
7.30115016e-01, 6.48975286e-01, 9.53344198e-01, 7.10095320e-01,
1.94587869e-01, 2.37110242e-01, 1.74995857e-02, 1.09717525e-01,
2.49860304e-01, 1.73485239e-01, 3.15215749e-01]])
Are divergences filtered out from draws during sampling? How would you proceed to diagnose what is going wrong in this case?
I think this doc has a lot of what you need to know: https://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html
But to summarise, you should know that understanding the divergences is a difficult problem in general and there isn't a silver bullet for it -- You'll have to try many (sometimes many many) things. Looking at the trace plots alone won't be enough. That being said, the document I linked you has a lot of good recommendations.
The general recommendation I can give is that you shouldn't focus on a specific sample that was divergent. That's meaningless. The thing that is diverging is not the sample but the trajectory. Use arviz.pair_plot and focus on where divergences concentrate (set divergences=True). Run much longer chains (more than 10k samples) so that you get a lot more divergences and can easily spot the pathological regions. Once you spot the pathological region, deciding what to do with it will depend on your specific model. Perhaps increasing the adaptation step, perhaps changing your priors, perhaps re-parametrizing your model.
Since you're talking about groups, I suspect you are using a hierarchical model. In which case I think the best approach would be to try an alternative parametrization. Look up for discussions about centred vs non-centred parametrizations in hierarchical models.
Best of luck in your divergence hunting! :)

How to predict the object location in the near future? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Assume that every object (car, bike etc.) is connected to internet and giving me (Cloud) just its current position and its random id, which is changes every “T” sec. Since objects changes their id’s it becomes difficult to keep track of every object, especially in a busy area like city center. Can someone help me to design an attacker model, which can predict the trajectories of the objects. That means the Cloud has to predict the direction of the objects in next, say 15 min.
You can assume like 4 regions A,B,C,D (5-10km of radius each) which are close to each other. How can I (cloud) predict and say, so many number cars/bikes have moved from Region A to Region B or the objects which I have seen in Region A and after 15min in Region B are the same.
I have studied through Kalman filter. In my case I know only Random Id and positions. So First step could be guessing the velocity of each object from 2 consecutive position points, then forming velocity vector and position vector. Then applying position vector and velocity vector to Kalman filter. Somehow predicting the positions of objects in next 15 min. Ofcouse, need not be accurate atleast probable near by region.
Is kalman filter right choice? If so any python or c++ implementation that can help? Is there any other concept which can help in predicting the user locations? Is there any simulator that can help in simulating these networks? Can someone please help in designing this attacker model? Thanks a lot.
Edit1:
Main idea behind chaning the Object Id's is to protect the user(object location) privacy. My challenge is design an attacker model, where I can prove even though you(object) changed your Id's but I can still track you based on the speed, direction you are going. Let's say a car is going on a highway. It's id (assume 1234) at 12:00. At 12:15 it has changed its Id to 2345. since no one is going in the same direction,with that particular speed. The attacker can say Id:2345 and Id:1234 are from the same object. So This is a valid linkability. But if the same object is moving in busy area, like city center. I have many combinations like turns, parking lots(where many objects report same location with different Id's) it's difficult to say Id:2345 is from the same object Id:1234.
MainGoal:
If I can find the valid linkability of objects seen in Regions A (say at 12:00Pm) and in region B (say at 12:15) .That means I need to predict the Region B where the users(most of the objects) are trying to move. ofcouse some times It might be falsepositive. Since main goal is to protect user privacy, the false positive helps the users.
Is kalman filter right choice?
A Kalman filter won't help in cases where the future predicted position depends on whether the vehicle makes a turn. A Markov model may work better for those cases.
On a straight-away with no turns, a Kalman filter will do better. However, Kalman filters assume a gaussian (normal) distribution for noise (which likely is only true when the vehicle has no surrounding traffic).
An Unscented Kalman filter can help compensate for the non-guassian noise but it too has limits.
If so any python or c++ implementation that can help?
The pykalman package would be a good place to start.
A Kalman filter helps model the motion of a single object. The problem you have is that you don't know for certain which measurements (id,position) originate from which object, since they are allowed to change their ids. That makes this a tracking problem as well, where you'll need to estimate, for each object, the list of ids it has used in the past. The reason this matters is that it could take more than 15 minutes to get from Region A to Region B, and all the random-ids you receive from Region B don't match the ones you got from the object when it was in Region A.
There is a lot of work on these sorts of problems (see http://www.probabilistic-robotics.org/ for information).
I will attempt to describe a simple solution, but this is actually a deep topic with a lot of historical work. I'm describing a sort of variant of a Particle Filter here:
Keep a table of "Object" that will contain all historical information for the object (i.e., the car). This will store the historical list of (random-id, position) pairs you believe the object has used in reporting its position.
Each measurement you receive is (random-id, position, time). Decide which object it belongs to. How to decide? Well first, if the random-id exactly matches a previous one, and that id has been used for less than 15 minutes, then you can assign it exactly. Otherwise, you'll need to deal with cases where the random-id for the object has now changed. One obvious algorithm is to match it to object whose last measured position is closest. In general, your motion model (such as a Kalman filter) will determine how to do this correspondence assignment. Sometimes you'll have to decide that the measurement is in fact a new object that appeared out of nowhere, or from the edge of the map.
When you receive a measurement in Region B and have assigned it to an object, now check if any past measurement of that object was in Region A. That will tell if you have a situation of an object moving from Region A to Region B.
What I've described is essentially an online tracking algorithm with MAP estimation of correspondences, and a pluggable motion model. The algorithm will continuously maintain a list of "traces" of a each unique object.

Time-varying band-pass filter in Python

I am trying to solve a problem very similar to the one discussed in this post
I have a broadband signal, which contains a component with time-varying frequency. I need to monitor the phase of this component over time. I am able to track the frequency shifts by (a somewhat brute force method of) peak tracking in the spectrogram. I need to "clean up" the signal around this time varying peak to extract the Hilbert phase (or, alternatively, I need a method of tracking the phase that does not involve the Hilbert transform).
To summarize that previous post: varying the coefficients of a FIR/IIR filter in time causes bad things to happen (it does not just shift the passband, it also completely confuses the filter state in ways that cause surprising transients). However, there probably is some way to adjust filter coefficients in time (probably by jointly modifying the filter coefficients and the filter state in some intelligent way). This is beyond my expertise, but I'd be open to any solutions.
There were two classes of solutions that seem plausible: one is to use a resonator filter (basically a damped harmonic oscillator driven by the signal) with a time-varying frequency. This model is simple enough to avoid surprising filter transients. I will try this -- but resonators have very poor attenuation in the stop band (if they can even be said to have a stop band?). This makes me nervous as I'm not 100% sure how the resonate filters will behave.
The other suggestion was to use a filter bank and smoothly interpolate between various band-pass filtered signals according to the frequency. This approach seems appealing, but I suspect it has some hidden caveats. I imagine that linearly mixing two band-pass filtered signals might not always do what you would expect, and might cause weird things? But, this is not my area of expertise, so if mixing over a filter bank is considered a safe solution (one that has been analyzed and published before), I would use it.
Another potential class of solutions occurs to me, which is to just take the phase from the frequency peak in a sliding short-time Fourier transform (could be windowed, multitaper, etc). If anyone knows any prior literature on this I'd be very interested. Related, would be to take the phase at the frequency power peak from a sliding complex Morlet wavelet transform over the band of interest.
So, I guess, basically I have three classes of solutions in mind.
1. Resonator filters with time-varying frequncy.
2. Using a filter bank, possibly with mixing?
3. Pulling phase from a STFT or CWT, (these can be considered a subset of the filter bank approach)
My supicion is that in (2,3) surprising thing will happen to the phase from time to time, and that in (1) we may not be able to reject as much noise as we'd like. It's not clear to me that this problem even has a perfect solution (uncertainty principle in time-frequency resolution?).
Anyway, if anyone has solved this before, and... even better, if anyone knows any papers that sound directly applicable here, I would be grateful.
Not sure if this will help, but googling "monitor phase of time varying component" resulted in this: Link
Hope that helps.

Locating and extracting (unknown) book from an image in OpenCV [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to locate a (possibly perspective-deformed) book in an image and extract it so that it is "straight" and "front-on" (i.e. perspective-corrected).
The particular book is unknown -- there is no query or reference image to check for matches against (i.e. by some sort of feature descriptor matching process). In other words, I'm trying to hunt through the image and find a bunch of pixels that look like they belong to the object class "book", not a particular book.
The book may be somewhat rotated or otherwise perspective-deformed. However, it is assumed the amount of deformation is within fairly reasonable bounds: the person taking the photo is working "with" me. This means as well that the book should feature prominently in the image -- perhaps 30-90% of total image area (and not as some random item amidst a bunch of other clutter).
Good resources exist for (superficially) similar problems online. For example, this well-written tutorial covers automatic perspective-correction of playing cards: https://opencv-code.com/tutorials/automatic-perspective-correction-for-quadrilateral-objects/.
Currently, the system follows a loosely similar process as this tutorial, with some additions. The general technique stack is:
Pre-processing
Find edges with Canny edge detection
Find edges that look like lines with Hough transform
Find intersection points between lines in the hope of finding book corners
Filter out implausible lines and intersection points based on simple geometric properties
Take convex hull of intersection points
Get polygon approximation to the convex hull and use this to get four corners
Apply perspective/homographic transform
The output points (used to calculate the perspective transform) are known because we assume a known aspect ratio (i.e. book dimensions).
It works for some images where the book is against fairly homogeneous backgrounds (around 1/3 to 1/2 of "nicer" images). After experimenting with the fairly dumb convex hull approach as well as a more involved quadrilateral-enumeration approach, I've concluded that the problem may be impossible using just geometric/spatial information alone -- it would probably need augmenting with colour/texture information (well, this is obvious when you consider the case of 180 degrees rotation/upside-down books).
The obvious challenge is that there is an almost infinite variety of possible book covers, and an almost infinite variety of possible backgrounds. Therefore, solving for the general case would be impossible or at least intractably hard. I knew this when I began the task. But, I hoped it would be the sort of problem that may have a solution enough of the time.
Other approaches I've considered looking at include OCRing the titles/text to work out orientation or possibly general position. The other approach that might conceivably be fruitful is some sort of learning-based classifier.
A related subtask I'm working on is the same goal but in a webcam video stream. This is definitely easier since I can use temporal information (i.e. position across frames). I just started this one yesterday but, after some initial progress, plateaued. A human holding the book generates background movement noise which throws off trivial approaches like frame differencing / background subtraction. Compared with the static image problem, however, I feel this is far more doable.
Sorry if that was a little long-winded. I wanted to make sure I made a sincere effort to articulate the problem(s). What do people think? Anyone have any thoughts as to how these problems might best be tackled?
Does calculating homography with 4 lines instead of 4 points help the problem? As you probably know, if points are related as p2=Hp1, the lines are related as l2=H-1l1. The lines on the book border should be quite prominent especially if the deformation is not large. Is you main problem selecting right lines (you did NOT actually said what's your problem was)? May be some kind of Hough-rectangle can help to find lines?
Anyway, selecting lines for homography input has an additional advantage that RANSAC homography with a constraint on aspect ratio is likely to keep right lines as inliners in the presence of numerous outliers from the background. And if those outliers sneak in they probably look like another book.

Utilising Genetic algorithm to overcome different size datasets in model

SO I realise the question I am asking here is large and complex.
A potential solution to variences in sizes of
In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering,
but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model.
The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the
percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified
but hopefully demonstrates the principle)
Example data
Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy
12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527
12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527
12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527
13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527
13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527
13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527
13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527
13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527
14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527
14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527
14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527
14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527
14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527
15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527
15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527
15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527
15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527
16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527
16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527
16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527
16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527
16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527
17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527
17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527
17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527
17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527
17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527
So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation:
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where α is the only dynamic number
If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates
For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed"
Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed"
So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature.
So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department?
So to clarify, this post is no longer a discussion about methodology
I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank.
Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.
The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets)
May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point.
And particularly those slides.
[EDIT]
After a second reading, here is my second understanding:
You have a set of example data with two related attributes X and Y.
You do not want X/Y to dominate when Y is small, (considered as less representative).
As a consequence you want to "weigth" the examples with a adapted value adjusted_xy .
You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R.
To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy .
With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal.
Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ).
One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information.
Once this said, I think you surely know how GA works . You have to
define the content of the chromosome : this appears to be your alpha parameter.
define an appropriate fitness function
The fitness function for one individual can be a sum of distances over all examples of the dataset.
As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA.
As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot).
Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).
Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate.
The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem).
The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem.
'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation.
Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class.
Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course).
Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.
No. Don't use a genetic algorithm.
The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before.
You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles.
This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding.
Take the problem back where it belongs and ask the statisticians instead.
For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not.
If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.
I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
which may give you something to play with.
Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows.
There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is.
Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :)
There are other tests of course, but AIC should get you started.
For a simple test, check out p-value

Categories

Resources