I need to compare two values for an if statement with the greater than equality.
if delta_t > tao_threshhold:
n = np.random.normal(0, 1)
rxn_vector = propensity*delta_t + (propensity*delta_t)**0.5*n
new_popul_num = popul_num
but I need the equality to be for much greater than. In maths the notation used is >> but this means something completely different in Python syntax.
Is there a way to express this much greater than equality in Python?
Cheers
I think the idea of "much greater than" largely depends on the use case or personal preference. To answer your question; you'll need to know "how much greater than" you want it to be:
Say I want to check if delta_t is 5 times greater than tao_threshhold; i would do something like this:
if delta_t > 5*tao_threshhold:
but again the solution to this lies in a well-defined concept of "much greater than"
As everyone is pointing out "much greater" is not a well-defined concept in mathematics/science. It is used in theoretical work to demonstrate concepts but it is open to interpretation when trying to implement the mathematical models in code.
That being said, "much greater" is often understood as "some orders of magnitudes greater" but exaclty how many is more or less up to you to define using intuition and experiments. It is also highly dependent on the units of measurement and the scaling of compared values (e.g. is there an upper bound delta_t? what values do you consider "much greater" and what values you do not? Do you have a prior knowledge or hint on how its values are distributed depending on different parameters of your algorithm?)
Practically, a way to treat it is the following:
Define some order of magniutde quantity:
E = 10e3
Implement your if statement as:
if delta_t > E*tao_threshold:
...
Be aware of precision errors: multiplying large numbers together is not safe.
If you are not sure how to choose approprite E values, you can start with the following principal:
Intuitively, your algorithm should not depend on E. So, for a given set of parameters and a chosen E value, (if your algorithm is deterministic or proven to converge to a specific value within the chosen parameters), your algorithm should show the same results for nearby E's. So you can explore different ranges of E for different sets of parameter values and search for stabilization regions. Assuming you have a scalar output and less than 3 parameters, this can be done by plotting the output. Just a point here: This is not the same as an optimization search. You want to find E values that lead to "stable" results, not "best" results.
Document everything in the code, README, documentation site, and potential technical paper. Allow the user of the code to change the selected value if needed.
Considering that tao_threshold looks like a parameter, it may be simpler just to explore different scales in that variable rather than introducing a "much greater" quantifier. But this greatly depends on the context of your algorithm and it may reduce that parameter's interpretability.
No, "much greater than" is a mathematical concept that has not made its way in to Python (or any other computer language, to my knowledge).
Compilers, both ancient and modern, can determine if one quantity is greater than another. This is a well-defined comparison. But "much greater than," while obvious on paper, does not yield a boolean (yes/no) answer.
Related
I'm a data analysis student and I'm starting to explore Genetic Algorithms at the moment. I'm trying to solve a problem with GA but I'm not sure about the formulation of the problem.
Basically I have a state of a variable being 0 or 1 (0 it's in the normal range of values, 1 is in a critical state). When the state is 1 I can apply 3 solutions (let's consider Solution A, B and C) and for each solution I know the time that the solution was applied and the time where the state of the variable goes to 0.
So I have for the problem a set of data that have a critical event at 1, the solution applied and the time interval (in minutes) from the critical event to the application of the solution, and the time interval (in minutes) from the application of the solution until the event goes to 0.
I want with a genetic algorithm to know which is the best solution for a critical event and the fastest one. And if it is possible to rank the solutions acquired so if in the future on solution can't be applied I can always apply the second best for example.
I'm thinking of developing the solution in Python since I'm new to GA.
Edit: Specifying the problem (responding to AMack)
Yes is more a less that but with some nuances. For example the function A can be more suitable to make the variable go to F but because exist other problems with the variable are applied more than one solution. So on the data that i receive for an event of V, sometimes can be applied 3 ou 4 functions but only 1 or 2 of them are specialized for the problem that i want to analyze. My objetive is to make a decision support on the solution to use when determined problem appear. But the optimal solution can be more that one because for some event function A acts very fast but in other case of the same event function A don't produce a fast response and function C is better in that case. So in the end i pretend a solution where is indicated what are the best solutions to the problem but not only the fastest because the fastest in the majority of the cases sometimes is not the fastest in the same issue but with a different background.
I'm unsure of what your question is, but here are the elements you need for any GA:
A population of initial "genomes"
A ranking function
Some form of mutation, crossing over within the genome
and reproduction.
If a critical event is always the same, your GA should work very well. That being said, if you have a different critical event but the same genome you will run into trouble. GA's evolve functions towards the best possible solution for A Set of conditions. If you constantly run the GA so that it may adapt to each unique situation you will find a greater degree of adaptability, but have a speed issue.
You have a distinct advantage using python because string manipulation (what you'll probably use for the genome) is easy, however...
python is slow.
If the genome is short, the initial population is small, and there are very few generations this shouldn't be a problem. You lose possibly better solutions that way but it will be significantly faster.
have fun...
You should take a look at the GARAGe Michigan State. They are a GA research group with a fair number of resources in terms of theory, papers, and software that should provide inspiration.
To start, let's make sure I understand your problem.
You have a set of sample data, each element containing a time series of a binary variable (we'll call it V). When V is set to True, a function (A, B, or C) is applied which returns V to it's False state. You would like to apply a genetic algorithm to determine which function (or solution) will return V to False in the least amount of time.
If this is the case, I would stay away from GAs. GAs are typically used for some kind of function optimization / tuning. In general, the underlying assumption is that what you permute is under your control during the algorithm's application (i.e., you are modifying parameters used by the algorithm that are independent of the input data). In your case, my impression is that you just want to find out which of your (I assume) static functions perform best in a wide variety of cases. If you don't feel your current dataset provides a decent approximation of your true input distribution, you can always sample from it and permute the values to see what happens; however, this would not be a GA.
Having said all of this, I could be wrong. If anyone has used GAs in verification like this, please let me know. I'd certainly be interested in learning about it.
As stated by most spelling corrector tutors, the correct word W^ for an incorrectly spelled word x is:
W^ = argmaxW P(X|W) P(W)
Where P(X|W) is the likelihood and P(W) is the Language model.
In the tutorial from where i am learning spelling correction, the instructor says that P(X|W) can be computed by using a confusion matrix which keeps track of how many times a letter in our corpus is mistakenly typed for another letter. I am using the World Wide Web as my corpus and it cant be guaranteed that a letter was mistakenly typed for another letter. So is it okay if i use the Levenshtein distance between X and W, instead of using the confusion matrix? Does it make much of a difference?
The way i am going to compute Lev. distance in python is this:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
See this
And here's the tutorial to make my question clearer: Click here
PS. i am working with Python
There are a few things to say.
The model you are using to predict the most likely correction is a simple, cascaded probability model: There is a probability for W to be entered by the user, and a conditional probability for the misspelling X to appear when W was meant. The correct terminology for P(X|W) is conditional probability, not likelihood. (A likelihood is used when estimating how well a candidate probability model matches given data. So it plays a role when you machine-learn a model, not when you apply a model to predict a correction.)
If you were to use Levenshtein distance for P(X|W), you would get integers between 0 and the sum of the lengths of W and X. This would not be suitable, because you are supposed to use a probability, which has to be between 0 and 1. Even worse, the value you get would be the larger the more different the candidate is from the input. That's the opposite of what you want.
However, fortunately, SequenceMatcher.ratio() is not actually an implementation of Levenshtein distance. It's an implementation of a similarity measure and returns values between 0 and 1. The closer to 1, the more similar the two strings are. So this makes sense.
Strictly speaking, you would have to verify that SequenceMatcher.ratio() is actually suitable as a probability measure. For this, you'd have to check if the sum of all ratios you get for all possible misspellings of W is a total of 1. This is certainly not the case with SequenceMatcher.ratio(), so it is not in fact a mathematically valid choice.
However, it will still give you reasonable results, and I'd say it can be used for a practical and prototypical implementation of a spell-checker. There is a perfomance concern, though: Since SequenceMatcher.ratio() is applied to a pair of strings (a candidate W and the user input X), you might have to apply this to a huge number of possible candidates coming from the dictionary to select the best match. That will be very slow when your dictionary is large. To improve this, you'll need to implement your dictionary using a data structure that has approximate string search built into it. You may want to look at this existing post for inspiration (it's for Java, but the answers include suggestions of general algorithms).
Yes, it is OK to use Levenshtein distance instead of the corpus of misspellings. Unless you are Google, you will not get access to a large and reliable enough corpus of misspellings. There any many other metrics that will do the job. I have used Levenshtein distance weighted by distance of differing letters on a keyboard. The idea is that abc is closer to abx than to abp, because p is farther away from x on my keyboard than c. Another option involves accounting for swapped characters- swap is a more likely correction of sawp that saw, because this is how people type. They often swap the order of characters, but it takes some real talent to type saw and then randomly insert a p at the end.
The rules above are called error model- you are trying to leverage knowledge of how real-world spelling mistakes occur to help with your decision. You can (and people have) come with really complex rules. Whether they makes a difference is an empirical question, you need to try and see. Chances are some rules will work better for some kinds of misspellings and worse for others. Google how does aspell work for more examples.
PS All of the example mistakes above have been purely due to the use of a keyboard. Sometime, people do not know how to spell a word- this is whole other can of worms. Google soundex.
I'm interested in programming languages that can reason about their own time complexity. To this end, it would be quite useful to have some way of representing time complexity programmatically, which would allow me to do things like:
f_time = O(n)
g_time = O(n^2)
h_time = O(sqrt(n))
fastest_asymptotically = min(f_time, g_time, h_time) # = h_time
total_time = f_time.inside(g_time).followed_by(h_time) # = O(n^3)
I'm using Python at the moment, but I'm not particularly tied to a language. I've experimented with sympy, but I haven't been able to find what I need out of the box there.
Is there a library that provides this capability? If not, is there a simple way to do the above with a symbolic math library?
EDIT: I wrote a simple library following #Patrick87's advice and it seems to work. I'm still interested if there are other solutions for this problem, though.
SymPy currently only supports the expansion at 0 (you can simulate other finite points by performing a shift). It doesn't support the expansion at infinity, which is what is used in algorithmic analysis.
But it would be a good base package for it, and if you implement it, we would gladly accept a patch (nb: I am a SymPy core developer).
Be aware that in general the problem is tricky, especially if you have two variables, or even symbolic constants. It's also tricky if you want to support oscilitory functions. EDIT: If you are interested in oscillating functions, this SymPy mailing list discussion gives some interesting papers.
EDIT 2: And I would recommend against trying to build this yourself from scratch, without the use of a computer algebra system. You will end up having to write your own computer algebra system, which is a lot of work, and even more work if you want to do it right and not have it be slow. There are already tons of systems already existing, including many that can act as libraries for code to be built on top of them (such as SymPy).
Actually you are building/finding a Expression Simplifier which can deal with:
+ (in your terms: followed_by)
***** (in your terms: inside)
^, log, ! (to represent the complexity)
variable (like n,m)
constant number (like that in 2^n)
For example, as you given f_time.inside(g_time).followed_by(h_time), It could be an expression like:
n*(n^2)+(n^(1/2))
, and you are expecting an processer to make it output as:n^3.
So in general speaking, you might want to use a common expression simplifier (if you want it to be interesting, go to check how Mathemetica does it) to get a simplified expression like n^3+n^(1/2), and then you need an additional processor to choose the term with highest complexity from the expression and get rid of the other terms. That would be easy, just use a table to define the complexity order of each kind of symbol.
Please note that in this case, the expressions are just symbol, you should write it as something like string (For your example: f_time = "O(n)"), not as functions.
If you're only working with big-O notation and are interested in whether one function grows more or less quickly than another, asymptotically speaking...
Given functions f and g
Compute the limit as n goes to infinity of f(n)/g(n) using a computer algebra package
If the limit diverges to +infinity, then f > g - in the sense that g = O(f), but f != O(g).
If the limit diverges to 0, then g < f.
If the limit converges to a finite number, then f = g (in the sense that f = O(g) and g = O(f))
If the limit is undefined... beats me!
I have to implement the solution to a 0/1 Knapsack problem with constraints.
My problem will have in most cases few variables (~ 10-20, at most 50).
I recall from university that there are a number of algorithms that in many cases perform better than brute force (I'm thinking, for example, to a branch and bound algorithm).
Since my problem is relative small, I'm wondering if there is an appreciable advantange in terms of efficiency when using a sophisticate solution as opposed to brute force.
If it helps, I'm programming in Python.
You can either use pseudopolynomial algorithm, which uses dynamic programming, if the sum of weights is small enough. You just calculate, whether you can get weight X with first Y items for each X and Y.
This runs in time O(NS), where N is number of items and S is sum of weights.
Another possibility is to use meet-in-the middle approach.
Partition items into two halves and:
For the first half take every possible combination of items (there are 2^(N/2) possible combinations in each half) and store its weight in some set.
For the second half take every possible combination of items and check whether there is a combination in first half with suitable weight.
This should run in O(2^(N/2)) time.
Brute force stuff would work fine for 10 variables, but for, say, 40 you'd get some 1000'000'000'000 possible solutions, which would probably take too long to enumerate. I'd consider approximate algorithms, e.g. the polynomial time algorithm (see, e.g. http://math.mit.edu/~goemans/18434S06/knapsack-katherine.pdf) or use a search algorithm such as branch-and-bound, maybe with an additional heuristic.
Brute force algorithms will always return the best solutions. The problem with them is that in exponential order problems they quickly become not feasible.
If you are guaranteed to have up to 20 variables, you will test no more than 1 million solutions (2^20= 1M). Hence, brute force is feasible and no other algorithm will return a better solution.
Heuristics are great, but they should be used only when we have no exact solution to the problem. There is a great book that might help you: How to Solve it, by Michalewicz.
Which is more CPU intensive, to do an if(x==num): check, or to do a sum x+y?
Your question is somewhat incomplete because you are comparing two different operations. If you need to add two things together then testing x==y isn't going to get you anywhere. So presumably you want to compare
if y != 0:
sum += y
with
sum +=y
It's a lot more complex for interpreted languages like Python, but on the hardware a test for non-zero introduces a branch and that in itself can be expensive. But I wouldn't want to say which would be faster without timing.
Throw into the equation different performance characteristics of different architectures and you have another confounding factor.
As always, you are best to write your code in the most natural maintainable way first and then time it. If you feel you need to extract more performance use a profiler to find hot spots and then optimise.