I know python offers random module to do some simple lottery. Let say random.shuffle() is a good one.
However, I want to build my own simple one. What should I look into? Is there any specific mathematical philosophies behind lottery?
Let say, the simplest situation. 100 names and generate 20 names randomly.
I don't want to use shuffle, since I want to learn to build one myself.
I need some advise to start. Thanks.
You can generate your own pseudo-random numbers -- there's a huge amount of theory behind that, start for example here -- and of course you won't be able to compete with Python's random "Mersenne twister" (explained halfway down the large wikipedia page I pointed you to), in either quality or speed, but for purposes of understanding, it's a good endeavor. Or, you can get physically-random numbers, for example from /dev/random or /dev/urandom on Linux machines (Windows machines have their own interfaces for that, too) -- one has more pushy physical randomness, the other one has better performance.
Once you do have (or borrow from random;-) a pseudo-random (or really random) number generator, picking 20 items at random from 100 is still an interesting problem. While shuffling is a more general approach, a more immediately understandable one might be, assuming your myrand(N) function returns a random or pseudorandom int between 0 included and N excluded:
def pickfromlist(howmany, thelist):
result = []
listcopy = list(thelist)
while listcopy and len(result) < howmany:
i = myrand(len(listcopy))
result.append(listcopy.pop(i))
return result
Definitely not maximally efficient, but, I hope, maximally clear!-) In words: as long as required and feasible, pick one random item out of the remaining ones (the auxiliary list listcopy gives us the "remaining ones" at any step, and gets modified by .pop without altering the input parameter thelist, since it's a shallow copy).
See the Fisher-Yates Shuffle, described also in Knuth's The Art of Computer Programming.
I praise your desire to do this on your own.
Back in the 1950's, random numbers were unavailable to most people without a supercomputer (of the time). The RAND corporation published a book called a million random digits with 100,000 normal deviates which had, literally, just that: random numbers. It was awesome because it enabled laypeople to use high-quality random numbers for research purposes.
Now, back to your question.
I recommend you read the instructions on how to use the book (yes, it comes with instructions) and try to implement that in your Python code. This will not be efficient or elegant, but you will understand the implications of the algorithm you ultimately settle for. I love the part that instructs you to
open the book to an unselected page of
the digit table and blindly choose a
five-digit number; this number with
the first number reduced modulo 2
determines the starting line; the two
digits to the right of the initially
selected five-digit number are reduced
modulo 50 to determine the starting
column in the starting line
It was an art to read that table of numbers!
To be sure, I'm not encouraging you to reinvent the wheel for production code. I'm encouraging you to learn about the art of randomness by implementing a clever, if not very efficient, random number generator.
My work requires that I use high-quality random numbers, on limited occasions I have found the site www.random.org a very good source of both insight and material. From their website:
RANDOM.ORG offers true random numbers
to anyone on the Internet. The
randomness comes from atmospheric
noise, which for many purposes is
better than the pseudo-random number
algorithms typically used in computer
programs. People use RANDOM.ORG for
holding drawings, lotteries and
sweepstakes, to drive games and
gambling sites, for scientific
applications and for art and music.
Now, go and implement your own lottery.
You can use: random.sample
Return a k length list of unique
elements chosen from the population
sequence. Used for random sampling
without replacement.
For a more low-level approach, use `random.choice', in a loop:
Return a random element from the
non-empty sequence seq.
The pseudo-random generator (PRNG) in Python is pretty good. If you want to go even more low-level, you can implement your own. Start with reading this article. The mathematical name for lottery is "sampling without replacement". Google that for information - here's a good link.
The main shortcoming of software-based methods of generating lottery numbers is the fact that all random numbers generated by software are pseudo-random.
This may not be a problem for your simple application, but you did ask about a 'specific mathematical philosophy'. You will have noticed that all commercial lottery systems use physical methods: balls with numbers.
And behind the scenes, the numbers generated by physical lottery systems will be carefully scrutunised for indications of non-randomness and steps taken to eliminate it.
As I say, this may not be a consideration for your simple application, but the overriding requirement of a true lottery (the 'specific mathematical philosophy') should be mathematically demonstrable randomness
Related
I want to implement the spell checker which will checks the spelling in a text file and outputs the errors and corrections. I want to create this using python.
But, the main thing is I want to implement that with using genetic algorithm. How can I implement the genetic algorithm for spell checker?
Don't expect my idea here to be perfect or optimal, but it might be a good starting point for you if you decide to go this route. A genetic algorithm may not be the best choice for a spell checker though.
For a genetic algorithm, you need to have a starting population, a way to pass the genes to the "next generation" (crossover), a definite means of creating mutations, and a way of selecting which ones are passed on to the next generation (aka a fitness function). Along with this you'll need, of course, a corpus. You can try the dictionary.com API if it's any good (I've never used it) http://www.programmableweb.com/api/dictionary.com.
For the starting population, you have the horrible issue in that your starting population will be thousands of the exact same word (i.e. ['hello']*1000). From here you can just check if it's a word, then if it is just return True (because grammar checking there vs their vs they're will be a pain in the ass).
To start off, you'll need to rely entirely on mutations to gain diversity, so maybe make mutations more likely if it's an earlier generation, and once the diversity grows the chance of mutation decreases. Mutations can be any of: insert a random letter somewhere, remove a letter somewhere, change a letter somewhere, do more than one of these.
For your fitness function, your best bet will be to use a sequence alignment algorithm. See: http://en.wikipedia.org/wiki/Sequence_alignment. If you REALLY want to get advanced, try creating phonetic spellings for each word in your population and see if they match anything in the corpus, and increase score based on that (i.e. tho and though would have the same pronunciation). I cannot claim to know anything about that. Bare in mind all of this will slow down your application horribly, so keep that in mind. It might be best to limit your population to 1000-2000.
For your crossover, you should take a few of your samples (early on you may need to use roulette to pick which will be the most fit, but later on you can use tournament for speed purposes). Again you can use the sequence alignment between each "parent", and then decide which letter to pull from each parent (i.e. soeed vs s_eeo can come out to be soeed, seed, seeo, or soeeo).
Don't take this as an expert solution, plus I only put a few minutes of thought into this, but it could be a good start if you decide to use a genetic algorithm.
import string,random,platform,os,sys
def rPass():
sent = os.urandom(random.randrange(900,7899))
print sent,"\n"
intsent=0
for i in sent:
intsent += ord(i)
print intsent
intset=0
rPass()
I need help figuring out total possible outputs for the bytecode section of this algorithm. Don't worry about the for loop and the ord stuff that's for down the line. -newbie crypto guy out.
I won't worry about the loop and the ord stuff, so let's just throw that out and look at the rest.
Also, I don't understand "I need help figuring out total possible outputs for the unicode section of this algorithm", because there is no Unicode section of the algorithm, or in fact any Unicode anything anywhere in your code. But I can help you figure out the total possible outputs of the whole thing. Which we'll do by simplifying it step by step.
First:
li=[]
for a in range(900,7899):
li.append(a)
This is exactly equivalent to:
li = range(900, 7899)
Meanwhile:
li[random.randint(0,7000)]
Because li happens to be exactly 6999 elements long, this is exactly the same as random.choice(li).
And, putting the last two together, this means it's equivalent to:
random.choice(range(900,7899))
… which is equivalent to:
random.randrange(900,7899)
But wait, what about that random.shuffle(li, random.random)? Well (ignoring the fact that random.random is already the default for the second parameter), the choice is already random-but-not-cryptographically-so, and adding another shuffle doesn't change that. If someone is trying to mathematically predict your RNG, adding one more trivial shuffle with the same RNG will not make it any harder to predict (while adding a whole lot more work based on the results may make a timing attack easier).
In fact, even if you used a subset of li instead of the whole thing, there's no way that could make your code more unpredictable. You'd have a smaller range of values to brute-force through, for no benefit.
So, your whole thing reduces to this:
sent = os.urandom(random.randrange(900, 7899))
The possible output is: Any byte string between 900 and 7899 bytes long.
The length is random, and roughly evenly distributed, but it's not random in a cryptographically-unpredictable sense. Fortunately, that's not likely to matter, because presumably the attacker can see how many bytes he's dealing with instead of having to predict it.
The content is random, both evenly distributed and cryptographically unpredictable, at least to the extent that your system's urandom is.
And that's all there is to say about it.
However, the fact that you've made it much harder to read, write, maintain, and think through gives you a major disadvantage, with no compensating disadvantage to your attacker.
So, just use the one-liner.
I think in your followup questions, you're asking how many possible values there are for 900-7898 bytes of random data.
Well, how many values are there for 900 bytes? 256**900. How many for 901? 256**901. So, the answer is:
sum(256**i for i in range(900, 7899))
… which is about 2**63184, or 10**19020.
So, 63184 bits of security sounds pretty impressive, right? Probably not. If your algorithm has no flaws in it, 100 bits is more than you could ever need. If your algorithm is flawed (and of course it is, because they all are), blindly throwing thousands more bits at it won't help.
Also, remember, the whole point of crypto is that you want cracking to be 2**N slower than legitimate decryption, for some large N. So, making legitimate decryption much slower makes your scheme much worse. This is why every real-life working crypto scheme uses a few hundred bits of key, salt, etc. (Yes, public-key encryption uses a few thousand bits for its keys, but that's because its keys aren't randomly distributed. And generally, all you do with those keys it to encrypt a randomly-generated session/document key of a few hundred bits.)
One last thing: I know you said to ignore the ord, but…
First you can write that whole part as intsent=sum(bytearray(sent)).
But, more importantly, if all you're doing with this buffer is summing it up, you're using a lot of entropy to generate a single number with a lot less entropy. (This should be obvious once you think about it. If you have two separate bytes, there are 65536 possibilities; if you add them together, there are only 512.)
Also, by generating a few thousand one-byte random numbers and adding them up, that's basically a very close approximation of a normal or gaussian distribution. (If you're a D&D player, think of how 3D6 gives 10 and 11 more often than 3 and 18… and how that's more true for 3D6 than for 2D6… and then consider 6000D6.) But then, by making the number of bytes range from 900 to 7899, you're flattening it back toward a uniform distribution from 700*127.5 to 7899*127.5. At any rate, if you can describe the distribution you're trying to get, you can probably generate that directly, without wasting all this urandom entropy and computation.
It's worth noting that there are very few cryptographic applications that can possibly make use of this much entropy. Even things like generating SSL certs use on the order of 128-1024 bits, not 64K bits.
You say:
trying to kill the password.
If you're trying to encrypt a password so it can be, say, stored on disk or sent over the network, this is almost always the wrong approach. You want to use some kind of zero-knowledge proof—store hashes of the password, or use challenge-response instead of sending data, etc. If you want to build a "keep me logged in feature", do that by actually keeping the user logged in (create and store a session auth token, rather than storing the password). See the Wikipedia article password for the basics.
Occasionally, you do need to encrypt and store passwords. For example, maybe you're building a "password locker" program for a user to store a bunch of passwords in. Or a client to a badly-designed server (or a protocol designed in the 70s). Or whatever. If you need to do this, you want one layer of encryption with a relatively small key (remember that a typical password is itself only about 256 bits long, and has less than 64 bits of actual information, so there is absolutely no benefit from using a key thousands of times as long as they). The only way to make it more secure is to use a better algorithm—but really, the encryption algorithm will almost never be the best attack surface (unless you've tried to design one yourself); put your effort into the weakest areas of the infrastructure, not the strongest.
You ask:
Also is urandom's output codependent on the assembler it's working with?
Well… there is no assembler it's working with, and I can't think of anything else you could be referring to that makes any sense.
All that urandom is dependent on is your OS's entropy pool and PRNG. As the docs say, urandom just reads /dev/urandom (Unix) or calls CryptGenRandom (Windows).
If you want to know exactly how that works on your system, man urandom or look up CryptGenRandom in MSDN. But all of the major OS's can generate enough entropy and mix it well enough that you basically don't have to worry about this at all. Under the covers, they all effectively have some pool of entropy, and some cryptographically-secure PRNG to "stretch" that pool, and some kernel device (linux, Windows) or user-space daemon (OS X) that gathers whatever entropy it can get from unpredictable things like user actions to mix it into the pool.
So, what is that dependent on? Assuming you don't have any apps wasting huge amounts of entropy, and your machine hasn't been compromised, and your OS doesn't have a major security flaw… it's basically not dependent on anything. Or, to put it another way, it's dependent on those three assumptions.
To quote the linux man page, /dev/urandom is good enough for "everything except long-lived GPG/SSL/SSH keys". (And on many systems, if someone tries to run a program that, like your code, reads thousands of bytes of urandom, or tries to kill the entropy-seeding daemon, or whatever, it'll be logged, and hopefully the user/sysadmin can deal with it.)
hmmmm python goes through an interpreter of its own so i'm not sure how that plays in
It doesn't. Obviously calling urandom(8) does a bunch of extra stuff before and after the syscall to read 8 bytes from /dev/urandom than you'd do in, say, a C problem… but the actual syscall is identical. So the urandom device can't even tell the difference between the two.
but I'm simply asking if urandom will produce different results on a different architecture.
Well, yes, obviously. For example, Linux and OS X use entirely different CSPRNGs and different ways of accumulating entropy. But the whole point is that it's supposed to be different, even on an identical machine, or at a different time on the same machine. As long as it produces "good enough" results on every platform, that's all that matters.
For instance would a processor\assembler\interpreter cause a fingerprint specific to said architecture, which is within reason stochastically predictable?
As mentioned above, the interpreter ultimately makes the same syscall as compiled code would.
As for an assembler… there probably isn't any assembler involved anywhere. The relevant parts of the Python interpreter, the random device, the entropy-gathering service or driver, etc. are most likely written in C. And even if they were hand-coded in assembly, the whole point of coding in assembly is that you pretty much directly control the machine code that gets generated, so different assemblers wouldn't make any difference.
The processor might leave a "fingerprint" in some sense. For example, I'll bet that if you knew the RNG algorithm, and controlled its state directly, you could write code that could distinguish an x86 vs. an x86_64, or maybe even one generation of i7 vs. another, based on timing. But I'm not sure what good that would do you. The algorithm will still generate the same results from the same state. And the actual attacks used against RNGs are about attacking the algorithm the entropy accumulator, and/or the entropy estimator.
At any rate, I'm willing to bet large sums of money that you're safer relying on urandom than on anything you come up with yourself. If you need something better (and you don't), implement—or, better, find a well-tested implementation of—Fortuna or BBS, or buy a hardware entropy-generating device.
I'm currently working on a website that will allow students from my university to automatically generate valid schedules based on the courses they'd like to take.
Before working on the site itself, I decided to tackle the issue of how to schedule the courses efficiently.
A few clarifications:
Each course at our university (and I assume at every other
university) comprises of one or more sections. So, for instance,
Calculus I currently has 4 sections available. This means that, depending on the amount of sections, and whether or not the course has a lab, this drastically affects the scheduling process.
Courses at our university are represented using a combination of subject abbreviation and course code. In the case of Calculus I: MATH 1110.
The CRN is a code unique to a section.
The university I study at is not mixed, meaning males and females study in (almost) separate campuses. What I mean by almost is that the campus is divided into two.
The datetimes and timeranges dicts are meant to decreases calls to datetime.datetime.strptime(), which was a real bottleneck.
My first attempt consisted of the algorithm looping continuously until 30 schedules were found. Schedules were created by randomly choosing a section from one of the inputted courses, and then trying to place sections from the remaining courses to try to construct a valid schedule. If not all of the courses fit into the schedule i.e. there were conflicts, the schedule was scrapped and the loop continued.
Clearly, the above solution is flawed. The algorithm took too long to run, and relied too much on randomness.
The second algorithm does the exact opposite of the old one. First, it generates a collection of all possible schedule combinations using itertools.product(). It then iterates through the schedules, crossing off any that are invalid. To ensure assorted sections, the schedule combinations are shuffled (random.shuffle()) before being validated. Again, there is a bit of randomness involved.
After a bit of optimization, I was able to get the scheduler to run in under 1 second for an average schedule consisting of 5 courses. That's great, but the problem begins once you start adding more courses.
To give you an idea, when I provide a certain set of inputs, the amount of combinations possible is so large that itertools.product() does not terminate in a reasonable amount of time, and eats up 1GB of RAM in the process.
Obviously, if I'm going to make this a service, I'm going to need a faster and more efficient algorithm. Two that have popped up online and in IRC: dynamic programming and genetic algorithms.
Dynamic programming cannot be applied to this problem because, if I understand the concept correctly, it involves breaking up the problem into smaller pieces, solving these pieces individually, and then bringing the solutions of these pieces together to form a complete solution. As far as I can see, this does not apply here.
As for genetic algorithms, I do not understand them much, and cannot even begin to fathom how to apply one in such a situation. I also understand that a GA would be more efficient for an extremely large problem space, and this is not that large.
What alternatives do I have? Is there a relatively understandable approach I can take to solve this problem? Or should I just stick to what I have and hope that not many people decide to take 8 courses next semester?
I'm not a great writer, so I'm sorry for any ambiguities in the question. Please feel free to ask for clarification and I'll try my best to help.
Here is the code in its entirety.
http://bpaste.net/show/ZY36uvAgcb1ujjUGKA1d/
Note: Sorry for using a misleading tag (scheduling).
Scheduling is a very famous constraint satisfaction problem that is generally NP-Complete. A lot of work has been done on the subject, even in the same context as you: Solving the University Class Scheduling Problem Using Advanced ILP Techniques. There are even textbooks on the subject.
People have taken many approaches, including:
Dynamic programming
Genetic algorithms
Neural networks
You need to reduce your problem-space and complexity. Make as many assumptions as possible (max amount of classes, block based timing, ect). There is no silver bullet for this problem but it should be possible to find a near-optimal solution.
Some semi-recent publications:
QUICK scheduler a time-saving tool for scheduling class sections
Scheduling classes on a College Campus
Did you ever read anything about genetic programming? The idea behind it is that you let the 'thing' you want solved evolve, just by itsself, until it has grown to the best solution(s) possible.
You generate a thousand schedules, of which usually zero are anywhere in the right direction of being valid. Next, you change 'some' courses, randomly. From these new schedules you select some of the best, based on ratings you give according to the 'goodness' of the schedule. Next, you let them reproduce, by combining some of the courses on both schedules. You end up with a thousand new schedules, but all of them a tiny fraction better than the ones you had. Let it repeat until you are satisfied, and select the schedule with the highest rating from the last thousand you generated.
There is randomness involved, I admit, but the schedules keep getting better, no matter how long you let the algorithm run. Just like real life and organisms there is survival of the fittest, and it is possible to view the different general 'threads' of the same kind of schedule, that is about as good as another one generated. Two very different schedules can finally 'battle' it out by cross breeding.
A project involving school schedules and genetic programming:
http://www.codeproject.com/Articles/23111/Making-a-Class-Schedule-Using-a-Genetic-Algorithm
I think they explain pretty well what you need.
My final note: I think this is a very interesting project. It is quite difficult to make, but once done it is just great to see your solution evolve, just like real life. Good luck!
The way you're currently generating combinations of sections is probably throwing up huge numbers of combinations that are excluded by conflicts between more than one course. I think you could reduce the number of combinations that you need to deal with by generating the product of the sections for only two courses first. Eliminate the conflicts from that set, then introduce the sections for a third course. Eliminate again, then introduce a fourth, and so on. This should see a more linear growth in the processing time required as the number of courses selected increases.
This is a hard problem. It you google something like 'course scheduling problem paper' you will find a lot of references. Genetic algorithm - no, dynamic programming - yes. GAs are much harder to understand and implement than standard DP algos. Usually people who use GAs out of the box, don't understand standard techniques. Do some research and you will find different algorithms. You might be able to find some implementations. Coming up with your own algorithm is way, way harder than putting some effort into understanding DP.
The problem you're describing is a Constraint Satisfaction Problem. My approach would be the following:
Check if there's any uncompatibilities between courses, if yes, record them as constraints or arcs
While not solution is found:
Select the course with less constrains (that is, has less uncompatibilities with other courses)
Run the AC-3 algorithm to reduce search space
I've tried this approach with sudoku solving and it worked (solved the hardest sudoku in the world in less than 10 seconds)
I want to generate tokens and keys that are random strings. What is the acceptable method to generate them?
Is generating two UUIDs via standard library functions and concatenating them acceptable?
os.urandom provides access to the operating systems random number generator
EDIT: If you are using linux and are very concerned about security, you should use /dev/random/ directly. This call will block until sufficient entropy is available.
Computers (without special hardware) can only generate pseudo random data. After a while, all speudo-random number generators will start to repeat themselves. The amount of data it can generate before repeating itself is called the period.
A very popular pseudo-random number (also used in Python in the random module) generator is the Mersenne Twister. But it is deemed not suitable for cryptographic purposes, because it is fairly easy to predict the next iteration after observing only a relatively small number of iterates.
See the Wikipedia page on cryptographically secure pseudo-random number generators for a list of algorithms that seem suitable.
Operating systems like FreeBSD, OpenBSD and OS X use the Yarrow algorithm for their urandom devices. So on those systems using os.urandom might be OK, because it is well-regarded as being cryptographically secure.
Of course what you need to use depends to a large degree on how high your requirements are; how secure do you want it to be? In general I would advise you to use published and tested implementations of algorithms. Writing your own implementation is too easy to get wrong.
Edit: Computers can gather random data by watching e.g. the times at which interrupts arrive. This however does not supply a large amount of random data, and it is therefore often used to seed a PRNG.
I'm working on a project with a friend where we need to generate a random hash. Before we had time to discuss, we both came up with different approaches and because they are using different modules, I wanted to ask you all what would be better--if there is such a thing.
hashlib.sha1(str(random.random())).hexdigest()
or
os.urandom(16).encode('hex')
Typing this question out has got me thinking that the second method is better. Simple is better than complex. If you agree, how reliable is this for 'randomly' generating hashes? How would I test this?
This solution:
os.urandom(16).encode('hex')
is the best since it uses the OS to generate randomness which should be usable for cryptographic purposes (depends on the OS implementation).
random.random() generates pseudo-random values.
Hashing a random value does not add any new randomness.
random.random() is a pseudo-radmom generator, that means the numbers are generated from a sequence. if you call random.seed(some_number), then after that the generated sequence will always be the same.
os.urandom() get's the random numbers from the os' rng, which uses an entropy pool to collect real random numbers, usually by random events from hardware devices, there exist even random special entropy generators for systems where a lot of random numbers are generated.
on unix system there are traditionally two random number generators: /dev/random and /dev/urandom. calls to the first block if there is not enough entropy available, whereas when you read /dev/urandom and there is not enough entropy data available, it uses a pseudo-rng and doesn't block.
so the use depends usually on what you need: if you need a few, equally distributed random numbers, then the built in prng should be sufficient. for cryptographic use it's always better to use real random numbers.
The second solution clearly has more entropy than the first. Assuming the quality of the source of the random bits would be the same for os.urandom and random.random:
In the second solution you are fetching 16 bytes = 128 bits worth of randomness
In the first solution you are fetching a floating point value which has roughly 52 bits of randomness (IEEE 754 double, ignoring subnormal numbers, etc...). Then you hash it around, which, of course, doesn't add any randomness.
More importantly, the quality of the randomness coming from os.urandom is expected and documented to be much better than the randomness coming from random.random. os.urandom's docstring says "suitable for cryptographic use".
Testing randomness is notoriously difficult - however, I would chose the second method, but ONLY (or, only as far as comes to mind) for this case, where the hash is seeded by a random number.
The whole point of hashes is to create a number that is vastly different based on slight differences in input. For your use case, the randomness of the input should do. If, however, you wanted to hash a file and detect one eensy byte's difference, that's when a hash algorithm shines.
I'm just curious, though: why use a hash algorithm at all? It seems that you're looking for a purely random number, and there are lots of libraries that generate uuid's, which have far stronger guarantees of uniqueness than random number generators.
if you want a unique identifier (uuid), then you should use
import uuid
uuid.uuid4().hex
https://docs.python.org/3/library/uuid.html