Random selection ideas

Random selection ideas - python

I am thinking of giving one or more set of introductory lectures to introduce people in my department with Python and related scientific tools as I did once in the last summer py4science # UND.
To make the meetings more interesting and catch more attention I had given two Python learning materials to one of the lucky audience via the shown ways:
1-) Get names and assign with a number and pick the first one as the winner from the assigned dictionary.
import random
lucky = {1:'Lucky1',...}
random.choice(lucky.keys())
2-) Similar to the previous one but pop items from the dictionary, thus the last one becomes the luckiest.
import random
lucky = {1:'Lucky1',...}
lucky.pop(random.choice(lucky.keys()))
Right now, I am looking at least for one more idea that will have randomness inherently and demonstrate a useful language feature helping me to make a funnier lottery time at the end of one of the sessions.

Cards are also a source of popular (and familiar!) games of chance.
Perhaps you could show how easy it is to generate, shuffle and sample cards:
#!/usr/bin/env python
import random
import itertools
numname={1:'Ace',11:'Jack',12:'Queen',13:'King'}
suits=['Clubs','Diamonds','Hearts','Spades']
numbers=range(1,14)
cards=['%s-%s'%(numname.get(number,number),suit)
for number,suit in itertools.product(numbers,suits)]
print(cards)
random.shuffle(cards)
print(cards)
hand=random.sample(cards,5)
print(hand)

One of the cutest uses of random numbers for mid-sized crowds is finding cycles. I will describe the physical method, and then some explorations. The Python code is fairly trivial.
Start with your group of about 100 people with their names on pieces of paper in a bowl. Everyone descends on the bowl and takes a random piece of paper. Each person goes to the person with that name. This leads to groups clumping together in various sizes. Not always what people expect.
For example, if Alice picks Bob, Bob picks Charlie, and Charlie picks Alice, then these three people will end up in their own clump. For some groups, have people join hands with their matches to see everyone being pulled this way and that. Also to see how the matches create chains or clumps.
Now write software to watch the number of clumps. Do the match on clumps, asking, for example, "how often is the biggest clump less than half the people"? For example, for N students, an average of 1/N will draw their own names.
Do you need code?

Computing Pi is always fun ;-)
import random
def approx_pi( n ):
# n random (x,y) pairs (as a generator)
data = ( (random.random(),random.random()) for _ in range(n) )
return 4.0*sum( 1 for x,y in data if x**2 + y**2 < 1 )/n
print approx_pi(100000)

Related

How to detect the amount of almost-repetition in a text file?

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.
The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).
Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.
u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]
x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)
f = (y - z)[: -5]
plt.plot(f)
plt.show();
So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...
import difflib
import numpy as np
lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]
n = 3
d = []
for i in range(len(lines)):
a = lines[i:i+n]
for j in range(len(lines)):
b = lines[j:j+n]
if i == j: continue # skip same line
group_size = np.sum([len(x) for x in a])
if group_size < 5: continue # skip short lines
ratio = 0
for u, v in zip(a, b):
r = difflib.SequenceMatcher(None, u, v).ratio()
ratio += r if r > 0.7 else 0
d.append(ratio)
dry = sum(d) / len(lines)
In the following, we can identify some repetition at a glance:
w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()
Why not using:
d = np.exp(np.array(d))
Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.
What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):
I am sure there is a more clever solution out there.
Any hint?

I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.
Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.

Looks like you're choosing a long path. I wouldn't go there.
I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.
Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.
Dis would be an interesting option.
I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.
On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.
Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.
With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.

You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.
As such, we can write this trivial implementation:
import operator
class Repeat:
def __init__(self,l):
self.l=list(l)
self.memo={}
def __call__(self,m,n):
l=self.l
memo=self.memo
k=m,n
ret=memo.get(k)
if not ret:
if not m or not n: ret=0,None
elif m!=n and l[m-1]==l[n-1]: # critical change here!
z,tail=self(m-1,n-1)
ret=z+1,((m-1,n-1),tail)
else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
memo[k]=ret
return ret
def go(self):
n=len(self.l)
v=self(n,n)[1]
ret=[]
while v:
x,v=v
ret.append(x)
ret.reverse()
return ret
def repeat(l): return Repeat(l).go()
You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.
Of course, there are also "real" answers and real research on this issue!

Frame challenge: I’m not sure you should do this
It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…
That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.
More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.
Edit:
Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.

I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.
For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).
As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))
Algorithm:
read text line by line
transform it by
removing blanks
removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text
Code:
from itertools import groupby, cycle
import re
def sequences(l):
x2 = cycle(l)
next(x2)
grps = groupby(l, key=lambda j: j + 1 == next(x2))
yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)
with open('program.cpp') as fp:
text = fp.readlines()
# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
line = re.sub(r"\s+", "", line, flags=re.UNICODE)
if line:
processed.append(line)
text_map[proc_ix] = ix
proc_ix += 1
# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
h = hash(''.join(processed[ix:ix+3])) # join 3 lines
hashes.append(h)
hpos.setdefault(h, []).append(ix) # this list will reflect lines that are duplicated
# filter duplicated three liners
seqs = []
for k, v in hpos.items():
if len(v) > 1:
seqs.extend(v)
seqs = sorted(list(set(seqs)))
# find longer sequences
result = {}
for seq in sequences(seqs):
result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))
print('Duplicates found:')
for v in result.values():
print('-'*20)
vbeg, vend = v[0]
print(''.join(text[vbeg:vend]))
print(f'Found {len(v)} duplicates, lines')
for line_numbers in v:
print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Recursive python matching algorithm based on subsets working too slowly

I'm building a web app to match high school students considering a gap year to students who have taken a gap year, based on interest as denoted by tags. A prototype is up at covidgapyears.com. I have never written a matching/recommendation algorithm, so though people have suggested things like collaborative filtering and association rule mining, or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset (few hundred users right now, few thousand soon). So I wrote my own alg using common sense.
It essentially takes in a list of tags that the student is interested it, then searches for an exact match of those tags with someone who has taken a gap year and registered with the site (who also selected tags on registration). An exactMatch, as given below, is when the tags the user specifies are ALL contained by some profile (i.e., are a subset). If it can't find an exact match with ALL of the user's inputted tags, it will check all n-1 length subsets of the tags list itself to see if any less selective queries have matches. It does this recursively until at least 3 matches are found. While it works fine for small tags selections (up to 5-7) it gets slow for larger tags selections (7-13), taking several seconds to return a result. When 11-13 tags are selected, hits a Heroku error due to worker timeout.
I did some tests by putting variables inside the algorithm to count computations and it seems that when it goes a bit deep into the recursive stack, it checks a few hundred subsets each time (to see if there's an exactMatch for that subset, and if there is, add it to results list to output), and the total number of computations doubles as you add one more tag (it went 54, 150, 270, 500, 1000, 1900, 3400 operations for more and more tags). It is true that there are a few hundred subsets at each depth. But exactMatches is O(1) as I've written it (no iteration), and aside from the other O(1) operations like IF, the FOR inside the subset loop will, at most, be gone through around 10 times. This agrees with the measured result of a few thousand computations each time.
This did not surprise me as selecting and iterating over all subsets seems to be something that could get non harder, but my question is about why it's so slow despite only doing a few thousand computations. I know my computer operates in GHz and I expect web servers are similar, so surely a few thousand computations would be near-instantaneous? What am I missing and how can I improve this algorithm? Any other approaches I should look into?
# takes in a list of length n and returns a list of all combos of subsets of depth n
def arbSubsets(seq, n):
return list(itertools.combinations(seq, len(seq)-n))
# takes in a tagsList and check Gapper.objects.all to see if any gapper has all those tags
def exactMatches(tagsList):
tagsSet = set(tagsList)
exactMatches = []
for gapper in Gapper.objects.all():
gapperSet = set(gapper.tags.names())
if tagsSet.issubset(gapperSet):
exactMatches.append(gapper)
return exactMatches
# takes in tagsList that has been cleaned to remove any tags that NO gappers have and then checks gapper objects to find optimal match
def matchGapper(tagsList, depth, results):
# handles the case where we're only given tags contained by no gappers
if depth == len(tagsList):
return []
# counter variable is to measure complexity for debugging
counter += 1
# we don't want too many results or it stops feeling tailored
upper_limit_results = 3
# now we must check subsets for match
subsets = arbSubsets(tagsList, depth)
for subset in subsets:
counter += 1
matches = exactMatches(subset)
if matches:
for match in matches:
counter += 1
# new need to check because we might be adding depth 2 to results from depth 1
# which we didn't do before, to make sure we have at least 3 results
if match not in results:
# don't want to show too many or it doesn't feel tailored anymore
counter += 1
if len(results) > upper_limit_results: break
results.append(match)
# always give at least 3 results
if len(results) > 2:
return results
else:
# check one level deeper (less specific) into tags if not enough gappers that match to get more results
counter += 1
return matchGapper(tagsList, depth + 1, results)
# this is the list of matches we then return to the user
matches = matchGapper(tagsList, 0, [])

It doesn't seem you are doing a few hundred computation steps. In fact you have a few hundred options for each depth, thus you should not add, but multiply the number of steps at each depth to estimate the complexity of your solution.
Additionally this statement: This or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset is also obviously not true. Although these algorithms may be overkill for some very simple cases, they are still valid and will work for them.

Okay, so after much fiddling with timers I've figured it out. There are a few functions at play when matching: exactMatches, matchGapper and arbSubset. When I put the counter into a global variable and measured operations (as measured as lines of my code being executed, it came in around 2-10K for large inputs (around 10 tags)).
It is true that arbSubset, which returns a list of subsets, at first seems like a plausible bottleneck. But if you look closely, we are 1) handling small amounts of tags (order of 10-50) and more importantly, 2) we are only calling arbSubset when we recurse matchGapper, which only happens a max of about 10 times, since tagsList can only be around 10 (order of 10-50, as above). And when I checked the time it took to generate arbSubsets, it was order of 2e-5. And so the total time spend on generating the subsets of arbitrary size is only 2e-4. In other words, not the source of the 5-30 second waiting time in the web app.
And so with that aside, knowing that arbSubset is only called on the order of 10 times, and is fast at that, and knowing that there are only around a max of 10K computations taking place in my code it starts to become clear that I must be using some out-of-the-box function, I don't know--like set() or .issubset() or something like that--that takes a nontrivial amount of time to compute, and is executed many times. Adding some counters in some more places, it becomes clear that exactMatch() accounts for around 95-99% of all computations that take place (as would be expected if we have to check all combinations of subsets of various sizes for exactMatches).
So the problem, at this point, is reduced to the fact that exactMatch takes around 0.02s (empirically) as implemented, and is called several thousand times. And so we can either try to make it faster by a couple of order of magnitudes (it's already pretty optimal), or take another approach that doesn't involve finding matches using subsets. A friend of mine suggested creating a dict with all the combinations of tags (so 2^len(tagsList) keys) and setting them equal to lists of registered profiles with that exact combination. This way, querying is just traversing a (huge) dict, which can be done fast. Any other suggestions are welcome.

Beginner Python Area of a room

Right now I am working on a program to calculate the area of a room in order to purchase cans of paint. I am just a three weeks into my class and I'm a little overwhelmed. I'm having trouble figuring out how I am supposed to attach each wall/ceiling/window/door to a separate name such as 'WALL1 WALL2' etc and then being able to call those to a calculation. As far as I have gotten I can't seem to figure out how to write this variable. I am by no means asking for the code to the whole program, so we will take a look at just walls as an example. "John wants to calculate how much paint he needs for a whole house and has 57 walls with various sizes of each wall." How do I allow an unlimited amount of walls to be used while attaching each wall to its Length and Height? Or should I limit the amount of walls? Once I establish how many of these walls there are how do I attach each wall to its own name? Each 'name' will then be called into the final calculation. Here is what I have so far:
# Area calculation for paint program
print "Area Calculation For Paint"
Project_Name = input('Enter your Project Name:')
print "WALL1."
print "WALL2."
print "WALL3."
print "WALL4."
print "WALL5"
print "WALL6"
print "WALL7"
print "WALL8"....
# Get the user’s choice:
shape = input("Please select a Wall and input the length and height: ")
# Calculate the area for each room
if WALL1 == yes:
height = input("Please enter the height: ")
length = input("Please enter the length: ")
area1 = height*length
WALL1 = area1
# Calculate the total square footage
TOTALSQFT = WALL1 + WALL2 + WALL3 + WALL4 + WALL5 + CEILING1 - WINDOW1 + WINDOW2 + WINDOW3 + DOOR1 + DOOR2... etc
print "Project_Name total square footage is TOTALSQFT"
I have provided my Flowchart here as reference so hopefully it makes better sense what I am trying to explain.

You can use a list or a tuple to store your walls ceiling etc. then its a matter of running a For loop to do the calculation. You may also want to use a dictionary if you want to call the items by name.
You can create Wall1, Wall2 etc using a simple string addition and put that in the dictionary rather than creating variables for each element.
If you clarify how you gonna accept the user input for all 57 walls etc. we can answer more accurately.

After our discussion in the comments, it appears that your real problem comes from not really understanding the relative difficulty of things yet. In large part that's because you don't really understand programming yet, you've just been making flowcharts in your class. There's a fair difference between flowcharts and programming, since with a flowchart you can just put something magic happens.
My first recommendation is to check out the Python style guide, called pep8).
Most Python developers stick to this, and it will make your life easier when trying to communicate with us.
Next, you want to adjust your expectations. Trying to parse out a bunch of values from something like:
Wall 1 3x4 Wall 2 5x9 Wall 3 9x9 Door 1 2x6.5 Door 2 2x6.5
You can do it, but as a beginning developer it's a bit overwhelming. If you know regular expressions it's pretty trivial, but you don't, and they're not a beginning topic. Just remember the popular saying:
Developers see a problem and say, "Ah, I know, I'll use regular expressions!" Now they have two problems.
Most of the time they're the wrong thing, but occasionally they're the right thing. But as a beginner, they're not the right thing.
Instead, you should aim for something like this:
get the project name
ask the user for wall sizes. When they input an empty/blank string, that's the last wall size
ask the user for ceiling sizes (though you could include this in the wall sizes, no need to have them different). When they input an empty/blank string there are no more ceilings.
ask the user for the door sizes. Same thing about empty strings.
ask the user for window sizes. The same thing applies for ceiling vs walls.
combine the wall/ceiling sizes and subtract (door sizes + window sizes)
You can store the sizes in lists, e.g. walls = [[3, 4], [5, 9], [9, 9]]. Dealing with lists is something that you can learn in the Python tutorial, or many other tutorials on the Internet.
You can iterate (loop) over your lists and write that information to a file, if that's something that you want to do. Tutorials will also cover that.
If you take the above approach, you'll find that your project is much easier to complete. Good luck!

Have you considered using pandas data frames to store each instance of Wall, Window and Ceiling? Then you multiply your columns Width by Length and store it in the column Surface.
Then you can simply use the groupbyfunction to get your totals and add up the results, or simply sum the Surfacecolumns.

Algorithm to find "good" neighbours - graph coloring?

I have a group of people and for each of them a list of friends, and a list of foes. I want to line them up (no circle as on a table) so that preferrable no enemies but only friends are next to each other.
Example with the Input: https://gist.github.com/solars/53a132e34688cc5f396c
I think I need to use graph coloring to solve this, but I'm not sure how - I think I have to leave out the friends (or foes) list to make it easier and map to a graph.
Does anyone know how to solve such problems and can tell me if I'm on the right path?
Code samples or online examples would also be nice, I don't mind the programming language, I usually use Ruby, Java, Python, Javascript
Thanks a lot for your help!

It is already mentioned in the comments, that this problem is equivalent to travelling salesman problem. I would like to elaborate on that:
Every person is equivalent to a vertex and the edges are between vertices which represents persons who can seat to each other. Now, finding a possible seating arrangement is equivalent to finding a Hamiltonian path in the graph.
So this problem is NPC. The most naive solution would be to try all possible permutations resulting in a O(n!) running time. There are a lot of well known approaches which perform better than O(n!) and are freely available on the web. I would like to mention Held-Karp, which runs in O(n^2*2^n)and is pretty straight forward to code, here in python:
#graph[i] contains all possible neighbors of the i-th person
def held_karp(graph):
n = len(graph)#number of persons
#remember the set of already seated persons (as bitmask) and the last person in the line
#thus a configuration consists of the set of seated persons and the last person in the line
#start with every possible person:
possible=set([(2**i, i) for i in xrange(n)])
#remember the predecessor configuration for every possible configuration:
preds=dict([((2**i, i), (0,-1)) for i in xrange(n)])
#there are maximal n persons in the line - every iterations adds a person
for _ in xrange(n-1):
next_possible=set()
#iterate through all possible configurations
for seated, last in possible:
for neighbor in graph[last]:
bit_mask=2**neighbor
if (bit_mask&seated)==0: #this possible neighbor is not yet seated!
next_config=(seated|bit_mask, neighbor)#add neighbor to the bit mask of seated
next_possible.add(next_config)
preds[next_config]=(seated, last)
possible=next_possible
#now reconstruct the line
if not possible:
return []#it is not possible for all to be seated
line=[]
config=possible.pop() #any configuration in possible has n person seated and is good enough!
while config[1]!=-1:
line.insert(0, config[1])
config=preds[config]#go a step back
return line
Disclaimer: this code is not properly tested, but I hope you can get the gist of it.

Chinese Restaurant Process implementation in Python

I have wrote a code in Python for CRP problem. The problem itself can be found here:
http://cog.brown.edu/~mj/classes/cg168/slides/ChineseRestaurants.pdf
And to give a short description of it:
Suppose we want to assign people entering to a restaurants to potentially infinite number of tables. If $z_i$ represents the random variable assigned for the $i$'th person entering the restaurant the following should hold:
With probability $p(z_i=a|z_1,...,z_{i-1})=\frac{n_a}{i-1+\alpha} for $n_a>0$, $i$'th person will sit in table $a$ and with probability $p(z_i=a|z_1,...,z_{i-1})=\frac{\alpha}{i-1+\alpha} $i$'th person will sit around a new table.
I am not quite sure if my code is correct cause I am surprised how small the final number of tables are.
I would be happy if somebody could say if the implementation is correct and if so are there any possible improvements.
import numpy as np
def CRP(alpha,N):
"""Chinese Restaurant Process with alpha as concentration parameter and N
the number of sample"""
#Array which will save for each i, the number of people people sitting
#until table i
summed=np.ones(1) #first person assigned to the first table
for i in range(1,N):
#A loop that assigns the people to tables
#randind represent the random number from the interval [1,i-1+alpha]
randind=(float(i)+alpha)*np.random.uniform(low=0.0, high=1.0, size=1)
#update is the index for the table that the person should be placed which
#if greater than the total number, will be placed in a new table
update=np.searchsorted(summed,randind,side='left')
if randind>i:
summed=np.append(summed,i+1)
else:
zerovec=np.zeros(update)
onevec=np.ones(summed.size-update)
summed+=np.append(zerovec,onevec)
#This part converts summed array to tables array which indicates the number
#of persons assigned to that table
tables=np.zeros(summed.size)
tables[0]=summed[0]
for i in range(1,summed.size):
tables[i]=summed[i]-summed[i-1]
return tables
a=CRP(0.9999,1000)
print a

Suggestion. Forget about the code you have written. Construct declarative tests of the code. By taking that approach, you start with examples for which you know the correct answer. That would have answered Brainiac's question, for example.
Then write your program. You will likely find that if you start approaching problems this way, you may create sub-problems first, for which you can also write tests. Until they all pass, there is no need to rush on to the full problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.