How to start create an automatic cyclical matching system - python

I try to study and create an automatic cyclical matching algorithm. The system's purpose is to match the destinations of several employee who want to move from their current division to other division (there are several reasons to move such as they want to move to their hometown or to look after their family). Unfortunately, the company protocol allow them to move when they can find someone from other division who want to place in their position. In practice, the past case, the employee have to find their partners by posting their destination on Facebook group and switch with two or more people for creating their cyclical rotation themselves like this:
Mr.A work for division X and he want to move to division Y.
Mr.B work for division Y and he want to move to division Z.
Mr.C work for division Z and he want to move to division X.
In use case, one of them (suppose Mr.A) have to contact Mr.B and Mr.C to make a cyclically move to achieve their goals. There are more than 2,000 employee face this problem in my company (my company have around 30k employee).
Thus, anyone please suggests me that how can I study and start to create an AI system or other algorithms that able to help my friend easily complete their purposes.
Ps. I has experience and familiar with python
thanks in advance.

In real life I assume you will be getting your data through a DB query since you talk about thousands of records. But for a small sample let's just put our requests in a list. The idea is, start with one request and see if we have some other people whose target location is the source location of our request, then explore each of those to see if we have some more, until we find a cycle.
Since you want the longest cycle, and probably need other criteria as well, we need to compute all the possibilities. Then another function will select the best choice.
This will find all possible cycles for a given request:
from dataclasses import dataclass
class Request():
code : int
who : str
source : str
target : str
days : int
requests = [Request(1,'A','X','Y',3),
cycles = []
def find_cycles(basereq):
global cycles
cycles = []
find_cycles2(basereq,, [basereq.code])
def find_cycles2(basereq, pivot, cycle):
global cycles
for otherreq in [r for r in requests if == basereq.source and r.code not in cycle]:
if otherreq.source == pivot:
cycles.append(cycle + [otherreq.code])
find_cycles2(otherreq, pivot, cycle + [otherreq.code])
>>> find_cycles(requests[0])
[[1, 3, 2], [1, 4]]
>>> find_cycles(requests[1])
[[2, 1, 3]]
>>> find_cycles(requests[4])


Plotting OpenStreetMap relations does not generate continous lines

I have been working on an index of all MTB trails worldwide. I'm a Python person so for all steps involved I try to use Python modules.
I was able to grab relations from the OSM overpass API like this:
from OSMPythonTools.overpass import Overpass
overpass = Overpass()
def fetch_relation_coords(relation):
rel = overpass.query('rel(%s); (._;>;); out;' % relation)
return rel
rel = fetch_relation_coords("6750628")
I'm choosing this particular relation (6750628) because it is one of several that is resulting in discontinuous (or otherwise erroneous) plots.
I process the "rel" object to get a pandas.DataFrame like this:
elements = pd.DataFrame(rel.toJSON()['elements'])
"elements" looks like this:
The Elements pandas.DataFrame contains rows of the types "relation" (1 in this case), several of the type "way" and many of the type "node". It was my understanding that I would use the "relation" row, "members" column to extract the order of the ways (which point to the nodes), and use that order to make a list of the latitudes and longitudes of the nodes (for later use in leaflet), in the correct order, that is, the order that leads to continuous path on a map.
However, that is not the case. For this particular relation, I end up with the following plot:
If we compare that with the way the relation is displayed on itself, we see that it goes wrong (focus on the middle, eastern part of the trail). I have many examples of this happening, although there are also a lot of relations that do display correctly.
So I was wondering, what am I missing? Are there nodes with tags that need to be ignored? I already tried several things, including leaving out nodes with any tags, this does not help. Somewhere my processing is wrong but I don't understand where.
You need to sort the ways inside the relation yourself. Only a few relation types require sorted members, for example some route relations such as route=bus and route=tram. Others may have sorted members, such as route=hiking, route=bicycle etc., but they don't require them. Various other relations, such as boundary relations (type=boundary), usually don't have sorted members.
I'm pretty sure there are already various tools for sorting relation members, obviously this includes the website where this relation is shown correctly. Unfortunately I'm not able to point you to these tools but I guess a little bit research will reveal others.
If I opt to just plot the different way on top of each other, I indeed get a continuous plot (index contains the indexes for all nodes per way):
In the Database I would have preferred to have the nodes sorted anyway because I could use them to make a GPX file on the fly. But I guess I did answer my own question with this approach, thank you #scai for tipping me into this direction.
You could have a look at shapely.ops.linemerge, which seems to be smart enough to chain multiple linestrings even if the directions are inconsistent. For example (adapted from here):
from shapely import geometry, ops
line_a = geometry.LineString([[0,0], [1,1]])
line_b = geometry.LineString([[1,0], [2,5], [1,1]]) # <- switch direction
line_c = geometry.LineString([[1,0], [2,0]])
multi_line = geometry.MultiLineString([line_a, line_b, line_c])
merged_line = ops.linemerge(multi_line)
# output:
LINESTRING (0 0, 1 1, 2 5, 1 0, 2 0)
Then you just need to make sure that the endpoints match exactly.

Programming the nearest neighbor algorithm in Python 3.6

I am very new to programming and might have bitten off more than I can chew. I am trying to create a program that allows me to find the shortest route to visit all of the National Parks by importing a csv file containing the park names and distances between each park. Ideally, I would like it to prompt the user for which park they would like to start with and then run through the other parks to find the shortest distance (e.g. if you wanted to start with Yellowstone, it would find the closest park to Yellowstone, then the closest park to that park, etc., then add up all those distances, returning the total mileage and the order the parks were visited in) I think I need to be importing the csv file as a dictionary so I can use the park names as keys, but I'm not sure then how to work the keys into the algorithm. So far I have the following put together from my limited knowledge:
import csv
import numpy as np
distances = csv.DictReader(open("ds.csv"))
for row in distances:
startingPark = input('Which park would you like to test?')
def NN(distanceArray, start):
path = [start]
cost = 0
N = A.shape[0]
mask = np.ones(N, dtype=bool)
mask[start] = False
for i in range(N-1):
last = path[-1]
next_ind = np.argmin(distanceArray[last][mask]) # find minimum of remaining locations
next_loc = np.arange(N)[mask][next_ind] # convert to original location
mask[next_loc] = False
cost += distanceArray[last, next_loc]
return path, cost
print (NN(distanceArray,0))
I know that I have to change all of the array stuff in the actual algorithm part of the code (that's just some code I was able to find through research on here that I am using as a starting point), but I am unsure of A: how to get it to actually use the input I give and B: how to make the algorithm work with the dictionary instead of with arrays that are input as part of the code itself. I've tried using the documentation and such, but it goes a bit over my head. Obviously I don't want anyone to just do it for me, but I'd appreciate any pointers or resources that people may have. I'm trying very hard to learn, but with no real guidance from anyone that knows what they're doing, I'm having a hard time.
Edit: Here is a sample of the data I have to work with. I pulled all of the distances from Google Maps and input them into a csv file. I think the x's I have in place might also be a problem and might need to be replaced with 0's or something similar, but I haven't gotten to handling that issue yet. (Sorry for not just uploading the picture, not enough rep to post one yet)

Minimum removed nodes required to cut path from A to B algorithm in Python

I am trying to solve a problem related to graph theory but can't seem to remember/find/understand the proper/best approach so I figured I'd ask the experts...
I have a list of paths from two nodes (1 and 10 in example code). I'm trying to find the minimum number of nodes to remove to cut all paths. I'm also only able to remove certain nodes.
I currently have it implemented (below) as a brute force search. This works fine on my test set but is going to be an issue when scaling up to a graphs that have paths in the 100K and available nodes in the 100 (factorial issue). Right now, I'm not caring about the order I remove nodes in, but I will at some point want to take that into account (switch sets to list in code below).
I believe there should be a way to solve this using a max flow/min cut algorithm. Everything I'm reading though is going way over my head in some way. It's been several (SEVERAL) years since doing this type of stuff and I can't seem to remember anything.
So my questions are:
1) Is there a better way to solve this problem other than testing all combinations and taking the smallest set?
2) If so, can you either explain it or, preferably, give pseudo code to help explain? I'm guessing there is probably a library that already does this in some way (I have been looking and using networkX lately but am open to others)
3) If not (or even of so), suggestions for how to multithread/process solution? I want to try to get every bit of performance I can from computer. (I have found a few good threads on this question I just haven't had a chance to implement so figured I'd ask at same time just in chance. I first want to get everything working properly before optimizing.)
4) General suggestions on making code more "Pythonic" (probably will help with performance too). I know there are improvements I can make and am still new to Python.
Thanks for the help.
#!/usr/bin/env python
def bruteForcePaths(paths, availableNodes, setsTested, testCombination, results, loopId):
#for each node available, we are going to
# check if we have already tested set with node
# if true- move to next node
# if false- remove the paths effected,
# if there are paths left,
# record combo, continue removing with current combo,
# if there are no paths left,
# record success, record combo, continue to next node
#local copy
currentPaths = list(paths)
currentAvailableNodes = list(availableNodes)
currentSetsTested = set(setsTested)
currentTestCombination= set(testCombination)
currentLoopId = loopId+1
print "loop ID: %d" %(currentLoopId)
print "currentAvailableNodes:"
for set1 in currentAvailableNodes:
print " %s" %(set1)
for node in currentAvailableNodes:
#add to the current test set
print "%d-current node: %s current combo: %s" % (currentLoopId, node, currentTestCombination)
# print "Testing: %s" % currentTestCombination
# print "Sets tested:"
# for set1 in currentSetsTested:
# print " %s" % set1
if currentTestCombination in currentSetsTested:
#we already tested this combination of nodes so go to next node
print "Already test: %s" % currentTestCombination
#get all the paths that don't have node in it
currentRemainingPaths = [path for path in currentPaths if not (node in path)]
#if there are no paths left
if len(currentRemainingPaths) == 0:
#save this combination
print "successful combination: %s" % currentTestCombination
#add to remember we tested combo
#now remove the node that was add, and go to the next one
#this combo didn't work, save it so we don't test it again
newAvailableNodes = list(currentAvailableNodes)
print "-------------------"
#need to pass "up" the tested sets from this loop
return None
if __name__ == '__main__':
testPaths = [
setsTested = set()
availableNodes = [2, 3, 6, 7, 9]
results = list()
currentTestCombination = set()
bruteForcePaths(testPaths, availableNodes, setsTested, currentTestCombination, results, 0)
print "results:"
for result in sorted(results, key=len):
print result
I reworked the code using itertool for generating the combinations. It make the code cleaner and faster (and should be easier to multiprocess. Now to try to figure out the dominate nodes as suggested and multiprocess function.
def bruteForcePaths3(paths, availableNodes, results):
#start by taking each combination 2 at a time, then 3, etc
for i in range(1,len(availableNodes)+1):
print "combo number: %d" % i
currentCombos = combinations(availableNodes, i)
for combo in currentCombos:
#get a fresh copy of paths for this combiniation
currentPaths = list(paths)
currentRemainingPaths = []
# print combo
for node in combo:
#determine better way to remove nodes, for now- if it's in, we remove
currentRemainingPaths = [path for path in currentPaths if not (node in path)]
currentPaths = currentRemainingPaths
#if there are no paths left
if len(currentRemainingPaths) == 0:
#save this combination
print combo
return None
Here is an answer which ignores the list of paths. It just takes a network, a source node, and a target node, and finds the minimum set of nodes within the network, not either source or target, so that removing these nodes disconnects the source from the target.
If I wanted to find the minimum set of edges, I could find out how just by searching for Max-Flow min-cut. Note that the Wikipedia article at states that there is a generalized max-flow min-cut theorem which considers vertex capacity as well as edge capacity, which is at least encouraging. Note also that edge capacities are given as Cuv, where Cuv is the maximum capacity from u to v. In the diagram they seem to be drawn as u/v. So the edge capacity in the forward direction can be different from the edge capacity in the backward direction.
To disguise a minimum vertex cut problem as a minimum edge cut problem I propose to make use of this asymmetry. First of all give all the existing edges a huge capacity - for example 100 times the number of nodes in the graph. Now replace every vertex X with two vertices Xi and Xo, which I will call the incoming and outgoing vertices. For every edge between X and Y create an edge between Xo and Yi with the existing capacity going forwards but 0 capacity going backwards - these are one-way edges. Now create an edge between Xi and Xo for each X with capacity 1 going forwards and capacity 0 going backwards.
Now run max-flow min-cut on the resulting graph. Because all the original links have huge capacity, the min cut must all be made up of the capacity 1 links (actually the min cut is defined as a division of the set of nodes into two: what you really want is the set of pairs of nodes Xi, Xo with Xi in one half and Xo in the other half, but you can easily get one from the other). If you break these links you disconnect the graph into two parts, as with standard max-flow min-cut, so deleting these nodes will disconnect the source from the target. Because you have the minimum cut, this is the smallest such set of nodes.
If you can find code for max-flow min-cut, such as those pointed to by I would expect that it will give you the min-cut. If not, for instance if you do it by solving a linear programming problem because you happen to have a linear programming solver handy, notice for example from that one half of the min cut is the set of nodes reachable from the source when the graph has been modifies to subtract out the edge capacities actually used by the solution - so given just the edge capacities used at max flow you can find it pretty easily.
If the paths were not provided as part of the problem I would agree that there should be some way to do this via, given a sufficiently ingenious network construction. However, because you haven't given any indication as to what is a reasonable path and what is not I am left to worry that a sufficiently malicious opponent might be able to find strange collections of paths which don't arise from any possible network.
In the worst case, this might make your problem as difficult as, in the sense that somebody, given a problem in Set Cover, might be able to find a set of paths and nodes that produces a path-cut problem whose solution can be turned into a solution of the original Set Cover problem.
If so - and I haven't even attempted to prove it - your problem is NP-Complete, but since you have only 100 nodes it is possible that some of the many papers you can find on Set Cover will point at an approach that will work in practice, or can provide a good enough approximation for you. Apart from the Wikipedia article, points you at two implementations, and a quick search finds the following summary at the start of a paper in
The SCP is an NP-hard problem in the strong sense (Garey and Johnson, 1979) and many algorithms
have been developed for solving the SCP. The exact algorithms (Fisher and Kedia, 1990; Beasley and
JØrnsten, 1992; Balas and Carrera, 1996) are mostly based on branch-and-bound and branch-and-cut.
Caprara et al. (2000) compared different exact algorithms for the SCP. They show that the best exact
algorithm for the SCP is CPLEX. Since exact methods require substantial computational effort to solve
large-scale SCP instances, heuristic algorithms are often used to find a good or near-optimal solution in a
reasonable time. Greedy algorithms may be the most natural heuristic approach for quickly solving large
combinatorial problems. As for the SCP, the simplest such approach is the greedy algorithm of Chvatal
(1979). Although simple, fast and easy to code, greedy algorithms could rarely generate solutions of good
Edit: If you want to destroy in fact all paths, and not those from a given list, then max-flow techniques as explained by mcdowella is much better than this approach.
As mentioned by mcdowella, the problem is NP-hard in general. However, the way your example looks, an exact approach might be feasible.
First, you can delete all vertices from the paths that are not available for deletion. Then, reduce the instance by eliminating dominated vertices. For example, every path that contains 15 also contains 2, so it never makes sense to delete 15. In the example if all vertices were available, 2, 3, 9, and 35 dominate all other vertices, so you'd have the problem down to 4 vertices.
Then take a vertex from the shortest path and branch recursively into two cases: delete it (remove all paths containing it) or don't delete it (delete it from all paths). (If the path has length one, omit the second case.) You can then check for dominance again.
This is exponential in the worst case, but might be sufficient for your examples.

How to implement "autoincrement" on Google AppEngine

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?
If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.
If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.
The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed: = False
consumed = avail
seq.current = start + consumed
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)
If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.
Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.
Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.
Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.
I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.
I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

Random selection ideas

I am thinking of giving one or more set of introductory lectures to introduce people in my department with Python and related scientific tools as I did once in the last summer py4science # UND.
To make the meetings more interesting and catch more attention I had given two Python learning materials to one of the lucky audience via the shown ways:
1-) Get names and assign with a number and pick the first one as the winner from the assigned dictionary.
import random
lucky = {1:'Lucky1',...}
2-) Similar to the previous one but pop items from the dictionary, thus the last one becomes the luckiest.
import random
lucky = {1:'Lucky1',...}
Right now, I am looking at least for one more idea that will have randomness inherently and demonstrate a useful language feature helping me to make a funnier lottery time at the end of one of the sessions.
Cards are also a source of popular (and familiar!) games of chance.
Perhaps you could show how easy it is to generate, shuffle and sample cards:
#!/usr/bin/env python
import random
import itertools
for number,suit in itertools.product(numbers,suits)]
One of the cutest uses of random numbers for mid-sized crowds is finding cycles. I will describe the physical method, and then some explorations. The Python code is fairly trivial.
Start with your group of about 100 people with their names on pieces of paper in a bowl. Everyone descends on the bowl and takes a random piece of paper. Each person goes to the person with that name. This leads to groups clumping together in various sizes. Not always what people expect.
For example, if Alice picks Bob, Bob picks Charlie, and Charlie picks Alice, then these three people will end up in their own clump. For some groups, have people join hands with their matches to see everyone being pulled this way and that. Also to see how the matches create chains or clumps.
Now write software to watch the number of clumps. Do the match on clumps, asking, for example, "how often is the biggest clump less than half the people"? For example, for N students, an average of 1/N will draw their own names.
Do you need code?
Computing Pi is always fun ;-)
import random
def approx_pi( n ):
# n random (x,y) pairs (as a generator)
data = ( (random.random(),random.random()) for _ in range(n) )
return 4.0*sum( 1 for x,y in data if x**2 + y**2 < 1 )/n
print approx_pi(100000)

