Recursive function with growing number of calls - python

So the idea is I'm using steam API to get list of friends of the given user, to gather some ID's for the data analysis. Each time I get friendlist of a user I want to get the 5 friends of his 5 friends. So first I get 5 friends of first user. And then I get the 5 friends of 5 friends so it's 5 -> 25 -> 125 and so on up until some points for example 6 times to get 15 625 ID's. And the question is how to do it because I don't really know how to make this really work. I'm not so good at recursion

Basicly you can imagine a person as a node who has n neighboring nodes (= friends) and you start at one (= yourself) and move on to your neighbor nodes (=friends) then you move on to their neighboring nodes and so on while always keeping track of which nodes you have already visited. This way you are gradually moving away from your start node, until the whole network is explored (you don't want that in your case) or until a certain distance (= nodes between you and your friends) is reached, so for example up to the 6th level as you've described in your post.
The network of friends builds a graph data structure and what you want to do is a well known graph algorithm called breadth-first search. In the wikipedia article you will find some pseudo code and if you google for breadth-first search you will find many, many resources and implementations in any language you need.
By the way, no need for recursion here, so don't use it.

Related

Read Graph from multiple files in IGraph (Python)

I have multiple node- and edgelists which form a large graph, lets call that the maingraph. My current strategy is to first read all the nodelists and import it with add_vertices. Every node then gets an internal id which depends on the order they are ingested and therefore isnt very reliable (as i've read it, if you delete one, all higher ids than the one deleted change). I assign every node a name attribute which corresponds to the external ID I use so I can keep track of my nodes between frameworks and a type attribute.
Now, how do I add the edges? When I read an edgelist it will start making a new graph (subgraph) and hence starts the internal ID at 0. Therefore, "merging" the graphs with maingraph.add_edges(subgraph.get_edgelist) inevitably fails.
It is possible to work around this and use the name attribute from both maingraph and subgraph to find out which internal ID each edges' incident nodes have in the maingraph:
def _get_real_source_and_target_id(edge):
''' takes an edge from the to-be-added subgraph and gets the ids of the corresponding nodes in the
maingraph by their name '''
source_id = maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index
target_id = maingraph.vs.select(name_eq=subgraph.vs[edge[1]]["name"])[0].index
return (source_id,target_id)
And then I tried
edgelist = [_get_real_source_and_target_id(x) for x in subgraph.get_edgelist()]
maingraph.add_edges(edgelist)
But that is hoooooorribly slow. The graph has millions of nodes and edges, which takes 10 seconds to load with the fast, but incorrect maingraph.add_edges(subgraph.get_edgelist) approach. with the correct approach explained above, it takes minutes (I usually stop it after 5 minutes o so). I will have to do this tens of thousands of times. I switched from NetworkX to Igraph because of the fast loading, but it doesn't really help if I have to do it like this.
Does anybody have a more clever way to do this? Any help much appreciated!
Thanks!
Nevermind, I figured out that the mistake was elsewhere. I used numpy.loadtxt() to read the node's names as strings, which somehow did funny stuff when the names were incrementing numbers with more than five figures (see my issue report here). Therefore the above solution got stuck when it tried to get the nodes where numpy messed up the node name. maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index simply sat there when it couldnt find the node. Now I use pandas to read the node names and it works fine.
The solution above is still ~10x faster than my previous NetworkX solution, so I will just leave it helps someone. Feel free to delete it otherwise.

Basic questions about nested blockmodel in graph-tool

Very briefly, two-three basic questions about the minimize_nested_blockmodel_dl function in graph-tool library. Is there a way to figure out which vertex falls onto which block? In other words, to extract a list from each block, containing the labels of its vertices.
The hierarchical visualization is rather difficult to understand for amateurs in network theory, e.g. are the squares with directed edges that are drawn meant to implicate the main direction of the underlying edges between two blocks under consideration? The blocks are nicely shown using different colors, but on a very conceptual level, which types of patterns or edge/vertex properties are behind the block categorization of vertices? In other words, when two vertices are in the same block, what can I say about their common properties?
Regarding your first question, it is fairly straightforward: The minimize_nested_blockmodel_dl() function returns a NestedBlockState object:
g = collection.data["football"]
state = minimize_nested_blockmodel_dl(g)
you can query the group membership of the nodes by inspecting the first level of the hierarchy:
lstate = state.levels[0]
This is a BlockState object, from which we get the group memberships via the get_blocks() method:
b = lstate.get_blocks()
print(b[30]) # prints the group membership of node 30
Regarding your second question, the stochastic block model assumes that nodes that belong to the same group have the same probability of connecting to the rest of the network. Hence, nodes that get classified in the same group by the function above have similar connectivity patterns. For example, if we look at the fit for the football network:
state.draw(output="football.png")
We see that nodes that belong to the same group tend to have more connections to other nodes of the same group --- a typical example of community structure. However, this is just one of the many possibilities that can be uncovered by the stochastic block model. Other topological patterns include core-periphery organization, bipartiteness, etc.

For each test case, display the minimum number of days necessary to complete the plan. ALGORITHM ideas?

I have to write code to solve this question. but first, I have to come up with an algorithm for it. I have no idea how to do it. Here is the question and what I was thinking of doing:
The Canadian Space Agency (CSA) is in a race with their
American and Russian counterparts (NASA and Roscosmos, resp.) to land the rst man on the
Mars. They have broken down their plan into n tasks each taking exactly one day. Each task can
only be started when all its prerequisite tasks has nished. Clearly, there is no circular dependency
between the tasks. The CSA needs your help to determine the minimum number of days necessary
to complete the plan.
we are given a list containing the prerequisites for each task, so I know which task has which prerequisite. for example. a list like [[1,2,3],[],[],[]] means that task 0 has prerequisites 1,2,3 and the other 3 tasks have no prerequisites.
what i was thinking of doing was checking every list and checking if it is 0-length. if it is, increase the day count by 1. if it is not, iterate through the prerequisites and see if any of those can be done. if any of the prerequisites have prerequisites, i have to do those first.
I understand the theory, i just have no idea how to implement this. I need to somehow keep track of the prerequisites I finished doing and then remove these prerequisites from any task that has it as a requirement. I understand i could just make a list, but if the list input file is extremely large, it would take a long time, and my algorithm has to return an input iin under 1 second.
any help figuring out this algorithm would be greatly appreciated.
This is a pretty standard tree/graph problem. Think about building a data structure that will have all the independent tasks as the leafs, and their nodes are the tasks that depend on them. You should then be able to come up with a search that will calculate the sum of days as it moves up levels on the tree.
Your node could look something like this:
class node(object):
def __init__(children)
self.children = children
def days_to_complete()
days = 1
for child in children:
days += child.days_to_complete()
return days
Note that this class doesn't account for cross dependencies, but I'm not going to solve your homework for you ;)

Need algorithm suggestions for flight routings

I'm in the early stages of thinking through a wild trip that involves visiting every commercial airport in India. A little research shows that the national carrier - Air India, has a special ticket called the Silver Pass that allows unlimited travel on their domestic network for 15 days. I would like to use this as my weapon of choice!
See this for a map of all the airports served by Air India
I have the following information available with me in Excel:
All of the domestic flight routes (departure airports and arrival airports in IATA codes)
Duration for every flight route
Weekly frequency for every flight (not all flights run on all days of the week, for example)
Given this information, how do I figure out what is the maximum number of airports that I can hit in 15 days using the Silver Pass ticket? Looking online shows that this is either a traveling salesman problem or a graph traversal problem. What would you guys recommend that I look at to solve this.
Some background on myself - I'm just beginning to learn Python and would like to figure out a way to solve this problem using that. Given that, what are the python-based algorithms/libraries that I should be looking at that will help me structure an approach to solving this?
Your problem is closely related to the Hamiltonian Path problem and Traveling Salesman Problem, which are NP-Hard.
Given an instance of Hamiltonian Path Problem - build a flight data:
Each vertex is an airport
Each edge is a flight
All flights leave at the same time and takes the same time.(*)
(*)The flight duration and departure time [which are common for all] should be calculated so you will be able to visit all terminals only if you visit each terminal only once. It can be easily done in polynomial time. Assume we have a fixed time of k hours for the ticket, we construct the flight table such that each flight takes exactly k/(n-1) hours, and there is a flight every k/(n-1) hours as well1 [remember all flights are at the same time].
It is easy to see that if and only if the graph has a hamiltonian path, you can use the ticket to visit al airports, since if we visit a certain airport twice in the path, we need at least n flights and the total time will be at least (k/(n-1)) * n > k, and we failed the time limit. [other way around is similar].
Thus your problem [for general case] is NP-Hard, and there is no known polynomial solution for it.
1: We assume it takes no time to pass between flights, this can be easily fixed by simply decreasing flight length by the time it takes to "jump" between two flights.
Representing your problem as a graph is definitely the best option. Since the duration, number of flights, and number of airports are relatively limited, and since you are (presumably) happy with approximate solutions, attacking this by brute force ought to be practical, and is probably your best option. Here's roughly what I would do:
Represent each airport as a node on the graph, and each flight as an edge.
Given a starting airport and a current time, select all the flights leaving that airport after the current time. Use a scoring function of some sort to rank them, such that flights to airports you haven't visited are ranked higher than flights to airports you haven't visited, and flights are ranked higher the sooner they are.
Recursively explore each outgoing edge, in order of score, and repeat the procedure for the arriving airport.
Any time you reach a node with no outgoing valid edges, compare it to the best possible solution. If it's an improvement, output it and set it as the new best solution.
Depending on the number of flights, you may be able to run this procedure exhaustively. The number of solutions grows exponentially with the number of flights, of course, so this will quickly become impractical. This is where the scoring function becomes useful - it prioritizes the solutions more likely to produce useful answers. You can run the procedure for as long as you want, and stop when it produces a solution you're happy with.
The properties of the scoring function will have a big impact on how good the solutions are. If your priority is exploring unique places, you want to put a big premium on unvisited airports, and since you want to explore as many as possible, you need to prioritize short transfer times. My suggestion for a starting point would be to make the penalty for going somewhere you've already been proportional to the time it would take to fly from there to somewhere else. That way, it'll still be explored as a stopover, but avoided where possible. Also, note that your scoring function will need context, namely the set of airports that have been visited by the current candidate path.
You can also use the scoring function to apply other constraints. Say you don't want to travel during the night (a reasonable assumption); you can penalize the score of edges that involve nighttime flights.

DHT: BitTorrent vs kademlia vs clones (python)

I'm in the middle of implementing my own dht for internal cluster. Since it will be used in file-sharing program like bittorrent, "Mainline DHT" was the first thing I was look at. After that I found "entangled" (python, dht using twisted matrix), congress (python, dht using pyev + libev) and of course original "kademlia".
They have different approaches on organizing k-buckets:
1) congress, kademlia use fixed 160 buckets in range 2*i <= (difference for each id from us) < 2*(i+1), for 0 <= i < 160.
2) mainline DHT and entangled use dynamic buckets. On start they have just 1 bucket covering whole space. After it will be filled with 8 alive nodes, bucket will be splitted to 2 new. But ONLY if our own id inside that bucket. If it is not -- bucket will be never splitted. So, soon we will have 160 closest to us buckets and few other.
Both variants are good enough. But I have found HUGE difference in logic which detects belongs some id to some bucket or not. And this is my question.
congress and kademlia treat bucket bundaries as "minimum distance from us" and "maximum distance from us". So, our own ID will be ALWAYS in bucket0. Maximum 2 other ids in bucket1 (because it covers 2*1 <= x < 2*2 distances) will be ALWAYS closest to us. So my brain does not breaks, coz everything OK.
But if you look into Mainline DHT or entangled, you will see what bucket bundaries treated as absolute node id bundaries, not xor distance! So in theoretically full table ids 0,1,2,3,4,5,6,7 will be in 1 bucket.
So. Why some implementations treat bucket boundaries as "max/min distance from us", while others as "max/min 160bit integer value"??
The kademlia paper actually calls out the optimization of dynamically splitting buckets as the routing table grows. There is no logic difference between these two approaches, it's just an optimization to save some space. When implementing a fixed full sized routing table, you have to find k nodes to send requests to. If the bucket your target falls in is empty, or has fewer than k nodes in it, you have to pick from neighboring buckets. Given that, have the closest bucket to you not be split in the first place, makes that search simpler and faster.
as for your point (1), I think you may have misunderstood kademlia. The routing table bucket boundaries are always relative your own node ID. And the ID space the buckets span double for each bucket further away from you. Without this property (if, say each bucket covered an equal range of the ID space) you would not be able to do searches properly, and they would certainly not be log(n).
The mainline DHT implements kademlia.

Categories

Resources