DHT: BitTorrent vs kademlia vs clones (python)

DHT: BitTorrent vs kademlia vs clones (python) - python

I'm in the middle of implementing my own dht for internal cluster. Since it will be used in file-sharing program like bittorrent, "Mainline DHT" was the first thing I was look at. After that I found "entangled" (python, dht using twisted matrix), congress (python, dht using pyev + libev) and of course original "kademlia".
They have different approaches on organizing k-buckets:
1) congress, kademlia use fixed 160 buckets in range 2*i <= (difference for each id from us) < 2*(i+1), for 0 <= i < 160.
2) mainline DHT and entangled use dynamic buckets. On start they have just 1 bucket covering whole space. After it will be filled with 8 alive nodes, bucket will be splitted to 2 new. But ONLY if our own id inside that bucket. If it is not -- bucket will be never splitted. So, soon we will have 160 closest to us buckets and few other.
Both variants are good enough. But I have found HUGE difference in logic which detects belongs some id to some bucket or not. And this is my question.
congress and kademlia treat bucket bundaries as "minimum distance from us" and "maximum distance from us". So, our own ID will be ALWAYS in bucket0. Maximum 2 other ids in bucket1 (because it covers 2*1 <= x < 2*2 distances) will be ALWAYS closest to us. So my brain does not breaks, coz everything OK.
But if you look into Mainline DHT or entangled, you will see what bucket bundaries treated as absolute node id bundaries, not xor distance! So in theoretically full table ids 0,1,2,3,4,5,6,7 will be in 1 bucket.
So. Why some implementations treat bucket boundaries as "max/min distance from us", while others as "max/min 160bit integer value"??

The kademlia paper actually calls out the optimization of dynamically splitting buckets as the routing table grows. There is no logic difference between these two approaches, it's just an optimization to save some space. When implementing a fixed full sized routing table, you have to find k nodes to send requests to. If the bucket your target falls in is empty, or has fewer than k nodes in it, you have to pick from neighboring buckets. Given that, have the closest bucket to you not be split in the first place, makes that search simpler and faster.
as for your point (1), I think you may have misunderstood kademlia. The routing table bucket boundaries are always relative your own node ID. And the ID space the buckets span double for each bucket further away from you. Without this property (if, say each bucket covered an equal range of the ID space) you would not be able to do searches properly, and they would certainly not be log(n).
The mainline DHT implements kademlia.

Related

Recursive function with growing number of calls

So the idea is I'm using steam API to get list of friends of the given user, to gather some ID's for the data analysis. Each time I get friendlist of a user I want to get the 5 friends of his 5 friends. So first I get 5 friends of first user. And then I get the 5 friends of 5 friends so it's 5 -> 25 -> 125 and so on up until some points for example 6 times to get 15 625 ID's. And the question is how to do it because I don't really know how to make this really work. I'm not so good at recursion

Basicly you can imagine a person as a node who has n neighboring nodes (= friends) and you start at one (= yourself) and move on to your neighbor nodes (=friends) then you move on to their neighboring nodes and so on while always keeping track of which nodes you have already visited. This way you are gradually moving away from your start node, until the whole network is explored (you don't want that in your case) or until a certain distance (= nodes between you and your friends) is reached, so for example up to the 6th level as you've described in your post.
The network of friends builds a graph data structure and what you want to do is a well known graph algorithm called breadth-first search. In the wikipedia article you will find some pseudo code and if you google for breadth-first search you will find many, many resources and implementations in any language you need.
By the way, no need for recursion here, so don't use it.

Determine Categorical Hierarchy Level of Freebase MID Value

After using the Google Cloud Vision API, I received MID values in the format of /m/XXXXXXX (not necessarily 7 characters at the end though). What I would like to do is determine how specific one MID value is compared to the others. Essentially how broad vs. refined a term is. For example, the term Vehicle might be level 1 while the term Van might be level 2.
I have tried to run the MID values through the Google Knowledge Graph API but unfortunately these MIDs are not in that database and return no information. For example, a few MIDs and descriptions I have are as follows:
/m/07s6nbt = text
/m/03gq5hm = font
/m/01n5jq = poster
/m/067408 = album cover
My initial thought on why these MIDs return nothing in the Knowledge Graph API is that they were not carried over after the discontinuation of Freebase. I understand that Google provides an RDF dump of Freebase but I'm not sure how to read that data in Python and use it to determine the depth of a mid in the hierarchy.
If it's not possible to determine the category level of the MID value, the number of connections a term had would also be an appropriate proxy. Assuming broader terms have more connections to other terms than more refined terms. I found an article that discusses the amount of "edges" a MID has which I believe means the number of connections. However, they do some converting between MID values to Long Values and use various scripts that keep giving me numerous errors in Python. I was hoping for a simple table with MID values in one column and the number of connections in another but I'm lost in their code, converting values, and Python Errors.
If you have any suggestions for easily determining the amount of connections a MID has or its hierarchical level, it would be greatly appreciated. Thank you!

Those MIDs look like they're for pretty common things, so I'm surprised their not in the Knowledge Graph. Do you prefix the MIDs to form URIs?
"kg": "http://g.co/kg"
"kg:/m/067408"
Freebase and the Knowledge Graph aren't organized as hierarchies, so your level finding idea doesn't really work. I'm also dubious about your idea of degree (ie # of edges) being correlated with broader vs narrower, but you should be able to use the dump that you've found to test it.
The Freebase ExQ Data Dump that you found is super confusing because they rename Freebase types as topics (not to be confused with Freebase topics), but I think their freebase-nodes-in-out-name.tsv contains information that you're looking for (# of edges == degree). You can use either the inDegree, outDegree or the sum of the two.
Their MID to integer conversion code doesn't look right to me (and doesn't match the comments) but you'll need to use a compatible implementation to match up with what they've done.
Looking at
/m/02w0000 "Clibadium subsessilifolium"#en
it's encoded as
48484848875048
or
48 48 48 48 87 50 48
0 0 0 0 w 2 0
So, just take the ASCII values from right to left and concatenate them left to right. Confusing, inefficient, and wrong all in one! (It's actually a base 36 (or 37?) coding)

keep keys of different heaps updated when storing links to the same objects

Designing one algorithm using Python I'm trying to maintain one invariant, but I don't know if that is even possible to maintain. It's part of an MST algorithm.
I have some "wanted nodes". They are wanted by one or more clusters, which are implemented as a list of nodes. If I get a node that is wanted by one cluster, it gets placed into that cluster. However, if more than one cluster want it, all those clusters get merged and then the node gets placed in the resulting cluster.
My goal
I am trying to get the biggest cluster of the list of "wanting clusters" in constant time, as if I had a max-heap and I could use the updated size of each cluster as the key.
What I am doing so far
The structure that I am using right now is a dict, where the keys are the nodes, and the values are lists with the clusters that want the node at the key. This way, if I get a node I can check in constant time if some cluster wants it, and in case there are, I loop through the list of clusters checking who is the biggest. Once I finish the loop, I merge the clusters by updating the information in all the smaller clusters. This way I get a total merging time of O(n logn), instead of O(n²).
Question
I was wondering if I could use something like a heap to store in my dict as the value, but I don't know how that heap would be updated with the current size of each cluster. Is it possible to do something like that by using pointers and possible other dict storing the size of each cluster?

How would I sort a set of lat/lons by distance from a single lat/lon?

A user signs up for my site and enters in their zip code. I want to query for other users, and sort by distance.
I have a database full of zip codes with lat/lon points for each zip code.
zip_code (char)
lat (float)
lon (float)
I have a method which will calculate the distance between two sets of lat/lons, but to run this on every other zip code in my db is expensive. I'd need to run this on every zip code combination. I suppose I can do it once and store it somewhere, but where would I store it? Seems strange to have a table for every zip code which would contain the distance to every other zip code. Is there a clean way to do this?

Doing it once and storing it somewhere sounds good to me. Here are some ideas that might give good performance with some consideration to storage space without sacrificing accuracy:
There are something like 43,191 zip codes, so the full would be 1,865,462,481. But the distances are of course symmetrical and the self-to-self ones are useless, which immediately cuts it down to 932,709,645 entries. We might also cut the space by realizing that a bunch of zip codes are either the same as each other, or one contains the other (e.g. 10178 seems to be inside 10016, and they're both geographically small). Many zip codes will have no users at all, so we might avoid populating those until they're needed (i.e. lazy load the cache). And finally, you can probably throw away large-distance results, where large is defined as a distance greater than is useful for your users.
For a more algorithmic view, see this previous question: Calculate distance between zip codes and users
Bonus tip: don't forget about non-US users. Poor non-US users.

Here's a solution with a fair amount of overhead, but which will pay off as your dataset size, user base, and/or number of transactions grow:
If you don't already have one, use a database that supports spatial types and spatial indexing. I recommend the PostGIS extension for PostGres, but most of these steps apply to other spatially-enabled databases:
Store your zip code location as Point geometry type instead of a two columns for lat and long.
Create a spatial index against the Point geometry column. Every time you add a new zip code, its location will automatically be added to the spatial index.
Assuming you don't want to show "nearest" neighbors that are thousands of miles away, use a Within function (ST_DWithin in PostGIS) to filter out those zip codes that are too far away. This will significantly reduce the search space for close neighbors.
Finally use a Distance function (ST_Distance in PostGIS) to calculate the distance between your zip code of interest and its closer neighbors, and use the DB to return results sorted by distance.
By using a database with spatial index and a filtering function that uses that index, you can significantly speed up your search. And when the time comes to do more spatial analysis or show maps, you'll already have a framework in place to support that new functionality.

How to best design a date/geographic proximity query on GAE?

I'm building a directory for finding athletic tournaments on GAE with
web2py and a Flex front end. The user selects a location, a radius, and a maximum
date from a set of choices. I have a basic version of this query implemented, but it's
inefficient and slow. One way I know I can improve it is by condensing
the many individual queries I'm using to assemble the objects into
bulk queries. I just learned that was possible. But I'm also thinking about a more extensive redesign that utilizes memcache.
The main problem is that I can't query the datastore by location
because GAE won't allow multiple numerical comparison statements
(<,<=,>=,>) in one query. I'm already using one for date, and I'd need
TWO to check both latitude and longitude, so it's a no go. Currently,
my algorithm looks like this:
1.) Query by date and select
2.) Use destination function from geopy's distance module to find the
max and min latitude and longitudes for supplied distance
3.) Loop through results and remove all with lat/lng outside max/min
4.) Loop through again and use distance function to check exact
distance, because step 2 will include some areas outside the radius.
Remove results outside supplied distance (is this 2/3/4 combination
inefficent?)
5.) Assemble many-to-many lists and attach to objects (this is where I
need to switch to bulk operations)
6.) Return to client
Here's my plan for using memcache.. let me know if I'm way out in left
field on this as I have no prior experience with memcache or server
caching in general.
-Keep a list in the cache filled with "geo objects" that represent all
my data. These have five properties: latitude, longitude, event_id,
event_type (in anticipation of expanding beyond tournaments), and
start_date. This list will be sorted by date.
-Also keep a dict of pointers in the cache which represent the start
and end indices in the cache for all the date ranges my app uses (next
week, 2 weeks, month, 3 months, 6 months, year, 2 years).
-Have a scheduled task that updates the pointers daily at 12am.
-Add new inserts to the cache as well as the datastore; update
pointers.
Using this design, the algorithm would now look like:
1.) Use pointers to slice off appropriate chunk of list based on
supplied date.
2-4.) Same as above algorithm, except with geo objects
5.) Use bulk operation to select full tournaments using remaining geo
objects' event_ids
6.) Assemble many-to-manys
7.) Return to client
Thoughts on this approach? Many thanks for reading and any advice you
can give.
-Dane

GeoModel is the best I found. You may look how my GAE app return geospatial queries. For instance India http query is with optional cc (country code) using geomodel library lat=20.2095231&lon=79.560344&cc=IN

You might be interested by geohash, which enables you to do an inequality query like this:
SELECT latitude, longitude, title FROM
myMarkers WHERE geohash >= :sw_geohash
AND geohash <= :ne_geohash
Have a look at this fine article which was featured in this month's Google App Engine App Engine Community Update blog post.
As a note on your proposed design, don't forget that entities in Memcache have no guarantee of staying in memory, and that you can not have them "sorted by date".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.