I have multiple node- and edgelists which form a large graph, lets call that the maingraph. My current strategy is to first read all the nodelists and import it with add_vertices. Every node then gets an internal id which depends on the order they are ingested and therefore isnt very reliable (as i've read it, if you delete one, all higher ids than the one deleted change). I assign every node a name attribute which corresponds to the external ID I use so I can keep track of my nodes between frameworks and a type attribute.
Now, how do I add the edges? When I read an edgelist it will start making a new graph (subgraph) and hence starts the internal ID at 0. Therefore, "merging" the graphs with maingraph.add_edges(subgraph.get_edgelist) inevitably fails.
It is possible to work around this and use the name attribute from both maingraph and subgraph to find out which internal ID each edges' incident nodes have in the maingraph:
def _get_real_source_and_target_id(edge):
''' takes an edge from the to-be-added subgraph and gets the ids of the corresponding nodes in the
maingraph by their name '''
source_id = maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index
target_id = maingraph.vs.select(name_eq=subgraph.vs[edge[1]]["name"])[0].index
return (source_id,target_id)
And then I tried
edgelist = [_get_real_source_and_target_id(x) for x in subgraph.get_edgelist()]
maingraph.add_edges(edgelist)
But that is hoooooorribly slow. The graph has millions of nodes and edges, which takes 10 seconds to load with the fast, but incorrect maingraph.add_edges(subgraph.get_edgelist) approach. with the correct approach explained above, it takes minutes (I usually stop it after 5 minutes o so). I will have to do this tens of thousands of times. I switched from NetworkX to Igraph because of the fast loading, but it doesn't really help if I have to do it like this.
Does anybody have a more clever way to do this? Any help much appreciated!
Thanks!
Nevermind, I figured out that the mistake was elsewhere. I used numpy.loadtxt() to read the node's names as strings, which somehow did funny stuff when the names were incrementing numbers with more than five figures (see my issue report here). Therefore the above solution got stuck when it tried to get the nodes where numpy messed up the node name. maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index simply sat there when it couldnt find the node. Now I use pandas to read the node names and it works fine.
The solution above is still ~10x faster than my previous NetworkX solution, so I will just leave it helps someone. Feel free to delete it otherwise.
Related
I have a dataset of postal codes for each store and nearby postal codes for each. It looks like the following:
PostalCode
nearPC
Travel Time
L2L 3J9
[N1K 0A1', 'N1K 0A2', 'N1K 0A3', 'N1K 0A4', '...
[nan,nan,9,5,nan...]
I know I can explode the data but that would result in tons more rows ~40M. Another preprocessing step I can perform is to remove the values in each list where the travel time is not available. However, then I would need to remove it from the nearPC list.
Is there a way to incorporate networkx to create this graph? I've tried using
G = nx.from_pandas_edgelist(df,'near_PC','PostalCode',['TravelTime'])
but I don't think it allows lists as the source or targets.
TypeError: unhashable type: 'list'
Is there a way around this? If not how can I remove the same indices of a list per row based on a conditional in an efficient way?
You've identified your problem, although you may not realize it. You have a graph with 40M edges, but you appropriately avoid the table explosion. You do have to code that explosion in some form, because your graph needs all 40M edges.
For what little trouble it might save you, I suggest that you write a simple generator expression for the edges: take one node from PostalCode, iterating through the nearPC list for the other node. Let Python and NetworkX worry about the in-line expansion.
You switch the nx build method you call, depending on the format you generate. You do slow down the processing somewhat, but the explosion details get hidden in the language syntax. Also, if there is any built-in parallelization between that generator and the nx method, you'll get that advantage implicitly.
I am trying to insert millions of vertex and edges in the shortest time possible with gremlin python.
I have 2 things to consider:
avoid duplicates for both vertex and edges
avoid to spend 10 hours to insert all data
The major time requested is for looking for the existing vertex and create the relationship.
If I insert edges without checking if a vertex already exists, the script is faster.
I have tryed also with batching the transactions like:
g.addV("person").property("name", "X").as_("p1")
.addV("person").property("name", "Y").as_("p2")
.addE("has_address").from("p1").to(g.V().has("address", "name", "street"))
.addE("has_address").from("p2").to(g.V().has("address", "name", "street2")).iterate()
but I have not improved the performance.
With duplicates I will have the same results in the queries?
I think it would be more expensive later for the queries with duplicates no?
Thanks.
My answer to your last question provided some hints on how to load data "fast" and now that I know your size is in the millions I hope you would consider those strategies.
If you happen to continue with loading with Gremlin and Python, then consider the following points:
I'm not sure where your duplicates arise from but I would look for opportunities to clean that source data and organize it and the load plan to avoid loading them in the first place which will save you cleanup later. I can't say if duplicates will leave you with the same results in your queries because I don't know your data nor do I know your queries. In some graphs and queries I've known duplicates to be irrelevant and expected and in others it could be a disaster.
Definitely try to organize your load into the batching pattern in the blog post I suggested in my other answer. This approach is faster than building giant chained Gremlin traversals filled with hundreds of addV() and addE().
Related to item 1, you've seen the performance pain of looking up graph elements prior to insert. Consider pre-sorting your data in such a way as to avoid repeated vertex lookups. Perhaps one way is to load all your vertices first so that you know they exist and then group/sort your edge load in such a way that you find vertices once and load all edges among that neighborhood of vertices.
Finally, if you can get 1 and 3 figured out, then perhaps you can parallelize the load.
Again, it's impossible to really offer specifics in this sort of forum, but perhaps these ideas will inspire you to an answer.
I'm setting up my database and sometimes I'll need to use an ID. At first, I added an ID as a property to my nodes of interest but realized I could also just use neo4j's internal id "". Then I stumbled upon the CREATE INDEX ON :label(something) and was wondering exactly what this would do? I thought an index and the would be the same thing?
This might be a stupid question, but since I'm kind of a beginner in databases, I may be missing some of these concepts.
Also, I've been reading about which kind of database to use (mySQL, MongoDB or neo4j) and decided on neo4j since my data pretty much follows a graph structure. (it will be used to build metabolic models: connections genes->proteins->reactions->compounds)
In SQL the syntax just seemed too complex as I had to go around several tables to make simple connections that neo4j accomplishes quite easily...
From what I understand MongoDb stores data independently, and, since my data is connected, it doesnt really seem to fit the data structure.
But again, since my knowledge on this subject is limited, perhaps I'm not doing the right choice?
Thanks in advance.
Graph dbs are ideal for connected data like this, it's a more natural fit for both storing and querying than relational dbs or document stores.
As far as indexes and ids, here's the index section of the docs, but the gist of it is that this has to do with how Neo4j can look up starting nodes. Neo4j only uses indexes for finding these starting nodes (though in 3.5 when we do index lookup like this, if you have ORDER BY on the indexed property, it will use the index to augment the performance of the ordering).
Here is what Neo4j will attempt to use, depending on availability, from fastest to slowest:
Lookup by internal ID - This is always quick, however we don't recommend preserving these internal ids outside the context of a query. The reason for that is that when graph elements are deleted, their ids become eligible for reuse. If you preserve the internal ids outside of Neo4j, and perform a lookup with them later, there is a chance that whatever you expected it to reference could have been deleted, and may point at nothing, or may point at some new node with completely different data.
Lookup by index - This where you would want to use CREATE INDEX ON (or add a unique constraint, if that makes sense for your model). When you use a MATCH or MERGE using the label and property (or properties) associated with the index, then this is a fast and direct lookup of the node(s) you want.
Lookup by label scan - If you perform a MATCH with a label present in the pattern, but no means to use an index (either no index present for the label/property combination, or only a label is present but no property), then a label scan will be performed, and every node of the given label will be matched to and filtered. This becomes more expensive as more nodes with those labels are added.
All nodes scan - If you do not supply any label in your MATCH pattern, then every node in your db will be scanned and filtered. This is very expensive as your db grows.
You can EXPLAIN or PROFILE a query to see its query plan, which will show you which means of lookup are used to find the starting nodes, and the rest of the operations for executing the query.
Once a starting node or nodes are found, then Neo4j uses relationship traversal and filtering to expand and find all paths matching your desired pattern.
This is more a logical thinking issue rather than coding.
I already have some working code blocks - one which telnets to a device, one which parses results of a command, one which populates a dictionary etc etc
Now lets say I want to analyse a network with unknown nodes, a,b,c etc (but I only know about 1)
I give my code block node a. The results are a table including b, c. I save that in a dictionary
I then want to use that first entry (b) as a target and see what it can see. Possibly d, e, etc And add those (if any) to the dict
And then do the same on the next node in this newly populated dictionary. The final output would be that all nodes have been visited once only, and all devices seen are recorded in this (or another) dictionary.
However I can't figure out how to keep re-reading the dict as it grows, and I can't figure out how to avoid looking at a device more than once.
I understand this is clearer to me than I have explained, apologies if it's confusing
You are looking at graph algorithms, specifically DFS or BFS. Are you asking specifically about implementation details, or more generally about the algorithms?
Recursion would be a very neat way of doing this.
seen = {}
def DFS( node ):
for neighbour in node.neighbours():
if neighbour not in seen:
seen[ neighbour ] = some_info_about_neighbour
DFS( neighbour )
I'm developing an application in which I need a structure to represent a huge graph (between 1000000 and 6000000 nodes and 100 or 600 edges per node) in memory. The edges representation will contain some attributes of the relation.
I have tried a memory map representation, arrays, dictionaries and strings to represent that structure in memory, but these always crash because of the memory limit.
I would like to get an advice of how I can represent this, or something similar.
By the way, I'm using python.
If that is 100-600 edges/node, then you are talking about 3.6 billion edges.
Why does this have to be all in memory?
Can you show us the structures you are currently using?
How much memory are we allowed (what is the memory limit you are hitting?)
If the only reason you need this in memory is because you need to be able to read and write it fast, then use a database. Databases read and write extremely fast, often they can read without going to disk at all.
Depending on you hardware resources an all in memory for a graph this size is probably out of the question. Two possible options from a graph specific DB point of view are:
Neo4j - claims to easily handle billions of nodes and its been in development a long time.
FlockDB - newly released by Twitter this is a distributed graph database.
Since your using Python, have you looked at Networkx? How far did you get loading a graph of this size if you have looked at it out of interest?
I doubt you'll be able to use a memory structure unless you have a LOT of memory at your disposal:
Assume you are talking about 600 directed edges from each node, with a node being 4-bytes (integer key) and a directed edge being JUST the destination node keys (4 bytes each).
Then the raw data about each node is 4 + 600 * 4 = 2404 bytes x 6,000,000 = over 14.4GB
That's without any other overheads or any additional data in the nodes (or edges).
You appear to have very few edges considering the amount of nodes - suggesting that most of the nodes aren't strictly necessary. So, instead of actually storing all of the nodes, why not use a sparse structure and only insert them when they're in use? This should be pretty easy to do with a dictionary; just don't insert the node until you use it for an edge.
The edges can be stored using an adjacency list on the nodes.
Of course, this only applies if you really mean 100-600 nodes in total. If you mean per node, that's a completely different story.
The scipy.sparse.csgraph package may be able to handle this -- 5 million nodes * 100 edges on average is 500 million pairs, at 8 bytes per pair (two integer IDs) is just about 4GB. I think csgraph uses compression so it will use less memory than that; this could work on your laptop.
csgraph doesn't have as many features as networkx but it uses waaay less memory.
Assuming you mean 600 per node, you could try something like this:
import os.path
import cPickle
class LazyGraph:
def __init__(self,folder):
self.folder = folder
def get_node(self,id):
f = open(os.path.join(self.folder,str(id)),'rb')
node = cPickle.load(f)
f.close() # just being paranoid
return node
def set_node(self,id,node):
f = open(os.path.join(self.folder,str(id)),'wb')
cPickle.dump(node,f,-1) # use highest protocol
f.close() # just being paranoid
Use arrays (or numpy arrays) to hold the actual node ids, as they are faster.
Note, this will be very very slow.
You could use threading to pre-fetch nodes (assuming you knew which order you were processing them in), but it won't be fun.
Sounds like you need a database and an iterator over the results. Then you wouldn't have to keep it all in memory at the same time but you could always have access to it.
If you do decide to use some kind of database after all, I suggest looking at neo4j and its python bindings. It's a graph database capable of handling large graphs. Here's a presentation from this year's PyCon.