I'm developing an application in which I need a structure to represent a huge graph (between 1000000 and 6000000 nodes and 100 or 600 edges per node) in memory. The edges representation will contain some attributes of the relation.
I have tried a memory map representation, arrays, dictionaries and strings to represent that structure in memory, but these always crash because of the memory limit.
I would like to get an advice of how I can represent this, or something similar.
By the way, I'm using python.
If that is 100-600 edges/node, then you are talking about 3.6 billion edges.
Why does this have to be all in memory?
Can you show us the structures you are currently using?
How much memory are we allowed (what is the memory limit you are hitting?)
If the only reason you need this in memory is because you need to be able to read and write it fast, then use a database. Databases read and write extremely fast, often they can read without going to disk at all.
Depending on you hardware resources an all in memory for a graph this size is probably out of the question. Two possible options from a graph specific DB point of view are:
Neo4j - claims to easily handle billions of nodes and its been in development a long time.
FlockDB - newly released by Twitter this is a distributed graph database.
Since your using Python, have you looked at Networkx? How far did you get loading a graph of this size if you have looked at it out of interest?
I doubt you'll be able to use a memory structure unless you have a LOT of memory at your disposal:
Assume you are talking about 600 directed edges from each node, with a node being 4-bytes (integer key) and a directed edge being JUST the destination node keys (4 bytes each).
Then the raw data about each node is 4 + 600 * 4 = 2404 bytes x 6,000,000 = over 14.4GB
That's without any other overheads or any additional data in the nodes (or edges).
You appear to have very few edges considering the amount of nodes - suggesting that most of the nodes aren't strictly necessary. So, instead of actually storing all of the nodes, why not use a sparse structure and only insert them when they're in use? This should be pretty easy to do with a dictionary; just don't insert the node until you use it for an edge.
The edges can be stored using an adjacency list on the nodes.
Of course, this only applies if you really mean 100-600 nodes in total. If you mean per node, that's a completely different story.
The scipy.sparse.csgraph package may be able to handle this -- 5 million nodes * 100 edges on average is 500 million pairs, at 8 bytes per pair (two integer IDs) is just about 4GB. I think csgraph uses compression so it will use less memory than that; this could work on your laptop.
csgraph doesn't have as many features as networkx but it uses waaay less memory.
Assuming you mean 600 per node, you could try something like this:
import os.path
import cPickle
class LazyGraph:
def __init__(self,folder):
self.folder = folder
def get_node(self,id):
f = open(os.path.join(self.folder,str(id)),'rb')
node = cPickle.load(f)
f.close() # just being paranoid
return node
def set_node(self,id,node):
f = open(os.path.join(self.folder,str(id)),'wb')
cPickle.dump(node,f,-1) # use highest protocol
f.close() # just being paranoid
Use arrays (or numpy arrays) to hold the actual node ids, as they are faster.
Note, this will be very very slow.
You could use threading to pre-fetch nodes (assuming you knew which order you were processing them in), but it won't be fun.
Sounds like you need a database and an iterator over the results. Then you wouldn't have to keep it all in memory at the same time but you could always have access to it.
If you do decide to use some kind of database after all, I suggest looking at neo4j and its python bindings. It's a graph database capable of handling large graphs. Here's a presentation from this year's PyCon.
Related
JSON isn't necessarily a high efficiency structure to store data in terms of bytes of overhead and parsing. There's a logical parsing structure, for example, based on syntax rather than being able to look up a specific segment. Let's say you have 20 years of timestep data, ~ 1TB compressed and want to be able to store it efficiently and load / store it as fast as possible for maximum speed simulation.
At first I tried relational databases, but those are actually not that fast - they're designed to load over a network, not locally, and the OSI model has overhead.
I was able to speed this up by creating a custom binary data structure with defined block sizes and header indexes, sort of like a file system, but this was time consuming and highly specified for a single type of data, for example fixed length data nodes. Editing the data wasn't a feature, it was a one time export spanning days of time. I'm sure some library could do it better.
I learned about Pandas, but they seem to load to / from CSV and JSON most commonly, and both of those are plain-text, so storing an int takes the space of multiple characters rather than having the power of deciding a 32 bit unsigned int for example.
What's the right tool? Can Pandas do this, or is there something better?
I need to be able to specify data type for each property being stored so if I only need a 16 bit int, thats the space that gets used.
I need to be able to use stream to read / write from big (1-10TB) data as fast as fundamentally possible per the hardware..
I have multiple node- and edgelists which form a large graph, lets call that the maingraph. My current strategy is to first read all the nodelists and import it with add_vertices. Every node then gets an internal id which depends on the order they are ingested and therefore isnt very reliable (as i've read it, if you delete one, all higher ids than the one deleted change). I assign every node a name attribute which corresponds to the external ID I use so I can keep track of my nodes between frameworks and a type attribute.
Now, how do I add the edges? When I read an edgelist it will start making a new graph (subgraph) and hence starts the internal ID at 0. Therefore, "merging" the graphs with maingraph.add_edges(subgraph.get_edgelist) inevitably fails.
It is possible to work around this and use the name attribute from both maingraph and subgraph to find out which internal ID each edges' incident nodes have in the maingraph:
def _get_real_source_and_target_id(edge):
''' takes an edge from the to-be-added subgraph and gets the ids of the corresponding nodes in the
maingraph by their name '''
source_id = maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index
target_id = maingraph.vs.select(name_eq=subgraph.vs[edge[1]]["name"])[0].index
return (source_id,target_id)
And then I tried
edgelist = [_get_real_source_and_target_id(x) for x in subgraph.get_edgelist()]
maingraph.add_edges(edgelist)
But that is hoooooorribly slow. The graph has millions of nodes and edges, which takes 10 seconds to load with the fast, but incorrect maingraph.add_edges(subgraph.get_edgelist) approach. with the correct approach explained above, it takes minutes (I usually stop it after 5 minutes o so). I will have to do this tens of thousands of times. I switched from NetworkX to Igraph because of the fast loading, but it doesn't really help if I have to do it like this.
Does anybody have a more clever way to do this? Any help much appreciated!
Thanks!
Nevermind, I figured out that the mistake was elsewhere. I used numpy.loadtxt() to read the node's names as strings, which somehow did funny stuff when the names were incrementing numbers with more than five figures (see my issue report here). Therefore the above solution got stuck when it tried to get the nodes where numpy messed up the node name. maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index simply sat there when it couldnt find the node. Now I use pandas to read the node names and it works fine.
The solution above is still ~10x faster than my previous NetworkX solution, so I will just leave it helps someone. Feel free to delete it otherwise.
I have a tree data structure that I want to save to disk. Thus, HDF5 with its internal tree structure seemed to be the perfect candidate. However, so far the data overhead is massive, by a factor of 100!
A test tree contains roughly 100 nodes, where leaves usually contain no more than 2 or 3 data items (like doubles). If I take the entire tree and just pickle it, it is about 21kB large. Yet, if I use PyTables and map the tree structure one to one to the HDF5 file, the file takes 2.4MB (!) disk space. Is the overhead that big?
The problem is that the overhead does not seem to be constant but linearly scales with the size of my tree data (as well as increasing nodes as increasing data per leaf, i.e. enlarging rows of the leaf tables).
Did I miss something regarding PyTables, like enabling compression (I thought PyTables does it by default)? What could possibly be the reason for this massive overhead?
Thanks a lot!
Ok, so I have found a way to massively reduce the file size. The point is, despite my prior believes, PyTables does NOT apply compression per default.
You can achieve this by using Filters.
Here is an example how that works:
import pytables as pt
hdf5_file = pt.openFile(filename = 'myhdf5file.h5',
mode='a',
title='How to compress data')
# for pytables >= 3 the method is called `open_file`,
# other methods are renamed analogously
myfilters = Filters(complevel=9, complib='zlib')
mydescitpion = {'mycolumn': pt.IntCol()} # Simple 1 column table
mytable = hdf5_file.createTable(where='/', name='mytable',
description=mydescription,
title='My Table',
filters=myfilters)
#Now you can happily fill the table...
The important line here is Filters(complevel=9, complib='zlib'). It specifies the
compression level complevel and the compression algorithm complib. Per default the level is set to 0, that means compression is disabled, whereas 9 is the highest compression level. For details on how compression works: HERE IS A LINK TO THE REFERENCE.
Next time, I better stick to RTFM :-) (although I did, but I missed the line "One of the beauties of PyTables is that it supports compression on tables and arrays, although it is not used by default")
I have a large dataset with 500 million rows and 58 variables. I need to sort the dataset using one of the 59th variable which is calculated using the other 58 variables. The variable happens to be a floating point number with four places after decimal.
There are two possible approaches:
The normal merge sort
While calculating the 59th variables, i start sending variables in particular ranges to to particular nodes. Sort the ranges in those nodes and then combine them in the reducer once i have perfectly sorted data and now I also know where to merge what set of data; It basically becomes appending.
Which is a better approach and why?
I'll assume that you are looking for a total sort order without a secondary sort for all your rows. I should also mention that 'better' is never a good question since there is typically a trade-off between time and space and in Hadoop we tend to think in terms of space rather than time unless you use products that are optimized for time (TeraData has the capability of putting Databases in memory for Hadoop use)
Out of the two possible approaches you mention, I think only one would work within the Hadoop infrastructure. Num 2, Since Hadoop leverages many nodes to do one job, sorting becomes a little trickier to implement and we typically want the 'shuffle and sort' phase of MR to take care of the sorting since distributed sorting is at the heart of the programming model.
At the point when the 59th variable is generated, you would want to sample the distribution of that variable so that you can send it through the framework then merge like you mentioned. Consider the case when the variable distribution of x contain 80% of your values. What this might do is send 80% of your data to one reducer who would do most of the work. This assumes of course that some keys will be grouped in the sort and shuffle phase which would be the case unless you programmed them unique. It's up to the programmer to set up partitioners to evenly distribute the load by sampling the key distribution.
If on the other hand we were to sort in memory then we could accomplish the same thing during reduce but there are inherent scalability issues since the sort is only as good as the amount of memory available in the node currently running the sort and dies off quickly when it starts to use HDFS to look for the rest of the data that did not fit into memory. And if you ignored the sampling issue you will likely run out of memory unless all your key values pairs are evenly distributed and you understand the memory capacity within your data.
Check out the Hadoop Comparator Class Part of HadoopStreaming Wiki Page
You can move the datasets to HDFS, use Python to write a mapper and do a hadoop streaming mapper only job. The Hadoop Streaming will automatically help you sort them.
Then you can use hdfs dfs -getmerge and -copyToLocal to move the sorted records back to local if you want.
I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora
Things that kind of suck:
Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.
How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.
Why not just store a table containing stuff that was deleted since the last re-write?
This table could be the same structure as your main bucket, maybe with a Bloom filter for quick membership checks.
You can re-write the main bucket data without the deleted items either when you were going to re-write it anyway for some other modification, or when the ratio of deleted items:bucket size exceeds some threshold.
This scheme could work either by storing each deleted pair alongside each bucket, or by storing a single table for all deleted documents: I'm not sure which is a better fit for your requirements.
Keeping a single table, it's hard to know when you can remove an item unless you know how many buckets it affects, without just re-writing all buckets whenever the deletion table gets too large. This could work, but it's a bit stop-the-world.
You also have to do two checks for each pair you stream in (ie, for (3278, 15378), you'd check whether either 3278 or 15378 has been deleted, instead of just checking whether pair (3278, 15378) has been deleted.
Conversely, the per-bucket table of each deleted pair would take longer to build, but be slightly faster to check, and easier to collapse when re-writing the bucket.
You are trying to reinvent what already exists in new age NoSQL data stores.
There are 2 very good candidates for your requirements.
Redis.
MongoDb
Both support data structures like dictionaries,lists,queues. The operations like append, modify or delete are also available in both , and very fast.
The performance of both of them is driven by amount of data that can reside in the RAM.
Since most of your data is integer based, that should not be a problem.
My personal suggestion is to go with Redis, with a good persistence configuration (i.e. the data should periodically be saved from RAM to disk ).
Here is a brief of redis data structures :
http://redis.io/topics/data-types-intro
The redis database is a lightweight installation, and client is available in Python.