How can I implement a recursive neural network in TensorFlow?

How can I implement a recursive neural network in TensorFlow? - python

Is there some way of implementing a recursive neural network like the one in [Socher et al. 2011] using TensorFlow?
Note that this is different from recurrent neural networks, which are nicely supported by TensorFlow.
The difference is that the network is not replicated into a linear sequence of operations, but into a tree structure.
I imagine that I could use the While op to construct something like a breadth-first traversal of the tree data structure for each entry of my dataset.
Maybe it would be possible to implement tree traversal as a new C++ op in TensorFlow, similar to While (but more general)?

Your guess is correct, you can use tf.while_loop and tf.cond to represent the tree structure in a static graph. More info:
https://github.com/bogatyy/cs224d/tree/master/assignment3
In my evaluation, it makes training 16x faster compared to re-building the graph for every new tree.

Currently, these models are very hard to implement efficiently and cleanly in TensorFlow because the graph structure depends on the input. That also makes it very hard to do minibatching. It is possible using things like the while loop you mentioned, but doing it cleanly isn't easy.
You can build a new graph for each example, but this will be very annoying. If, for a given input size, you can enumerate a reasonably small number of possible graphs you can select between them and build them all at once, but this won't be possible for larger inputs.
You can also route examples through your graph with complicated tf.gather logic and masks, but this can also be a huge pain.
Ultimately, building the graph on the fly for each example is probably the easiest and there is a chance that there will be alternatives in the future that support better immediate style execution. But as of v0.8 I would expect this to be a bit annoying and introduce some overhead as Yaroslav mentions in his comment.
Edit: Since I answered, here is an example using a static graph with while loops: https://github.com/bogatyy/cs224d/tree/master/assignment3
I am not sure how performant it is compared to custom C++ code for models like this, although in principle it could be batched.

Related

Model for generating and detecting communities in dense network

I have a complete undirected weighted graph. Think of a graph where persons are nodes and the edge (u,v,w) indicates the kind of relationship between u and v with weight w. w can take any value between 1 (doesn't know each other - hence the completeness), 2 (acquaintances), 3(friends). This kind of relationships form naturally clusters based on the edge weight.
My goal is to define a model that models this phenomena and from where I can sample some graphs and see the observed behaviour in reality.
So far I've played with stochastic block models (https://graspy.neurodata.io/tutorials/simulations/sbm.html) since there are some papers about the use of these generative models for these community-detection tasks. However I may be overseeing something, since I can't seem to be able to fully represent what I need: g = sbm(list_of_params) where g is complete and there are some discernibles clusters among nodes sharing weight 3.
At this point I am not even sure whether sbm is the best approach for this task.
I am also assuming that everything that graph-tool can do, graspy can also do. Since at the beginning I read about both and it seems that is the case.
Summarizing:
Is there a way to generate a stochastic block model in graspy that yields a complete undirected weighted graph?
Is sbm the best model for the task. Should I be looking at gmm?
Thanks

Is there a way to generate a stochastic block model in graspy that yields a complete undirected weighted graph?
Yes, but as pointed out in the comments above, that's a strange way to specify the model. If you want to benefit from the deep literature on community detection in social networks, you should not use a complete graph. Do what everyone else does: The presence (or absence) of an edge should indicate a relationship (or lack thereof), and an optional weight on the edge can indicate the strength of the relationship.
To generate graphs from SBM with weights, use this function:
https://graspy.neurodata.io/reference/simulations.html#graspologic.simulations.sbm
I am also assuming that everything that graph-tool can do, graspy can also do.
This is not true. There are (at least) two different popular methods for inferring the parameters of an SBM. Unfortunately, the practitioners of each method seem to avoid citing each other in their papers and code.
graph-tool uses an MCMC statistical inference approach to find the optimal graph partitioning.
graspologic (formerly graspy) uses a trick related to spectral clustering to find the partitioning.
From what I can tell, the graph-tool approach offers more straightforward and principled model selection methods. It also has useful extensions, such as overlapping communities, nested (hierarchical) communities, layered graphs, and more.
I'm not as familiar with the graspologic (spectral) methods, but -- to me -- they seem more difficult to extend beyond merely seeking a point estimate for the ideal community partitioning. You should take my opinion with a hefty bit of skepticism, though. I'm not really an expert in this space.

Python - go beyond RAM limits?

I'm trying to analyze text, but my Mac's RAM is only 8 gigs, and the RidgeRegressor just stops after a while with Killed: 9. I recon this is because it'd need more memory.
Is there a way to disable the stack size limiter so that the algorithm could use some kind of swap memory?

You will need to do it manually.
There are probably two different core-problems here:
A: holding your training-data
B: training the regressor
For A, you can try numpy's memmap which abstracts swapping away.
As an alternative, consider preparing your data to HDF5 or some DB. For HDF5, you can use h5py or pytables, both allowing numpy-like usage.
For B: it's a good idea to use some out-of-core ready algorithm. In scikit-learn those are the ones supporting partial_fit.
Keep in mind, that this training-process decomposes into at least two new elements:
Efficient being in regards to memory
Swapping is slow; you don't want to use something which holds N^2 aux-memory during learning
Efficient convergence
Those algorithms in the link above should be okay for both.
SGDRegressor can be parameterized to resemble RidgeRegression.
Also: it might be needed to use partial_fit manually, obeying the rules of the algorithm (often some kind of random-ordering needed for convergence-proofs). The problem with abstracting-away swapping is: if your regressor is doing a permutation in each epoch, without knowing how costly that is, you might be in trouble!
Because the problem itself is quite hard, there are some special libraries built for this, while sklearn needs some more manual work as explained. One of the most extreme ones (a lot of crazy tricks) might be vowpal_wabbit (where IO is often the bottleneck!). Of course there are other popular libs like pyspark, serving a slightly different purpose (distributed computing).

How can I make my neural network emphasize that some data is more important than the rest?

I looked around online but couldn't find anything, but I may well have missed a piece of literature on this. I am running a basic neural net on a 289 component vector to produce a 285 component vector. In my input, the last 4 pieces of data are critical to change the rest of the input into the resultant 285 for the output. That is to say, the input is 285 + 4, such that the 4 morph the rest of the input into the output.
But when running a neural network on this, I am not sure how to reflect this. Would I need to use convolution on the rest of the input? I want my system to emphasize the 4 data points that critically affect the other 285. I am still new to all of this, so a few pointers would be great!
Again, if there is something already written on this, then that would be awesome too.

I don't think you have any reason doing this since the network will infer that on its own. The weights will be reduced or enhanced for each input according to their importance considering the output.
What you could do though, is to have a preliminary network that is going to have the 285 component as an input, and then a new network that is going to have the 4 critical components and the output of the preliminary network as an input.
[285 compo.]---[neural network]---+---[neural network]---[output 285 compo.]
|
[4 compo.]-+
For instance, you could treat a picture with convolution networks and then add some meta information later in a fully connected network to process everything.

The neural network should more or less learn this thing by itself. Especially with newer approaches like deep learning & friends, where the amount of hand-tuning is almost zero. However, this does assume that the function which you're trying to learn is learnable and that the system you use has enough power to learn it. That's a function of the complexity of the network involved (number of layers, nodes, types of activations etc.), the learning algorithms involved, as well as the data you supply.
It's really hard to tell without knowing more about the domain you're addressing? What sort of signals are we talking about (I assume they're signals since you speak of convolution)? What are the four inputs about? I assume they have a different modality than the other 285.
Perhaps this doc will help a little bit though.

Theoretically, you can let the network try to learn this relationship. However, there are good reasons to try to rethink the way you're formulating the problem. Also, the difficulty a neural network will have learning this function is going to depend strongly on your specific problem (and the best way to figure it out is probably just to try it and find out).
Let me try to help by making an analogy to a simpler problem: let's take your 289-element vector and assume that 285 elements take values from -1 to 1 and the remaining four take values from -1000 to 1000. This maintains your original premise: that the four variables are somehow far more important in determining the output than the 285. (I understand that this loses the coupled relationship between the variables, but let's run with the example anyways.)
This is a simpler example for two reasons:
it's easier to see why it's harder to learn
there are a bag of well-understood tricks to solve it
Compared to a scenario where all 289 inputs have the same input range, a gradient descent algorithm will be slower to converge on the heterogeneous case. (Extra credit: try this!) Geoff Hinton has a rather famous set of slides which describes this effect fairly well: Lecture 6. I believe this is also part of a Coursera course now.
Hinton's slides also touch on two ways to attack this simpler version of the problem. The first is just to pre-process your inputs. If you scale down the inputs to have the same mean and variance, your gradient descent optimizer will converge more quickly. The other is to use a more powerful optimization method, specifically one with per-parameter adaptive learning rates, which handles this case as well as trickier scenarios. Andrej Karpathy's fantastic notes from Stanford's CS231n class are a good intro.
But let's tie this back to your problem: that there are four "special" variables which transform the entire input. Given enough time and input, it's possible that a network can learn this function. But understand that if this transformation is complex and makes the optimization landscape rough, your network will likely have some trouble dealing with it.
If there's a way to transform your representation of the problem to avoid this link, I'd say try to pursue that. If not, then be prepared to resort to some bigger guns to solve the problem.
Without knowing the specifics of your problem, it's hard to give more concrete advice. Plus, ultimately, you're the one that will be solving it, so you're going to be the expert eventually!

To emphasize on any vector elements in your input vector you will have to give less information of the unimportant vector to your neural network.
Try to encode the first less important 285 numbers into one number or any vector size you like, with a a multiplayer neural network then use that number with other 4 number as a input to a neural network.
Example:
v1=[1,2,3,..........285]
v2=[286,287,288,289]
v_out= Neural_network(input_vector=v1,neurons=[100,1]) # 100 hidden unit with one outpt.
v_final=Neural_network(input_vector=v_out,neurons=[100,1]) # 100 hidden unit with one outpt.

What scalability issues are associated with NetworkX?

I'm interested in network analysis on large networks with millions of nodes and tens of millions of edges. I want to be able to do things like parse networks from many formats, find connected components, detect communities, and run centrality measures like PageRank.
I am attracted to NetworkX because it has a nice api, good documentation, and has been under active development for years. Plus because it is in python, it should be quick to develop with.
In a recent presentation (the slides are available on github here), it was claimed that:
Unlike many other tools, NX is designed to handle data on a scale
relevant to modern problems...Most of the core algorithms in NX rely on extremely fast legacy code.
The presentation also states that the base algorithms of NetworkX are implemented in C/Fortran.
However, looking at the source code, it looks like NetworkX is mostly written in python. I am not too familiar with the source code, but I am aware of a couple of examples where NetworkX uses numpy to do heavy lifting (which in turn uses C/Fortran to do linear algebra). For example, the file networkx/networkx/algorithms/centrality/eigenvector.py uses numpy to calculate eigenvectors.
Does anyone know if this strategy of calling an optimized library like numpy is really prevalent throughout NetworkX, or if just a few algorithms do it? Also can anyone describe other scalability issues associated with NetworkX?
Reply from NetworkX Lead Programmer
I posed this question on the NetworkX mailing list, and Aric Hagberg replied:
The data structures used in NetworkX are appropriate for scaling to
large problems (e.g. the data structure is an adjacency list). The
algorithms have various scaling properties but some of the ones you
mention are usable (e.g. PageRank, connected components, are linear
complexity in the number of edges).
At this point NetworkX is pure Python code. The adjacency structure
is encoded with Python dictionaries which provides great flexibility
at the expense of memory and computational speed. Large graphs will
take a lot of memory and you will eventually run out.
NetworkX does use NumPy and SciPy for algorithms that are primarily
based on linear algebra. In that case the graph is represented
(copied) as an adjacency matrix using either NumPy matrices or SciPy
sparse matrices. Those algorithms can benefit from the legacy C and
FORTRAN code that is used under the hood in NumPy and SciPY.

This is an old question, but I think it is worth mentioning that graph-tool has a very similar functionality to NetworkX, but it is implemented in C++ with templates (using the Boost Graph Library), and hence is much faster (up to two orders of magnitude) and uses much less memory.
Disclaimer: I'm the author of graph-tool.

Your big issue will be memory. Python simply cannot handle tens of millions of objects without jumping through hoops in your class implementation. The memory overhead of many objects is too high, you hit 2GB, and 32-bit code won't work. There are ways around it - using slots, arrays, or NumPy. It should be OK because networkx was written for performance, but if there are a few things that don't work, I will check your memory usage.
As for scaling, algorithms are basically the only thing that matters with graphs. Graph algorithms tend to have really ugly scaling if they are done wrong, and they are just as likely to be done right in Python as any other language.

The fact that networkX is mostly written in python does not mean that it is not scalable, nor claims perfection. There is always a trade-off. If you throw more money on your "machines", you'll have as much scalability as you want plus the benefits of using a pythonic graph library.
If not, there are other solutions, ( here and here ), which may consume less memory ( benchmark and see, I think igraph is fully C backed so it will ), but you may miss the pythonic feel of NX.

What is the most efficient graph data structure in Python? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to be able to manipulate a large (10^7 nodes) graph in python. The data corresponding to each node/edge is minimal, say, a small number of strings. What is the most efficient, in terms of memory and speed, way of doing this?
A dict of dicts is more flexible and simpler to implement, but I intuitively expect a list of lists to be faster. The list option would also require that I keep the data separate from the structure, while dicts would allow for something of the sort:
graph[I][J]["Property"]="value"
What would you suggest?
Yes, I should have been a bit clearer on what I mean by efficiency. In this particular case I mean it in terms of random access retrieval.
Loading the data in to memory isn't a huge problem. That's done once and for all. The time consuming part is visiting the nodes so I can extract the information and measure the metrics I'm interested in.
I hadn't considered making each node a class (properties are the same for all nodes) but it seems like that would add an extra layer of overhead? I was hoping someone would have some direct experience with a similar case that they could share. After all, graphs are one of the most common abstractions in CS.

I would strongly advocate you look at NetworkX. It's a battle-tested war horse and the first tool most 'research' types reach for when they need to do analysis of network based data. I have manipulated graphs with 100s of thousands of edges without problem on a notebook. Its feature rich and very easy to use. You will find yourself focusing more on the problem at hand rather than the details in the underlying implementation.
Example of Erdős-Rényi random graph generation and analysis
"""
Create an G{n,m} random graph with n nodes and m edges
and report some properties.
This graph is sometimes called the Erd##[m~Qs-Rényi graph
but is different from G{n,p} or binomial_graph which is also
sometimes called the Erd##[m~Qs-Rényi graph.
"""
__author__ = """Aric Hagberg (hagberg#lanl.gov)"""
__credits__ = """"""
# Copyright (C) 2004-2006 by
# Aric Hagberg
# Dan Schult
# Pieter Swart
# Distributed under the terms of the GNU Lesser General Public License
# http://www.gnu.org/copyleft/lesser.html
from networkx import *
import sys
n=10 # 10 nodes
m=20 # 20 edges
G=gnm_random_graph(n,m)
# some properties
print "node degree clustering"
for v in nodes(G):
print v,degree(G,v),clustering(G,v)
# print the adjacency list to terminal
write_adjlist(G,sys.stdout)
Visualizations are also straightforward:
More visualization: http://jonschull.blogspot.com/2008/08/graph-visualization.html

Even though this question is now quite old, I think it is worthwhile to mention my own python module for graph manipulation called graph-tool. It is very efficient, since the data structures and algorithms are implemented in C++, with template metaprograming, using the Boost Graph Library. Therefore its performance (both in memory usage and runtime) is comparable to a pure C++ library, and can be orders of magnitude better than typical python code, without sacrificing ease of use. I use it myself constantly to work with very large graphs.

As already mentioned, NetworkX is very good, with another option being igraph. Both modules will have most (if not all) the analysis tools you're likely to need, and both libraries are routinely used with large networks.

A dictionary may also contain overhead, depending on the actual implementation. A hashtable usually contain some prime number of available nodes to begin with, even though you might only use a couple of the nodes.
Judging by your example, "Property", would you be better of with a class approach for the final level and real properties? Or is the names of the properties changing a lot from node to node?
I'd say that what "efficient" means depends on a lot of things, like:
speed of updates (insert, update, delete)
speed of random access retrieval
speed of sequential retrieval
memory used
I think that you'll find that a data structure that is speedy will generally consume more memory than one that is slow. This isn't always the case, but most data structures seems to follow this.
A dictionary might be easy to use, and give you relatively uniformly fast access, it will most likely use more memory than, as you suggest, lists. Lists, however, generally tend to contain more overhead when you insert data into it, unless they preallocate X nodes, in which they will again use more memory.
My suggestion, in general, would be to just use the method that seems the most natural to you, and then do a "stress test" of the system, adding a substantial amount of data to it and see if it becomes a problem.
You might also consider adding a layer of abstraction to your system, so that you don't have to change the programming interface if you later on need to change the internal data structure.

As I understand it, random access is in constant time for both Python's dicts and lists, the difference is that you can only do random access of integer indexes with lists. I'm assuming that you need to lookup a node by its label, so you want a dict of dicts.
However, on the performance front, loading it into memory may not be a problem, but if you use too much you'll end up swapping to disk, which will kill the performance of even Python's highly efficient dicts. Try to keep memory usage down as much as possible. Also, RAM is amazingly cheap right now; if you do this kind of thing a lot, there's no reason not to have at least 4GB.
If you'd like advice on keeping memory usage down, give some more information about the kind of information you're tracking for each node.

Making a class-based structure would probably have more overhead than the dict-based structure, since in python classes actually use dicts when they are implemented.

No doubt NetworkX is the best data structure till now for graph. It comes with utilities like Helper Functions, Data Structures and Algorithms, Random Sequence Generators, Decorators, Cuthill-Mckee Ordering, Context Managers
NetworkX is great because it wowrs for graphs, digraphs, and multigraphs. It can write graph with multiple ways: Adjacency List, Multiline Adjacency List,
Edge List, GEXF, GML. It works with Pickle, GraphML, JSON, SparseGraph6 etc.
It has implimentation of various radimade algorithms including:
Approximation, Bipartite, Boundary, Centrality, Clique, Clustering, Coloring, Components, Connectivity, Cycles, Directed Acyclic Graphs,
Distance Measures, Dominating Sets, Eulerian, Isomorphism, Link Analysis, Link Prediction, Matching, Minimum Spanning Tree, Rich Club, Shortest Paths, Traversal, Tree.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.