How do I change the height of Graphviz output (DOT question)?

How do I change the height of Graphviz output (DOT question)? - python

I am trying to create a family tree. I found an excellent script for achieving this and I understand the Python code that underlies it. The script can be found here: https://github.com/adrienverge/familytreemaker/blob/master/familytreemaker.py
I am hitting a wall at lines 322-342. The script executes successfully, but the output is restricted to two layers. There are four generations in the current tree, so this compression makes the graph impossible decipher.
def output_descending_tree(self, ancestor):
"""Outputs the whole descending family tree from a given ancestor,
in DOT format.
"""
# Find the first households
gen = [ancestor]
print('digraph {\n' + \
'\tnode [shape=box];\n' + \
'\tedge [dir=none];\n')
for p in self.everybody.values():
print('\t' + p.graphviz() + ';')
print('')
while gen:
self.display_generation(gen)
gen = self.next_generation(gen)
print('}')
I think a big part of my problem is there are many nodes since the family is quite large. If I use a subset of my family, the tree renders without any issue.
It looks like I need to add some parameters around lines 331-332, such as:
'\tgraph [nodesep=1, levelsgap=1, ratio=expand, size="10,20"];\n' + \
OR, add additional parameters to ``nodeand/oredge```.
I found the documentation here: https://www.graphviz.org/doc/info/attrs.html, and I tried a few different parameters without much success. The plot did change size, but the contents became blurry rather than expanding. The parameters I have tried changing are: dimen, esep, fontsize, height, levels, and ratio.
This was a helpful thread: Change Size (Width and Height) of Graph (GraphViz & dot), but my interpreter didn't like the syntax (specifically: nodesep = 1.5; and ranksep = 1).
Any advice is greatly appreciated. Thank you!
EDIT: I have attached an anonymized version of my output. This version is less compressed than the version I originally had when I wrote this post. Maybe the changes I have made to the code in the mean time achieved my goals? Not sure which one did it, but thank you all for your feedback.

Related

Multiplying a numpy array within Psychopy

TL;DR: Can I multiply a numpy.average by 2? If yes, how?
For an orientation discrimination experiment, during which people respond on how well they're able to discriminate the angle between an visible grating and non-visible reference grating, I want to calculate the Just Noticeable Difference (JND).
At the end of the code I have this:
#write JND to logfile (average of last 10 reversals)
if len(staircase[stairnum].reversalIntensities) < 10:
dataFile.write('JND = %.3f\n' % numpy.average(staircase[stairnum].reversalIntensities))
else:
dataFile.write('JND = %.3f\n' % numpy.average(staircase[stairnum].reversalIntensities[-10:]))
This is where the JND is written to the file, and I thought it'd be easy to multiply that "numpy.average" line by 2, which doesn't work. I thought of making two different variables that contained the same array, and using numpy.sum to add them together.
#Possible solution
x=numpy.average(staircase[stairnum].reversalIntensities[-10:]))
y=numpy.average(staircase[stairnum].reversalIntensities[-10:]))
numpy.sum(x,y, [et cetera])
I am sure the procedure is very simple, but my current capabilities of programming are limited and the psychopy and python reference materials did not provide what I was looking for (if there is, please share!).

Why is Python randint() generating bizarre grid effect when used in script to transform image?

I am playing around with images and the random number generator in Python, and I would like to understand the strange output picture my script is generating.
The larger script iterates through each pixel (left to right, then next row down) and changes the color. I used the following function to offset the given input red, green, and blue values by a randomly determined integer between 0 and 150 (so this formula is invoked 3 times for R, G, and B in each iteration):
def colCh(cVal):
random.seed()
rnd = random.randint(0,150)
newVal = max(min(cVal - 75 + rnd,255),0)
return newVal
My understanding is that random.seed() without arguments uses the system clock as the seed value. Given that it is invoked prior to the calculation of each offset value, I would have expected a fairly random output.
When reviewing the numerical output, it does appear to be quite random:
Scatter plot of every 100th R value as x and R' as y:
However, the picture this script generates has a very peculiar grid effect:
Output picture with grid effect hopefully noticeable:
Furthermore, fwiw, this grid effect seems to appear or disappear at different zoom levels.
I will be experimenting with new methods of creating seed values, but I can't just let this go without trying to get an answer.
Can anybody explain why this is happening? THANKS!!!
Update: Per Dan's comment about possible issues from JPEG compression, the input file format is .jpg and the output file format is .png. I would assume only the output file format would potentially create the issue he describes, but I admittedly do not understand how JPEG compression works at all. In order to try and isolate JPEG compression as the culprit, I changed the script so that the colCh function that creates the randomness is excluded. Instead, it merely reads the original R,G,B values and writes those exact values as the new R,G,B values. As one would expect, this outputs the exact same picture as the input picture. Even when, say, multiplying each R,G,B value by 1.2, I am not seeing the grid effect. Therefore, I am fairly confident this is being caused by the colCh function (i.e. the randomness).
Update 2: I have updated my code to incorporate Sascha's suggestions. First, I moved random.seed() outside of the function so that it is not reseeding based on the system time in every iteration. Second, while I am not quite sure I understand how there is bias in my original function, I am now sampling from a positive/negative distribution. Unfortunately, I am still seeing the grid effect. Here is my new code:
random.seed()
def colCh(cVal):
rnd = random.uniform(-75,75)
newVal = int(max(min(cVal + rnd,255),0))
return newVal
Any more ideas?

As imgur is down for me right now, some guessing:
Your usage of PRNGs is a bit scary. Don't use time-based seeds in very frequently called loops. It's very much possible, that the same seeds are generated and of course this will generate patterns. (granularity of time + number of random-bits used matter here)
So: seed your PRNG once! Don't do this every time, don't do this for every channel. Seed one global PRNG and use it for all operations.
There should be no pattern then.
(If there is: also check the effect of interpolation = image-size change)
Edit: As imgur is on now, i recognized the macro-block like patterns, like Dan mentioned in the comments. Please change your PRNG-usage first before further analysis. And maybe show more complete code.
It may be possible, that you recompressed the output and JPEG-compression emphasized the effects observed before.
Another thing is:
newVal = max(min(cVal - 75 + rnd,255),0)
There is a bit of a bias here (better approach: sample from symmetric negative/positive distribution and clip between 0,255), which can also emphasize some effect (what looked those macroblocks before?).

graphviz cluster's label multiple lines

I'm using python+graphviz in order to create networking topologies out of the information contained in Racktables. I've succeded pretty well so far but I'm willing now to add multiple lines labels to a cluster (not a node).
For example, I have the following code with python:
for router in routers:
[...]
cluster_name = "cluster"+str(i)
router_label=router_name+"\n"+router_hw
c = gv.Graph(cluster_name)
c.body.append('label='+router_label)
When ever I run that program, I get the following:
ST120_CMS70_SARM
SARM
ST202_P9J70_SARM
SARM
Error: node "SARM" is contained in two non-comparable clusters "cluster1" and "cluster0"
But, if I change this router_label=router_name+"\n"+router_hw to this router_label=router_name+"_"+router_hw, I get no error and the topology gets drawn, but, of course, a one line label.
Any hint on this?
Many thanks!
Lucas

Ok, I've found the solution. The multiline label is achieved using HTML like labels, like the following one...
router_label="<"+router_name+"<BR />"+router_ip+">"
c = gv.Graph(cluster_name)
c.body.append('label='+router_label)
This code provides the following:
Thanks!
Lucas

Minimum removed nodes required to cut path from A to B algorithm in Python

I am trying to solve a problem related to graph theory but can't seem to remember/find/understand the proper/best approach so I figured I'd ask the experts...
I have a list of paths from two nodes (1 and 10 in example code). I'm trying to find the minimum number of nodes to remove to cut all paths. I'm also only able to remove certain nodes.
I currently have it implemented (below) as a brute force search. This works fine on my test set but is going to be an issue when scaling up to a graphs that have paths in the 100K and available nodes in the 100 (factorial issue). Right now, I'm not caring about the order I remove nodes in, but I will at some point want to take that into account (switch sets to list in code below).
I believe there should be a way to solve this using a max flow/min cut algorithm. Everything I'm reading though is going way over my head in some way. It's been several (SEVERAL) years since doing this type of stuff and I can't seem to remember anything.
So my questions are:
1) Is there a better way to solve this problem other than testing all combinations and taking the smallest set?
2) If so, can you either explain it or, preferably, give pseudo code to help explain? I'm guessing there is probably a library that already does this in some way (I have been looking and using networkX lately but am open to others)
3) If not (or even of so), suggestions for how to multithread/process solution? I want to try to get every bit of performance I can from computer. (I have found a few good threads on this question I just haven't had a chance to implement so figured I'd ask at same time just in chance. I first want to get everything working properly before optimizing.)
4) General suggestions on making code more "Pythonic" (probably will help with performance too). I know there are improvements I can make and am still new to Python.
Thanks for the help.
#!/usr/bin/env python
def bruteForcePaths(paths, availableNodes, setsTested, testCombination, results, loopId):
#for each node available, we are going to
# check if we have already tested set with node
# if true- move to next node
# if false- remove the paths effected,
# if there are paths left,
# record combo, continue removing with current combo,
# if there are no paths left,
# record success, record combo, continue to next node
#local copy
currentPaths = list(paths)
currentAvailableNodes = list(availableNodes)
currentSetsTested = set(setsTested)
currentTestCombination= set(testCombination)
currentLoopId = loopId+1
print "loop ID: %d" %(currentLoopId)
print "currentAvailableNodes:"
for set1 in currentAvailableNodes:
print " %s" %(set1)
for node in currentAvailableNodes:
#add to the current test set
print "%d-current node: %s current combo: %s" % (currentLoopId, node, currentTestCombination)
currentTestCombination.add(node)
# print "Testing: %s" % currentTestCombination
# print "Sets tested:"
# for set1 in currentSetsTested:
# print " %s" % set1
if currentTestCombination in currentSetsTested:
#we already tested this combination of nodes so go to next node
print "Already test: %s" % currentTestCombination
currentTestCombination.remove(node)
continue
#get all the paths that don't have node in it
currentRemainingPaths = [path for path in currentPaths if not (node in path)]
#if there are no paths left
if len(currentRemainingPaths) == 0:
#save this combination
print "successful combination: %s" % currentTestCombination
results.append(frozenset(currentTestCombination))
#add to remember we tested combo
currentSetsTested.add(frozenset(currentTestCombination))
#now remove the node that was add, and go to the next one
currentTestCombination.remove(node)
else:
#this combo didn't work, save it so we don't test it again
currentSetsTested.add(frozenset(currentTestCombination))
newAvailableNodes = list(currentAvailableNodes)
newAvailableNodes.remove(node)
bruteForcePaths(currentRemainingPaths,
newAvailableNodes,
currentSetsTested,
currentTestCombination,
results,
currentLoopId)
currentTestCombination.remove(node)
print "-------------------"
#need to pass "up" the tested sets from this loop
setsTested.update(currentSetsTested)
return None
if __name__ == '__main__':
testPaths = [
[1,2,14,15,16,18,9,10],
[1,2,24,25,26,28,9,10],
[1,2,34,35,36,38,9,10],
[1,3,44,45,46,48,9,10],
[1,3,54,55,56,58,9,10],
[1,3,64,65,66,68,9,10],
[1,2,14,15,16,7,10],
[1,2,24,7,10],
[1,3,34,35,7,10],
[1,3,44,35,6,10],
]
setsTested = set()
availableNodes = [2, 3, 6, 7, 9]
results = list()
currentTestCombination = set()
bruteForcePaths(testPaths, availableNodes, setsTested, currentTestCombination, results, 0)
print "results:"
for result in sorted(results, key=len):
print result
UPDATE:
I reworked the code using itertool for generating the combinations. It make the code cleaner and faster (and should be easier to multiprocess. Now to try to figure out the dominate nodes as suggested and multiprocess function.
def bruteForcePaths3(paths, availableNodes, results):
#start by taking each combination 2 at a time, then 3, etc
for i in range(1,len(availableNodes)+1):
print "combo number: %d" % i
currentCombos = combinations(availableNodes, i)
for combo in currentCombos:
#get a fresh copy of paths for this combiniation
currentPaths = list(paths)
currentRemainingPaths = []
# print combo
for node in combo:
#determine better way to remove nodes, for now- if it's in, we remove
currentRemainingPaths = [path for path in currentPaths if not (node in path)]
currentPaths = currentRemainingPaths
#if there are no paths left
if len(currentRemainingPaths) == 0:
#save this combination
print combo
results.append(frozenset(combo))
return None

Here is an answer which ignores the list of paths. It just takes a network, a source node, and a target node, and finds the minimum set of nodes within the network, not either source or target, so that removing these nodes disconnects the source from the target.
If I wanted to find the minimum set of edges, I could find out how just by searching for Max-Flow min-cut. Note that the Wikipedia article at http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem#Generalized_max-flow_min-cut_theorem states that there is a generalized max-flow min-cut theorem which considers vertex capacity as well as edge capacity, which is at least encouraging. Note also that edge capacities are given as Cuv, where Cuv is the maximum capacity from u to v. In the diagram they seem to be drawn as u/v. So the edge capacity in the forward direction can be different from the edge capacity in the backward direction.
To disguise a minimum vertex cut problem as a minimum edge cut problem I propose to make use of this asymmetry. First of all give all the existing edges a huge capacity - for example 100 times the number of nodes in the graph. Now replace every vertex X with two vertices Xi and Xo, which I will call the incoming and outgoing vertices. For every edge between X and Y create an edge between Xo and Yi with the existing capacity going forwards but 0 capacity going backwards - these are one-way edges. Now create an edge between Xi and Xo for each X with capacity 1 going forwards and capacity 0 going backwards.
Now run max-flow min-cut on the resulting graph. Because all the original links have huge capacity, the min cut must all be made up of the capacity 1 links (actually the min cut is defined as a division of the set of nodes into two: what you really want is the set of pairs of nodes Xi, Xo with Xi in one half and Xo in the other half, but you can easily get one from the other). If you break these links you disconnect the graph into two parts, as with standard max-flow min-cut, so deleting these nodes will disconnect the source from the target. Because you have the minimum cut, this is the smallest such set of nodes.
If you can find code for max-flow min-cut, such as those pointed to by http://www.cs.sunysb.edu/~algorith/files/network-flow.shtml I would expect that it will give you the min-cut. If not, for instance if you do it by solving a linear programming problem because you happen to have a linear programming solver handy, notice for example from http://www.cse.yorku.ca/~aaw/Wang/MaxFlowMinCutAlg.html that one half of the min cut is the set of nodes reachable from the source when the graph has been modifies to subtract out the edge capacities actually used by the solution - so given just the edge capacities used at max flow you can find it pretty easily.

If the paths were not provided as part of the problem I would agree that there should be some way to do this via http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem, given a sufficiently ingenious network construction. However, because you haven't given any indication as to what is a reasonable path and what is not I am left to worry that a sufficiently malicious opponent might be able to find strange collections of paths which don't arise from any possible network.
In the worst case, this might make your problem as difficult as http://en.wikipedia.org/wiki/Set_cover_problem, in the sense that somebody, given a problem in Set Cover, might be able to find a set of paths and nodes that produces a path-cut problem whose solution can be turned into a solution of the original Set Cover problem.
If so - and I haven't even attempted to prove it - your problem is NP-Complete, but since you have only 100 nodes it is possible that some of the many papers you can find on Set Cover will point at an approach that will work in practice, or can provide a good enough approximation for you. Apart from the Wikipedia article, http://www.cs.sunysb.edu/~algorith/files/set-cover.shtml points you at two implementations, and a quick search finds the following summary at the start of a paper in http://www.ise.ufl.edu/glan/files/2011/12/EJORpaper.pdf:
The SCP is an NP-hard problem in the strong sense (Garey and Johnson, 1979) and many algorithms
have been developed for solving the SCP. The exact algorithms (Fisher and Kedia, 1990; Beasley and
JØrnsten, 1992; Balas and Carrera, 1996) are mostly based on branch-and-bound and branch-and-cut.
Caprara et al. (2000) compared different exact algorithms for the SCP. They show that the best exact
algorithm for the SCP is CPLEX. Since exact methods require substantial computational effort to solve
large-scale SCP instances, heuristic algorithms are often used to find a good or near-optimal solution in a
reasonable time. Greedy algorithms may be the most natural heuristic approach for quickly solving large
combinatorial problems. As for the SCP, the simplest such approach is the greedy algorithm of Chvatal
(1979). Although simple, fast and easy to code, greedy algorithms could rarely generate solutions of good
quality....

Edit: If you want to destroy in fact all paths, and not those from a given list, then max-flow techniques as explained by mcdowella is much better than this approach.
As mentioned by mcdowella, the problem is NP-hard in general. However, the way your example looks, an exact approach might be feasible.
First, you can delete all vertices from the paths that are not available for deletion. Then, reduce the instance by eliminating dominated vertices. For example, every path that contains 15 also contains 2, so it never makes sense to delete 15. In the example if all vertices were available, 2, 3, 9, and 35 dominate all other vertices, so you'd have the problem down to 4 vertices.
Then take a vertex from the shortest path and branch recursively into two cases: delete it (remove all paths containing it) or don't delete it (delete it from all paths). (If the path has length one, omit the second case.) You can then check for dominance again.
This is exponential in the worst case, but might be sufficient for your examples.

error with gdalbuildvrt, in Python

I am new to python/GDAL and am running into perhaps a trivial issue. This may stem from the fact that I don't really understand how to use GDAL properly in python, or something careless, but even though I think I am following the help doc, I keep getting a syntax error when trying to use "gdalbuildvrt".
What I want to do is take several (amount varies for each set, call it N) geotagged 1-band binary rasters [all values are either 0 or 1] of different sizes (each raster in the set overlaps for the most part though), and "stack" them on top of each other so that they are aligned properly according to their coordinate information. I want this "stack" simply so I can sum the values and produce a 'total' tiff that has an extent to match the exclusive extent (meaning not just the overlap region) of all the original rasters. The resulting tiff would have values ranging from 0 to N, to represent the total number of "hits" the pixel in that location received over the course of the N rasters.
I was led to gdalbuildvrt [http://www.gdal.org/gdalbuildvrt.html] and after reading about it, it seemed that by using the keyword -separate, I would be able to achieve what I need. However, each time I try to run my program, I get a syntax error. The following shows two of the several different ways I tried calling gdalbuildvrt:
gdalbuildvrt -separate -input_file_list stack.vrt inputlist.txt
gdalbuildvrt -separate stack.vrt inclassfiles
Where inputlist.txt is a text file with a path to the tif on every line, just like the help doc specifies. And inclassfiles is a python list of the pathnames. Every single time, no matter which way I call it, I get a syntax error on the first word after the keywords (i.e. 'inputlist' in inputlist.txt, or 'stack' in stack.vrt).
Could someone please shed some light on what I might be doing wrong? Alternatively, does anyone know how else I could use python to get what I need?
Thanks so much.

gdalbuildvrt is a GDAL command line utility. From your example its a bit unclear how you actually run it, but when running from within Python you should execute it as a subprocess.
And in your first line you have the .vrt and the .txt in the wrong order. The textfile containing the files should follow directly after the -input_file_list.
From within Python you can call gdalbuildvrt like:
import os
os.system('gdalbuildvrt -separate -input_file_list inputlist.txt stack.vrt')
Note that the command is provided as a string. Using a Python list with the files can be done with something like:
os.system('gdalbuildvrt -separate stack.vrt %s') % ' '.join(data)
The ' '.join(data) part converts the list to a string with a space between the items.
Depending on how your GDAL is build, its sometimes possible to use wildcards as well:
os.system('gdalbuildvrt -separate stack.vrt *.tif')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.