How to create a DAG from a list in python

How to create a DAG from a list in python - python

I am using networkx to manually input the (u, v, weights) to a graph. But when the input gets bigger this manual insertion of nodes and edges will become a really tiresome task and prone to errors. I'm trying but haven't figured out that how to perform this task without manual labour.
Sample Input:
my_list = ["s1[0]", "d1[0, 2]", "s2[0]", "d2[1, 3]", "d3[0, 2]", "d4[1, 4]", "d5[2,
3]", "d6[1, 4]"]
Manual Insertion:
Before inserting nodes into a graph I need to number them, so first occurrence of 's' or 'd' can be differentiate from later similar characters e.g. s1,s2,s3,... and d1,d2,d3,...
I am aware it is something similar to SSA form (compilers) but I was not able to find something helpful for my case.
Manually inserting (u, v, weights) to a DiGraph()
my_graph.add_weighted_edges_from([("s1", "d1", 0), ("d1", "s2", 0), ("d1", "d3", 2), ("s2", "d3", 0), (
"d2", "d4", 1), ("d2", "d5", 3), ("d3", "d5", 2), ("d4", "d6", 1), ("d4", "d6", 4)])
Question:
How to automatically convert that input list(my_list) into a DAG(my_graph), avoiding manual insertion?
Complete Code:
This is what I have written so far.
import networkx as nx
from networkx.drawing.nx_agraph import write_dot, graphviz_layout
from matplotlib import pyplot as plt
my_graph = nx.DiGraph()
my_graph.add_weighted_edges_from([("s1", "d1", 0), ("d1", "s2", 0), ("d1", "d3", 2), ("s2", "d3", 0), (
"d2", "d4", 1), ("d2", "d5", 3), ("d3", "d5", 2), ("d4", "d6", 1), ("d4", "d6", 4)])
write_dot(my_graph, "graph.dot")
plt.title("draw graph")
pos = graphviz_layout(my_graph, prog='dot')
nx.draw(my_graph, pos, with_labels=True, arrows=True)
plt.show()
plt.clf()
Explanation:
's' and 'd' are some instructions that requires 1 or 2 registers respectively, to perform an operation.
In above example we have 2 's' operations and 6 'd' operations and there are five registers [0,1,2,3,4].
Each operation will perform some calculation and store the results in relevant register/s.
From input we can see that d1 uses register 0 and 2, so it cannot operate until both of these registers are free. Therefore, d1 is dependent on s1 because s1 comes before d1 and is using register 0. As soon as s1 finishes d1 can operate as register 2 is already free.
E.g. We initialize all registers with 1. s1 doubles its input while d1 sums two inputs and store the result in it's second register:
so after s1[0] reg-0 * 2 -> 1 * 2 => reg-0 = 2
and after d1[0, 2] reg-0 + reg-2 -> 2 + 1 => reg-0 = 2 and reg-2 = 3
Update 1: The graph will be a dependency-graph based on some resources [0...4], each node will require 1(for 's') or 2(for 'd') of these resources.
Update 2: Two questions were causing confusion so I'm separating them. For now I have changed my input list and there is only a single task of converting that list into a DAG. I have also included an explanation section.
PS: You might need to pip install graphviz if you don't already have it.

Ok now that I have a better idea of how the mapping works, it just comes down to describing the process in code, keeping a mapping of which op is using which resource and as iterating over the operations if it uses a resource used by the previous operation we generate an edge. I think this is along the lines of what you are looking for:
import ast
class UniqueIdGenerator:
def __init__(self, initial=1):
self.auto_indexing = {}
self.initial = initial
def get_unique_name(self, name):
"adds number after given string to ensure uniqueness."
if name not in self.auto_indexing:
self.auto_indexing[name] = self.initial
unique_idx = self.auto_indexing[name]
self.auto_indexing[name] += 1
return f"{name}{unique_idx}"
def generate_DAG(source):
"""
takes iterable of tuples in format (name, list_of_resources) where
- name doesn't have to be unique
- list_of_resources is a list of resources in any hashable format (list of numbers or strings is typical)
generates edges in the format (name1, name2, resource),
- name1 and name2 are unique-ified versions of names in input
- resource is the value in the list of resources
each "edge" represents a handoff of resource, so name1 and name2 use the same resource sequentially.
"""
# format {resource: name} for each resource in use.
resources = {}
g = UniqueIdGenerator()
for (op, deps) in source:
op = g.get_unique_name(op)
for resource in deps:
# for each resource this operation requires, if a previous operation used it
if resource in resources:
# yield the new edge
yield (resources[resource], op, resource)
# either first or yielded an edge, this op is now using the resource.
resources[resource] = op
my_list = ["s[0]", "d[0, 2]", "s[0]", "d[1, 3]", "d[0, 2]", "d[1, 4]", "d[2, 3]", "d[1, 4]"]
data = generate_DAG((a[0], ast.literal_eval(a[1:])) for a in my_list)
print(*data, sep="\n")

Related

construct a tree out of list of strings

I have 400 lists that look like that:
[A ,B, C,D,E]
[A, C, G, B, E]
[A,Z,B,D,E]
...
[A,B,R,D,E]
Each with length of 5 items that start with A.
I would like to construct a tree or directed acyclic graph (while with counts a weights ) where each level is the index of the item i.e A have edges with all possible items in the first index, they will have edge with child in the second index and so on.
is there an easy way to build in in networkx ? what i thought to do is to create all the tuples for each level i.e for level 0 : (A,B) ,(A,C) , (A,Z) etc .. but not sure how to move with it

If I understood you correctly, you can set each list as a path using nx.add_path of a directed graph.
l = [['A' ,'B', 'C','D','E'],
['A', 'C','G', 'B', 'E'],
['A','Z','B','D','E'],
['A','B','R','D','E']]
Though since you have nodes across multiple levels, you should probably rename them according to their level, since you cannot have multiple nodes with the same name. So one way could be:
l = [[f'{j}_level{lev}' for lev,j in enumerate(i, 1)] for i in l]
#[['A_level1', 'B_level2', 'C_level3', 'D_level4', 'E_level5'],
# ['A_level1', 'C_level2', 'G_level3', 'B_level4', 'E_level5'],
# ['A_level1', 'Z_level2', 'B_level3', 'D_level4', 'E_level5'],
# ['A_level1', 'B_level2', 'R_level3', 'D_level4', 'E_level5']]
And now construct the graph with: 
G = nx.DiGraph()
for path in l:
nx.add_path(G, path)
Then you could create a tree-like structure using a graphviz's dot layout:
from networkx.drawing.nx_agraph import graphviz_layout
pos=graphviz_layout(G, prog='dot')
nx.draw(G, pos=pos,
node_color='lightgreen',
node_size=1500,
with_labels=True,
arrows=True)

iGraph: selecting vertices connected to

Suppose I have the following graph:
g = ig.Graph([(0,1), (0,2), (2,3), (3,4), (4,2), (2,5), (5,0), (6,3), (5,6)], directed=False)
g.vs["name"] = ["Alice", "Bob", "Claire", "Dennis", "Esther", "Frank", "George"]
and I wish to see who Bob is connected to. Bob is only connected to one person Alice. However if try to find the edge :
g.es.select(_source=1)
>>> <igraph.EdgeSeq at 0x7f15ece78050>
I simply get the above response. How do I infer what the vertex index is from the above. Or if that isn't possible, how do I find the vertices connected to Bob?

This seems to work. The keyword arguments consist of the property, e.g _source and _target, and operator e.g eq (=). And also it seems you need to check both the source and target of the edges (even it's an undirected graph), after filtering the edges, you can use a list comprehension to loop through the edges and extract the source or target:
connected_from_bob = [edge.target for edge in g.es.select(_source_eq=1)]
connected_to_bob = [edge.source for edge in g.es.select(_target_eq=1)]
connected_from_bob
# []
connected_to_bob
# [0]
Then vertices connected with Bob is a union of the two lists:
connected_with_bob = connected_from_bob + connected_to_bob

Python: Best way to store the top ten numbers

I have the following problem: I do paramter tests and create for every single paramter combination a new object, which is replaced by the next object created with other paramters. The Object has an attribute jaccard coefficient and an attribute ID. In every step i want to store the jaccard coeeficient of the object. At the end i want the top ten jaccard coeefcient and their corresponding ID.
r=["%.2f" % r for r in np.arange(3,5,1)]
fs=["%.2f" % fs for fs in np.arange(2,5,1)]
co=["%.2f" % co for co in np.arange(1,5,1)]
frc_networks=[]
bestJC = []
bestPercent = []
best10Candidates = []
count = 0
for parameters in itertools.product(r,fs,co):
args = parser.parse_args(["path1.csv","path2.csv","--r",parameters[0],"--fs",parameters[1],"--co",parameters[2]])
if not os.path.isfile('FCR_Network_Coordinates_ID_{}_r_{}_x_{}_y_{}_z_{}_fcr_{}_co_{}_1.csv'.format(count, args.r, args.x, args.y, args.z, args.fs,args.co)):
FRC_Network(count,args.p[0],args.p[1],args.x,args.y,args.z,args.r,args.fs,args.co)
The attributes can be called by FRC_Network.ID and FRC_Network.JC

I think I'd use heapq.heappushpop() for this. That way, no matter how large your input set is, your data requirement is limited to a list of 10 tuples.
Note the use of tuples to keep the JC and ID parameters. Since the comparisons are lexicographic, this will always sort by JC.
Also, note that the final call to .sort() is optional. If you just want the ten best, skip the call. If you want the ten best in order, keep the call.
import heapq
#UNTESTED
best = []
for parameters in itertools.product(r,fs,co):
# ...
if len(best) < 10:
heapq.heappush(best, (FRC_Network.JC, FRC_Network.ID))
else:
heapq.heappushpop(best, (FRC_Network.JC, FRC_Network.ID))
best.sort(reverse=True)
Here is a tested version that demonstrates the concept:
import heapq
import random
from pprint import pprint
best = []
for ID in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
JC = random.randint(0, 100)
if len(best) < 10:
heapq.heappush(best, (JC, ID))
else:
heapq.heappushpop(best, (JC, ID))
pprint(best)
Result:
[(81, 'E'),
(82, 'd'),
(83, 'G'),
(92, 'i'),
(95, 'Z'),
(100, 'p'),
(89, 'q'),
(98, 'a'),
(96, 'z'),
(97, 'O')]

Mapping modified string indices to original string indices in Python

I'm relatively new to programming and wanted to get some help on a problem I've have. I need to figure out a way to map the indices of a string back to an original string after removing certain positions. For example, say I had a list:
original_string = 'abcdefgh'
And I removed a few elements to get:
new_string = acfh
I need a way to get the "true" indices of new_string. In other words, I want the indices of the positions I've kept as they were in original_string. Thus returning:
original_indices_of_new_string = [0,2,5,7]
My general approach has been something like this:
I find the positions I've removed in the original_string to get:
removed_positions = [1,3,4,6]
Then given the indices of new_string:
new_string_indices = [0,1,2,3]
Then I think I should be able to do something like this:
original_indices_of_new_string = []
for i in new_string_indices:
offset = 0
corrected_value = i + offset
if corrected_value in removed_positions:
#somehow offset to correct value
offset+=1
else:
original_indices_of_new_string.append(corrected_value)
This doesn't really work because the offset is reset to 0 after every loop, which I only want to happen if the corrected_value is in removed_positions (ie. I want to offset 2 for removed_positions 3 and 4 but only 1 if consecutive positions weren't removed).
I need to do this based off positions I've removed rather than those I've kept because further down the line I'll be removing more positions and I'd like to just have an easy function to map those back to the original each time. I also can't just search for the parts I've removed because the real string isn't unique enough to guarantee that the correct portion gets found.
Any help would be much appreciated. I've been using stack overflow for a while now and have always found the question I've had in a previous thread but couldn't find something this time so I decided to post a question myself! Let me know if anything needs clarification.
*Letters in the string are a not unique

Given your string original_string = 'abcdefgh' you can create a tuple of the index, and character of each:
>>> li=[(i, c) for i, c in enumerate(original_string)]
>>> li
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h')]
Then remove your desired charaters:
>>> new_li=[t for t in li if t[1] not in 'bdeg']
>>> new_li
[(0, 'a'), (2, 'c'), (5, 'f'), (7, 'h')]
Then rejoin that into a string:
>>> ''.join([t[1] for t in new_li])
acfh
Your 'answer' is the method used to create new_li and referring to the index there:
>>> ', '.join(map(str, (t[0] for t in new_li)))
0, 2, 5, 7

You can create a new class to deal with this stuff
class String:
def __init__(self, myString):
self.myString = myString
self.myMap = {}
self.__createMapping(self.myString)
def __createMapping(self, myString):
index = 0
for character in myString:
# If the character already exists in the map, append the index to the list
if character in self.myMap:
self.myMap[character].append(index)
else:
self.myMap[character] = [index,]
index += 1
def removeCharacters(self, myList):
for character in self.myString:
if character in myList:
self.myString = self.myString.replace(character, '')
del self.myMap[character]
return self.myString
def getIndeces(self):
return self.myMap
if __name__ == '__main__':
myString = String('abcdef')
print myString.removeCharacters(['a', 'b']) # Prints cdef
print myString.getIndeces() # Prints each character and a list of the indeces these occur at
This will give a mapping of the characters and a list of the indeces that they occur at. You can add more functionality if you want a single list returned, etc. Hopefully this gives you an idea of how to start

If removing by index, you simply need to start with a list of all indexes, e.g.: [0, 1, 2, 3, 4] and then, as you remove at each index, remove it from that list. For example, if you're removing indexes 1 and 3, you'll do:
idxlst.remove(1)
idxlst.remove(3)
idxlst # => [0, 2, 4]
[update]: if not removing by index, it's probably easiest to find the index first and then proceed with the above solution, e.g. if removing 'c' from 'abc', do:
i = mystr.index('c')
# remove 'c'
idxlst.remove(i)

Trying to stay as close as possible to what you were originally trying to accomplish, this code should work:
big = 'abcdefgh'
small='acfh'
l = []
current = 0
while len(small) >0:
if big[current] == small[0]:
l.append(current)
small = small[1:]
else:
current += 1
print(l)
The idea is working from the front so you don't need to worry about offset.
A precondition is of course that small actually is obtained by removing a few indices from big. Otherwise, an IndexError is thrown. If you need the code to be more robust, just catch the exception at the very end and return an empty list or something. Otherwise the code should work fine.

Assuming the character in your input string are unique, this is what is happening with your code:
original_indices_of_new_string = []
for i in new_string_indices:
offset = 0
corrected_value = i + offset
if corrected_value in removed_positions:
#somehow offset to correct value
offset+=1
else:
original_indices_of_new_string.append(corrected_value)
Setting offset to 0 every time in the loop is as good as having it preset to 0 outside the loop. And if you are adding 0 everytime to i in the loop, might as well use i. That boils down your code to:
if i in removed_positions:
#somehow offset to correct value
pass
else:
original_indices_of_new_string.append(i)
This code gives the output as [0, 2] and the logic is right (again assuming the characters in the input are unique) What you should be doing is, running the loop for the length of the original_string. That will give you what you want. Like this:
original_indices_of_new_string = []
for i in range(len(original_string)):
if i in removed_positions:
#somehow offset to correct value
pass
else:
original_indices_of_new_string.append(i)
print original_indices_of_new_string
This prints:
[0, 2, 5, 7]
A simpler one liner to achieve the same would be:
original_indices_of_new_string = [original_string.index(i) for i in new_string for j in i]
Hope this helps.

It may help to map the characters in the new string with their positions in the original string in a dictionary and recover the new string like this:
import operator
chars = {'a':0, 'c':2, 'f':6, 'h':8}
sorted_chars = sorted(chars.iteritems(), key=operator.itemgetter(1))
new_string = ''.join([char for char, pos in sorted_chars]) # 'acfh'

python pandas beginner: multi-dimensional data-analysis workflow (groupby+agg+plot)

I'm new into pandas and try to learn how to process my multi-dimensional data.
My data
Let's assume, my data is a big CSV of the columns ['A', 'B', 'C', 'D', 'E', 'F', 'G']. This data describes some simulation results, where ['A', 'B', ..., 'F'] are simulation parameters and 'G' is one of the ouputs (only existing output in this example!).
EDIT / UPDATE:
As Boud suggested in the comments, let's generate some data which is compatible to mine:
import pandas as pd
import itertools
import numpy as np
npData = np.zeros(5000, dtype=[('A','i4'),('B','f4'),('C','i4'), ('D', 'i4'), ('E', 'f4'), ('F', 'i4'), ('G', 'f4')])
A = [0,1,2,3,6] # param A: int
B = [1000.0, 10.000] # param B: float
C = [100,150,200,250,300] # param C: int
D = [10,15,20,25,30] # param D: int
E = [0.1, 0.3] # param E: float
F = [0,1,2,3,4,5,6,7,8,9] # param F = random-seed = int -> 10 runs per scenario
# some beta-distribution parameters for randomizing the results in column "G"
aDistParams = [ (6,1),
(5,2),
(4,3),
(3,4),
(2,5),
(1,6),
(1,7) ]
counter = 0
for i in itertools.product(A,B,C,D,E,F):
npData[counter]['A'] = i[0]
npData[counter]['B'] = i[1]
npData[counter]['C'] = i[2]
npData[counter]['D'] = i[3]
npData[counter]['E'] = i[4]
npData[counter]['F'] = i[5]
np.random.seed = i[5]
npData[counter]['G'] = np.random.beta(a=aDistParams[i[0]][0], b=aDistParams[i[0]][1])
counter += 1
data = pd.DataFrame(npData)
data = data.reindex(np.random.permutation(data.index)) # shuffle rows because my original data doesn't give any guarantees
Because the parameters ['A', 'B', ..., 'F'] are generated as a cartesian-product (meaning: nested for-loops; a priori), i want to use groupby for obtaining each possible 'simulation scenario' before analysing the output.
The parameter 'F' describe multiple runs for each scenario (each scenario defined by 'A', 'B', ..., 'E' ; let's assume, that 'F' is the random-seed), so my code becomes:
grouped = data.groupby(['A','B','C','D','E'])
# -> every group defines one simulation scenario
grouped_agg = grouped.agg(({'G' : np.mean}))
# -> the mean of the simulation output in 'G' over 'F' is calculated for each group/scenario
What do i want to do now?
I: display all the (unique) values of each scenario-parameter within these groups -> grouped_agg gives me an iterable of tuples, where for example all the entries at each position 0 give me all the values for 'A' (so with a few lines of python i would get my unique values, but maybe there is a function for that)
Update: my approach
list(set(grouped_agg.index.get_level_values('A'))) (when interested in 'A'; using set for obtaining unique values; probably not the stuff you want to do, if you need high performance)
=> [0, 1, 2, 3, 6]
II: generate some plots (of lower dimension) -> i need to make some variables constant and filter/select my data before plotting (therefore step I needed) =>
'B' const
'C', const
'E' const
'D' = x-axis
'G' = y-axis / output from my aggregation
'A' = one more dimension = multiple colors within 2d-plot -> one G/y-axis for each value of 'A'
How would i generate a plot like that?
I think, that reshaping my data is the key step and pandas plotting capabilities will handle it then. Maybe achieving a shape, where there are 5 columns (one for each value of parameter A) and the corresponding G-values for each index-selection + param-A-selection is enough, but i wasn't able to achieve that form yet.
Thanks for your input!
(i'm using pandas 0.12 within enthought canopy)
Sascha

I: If I understand your example and desired output, I don't see why grouping is necessary.
data.A.unique()
II: Updated....
I will implement the example you sketch above. Assume that we have averaged 'G' over the random seed ('F') like so:
data = data.groupby(['A','B','C','D','E']).agg(({'G' : np.mean})).reset_index()
Start by selecting the rows where B, C, and E have some constant values that you specify.
df1 = data[(data['B'] == const1) & (data['C'] == const2) & (data['E'] == const3)]
Now we want to plot 'G' as a function of 'D', with a different color for every value of 'A'.
df1.set_index('D').groupby('A')['G'].plot(legend=True)
I tested the above on some dummy data, and it works as you describe. The range of 'G' corresponding to each 'A' are plotting in the distinct color on the same axes.
III: I don't know how to answer that broad question.
IV: No, I don't think that's an issue for you here.
I suggest playing with simpler, small data sets and getting more familiar with pandas.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.