I went through the manual but couldnt completely figure it out.
I have a massive file of 3 columns, node1 hits node 2 with a certain strength. From this many clusters are generated by NetworkX and this works perfectly. However I cannot load these files into for example cytoscape so I need to write every cluster to a separate file.
I tried:
for n in G: nx.write_weighted_edgelist(G[n], 'test'+str(count))
Or looked into G.number_of_nodes / edges, G.graph.keys(), dir(G) but this doesnt result in what I want.
Is there a way to store every cluster separately with the strength?
With Clusters = nx.connected_components(G) I can obtain the clusters yet I loose all the connection information.
for n,nbrs in G.adjacency_iter():
for nbr,eattr in nbrs.items():
data=eattr['weight']
if data < 2:
print('(%s, %s, %s)' % (n,nbr,data))
When using that upon an empty line I think that those are separate clusters.
##########Solution
Clusters = nx.connected_components(G)
for Cluster in Clusters:
count = count + 1
cfile = open("tmp/Cluster_"+str(count)+".clus","w")
for C in Cluster:
hit = G[C]
for h in hit:
cfile.write('\t'.join([str(C),str(h),str(hit[h].values()[0]),"\n"]))
Try using graphs = nx.connected_component_subgraphs(). That will return a list of graphs which you could write individually in whatever format works for cytoscape.
Related
I'm using networkx package, how can I randomly remove multiple edges but not causing any disconnection(nodes number be the same).
I've tried Stratified sampling the dataframe, but not working, don't know how to do it.
please for any advice.
here how I do so:
removed_edge_cnt = 0
remove_list = set([])
pbar = tqdm(total=n)
while (removed_edge_cnt < n):
removed = True
drop_indices = np.random.choice(orig_data_copy.index, 1, replace=False)
edge = orig_data_copy.iloc[drop_indices, :].values.ravel()
orig_data_copy = orig_data_copy.drop(drop_indices) # 不管該邊有沒有要刪掉都要drop避免一直取相同的edge
# print('{}-{}:{}'.format(edge[0], edge[1], G.has_edge(edge[0], edge[1])))
if G.has_edge(edge[0], edge[1]): # 邊存在
G.remove_edge(edge[0], edge[1])
if not nx.is_weakly_connected(G): # 移除是否會造成disconnect
G.add_edge(edge[0], edge[1])
removed = False
if removed:
removed_edge_cnt += 1
remove_list.add((edge[0], edge[1]))
pbar.update(1)
pbar.close()
A completely connected graph is one where there is a path from every node to every other. I assume that this is what you mean by "no disconnection"
To check if a graph is completely connected, do a depth first search from any node and confirm that every node is visited.
So, remove your random edges(s) and check if the graph is still completely connected. If not, replace the links and try again until you find an edge that can be removed.
I have a list of tuples like this.
a = [(1,2),(1,3),(1,4),(2,5),(6,5),(7,8)]
In this list 1 relates to 2 and then 2 relates to 5 and 5 relates to 6 therefore 1 relates to 6. Similarly I need to find the relations between other elements in tuples. I need a function that takes the input values and outputs as follows:
input = (1,6) #output = True
input = (5,3) #output = True
input = (2,8) #output = False
I do not have knowledge of itertools or map functions. Can they be used to solve these types of problems?
And for the sake of curiosity and interest where can I find these types of questions to solve and where are these types of problems encountered in real life situations?
This can be easily done by considering the tuples as edges in a graph. The question is then reduced to checking if there is a path between the two nodes.
There exists lots of nice libraries for this, see e.g. networkx
import networkx as nx
a = [(1,2),(1,3),(1,4),(2,5),(6,5),(7,8)]
G = nx.Graph(a)
nx.has_path(G, 1, 6) # True
nx.has_path(G, 5, 3) # True
nx.has_path(G, 2, 8) # False
This answer here nicely states your problem as a graph problem, where every time you need to run your algorithm you need to check for the existence of a path between your input vertices. The time complexity for every query then depends on the size, order, diameter, degree of the underlying graph.
However, if you intend to run this algorithm many times with the same array a, it may be worth doing some preprocessing on the input graph to find the connected components (Wikipedia : connected components) first. In that case you can get constant time for every query. Here is the code I suggest :
# NOTE : tested using python 3.6.1
# WARNING : no input sanitization
a = [(1,2),(1,3),(1,4),(2,5),(6,5),(7,8)]
n = 8 # order of the underlying graph
# prepare graph as lists of neighbors for every vertex, i.e. adjacency lists (extra unused vertex '0', just to match the value range of the problem)
graph = [[] for i in range(n+1)]
for edge in a:
graph[edge[0]].append(edge[1])
graph[edge[1]].append(edge[0])
print( "graph : " + str(graph) )
# list of unprocessed vertices : contains all of them at the beginning
unprocessed_vertices = {i for i in range(1,n+1)}
# subroutine to discover the connected component of a vertex
def build_component():
component = [] # current connected component
curr_vertices = {unprocessed_vertices.pop()} # locally unprocessed vertices, initialize with one of the globally unprocessed vertices
while len(curr_vertices) > 0:
curr_vertex = curr_vertices.pop() # vertex to be processed
# add unprocessed neighbours of current vertex to the set of vertices to process
for neighbour in graph[curr_vertex]:
if neighbour in unprocessed_vertices:
curr_vertices.add(neighbour)
unprocessed_vertices.remove(neighbour)
component.append(curr_vertex)
return component
# main algorithm : graph traversal on multiple connected components
components = []
while len(unprocessed_vertices) > 0:
components.append( build_component() )
print( "components : " + str(components) )
# assign a number to each component
component_numbers = [None] * (n+1)
curr_number = 1
for comp in components:
for vertex in comp:
component_numbers[vertex] = curr_number
curr_number += 1
print( "component_numbers : " + str(component_numbers) )
# main functionality
def is_connected( pair ):
return component_numbers[pair[0]] == component_numbers[pair[1]]
# run main functionnality on inputs : every call is executed in constant time now, regardless of the size of the graph
print( is_connected( (1,6) ) )
print( is_connected( (5,3) ) )
print( is_connected( (2,8) ) )
I don't really know about the most likely situations where this problem could be encountered, but I suppose it can have application is some clustering tasks, or maybe if you want to know if it is possible to go from one place to another. If the edges of the graph represent dependencies between modules, this problem would tell you if two parts depend on each other, so maybe some potential applications in compiling or the managment of large projects. The underlying problem is a "Connected component" problem which is among the problems we know polynomial algorithms for.
It is generally very useful to model these kind of problems with graphs as these objects have a very simple structure, and most of the time we can reduce the original problem to a well known problem on graphs.
I have this data frame df1,
id lat_long
400743 2504043 (175.0976323, -41.1141412)
43203 1533418 (173.976683, -35.2235338)
463952 3805508 (174.6947496, -36.7437555)
1054906 3144009 (168.0105269, -46.36193)
214474 3030933 (174.6311167, -36.867717)
1008802 2814248 (169.3183615, -45.1859095)
988706 3245376 (171.2338968, -44.3884099)
492345 3085310 (174.740957, -36.8893026)
416106 3794301 (174.0106383, -35.3876921)
937313 3114127 (174.8436185, -37.80499)
I have constructed the tree for search here,
def construct_geopoints(s):
data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
tree = KDTree(data_geopoints, distance_metric='Arc', radius=pysal.cg.RADIUS_EARTH_KM)
return tree
tree = construct_geopoints(actualdata)
Now, I am trying to search all the geopoints which are within 1KM of every geopoint in my data frame df1. Here is how I am doing,
dfs = []
for name,group in df1.groupby(np.arange(len(df1))//10000):
s = group.reset_index(drop=True).copy()
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
dfs.append(s)
output = pd.concat(dfs,axis = 0)
Everything here works fine, however I am trying to parallelise this task, since my df1 size is 2M records, this process is running for more than 8 hours. Can anyone help me on this? And another thing is, the result returned by query_ball_point is a list and so its throwing memory error when I am processing it for the huge amount of records. Any way to handle this.
EDIT :- Memory issue, look at the VIRT size.
It should be possible to parallelize your last segment of code with something like this:
from multiprocessing import Pool
...
def process_group(group):
s = group[1].reset_index(drop=True) # .copy() is implicit
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
return s
groups = df1.groupby(np.arange(len(df1))//10000)
p = Pool(5)
dfs = p.map(process_group, groups)
output = pd.concat(dfs, axis=0)
But watch out, because the multiprocessing library pickles all the data on its way to and from the workers, and that can add a lot of overhead for data-intensive tasks, possibly cancelling the savings due to parallel processing.
I can't see where you'd be getting out-of-memory errors from. 8 million records is not that much for pandas. Maybe if your searches are producing hundreds of matches per row that could be a problem. If you say more about that I might be able to give some more advice.
It also sounds like pysal may be taking longer than necessary to do this. You might be able to get better performance by using GeoPandas or "rolling your own" solution like this:
assign each point to a surrounding 1-km grid cell (e.g., calculate UTM coordinates x and y, then create columns cx=x//1000 and cy=y//1000);
create an index on the grid cell coordinates cx and cy (e.g., df=df.set_index(['cx', 'cy']));
for each point, find the points in the 9 surrounding cells; you can select these directly from the index via df.loc[[(cx-1,cy-1),(cx-1,cy),(cx-1,cy+1),(cx,cy-1),...(cx+1,cy+1)], :];
filter the points you just selected to find the ones within 1 km.
I am using code I found and slightly modified for my purposes. The problem is, it is not doing exactly what I want, and I am stuck with what to change to fix it.
I am searching for all neighbouring polygons, that share common borded (a line), that is not a point
My goal: 135/12 is neigbour with 319/2 135/4, 317 but not with 320/1
What I get in my QGIS table after I run my script
NEIGBOURS are the neighbouring polygons,
SUM is the number of neighbouring polygons
The code I use also includes 320/1 as a neighbouring polygon. How to fix it?
from qgis.utils import iface
from PyQt4.QtCore import QVariant
_NAME_FIELD = 'Nr'
_SUM_FIELD = 'calc'
_NEW_NEIGHBORS_FIELD = 'NEIGHBORS'
_NEW_SUM_FIELD = 'SUM'
layer = iface.activeLayer()
layer.startEditing()
layer.dataProvider().addAttributes(
[QgsField(_NEW_NEIGHBORS_FIELD, QVariant.String),
QgsField(_NEW_SUM_FIELD, QVariant.Int)])
layer.updateFields()
feature_dict = {f.id(): f for f in layer.getFeatures()}
index = QgsSpatialIndex()
for f in feature_dict.values():
index.insertFeature(f)
for f in feature_dict.values():
print 'Working on %s' % f[_NAME_FIELD]
geom = f.geometry()
intersecting_ids = index.intersects(geom.boundingBox())
neighbors = []
neighbors_sum = 0
for intersecting_id in intersecting_ids:
intersecting_f = feature_dict[intersecting_id]
if (f != intersecting_f and
not intersecting_f.geometry().disjoint(geom)):
neighbors.append(intersecting_f[_NAME_FIELD])
neighbors_sum += intersecting_f[_SUM_FIELD]
f[_NEW_NEIGHBORS_FIELD] = ','.join(neighbors)
f[_NEW_SUM_FIELD] = neighbors_sum
layer.updateFeature(f)
layer.commitChanges()
print 'Processing complete.'
I have found somewhat a workaround for it. Before using my script, I create a small (for my purposes, 0,01 m was enough) buffer around all joints. Later, I use a Difference tool to remove the buffer areas from my main layer, thus removing not-needed neighbouring polygons. Using the code now works fine
I'm loading directed weighted graph from csv file into graph-tool graph in python. The organization of the input csv file is:
1,2,300
2,4,432
3,89,1.24
...
Where the fist two entries of a line identify source and target of an edge and the third number is the weight of the edge.
Currently I'm using:
g = gt.Graph()
e_weight = g.new_edge_property("float")
csv_network = open (in_file_directory+ '/'+network_input, 'r')
csv_data_n = csv_network.readlines()
for line in csv_data_n:
edge = line.replace('\r\n','')
edge = edge.split(delimiter)
e = g.add_edge(edge[0], edge[1])
e_weight[e] = float(edge[2])
However it takes quite long to load the data (I have network of 10 millions of nodes and it takes about 45 min).
I have tried to make it faster by using g.add_edge_list, but this works only for unweighted graphs. Any suggestion how to make it faster?
This has been answered in graph-tool's mailing list:
http://lists.skewed.de/pipermail/graph-tool/2015-June/002043.html
In short, you should use the function g.add_edge_list(), as you said, and and put the weights separately
via the array interface for property maps:
e_weight.a = weight_list
The weight list should have the same ordering as the edges you passed to
g.add_edge_list().
I suggest you try the performance you get by using the csv library. This example returns edge holding a list of the 3 parameters.
import csv
reader = csv.reader(open(in_file_directory+ '/'+network_input, 'r'), delimiter=",")
for edge in reader:
if len(edge) == 3:
edge_float = [float(param) for param in edge]
So you would get the following to work with...
edge_float = [1.0, 2.0, 300.0]