Edge pruning in graph

Edge pruning in graph - python

Im just starting out with my python - graphs adventure and I have just encountered a conceptional problem. So, my data is some activities which have happend over time - activity, timestamp and some additional data. I want to create graph that shows consecutive steps, so I join data on time condition. My ideal graph whould show that A->B->C->D, but because A is always initial step i get A->B (which is ok) and obsolete A->C, A->D. So my questions is do you know any smart way to prune those edges leaving only important ones? I was thinking about maybe some function in networkx? Please help.

Related

AI categorical prediction for time variant data

I'm currently trying to use a sensor to measure a process's consistency. The sensor output varies wildly in its actual reading but displays features that are statistically different across three categories [dark, appropriate, light], with dark and light being out of control items. For example, one output could read approximately 0V, the process repeats and the sensor then reads 0.6V. Both the 0V reading and the 0.6V reading could represent an in control process. There is a consistent difference for sensor readings for out of control items vs in control items. An example set of an in control item can be found here and an example set of two out of control items can be found here. Because of the wildness of the sensor and characteristic shapes of each category's data, I think the best way to assess the readings is to process them with an AI model. This is my first foray into creating a model that creates a categorical prediction given a time series window. I haven't been able to find anything on the internet with my searches (I'm possibly looking for the wrong thing). I'm certain that what I'm attempting is feasible and has a strong case for an AI model, I'm just not certain what the optimal way to make it is. One idea that I had was to treat the data similarly to how an image is treated by an object detection model, with the readings as the input array and the category as the output, but I'm not certain that this is the best way to go about solving the problem. If anyone can help point me in the right direction or give me a resource, I would greatly appreciate it. Thanks for reading my post!

Method of averaging periodic timeseries data

I'm in the process of collecting O2 data for work. This data shows periodic behavior. I would like to parse out each repetition to thereby get statistical information like average and theoretical error. Data Figure
Is there a convenient way programmatically:
Identify cyclical data?
Pick out starting & ending indices such that repeating cycle can be concatenated, post-processed, etc.
I had a few ideas, but am more lacking the Python programing experience.
Brute force, condition data in Excel prior. (Will likely collect similar data in future, would like more robust method).
Train NN to identify cycle then output indices. (Limited training set, would have to label).
Decompose to trend/seasonal data apply Fourier series on seasonal data. Pick out N cycles.
Heuristically, i.e. identify thresholds of rate of change & event detection (difficult due to secondary hump, please see data).
Is there a Python program that systematically does this for me? Any help would be greatly appreciated.
Sample Data

Read Graph from multiple files in IGraph (Python)

I have multiple node- and edgelists which form a large graph, lets call that the maingraph. My current strategy is to first read all the nodelists and import it with add_vertices. Every node then gets an internal id which depends on the order they are ingested and therefore isnt very reliable (as i've read it, if you delete one, all higher ids than the one deleted change). I assign every node a name attribute which corresponds to the external ID I use so I can keep track of my nodes between frameworks and a type attribute.
Now, how do I add the edges? When I read an edgelist it will start making a new graph (subgraph) and hence starts the internal ID at 0. Therefore, "merging" the graphs with maingraph.add_edges(subgraph.get_edgelist) inevitably fails.
It is possible to work around this and use the name attribute from both maingraph and subgraph to find out which internal ID each edges' incident nodes have in the maingraph:
def _get_real_source_and_target_id(edge):
''' takes an edge from the to-be-added subgraph and gets the ids of the corresponding nodes in the
maingraph by their name '''
source_id = maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index
target_id = maingraph.vs.select(name_eq=subgraph.vs[edge[1]]["name"])[0].index
return (source_id,target_id)
And then I tried
edgelist = [_get_real_source_and_target_id(x) for x in subgraph.get_edgelist()]
maingraph.add_edges(edgelist)
But that is hoooooorribly slow. The graph has millions of nodes and edges, which takes 10 seconds to load with the fast, but incorrect maingraph.add_edges(subgraph.get_edgelist) approach. with the correct approach explained above, it takes minutes (I usually stop it after 5 minutes o so). I will have to do this tens of thousands of times. I switched from NetworkX to Igraph because of the fast loading, but it doesn't really help if I have to do it like this.
Does anybody have a more clever way to do this? Any help much appreciated!
Thanks!

Nevermind, I figured out that the mistake was elsewhere. I used numpy.loadtxt() to read the node's names as strings, which somehow did funny stuff when the names were incrementing numbers with more than five figures (see my issue report here). Therefore the above solution got stuck when it tried to get the nodes where numpy messed up the node name. maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index simply sat there when it couldnt find the node. Now I use pandas to read the node names and it works fine.
The solution above is still ~10x faster than my previous NetworkX solution, so I will just leave it helps someone. Feel free to delete it otherwise.

Creating new nodes using redistribute_vertices

I am working on a graph flow model in the context of transport networks. I have the position of sensors (lat/lon) and would like to associate these sensors with nodes on a graph retrieved using osmnx.
At present, I use get_nearest_node to map a sensor to a node. However, this isn't optimal, as I'm at the mercy of the cartographer -- straight roads will be have fewer nodes, and so the mean displacement (and therefore error) will be higher, even when dealing with unsimplified graphs. I had considered using get_nearest_edge, but I'd still need to edit the graph to insert a new node at the position of the sensor.
Instead, I thought a reasonable way of achieving this would be to upsample the graph (perhaps using redistribute_vertices), applying get_nearest_node, and then re-simplifying the graph, but somehow whitelisting the node that is now associated with a sensor to prevent it from being removed.
However, it's not clear to me how to go from the output of redistribute_vertices to a graph -- it returns a LineString or MultiLineString rather than a new graph.
I saw this question posted on the osmnx GitHub project: https://github.com/gboeing/osmnx/issues/304, in which a GeoDataFrame is generated, with a new column containing the redistributed way as a (Multi)LineString. However, I'm not sure how I can map this new gdf back to a Graph -- the corresponding node dataframe hasn't been updated, and u and v values remain the same in the new edges table.
Any pointers (including telling me I'm going about this the wrong way and should be using function XYZ) would be really appreciated.

KMeans: Extracting the parameters/rules that fill up the clusters

I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.

Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.