Dataset for experimenting with road networks with graph algorithms - python

I'm working on a kind of geographic problem, specifically finding the fastest path for an Electric Vehicle.
So to facilitate the experimentations of the algorithms I've created I need some kind of road network dataset. I have been looking at some real world data sets such as OpenStreetMap, but this seems like an awfully complicated thing to integrate.
All I really need is a road network dataset that contains distances and speed limits, being able to work with it in python is preferable.

Researchers frequently use graphs from the 9th DIMACS Implementation Challenge for experiments with their shortest-path algorithms. Coordinates, distances, and estimated travel times all are provided. The format is simple and textual; I estimate that a dozen lines of Python would suffice to read them.

Related

Which algorithm to use when finding shortest path in multiple overlapping datasets?

I've been tasked with writing an algorithm in Python that is able to establish the shortest journey on a metro train system. The train network has multiple train "lines", which consist of different stations. Some stations are featured on multiple different lines. It essentially works exactly the same as the London tube network. I've been given the stations and the lines, as well as the time between each station, which will act as the cost in the algorithm. Could someone give me a brief idea of how I should approach this? Thanks.
I'm aware Dijkstra's would be the best algorithm to use for this, I'm just having trouble wrapping my head around how to actually approach it.

Community detection for larger than memory embeddings dataset

I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py
It works great, but, it doesn't really scale as the data set grows larger than memory.
The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.
I'm certain there are a class of algorithms - and hopefully a python library - that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?
That seems small enough that you could just rent a bigger computer.
Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but
there's a whole literature out there;
if you already have a scalable clustering algorithm you like, you can use that.

Is there a Community Detection algorithm that has low time complexity such as Louvain, but that penalizes TOO large of communities

I am building a massive network, that is filled with isolated nodes, but also rather large clusters as well. I have used Louvain's algorithm to achieve the best partition - however some communities are too large. I was curious what algorithms (preferably with Python frameworks) have similar run time to Louvain but penalize too large of communities while achieving ideal modularity?
You may try to iterate the community detection algorithm (Louvain or other) by running it on the too large communities you first find. This will partition them into smaller ones.
Notice also that Louvain and other community detection algorithms generally do not produce the best partition, but a good partition with respect to a given quality function. In most cases, finding the best partition is NP-hard.
With this in mind, one may include a scale parameter into the quality function, and detect relevant community at different scales: Post-Processing Hierarchical Community Structures: Quality Improvements and Multi-scale View

Routing problems with a large amount of points and one constraint

I am currently tackling a routing problem where I have to create daily schedule for workers to repair some installations. There 200,000 installations and a worker can only work 8 hours per fay. The goal is to make optimal routes on a daily basis; therefore optimizing the distance between the different points he has to visit on a daily basis but there is also a constraint on the priority of each installation. Indeed each installation has a priority between 0 and 1 and higher priority points should be given higher weights.
I am just looking for some suggestions as I have tried implementing some solutions (https://developers.google.com/optimization/routing/tsp) but due to the many points I have, this results in too long computation time.
Thank you.
Best regards,
Charles
As you know, there is no perfect answer for your issue, but maybe I can guide your research :
Alpha-Beta pruning : I've been using it to reduce the amount of possibilities for an AI playing Hex game.
A* pathfinding : I've been using it to simulate a futuristic hyperloop-like capsule-based network, as a complement of Dijkstra algorithm.
You can customize both algorithm according to your needs.
Hoping to be useful !
Due to large scale of the described problem it is nearly impossible to achieve the optimal solution for each case. You could try something based on mixed integer programming, especially in TSP or vehicle routing problem but I assume that it won't work in your case.
What you should try, at least in my opinion, are heuristic approaches for solving TSP/VRP: tabu search, simulated annealing, hill climbing. Given enough time and a proper set of constraints one of these methods would produce "good enough" solutions, which are much better than a random guessing. Take a look at something like Google OR-Tools
That's a massive sized problem. You will need to cluster it into smaller subproblems before tackling it. We've applied sophisticated fuzzy clustering techniques to experimentally solve a 20,000 location problem. For 200,000 you'll probably need to aggregate by geographic regions (e.g. postcode / zipcode) though before you could attempt to run some kind of clustering to split it up. Alternatively you may just want to try a hard split based on geography first of all.

Graphing a very large graph in networkx

I am attempting to draw a very large networkx graph that has approximately 5000 nodes and 100000 edges. It represents the road network of a large city. I cannot determine if the computer is hanging or if it simply just takes forever. The line of code that it seems to be hanging on is the following:
##a is my network
pos = networkx.spring_layout(a)
Is there perhaps a better method for plotting such a large network?
Here is the good news. Yes it wasn't broken, It was working and you wouldn't want to wait for it even if you could.
Check out my answer to this question to see what your end result would look like.
Drawing massive networkx graph: Array too big
I think the spring layout is an n^3 algorithm which would take 125,000,000,000 calculations to get the positions for your graph. The best thing for you is to choose a different layout type or plot the positions yourself.
So another alternative is pulling out the relevant points yourself using a tool called gephi.
As Aric said, if you know the locations, that's probably the best option.
If instead you just know distances, but don't have locations to plug in, there's some calculation you can do that will reproduce locations pretty well (up to a rotation). If you do a principal component analysis of the distances and project into 2 dimensions, it will probably do a very good job estimating the geographic locations. (It was an example I saw in a linear algebra class once)

Categories

Resources