Need algorithm suggestions for flight routings - python

I'm in the early stages of thinking through a wild trip that involves visiting every commercial airport in India. A little research shows that the national carrier - Air India, has a special ticket called the Silver Pass that allows unlimited travel on their domestic network for 15 days. I would like to use this as my weapon of choice!
See this for a map of all the airports served by Air India
I have the following information available with me in Excel:
All of the domestic flight routes (departure airports and arrival airports in IATA codes)
Duration for every flight route
Weekly frequency for every flight (not all flights run on all days of the week, for example)
Given this information, how do I figure out what is the maximum number of airports that I can hit in 15 days using the Silver Pass ticket? Looking online shows that this is either a traveling salesman problem or a graph traversal problem. What would you guys recommend that I look at to solve this.
Some background on myself - I'm just beginning to learn Python and would like to figure out a way to solve this problem using that. Given that, what are the python-based algorithms/libraries that I should be looking at that will help me structure an approach to solving this?

Your problem is closely related to the Hamiltonian Path problem and Traveling Salesman Problem, which are NP-Hard.
Given an instance of Hamiltonian Path Problem - build a flight data:
Each vertex is an airport
Each edge is a flight
All flights leave at the same time and takes the same time.(*)
(*)The flight duration and departure time [which are common for all] should be calculated so you will be able to visit all terminals only if you visit each terminal only once. It can be easily done in polynomial time. Assume we have a fixed time of k hours for the ticket, we construct the flight table such that each flight takes exactly k/(n-1) hours, and there is a flight every k/(n-1) hours as well1 [remember all flights are at the same time].
It is easy to see that if and only if the graph has a hamiltonian path, you can use the ticket to visit al airports, since if we visit a certain airport twice in the path, we need at least n flights and the total time will be at least (k/(n-1)) * n > k, and we failed the time limit. [other way around is similar].
Thus your problem [for general case] is NP-Hard, and there is no known polynomial solution for it.
1: We assume it takes no time to pass between flights, this can be easily fixed by simply decreasing flight length by the time it takes to "jump" between two flights.

Representing your problem as a graph is definitely the best option. Since the duration, number of flights, and number of airports are relatively limited, and since you are (presumably) happy with approximate solutions, attacking this by brute force ought to be practical, and is probably your best option. Here's roughly what I would do:
Represent each airport as a node on the graph, and each flight as an edge.
Given a starting airport and a current time, select all the flights leaving that airport after the current time. Use a scoring function of some sort to rank them, such that flights to airports you haven't visited are ranked higher than flights to airports you haven't visited, and flights are ranked higher the sooner they are.
Recursively explore each outgoing edge, in order of score, and repeat the procedure for the arriving airport.
Any time you reach a node with no outgoing valid edges, compare it to the best possible solution. If it's an improvement, output it and set it as the new best solution.
Depending on the number of flights, you may be able to run this procedure exhaustively. The number of solutions grows exponentially with the number of flights, of course, so this will quickly become impractical. This is where the scoring function becomes useful - it prioritizes the solutions more likely to produce useful answers. You can run the procedure for as long as you want, and stop when it produces a solution you're happy with.
The properties of the scoring function will have a big impact on how good the solutions are. If your priority is exploring unique places, you want to put a big premium on unvisited airports, and since you want to explore as many as possible, you need to prioritize short transfer times. My suggestion for a starting point would be to make the penalty for going somewhere you've already been proportional to the time it would take to fly from there to somewhere else. That way, it'll still be explored as a stopover, but avoided where possible. Also, note that your scoring function will need context, namely the set of airports that have been visited by the current candidate path.
You can also use the scoring function to apply other constraints. Say you don't want to travel during the night (a reasonable assumption); you can penalize the score of edges that involve nighttime flights.

Related

How to simply compute the travel time from one point to an other? (Without a plot)

I spent a lot of time reading and testing the example notebooks of OSMnx but I couldn't figure out a way to simply calculate the travel time from a given point (GPS coordonates) to an other one.
I would like to estimate, for each point from my list, how long it takes to go to a specific point (sometimes 100km away). I don't need to generate a graph/map/plot, as I only need the duration of each trip (and I think that OSMnx maps render better at a city-scale).
I am pretty desperate as I could not find a simple way to do this across different Python libraries... If doing this calculation for +-10k points within a country-scale map is asking too much from OSMnx, could a locally stored pbf file of the country be helpful for another solution?
There are inherent trade-offs when you want to model a large study area such as an entire region or an entire country: 1) model precision vs 2) area size vs 3) memory/speed. You need to trade off one of these three.
For the first, you can model a coarser-grained network, such as only major roads in the region/country, rather than millions of fine-grained residential streets and paths. For the second, you can study a smaller area. For the third, you can provision a machine with lots of memory and then let the script run for a while to complete the process. What you trade off will be up to your own needs for this analysis.
In the example code below, I chose to trade off #1: I've modeled this region (West Midlands) by its motorways and trunk roads. Given a different analytical goal, you may trade off other things instead. After creating the model, I randomly sample 1000 origin and destination lat-long points, snap them to the nearest nodes in the graph, and solve the shortest paths by travel time (accounting for speed limits) with multiprocessing.
import osmnx as ox
# get boundaries of West Midlands region by its OSM ID
gdf = ox.geocode_to_gdf('R151283', by_osmid=True)
polygon = gdf.iloc[0]['geometry']
# get network of motorways and trunk roads, with speed and travel time
cf = '["highway"~"motorway|motorway_link|trunk|trunk_link"]'
G = ox.graph_from_polygon(polygon, network_type='drive', custom_filter=cf)
G = ox.add_edge_speeds(G)
G = ox.add_edge_travel_times(G)
# randomly sample lat-lng points across the graph
origin_points = ox.utils_geo.sample_points(ox.get_undirected(G), 1000)
origin_nodes = ox.nearest_nodes(G, origin_points.x, origin_points.y)
dest_points = ox.utils_geo.sample_points(ox.get_undirected(G), 1000)
dest_nodes = ox.nearest_nodes(G, dest_points.x, dest_points.y)
%%time
# solve 1000 shortest paths between origins and destinations
# minimizing travel time, using all available CPUs
paths = ox.shortest_path(G, origin_nodes, dest_nodes, weight='travel_time', cpus=None)
# elapsed time: 9.8 seconds
For faster modeling, you can load the network data from a .osm XML file instead of having to make numerous calls to the Overpass API. OSMnx by default divides your query area into 50km x 50km pieces, then queries Overpass for each piece one a time to not exceed the server's per-query memory limits. You can configure this max_query_area_size parameter, as well as the server memory allocation, if you prefer to use OSMnx's API querying functions rather than its from-file functionality.

Trying to work out how to produce a synthetic data set using python or javascript in a repeatable way

I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom

Is there a library or suggested tactic for shift planning with hours and breaks?

I’m trying to think through a sort of extra credit project- optimizing our schedule.
Givens:
“demand” numbers that go down to the 1/2 hour. These tell us the ideal number of people we’d have on at any given time;
8 hour shift, plus an hour lunch break > 2 hours from the start and end of the shift (9 hours from start to finish);
Breaks: 2x 30 minute breaks in the middle of the shift;
For simplicity, can assume an employee would have the same schedule every day.
Desired result:
Dictionary or data frame with the best-case distribution of start times, breaks, lunches across an input number of employees such that the difference between staffed and demanded labor is minimized.
I have pretty basic python, so my first guess was to just come up with all of the possible shift permutations (points at which one could take breaks or lunches), and then ask python to select x (x=number of employees available) at random a lot of times, and then tell me which one best allocates the labor. That seems a bit cumbersome and silly, but my limitations are such that I can’t see beyond such a solution.
I have tried to look for libraries or tools that help with this, but the question here- how to distribute start times and breaks within a shift- doesn’t seem to be widely discussed. I’m open to hearing that this is several years off for me, but...
Appreciate anyone’s guidance!

Machine Learning Method to Classify using Demand

I have the following scenario:
My input is a set of points. Each point (Long_Lat coordinates) corresponds to the centroid of a subsection of a region, and has a demand for school. Like 50 children that need a school on the neighborhood.
After using a cluster method (like k-means or DBscan) to aggregate these Points by proximity, I want to allocate demand points to schools, in such a way that the cluster demands (the sum of all points demands on that cluster) are satisfied.
In other words, I want to create schools on that cluster and allocate the children (points) to these schools.
Schools have a fixed capacity restriction.
I.e.: I need 3 schools (capacity of 40), to suply the 100 children demand (P1, P2, P3) of cluster C4.
Main objective is, of course, to know the location of these schools. But I can retrieve it using logic.
What method should I use to fill the capacity of a cluster?
Is this the correct approach?
For nicely distributed data, I expect that the most effective way will be to start with a k-means clustering. If each resulting cluster fits within the schools' capacities, you have a solution.
However, your "worry" case is where at least one school is over capacity. For instance, you have 20 children on the north side of a wide river, 90 on the south side, and the schools have a capacity of 40: you need to assign at least 10 children from the south to the north.
The algorithmic way to deal with this is to implement a different error function: add a clause that heavily penalizes (i.e. +infinity cost) adding a 41st student to that cluster.
Another way is to allow the clusters to aggregate normally, but adjust afterward. Say that the SE school has 46 students, and the SW has 44: send the 6 and 4 students nearest to the the north school, to that school.
Is this enough guidance to work for you? Do you have cases where you would have multiple schools both over and under capacity? I don't want to over-engineer a solution.

How to weight the beginnings of strings in Levenshtein distance algorithm

I am trying to use the Levenshtein distance algorithm (in Python, if it makes any difference) to do a fuzzy string comparison between two lists of company names. An example would be where list A contains XYZ INDUSTRIAL SUPPLY but list B might say XYZ INDUSTRIAL SUPPLY, INC. but they should still match.
Right now, my implementation is terribly inaccurate. As a second example, currently the algorithm finds abc metal finishing and xyz's mach & metal finishing very similar because of their endings, but they are totally different companies. I want to improve this accuracy and one way I think I can do that is by weighting the beginning of strings somehow. If the company names are supposed to match, chances are they start out similarly. Looking at my first example, the whole beginning matches, it's only at the very end where variations occur. Is there a way to accomplish this? I haven't been able to work it out.
Thanks!
Edit for more examples:
Should match:
s west tool supply, southwest tool supply
abc indust inc, abc industries
icg usa, icg usa llc
Shouldn't match (but do currently):
ohio state university, iowa state university
m e gill corporation, s g corporation
UPDATE:
Progress has been made :) In case anyone is ever interested in this sort of thing, I ended up experimenting with the costs of inserts deletes and substitutions. My idea was to weight the beginning of the strings more heavily, so I based the weights off of the current location within the matrix. The issue that this created though, was that everything was matching to a couple of very short names because of how my weights were being distributed. I fixed this by including the lengths in my calculation. My insertion weight, for example, ended up being (1 if index<=2/3*len(target) else .1*(len(source)-index)) where source is always the longer of the two strings. I'm planning to continue tweaking this and experimenting with other values, but it has already shown a great improvement. This is by no means anywhere close to an exact science but if it's working, that's what counts!

Categories

Resources