Machine Learning Method to Classify using Demand

Machine Learning Method to Classify using Demand - python

I have the following scenario:
My input is a set of points. Each point (Long_Lat coordinates) corresponds to the centroid of a subsection of a region, and has a demand for school. Like 50 children that need a school on the neighborhood.
After using a cluster method (like k-means or DBscan) to aggregate these Points by proximity, I want to allocate demand points to schools, in such a way that the cluster demands (the sum of all points demands on that cluster) are satisfied.
In other words, I want to create schools on that cluster and allocate the children (points) to these schools.
Schools have a fixed capacity restriction.
I.e.: I need 3 schools (capacity of 40), to suply the 100 children demand (P1, P2, P3) of cluster C4.
Main objective is, of course, to know the location of these schools. But I can retrieve it using logic.
What method should I use to fill the capacity of a cluster?
Is this the correct approach?

For nicely distributed data, I expect that the most effective way will be to start with a k-means clustering. If each resulting cluster fits within the schools' capacities, you have a solution.
However, your "worry" case is where at least one school is over capacity. For instance, you have 20 children on the north side of a wide river, 90 on the south side, and the schools have a capacity of 40: you need to assign at least 10 children from the south to the north.
The algorithmic way to deal with this is to implement a different error function: add a clause that heavily penalizes (i.e. +infinity cost) adding a 41st student to that cluster.
Another way is to allow the clusters to aggregate normally, but adjust afterward. Say that the SE school has 46 students, and the SW has 44: send the 6 and 4 students nearest to the the north school, to that school.
Is this enough guidance to work for you? Do you have cases where you would have multiple schools both over and under capacity? I don't want to over-engineer a solution.

Related

finding big enough sample size by expanding search categories. Algorithmic clustering?

I'm interested in finding 50+ similar samples within a dataset of 3M+ rows and 200 columns
Consider we've got .csv database of vehicles. Every row is one car, and in the columns, there are features like brand, millage, engine size etc.
brand
year bin
engine bin
millage
Ford
2014-2016
1-2
20K-30K
The procedure to automate:
When I receive a new sample I want to find 50+ similar ones. If I can't find exactly the same I can drop/broaden some information. For example, the same model of Ford between 2012 and 2016 is nearly the same car so I would expand the search with a bigger year bin. I expect that if I expand the search for enough categories I will always find a required population.
After this, I got a "search query" like this which returns me 50+ samples so it's maximally precise and big enough to observe the mean, variance etc.
brand
year bin
engine bin
millage
Ford
2010-2018
1-2
10K-40K
Is there anything like this already implemented?
I've tried k-means clustering vehicles by those features but it isn't precise enough and isn't easily interpretable for people without a data science background. I think the "distance" based metrics can't learn the "hard" constraints like not searching in different brands. But maybe there is a way of feature weighting?
I'm happy to receive every suggestion!

Finding out which cluster a new data entered belongs to and returning the other items in the same cluster

I have a set of product data with specifications etc.
I have applied kmode clustering to the dataset to form clusters of the most similar products.
When I enter a new data, I want to know which cluster this data belongs to and what are the other products( which are almost same as this new product). How do I go about this?

Use the nearest neighbors.
No need to rely on clustering, which tends to be unstable and produce unbalanced clusters. It's fairly common to have 90% of your data rightfully in the same cluster (e.g. a "normal users" cluster, or a "single visit" cluster). So you should ask yourself: what do you gain by doing this, what is the cost-benefit ratio?

How to weight the beginnings of strings in Levenshtein distance algorithm

I am trying to use the Levenshtein distance algorithm (in Python, if it makes any difference) to do a fuzzy string comparison between two lists of company names. An example would be where list A contains XYZ INDUSTRIAL SUPPLY but list B might say XYZ INDUSTRIAL SUPPLY, INC. but they should still match.
Right now, my implementation is terribly inaccurate. As a second example, currently the algorithm finds abc metal finishing and xyz's mach & metal finishing very similar because of their endings, but they are totally different companies. I want to improve this accuracy and one way I think I can do that is by weighting the beginning of strings somehow. If the company names are supposed to match, chances are they start out similarly. Looking at my first example, the whole beginning matches, it's only at the very end where variations occur. Is there a way to accomplish this? I haven't been able to work it out.
Thanks!
Edit for more examples:
Should match:
s west tool supply, southwest tool supply
abc indust inc, abc industries
icg usa, icg usa llc
Shouldn't match (but do currently):
ohio state university, iowa state university
m e gill corporation, s g corporation
UPDATE:
Progress has been made :) In case anyone is ever interested in this sort of thing, I ended up experimenting with the costs of inserts deletes and substitutions. My idea was to weight the beginning of the strings more heavily, so I based the weights off of the current location within the matrix. The issue that this created though, was that everything was matching to a couple of very short names because of how my weights were being distributed. I fixed this by including the lengths in my calculation. My insertion weight, for example, ended up being (1 if index<=2/3*len(target) else .1*(len(source)-index)) where source is always the longer of the two strings. I'm planning to continue tweaking this and experimenting with other values, but it has already shown a great improvement. This is by no means anywhere close to an exact science but if it's working, that's what counts!

Statistics: Estimating U.S. population-weighted average temperature from 100+ daily airport station measurements

I have recently signed up for a developer's key for the Census API (http://www.census.gov/developers/) and will be using a Python wrapper class to access the Census database.
I also have access to a data feed for the daily average temperatures & forecasts from 100+ airport stations distributed across the U.S. (These stations are largely representative of the U.S population since they are located in major cities). With minimal assumptions, what would be the best way to map the entire population of the United States onto the set of 100+ airports, so that I may derive a population weighted average temperature? This would probably entail some kind of distance/climate function. What are some nuances I should consider when doing this?

(1) Sounds like you need something akin to a Voronoi tessellation, but built on zip code regions instead of continuous space. Essentially you need to assign each zip code region to the "nearest" airport, then weight the airport's observations by the fraction of population in all the nearby zip codes. (I'm assuming the census data is organized by zip codes.) I say "nearest" in quote marks because there could be different ways to consider that; e.g. distance to geographical center of region, distance to population center of region, time to travel from center to airport, probably others. You can probably use a brute force algorithm to assign zip codes to airports: just cycle through all zip codes and find the airport which is "nearest" in the sense you've chosen. That could be slow but you only have to do it once (well, once for each definition of "nearest").
(2) You might get more traction for this question on CrossValidated.

Need algorithm suggestions for flight routings

I'm in the early stages of thinking through a wild trip that involves visiting every commercial airport in India. A little research shows that the national carrier - Air India, has a special ticket called the Silver Pass that allows unlimited travel on their domestic network for 15 days. I would like to use this as my weapon of choice!
See this for a map of all the airports served by Air India
I have the following information available with me in Excel:
All of the domestic flight routes (departure airports and arrival airports in IATA codes)
Duration for every flight route
Weekly frequency for every flight (not all flights run on all days of the week, for example)
Given this information, how do I figure out what is the maximum number of airports that I can hit in 15 days using the Silver Pass ticket? Looking online shows that this is either a traveling salesman problem or a graph traversal problem. What would you guys recommend that I look at to solve this.
Some background on myself - I'm just beginning to learn Python and would like to figure out a way to solve this problem using that. Given that, what are the python-based algorithms/libraries that I should be looking at that will help me structure an approach to solving this?

Your problem is closely related to the Hamiltonian Path problem and Traveling Salesman Problem, which are NP-Hard.
Given an instance of Hamiltonian Path Problem - build a flight data:
Each vertex is an airport
Each edge is a flight
All flights leave at the same time and takes the same time.(*)
(*)The flight duration and departure time [which are common for all] should be calculated so you will be able to visit all terminals only if you visit each terminal only once. It can be easily done in polynomial time. Assume we have a fixed time of k hours for the ticket, we construct the flight table such that each flight takes exactly k/(n-1) hours, and there is a flight every k/(n-1) hours as well1 [remember all flights are at the same time].
It is easy to see that if and only if the graph has a hamiltonian path, you can use the ticket to visit al airports, since if we visit a certain airport twice in the path, we need at least n flights and the total time will be at least (k/(n-1)) * n > k, and we failed the time limit. [other way around is similar].
Thus your problem [for general case] is NP-Hard, and there is no known polynomial solution for it.
1: We assume it takes no time to pass between flights, this can be easily fixed by simply decreasing flight length by the time it takes to "jump" between two flights.

Representing your problem as a graph is definitely the best option. Since the duration, number of flights, and number of airports are relatively limited, and since you are (presumably) happy with approximate solutions, attacking this by brute force ought to be practical, and is probably your best option. Here's roughly what I would do:
Represent each airport as a node on the graph, and each flight as an edge.
Given a starting airport and a current time, select all the flights leaving that airport after the current time. Use a scoring function of some sort to rank them, such that flights to airports you haven't visited are ranked higher than flights to airports you haven't visited, and flights are ranked higher the sooner they are.
Recursively explore each outgoing edge, in order of score, and repeat the procedure for the arriving airport.
Any time you reach a node with no outgoing valid edges, compare it to the best possible solution. If it's an improvement, output it and set it as the new best solution.
Depending on the number of flights, you may be able to run this procedure exhaustively. The number of solutions grows exponentially with the number of flights, of course, so this will quickly become impractical. This is where the scoring function becomes useful - it prioritizes the solutions more likely to produce useful answers. You can run the procedure for as long as you want, and stop when it produces a solution you're happy with.
The properties of the scoring function will have a big impact on how good the solutions are. If your priority is exploring unique places, you want to put a big premium on unvisited airports, and since you want to explore as many as possible, you need to prioritize short transfer times. My suggestion for a starting point would be to make the penalty for going somewhere you've already been proportional to the time it would take to fly from there to somewhere else. That way, it'll still be explored as a stopover, but avoided where possible. Also, note that your scoring function will need context, namely the set of airports that have been visited by the current candidate path.
You can also use the scoring function to apply other constraints. Say you don't want to travel during the night (a reasonable assumption); you can penalize the score of edges that involve nighttime flights.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.