So I have been interested in a project to help my dad with his business, or at least for my own whimsy. Basically, the job involves going to different fields spread throughout the county, and a lot of how we do it now is inefficient and leapfroggy. So I would try to create a system that will find an optimized path. I'm not asking someone to build any of this for me, I just need to know the right direction to look, for gathering information on how to do this. So we have a map of our county and or county, and luckily because we live in Nebraska all county's are just big grids. And we have a bunch of different fields we need to get too, for this task, there are 2 to 3 different teams who each drive there own truck( so 1 to 2 trucks). And in some cases, there is certain fields truck A has to check. So I just would like some help researching this, I would prefer to write this all in python. I know about pathfinding algorithms, but that's about it. So really here are my questions: How do I make, or use a roadmap in python? How can I institute a pathfinding algorithm to that map? How can I make 2 of those algorithms making there own path of the same length, ignoring certain fields? Any help is appreciated. Here is a low-quality picture of our field map https://drive.google.com/file/d/1L5GNoUrtzTxJvfKoS04wGO8EgkK8Ulue/view?usp=sharing
Related
Hello all and thank you for your time.
I have a group of about 100 people. Each user has liked 3 of the other 99 users. These people need to be divided in about 4 groups of roughly equal size. They need to be grouped together so that the groups consist mostly of people who like each other, or at least as much as possible.
To make the problem a bit more challenging, I need to take into account gender and nationality but that could be done on a second level.
I have tried to create a matrix (directed graph) in Google Sheets but I am now stuck when it comes to clustering them. I am happy to use Python once I have identified the algorithms needed. Any suggestions please?
Thank you!
I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom
First of all, my apologies if I am not following some of the best practices of this site, as you will see, my home is mostly MSE (math stack exchange).
I am currently working on a project where I build a vacation recommendation system. The initial idea was somewhat akin to 20 questions: We ask the user certain questions, such as "Do you like museums?", "Do you like architecture", "Do you like nightlife" etc., and then based on these answers decide for the user their best vacation destination. We answer these questions based on keywords scraped from websites, and the decision tree we would implement would allow us to effectively determine the next question to ask a user. However, we are having some difficulties with the implementation. Some examples of our difficulties are as follows:
There are issues with granularity of questions. For example, to say that a city is good for "nature-lovers" is great, but this does not mean much. Nature could involve say, hot, sunny and wet vacations for some, whereas for others, nature could involve a brisk hike in cool woods. Fortunately, the API we are currently using provides us with a list of attractions in a city, down to a fairly granular level (for example, it distinguishes between different watersport activities such as jet skiing, or white water rafting). My question is: do we need to create some sort of hiearchy like:
nature-> (Ocean,Mountain,Plains) (Mountain->Hiking,Skiing,...)
or would it be best to simply include the bottom level results (the activities themselves) and just ask questions regarding those? I only ask because I am unfamiliar with exactly how the classification is done and the final output produced. Is there a better sort of structure that should be used?
Thank you very much for your help.
I think using a decision tree is a great idea for this problem. It might be an idea to group your granular activities, and for the "nature lovers" category list a number of different climate types: Dry and sunny, coastal, forests, etc and have subcategories within them.
For the activities, you could make a category called watersports, sightseeing, etc. It sounds like your dataset is more granular than you want your decision tree to be, but you can just keep dividing that granularity down into more categories on the tree until you reach a level you're happy with. It might be an idea to include images too, of each place and activity. Maybe even without descriptive text.
Bins and sub bins are a good idea, as is the nature, ocean_nature thing.
I was thinking more about your problem last night, TripAdvisor would be a good idea. What I would do is, take the top 10 items in trip advisor and categorize them by type.
Or, maybe your tree narrows it down to 10 cities. You would rank those cities according to popularity or distance from the user.
I’m not sure how to decide which city would be best for watersports, etc. You could even have cities pay to be top of the list.
Ive started learning Python and decided to give myself a golf related project to work on. My question revolves around choosing the best data type to use. Now i know th3nanswer to this is based on requirements but tht isnt helping me.
Besides simple data like name, date, name of course, etc., ill alao be generating 9 and 18 hole scores for multiple players in my locl society.
While keeping a historical record of past scores is nice i may want to perform some analytics across my dataset to find handicaps, hardest holes, etc. And, yes, i know there are apps out there aldeady im doing this to learn. ;)
So....which data structur should i use to work with? Lists, dictionaries, numpy arrats, objects, or a combination?
Many thanks!
I would like to be able to associate various models (Venues, places, landmarks) with a City/Country.
But I am not sure what some good ways of implementing this would be.
I'm following a simple route, I have implemented a Country and City model.
Whenever a new city or country is mentioned it is automatically created.
Unfortunately I have various problems:
The database can easily be polluted
Django has no real knowledge of what those City/Countries really are
Any tips or ideas? Thanks! :)
A good starting places would be to get a location dataset from a service like Geonames. There is also GeoDjango which came up in this question.
As you encounter new location names, check them against your larger dataset before adding them. For your 2nd point, you'll need to design this into your object model and write your code accordingly.
Here are some other things you may want to consider:
Aliases & Abbreviations
These come up more than you would think. People often use the names of suburbs or neighbourhoods that aren't official towns. You can also consider ones like LA -> Los Angeles MTL for Montreal, MT. Forest -> Mount Forest, Saint vs (ST st. ST-), etc.
Fuzzy Search
Looking up city names is much easier when differences in spelling are accounted for. This also helps reduce the number of duplicate names for the same place.
You can do this by pre-computing the Soundex or Double Metaphone values for the cities in your data set. When performing a lookup, compute the value for the search term and compare against the pre-computed values. This will work best for English, but you may have success with other romance language derivatives (unsure what your options are beyond these).
Location Equivalence/Inclusion
Be able to determine that Brooklyn is a borough of New York City.
At the end of the day, this is a hard problem, but applying these suggestions should greatly reduce the amount of data corruption and other headaches you have to deal with.
Geocoding datasets from yahoo and google can be a good starting poing, Also take a look at geopy library in django.