Python PuLP Optimization - How to improve performance? - python

I'm writing a script to help me schedule shifts for a team. I have the model and the script to solve it, but when I try to run it, it takes too long. I think it might be an issue with the size of the problem I'm trying to solve. Before getting into the details, here are some characteristics of the problem:
Team has 13 members.
Schedule shifts for a month: 18 operating hours/day, 7 days/week, for 4 weeks.
I want to schedule non-fixed shifts: members can be on shift during any of the 18 operating hours. They can even be on for a couple hours, off for another hour, and then on again, etc. This is subject to some constraints.
Sets of the problem:
m: each of the members
h: operating hours (0 to 17)
d: days of the week (0 to 6)
w: week of the month (0 to 4)
Variables:
X[h,h,d,w]: Binary. 1 if member m starts a shift on hour h,d,w. 0 otherwise.
Y[h,h,d,w]: Binary. 1 if member m is on shift on hour h,d,w. 0 otherwise.
W[m,d,w]: Binary. 1 if member m had any shift on day d,w. 0 otherwise.
G[m,w]: Binary. 1 if member m had the weekend off during week w. 0 otherwise.
The problem has 20 constraints, 13 which are "real constraints" of the problem and 7 which are relations between the variables, for the model to work.
When I run it, I get this message:
At line 2 NAME MODEL
At line 3 ROWS
At line 54964 COLUMNS
At line 295123 RHS
At line 350083 BOUNDS
At line 369067 ENDATA
Problem MODEL has 54959 rows, 18983 columns and 201097 elements
Coin0008I MODEL read with 0 errors
I left it running overnight and didn't even get a solution. Then I tried changing all variables to continuous variables and it took ~25 seconds to find the optimal solution.
I don't know if I'm having issues because of the size of the problem, or if it's because I'm using only binary variables. Or a combination of that.
Are there any general tips or best practices that could be used to improve performance on a model? Or is it always related to the specific model I'm trying to solve?

The long duration solve is almost certainly due to the number of binary variables in your model, and it may be degenerate, meaning that there are many solutions that produce the same value for the objective, so it is just thrashing around trying all of the (relatively equal) combinations.
The fact that it does solve when you relax the integer constraint is good and good troubleshooting. Assuming the model is constructed properly, here are a couple things to try:
Solve for 1 week at a time (4 separate solves). It isn't clear what the linkage is from week-to-week from your description, but that would reduce the model to 1/4 of its size.
Change your time-blocks to more than 1 hour. If you used 3 hour blocks, your problem would again reduce to 1/3 of its size by your description. You would only need to reduce the set H to {1,2,3,4,5,6} and then do the mental math to align that with the actual hours. Or you could do 2 hour blocks.
You should also tinker with the optimality gap. Sometimes the difference between 95% Optimal and 100% is days/weeks/years of solver time. Have you tried a .02, .05, 0.10, or 0.25 relative gap? You may still get the optimal answer, but you forgo the guarantee of optimality.

Related

What is window_size in time-series and What are the advantages and disadvantages of having small and large window_size?

I am quite beginner in machine learning. I have tried a lot to understand this concept but I am unable to understand it on google. I need to understand this concept in easy way.
Please explain this question in easy words and much detail.
This question is best suited for stack exchange as it is not a specific coding question.
Window size is the duration of observations that you ask an algorithm to consider when learning a time series. For example, if you need to predict tomorrow's temperature, and you use a window of 5 days, then the algorithm will divide your entire time series into segments of 6 days (5 training days and 1 prediction days) and try to learn how to use only 5 days of data to predict next 1 day based on the historic records.
Advantage of short window:
You get more samples out of the time series so that your estimation of short term effects are more reliable (100 days historic time series will provide you around 95 samples if you are using a 5 day window - so the model is more certain about what the influence of the past 5 days has on next day temperature)
Advantage of long window
long windows allow you to better learn seasonal and trend effects (think about events that happen yearly, monthly...etc). If your window is small - say 5 days, your model will not learn any seasonal effect that occurs monthly. However, if your window is 60 days, then every sample of data that you feed to the model would have at least 2 occurrences of the monthly seasonal effect, and this would enable your model to learn such seasonality.
The downside of long window is the number of samples decreases. Assuming an 100 day time series, 60 day window will only yield 40 samples of data. This would mean every parameter of your model is now fitted on much smaller sample of data and may be reduce the reliability of the model.
"window size" typically refers to the number of time periods that are used to calculate a statistic or model.
Advantages and Disadvantages of various window sizes relate to the balance between:
the sensitivity to changes in the data vs susceptibility to noise & outliers
If you have ever dealt with moving average indicators on the stock market, you will understand that each window size has a purpose, and these different window sizes are often used in combination to get a more holistic view/understanding. eg. MA20 vs MA50 vs MA100. Each of these indicators are using a different window size to calculate the moving average of the stock of interest.
Image Source: Yahoo Finance

How do I detect and remove multiple substrings, which may vary?

Every week, my school makes us type which assignments we will be doing for the week. I want to do this automatically with python.
Below is a example of what an input I might want the program to take.
Unit 1 presentation - Due in 2 days - 40 minutes spent
English Unit 2 Test - Due in 3 days - 1 hour spent
History Essay - Due in 7 days - 10 minutes spent
I want to get rid of everything other than the name of the assigments. How can I achieve this? I can't simply use the find() method since the substrings may vary.
Google hasn't been much help. I'm a beginner, so please bear with me.
I would try to take each line as line and process as
assignment_name = line.split("-")[0].strip()

Employee Scheduling Constraints Issue

I am currently trying to develop an employee scheduling tool in order to reduce the daily workload. I am using pyomo for setting up the model but unfortunately stuck on one of the constraint setting.
Here is the simplified background:
4 shifts are available for assignation - RDO (Regular Day Off), M (Morning), D (Day) and N (Night). All of them are 8-hrs shift
Every employee will get 1 RDO per week and constant RDO is preferred (say staff A better to have Monday as day off constantly but this can be violate)
Same working shift (M / D / N) is preferred for every staff week by week (the constraint that I stuck on)
a. Example 1 (RDO at Monday): The shift of Tuesday to Sunday should be / have better to be the same
b. Example 2 (RDO at Thursday): The shift of Mon to Wed should be same as the last working day of prior week, while the shift of Fri to Sun this week also need to be same but not limit to be which shift
Since the RDO day (Mon - Sun) is different among employees, the constraint of point 3 also require to be changed people by people conditionally (say if RDO == "Mon" then do A ; else if RDO == "Tue" then do B), I have no idea how can it be reflected on the constraint as IF / ELSE statement cant really work on solver.
Appreciate if you can give me some hints or direction. Thanks very much!
The constraints you are trying to create could be moderately complicated, and are very dependent on how you set up the problem, how many time periods you look at in the model, etc. etc. and are probably beyond the scope of 1 answer. Are you taking an LP course in school? If so, you might want to bounce your framework off of your instructor for ideas.
That aside, you might want to tackle the ROD by assigning each person a cost table based on their preferences and then putting in a small penalty in the objective based on their "costs" to influence the solver to give them their "pick" -- assumes the "picks" are relatively well distributed and not everybody wants Friday off, etc.
You could probably do the same with the shifts, essentially making a parameter that is indexed by [employee, shift] with "costs" and using that in the obj in a creative way. This would be the easiest solution... others get into counting variables, big-M, etc.
You could in this case use a switch case. Use the ROD as input and the cases are the days of the week. In the cases you can than to the rest of the planing.
Here is a good reverence:
https://pythongeeks.org/switch-in-python/

Is there a library or suggested tactic for shift planning with hours and breaks?

I’m trying to think through a sort of extra credit project- optimizing our schedule.
Givens:
“demand” numbers that go down to the 1/2 hour. These tell us the ideal number of people we’d have on at any given time;
8 hour shift, plus an hour lunch break > 2 hours from the start and end of the shift (9 hours from start to finish);
Breaks: 2x 30 minute breaks in the middle of the shift;
For simplicity, can assume an employee would have the same schedule every day.
Desired result:
Dictionary or data frame with the best-case distribution of start times, breaks, lunches across an input number of employees such that the difference between staffed and demanded labor is minimized.
I have pretty basic python, so my first guess was to just come up with all of the possible shift permutations (points at which one could take breaks or lunches), and then ask python to select x (x=number of employees available) at random a lot of times, and then tell me which one best allocates the labor. That seems a bit cumbersome and silly, but my limitations are such that I can’t see beyond such a solution.
I have tried to look for libraries or tools that help with this, but the question here- how to distribute start times and breaks within a shift- doesn’t seem to be widely discussed. I’m open to hearing that this is several years off for me, but...
Appreciate anyone’s guidance!

Understanding SLOCCount output

I recently run the SLOCCount tool because I needed to estimate the number of lines in a large project.
This is what it showed:
Totals grouped by language (dominant language first):
python: 7826 (100.00%)
Total Physical Source Lines of Code (SLOC) = 7,826
Development Effort Estimate, Person-Years (Person-Months) = 1.73 (20.82)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months) = 0.66 (7.92)
(Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule) = 2.63
Total Estimated Cost to Develop = $ 234,346
(average salary = $56,286/year, overhead = 2.40).
I'm not entirely sure how it comes up with all those estimates but one in particular threw me off, the Development Effort Estimate. I read about the COCOMO model but I'm still a bit lost.
What is the meaning of this estimate in simple words?
The Development Effort Estimate is a measure of how much time it might have taken to create the 7.8k lines of Python code.
If you believe in divisible man-months of effort, it would have taken one person about 21 months to produce (might be about right), or two people about 11 months (a bit optimistic), or three people about 7 months (quite optimistic). In practice, it doesn't scale linearly like that — and some tasks are indivisible. Putting 9 women to work to produce a baby in 1 month doesn't work, even though it takes 1 woman 9 months to produce a baby.
Is $56k really the average salary for a programmer these days?
COCOMO calculates how long it takes the average developer in a large company to create this software.
It's a very rough estimate, but there are parameters (called drivers) that you can tweak to make it more accurate to your case.
Some tools like ProjectCodeMeter can auto-detect these parameters and make the calculation for you.

Categories

Resources