Understanding SLOCCount output

Understanding SLOCCount output - python

I recently run the SLOCCount tool because I needed to estimate the number of lines in a large project.
This is what it showed:
Totals grouped by language (dominant language first):
python: 7826 (100.00%)
Total Physical Source Lines of Code (SLOC) = 7,826
Development Effort Estimate, Person-Years (Person-Months) = 1.73 (20.82)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months) = 0.66 (7.92)
(Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule) = 2.63
Total Estimated Cost to Develop = $ 234,346
(average salary = $56,286/year, overhead = 2.40).
I'm not entirely sure how it comes up with all those estimates but one in particular threw me off, the Development Effort Estimate. I read about the COCOMO model but I'm still a bit lost.
What is the meaning of this estimate in simple words?

The Development Effort Estimate is a measure of how much time it might have taken to create the 7.8k lines of Python code.
If you believe in divisible man-months of effort, it would have taken one person about 21 months to produce (might be about right), or two people about 11 months (a bit optimistic), or three people about 7 months (quite optimistic). In practice, it doesn't scale linearly like that — and some tasks are indivisible. Putting 9 women to work to produce a baby in 1 month doesn't work, even though it takes 1 woman 9 months to produce a baby.
Is $56k really the average salary for a programmer these days?

COCOMO calculates how long it takes the average developer in a large company to create this software.
It's a very rough estimate, but there are parameters (called drivers) that you can tweak to make it more accurate to your case.
Some tools like ProjectCodeMeter can auto-detect these parameters and make the calculation for you.

Related

What is window_size in time-series and What are the advantages and disadvantages of having small and large window_size?

I am quite beginner in machine learning. I have tried a lot to understand this concept but I am unable to understand it on google. I need to understand this concept in easy way.
Please explain this question in easy words and much detail.

This question is best suited for stack exchange as it is not a specific coding question.
Window size is the duration of observations that you ask an algorithm to consider when learning a time series. For example, if you need to predict tomorrow's temperature, and you use a window of 5 days, then the algorithm will divide your entire time series into segments of 6 days (5 training days and 1 prediction days) and try to learn how to use only 5 days of data to predict next 1 day based on the historic records.
Advantage of short window:
You get more samples out of the time series so that your estimation of short term effects are more reliable (100 days historic time series will provide you around 95 samples if you are using a 5 day window - so the model is more certain about what the influence of the past 5 days has on next day temperature)
Advantage of long window
long windows allow you to better learn seasonal and trend effects (think about events that happen yearly, monthly...etc). If your window is small - say 5 days, your model will not learn any seasonal effect that occurs monthly. However, if your window is 60 days, then every sample of data that you feed to the model would have at least 2 occurrences of the monthly seasonal effect, and this would enable your model to learn such seasonality.
The downside of long window is the number of samples decreases. Assuming an 100 day time series, 60 day window will only yield 40 samples of data. This would mean every parameter of your model is now fitted on much smaller sample of data and may be reduce the reliability of the model.

"window size" typically refers to the number of time periods that are used to calculate a statistic or model.
Advantages and Disadvantages of various window sizes relate to the balance between:
the sensitivity to changes in the data vs susceptibility to noise & outliers
If you have ever dealt with moving average indicators on the stock market, you will understand that each window size has a purpose, and these different window sizes are often used in combination to get a more holistic view/understanding. eg. MA20 vs MA50 vs MA100. Each of these indicators are using a different window size to calculate the moving average of the stock of interest.
Image Source: Yahoo Finance

Production Planning based on the Raw Material Arrival

I am a non-programmer here but would like to learn to solve 1 problem at a time.
I want to make a production plan based on the availability of raw materials where % of each raw material (RM) decides the product type.
Example -
Product_1: RM_1 - 50%, RM_2 - 30%, RM_3 - 20%.............1st Choice
Product_2: RM_1 - 50%, RM_4 - 10%, RM_5 - 40%.............2nd Choice
Product_3: RM_2 - 40%, RM_3 - 20%, RM_5 - 30%.............3rd Choice
Number of products are fixed i.e., 4
Number of Raw materials are fixed i.e., 8
I have 2 production lines with different capacity (line_1 - 10 Tons per Day, line_2 - 15 Tons per Day).
I intend to produce Product_1 for the whole year but the main constraint is that the Raw materials of choice (RM_1, RM_2 and RM_3) may not be readily available all the time. All the raw materials have to be imported. They are shipped 70-80 times a year. Hence until the inventory is replenished, I have to switch to Product_2 or Product_3 depending on the ratio of raw materials available to keep my plant running.
So, I am looking for a production planner that can consider all these constraints and formulate a day wise production plan.
If someone would be kind enough to walk me through the steps that I have to perform, I would be grateful.
Thanks.
I tried to do this in excel but there are too many constraints for formulation without programing.
I have read about you good folks while googling the solution, hence I am here.
thanks.

What model to predict GYM Leavers based on recent GYM Joiners? HELP!! Time series vs Multiple Linear Regression

I work in the gym space, and I'm trying to predict the numbers of gym leavers we will see next month, the following month etc.
The number of leavers are directly be impacted by the number of joiners we had 13 months ago (for a 12 month contract) or 4 months ago (for a 3 month) contract. As you need to give a months notice.
There is some seasonality in Jan/Sept, but ultimately the type of contract a member joins only and length is the biggest contribution to how long they'd likely stay.
We have over a hundred permutations on contract types and length.
What is the best way to model this in python, and which methods.
I've created a proof of concept model in excel, which looks at historic churn rates, a month 1/2/3etc by contract, and can apply that to our current member mix and their tenure to predict how many will leave this month but it's extremely messy on lots of worksheets. But it is accurate, and outside irregular macro events is very accurate in predicting Leavers within the next month..
I've tried a linear regression based on the leaver volume this month, against all the Joiners in t-1, t-2... t-64.. but it spits out a bunch of co-efficients which don't provide any reasonable number. Some are (+)ive and some (-)ive. But i thought over a longer enough period the numbers of joiners could show estimate leavers.
I've thought Time series next, but struggle to understand how to set the data up to run that. As i have some many contract mixes, and in one way, i need to look at the data and say, this person is on this contract, has been with us X months, so has this chance of leaving.

Python PuLP Optimization - How to improve performance?

I'm writing a script to help me schedule shifts for a team. I have the model and the script to solve it, but when I try to run it, it takes too long. I think it might be an issue with the size of the problem I'm trying to solve. Before getting into the details, here are some characteristics of the problem:
Team has 13 members.
Schedule shifts for a month: 18 operating hours/day, 7 days/week, for 4 weeks.
I want to schedule non-fixed shifts: members can be on shift during any of the 18 operating hours. They can even be on for a couple hours, off for another hour, and then on again, etc. This is subject to some constraints.
Sets of the problem:
m: each of the members
h: operating hours (0 to 17)
d: days of the week (0 to 6)
w: week of the month (0 to 4)
Variables:
X[h,h,d,w]: Binary. 1 if member m starts a shift on hour h,d,w. 0 otherwise.
Y[h,h,d,w]: Binary. 1 if member m is on shift on hour h,d,w. 0 otherwise.
W[m,d,w]: Binary. 1 if member m had any shift on day d,w. 0 otherwise.
G[m,w]: Binary. 1 if member m had the weekend off during week w. 0 otherwise.
The problem has 20 constraints, 13 which are "real constraints" of the problem and 7 which are relations between the variables, for the model to work.
When I run it, I get this message:
At line 2 NAME MODEL
At line 3 ROWS
At line 54964 COLUMNS
At line 295123 RHS
At line 350083 BOUNDS
At line 369067 ENDATA
Problem MODEL has 54959 rows, 18983 columns and 201097 elements
Coin0008I MODEL read with 0 errors
I left it running overnight and didn't even get a solution. Then I tried changing all variables to continuous variables and it took ~25 seconds to find the optimal solution.
I don't know if I'm having issues because of the size of the problem, or if it's because I'm using only binary variables. Or a combination of that.
Are there any general tips or best practices that could be used to improve performance on a model? Or is it always related to the specific model I'm trying to solve?

The long duration solve is almost certainly due to the number of binary variables in your model, and it may be degenerate, meaning that there are many solutions that produce the same value for the objective, so it is just thrashing around trying all of the (relatively equal) combinations.
The fact that it does solve when you relax the integer constraint is good and good troubleshooting. Assuming the model is constructed properly, here are a couple things to try:
Solve for 1 week at a time (4 separate solves). It isn't clear what the linkage is from week-to-week from your description, but that would reduce the model to 1/4 of its size.
Change your time-blocks to more than 1 hour. If you used 3 hour blocks, your problem would again reduce to 1/3 of its size by your description. You would only need to reduce the set H to {1,2,3,4,5,6} and then do the mental math to align that with the actual hours. Or you could do 2 hour blocks.
You should also tinker with the optimality gap. Sometimes the difference between 95% Optimal and 100% is days/weeks/years of solver time. Have you tried a .02, .05, 0.10, or 0.25 relative gap? You may still get the optimal answer, but you forgo the guarantee of optimality.

Is there a library or suggested tactic for shift planning with hours and breaks?

I’m trying to think through a sort of extra credit project- optimizing our schedule.
Givens:
“demand” numbers that go down to the 1/2 hour. These tell us the ideal number of people we’d have on at any given time;
8 hour shift, plus an hour lunch break > 2 hours from the start and end of the shift (9 hours from start to finish);
Breaks: 2x 30 minute breaks in the middle of the shift;
For simplicity, can assume an employee would have the same schedule every day.
Desired result:
Dictionary or data frame with the best-case distribution of start times, breaks, lunches across an input number of employees such that the difference between staffed and demanded labor is minimized.
I have pretty basic python, so my first guess was to just come up with all of the possible shift permutations (points at which one could take breaks or lunches), and then ask python to select x (x=number of employees available) at random a lot of times, and then tell me which one best allocates the labor. That seems a bit cumbersome and silly, but my limitations are such that I can’t see beyond such a solution.
I have tried to look for libraries or tools that help with this, but the question here- how to distribute start times and breaks within a shift- doesn’t seem to be widely discussed. I’m open to hearing that this is several years off for me, but...
Appreciate anyone’s guidance!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.