One sample T-test in python not giving the desired output

One sample T-test in python not giving the desired output - python

I have a dataframe called df_salaire that has age, marital status,...., number of working hours per week,...
I want to test if the number of working hours per week is around 40h per week, I used to dataframe.mean() and the output was 40.37
However, when I t test using ttest_1samp like this:
ttest, pval = ttest_1samp(np.array(df_salaire['heures.par.semaine']), 40)
print(pval)
I get a pvalue that is less than 0.05, so the number of working hours is not around 40h per week.
Which is contradictory.
Am I missing something?

Your issue is one of statistics, not of coding. "Around" 40 hours is a very fuzzy concept. If you're dealing with the amount of time someone works, 40.37 is "around" 40 hours. If you're dealing with computer uptime in microseconds, then 40.37 may not be close enough to 40.00000 hours to be acceptable.
So what you need to do is specify what your acceptable range is. If you are willing to accept anything that is +/- 0.5 hours, then you have two tests that you need to make:
Is the average statistically greater than 39.5 hours?
Is the average statistically less than 40.5 hours?
If both of those tests pass, then you know that your average is statistically within 0.5 hours of 40.
Both of those statistical questions can be "phrased" as 1-tailed t-tests, though you may need to do some reading to figure out how to properly "phrase the question".

Related

Python PuLP Optimization - How to improve performance?

I'm writing a script to help me schedule shifts for a team. I have the model and the script to solve it, but when I try to run it, it takes too long. I think it might be an issue with the size of the problem I'm trying to solve. Before getting into the details, here are some characteristics of the problem:
Team has 13 members.
Schedule shifts for a month: 18 operating hours/day, 7 days/week, for 4 weeks.
I want to schedule non-fixed shifts: members can be on shift during any of the 18 operating hours. They can even be on for a couple hours, off for another hour, and then on again, etc. This is subject to some constraints.
Sets of the problem:
m: each of the members
h: operating hours (0 to 17)
d: days of the week (0 to 6)
w: week of the month (0 to 4)
Variables:
X[h,h,d,w]: Binary. 1 if member m starts a shift on hour h,d,w. 0 otherwise.
Y[h,h,d,w]: Binary. 1 if member m is on shift on hour h,d,w. 0 otherwise.
W[m,d,w]: Binary. 1 if member m had any shift on day d,w. 0 otherwise.
G[m,w]: Binary. 1 if member m had the weekend off during week w. 0 otherwise.
The problem has 20 constraints, 13 which are "real constraints" of the problem and 7 which are relations between the variables, for the model to work.
When I run it, I get this message:
At line 2 NAME MODEL
At line 3 ROWS
At line 54964 COLUMNS
At line 295123 RHS
At line 350083 BOUNDS
At line 369067 ENDATA
Problem MODEL has 54959 rows, 18983 columns and 201097 elements
Coin0008I MODEL read with 0 errors
I left it running overnight and didn't even get a solution. Then I tried changing all variables to continuous variables and it took ~25 seconds to find the optimal solution.
I don't know if I'm having issues because of the size of the problem, or if it's because I'm using only binary variables. Or a combination of that.
Are there any general tips or best practices that could be used to improve performance on a model? Or is it always related to the specific model I'm trying to solve?

The long duration solve is almost certainly due to the number of binary variables in your model, and it may be degenerate, meaning that there are many solutions that produce the same value for the objective, so it is just thrashing around trying all of the (relatively equal) combinations.
The fact that it does solve when you relax the integer constraint is good and good troubleshooting. Assuming the model is constructed properly, here are a couple things to try:
Solve for 1 week at a time (4 separate solves). It isn't clear what the linkage is from week-to-week from your description, but that would reduce the model to 1/4 of its size.
Change your time-blocks to more than 1 hour. If you used 3 hour blocks, your problem would again reduce to 1/3 of its size by your description. You would only need to reduce the set H to {1,2,3,4,5,6} and then do the mental math to align that with the actual hours. Or you could do 2 hour blocks.
You should also tinker with the optimality gap. Sometimes the difference between 95% Optimal and 100% is days/weeks/years of solver time. Have you tried a .02, .05, 0.10, or 0.25 relative gap? You may still get the optimal answer, but you forgo the guarantee of optimality.

Cummulative Time Spent in Specific States

I have a dataset that looks as follows:
What I would like to do with this data is calculate how much time was spent in specific states, per day. So say for example I wanted to know how long the unit was running today. I would just like to know the sum of the time the unit spent RUNNING: 45 minutes, NOT_RUNNING: 400 minutes, WARMING_UP: 10 minutes, etc.
I know how to summarize the column data on its own, but I'm looking to reference the time stamp I have available to subtract the first time it was on, from the last time it was on and get that measure of difference. I haven't had any luck searching for this solution, but there's no way I'm the first to come across this and know it can be done some how, just looking to learn how. Anything helps, Thanks!

How to appropriately choose the window for calculating rolling mean and std?

I am checking my data for stationarity. I have a time series with daily data between 1986 and 2019. I would like to know what is the right way to choose the window for the rolling mean and std. I was thinking 252 - the number of business days a year, but I am not sure if this is too big a period.

There are various factors to be considered before choosing the correct time time for rolling mean (moving average). First you should have a clear aim (forecasting, smoothing, etc). Mostly people use it for forecasting in stock prices (thus i assume you have daily data for a stock/commodity/trade-able product and you would like to predict its prices using rolling mean.
Check out the seasonality component and how much fluctuation is happening in the data and if by observation you can see that if has a cycle of say 2 months (60 days, or approx 50 working days) then your period of rolling mean should be lesser than it.
In short there is not sure shot formula for finding the period of most suitable rolling mean to be used, and there is subjectivity involved in it.

Understanding SLOCCount output

I recently run the SLOCCount tool because I needed to estimate the number of lines in a large project.
This is what it showed:
Totals grouped by language (dominant language first):
python: 7826 (100.00%)
Total Physical Source Lines of Code (SLOC) = 7,826
Development Effort Estimate, Person-Years (Person-Months) = 1.73 (20.82)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months) = 0.66 (7.92)
(Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule) = 2.63
Total Estimated Cost to Develop = $ 234,346
(average salary = $56,286/year, overhead = 2.40).
I'm not entirely sure how it comes up with all those estimates but one in particular threw me off, the Development Effort Estimate. I read about the COCOMO model but I'm still a bit lost.
What is the meaning of this estimate in simple words?

The Development Effort Estimate is a measure of how much time it might have taken to create the 7.8k lines of Python code.
If you believe in divisible man-months of effort, it would have taken one person about 21 months to produce (might be about right), or two people about 11 months (a bit optimistic), or three people about 7 months (quite optimistic). In practice, it doesn't scale linearly like that — and some tasks are indivisible. Putting 9 women to work to produce a baby in 1 month doesn't work, even though it takes 1 woman 9 months to produce a baby.
Is $56k really the average salary for a programmer these days?

COCOMO calculates how long it takes the average developer in a large company to create this software.
It's a very rough estimate, but there are parameters (called drivers) that you can tweak to make it more accurate to your case.
Some tools like ProjectCodeMeter can auto-detect these parameters and make the calculation for you.

Is a day always 86,400 epoch seconds long?

While reviewing my past answers, I noticed I'd proposed code such as this:
import time
def dates_between(start, end):
# muck around between the 9k+ time representation systems in Python
# now start and end are seconds since epoch
# return [start, start + 86400, start + 86400*2, ...]
return range(start, end + 1, 86400)
When rereading this piece of code, I couldn't help but feel the ghastly touch of Tony the Pony on my spine, gently murmuring "leap seconds" to my ears and other such terrible, terrible things.
When does the "a day is 86,400 seconds long" assumption break, for epoch definitions of 'second', if ever? (I assume functions such as Python's time.mktime already return DST-adjusted values, so the above snippet should also work on DST switching days... I hope?)

Whenever doing calendrical calculations, it is almost always better to use whatever API the platform provides, such as Python's datetime and calendar modules, or a mature high-quality library, than it is to write "simpler" code yourself. Date and calendar APIs are ugly and complicated, but that's because real-world calendars have a lot of weird behavior.
For example, if it is "10:00:00 AM" right now, then the number of seconds to "10:00:00 AM tomorrow" could be a few different things, depending on what timezone(s) you are using, whether DST is starting or ending tonight, and so on.
Any time the constant 86400 appears in your code, there is a good chance you're doing something that's not quite right.
And things get even more complicated when you need to determine the number of seconds in a week, a month, a year, a quarter, and so on. Learn to use those calendar libraries.

Number of seconds in a day depends on time system that you use e.g., in POSIX, a day is exactly 86400 seconds by definition:
As represented in seconds since the Epoch, each and every day shall be
accounted for by exactly 86400 seconds.
In UTC, there could be a leap second included i.e., a day can be 86401 SI seconds (and theoretically 86399 SI seconds). As of Jun 30 2015, it has happened 26 times.
If we measure days by apparent motion of the Sun then the length of a (solar) day varies through the year by ~16 minutes from the mean.
In turn it is different from UT1 that is also based on rotation of the Earth (mean solar time). An apparent solar day can be 20 seconds shorter or 30 seconds longer than a mean solar day. UTC is kept within 0.9 seconds of UT1 by the introduction of occasional intercalary leap seconds.
If you define a day by local clock then it may be very chaotic due to bizarre political timezone changes. It is not correct to assume that a day may change only by an hour due to DST.

According to Wikipedia,
UTC days are almost always 86 400 s long, but due to "leap seconds"
are occasionally 86 401 s and could be 86 399 s long (though the
latter option has never been used as of December 2010); this keeps the
days synchronized with the rotation of the Earth (or Universal Time).
I expect that a double leap second could in fact make the day 86402s long, if that were to ever be used.
EDIT again: second guessed myself due to confusing python documentation. time.mktime always returns UTC epoch seconds. There done. : )

In all time zones that "support" daylight savings time, you'll get two days a year that don't have 24h. They'll have 25h or 23h respectively. And don't even think of hardcoding those dates. They change every year, and between time zones.
Oh, and here's a list of 34 other reasons that you hadn't thought about, and why you shouldn't do what you're doing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.