I have a dataframe called df_salaire that has age, marital status,...., number of working hours per week,...
I want to test if the number of working hours per week is around 40h per week, I used to dataframe.mean() and the output was 40.37
However, when I t test using ttest_1samp like this:
ttest, pval = ttest_1samp(np.array(df_salaire['heures.par.semaine']), 40)
print(pval)
I get a pvalue that is less than 0.05, so the number of working hours is not around 40h per week.
Which is contradictory.
Am I missing something?
Your issue is one of statistics, not of coding. "Around" 40 hours is a very fuzzy concept. If you're dealing with the amount of time someone works, 40.37 is "around" 40 hours. If you're dealing with computer uptime in microseconds, then 40.37 may not be close enough to 40.00000 hours to be acceptable.
So what you need to do is specify what your acceptable range is. If you are willing to accept anything that is +/- 0.5 hours, then you have two tests that you need to make:
Is the average statistically greater than 39.5 hours?
Is the average statistically less than 40.5 hours?
If both of those tests pass, then you know that your average is statistically within 0.5 hours of 40.
Both of those statistical questions can be "phrased" as 1-tailed t-tests, though you may need to do some reading to figure out how to properly "phrase the question".
I have a dataset that looks as follows:
What I would like to do with this data is calculate how much time was spent in specific states, per day. So say for example I wanted to know how long the unit was running today. I would just like to know the sum of the time the unit spent RUNNING: 45 minutes, NOT_RUNNING: 400 minutes, WARMING_UP: 10 minutes, etc.
I know how to summarize the column data on its own, but I'm looking to reference the time stamp I have available to subtract the first time it was on, from the last time it was on and get that measure of difference. I haven't had any luck searching for this solution, but there's no way I'm the first to come across this and know it can be done some how, just looking to learn how. Anything helps, Thanks!
I have a time-series that contains a hidden periodicity in the data.
I already found the period itself (e.g., 60 minutes, 100 minutes).
How can I find the specific sequence that has the periodicity? note that the periodicity can even start in the middle of the time series.
What can I do about it?
Check out cycle detection algorithms. If your cycle is exact they give you the cycle length and the stretch leading up to the first cycle.
I’m trying to think through a sort of extra credit project- optimizing our schedule.
Givens:
“demand” numbers that go down to the 1/2 hour. These tell us the ideal number of people we’d have on at any given time;
8 hour shift, plus an hour lunch break > 2 hours from the start and end of the shift (9 hours from start to finish);
Breaks: 2x 30 minute breaks in the middle of the shift;
For simplicity, can assume an employee would have the same schedule every day.
Desired result:
Dictionary or data frame with the best-case distribution of start times, breaks, lunches across an input number of employees such that the difference between staffed and demanded labor is minimized.
I have pretty basic python, so my first guess was to just come up with all of the possible shift permutations (points at which one could take breaks or lunches), and then ask python to select x (x=number of employees available) at random a lot of times, and then tell me which one best allocates the labor. That seems a bit cumbersome and silly, but my limitations are such that I can’t see beyond such a solution.
I have tried to look for libraries or tools that help with this, but the question here- how to distribute start times and breaks within a shift- doesn’t seem to be widely discussed. I’m open to hearing that this is several years off for me, but...
Appreciate anyone’s guidance!
While reviewing my past answers, I noticed I'd proposed code such as this:
import time
def dates_between(start, end):
# muck around between the 9k+ time representation systems in Python
# now start and end are seconds since epoch
# return [start, start + 86400, start + 86400*2, ...]
return range(start, end + 1, 86400)
When rereading this piece of code, I couldn't help but feel the ghastly touch of Tony the Pony on my spine, gently murmuring "leap seconds" to my ears and other such terrible, terrible things.
When does the "a day is 86,400 seconds long" assumption break, for epoch definitions of 'second', if ever? (I assume functions such as Python's time.mktime already return DST-adjusted values, so the above snippet should also work on DST switching days... I hope?)
Whenever doing calendrical calculations, it is almost always better to use whatever API the platform provides, such as Python's datetime and calendar modules, or a mature high-quality library, than it is to write "simpler" code yourself. Date and calendar APIs are ugly and complicated, but that's because real-world calendars have a lot of weird behavior.
For example, if it is "10:00:00 AM" right now, then the number of seconds to "10:00:00 AM tomorrow" could be a few different things, depending on what timezone(s) you are using, whether DST is starting or ending tonight, and so on.
Any time the constant 86400 appears in your code, there is a good chance you're doing something that's not quite right.
And things get even more complicated when you need to determine the number of seconds in a week, a month, a year, a quarter, and so on. Learn to use those calendar libraries.
Number of seconds in a day depends on time system that you use e.g., in POSIX, a day is exactly 86400 seconds by definition:
As represented in seconds since the Epoch, each and every day shall be
accounted for by exactly 86400 seconds.
In UTC, there could be a leap second included i.e., a day can be 86401 SI seconds (and theoretically 86399 SI seconds). As of Jun 30 2015, it has happened 26 times.
If we measure days by apparent motion of the Sun then the length of a (solar) day varies through the year by ~16 minutes from the mean.
In turn it is different from UT1 that is also based on rotation of the Earth (mean solar time). An apparent solar day can be 20 seconds shorter or 30 seconds longer than a mean solar day. UTC is kept within 0.9 seconds of UT1 by the introduction of occasional intercalary leap seconds.
If you define a day by local clock then it may be very chaotic due to bizarre political timezone changes. It is not correct to assume that a day may change only by an hour due to DST.
According to Wikipedia,
UTC days are almost always 86 400 s long, but due to "leap seconds"
are occasionally 86 401 s and could be 86 399 s long (though the
latter option has never been used as of December 2010); this keeps the
days synchronized with the rotation of the Earth (or Universal Time).
I expect that a double leap second could in fact make the day 86402s long, if that were to ever be used.
EDIT again: second guessed myself due to confusing python documentation. time.mktime always returns UTC epoch seconds. There done. : )
In all time zones that "support" daylight savings time, you'll get two days a year that don't have 24h. They'll have 25h or 23h respectively. And don't even think of hardcoding those dates. They change every year, and between time zones.
Oh, and here's a list of 34 other reasons that you hadn't thought about, and why you shouldn't do what you're doing.