How can I iterate through list data faster using multiprocessing? - python

I'm trying to determine the amount of time worked by a list of employees during their shift - this data is given to me in the form of a CSV file.
I populate a matrix with this data and iterate through it using a while loop applying the necessary conditionals (for example, deducting 30 minute for lunch). This is then put into a new list, which is used to make an Excel worksheet.
My script does what it is meant to do, but takes a very long time when having to loop through a lot of data (it needs to loop through approximately 26 000 rows).
My idea is to use multiprocessing to do the following three loops in parallel:
Convert the time from hh:mm:ss to minutes.
Loop through and apply conditionals.
Round values and convert back to hours, so that this is not done within the big while loop.
Is this a good idea?
If so, how would I have the loops run in parallel when I need data from one loop to be used in the next? My first thought is to use the time function to give a delay, but then I'm concerned that my loops may "catch up" with one another and spit out that the list index being called does not exist.
Any more experienced opinions would be amazing, thanks!
My script:
import pandas as pd
# Function: To round down the time to the next lowest ten minutes --> 77 = 70 ; 32 = 30:
def floor_time(n, decimals=0):
multiplier = 10 ** decimals
return int(n * multiplier) / multiplier
# Function: Get data from excel spreadsheet:
def get_data():
df = pd.read_csv('/Users/Chadd/Desktop/dd.csv', sep = ',')
list_of_rows = [list(row) for row in df.values]
data = []
i = 0
while i < len(list_of_rows):
data.append(list_of_rows[i][0].split(';'))
data[i].pop()
i += 1
return data
# Function: Convert time index in data to 24 hour scale:
def get_time(time_data):
return int(time_data.split(':')[0])*60 + int(time_data.split(':')[1])
# Function: Loop through data in CSV applying conditionals:
def get_time_worked():
i = 0 # Looping through entry data
j = 1 # Looping through departure data
list_of_times = []
while j < len(get_data()):
start_time = get_time(get_data()[i][3])
end_time = get_time(get_data()[j][3])
# Morning shift - start time < end time
if start_time < end_time:
time_worked = end_time - start_time # end time - start time (minutes)
# Need to deduct 15 minutes if late:
if start_time > 6*60: # Late
time_worked = time_worked - 15
# Need to set the start time to 06:00:00:
if start_time < 6*60: # Early
time_worked = end_time - 6*60
# Afternoon shift - start time > end time
elif start_time > end_time:
time_worked = 24*60 - start_time + end_time # 24*60 - start time + end time (minutes)
# Need to deduct 15 minutes if late:
if start_time > 18*60: # Late
time_worked = time_worked - 15
# Need to set the start time to 18:00:00:
if start_time > 18*60: # Early
time_worked = 24*60 - 18*60 + end_time
# If time worked exceeds 5 hours, deduct 30 minutes for lunch:
if time_worked >= 5*60:
time_worked = time_worked - 30
# Set max time worked to 11.5 hours:
if time_worked > 11.5*60:
time_worked = 11.5*60
list_of_times.append([get_data()[i][1], get_data()[i][2], round(floor_time(time_worked, decimals = -1)/60, 2)])
i += 2
j += 2
return list_of_times
# Save the data into Excel worksheet:
def save_data():
file_heading = '{} to {}'.format(get_data()[0][2], get_data()[len(get_data())-1][2])
file_heading_2 = file_heading.replace('/', '_')
df = pd.DataFrame(get_time_worked())
writer = pd.ExcelWriter('/Users/Chadd/Desktop/{}.xlsx'.format(file_heading_2), engine='xlsxwriter')
df.to_excel(writer, sheet_name='Hours Worked', index=False)
writer.save()
save_data()

You can look at multiprocessing.Pool which allows executing a function multiple times with different input variables. From the docs
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
Then, it's a matter of splitting up your data into chunks (instead of the [1, 2, 3] in the example).
But, my personal preference, is to take the time and learn something that is distributed by default. Such as Spark and pyspark. It'll help you in the long run immensely.

Related

calculate working hours time - Python Pandas ( hours worked total ) hours worked in morning, afternoon , night

I'm pretty new to python and pandas (beginner level)
I have a pandas dataframe which I'm trying to calculate how many hours/min/secs worked in the morning , afternoon , evening.
mornings will be within the time range (03:00 and 12:00)
afternoons will be within the time range ( 12:00 and 17:00))
evenings will be within the time range ( 17:00 and 03:00)
data = [ ('employee1', '2022-10-28', '12:06', '13:00:00', '00:00:00', '00:00:23'),
('employee2','2022-10-28', '10:00', '06:00:00', '00:00:00', '00:00:16'),
('employee3', '2022-05-06', '16:13', '08:00:00', '00:54:00', '00:00:09'),
('employee4', '2022-06-03', '2:33', '09:00:00', '00:19:00', '00:00:56'),
('employee5', '2022-08-12', '9:50', '20:00:00', '00:27:00', '00:00:22'),
('employee6', '2022-02-15', '6:52', '00:00:00', '00:35:00','00:00:35')]
df = pd.DataFrame(data, columns =['Name','date','start_time','hours_worked','minutes_worked','seconds_worked'])
df[['enddate_time','time_worked_morning','time_worked_afternoon','time_worked_evening','total_time_wkd']]=None
can someone please help onto how to calculate how much of total time worked falls within these 3 ranges ?
Since you mentioned you are a beginner, I will include some extra comments in my code. I will also present 2 solutions: a "slow" one that use mostly Python's loops and a "fast" one that uses numpy vectorized code.
Both solutions will use the following code snippet. All dates/times in your dataframe are stored as strings, which are hard to work with. Convert them to pd.Timestamp and pd.Timedelta for easier manipulation:
start_time = pd.to_datetime(df["date"] + " " + df["start_time"])
time_worked = (
pd.to_timedelta(df["hours_worked"])
+ pd.to_timedelta(df["minutes_worked"])
+ pd.to_timedelta(df["seconds_worked"])
)
end_time = start_time + time_worked
The slow solution
Loop over each employe, then over each shift to see how much time they work in that shift. Then tally up the results to assign them as new columns in your dataframe:
# You did not specify a unit for your time_worked_morning, _aternoon, etc.
# I assume you want to measure them in hours.
unit = pd.Timedelta(1, "h")
# The shift boundaries. There are 4 boundaries for 3 shifts because we need to
# account for the start and end time of each shift.
# Shift start time: 3, 12, 17
# Shift end time: 12, 17, 27
# The numbers represent hours since midnight of the day
boundaries = [pd.Timedelta(i, "h") for i in [3, 12, 17, 27]]
# Now loop through each employee's work record to break down their working hours
# into shifts
data = []
for work_st, work_et in zip(start_time, end_time):
# Get the midnight of the day
start_of_day = work_st.normalize()
# Work time in each shift starts with 0
work_time = [0] * (len(boundaries) - 1)
# loop through the shifts
for i, (lb, ub) in enumerate(zip(boundaries[:-1], boundaries[1:])):
# Calculate the start and end time of each shift
shift_st = start_of_day + lb
shift_et = start_of_day + ub
# Work time in shift = (effective end time) - (effective start time)
# Effective end time = (work end time) or (shift end time) , whichever is EARLIER => use min
# Effective start time = (work start time) or (shift start time), whichever is LATER => use max
# The "/ unit" operation is to convert the pd.Timedelta object into a
# float representing the hour
t = (min(work_et, shift_et) - max(work_st, shift_st)) / unit
# Our algorithm above sometimes cause an `Effective end time` that is
# before the `Effective start time`. An employee can't spend negative
# time in a shift so clip it to 0 if negative.
work_time[i] = max(0, t)
data.append(work_time)
# Convert `data` numpy array for easier slicing
data = np.array(data)
# Add the extra columns to your data frame
df["enddate_time"] = end_time
df["time_worked_morning"] = data[:, 0]
df["time_worked_afternoon"] = data[:, 1]
df["time_worked_evening"] = data[:, 2]
df["total_time_wkd"] = time_worked / unit
The fast solution
Instead of working with one employee and one shift at a time, we will deal with 2D arrays containing all employees in all shifts. This enables us to use several vectorized operations offered by numpy.
# The boundaries here are pretty much the same as in the "slow" solution, only
# as a numpy array instead of list
boundaries = np.array([np.timedelta64(i, "h") for i in [3, 12, 17, 27]])
n = len(boundaries) - 1
# For every employee, repeat the work start time and end time to once per shift
work_st = np.tile(start_time.to_numpy(), (n, 1)).T
work_et = np.tile(end_time.to_numpy(), (n, 1)).T
# Calculate the shift's start and end time for each day
start_of_day = start_time.dt.normalize().to_numpy()[:, None]
shift_st = start_of_day + boundaries[:-1]
shift_et = start_of_day + boundaries[1:]
# The effective start and end time are calculated identically to above, only
# in vectorized form
effective_st = np.max([work_st, shift_st], axis=0)
effective_et = np.min([work_et, shift_et], axis=0)
# The time worked per shift uses the same calculation as the solution: effective
# end time - effective start time, with a minimum of 0
data = np.clip((effective_et - effective_st) / unit, 0, None)
# Add the extra columns to your data frame
df["enddate_time"] = end_time
df["time_worked_morning"] = data[:, 0]
df["time_worked_afternoon"] = data[:, 1]
df["time_worked_evening"] = data[:, 2]
df["total_time_wkd"] = time_worked / unit

How to find the difference between two times

I'm trying to figure out a way to take two times from the same day and figure out the difference between them. So far shown in the code below I have converted both of the given times into Int Vars and split the strings to retrieve the information. This works well but when the clock in values minute is higher than the clock out value it proceeds to give a negative value in minute slot of the output.
My current code is:
from datetime import datetime
now = datetime.now()
clocked_in = now.strftime("%H:%M")
clocked_out = '18:10'
def calc_total_hours(clockedin, clockedout):
in_hh, in_mm = map(int, clockedin.split(':'))
out_hh, out_mm = map(int, clockedout.split(':'))
hours = out_hh - in_hh
mins = out_mm - in_mm
return f"{hours}:{mins}"
print(calc_total_hours(clocked_in, clocked_out))
if the clocked in value is 12:30 and the clocked out value is 18:10
the output is:
6:-20
the output needs to be converted back into a stand time format when everything is done H:M:S
Thanks for you assistance and sorry for the lack of quality code. Im still learning! :D
First, in order to fix your code, you need to convert both time to minutes, compute the difference and then convert it back to hours and minutes:
clocked_in = '12:30'
clocked_out = '18:10'
def calc_total_hours(clockedin, clockedout):
in_hh, in_mm = map(int, clockedin.split(':'))
out_hh, out_mm = map(int, clockedout.split(':'))
diff = (in_hh * 60 + in_mm) - (out_hh * 60 + out_mm)
hours, mins = divmod(abs(diff) ,60)
return f"{hours}:{mins}"
print(calc_total_hours(clocked_in, clocked_out))
# 5: 40
Better way to implement the time difference:
import time
import datetime
t1 = datetime.datetime.now()
time.sleep(5)
t2 = datetime.datetime.now()
diff = t2 - t1
print(str(diff))
Output:
#h:mm:ss
0:00:05.013823
Probably the most reliable way is to represent the times a datetime objects, and then take one from the other which will give you a timedelta.
from datetime import datetime
clock_in = datetime.now()
clock_out = clock_in.replace(hour=18, minute=10)
seconds_diff = abs((clock_out - clock_in).total_seconds())
hours, minutes = seconds_diff // 3600, (seconds_diff // 60) % 60
print(f"{hours}:{minutes}")

why is not the result 00:00:XX?

i expected like 00:00:0X but 09:00:0X came out how can i do to make 00:00:0X
import time
start = input("Enter를 누르면 타이머를 시작합니다.")
begin = time.time()
while True:
time.sleep(1)
count = time.time()
result = time.localtime(count - begin)
print(count - begin)
print(time.strftime('%I:%M:%S', result))
result:
1.0102884769439697
09:00:01
2.0233511924743652
09:00:02
3.0368154048919678
time.time() will give you the number of seconds since 1.1.1970 in UTC.
So begin is a huge number and count will also be a huge number + about 1. Subtracting those will give about 1.
If you pass this to time.time() you'll get 1.1.1970 plus 1 second. Converting to local time (time.localtime()) will give you whatever timezone offset you are. Obviously +9 hours.
What you probably wanted is time.gmtime() and output in 24 hour format. This will work...
import time
start = input("Enter를 누르면 타이머를 시작합니다.")
begin = time.time()
while True:
time.sleep(1)
count = time.time()
result = time.gmtime(count - begin)
print(count - begin)
print(time.strftime('%H:%M:%S', result))
but it is semantically incorrect. If you subtract 2 dates, the result is a timespan, not a date. What is the difference?
If someone asks, how old you are, you have a look at the current year and you subtract the year of your birth. Then you say "I'm 25 years old". You don't add 1.1.1970 and say "I'm 1995 years old".
So the following is semantically much better:
import time
from datetime import timedelta
start = input("Enter를 누르면 타이머를 시작합니다.")
begin = time.time()
while True:
time.sleep(1)
count = time.time()
timespan = timedelta(seconds=count - begin)
print(timespan)
It shows 09:00:00 because you're in the UTC+9 timezone. For example, I'm in UTC+1 (France) and it shows 01:00:00 for me. Therefore, your code will have different outputs depending on where you run it.
To remove this timezone constraint, simply use datetime.timedelta:
begin = time.time()
while True:
time.sleep(1)
count = time.time()
print(datetime.timedelta(seconds=round(count - begin)))
Output:
0:00:01
0:00:02
0:00:03
0:00:04
0:00:05

How Do I Keep Total Time with Multiple Duplicate Entries?

I will attach my code, but basically I am importing a csv file with start times/end times for picking cases of a particular item. All the cases go to a "cart", which is identified by an ID number. I want to find the total time to pick all the cases. The format of the time is hh:mm:ss and, initially, I was using the datetime module but I could not figure out the documentation, so I ended up just converting all the times to seconds, subtracting end/start for each case, and adding that duration to the total time. In the end, converted total time to hours. Already had number of cases picked total, and divided by total time in hrs to get cases picked per hr. Is this correct logic? I got a number that was very, very low: 7.99 cases/hr, which leads me to believe my timing/duration code is incorrect (already checked that quantity was correct).
#instantiate totalTime to zero
totalTime = 0
#every line/row in file; assume already opened above
for line in lines:
#if there is a different case to pick, find the start time
if taskId != entryList[0]: #this is so it doesnt duplicate times
timestart = entryList[7]
colonStartIndex = timestart.find(":")
hourstart = int(timestart[0:colonStartIndex])
minutestart = int(timestart[colonStartIndex+1:colonStartIndex+3])
colonStartIndex2 = timestart.find(":", colonStartIndex+1)
secondstart = int(timestart[colonStartIndex2 +1:colonStartIndex2 +3])
start = hourstart*3600 + minutestart*60 + secondstart
#start = datetime(year=1, month=1, day=1,hour=hourstart,minute=minutestart,second=secondstart)
#start = datetime.time(start)
timeend = entryList[9]
colonEndIndex = timeend.find(":")
hourend = int(timeend[0:colonEndIndex])
minuteend = int(timeend[colonEndIndex+1:colonEndIndex+3])
colonEndIndex2 = timeend.find(":", colonEndIndex+1)
secondend = int(timeend[colonEndIndex2+1:colonEndIndex2+3])
end = hourend*3600 + minuteend*60 + secondend
#end = datetime(year=1,month=1,day=1,hour=hourend,minute=minuteend,second=secondend)
#end = datetime.time(end)
#duration = datetime.combine(date.today(), end) - datetime.combine(date.today(), start)
duration = end - start
if duration >= 0:
duration = duration
elif duration < 0:
duration = -1*duration
totalTime = totalTime + duration
taskId = entryList[0] #first entry in csv file of each line is cartID
totalTime = totalTime/3600
print(totalTime)
print(quantityCount)
avgNumCases = quantityCount/totalTime
print(avgNumCases)
Thank you so much for any help!! Also, I included the datetime stuff, commented out, so if you could suggest a solution based on that, I am open to it:) I was just frustrated because I spent a good bit of time trying to figure it out, but I'm not super familiar w it and the documentation is pretty hard to understand (esp b/c datetime objects, blah blah)
There is an obvious problem in this section:
duration = end - start
if duration >= 0:
duration = duration
elif duration < 0:
duration = -1*duration
If your start point is 22:00:00 and end point is 21:00:00 your duration will be 1 hour instead of 23 hours.

Python: what's the subtraction value between two time.time()?

import time
print time.time() - time.time()
the result unit is millisecond or second?
what I want do is judge if the two operation time span is larger than 10 minutes
Return the time in seconds since the epoch as a floating point number. (from documentation)
So difference between two times is also seconds.
time_a = time.time()
# ... some operations ...
ten_minutes = 10 * 60
time_span = time.time() - time_a
if time_span > ten_minutes:
# time span is larger than 10 minutes.

Categories

Resources