Speed up code that doesn't use groupby()? - python

I have two pieces of code (doing the same job) which takes in array of datetime and produces clusters of datetime which have difference of 1 hour.
First piece is:
def findClustersOfRuns(data):
runClusters = []
for k, g in groupby(itertools.izip(data[0:-1], data[1:]),
lambda (i, x): (i - x).total_seconds() / 3600):
runClusters.append(map(itemgetter(1), g))
Second piece is:
def findClustersOfRuns(data):
if len(data) <= 1:
return []
current_group = [data[0]]
delta = 3600
results = []
for current, next in itertools.izip(data, data[1:]):
if abs((next - current).total_seconds()) > delta:
# Here, `current` is the last item of the previous subsequence
# and `next` is the first item of the next subsequence.
if len(current_group) >= 2:
results.append(current_group)
current_group = [next]
continue
current_group.append(next)
return results
The first code takes 5 minutes to execute while second piece takes few seconds. I am trying to understand why.
The data over which I am running the code has size:
data.shape
(13989L,)
The data contents is as:
data
array([datetime.datetime(2016, 10, 1, 8, 0),
datetime.datetime(2016, 10, 1, 9, 0),
datetime.datetime(2016, 10, 1, 10, 0), ...,
datetime.datetime(2019, 1, 3, 9, 0),
datetime.datetime(2019, 1, 3, 10, 0),
datetime.datetime(2019, 1, 3, 11, 0)], dtype=object)
How do I improve the first piece of code to make it run as fast?

Based on the size, it looks you are having a huge list of elements i.e. huge len. Your second code is having just one for loop where as your first approach has many. You see just one right? They are in the form of map(), groupby(). Multiple iteration on the huge list is adding huge cost to the time complexity. These are not just additional iterations, but also these are slower than the normal for loop.
I made a comparison for another post which you might find useful Comparing list comprehensions and explicit loops.
Also, the usage of lambda function is adding up extra time.
However, you may further improve the execution time of the code by storing results.append to a separate variable say my_func and make a call as my_func(current_group).
Few more comparisons are:
Python 3: Loops, list comprehension and map slower compared to Python 2
Speed/efficiency comparison for loop vs list comprehension vs other methods

Related

Inserting a string with time into a fixed-length sorted list of times

I have a piece of code, as below, where if a time is less than a value in a list of 5 times I want to insert the new value in the list and delete 5th value in the list, essentially updating the list with the new set of top five times.
hours=str(input('enter hours:'))
minutes=int(input('enter minutes:'))
seconds=int(input('enter seconds:'))
if minutes <10:
minutes='0'+str(minutes)
if seconds <10:
seconds='0'+str(seconds)
currentTime=hours+':'+str(minutes)+':'+str(seconds)
print(currentTime)
hours=int(currentTime[0])
minutes=int(currentTime[2]+currentTime[3])
seconds=int(currentTime[5]+currentTime[6])
print(hours, minutes, seconds)
for i in range(1,6):
aTime=str(Times[0][i])
print(Times[0][i])
if hours==int(aTime[1]):
if minutes==int(aTime[3]+aTime[4]):
if seconds <=int(aTime[6]+aTime[7]):
#equal values deemed quicker
print('quicker time than time',str(i))
#insert value in Times[0][i]
break
elif minutes<int(aTime[3]+aTime[4]):
print('quicker time than time',str(i))
#insert value in Times[0][i]
break
elif hours<int(aTime[1]):
print('quicker time than time',str(i))
#insert value in Times[0][i]
break
Example of what I want this to achieve is if;
Times=['1:30:00','2:00:00','2:00:00','2:00:00','2:00:00']
and, for example, current Time='1:45:00'
That Times should become:
Times=['1:30:00','1:45:00','2:00:00','2:00:00','2:00:00']
Feel like this type of question has been asked before but couldn't find one.
Thanks In Advance.
Side note: I imagine that this is probably a very inefficient way of comparing the times but the method is correct just need to know how to insert a value into a list essentially.
This looks like a problem that would be best solved by coming up with a data structure.
We'll call it a MaxList since:
It only retains the minimum elements
It has a maximum number of elements
class MaxList:
def __init__(self, size, remaining=None):
self._size = size
self._contents = []
if remaining:
self._contents = remaining
def append(self, item):
for i, entry in enumerate(self._contents):
if item < entry:
self._contents.insert(i, item)
break
self._contents = self._contents[:self._size]
def __repr__(self):
return self._contents.__repr__()
This implementation assumes the list starts sorted, and works as follows:
>>> ls = MaxList(5, [1, 3, 5, 7, 9]) # 5 is the total length, rest is the list
>>> ls
[1, 3, 5, 7, 9]
>>> ls.append(8)
>>> ls
[1, 3, 5, 7, 8] # `9` was pushed out
>>> ls.append(10)
>>> ls
[1, 3, 5, 7, 8] # Notice `10` is not in the list
>>> ls.append(-1)
>>> ls
[-1, 1, 3, 5, 7]
>>> ls.append(-5)
>>> ls
[-5, -1, 1, 3, 5]
I'd recommend using a non-string representation for times: it will make it easier to compare and decide if one value is less than or greater than another. The datetime objects in Python's standard library could be a good place to start.
For example:
from datetime import time
times = MaxList(5, [
time(1, 30, 0),
time(2, 0, 0),
time(2, 0, 0),
time(2, 0, 0),
time(2, 0, 0),
])
print(times)
times.append(time(1, 45, 0))
print(times)
Output:
[datetime.time(1, 30), datetime.time(2, 0), datetime.time(2, 0), datetime.time(2, 0), datetime.time(2, 0)]
[datetime.time(1, 30), datetime.time(1, 45), datetime.time(2, 0), datetime.time(2, 0), datetime.time(2, 0)]
timeint = int(newtime.replace(':', ''))
for n, i in enumerate(list):
i = int(i.replace(':', ''))
if i > timeint:
list[n] = newtime
break
print(list)
where newtime is the new time you are trying to compare and list is the list of previous times
You can use the bisect module for this.
import bisect
Times=['1:30:00','2:00:00','2:00:00','2:00:00','2:00:00']
Time='1:45:00'
Times[bisect.bisect(Times, Time)]=Time
>>> Times
['1:30:00', '1:45:00', '2:00:00', '2:00:00', '2:00:00']
This only works, in general, for a list already sorted (yours is) with an element that that will compare to those elements in the desired way. String representations of times can be represented this way so that a lexical sorting can be maintained.
See ISO 8601. Time only strings are subset that can be treated the same way.
While strings will work, a better way to store and compare times and dates in Python is with the datetime module. You can use the bisect module with datetime objects to insert into a sorted list the same way.

Is that any faster way to deleting Excel row using openpyxl?

I have a list of excel row numbers that I want to delete with 2138 length using Openpyxl. Here is the code :
delete_this_row = [1,2,....,2138]
for delete in delete_this_row:
worksheet.delete_rows(delete)
But its too slow. It takes 45 seconds until 1 minute to finish the process.
Is that any faster way to complete the task?
There's almost always a faster way to do something. Sometimes the cost is too high but not in this case, I suspect :-)
If it's just a set of contiguous rows you want to delete, you can just use:
worksheet.delete_rows(1, 2138)
Documentation here, copied below for completeness:
delete_rows(idx, amount=1): Delete row or rows from row==idx.
Your solution is slow since, every time you delete a single row, it has to shift everything beneath that point up one row then delete the final row.
By passing in the row count, it instead does one shift, shifting rows 2139..max straight up to rows 1..max-2138, then deletes all the rows that are below max-2138.
This is likely to be roughly 2,138 times faster than what you have now :-)
If you have arbitrary row numbers in your array, you can still use this approach to optimise it as much as possible.
The idea here is to first turn your row list into a tuple list where each tuple has:
the starting row; and
the number of rows to delete from there.
Ideally, you'd also generate this in reverse order so you could just process it as is. The following snippet shows how you could do this, with the openpyxl calls being printed rather than called:
def reverseCombiner(rowList):
# Don't do anything for empty list. Otherwise,
# make a copy and sort.
if len(rowList) == 0: return []
sortedList = rowList[:]
sortedList.sort()
# Init, empty tuple, use first item for previous and
# first in this run.
tupleList = []
firstItem = sortedList[0]
prevItem = sortedList[0]
# Process all other items in order.
for item in sortedList[1:]:
# If start of new run, add tuple and use new first-in-run.
if item != prevItem + 1:
tupleList = [(firstItem, prevItem + 1 - firstItem)] + tupleList
firstItem = item
# Regardless, current becomes previous for next loop.
prevItem = item
# Finish off the final run and return tuple list.
tupleList = [(firstItem, prevItem + 1 - firstItem)] + tupleList
return tupleList
# Test data, hit me with anything :-)
myList = [1, 70, 71, 72, 98, 21, 22, 23, 24, 25, 99]
# Create tuple list, show original and that list, then process.
tuples = reverseCombiner(myList)
print(f"Original: {myList}")
print(f"Tuples: {tuples}\n")
for tuple in tuples:
print(f"Would execute: worksheet.delete_rows({tuple[0]}, {tuple[1]})")
The output is:
Original: [1, 70, 71, 72, 98, 21, 22, 23, 24, 25, 99]
Tuples: [(98, 2), (70, 3), (21, 5), (1, 1)]
Would execute: worksheet.delete_rows(98, 2)
Would execute: worksheet.delete_rows(70, 3)
Would execute: worksheet.delete_rows(21, 5)
Would execute: worksheet.delete_rows(1, 1)

Returning multiple values from Python Dict with range of datetime

I'm running into a problem with using DateTime as a dict key. My goal is to bring in information from one data source that includes Datetime, and then look up in a dictionary and return all values for keys that are within 2 days +- of the input DateTime.
For example, my input would be:
datetime.datetime(2018, 9, 20, 12, 48)
My dictionary to reference would be:
example = {datetime.datetime(2018, 9, 20, 14, 43):'A', datetime.datetime(2018, 9, 18, 19, 41):'B', datetime.datetime(2018, 9, 15, 9, 12):'C'}
In that case, I would return: A, B
I have considered sorting the dictionary and then creating a dictionary of indexes maybe for odd-numbered dates, then taking in my input date, figuring the base date +- 2 of the input date, referencing the index dict, and then using those indexes to loop through the reference dict only between the indexes and return all values that are there.
My main issue is that I can't predict what the dict datetimes will be or the input datetimes so I'm just not certain if I can return values for a range of keys in a dict other than by looping through the index of the keys in a sorted order. Doing a for loop for all keys is not efficient here because of the number of keys to look through -- I am already reducing this list by deduplicating down as much as possible and only bringing in the minimum amount of reference data.
One other item is that my inputs will be 100,000s of datetimes to look up, many of which will be minutes, seconds, or hours off from each other, so reducing the number of lookups and for looping will be essential to keep the runtime down.
I apologize if this isn't quite a proper question with full code to look at but I basically am not sure where to start on this issue, so I didn't think it would help anyone to put anything else in except for example input and dictionary with output goal.
First, sort the dictionary dates and transform a dictionary into list of sorted tuples :
dic_dates = {
datetime.datetime(2018, 9, 20, 14, 43):'A',
datetime.datetime(2018, 9, 18, 12, 41):'B',
datetime.datetime(2018, 9, 15, 9, 12):'C'
}
sorted_dates = sorted(dic_dates.items())
Then use bisect to find the position of your date inside that list :
dat = datetime.datetime(2018, 9, 20, 12, 48)
insert_index = bisect.bisect_left(sorted_dates, (dat,None))
Look from this position to the left and break as soon as an element do not verify the condition, then do the same starting from the position to the right. (You can use your own conditions as I found that was quite unclear in you example - +-2days should not select 'B' IMO but that's not the point)
if insert_index:
#if insert_index = 0, do not loop on left side
dat_min = dat - datetime.timedelta(days=2)
for d in sorted_dates[insert_index-1::-1]:
if d[0] > dat_min:
print(d[1])
else:
break
dat_max = dat + datetime.timedelta(days=2)
for d in sorted_dates[insert_index:]:
if d[0] < dat_max:
print(d[1])
else:
break
EDIT
One example of bisct_left implementation :
def bisect_left(l, e, start = 0):
if not l:
return start
pos = int(len(l)/2)
if l[pos] < e and (pos+1 >= len(l) or l[pos+1] > e):
return start + pos + 1
elif l[pos] >= e:
return bisect_left(l[:pos], e, start)
else:
return bisect_left(l[pos:], e, start+pos)
I strongly advice you to use bisect as it will be quicker and more reliable.

Adding up datetime.datetime that are in single list

I've looked everywhere and I can't seem to find what I need. I have a list that contains datetime that I need to combine to find the sum. This list is parsed from a file and can have any number of datetime items in it. It looks like such for an example.
[datetime.datetime(1900, 1, 1, 1, 19, 42, 89000), datetime.datetime(1900, 1, 1, 2, 8, 4, 396000), datetime.datetime(1900, 1, 1, 0, 43, 54, 883000), datetime.datetime(1900, 1, 1, 0, 9, 13, 343000)]
The code I'm using to get this is this
time = [i[8] for i in smaller_list]
try:
times = [datetime.datetime.strptime(x, "%H:%M:%S.%f") for x in time]
except ValueError:
times = [datetime.datetime.strptime(x, "%M:%S.%f") for x in time]
Time gets the varibles from a larger nested list that I created to separate lines of data. I've tried datetime.datetime.combine() but I'm not really sure how to use this for items in a single list. Do I need to create a nested list of datetimes and sum them up? How do I iterate though this list and combine all the times for a sum? If I do have to create a nested list, how do I iterate through it to add up the times? Trying to wrap my head around this.
When you print time this is what is returned so the example list directly helps me.
[datetime.datetime(1900, 1, 1, 1, 19, 42, 89000), datetime.datetime(1900, 1, 1, 2, 8, 4, 396000), datetime.datetime(1900, 1, 1, 0, 43, 54, 883000), datetime.datetime(1900, 1, 1, 0, 9, 13, 343000)]
This is what the original times look like. I need to add up times such as these for total time. Usually in minutes with micro seconds included and rarely hours.
25:21.442
09:52.149
28:03.604
27:12.113
If anyone else runs into this problem here is the code I used.
time = [i[8] for i in smaller_list]
sumtime = datetime.timedelta()
for i in time:
try:
(h, m, s) = i.split(':')
d = datetime.timedelta(hours=int(h), minutes=int(m), seconds=float(s))
except ValueError:
(m, s) = i.split(':')
d = datetime.timedelta(minutes=int(m), seconds=float(s))
sumtime += d
print(str(sumtime))
If you're learning python it's pretty confusing trying to wrap your mind around date time and time delta. For duration's you need to use timedelta. You have to split the values up and pass the correct values to time delta and then you can sum them up to find the duration. Hopefully this helps someone out there.
If you need to round the microseconds to seconds you can use this code in place of d.
d = datetime.timedelta(minutes=int(m), seconds=round(float(s)))

Problem with quicksort and python

I'm writing a python program that plays poker for a class, and I need to sort a list of five card hands. I have a function called wins(), which takes two hands and returns True if the first one beats the second one, False if it doesn't. I wrote an implementation of quicksort to sort the list of hands, and I noticed that it was taking much longer than expected, so I programmed it to print the length of each list it was sorting. The function looks like this:
def sort(l):
if len(l) <= 1:
return l
print len(l)
pivot = choice(l)
l.remove(pivot)
left = []
right = []
for i in l:
if wins(i, pivot) == True:
right.append(i)
else:
left.append(i)
return sort(left) + [pivot] + sort(right)
and when I had it sort 64 hands, it printed:
64,
53,
39,
26,
25,
24,
23,
22,
21,
20,
19,
18,
17,
16,
15,
14,
13,
12,
11,
10,
9,
8,
7,
6,
5,
4,
3,
2,
12,
7,
3,
2,
3,
2,
4,
3,
2,
13,
9,
6,
2,
3,
2,
2,
3,
10,
8,
2,
5,
4,
3,
2. Notice the consecutive sequence in the middle? I can't figure out why it does that, but it's causing quicksort to behave like an O(n^2). It doesn't make sense for it to choose the strongest hand as a pivot on every iteration, but that's what seems to be happening. What am I overlooking?
Edited answer after comments:
Could you try the following code and share your results. It introduces an equivalence class to reduce the population in the "lose or tie" (or "smaller than or equal to") group while forming an answer via recursion.
# modified quicksort with equivalence class.
def sort(l):
if len(l) <= 1:
return l
print len(l)
pivot = choice(l)
l.remove(pivot)
left = []
right = []
# I added an equivalence class
pivotEquivalence = [pivot]
# and an equivalence partitioning
# to reduce number of elements
# in the upcoming recursive calls
for i in l:
if wins(i, pivot) == True:
right.append(i)
elif wins(pivot,i) == True:
left.append(i)
else
pivotEquivalence.append(i)
return sort(left) + pivotEquivalence + sort(right)
======
Instead of choosing the pivot randomly, try taking the middle indexed element. on average it will guarantee the O(N log N).
Also for O notation, you need to make many simulations to collect empirical data. Just one example might be very misleading.
Try to print the pivot, and it's index as well. Shuffle your list by random.shuffle(list) and try again to see the distribution. Just ensure system time seeding as well.
Could you send post the complete code, and your data for community to repeat the problem?
Sincerely,
Umut

Categories

Resources