creating a hash-based sorting algorithm

creating a hash-based sorting algorithm - python

For experimental and learning purposes. I was trying to create a sorting algorithm from a hash function that gives a value biased on alphabetical sequence of the string, it then would ideally place it in the right place from that hash. i tryed looking for a hash-biased sorting function but only found one for integers and would be a memory hog if adapted for my purposes.
The reasoning is that theoretically if done right this algorithm can achieve O(n) speeds or nearly so.
So here is what i have worked out in python so far:
letters = {'a':0,'b':1,'c':2,'d':3,'e':4,'f':5,'g':6,'h':7,'i':8,'j':9,
'k':10,'l':11,'m':12,'n':13,'o':14,'p':15,'q':16,'r':17,
's':18,'t':19,'u':20,'v':21,'w':22,'x':23,'y':24,'z':25,
'A':0,'B':1,'C':2,'D':3,'E':4,'F':5,'G':6,'H':7,'I':8,'J':9,
'K':10,'L':11,'M':12,'N':13,'O':14,'P':15,'Q':16,'R':17,
'S':18,'T':19,'U':20,'V':21,'W':22,'X':23,'Y':24,'Z':25}
def sortlist(listToSort):
listLen = len(listToSort)
newlist = []
for i in listToSort:
k = letters[i[0]]
for j in i[1:]:
k = (k*26) + letters[j]
norm = k/pow(26,len(i)) # get a float hash that is normalized(i think thats what it is called)
# 2nd part
idx = int(norm*len(newlist)) # get a general of where it should go
if newlist: #find the right place from idx
if norm < newlist[idx][1]:
while norm < newlist[idx][1] and idx > 0: idx -= 1
if norm > newlist[idx][1]: idx += 1
else:
while norm > newlist[idx][1] and idx < (len(newlist)-1): idx += 1
if norm > newlist[idx][1]: idx += 1
newlist.insert(idx,[i,norm])# put it in the right place with the "norm" to ref later when sorting
return newlist
i think that the 1st part is good, but the 2nd part needs help. so the Qs would be what would be the best way to do something like this or is it even possible to get O(n) time (or near that) out of this?
the testing i did with an 88,000 word list took prob about 5 min, 10,000 took about 30 sec it got a lot worse as the list count went up.
if this idea actually works out then i would recode it in C to get some real speed and optimizations.
The 2nd part is there only because it works - even if slow, and i cant think of a better way to do it for the life of me, i would like to replace it with something that would not have to do the other loops if at all possible.
thank for any advice or ideas that you could give.

On sorting in O(n): you can't do it generally for all inputs, period. It is simply, fundamentally, mathematically impossible.
Here's the nice, short information-theoretic proof of impossibility: to sort, you have to be able to distinguish among the n! possible orderings of the input; to do so, you have to get log2(n!) bits of data; to do that, you need to do O(log (n!)) comparisons, which is O(n log n). Any sorting algorithm that claims to run in O(n) is either running on specialized data (e.g. data with a fixed number of bits), or is not correct.
Implementing a sorting algorithm is a good learning exercise, but you may want to stick to existing algorithms until you are comfortable with the concepts and methods commonly employed. It might be rather frustrating otherwise if the algorithm doesn't work.
Have fun learning!
P.S. Python's built-in timsort algorithm is really good on a lot of real-world data. So, if you need a general sorting algorithm for production code, you can usually rely on .sort/sorted to be fast enough for your needs. (And, if you can understand timsort, you'll do better than 90% of the Python-wielding population :)

Related

what approach is best to decrease the time complexity of this problem

I want to preface this thread by stating I am still learning the basics of data structures and algorithms I'm not looking for the correct code for this problem but rather what the correct approach is. So that I can learn what situations call for which data structure. That being said I am now going to try and correctly explain this code.
The code below is a solution I had written for a medium-level leetcode problem. Please see the link to read the problem
Correct me if I am wrong, currently the time complexity of this algorithn is O(n)
class Solution:
def canCompleteCircuit(self, gas: List[int], cost: List[int]):
startingStation = 0
didCircuit = -1
tank = 0
i = 0
while i <= len(gas):
if startingStation == len(gas):
return -1
if startingStation == i:
didCircuit += 1
if didCircuit == 1:
return startingStation
tank += gas[i] - cost[i]
if tank >= 0:
i += 1
if i == len(gas):
i = 0
if tank < 0:
didCircuit = -1
startingStation += 1
i = startingStation
tank = 0
The code works fine but the time complexity is too slow to iterate through each test case. What I am asking is if this algorithm is O(n) what approach could I have used to make the runtime complexity of this algorithm O(log(n)) or just faster?
side question - I know having a lot of if statements is bad and ugly code but if all of the iterations are O(1) does the amount of if statements have any impact on the performance of this function if scaled to a high iteration count?

"Correct me if I am wrong, currently the time complexity of this algorithn is O(n)"
This algorithm is O(n^2) rather than O(n). In the best case, it will return an answer in only "n" iterations of the while loop, but in the situation where there is no answer, it needs to run the loop (n*(n+1))/2 times.
O() notation tells us to ignore practical values of n and remove terms that become insignificant as n grows very large. So we ignore the +n and the /2 in the iterations, with the most significant component being the n^2.
So it is an O(n^2) algorithm.
"if all of the iterations are O(1) does the amount of if statements have any impact on the performance of this function if scaled to a high iteration count"
No, the O() of the algorithm is not impacted by the number of logic statements, but beware of hidden loops and expensive operations. For example, a logic statement of if x in list can be O(n) on the number of items in the list without data-specific optimizations, so if you have an O(n) loop around it (for the same list) you could have an O(n^2) algorithm. None of your logic statements have this issue, you can ignore them for O() purposes.
Assignments can be treated the same.
"What I am asking is if this algorithm is O(n) what approach could I have used to make the runtime complexity of this algorithm O(log(n)) or just faster?"
Since the algorithm is not O(n), better to ask how you might get there. You can get there by finding a way to not have to loop over the arrays more than once.
You ask about data structures, but you talk about time complexity.
The best algorithm in this case is O(n) in time, and O(1) in additional space. It requires you to store one integer in addition to the two arrays. You can even implement it with three integers of storage if you keep reading the gas and cost values from streams of data.
"I'm not looking for the correct code for this problem but rather what the correct approach is"
They've given you a gift with the statement that any success solution is unique. From this we know that the amount of gas available is no more than the sum of all costs plus the smallest difference between a station's cost and gas. If it were otherwise, then there would two points in the loop where you could start.
That means that as soon as we find an i where the sum of the gas available at stations 0 to i exceeds the cost of travel from 0 to i we have found the unique starting position. If we get to the end of the line and have not found this, we know it is impossible to do so for any starting position.

Most efficient way to find mode in an array using python? Return type is an array of integers

Here is my solution, which works in O(N) time and O(N) space:
def find_mode(array):
myDict = {}
result = []
for i in range(len(array)):
if array[i] in myDict:
myDict[array[i]] += 1
else:
myDict[array[i]] = 1
maximum = max(myDict.values())
for key, value in myDict.items():
if value == maximum:
result.append(key)
return result
I can't think of a more efficient solution than O(N) but if anyone has any improvements to this function please let me know. The return type is an array of integers.

First, you should note that O(n) worst-case time cannot be improved upon with a deterministic, non-randomized algorithm, since we may need to check all elements.
Second, since you want all modes, not just one, the best space complexity of any possible algorithm is O(|output|), not O(1).
Third, this is as hard as the Element distinctness problem. This implies that any algorithm that is 'expressible' in terms of decision trees only, can at best achieve Omega(n log n) runtime. To beat this, you need to be able to hash elements or use numbers to index the computer's memory or some other non-combinatorial operation. This isn't a rigorous proof that O(|output|) space complexity with O(n) time is impossible, but it means you'll need to specify a model of computation to get a more precise bound on runtime, or specify bounds on the range of integers in your array.
Lastly, and most importantly, you should profile your code if you are worried about performance. If this is truly the bottleneck in your program, then Python may not be the right language to achieve the absolute minimum number of operations needed to solve this problem.
Here's a more Pythonic approach, using the standard library's very useful collections.Counter(). The Counter initialization (in CPython) is usually done through a C function, which will be faster than your for loop. It is still O(n) time and space, though.
def find_mode(array: List[int]) -> List[int]:
counts = collections.Counter(array)
maximum = max(counts.values())
return [key for key, value in counts.items()
if value == maximum]

Under what circumstances is bidirectional bubble sort better than standard bubble sort?

I have implemented the bidirectional bubble sort algorithm. But I can't think of a scenario where bidirectional bubble sort better than standard bubble sort..can some one give me some clue?
My implementation in Python:
def bubbleSort_v(my_list):
s = 0
e = 0
right = True
for index in range(len(my_list)-1,0,-1):
if right:
right = False
for idx in range(s,index+e,1):
if my_list[idx] > my_list[idx+1]:
my_list[idx],my_list[idx+1] = my_list[idx+1],my_list[idx]
s += 1
else:
right = True
for idx in range(index-1+s,e,-1):
if my_list[idx] < my_list[idx-1]:
my_list[idx],my_list[idx-1] = my_list[idx-1],my_list[idx]
e += 1
return my_list
Thanks!

In case there is an element that is at the right (for instance the last index) of the list, that should be moved to the left side (for instance the first index) of the list. This will take a long time with single-directional bubble-sort: each time it will move only one step.
If we perform bi-directional bubblesort however, the element will be moved to the left in the first step to the right.
So in general it is better if one or more elements should be moved (over a large number of places) in the opposite direction in which the single direction bubblesort is done.
For your implementation of bubblesort, it will however not make much difference: usually bubblesort will test while it sorts. In case it can do a full run without swaps, it will simply stop working.
For example a single-directional bubblesort that moves to the right:
def single_bubble(data):
for i in range(len(data)):
can_exit = True
for j in range(len(data)-i-1):
if data[j] > data[j+1]:
data[j],data[j+1] = data[j+1],data[j]
can_exit = False
if can_exit:
return
So in case you want to move an element a large number of places to the left, then for each such step, you will have to do a full loop again. We can optimize the above method a bit more, but this behavior cannot be eliminated.
Bi-directional bubblesort can be implemented like:
def single_bubble(data):
for i in range(len(data)):
can_exit = True
for j in range(len(data)-i-1):
if data[j] > data[j+1]:
data[j],data[j+1] = data[j+1],data[j]
can_exit = False
if can_exit:
return
for j in range(len(data)-i,i,-1):
if data[j-i] > data[j]:
data[j-1],data[j] = data[j],data[j-1]
can_exit = False
if can_exit:
return
That being said, bubble sort is not a good sorting algorithm in general. There exist way better algorithms like quicksort, mergesort, timsort, radixsort (for numerical data), etc.
Bubblesort is actually a quite bad algorithm even among O(n2) algorithms, since it will move an object one place at a time. Insertion sort will simply first calculate what has to move and then move that part of the list quite fast saving a lot of useless moves. The algorithms can however serve an educational purpose when learning to design, implement and analyze algorithms, since the algorithms will perform significantly bad compared to more advanced ones.
Implementing (general purpose) sorting function yourself is probably not beneficial: good algorithms have been implemented for all popular programming languages and these algorithms are fast, consume less memory, etc.

Memoized to DP solution - Making Change

Recently I read a problem to practice DP. I wasn't able to come up with one, so I tried a recursive solution which I later modified to use memoization. The problem statement is as follows :-
Making Change. You are given n types of coin denominations of values
v(1) < v(2) < ... < v(n) (all integers). Assume v(1) = 1, so you can
always make change for any amount of money C. Give an algorithm which
makes change for an amount of money C with as few coins as possible.
[on problem set 4]
I got the question from here
My solution was as follows :-
def memoized_make_change(L, index, cost, d):
if index == 0:
return cost
if (index, cost) in d:
return d[(index, cost)]
count = cost / L[index]
val1 = memoized_make_change(L, index-1, cost%L[index], d) + count
val2 = memoized_make_change(L, index-1, cost, d)
x = min(val1, val2)
d[(index, cost)] = x
return x
This is how I've understood my solution to the problem. Assume that the denominations are stored in L in ascending order. As I iterate from the end to the beginning, I have a choice to either choose a denomination or not choose it. If I choose it, I then recurse to satisfy the remaining amount with lower denominations. If I do not choose it, I recurse to satisfy the current amount with lower denominations.
Either way, at a given function call, I find the best(lowest count) to satisfy a given amount.
Could I have some help in bridging the thought process from here onward to reach a DP solution? I'm not doing this as any HW, this is just for fun and practice. I don't really need any code either, just some help in explaining the thought process would be perfect.
[EDIT]
I recall reading that function calls are expensive and is the reason why bottom up(based on iteration) might be preferred. Is that possible for this problem?

Here is a general approach for converting memoized recursive solutions to "traditional" bottom-up DP ones, in cases where this is possible.
First, let's express our general "memoized recursive solution". Here, x represents all the parameters that change on each recursive call. We want this to be a tuple of positive integers - in your case, (index, cost). I omit anything that's constant across the recursion (in your case, L), and I suppose that I have a global cache. (But FWIW, in Python you should just use the lru_cache decorator from the standard library functools module rather than managing the cache yourself.)
To solve for(x):
If x in cache: return cache[x]
Handle base cases, i.e. where one or more components of x is zero
Otherwise:
Make one or more recursive calls
Combine those results into `result`
cache[x] = result
return result
The basic idea in dynamic programming is simply to evaluate the base cases first and work upward:
To solve for(x):
For y starting at (0, 0, ...) and increasing towards x:
Do all the stuff from above
However, two neat things happen when we arrange the code this way:
As long as the order of y values is chosen properly (this is trivial when there's only one vector component, of course), we can arrange that the results for the recursive call are always in cache (i.e. we already calculated them earlier, because y had that value on a previous iteration of the loop). So instead of actually making the recursive call, we replace it directly with a cache lookup.
Since every component of y will use consecutively increasing values, and will be placed in the cache in order, we can use a multidimensional array (nested lists, or else a Numpy array) to store the values instead of a dictionary.
So we get something like:
To solve for(x):
cache = multidimensional array sized according to x
for i in range(first component of x):
for j in ...:
(as many loops as needed; better yet use `itertools.product`)
If this is a base case, write the appropriate value to cache
Otherwise, compute "recursive" index values to use, look up
the values, perform the computation and store the result
return the appropriate ("last") value from cache

I suggest considering the relationship between the value you are constructing and the values you need for it.
In this case you are constructing a value for index, cost based on:
index-1 and cost
index-1 and cost%L[index]
What you are searching for is a way of iterating over the choices such that you will always have precalculated everything you need.
In this case you can simply change the code to the iterative approach:
for each choice of index 0 upwards:
for each choice of cost:
compute value corresponding to index,cost
In practice, I find that the iterative approach can be significantly faster (e.g. *4 perhaps) for simple problems as it avoids the overhead of function calls and checking the cache for preexisting values.

Fastest way to apply function on different lists in python

I have complex algorithm to build in order to select the best combination of elements in my list.
I have a list of 20 elements. I make all the combinations this list using this algorithms, the resutlt would be a list of element with size: 2^20-1 (without duplications)
from itertools import combinations
def get_all_combinations(input_list):
for i in xrange(len(input_list)):
for item in combinations(input_list, r = i + 1):
yield list(item)
input_list = [1,4,6,8,11,13,5,98,45,10,21,34,46,85,311,133,35,938,345,310]
print len(get_all_combinations(input_list)) # 1048575
I have another algorithm that is applied on every list, then calculate the max.
// this is just an example
def calcul_factor(item):
return max(item) * min(item) / sqrt(min(item))
I tried to do it like this way: but it's taking a long time.
columnsList= get_all_combinations(input_list)
for x in columnsList:
i= calcul_factor(x)
factorsList.append(i)
l.append(x)
print "max", max(factorsList)
print "Best combinations:", l[factorsList.index( max(factorsList))]
Does using Maps/Lamda expressions solve issues to make "parallelisme" to calculate the maximum ?
ANy hints to do that ?

In case you can't find a better algorithm (which might be needed here) you can avoid creating those big lists by using generators.
With the help of itertools.chain you can combine the itertools.combinations-generators. Furthermore the max-function can take a function as a key.
Your code can be reduced to:
all_combinations = chain(*[combinations(input_list, i) for i in range(1, len(input_list))])
max(all_combinations, key=algorithm)
Since this code relies solely on generators it might be faster (doesn't mean fast enough).
Edit: I generally agree with Hugh Bothwell, that you should be trying to find a better algorithm before going with an implementation like this. Especially if your lists are going to contain more than 20 elements.

If you can easily calculate calcul_factor(item + [k]) given calcul_factor(item), you might greatly benefit from a dynamic-programming approach.
If you can eliminate some bad solutions early, it will also greatly reduce the total number of combinations to consider (branch-and-bound).
If the calculation is reasonably well-behaved, you might even be able to use ie simplex method or a linear solver and walk directly to a solution (something like O(n**2 log n) runtime instead of O(2**n))
Could you show us the actual calcul_factor code and an actual input_list?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

creating a hash-based sorting algorithm - python

Related

what approach is best to decrease the time complexity of this problem

Most efficient way to find mode in an array using python? Return type is an array of integers

Under what circumstances is bidirectional bubble sort better than standard bubble sort?

Memoized to DP solution - Making Change

Fastest way to apply function on different lists in python

Categories

Resources