Finding Median in Large Integer File of Integers

Finding Median in Large Integer File of Integers - python

I was asked in an interview the following. I didn't get it but trying to solve it at home. I believe we have to use the Median of Median algorithm...
Q: Finding Median in Large Integer File of Integers
Find the median from a large file of integers. You can not access the
numbers by index, can only access it sequentially. And the numbers
cannot fit in memory.
I found a solution online (rewrote in Python) but there are a few things I do not understand.. I kind of get the algorithm but not 100% sure.
a) Why do we check left >= right?
b) When count < k, we call self.findMedianInLargeFile(numbers,k,max(result+1,guess),right). Why do we call max(result+1, guess) as left?
c) when count > k, why do we use result as right?
class Solution:
def findMedianInLargeFile(self, numbers,k,left,right):
if left >= right:
return left
result = left
guess = (left + right ) // 2
count = 0
# count the number that is less than guess
for i in numbers:
if i <= guess:
count+=1
result = max(result,i)
if count == k:
return result
elif count < k: # if the number of items < guess is < K
return self.findMedianInLargeFile(numbers,k,max(result+1,guess),right)
else:
return self.findMedianInLargeFile(numbers,k,left,result)
def findMedian(self, numbers):
length = len(numbers)
if length % 2 == 1: # odd
return self.findMedianInLargeFile(numbers,length//2 + 1,-999999999,999999999)
else:
return (self.findMedianInLargeFile(numbers,length//2,-999999999,999999999) + self.findMedianInLargeFile(numbers,length//2 +1 ,-999999999,999999999)) / 2

This is just binary search by median value
Compare with example code
function binary_search(A, n, T):
L := 0
R := n − 1
while L <= R:
m := floor((L + R) / 2)
if A[m] < T:
L := m + 1
else if A[m] > T:
R := m - 1
else:
return m
return unsuccessful
if left >= right: stops iterations when borders
collide
when count < k, we call self.findMedianInLargeFile(numbers,k,max(result+1,guess),right) because our guess was too small, and median value is bigger than quessed value.
similar but reversed situation for else case

You could perform merge sort with -- O(nlogn) -- on the external memory as it is implemented to operate on data sequentially.
An interesting solution could be through an order-statistic tree, with the implementation available here: Median of large amount of numbers for each sets of given size
however, if you have any questions -- let me know!

Related

Making the complexity smaller (better)

I have an algorithm that looks for the good pairs in a list of numbers. A good pair is being considered as index i being less than j and arr[i] < arr[j]. It currently has a complexity of O(n^2) but I want to make it O(nlogn) based on divide and conquering. How can I go about doing that?
Here's the algorithm:
def goodPairs(nums):
count = 0
for i in range(0,len(nums)):
for j in range(i+1,len(nums)):
if i < j and nums[i] < nums[j]:
count += 1
j += 1
j += 1
return count
Here's my attempt at making it but it just returns 0:
def goodPairs(arr):
count = 0
if len(arr) > 1:
# Finding the mid of the array
mid = len(arr)//2
# Dividing the array elements
left_side = arr[:mid]
# into 2 halves
right_side = arr[mid:]
# Sorting the first half
goodPairs(left_side)
# Sorting the second half
goodPairs(right_side)
for i in left_side:
for j in right_side:
if i < j:
count += 1
return count

The current previously accepted answer by Fire Assassin doesn't really answer the question, which asks for better complexity. It's still quadratic, and about as fast as a much simpler quadratic solution. Benchmark with 2000 shuffled ints:
387.5 ms original
108.3 ms pythonic
104.6 ms divide_and_conquer_quadratic
4.1 ms divide_and_conquer_nlogn
4.6 ms divide_and_conquer_nlogn_2
Code (Try it online!):
def original(nums):
count = 0
for i in range(0,len(nums)):
for j in range(i+1,len(nums)):
if i < j and nums[i] < nums[j]:
count += 1
j += 1
j += 1
return count
def pythonic(nums):
count = 0
for i, a in enumerate(nums, 1):
for b in nums[i:]:
if a < b:
count += 1
return count
def divide_and_conquer_quadratic(arr):
count = 0
left_count = 0
right_count = 0
if len(arr) > 1:
mid = len(arr) // 2
left_side = arr[:mid]
right_side = arr[mid:]
left_count = divide_and_conquer_quadratic(left_side)
right_count = divide_and_conquer_quadratic(right_side)
for i in left_side:
for j in right_side:
if i < j:
count += 1
return count + left_count + right_count
def divide_and_conquer_nlogn(arr):
mid = len(arr) // 2
if not mid:
return 0
left = arr[:mid]
right = arr[mid:]
count = divide_and_conquer_nlogn(left)
count += divide_and_conquer_nlogn(right)
i = 0
for r in right:
while i < mid and left[i] < r:
i += 1
count += i
arr[:] = left + right
arr.sort() # linear, as Timsort takes advantage of the two sorted runs
return count
def divide_and_conquer_nlogn_2(arr):
mid = len(arr) // 2
if not mid:
return 0
left = arr[:mid]
right = arr[mid:]
count = divide_and_conquer_nlogn_2(left)
count += divide_and_conquer_nlogn_2(right)
i = 0
arr.clear()
append = arr.append
for r in right:
while i < mid and left[i] < r:
append(left[i])
i += 1
append(r)
count += i
arr += left[i:]
return count
from timeit import timeit
from random import shuffle
arr = list(range(2000))
shuffle(arr)
funcs = [
original,
pythonic,
divide_and_conquer_quadratic,
divide_and_conquer_nlogn,
divide_and_conquer_nlogn_2,
]
for func in funcs:
print(func(arr[:]))
for _ in range(3):
print()
for func in funcs:
arr2 = arr[:]
t = timeit(lambda: func(arr2), number=1)
print('%5.1f ms ' % (t * 1e3), func.__name__)

One of the most well-known divide-and-conquer algorithms is merge sort. And merge sort is actually a really good foundation for this algorithm.
The idea is that when comparing two numbers from two different 'partitions', you already have a lot of information about the remaining part of these partitions, as they're sorted in every iteration.
Let's take an example!
Consider the following partitions, which has already been sorted individually and "good pairs" have been counted.
Partition x: [1, 3, 6, 9].
Partition y: [4, 5, 7, 8].
It is important to note that the numbers from partition x is located further to the left in the original list than partition y. In particular, for every element in x, it's corresponding index i must be smaller than some index j for every element in y.
We will start of by comparing 1 and 4. Obviously 1 is smaller than 4. But since 4 is the smallest element in partition y, 1 must also be smaller than the rest of the elements in y. Consequently, we can conclude that there is 4 additional good pairs, since the index of 1 is also smaller than the index of the remaining elements of y.
The exact same thing happens with 3, and we can add 4 new good pairs to the sum.
For 6 we will conclude that there is two new good pairs. The comparison between 6 and 4 did not yield a good pair and likewise for 6 and 5.
You might now notice how these additional good pairs would be counted? Basically if the element from x is less than the element from y, add the number of elements remaining in y to the sum. Rince and repeat.
Since merge sort is an O(n log n) algorithm, and the additional work in this algorithm is constant, we can conclude that this algorithm is also an O(n log n) algorithm.
I will leave the actual programming as an exercise for you.

#niklasaa has added an explanation for the merge sort analogy, but your implementation still has an issue.
You are partitioning the array and calculating the result for either half, but
You haven't actually sorted either half. So when you're comparing their elements, your two pointer approach isn't correct.
You haven't used their results in the final computation. That's why you're getting an incorrect answer.
For point #1, you should look at merge sort, especially the merge() function. That logic is what will give you the correct pair count without having O(N^2) iteration.
For point #2, store the result for either half first:
# Sorting the first half
leftCount = goodPairs(left_side)
# Sorting the second half
rightCount = goodPairs(right_side)
While returning the final count, add these two results as well.
return count + leftCount + rightCount

Like #Abhinav Mathur stated, you have most of the code down, your problem is with these lines:
# Sorting the first half
goodPairs(left_side)
# Sorting the second half
goodPairs(right_side)
You want to store these in variables that should be declared before the if statement. Here's an updated version of your code:
def goodPairs(arr):
count = 0
left_count = 0
right_count = 0
if len(arr) > 1:
mid = len(arr) // 2
left_side = arr[:mid]
right_side = arr[mid:]
left_count = goodPairs(left_side)
right_count = goodPairs(right_side)
for i in left_side:
for j in right_side:
if i < j:
count += 1
return count + left_count + right_count
Recursion can be difficult at times, look into the idea of merge sort and quick sort to get better ideas on how the divide and conquer algorithms work.

Number of ways to get sum of number(Integer Partition) using recursion or other methods

Question from codewars https://www.codewars.com/kata/52ec24228a515e620b0005ef/python
In number theory and combinatorics, a partition of a positive integer n, also called an integer partition, is a way of writing n as a sum of positive integers. Two sums that differ only in the order of their summands are considered the same partition. If order matters, the sum becomes a composition. For example, 4 can be partitioned in five distinct ways:
4
3 + 1
2 + 2
2 + 1 + 1
1 + 1 + 1 + 1
Given number n, write a function exp_sum(n) that returns the total number of ways n can be partitioned.
Eg: exp_sum(4) = 5
Why does the recursion approach:
def exp_sum(n):
arr = list(range(1, n+1))
mem = {}
return rec(n, arr, mem)
def rec(n, arr, mem):
key = str(n)+ ":" + str(arr)
if key in mem:
return mem[key]
elif n < 0:
return 0
elif n == 0:
return 1
elif n > 0 and not arr:
return 0
else:
to_return = rec(n - arr[-1], arr, mem) + rec(n, arr[:-1], mem)
mem[key] = to_return
return to_return
take so much longer to run compared to this particular method (top solution of this kata)?
def exp_sum(n):
if n < 0:
return 0
dp = [1]+[0]*n
for num in range(1,n+1):
for i in range(num,n+1):
dp[i] += dp[i-num]
return dp[-1]
Even with using memoisation, the recursion approach barely managed to pass the test case at a time of about 10000ms, compared to the 1000ms taken for the above approach.
And can anyone explain how the particular method above works and the logic behind it or if it uses some particular algorithm which I can read up about?

Calculate the number of combinations of unique positive integers with minimum and maximum differences between each other?

How do I write a Python program to calculate the number of combinations of unique sorted positive integers over a range of integers that can be selected where the minimum difference between each of the numbers in the set is one number and the maximum difference is another number?
For instance, if I want to calculate the number of ways I can select 6 numbers from the positive integers from 1-50 such that the minimum difference between each number is 4 and the maximum difference between each number is 7, I would want to count the combination {1,6,12,18,24,28} since the minimum difference is 4 and the maximum difference is 6, but I would not want to count combinations like {7,19,21,29,41,49} since the minimum difference is 2 and the maximum difference is 12.
I have the following code so far, but the problem is that it has to loop through every combination, which takes an extremely long time in many cases.
import itertools
def min_max_differences(integer_list):
i = 1
diff_min = max(integer_list)
diff_max = 1
while i < len(integer_list):
diff = (integer_list[i]-integer_list[i-1])
if diff < diff_min:
diff_min = diff
if diff > diff_max:
diff_max = diff
i += 1
return (diff_min,diff_max)
def total_combinations(lower_bound,upper_bound,min_difference,max_difference,elements_selected):
numbers_range = list(range(lower_bound,upper_bound+1,1))
all_combos = itertools.combinations(numbers_range,elements_selected)
min_max_diff_combos = 0
for c in all_combos:
if min_max_differences(c)[0] >= min_difference and min_max_differences(c)[1] <= max_difference:
min_max_diff_combos += 1
return min_max_diff_combos
I do not have a background in combinatorics, but I am guessing there is a much more algorithmically efficient way to do this using some combinatorial methods.

You can use a recursive function with caching to get your answer.
This method will work even if you have a large array because some positions are repeated many times with the same parameters.
Here is a code for you (forgive me if I made any mistakes in python cause I don't normally use it).
If there is any flow in the logic, please let me know
# function to get the number of ways to select {target} numbers from the
# array {numbers} with minimum difference {min} and maximum difference {max}
# starting from position {p}, with the help of caching
dict = {}
def Combinations(numbers, target, min, max, p):
if target == 1: return 1
# get a unique key for this position
key = target * 1000000000000 + min * 100000000 + max * 10000 + p
if dict.has_key(key): return dict[key]
ans = 0
# current start value
pivot = numbers[p]
p += 1;
# increase the position until you reach the minimum
while p < len(numbers) and numbers[p] - pivot < min:
p += 1
# get all the values in the range of min <--> max
while p < len(numbers) and numbers[p] - pivot <= max:
ans += Combinations(numbers, target - 1, min, max, p)
p += 1
# store the ans for further inquiry
dict[key] = ans
return ans
# any range of numbers (must be SORTED as you asked)
numbers = []
for i in range(0,50): numbers.append(i+1)
# number of numbers to select
count = 6
# minimum difference
min = 4
# maximum difference
max = 7
ans = 0
for i in range(0,len(numbers)):
ans += Combinations(numbers, count, min, max, i)
print ans

Here is a very simple (and non-optimized) recursive approach:
Code
import numpy as np
from time import time
""" PARAMETERS """
SET = range(50) # Set of elements to choose from
N = 6 # N elements to choose
MIN_GAP = 4 # Gaps
MAX_GAP = 7 # ""
def count(N, CHOSEN=[]):
""" assumption: N > 0 at start """
if N == 0:
return 1
else:
return sum([count(N-1, CHOSEN + [val])
for val in SET if (val not in CHOSEN)
and ((not CHOSEN) or ((val - CHOSEN[-1]) >= MIN_GAP))
and ((not CHOSEN) or ((val - CHOSEN[-1]) <= MAX_GAP))])
start_time = time()
count_ = count(N)
print('used time in secs: ', time() - start_time)
print('# solutions: ', count_)
Output
('used time in secs: ', 0.1174919605255127)
('# solutions: ', 23040)
Remarks
It outputs the same solution as Ayman's approach
Ayman's approach is much more powerful (in terms of asymptotical speed)

Find max value in for loop range - Python

I've been working on a small program that takes the input of two numbers and gives the greatest common divisor. I have managed to get the program to at least print out all common divisors until the greatest number is reached, but all I need to print is the max value. Unfortunately, I can't get seem to get this to work. I've tried passing i though max() but was received an error that ''int' objects are not iterable''. Thus I am wondering if anyone could help me find a solution that will allow me to print only the max value as opposed to all values without having to employ much more complex coding methods. Here is the code
def great_divisor():
m = int(raw_input("Choose a number"))
n = int(raw_input("Choose another number"))
#lowest number assigned to d
if m > n:
d = n
else:
d = m
for i in range(1, d + 1):
if (n%i == 0 and m%i == 0):
print(max(i))
return

The easiest way is to use range(d, 0, -1) and just return the first divisor you find. No need to use max.

How about this?
maxn = 0
for i in range(1, d + 1):
if (n%i == 0 and m%i == 0):
maxn = i
return maxn

Max can only be applied to an iterable, a list for example.
You can add all the common divisor in a list and get the max.
for i in range(1, d + 1):
if (n%i == 0 and m%i == 0):
divisors.append(i)
print(max(divisors))

How to calculate no. of palindroms in a large number interval?

I want to calculate how many numbers are palindrome in large interval data say 10^15
My simple code (python) snippet is:
def count_palindromes(start, end):
count = 0
for i in range(start, end + 1):
if str(i) == str(i)[::-1]:
count += 1
return count
start = 1000 #some initial number
end = 10000000000000 #some other large number
if __name__ == "__main__":
print count_palindromes(start, end)
Its a simple program which checks each number one by one. Its vary time consuming and takes a lot of computer resources.
Is there any other method/technique by which we can count Palindrome numbers? Any Algorithm to use for this?
I want to minimize time taken in producing the output.

When you want to count the numbers having some given property between two limits, it is often useful to solve the somewhat simpler problem
How many numbers with the given property are there between 0 and n?
Keeping one limit fixed can make the problem significantly simpler to tackle. When the simpler problem is solved, you can get the solution to the original problem with a simple subtraction:
countBetween(a,b) = countTo(b) - countTo(a)
or countTo(b ± 1) - countTo(a ± 1), depending on whether the limit is included in countTo and which limits shall be included in countBetween.
If negative limits can occur (not for palindromes, I presume), countTo(n) should be <= 0 for negative n (one can regard the function as an integral with respect to the counting measure).
So let us determine
palindromes_below(n) = #{ k : 0 <= k < n, k is a palindrome }
We get more uniform formulae for the first part if we pretend that 0 is not a palindrome, so for the first part, we do that.
Part 1: How many palindromes with a given number d of digits are there?
The first digit cannot be 0, otherwise it's unrestricted, hence there are 9 possible choices (b-1 for palindromes in an arbitrary base b).
The last digit is equal to the first by the fact that it shall be a palindrome.
The second digit - if d >= 3 - can be chosen arbitrarily and independently from the first. That also determines the penultimate digit.
If d >= 5, one can also freely choose the third digit, and so on.
A moment's thought shows that for d = 2*k + 1 or d = 2*k + 2, there are k digits that can be chosen without restriction, and one digit (the first) that is subject to the restriction that it be non-zero. So there are
9 * 10**k
d-digit palindromes then ((b-1) * b**k for base b).
That's a nice and simple formula. From that, using the formula for a geometric sum, we can easily obtain the number of palindromes smaller than 10n (that is, with at most n digits):
if n is even, the number is
n/2-1 n/2-1
2 * ∑ 9*10**k = 18 * ∑ 10**k = 18 * (10**(n/2) - 1) / (10 - 1) = 2 * (10**(n/2) - 1)
k=0 k=0
if n is odd, the number is
2 * (10**((n-1)/2) - 1) + 9 * 10**((n-1)/2) = 11 * (10**((n-1)/2) - 2
(for general base b, the numbers are 2 * (b**(n/2) - 1) resp. (b+1) * b**((n-1)/2) - 2).
That's not quite as uniform anymore, but still simple enough:
def palindromes_up_to_n_digits(n):
if n < 1:
return 0
if n % 2 == 0:
return 2*10**(n//2) - 2
else:
return 11*10**(n//2) - 2
(remember, we don't count 0 yet).
Now for the remaining part. Given n > 0 with k digits, the palindromes < n are either
palindromes with fewer than k digits, there are palindromes_up_to_n_digits(k-1) of them, or
palindromes with exactly k digits that are smaller than n.
So it remains to count the latter.
Part 2:
Letm = (k-1)//2 and
d[1] d[2] ... d[m] d[m+1] ... d[k]
the decimal representation of n (the whole thing works with the same principle for other bases, but I don't explicitly mention that in the following), so
k
n = ∑ d[j]*10**(k-j)
j=1
For each 1 <= c[1] < d[1], we can choose the m digits c[2], ..., c[m+1] freely to obtain a palindrome
p = c[1] c[2] ... c[m+1] {c[m+1]} c[m] ... c[2] c[1]
(the digit c[m+1] appears once for odd k and twice for even k). Now,
c[1]*(10**(k-1) + 1) <= p < (c[1] + 1)*10**(k-1) <= d[1]*10**(k-1) <= n,
so all these 10**m palindromes (for a given choice of c[1]!) are smaller than n.
Thus there are (d[1] - 1) * 10**m k-digit palindromes whose first digit is smaller than the first digit of n.
Now let us consider the k-digit palindromes with first digit d[1] that are smaller than n.
If k == 2, there is one if d[1] < d[2] and none otherwise. If k >= 3, for each 0 <= c[2] < d[2], we can freely choose the m-1 digits c[3] ... c[m+1] to obtain a palindrome
p = d[1] c[2] c[3] ... c[m] c[m+1] {c[m+1]} c[m] ... c[3] c[2] d[1]
We see p < n:
d[1]*(10**(k-1) + 1) + c[2]*(10**(k-2) + 10)
<= p < d[1]*(10**(k-1) + 1) + (c[2] + 1)*(10**(k-2) + 10)
<= d[1]*(10**(k-1) + 1) + d[2]*(10**(k-2) + 10) <= n
(assuming k > 3, for k == 3 replace 10**(k-2) + 10 with 10).
So that makes d[2]*10**(m-1) k-digit palindromes with first digit d[1] and second digit smaller than d[2].
Continuing, for 1 <= r <= m, there are
d[m+1]*10**(m-r)
k-digit palindromes whose first r digits are d[1] ... d[r] and whose r+1st digit is smaller than d[r+1].
Summing up, there are
(d[1]-1])*10**m + d[2]*10**(m-1) + ... + d[m]*10 + d[m+1]
k-digit palindromes that have one of the first m+1 digits smaller than the corresponding digit of n and all preceding digits equal to the corresponding digit of n. Obviously, these are all smaller than n.
There is one k-digit palindrome p whose first m+1 digits are d[1] .. d[m+1], we must count that too if p < n.
So, wrapping up, and now incorporating 0 too, we get
def palindromes_below(n):
if n < 1:
return 0
if n < 10:
return n # 0, 1, ..., n-1
# General case
dec = str(n)
digits = len(dec)
count = palindromes_up_to_n_digits(digits-1) + 1 # + 1 for 0
half_length = (digits-1) // 2
front_part = dec[0:half_length + 1]
count += int(front_part) - 10**half_length
i, j = half_length, half_length+1
if digits % 2 == 1:
i -= 1
while i >= 0 and dec[i] == dec[j]:
i -= 1
j += 1
if i >= 0 and dec[i] < dec[j]:
count += 1
return count
Since the limits are both to be included in the count for the given problem (unless the OP misunderstood), we then have
def count_palindromes(start, end):
return palindromes_below(end+1) - palindromes_below(start)
for a fast solution:
>>> bench(10**100,10**101-1)
900000000000000000000000000000000000000000000000000 palindromes between
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
and
99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999
in 0.000186920166016 seconds

Actually, it's a problem for Google Codejam (which I'm pretty sure you're not supposed to get outside help on) but alas, I'll throw in my 2 cents.
The idea I came up with (but failed to implement) for the large problem was to precompile (generated at runtime, not hardcoded into the source) a list of all palindromic numbers less than 10^15 (there's not very many, it takes like ~60 seconds) then find out how many of those numbers lie between the bounds of each input.
EDIT: This won't work on the 10^100 problem, like you said, that would be a mathematical solution (although there is a pattern if you look, so you'd just need an algorithm to generate all numbers with that pattern)

I presume this is for something like Project Euler... my rough idea would be to generate all numbers up to half the length of your limit (like, if you're going to 99999, go up to 99). Then reverse them, append them to the unreversed one, and potentially add a digit in the middle (for the numbers with odd lengths). You'll might have to do some filtering for duplicates, or weird ones (like if you had a zero at the beginning of the number or sommat) but that should be a lot faster than what you were doing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.