Compressing a list of integers in Python - python

I have a list of positive (random) integers with the following properties:
Number of elements: 78495
Maximum value of element: 999982
Length of list when converted to a string: 517115 (string looks like "6,79384,238956,...")
Size of list in text file on disk: 520 kb
I am trying to use this list as a precomputed list for an online judge problem because it takes a long time to actually generate this list. However, it is too large to be accepted if I paste it directly into the source code, which has a cap of 50 kb.
I looked into zlib as a way to compress the string but it only seemed to cut the size in half.
Is there a way to really shrink this down so I can unpack it / use it in the source code?

Given your definition ...
it is a list of smallest-k values for which 10^k = 1 mod p for primes p > 5
... am I wrong to believe that your values are of the form (p - 1) / x where x is an integer significantly smaller than p?
For instance, for p < 50, we have:
p = 7 : 10^6 = 1 (mod 7) => k = 6 = (p - 1) / 1 => x = 1
p = 11 : 10^2 = 1 (mod 11) => k = 2 = (p - 1) / 5 => x = 5
p = 13 : 10^6 = 1 (mod 13) => k = 6 = (p - 1) / 2 => x = 2
p = 17 : 10^16 = 1 (mod 17) => k = 16 = (p - 1) / 1 => x = 1
p = 19 : 10^18 = 1 (mod 19) => k = 18 = (p - 1) / 1 => x = 1
p = 23 : 10^22 = 1 (mod 23) => k = 22 = (p - 1) / 1 => x = 1
p = 29 : 10^28 = 1 (mod 29) => k = 28 = (p - 1) / 1 => x = 1
p = 31 : 10^15 = 1 (mod 31) => k = 15 = (p - 1) / 2 => x = 2
p = 37 : 10^3 = 1 (mod 37) => k = 3 = (p - 1) / 12 => x = 12
p = 41 : 10^5 = 1 (mod 41) => k = 5 = (p - 1) / 8 => x = 8
p = 43 : 10^21 = 1 (mod 43) => k = 21 = (p - 1) / 2 => x = 2
p = 47 : 10^46 = 1 (mod 47) => k = 46 = (p - 1) / 1 => x = 1
The list of x values should compress much better than the list of k values. (For instance, I'd be willing to bet that the most frequent value of x will be '1'.)
And because it's rather easy and fast to compute primes up to 1 million (which I think is your upper bound), you may be able to quickly rebuild the list of k values based on the compressed list of x values and the real-time computed list of primes.
You probably should have explained from the beginning what exactly you were trying to compress to get more accurate answers.

In short, no.
log(2, 999982) ~= 20
So the largest number would take 20 bits to store. Let's say that on average, each number takes 10 bits to store (evenly distributed between 0 and the max).
~80,000 numbers * 10 bits per number = 800,000 bits = 100,000 bytes
So these numbers, stored as efficiently as possible, would take ~100KB of space.
Compression will only work if there's some non-randomness to the numbers. If they're truly random, as you say, then a general compression algorithm won't be able to make this any smaller, so 100KB is about the best you can hope to do.
EDIT
Note that things are even worse, in that you want to paste these into source code, so you can't just use arbitrary binary data. You'll need something text-friendly, like base64 encoding, which will add another ~33% of overhead. Also, you can't really store numbers based on the average number of bits required, because you'd need some way to know how many bits were used by each individual number. There are possible encoding schemes, but all will carry some additional overhead.
SECOND EDIT
Based on the comments above, the data is not actually random as originally stated. A general compression algorithm therefore might work, and if not, there are presumably other solutions (e.g. just shipping the code that generated the numbers in the first place, which is likely smaller than 50KB).

The best text compression available offers a (roughly) 12-17% compression ratio (62.4-90 kB) so you're not going to meet your threshold. Your data are random, as well, which generally makes compression algorithms perform worse.
Look at an alternative approach, such as making your RNG process faster, or if you don't need a full list (just some integers), create a separate "producer" thread to generate random integers (involving whatever actual math you are using) and a "consumer" thread that does work on those integers as they come in. That way, your program could perhaps still do work, even if it would take a long time to generate a full list.

Here I've tested easily availiable algorithms in python on two strings: one is generated randomly with non uniform distribution, another one has some structure. It seems, lzma does better
# check the compression ratio
import lzma
import zlib
import gzip
import bz2
import zipfile
import tarfile
compressors = ['lzma','zlib','gzip','bz2']
a = np.exp(np.random.rand(1024))
b = np.arange(1024)
b[32] = -10
b[96] = 20000
a = bytes(a)
b = bytes(b)
for i in range(len(compressors)):
print("{} compression ratio: ".format(compressors[i]))
a_lzma = eval(compressors[i]).compress(a)
b_lzma = eval(compressors[i]).compress(b)
print(float(len(a_lzma))/len(a),float(len(b_lzma))/len(b))
print("\n")
The output:
lzma compression ratio:
0.93115234375 0.08984375
zlib compression ratio:
0.95068359375 0.1944580078125
gzip compression ratio:
0.9521484375 0.196533203125
bz2 compression ratio:
0.9925537109375 0.1268310546875

Related

Sparse Binary Decomposition

A non-negative integer N is called sparse if its binary representation does not contain two consecutive bits set to 1. For example, 41 is sparse, because its binary representation is "101001" and it does not contain two consecutive 1s. On the other hand, 26 is not sparse, because its binary representation is "11010" and it contains two consecutive 1s.
Two non-negative integers P and Q are called a sparse decomposition of integer N if P and Q are sparse and N = P + Q.
For example:
8 and 18 are a sparse decomposition of 26 (binary representation of 8 is "1000", binary representation of 18 is "10010");
9 and 17 are a sparse decomposition of 26 (binary representation of 9 is "1001", binary representation of 17 is "10001");
2 and 24 are not a sparse decomposition of 26; though 2 + 24 = 26, the binary representation of 24 is "11000", which is not sparse.
I need a function that, given a non-negative integer N, returns any integer that is one part of a sparse decomposition of N. The function should return −1 if there is no sparse decomposition of N.
For example, given N = 26 the function may return 8, 9, 17 or 18, as explained in the example above. All other possible results for N = 26 are 5, 10, 16 and 21.
I tried this: Which works when N=26, 1166, 1031. But id does not work for very big numbers like 74901729 because of runtime error (timeout)
import re
def solution(N):
for i in range(N):
x = N-i
is_x_sparse = not re.findall('11+', bin(x))
is_i_sparse = not re.findall('11+', bin(i))
if is_x_sparse and is_i_sparse:
return i
As per my comment, one solution for any x is the pair (x & 0x55555555, x & 0xAAAAAAAA), of which you can return any of the two elements.
Now, why does this work? Let's look at the masks in binary:
0x55555555 = 0b01010101010101010101010101010101
0xAAAAAAAA = 0b10101010101010101010101010101010
They both have alternating 1s and 0s, so the result of the bitwise and of any number with one of the masks will never have two consecutive ones.
The only missing thing is whether the two values sum to the original x. This is indeed the case: each bit of x that was set to 1 will be in exactly one of the two items, and during the addition no carry will ever be generated (when doing the sum, we never sum two 1s). So the addition reduces to the binary or of the operands, and the result will be the original x.
As a final note, the masks I mentioned are 32bit, but can be adapted to any width by extending or reducing the same pattern.
Your code doesn't short-circuit when it finds '11' in the binary expansion of i but instead finds all matches in both i and N-i.
Here is a solution which uses the simple in operator on strings rather than re. It also iterates up to (N+1)//2 rather than N. It takes advantage both of the short-circuiting nature of in and the short-circuiting nature of and:
def solution(N):
for i in range((N+1)//2):
x = N-i
if not '11' in bin(i) and not '11' in bin(x):
return i
return -1
It is noticeably faster on 74901729.
def solution(N):
# Implement your solution here
if N == 0:
return 0
if N == 1:
return 1
for P in range(1,N):
if P <= N/2:
s1 = None
s2 = None
s3 = None
s4 = None
Q = N - P
s1 = format(P,'b')
s2 = s1[1:] + '0'
if int(s1,2) & int(s2,2) == 0:
s3 = format(Q, 'b')
s4 = s3[1:] + '0'
if int(s3,2) & int(s4,2) == 0:
return P
return -1
this work properly but failed in performance ,however you can find 100% solution under this link:
https://gist.github.com/tcarrio/f90efb54c72cc84c2aa05ce8fc7d5e7d

Code for factoring doesn't work with large numbers?

I have a large 512 bit number n and I need to rewrite n-1 as m*2k
Here is the code I wrote:
# write (n-1) = m*2^k (where m is odd)
k = 0 # number of times we were able to divide by 2
total = (n-1)
while total % 2 == 0:
total /= 2
k += 1
m = int(total)
assert (n-1) == (2**k) * m # this does not hold true for large values of n for some reason
The problem is that it doesn't work for large (515 bit) values of n such as:
8711599454063889217821738854601954834373650047096243407624954758041578156381215983765719390767527065267731131102484447503200895621045535585981917487924709
For the above value of n, my code found k = 460 and m =2926172291557515
When I evaluate 2926172291557515 * 2**460 in python I get:
8711599454063889889401923055669626316647070894345982715097720460936366477064539266279767451213791729696559357170292404522606916263895951485640687369584640
Which does not equal n-1. Does anyone know why this could be happening? I assume it's related to having such large numbers (this code works fine for lower numbers that I test it with.
The problem arises because you are using /= which is float division. Replace it with //=, or integer division, and your code will work.
#Primusa is correct. In Python3.x / (or /=) returns a float value, so loses a lot of precision.
>>> n = 8711599454063889217821738854601954834373650047096243407624954758041578156381215983765719390767527065267731131102484447503200895621045535585981917487924709
>>> total = (n-1)
>>> total / 2
4.355799727031945e+153
>>> total // 2
4355799727031944608910869427300977417186825023548121703812477379020789078190607991882859695383763532633865565551242223751600447810522767792990958743962354

Fast modular exponentiation, help me find the mistake

I am trying to implement a scheme of fast exponentiation. Degree is represented in binary form:
def pow_h(base, degree, module):
degree = bin(degree)[2:]
r = 1
for i in range(len(degree) - 1, -1, -1):
r = (r ** 2) % module
r = (r * base ** int(degree[i])) % module
return r
But function is not working properly, where is the mistake?
As I said in the comments, the built-in pow function already does fast modular exponentiation, but I guess it's a reasonable coding exercise to implement it yourself.
Your algorithm is close, but you're squaring the wrong thing. You need to square base, not r, and you should do it after the multiplying step.
def pow_h(base, degree, module):
degree = bin(degree)[2:]
r = 1
for i in range(len(degree) - 1, -1, -1):
r = (r * base ** int(degree[i])) % module
base = (base ** 2) % module
return r
#test
for i in range(16):
print(i, 2**i, pow_h(2, i, 100))
output
0 1 1
1 2 2
2 4 4
3 8 8
4 16 16
5 32 32
6 64 64
7 128 28
8 256 56
9 512 12
10 1024 24
11 2048 48
12 4096 96
13 8192 92
14 16384 84
15 32768 68
Using r * base ** int(degree[i]) is a cute trick, but it's probably more efficient to use a if statement than exponentiation. And you can use arithmetic to get the bits of degree, rather than using string, although bin is rather efficient. Anyway, here's my version:
def pow_h(base, power, modulus):
a = 1
while power:
power, d = power // 2, power % 2
if d:
a = a * base % modulus
base = base * base % modulus
return a
Such fast exponentiation must act differently if the current exponent is even or odd, but you have no such check in your code. Here are some hints:
To find x**y, you need an "accumulator" variable to hold the value calculated so far. Let's use a. So you are finding a*(x**y), with your code decreasing y and increasing a and/or x until y becomes zero and a is your final answer.
If y is even, say y==2*k, then a*x**(2*k) == a*(x**2)**k. This decreased y to y//2 and increased x to x**2.
If y is odd, say y==2k+1, then a*x**(2*k+1) == (a*x)*x**(2*k). This decreased y to y-1 and increased a to a*x.
You should be able to figure the algorithm from here. I did not include using the modulus: that should be easy.

Capturing all data in non-whole train, test, and validate splits

just wondering if a better solution exists for this sort of problem.
We know that for a X/Y percentage split of an even number we can get an exact split of the data - for example for data size 10:
10 * .6 = 6
10 * .4 = 4
10
Splitting data this way is easy, and we can guarantee we have all of the data and nothing is lost. However where I am struggling is on less friendly numbers - take 11
11 * .6 = 6.6
11 * .4 = 4.4
11
However we can't index into an array at i = 6.6 for example. So we have to decide how to to do this. If we take JUST the integer portion we lose 1 data point -
First set = 0..6
Second set = 6..10
This would be the same case if we floored the numbers.
However, if we take the ceiling of the numbers:
First set = 0..7
Second set = 7..12
And we've read past the end of our array.
This gets even worse when we throw in a 3rd or 4th split (30,30,20,20 for example).
Is there a standard splitting procedure for these kinds of problems? Is data loss accepted? It seems like data loss would be unacceptable for dependent data, such as time series.
Thanks!
EDIT: The values .6 and .4 are chosen by me. They could be any two numbers that sum to 1.
First of all, notice that your problem is not limited to odd-sized arrays as you claim, but any-sized arrays. How would you make the 56%-44% split of a 10 element array? Or a 60%-40% split of a 4 element array?
There is no standard procedure. In many cases, programmers do not care that much about an exact split and they either do it by flooring or rounding one quantity (the size of the first set), while taking the complementary (array length - rounded size) for the other (the size of the second).
This might be ok in most cases when this is an one-off calculation and accuracy is not required. You have to ask yourself what your requirements are. For example: are you taking thousands of 10-sized arrays and each time you are splitting them 56%-44% doing some calculations and returning a result? You have to ask yourself what accuracy do you want. Do you care if your result ends up being
the 60%-40% split or the 50%-50% split?
As another example imagine that you are doing a 4-way equal split of 25%-25%-25%-25%. If you have 10 elements and you apply the rounding technique you end up with 3,3,3,1 elements. Surely this will mess up your results.
If you do care about all these inaccuracies then the first step is consider whether you can to adjust either the array size and/or the split ratio(s).
If these are set in stone then the only way to have an accurate split of any ratios of any sized array is to make it probabilistic. You have to split multiple arrays for this to work (meaning you have to apply the same split ratio to same-sized arrays multiple times). The more arrays the better (or you can use the same array multiple times).
So imagine that you have to make a 56%-44% split of a 10 sized array. This means that you need to split it in 5.6 elements and 4.4 elements on the average.
There are many ways you can achieve a 5.6 element average. The easiest one (and the one with the smallest variance in the sequence of tries) is to have 60% of the time a set with 6 elements and 40% of the time a set that has 5 elements.
0.6*6 + 0.4*5 = 5.6
In terms of code this is what you can do to decide on the size of the set each time:
import random
array_size = 10
first_split = 0.56
avg_split_size = array_size * first_split
floored_split_size = int(avg_split_size)
if avg_split_size > floored_split_size:
if random.uniform(0,1) > avg_split_size - floored_split_size:
this_split_size = floored_split_size
else:
this_split_size = floored_split_size + 1
else:
this_split_size = avg_split_size
You could make the code more compact, I just made an outline here so you get the idea. I hope this helps.
Instead of using ciel() or floor() use round() instead. For example:
>>> round(6.6)
7.0
The value returned will be of float type. For getting the integer value, type-cast it to int as:
>>> int(round(6.6))
7
This will be the value of your first split. For getting the second split, calculate it using len(data) - split1_val. This will be applicable in case of 2 split problem.
In case of 3 split, take round value of two split and take the value of 3rd split as the value of len(my_list) - val_split_1 - val_split2
In a Generic way, For N split:
Take the round() value of N-1 split. And for the last value, do len(data) - "value of N round() values".
where len() gives the length of the list.
Let's first consider just splitting the set into two pieces.
Let n be the number of elements we are splitting, and p and q be the proportions, so that
p+q == 1
I assert that the parts after the decimal point will always sum to either 1 or 0, so we should use floor on one and ceil on the other, and we will always be right.
Here is a function that does that, along with a test. I left the print statements in but they are commented out.
def simpleSplitN(n, p, q):
"split n into proportions p and q and return indices"
np = math.ceil(n*p)
nq = math.floor(n*q)
#print n, sum([np, nq]) #np and nq are the proportions
return [0, np] #these are the indices we would use
#test for simpleSplitN
for i in range(1, 10):
p = i/10.0;
q = 1-p
simpleSplitN(37, p, q);
For the mathematically inclined, here is the proof that the decimal proportions will sum to 1
-----------------------
We can express p*n as n/(1/p), and so by the division algorithm we get integers k and r
n == k*(1/p) + r with 0 <= r < (1/p)
Thus r/(1/p) == p*r < 1
We can do exactly the same for q, getting
q*r < 1 (this is a different r)
It is important to note that q*r and p*r are the part after the decimal when we divide our n.
Now we can add them together (we've added subscripts now)
0 <= p*(r_1) < 1
0 <= q*(r_2) < 1
=> 0 < p*r + q*r == p*n + q*n + k_1 + k_2 == n + k_1 + k_2 < 2
But by closure of the integers, n + k_1 + k_2 is an integer and so
0 < n + k_1 + k_2 < 2
means that p*r + q*r must be either 0 or 1. It will only be 0 in the case that our n is divided evenly.
Otherwise we can now see that our fractional parts will always sum to 1.
-----------------------
We can do a very similar (but slightly more complicated) proof for splitting n into an arbitrary number (say N) parts, but instead of them summing to 1, they will sum to an integer less than N.
Here is the general function, it has uncommented print statements for verification purposes.
import math
import random
def splitN(n, c):
"""Compute indices that can be used to split
a dataset of n items into a list of proportions c
by first dividing them naively and then distributing
the decimal parts of said division randomly
"""
nc = [n*i for i in c];
nr = [n*i - int(n*i) for i in c] #the decimal parts
N = int(round(sum(nr))) #sum of all decimal parts
print N, nc
for i in range(0, len(nc)):
nc[i] = math.floor(nc[i])
for i in range(N): #randomly distribute leftovers
nc[random.randint(1, len(nc)) - 1] += 1
print n,sum(nc); #nc now contains the proportions
out = [0] #compute a cumulative sum
for i in range(0, len(nc) - 1):
out.append(out[-1] + nc[i])
print out
return out
#test for splitN with various proportions
c = [.1,.2,.3,.4]
c = [.2,.2,.2,.2,.2]
c = [.3, .2, .2, .3]
for n in range( 10, 40 ):
print splitN(n, c)
If we have leftovers, we will never get an even split, so we distribute them randomly, like #Thanassis said. If you don't like the dependency on random, then you could just add them all at the beginning or at even intervals.
Both of my functions output indices but they compute proportions and thus could be slightly modified to output those instead per user preference.

count number of ones in a given integer

How do you count the number of ones in a given integer's binary representation.
Say you are given a number 20, which is 10100 in binary, so number of ones is 2.
What you're looking for is called the Hamming weight, and there are a lot of algorithms to do it. Here's another straightforward one:
def ones(n):
w = 0
while (n):
w += 1
n &= n - 1
return w
Use the awesome collections module.
>>> from collections import Counter
>>> binary = bin(20)[2:]
>>> Counter(binary)
Counter({'0': 3, '1': 2})
Or you can use the built-in function count():
>>> binary = bin(20)[2:]
>>> binary.count('1')
2
Or even:
>>> sum(1 for i in bin(20)[2:] if i == '1')
2
But that last solution is slower than using count()
>>> num = 20
>>> bin(num)[2:].count('1')
2
The usual way to make this blinding fast is to use lookup tables:
table = [bin(i)[2:].count('1') for i in range(256)]
def pop_count(n):
cnt = 0
while n > 0:
cnt += table[n & 255]
n >>= 8
return cnt
In Python, any solution using bin and list.count will be faster, but this is nice if you want to write it in assembler.
The int type has a new method int.bit_count() since python 3.10a, returning the number of ones in the binary expansion of a given integer, also known as the population count as follows:
n = 20
bin(n)
'0b10100'
n.bit_count() returns 2 as it has 2 ones in the binary representation.
The str.count method and bin function make short work of this little challenge:
>>> def ones(x):
"Count the number of ones in an integer's binary representation"
return bin(x).count('1')
>>> ones(20)
2
You can do this using bit shifting >> and bitwise and & to inspect the least significant bit, like this:
def count_ones(x):
result = 0
while x > 0:
result += x & 1
x = x >> 1
return result
This works by shifting the bits right until the value becomes zero, counting the number of times the least significant bit is 1 along the way.
I am a new coder and I found this one logic simple. Might be easier for newbies to understand.
def onesInDecimal(n):
count = 0
while(n!=0):
if (n%2!=0):
count = count+1
n = n-1
n = n/2
else:
n = n/2
return count
For a special case when you need to check quickly whether the binary form of the integer x has only a single 1 (and thus is a power of 2), you can use this check:
if x == -(x | (-x)):
...
The expression -(x | (-x)) is the number that you get if you replace all 1s except the last one (the least significant bit) in the binary representation of x with 0.
Example:
12 = 1100 in binary
-12 = ...110100 in binary (with an infinite number of leading 1s)
12 | (-12) = ...111100 in binary (with an infinite number of leading 1s)
-(12 | (-12)) = 100 in binary
If the input number is 'number'
number =20
len(bin(number)[2:].replace('0',''))
Another solution is
from collections import Counter
Counter(list(bin(number))[2:])['1']

Categories

Resources