Searchlight Inefficient Compression Scheme

Searchlight Inefficient Compression Scheme - python

Write a compression and decompression algorithm for SICS which works as follows: Find the most popular characters, and in order of popularity assign them a hex value 0 to E, F0 to FE, FF0 to FFE, etc. Note, an F indicates that there are more nibbles to follow, anything else is the terminal nibble.
Compress the message by replacing characters with their assigned value. Below are the sample text, but the code should work universally for any given text.
Test case:
text = "Marley was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it: and Scrooge’s name was good upon ’Change, for anything he chose to put his hand to. Old Marley was as dead as a door-nail. Mind! I don’t mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail. I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade. But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country’s done for. You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail."
solution = "f826b1d0e2a08128fd0340f61f0750e739f50fe916107a054084cf630e9231ff01602f64c303923f5 0fe91061f07a31604f3097a0f6c672b0e2a0a7f05180f6d03910f2b16f0df125f403910f2b16f9f4039 10c581632f916f4025803910f2971f30f14c6516f50ff1f2644f010a7f0518073fd02580ff1f2644f01fa a052f110e2a0f04480cf7450faff2925f01f40f346025d3975f00910f294a10340f7c3097a09258034f 50ff3b80f826b1d0e2a02a0812802a0208446fb527bf50f8758ff40fc0845fa30f11250340a2d03923 0fc0f954ef404f30f1d04e50f954eb18f01f40e92303916107a0f72637f2cb26bd0812802f64c30208 446fb527bf50fc0f17f093092ff010f6115075f2b7518f40f1da1bf3f4034061f0268020f24f3f375fb52 7b02a0391081281a30f771f2104f307645f145f016d0750391036281f50ff5c303910e7a84f104f30 4c6025f21a346a07a07503910a7f17b1ff602580f1d0c592bb4e1809258a0a92bb0543087a3c6f6 073f404603910ff24c536dfaa084510f346f50ff74c0e7bb039161f34610f716f1730f11034061f7123 f401f1f79237f22bbdf4039230f826b1d0e2a02a0812802a0208446fb527bf5

I am adding a code that can help to solve your problem. It will ask for raw input and you can modify it as per your need.
from collections import Counter
key = []
def initCompute(data): # here i am finding the unique chars and its occurrence to find the popular char
uChar = list(set(data))
xCount = dict(Counter(data))
xCount = dict(sorted(xCount.items(), key=lambda item: item[1], reverse=True))
return len(uChar), xCount
def compute(data, keyList, dataDict): # assigning the hex value(key) to the each characters
j = 0
comDict = {}
decdict = {}
sol = ""
for k in dataDict:
comDict[k] = [keyList[j], dataDict[k]]
decdict[keyList[j]] = k
j += 1
for c in data:
sol += comDict[c][0]
return sol, decdict
def decompression(keyDict,
cData): # this is to find the decompressed data by having the dict of
# hex value assigned to char and compressed data as inputs
sol = ""
fNib = ""
for s in cData:
if s == 'f':
fNib += s
else:
fNib += s
sol += keyDict[fNib]
fNib = ""
# print(sol)
return sol
def compression(): # find the key(hex value) and framing the compressed data
i = 0
fac = 16
pwr = 1
keyLen, dDict = initCompute(text)
while len(key) <= keyLen:
if i <= (fac - 2):
key.append(str(hex(i))[2:]) # finding the hex value and storing in key
i += 1
else:
pwr += 1
fac = pow(16, pwr)
i = fac - 16
sol, newDict = compute(text, key, dDict)
print("Assigned hex values for each character: ")
print(newDict)
return sol, newDict, keyLen
if __name__ == '__main__':
text = input("Input your Data here: ") # input
compressed_data, new_dict, key_len = compression()
print("Compressed data: ",compressed_data)
print(compressed_data)
print("Decompressed data:")
print(decompression(new_dict, compressed_data))

Related

Compressing multiple nested `for` loops

Similar to this and many other questions, I have many nested loops (up to 16) of the same structure.
Problem: I have 4-letter alphabet and want to get all possible words of length 16. I need to filter those words. These are DNA sequences (hence 4 letter: ATGC), filtering rules are quite simple:
no XXXX substrings (i.e. can't have same letter in a row more than 3 times, ATGCATGGGGCTA is "bad")
specific GC content, that is number of Gs + number of Cs should be in specific range (40-50%). ATATATATATATA and GCGCGCGCGCGC are bad words
itertools.product will work for that, but data structure here gonna be giant (4^16 = 4*10^9 words)
More importantly, if I do use product, then I still have to go through each element to filter it out. Thus I will have 4 billion steps times 2
My current solution is nested for loops
alphabet = ['a','t','g','c']
for p1 in alphabet:
for p2 in alphabet:
for p3 in alphabet:
...skip...
for p16 in alphabet:
word = p1+p2+p3+...+p16
if word_is_good(word):
good_words.append(word)
counter+=1
Is there good pattern to program that without 16 nested loops? Is there a way to parallelize it efficiently (on multi-core or multiple EC2 nodes)
Also with that pattern i can plug word_is_good? check inside middle of the loops: word that starts badly is bad
...skip...
for p3 in alphabet:
word_3 = p1+p2+p3
if not word_is_good(word_3):
break
for p4 in alphabet:
...skip...

from itertools import product, islice
from time import time
length = 16
def generate(start, alphabet):
"""
A recursive generator function which works like itertools.product
but restricts the alphabet as it goes based on the letters accumulated so far.
"""
if len(start) == length:
yield start
return
gcs = start.count('g') + start.count('c')
if gcs >= length * 0.5:
alphabet = 'at'
# consider the maximum number of Gs and Cs we can have in the end
# if we add one more A/T now
elif length - len(start) - 1 + gcs < length * 0.4:
alphabet = 'gc'
for c in alphabet:
if start.endswith(c * 3):
continue
for string in generate(start + c, alphabet):
yield string
def brute_force():
""" Straightforward method for comparison """
lower = length * 0.4
upper = length * 0.5
for s in product('atgc', repeat=length):
if lower <= s.count('g') + s.count('c') <= upper:
s = ''.join(s)
if not ('aaaa' in s or
'tttt' in s or
'cccc' in s or
'gggg' in s):
yield s
def main():
funcs = [
lambda: generate('', 'atgc'),
brute_force
]
# Testing performance
for func in funcs:
# This needs to be big to get an accurate measure,
# otherwise `brute_force` seems slower than it really is.
# This is probably because of how `itertools.product`
# is implemented.
count = 100000000
start = time()
for _ in islice(func(), count):
pass
print(time() - start)
# Testing correctness
global length
length = 12
for x, y in zip(*[func() for func in funcs]):
assert x == y, (x, y)
main()
On my machine, generate was just a bit faster than brute_force, at about 390 seconds vs 425. This was pretty much as fast as I could make them. I think the full thing would take about 2 hours. Of course, actually processing them will take much longer. The problem is that your constraints don't reduce the full set much.
Here's an example of how to use this in parallel across 16 processes:
from multiprocessing.pool import Pool
alpha = 'atgc'
def generate_worker(start):
start = ''.join(start)
for s in generate(start, alpha):
print(s)
Pool(16).map(generate_worker, product(alpha, repeat=2))

Since you happen to have an alphabet of length 4 (or any "power of 2 integer"), the idea of using and integer ID and bit-wise operations comes to mind instead of checking for consecutive characters in strings. We can assign an integer value to each of the characters in alphabet, for simplicity lets use the index corresponding to each letter.
Example:
6546354310 = 33212321033134 = 'aaaddcbcdcbaddbd'
The following function converts from a base 10 integer to a word using alphabet.
def id_to_word(word_id, word_len):
word = ''
while word_id:
rem = word_id & 0x3 # 2 bits pet letter
word = ALPHABET[rem] + word
word_id >>= 2 # Bit shift to the next letter
return '{2:{0}>{1}}'.format(ALPHABET[0], word_len, word)
Now for a function to check whether a word is "good" based on its integer ID. The following method is of a similar format to id_to_word, except a counter is used to keep track of consecutive characters. The function will return False if the maximum number of identical consecutive characters is exceeded, otherwise it returns True.
def check_word(word_id, max_consecutive):
consecutive = 0
previous = None
while word_id:
rem = word_id & 0x3
if rem != previous:
consecutive = 0
consecutive += 1
if consecutive == max_consecutive + 1:
return False
word_id >>= 2
previous = rem
return True
We're effectively thinking of each word as an integer with base 4. If the Alphabet length was not a "power of 2" value, then modulo % alpha_len and integer division // alpha_len could be used in place of & log2(alpha_len) and >> log2(alpha_len) respectively, although it would take much longer.
Finally, finding all the good words for a given word_len. The advantage of using a range of integer values is that you can reduce the number of for-loops in your code from word_len to 2, albeit the outer loop is very large. This may allow for more friendly multiprocessing of your good word finding task. I have also added in a quick calculation to determine the smallest and largest IDs corresponding to good words, which helps significantly narrow down the search for good words
ALPHABET = ('a', 'b', 'c', 'd')
def find_good_words(word_len):
max_consecutive = 3
alpha_len = len(ALPHABET)
# Determine the words corresponding to the smallest and largest ids
smallest_word = '' # aaabaaabaaabaaab
largest_word = '' # dddcdddcdddcdddc
for i in range(word_len):
if (i + 1) % (max_consecutive + 1):
smallest_word = ALPHABET[0] + smallest_word
largest_word = ALPHABET[-1] + largest_word
else:
smallest_word = ALPHABET[1] + smallest_word
largest_word = ALPHABET[-2] + largest_word
# Determine the integer ids of said words
trans_table = str.maketrans({c: str(i) for i, c in enumerate(ALPHABET)})
smallest_id = int(smallest_word.translate(trans_table), alpha_len) # 1077952576
largest_id = int(largest_word.translate(trans_table), alpha_len) # 3217014720
# Find and store the id's of "good" words
counter = 0
goodies = []
for i in range(smallest_id, largest_id + 1):
if check_word(i, max_consecutive):
goodies.append(i)
counter += 1
In this loop I have specifically stored the word's ID as opposed to the actual word itself incase you are going to use the words for further processing. However, if you are just after the words then change the second to last line to read goodies.append(id_to_word(i, word_len)).
NOTE: I receive a MemoryError when attempting to store all good IDs for word_len >= 14. I suggest writing these IDs/words to a file of some sort!

Finding the longest repetitive piece in a string python

I want to write a function "longest" where my input doc test looks like this (python)
"""
>>>longest('1211')
1
>>>longest('1212')
2
>>>longest('212111212112112121222222212212112121')
2
>>>lvs('1')
0
>>>lvs('121')
0
>>>lvs('12112')
0
"""
What I am trying to achieve is that for example in the first case the 1 is repeated in the back with "11" so the repeated part is 1 and this repeated part is 1 character long it is this length that this function should return.
So in the case of the second you got "1212" so the repeated part is "12" which is 2 characters long.
The tricky thing here is that the longest is "2222222" but this doesn't matter since it is not in the front nor the back. The solution for the last doc test is that 21 is being repeated which is 2 characters long.
The code I have created this far is following
import re
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
yield (match.group(1), len(match.group(0)) / len(match.group(1)))
def longest(s):
"""
>>> longest('1211')
1
"""
nummer_hoeveel_keer = dict(repetitions(s)) #gives a dictionary with as key the number (for doctest 1 this be 1) and as value the length of the key
if nummer_hoeveel_keer == {}: #if there are no repetitive nothing should be returnd
return 0
sleutels = nummer_hoeveel_keer.keys() #here i collect the keys to see which has has the longest length
lengtes = {}
for sleutel in sleutels:
lengte = len(sleutel)
lengtes[lengte] = sleutel
while lengtes != {}: #as long there isn't a match and the list isn't empty i keep looking for the longest repetitive which is or in the beginning or in the back
maximum_lengte = max(lengtes.keys())
lengte_sleutel = {v: k for k, v in lengtes.items()}
x= int(nummer_hoeveel_keer[(lengtes[maximum_lengte])])
achter = s[len(s) - maximum_lengte*x:]
voor = s[:maximum_lengte*x]
combinatie = lengtes[maximum_lengte]*x
if achter == combinatie or voor == combinatie:
return maximum_lengte
del lengtes[str(maximum_lengte)]
return 0
when following doc test is put in this code
"""
longest('12112')
0
""
there is a key error where I put "del lengtes[str(maximum_lengte)]"
after a suggestion of #theausome I used his code as a base to work further with (see answer): this makes my code right now look like this:
def longest(s):
if len(s) == 1:
return 0
longest_patt = []
k = s[-1]
longest_patt.append(k)
for c in s[-2::-1]:
if c != k:
longest_patt.append(c)
else:
break
rev_l = list(reversed(longest_patt))
character = ''.join(rev_l)
length = len(rev_l)
s = s.replace(' ','')[:-length]
if s[-length:] == character:
return len(longest_patt)
else:
return 0
l = longest(s)
print l
Still there are some doc tests that are troubling me like for example:
>>>longest('211211222212121111111')
3 #I get 1
>>>longest('2111222122222221211221222112211')
4 #I get 1
>>>longest('122211222221221112111')
4 #I get 1
>>>longest('121212222112222112')
6 #I get 1
Anyone has ideas how to deal with/ approach this problem, maybe find a more graceful way around the problem ?

Try the below code. It works perfectly for your input doc tests.
def longest(s):
if len(s) == 1:
return 0
longest_patt = []
k = s[-1]
longest_patt.append(k)
for c in s[-2::-1]:
if c != k:
longest_patt.append(c)
else:
break
rev_l = list(reversed(longest_patt))
character = ''.join(rev_l)
length = len(rev_l)
s = s.replace(' ','')[:-length]
if s[-length:] == character:
return len(longest_patt)
else:
return 0
l = longest(s)
print l
Output:
longest('1211')
1
longest('1212')
2
longest('212111212112112121222222212212112121')
2
longest('1')
0
longest('121')
0
longest('12112')
0

Encoding a 128-bit integer in Python?

Inspired by the "encoding scheme" of the answer to this question, I implemented my own encoding algorithm in Python.
Here is what it looks like:
import random
from math import pow
from string import ascii_letters, digits
# RFC 2396 unreserved URI characters
unreserved = '-_.!~*\'()'
characters = ascii_letters + digits + unreserved
size = len(characters)
seq = range(0,size)
# Seed random generator with same randomly generated number
random.seed(914576904)
random.shuffle(seq)
dictionary = dict(zip(seq, characters))
reverse_dictionary = dict((v,k) for k,v in dictionary.iteritems())
def encode(n):
d = []
n = n
while n > 0:
qr = divmod(n, size)
n = qr[0]
d.append(qr[1])
chars = ''
for i in d:
chars += dictionary[i]
return chars
def decode(str):
d = []
for c in str:
d.append(reverse_dictionary[c])
value = 0
for i in range(0, len(d)):
value += d[i] * pow(size, i)
return value
The issue I'm running into is encoding and decoding very large integers. For example, this is how a large number is currently encoded and decoded:
s = encode(88291326719355847026813766449910520462)
# print s -> "3_r(AUqqMvPRkf~JXaWj8"
i = decode(s)
# print i -> "8.82913267194e+37"
# print long(i) -> "88291326719355843047833376688611262464"
The highest 16 places match up perfectly, but after those the number deviates from its original.
I assume this is a problem with the precision of extremely large integers when dividing in Python. Is there any way to circumvent this problem? Or is there another issue that I'm not aware of?

The problem lies within this line:
value += d[i] * pow(size, i)
It seems like you're using math.pow here instead of the built-in pow method. It returns a floating point number, so you lose accuracy for your large numbers. You should use the built-in pow or the ** operator or, even better, keep the current power of the base in an integer variable:
def decode(s):
d = [reverse_dictionary[c] for c in s]
result, power = 0, 1
for x in d:
result += x * power
power *= size
return result
It gives me the following result now:
print decode(encode(88291326719355847026813766449910520462))
# => 88291326719355847026813766449910520462

Python iterator behaviour

Given an arbitrary input string I'm meant to find the sum of all numbers in that string.
This obviously requires that i know the NEXT element in string while iterating through it...and make the decision whether its an integer. if the previous element was an integer also, the two elements form a new integer, all other characters are ignored and so on.
For instance an input string
ab123r.t5689yhu8
should result in the sum of 123 + 5689 + 8 = 5820.
All this is to be done without using regular expressions.
I have implemented an iterator in python, whose (next()) method i think returns the next element, but passing the input string
acdre2345ty
I'm getting the following output
a
c
d
r
e
2
4
t
y
Some numbers 3 and 5 are missing...why is this? I need that the next() to work for me to be able to sift through an input string and do the calculations correctly
Better still, how should i implement the next method so that it yields the element to the immediate right during a given iteration?
Here is my code
class Inputiterator(object):
'''
a simple iterator to yield all elements from a given
string successively from a given input string
'''
def __init__(self, data):
self.data = data
self.index = 0
def __iter__(self):
return self
def next(self):
"""
check whether we've reached the end of the input
string, if not continue returning the current value
"""
if self.index == len(self.data)-1:
raise StopIteration
self.index = self.index + 1
return self.data[self.index]
# Create a method to get the input from the user
# simply return a string
def get_input_as_string():
input=raw_input("Please enter an arbitrary string of numbers")
return input
def sort_by_type():
maininput= Inputiterator(get_input_as_string())
list=[]
s=""
for char in maininput:
if str(char).isalpha():
print ""+ str(char)
elif str(char).isdigit() and str(maininput.next()).isdigit():
print ""+ str(char)
sort_by_type()

Python strings are already iterable, no need to create you own iterator.
What you want is thus simply achieved without iterators:
s = "acdre2345ty2390"
total = 0
num = 0
for c in s:
if c.isdigit():
num = num * 10 + int(c)
else:
total += num
num = 0
total += num
Which results in:
>>> print total
4735

This can be done with itertools.groupby:
from itertools import groupby
s = 'ab123r#t5689yhu8'
tot = 0
for k, g in groupby(s, str.isdigit):
if k:
tot += int(''.join(g))
Or in one line (as suggested in the comments down below):
tot = sum((int(''.join(g)) for k, g in groupby(s, str.isdigit) if k)

Edit: I first deleted this answer as there are much better solutions for your problem in this thread, but as you are directly asking how to use the next method to get your code working I have recovered it, in case you find it useful.
Try this (I mocked the iterator for convenience):
def sort_by_type():
maininput = iter("acdre2345ty")
for char in maininput:
if char.isalpha():
print char
elif char.isdigit():
number = char
while True:
# try/except could take care of the StopIteration exception
# when a digit is last in the string
#
# try:
# char = maininput.next()
# except StopIteration:
# char = ""
#
# however using next(iterator, default) is much better:
#
char = next(maininput, "")
if char.isdigit():
number += char
else:
break
print number
print char
if produces:
a
c
d
r
e
2345
t
y

For entertainment purposes only (I couldn't resist, 49 chars):
eval(''.join([['+0+',x][x.isdigit()]for x in s]))

Encrypt a string using a public key

I need to take a string in Python and encrypt it using a public key.
Can anyone give me an example or recommendation about how to go about doing this?

You'll need a Python cryptography library to do this.
Have a look at ezPyCrypto: "As a reaction to some other crypto libraries, which can be painfully complex to understand and use, ezPyCrypto has been designed from the ground up for absolute ease of use, without compromising security."
It has an API:
encString(self, raw)
which looks like what you're after: "High-level func. encrypts an entire string of data, returning the encrypted string as binary."

PyMe provides a Python interface to the GPGME library.
You should, in theory, be able to use that to interact with GPG from Python to do whatever encrypting you need to do.
Here's a very simple code sample from the documentation:
This program is not for serious encryption, but for example purposes only!
import sys
from pyme import core, constants
# Set up our input and output buffers.
plain = core.Data('This is my message.')
cipher = core.Data()
# Initialize our context.
c = core.Context()
c.set_armor(1)
# Set up the recipients.
sys.stdout.write("Enter name of your recipient: ")
name = sys.stdin.readline().strip()
c.op_keylist_start(name, 0)
r = c.op_keylist_next()
# Do the encryption.
c.op_encrypt([r], 1, plain, cipher)
cipher.seek(0,0)
print cipher.read()

I looked at the ezPyCrypto library that was recommended in another answer.
Please don't use this library. It is very incomplete and in some cases incorrect and highly insecure. Public key algorithms have many pitfalls and need to be implemented carefully.
For example, RSA message should use a padding scheme such as PKCS #1, OAEP etc to be secure. This library doesn't pad. DSA signatures should use the SHA1 hash function. This library uses the broken MD5 hash and there is even a bigger bug in the random number generation. Hence the DSA implementation is neither standards conform nor secure. ElGamal is also implemented incorrectly.
Following standards does make implementations somewhat more complex. But not following any is not an option. At least not if you care about security.

An even "simpler" example without the use of any additional libraries would be:
def rsa():
# Choose two prime numbers p and q
p = raw_input('Choose a p: ')
p = int(p)
while isPrime(p) == False:
print "Please ensure p is prime"
p = raw_input('Choose a p: ')
p = int(p)
q = raw_input('Choose a q: ')
q = int(q)
while isPrime(q) == False or p==q:
print "Please ensure q is prime and NOT the same value as p"
q = raw_input('Choose a q: ')
q = int(q)
# Compute n = pq
n = p * q
# Compute the phi of n
phi = (p-1) * (q-1)
# Choose an integer e such that e and phi(n) are coprime
e = random.randrange(1,phi)
# Use Euclid's Algorithm to verify that e and phi(n) are comprime
g = euclid(e,phi)
while(g!=1):
e = random.randrange(1,phi)
g = euclid(e,phi)
# Use Extended Euclid's Algorithm
d = extended_euclid(e,phi)
# Public and Private Key have been generated
public_key=(e,n)
private_key=(d,n)
print "Public Key [E,N]: ", public_key
print "Private Key [D,N]: ", private_key
# Enter plain text to be encrypted using the Public Key
sentence = raw_input('Enter plain text: ')
letters = list(sentence)
cipher = []
num = ""
# Encrypt the plain text
for i in range(0,len(letters)):
print "Value of ", letters[i], " is ", character[letters[i]]
c = (character[letters[i]]**e)%n
cipher += [c]
num += str(c)
print "Cipher Text is: ", num
plain = []
sentence = ""
# Decrypt the cipher text
for j in range(0,len(cipher)):
p = (cipher[j]**d)%n
for key in character.keys():
if character[key]==p:
plain += [key]
sentence += key
break
print "Plain Text is: ", sentence
# Euclid's Algorithm
def euclid(a, b):
if b==0:
return a
else:
return euclid(b, a % b)
# Euclid's Extended Algorithm
def extended_euclid(e,phi):
d=0
x1=0
x2=1
y1=1
orig_phi = phi
tempPhi = phi
while (e>0):
temp1 = int(tempPhi/e)
temp2 = tempPhi - temp1 * e
tempPhi = e
e = temp2
x = x2- temp1* x1
y = d - temp1 * y1
x2 = x1
x1 = x
d = y1
y1 = y
if tempPhi == 1:
d += phi
break
return d
# Checks if n is a prime number
def isPrime(n):
for i in range(2,n):
if n%i == 0:
return False
return True
character = {"A":1,"B":2,"C":3,"D":4,"E":5,"F":6,"G":7,"H":8,"I":9,"J":10,
"K":11,"L":12,"M":13,"N":14,"O":15,"P":16,"Q":17,"R":18,"S":19,
"T":20,"U":21,"V":22,"W":23,"X":24,"Y":25,"Z":26,"a":27,"b":28,
"c":29,"d":30,"e":31,"f":32,"g":33,"h":34,"i":35,"j":36,"k":37,
"l":38,"m":39,"n":40,"o":41,"p":42,"q":43,"r":44,"s":45,"t":46,
"u":47,"v":48,"w":49,"x":50,"y":51,"z":52, " ":53, ".":54, ",":55,
"?":56,"/":57,"!":58,"(":59,")":60,"$":61,":":62,";":63,"'":64,"#":65,
"#":66,"%":67,"^":68,"&":69,"*":70,"+":71,"-":72,"_":73,"=":74}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.