Representing a word as sequence of bits - python

I want to represent a word as a sequence of 26 bits. If 25th bit is set it means that the letter 'y' is present in that word.
For example: word:"abekz"
representation:10000000000000010000010011
This is very easy to do it in C/C++ since it has a 32 bit int type. But Python's int has infinite precision so I'm unable to do it.
Here's my (Wrong)solution:
def representAsBits(string):
mask=0
for each_char in string:
bit_position= ord(each_char)-97 #string consists of only lower-case letters
mask= mask | (1<<bit_position)
return bin(mask)
print representAsBits("abze")# gives me 0b10000000000000000000010011
print representAsBits("wxcc")# gives me 0b110000000000000000000100 2 bits missing here
What changes can I make? Thanks!

You can't store leading zeroes on an integer. Thankfully, you're using bin(), which returns a string.
With a little creative slicing, we can format it however we want:
return "0b%32d" % int(bin(mask)[2:])
will give:
>>> representAsBits("abekz")
'0b00000010000000000000010000010011'
That being said, to compare masks, you don't have to bin() them except if you want to "show" the binary. Compare the integers themselves, which will be the same:
with return mask:
>>> representAsBits("z") == representAsBits("zzz")
True
Although, since the masks will match, it doesn't matter what padding you use, as they will be the same if generated from the same mask: Any string containing only the characters wxc will yield the same string, regardless of what method you use.

Related

How to encode with actual bits in Python?

I have built a huffman encoder in Python, but because I'm storing the bits (which represent the characters) as strings, the encoded text is larger than the original. How can I use actual bits to compress text properly?
You can convert a str of 1s and 0s to an int type variable like this:
>>> int('10110001',2)
177
And you can convert ints back to strs of 1s and 0s like this:
>>> format(177,'b')
'10110001'
Also, note that you can write int literals in binary using a leading 0b, like this:
>>> foo = 0b10110001
>>> foo
177
Now, before you say "No, I asked for bits, not ints!" think about that for a second. An int variable isn't stored in the computer's hardware as a base-10 representation of the number; it's stored directly as bits.
EDIT: Stefan Pochmann points out that this will drop leading zeros. Consider:
>>> code = '000010110001'
>>> bitcode = int(code, 2)
>>> format(bitcode, 'b')
'10110001'
So how do you keep the leading zeros? There are a few ways. How you go about it will likely depend on whether you want to type cast each character into an int first and then concatenate them, or concatenate the strings of 1s and 0s before type casting the whole thing as an int. The latter will probably be much simpler. One way that will work well for the latter is to store the length of the code and then use that with this syntax:
>>> format(bitcode, '012b')
'000010110001'
where '012b' tells the format function to pad the left of the string with enough zeros to ensure a minimum length of 12. So you can use it in this way:
>>> code = '000010110001'
>>> code_length = len(code)
>>> bitcode = int(code, 2)
>>> format(bitcode, '0{}b'.format(code_length))
'000010110001'
Finally, if that {} and second format is unfamiliar to you, read up on string formatting.

How to compute a double precision float score from the first 8 bytes of a string in Python?

Trying to get a double-precision floating point score from a UTF-8 encoded string object in Python. The idea is to grab the first 8 bytes of the string and create a float, so that the strings, ordered by their score, would be ordered lexicographically according to their first 8 bytes (or possibly their first 63 bits, after forcing them all to be positive to avoid sign errors).
For example:
get_score(u'aaaaaaa') < get_score(u'aaaaaaab') < get_score(u'zzzzzzzz')
I have tried to compute the score in an integer using bit-shift-left and XOR, but I am not sure of how to translate that into a float value. I am also not sure if there is a better way to do this.
How should the score for a string be computed so the condition I specified before is met?
Edit: The string object is UTF-8 encoded (as per #Bakuriu's commment).
float won't give you 64 bits of precision. Use integers instead.
def get_score(s):
return struct.unpack('>Q', (u'\0\0\0\0\0\0\0\0' + s[:8])[-8:])[0]
In Python 3:
def get_score(s):
return struct.unpack('>Q', ('\0\0\0\0\0\0\0\0' + s[:8])[-8:].encode('ascii', 'error'))[0]
EDIT:
For floats, with 6 characters:
def get_score(s):
return struct.unpack('>d', (u'\0\1' + (u'\0\0\0\0\0\0\0\0' + s[:6])[-6:]).encode('ascii', 'error'))[0]
You will need to setup the entire alphabet and do the conversion by hand, since conversions to base > 36 are not built in, in order to do that you only need to define the complete alphabet to use. If it was an ascii string for instance you would create a conversion to a long in base 256 from the input string using all the ascii table as an alphabet.
You have an example of the full functions to do it here: string to base 62 number
Also you don't need to worry about negative-positive numbers when doing this, since the encoding of the string with the first character in the alphabet will yield the minimum possible number in the representation, which is the negative value with the highest absolute value, in your case -2**63 which is the correct value and allows you to use < > against it.
Hope it helps!

Python, string , integer

I have a string variable:
str1 = '0000120000210000'
I want to convert the string into an integer without losing the first 4 zero characters. In other words, I want the integer variable to also store the first 4 zero digits as part of the integer.
I tried the int() function, but I'm not able to retain the first four digits.
You can use two integers, one to store the width of the number, and the other to store the number itself:
kw = len(s)
k = int(s)
To put the number back together in a string, use format:
print '{:0{width}}'.format(k, width=kw) # prints 0000120000210000
But, in general, you should not store identifiers (such as credit card numbers, student IDs, etc.) as integers, even if they appear to be. Numbers in these contexts should only be used if you need to do arithmetic, and you don't usually do arithmetic with identifiers.
What you want simply cannot be done.. Integer value does not store the leading zero's, because there can be any number of them. So, it can't be said how many to store.
But if you want to print it like that, that can be done by formatting output.
EDIT: -
Added #TimPietzcker's comment from OP to make complete answer: -
You should never store a number as an integer unless you're planning on doing arithmetic with it. In all other cases, they should be stored as strings

Get the x Least Significant Bits from a String in Python

How can I get the x LSBs from a string (str) in Python?
In the specific I have a 256 bits string consisting in 32 chars each occupying 1 byte, from which I have to get a "char" string with the 50 Least Significant Bits.
So here are the ingredients for an approach that works using strings (simple but not the most efficient variant):
ord(x) yields the number (i.e. essentially the bits) for a char (e.g. ord('A')=65). Note that ord expects really an byte-long character (no special signs such as € or similar...)
bin(x)[2:] creates a string representing the number x in binary.
Thus, we can do (mystr holds your string):
l = [bin(ord(x))[2:] for x in mystr] # retrieve every character as binary number
bits = ""
for x in l: # concatenate all bits
bits = bits + l
bits[-50:] # retrieve the last 50 bits
Note that this approach is surely not the most efficient one due to the heavy string operations that could be replaced by plain integer operations (using bit-shifts and such). However, it is the simplest variant that came to my mind.
I think that a possible answer could be in this function:
mystr holds my string
def get_lsbs_str(mystr):
chrlist = list(mystr)
result1 = [chr(ord(chrlist[-7])&(3))]
result2 = chrlist[-6:]
return "".join(result1 + result2)
this function take 2 LSBs of the -7rd char of mystr (these are the 2 MSBs of the 50 LSBs)
then take the last 6 characters of mystr (these are the 48 LSB of the 50 LSB)
Please make me know if I am in error.
If it's only to display, wouldn't it help you?
yourString [14:63]
You can also use
yourString [-50:]
For more information, see here

How do I calculate the numeric value of a string with unicode components in python?

Along the lines of my previous question, How do I convert unicode characters to floats in Python? , I would like to find a more elegant solution to calculating the value of a string that contains unicode numeric values.
For example, take the strings "1⅕" and "1 ⅕". I would like these to resolve to 1.2
I know that I can iterate through the string by character, check for unicodedata.category(x) == "No" on each character, and convert the unicode characters by unicodedata.numeric(x). I would then have to split the string and sum the values. However, this seems rather hacky and unstable. Is there a more elegant solution for this in Python?
I think this is what you want...
import unicodedata
def eval_unicode(s):
#sum all the unicode fractions
u = sum(map(unicodedata.numeric, filter(lambda x: unicodedata.category(x)=="No",s)))
#eval the regular digits (with optional dot) as a float, or default to 0
n = float("".join(filter(lambda x:x.isdigit() or x==".", s)) or 0)
return n+u
or the "comprehensive" solution, for those who prefer that style:
import unicodedata
def eval_unicode(s):
#sum all the unicode fractions
u = sum(unicodedata.numeric(i) for i in s if unicodedata.category(i)=="No")
#eval the regular digits (with optional dot) as a float, or default to 0
n = float("".join(i for i in s if i.isdigit() or i==".") or 0)
return n+u
But beware, there are many unicode values that seem to not have a numeric value assigned in python (for example ⅜⅝ don't work... or maybe is just a matter with my keyboard xD).
Another note on the implementation: it's "too robust", it will work even will malformed numbers like "123½3 ½" and will eval it to 1234.0... but it won't work if there are more than one dots.
>>> import unicodedata
>>> b = '10 ⅕'
>>> int(b[:-1]) + unicodedata.numeric(b[-1])
10.2
define convert_dubious_strings(s):
try:
return int(s)
except UnicodeEncodeError:
return int(b[:-1]) + unicodedata.numeric(b[-1])
and if it might have no integer part than another try-except sub-block needs to be added.
This might be sufficient for you, depending on the strange edge cases you want to deal with:
val = 0
for c in my_unicode_string:
if unicodedata.category(unichr(c)) == 'No':
cval = unicodedata.numeric(c)
elif c.isdigit():
cval = int(c)
else:
continue
if cval == int(cval):
val *= 10
val += cval
print val
Whole digits are assumed to be another digit in the number, fractional characters are assumed to be fractions to add to the number. Doesn't do the right thing with spaces between digits, repeated fractions, etc.
I think you'll need a regular expression, explicitly listing the characters that you want to support. Not all numerical characters are suitable for the kind of composition that you envision - for example, what should be the numerical value of
u"4\N{CIRCLED NUMBER FORTY TWO}2\N{SUPERSCRIPT SIX}"
???
Do
for i in range(65536):
if unicodedata.category(unichr(i)) == 'No':
print hex(i), unicodedata.name(unichdr(i))
and go through the list defining which ones you really want to support.

Categories

Resources