using numpy to convert an int to an array of bits - python

I need a way to convert 20 million 32 and 64-bit integers into corresponding bit arrays (so this has to be memory/time efficient). Under advice from a different question/answer here on SO, I'm attempting to do this by using numpy.unpackbits. While experimenting with this method I ran into unexpected results:
np.unpackbits(np.array([1], dtype=np.uint64).view(np.uint8))
produces:
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)
I would expect the 1 element to be the last one, but not in the middle. So I'm obviously missing something that preserves the byte order. What am I missing?

Try: dtype='>i8', like so:
In [6]: np.unpackbits(np.array([1], dtype='>i8').view(np.uint8))
Out[6]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=uint8)
Reference:
http://docs.scipy.org/doc/numpy/user/basics.byteswapping.html

Related

Indexing 2 dimensional array in python

I've been trying to change a single item in a 2-dimensional array in python using the syntax x[2][3]=1 but instead of just changing the item in the 2nd row 3rd column, it ends up changing the values of all of the 3rd column. My code is below:
population = [[0]*20]*5
population[2][3] = 1
for row in population:
print(row)
This outputs
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
but I only want
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
How would I index the item such that it only changes the 2nd row and 3rd column?
I'm using python 3.7.4 on repl.it
Link here: https://repl.it/#ajqe/2d-array-test
Use :
population = [[0]*20 for _ in range(5)]
to generate the lists instead. The method you are using is referencing the same object 5 times, instead of creating 5 separate lists. To check this you can use the is operator:
>>> population = [[0]*20]*5
>>> population[0] is population[1]
True
>>> population = [[0]*20 for _ in range(5)]
>>> population[0] is population[1]
False

Compute Cosine SImilarity Within Groups

I have a dataframe that consists of rows like the following. My goal here is to compute the the cosine similarity of every row with every row within the same category, such that I'd end up with a dataframe with 3 columns: category, vecs, and dist where dist is a n length array that contains the distance between each row and every row within the same category.
category vecs
0 a [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
1 a [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
2 b [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
3 b [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
The inefficient solution that I've though of would be to loop through each row, check if cat is equal and then compute distance and add to list else continue loop. This solution would be n^2 though and I'm looking for something more efficient. I have 8115 rows in this dataframe and am looking for something that would possibly scale to even larger datasets.
The other possible solution I've looked at would be using sklearn pairwise distance (metric = cosine) and somehow only include computations with same categories, but I'm struggling to think about how to do this.
Would someone be willing to help or suggest a different efficient solution?
You need to do the (more or less) n(n-1)/2 computations.
This is irreducible, since the similarities have to be computed somehow if there is no hidden structure in the vectors.
You can use scipy to compute the pairwise distances, and the squareform function to get back a regular symmetric matrix, that would otherwise be the triangular flattened:
from scipy.spatial.distance import pdist, squareform
similarities = dict()
for cat, group in df.groupby("category"):
a = tuple(row.vecs for _, row in group.iterrows())
b = np.array(a)
sim_mat = squareform(1 - pdist(b, metric='cosine'))
similarities[cat] = sim_mat
[print(k, v, sep='\n') for k, v in similarities.items()]
a
[[0. 1.]
[1. 0.]]
b
[[0. 0.70710678]
[0.70710678 0. ]]

python: convert array of integers to array of their binary representation [duplicate]

I need a way to convert 20 million 32 and 64-bit integers into corresponding bit arrays (so this has to be memory/time efficient). Under advice from a different question/answer here on SO, I'm attempting to do this by using numpy.unpackbits. While experimenting with this method I ran into unexpected results:
np.unpackbits(np.array([1], dtype=np.uint64).view(np.uint8))
produces:
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)
I would expect the 1 element to be the last one, but not in the middle. So I'm obviously missing something that preserves the byte order. What am I missing?
Try: dtype='>i8', like so:
In [6]: np.unpackbits(np.array([1], dtype='>i8').view(np.uint8))
Out[6]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=uint8)
Reference:
http://docs.scipy.org/doc/numpy/user/basics.byteswapping.html

How to get frequency of Unicode strings in file

This is my code to
count number of times a word occurred in a file( All entries are in Unicode)
Text_file = open("Mytext.txt", 'r').read()
Wordlist = {'മാന്നാര്‍':[], 'മാന്‍':[]}
for line in Text_file:
for word in Wordlist.keys():
Wordlist[word].append(line.count(word))
My expected result is
'മാന്നാര്‍' _ 5
മാന്‍ _ 1
My_text =
കുരുവികളോട്‌ കൂട്ട്‌ കൂടാന്‍ … മട്ടാഞ്ചേരി കുരുവികളോടൊത്ത്‌ കൂട്ടുകൂടാനും സംരക്ഷിക്കുവാനും കുരുന്നുമനസ്സുകളില്‍ ബോധമുണര്‍ത്താന്‍ ജെയിന്‍ ഫൗണ്ടേഷന്‍ രംഗത്ത്‌ ലോക കുരുവി ദിനമായ ഇന്നലെ കുരുന്നുകള്‍ക്ക്‌ കുരുവിക്കൂടും കുടിവെള്ളപാത്രവും നല്‍കിക്കൊണ്ടാണ്‌ ഫൗണ്ടേഷന്‍ പക്ഷി-മൃഗാദി പരിശീലന പദ്ധതി നടപ്പിലാക്കുന്നത്‌ സ്ക്കൂളുകള്‍ ലൈബ്രറികള്‍ എന്നിവ കേന്ദ്രീകരിച്ചാണ്‌ ഫൗണ്ടേഷന്‍ പദ്ധതി നടപ്പിലാക്കുന്നത്‌ കുരുവികളെ സംരക്ഷിക്കുന്നതിനും പരിചരിക്കുന്നതിനുമായി പരിസ്ഥിതി സൗഹൃദമായ മണ്‍കുടങ്ങളാണ്‌ ഫൗണ്ടേഷന്‍ സമ്മാനിച്ചത്‌ വേനല്‍കാല ചൂടില്‍ ദാഹമകറ്റുന്നതിന്‌ മണ്‍കലങ്ങളും ഇതിനോടൊപ്പം നല്‍കുകയും ചെയ്തു
ലോകകുരുവി ദിനത്തില്‍ നടന്ന കുരുവികള്‍ക്ക്‌ കൂടൊരുക്കാം പരിപാടിയില്‍ വിദേശികളും സ്വദേശികളും സാക്ഷികളായി ഫോര്‍ട്ടുകൊച്ചിയിലെ സെന്റ്‌ മാര്‍ക്കസ്‌ സ്ക്കൂളിലെ കുട്ടികള്‍ക്കാണ്‌ ഫൗണ്ടേഷന്‍ കുരുവിക്കൂടുകള്‍ നല്‍കിയത്‌ ജൈന്‍ ഫൗണ്ടേഷന്‍ ജനമൈത്രി പോലീസ്‌ സെന്റ്മാര്‍ക്കസ്‌ സ്ക്കൂള്‍ എന്നിവരുമായി കൈകോര്‍ത്ത്‌ സംഘടിപ്പിച്ച പരിപാടിയില്‍ ജനമൈത്രി പോലീസ്‌ സി ആര്‍ ഒ പി യു ഹരിദാസ്‌ സ്ക്കൂള്‍ പ്രിന്‍സിപ്പല്‍ ഹേറിന്‍ ഫെര്‍ണാണ്ടസിന്‌ നല്‍കി പദ്ധതി ഉദ്ഘാടനം ചെയ്തു ഫൗണ്ടേഷന്‍ ഭാരവാഹി മുകേഷ്‌ ജെയിന്‍ ശാന്തി മേനോന്‍ പ്രിയ കെനറ്റ്‌ എം എം സലീം സുധി എന്നിവര്‍ സംസാരിച്ചു
But I am getting
{'\xe0\xb4\xae\xe0\xb4\xbe\xe0\xb4\xa8\xe0\xb5\x8d\xe0\xb4\xa8\xe0\xb4\xbe\xe0\xb4\xb0\xe0\xb5\x8d\xe2\x80\x8d': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'മാന്‍': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
What is the error here ?
You need your script file to be unicode, and you need python to open the input file as unicode, utf-8, utf-16 - whatever is the encoding of your file. For example,
import codecs
f = codecs.open('Mytext.txt', encoding='utf-16')
for line in f:
print repr(line)
See http://docs.python.org/2/howto/unicode.html
Apart from that you need your dictionary to map the counted strings to the count, not to a list, as in,
Wordlist = {'മാന്നാര്‍':0, 'മാന്‍':0}
When you need to increment the dictionary entry:
Wordlist['മാന്നാര്‍'] += 1

How to process pygame.get_pressed() to receive input

pygame.get_keypressed() returns a long list of 0s and 1s for each keys pressed that can be mapped by pygame. Sample below, is there a straight forward way to extract the letter representation of the key pressed?
I'm trying to avoid a long multiple if statements to test if K_a, K_b... ect is clicked, is there a way to process the 1s and 0s below?
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0)
It looks like number in binary representation so you could convert it into integer and use bitwise 'AND' to compare it with some 'mask' (which represents keys you need). I do not know if it is worth doing.
For testing more keys (for example h,e,l,o ) you can use
pressed = pygame.get_keypressed()
if all( (pressed[x] for x in (K_h, K_e, K_l, K_o)) ):
print "all keys are pressed: h, e, l, o"
if any( (pressed[x] for x in (K_h, K_e, K_l, K_o)) ):
print "at least one key is pressed: h, e, l, o"
You can turn it into function
def test_all_keys( list_of_keys, pressed ):
return all( (pressed[x] for x in list_of_keys) )
if test_all_keys((K_h, K_e, K_l, K_o), pressed):
print "all keys are pressed: h, e, l, o"
if you need list of pressed keys:
list_of_pressed = [ i for i in range(len(pressed)) if pressed[i] ]
if K_a in list_of_pressed:
print "key 'a' was pressed"

Categories

Resources