Take non-zero elements in a macro-list - python

I have a problem with the instruction np.nonzero() in python. I want to take all the indices of a given list that are non zero. So, consider that I have the following code:
import numpy as np
from scipy.special import binom
M=4
N=3
def generate(N,nb):
states = np.zeros((int(binom(nb+N-1, nb)), N), dtype=int)
states[0, 0]=nb
ni = 0 # init
for i in xrange(1, states.shape[0]):
states[i,:N-1] = states[i-1, :N-1]
states[i,ni] -= 1
states[i,ni+1] += 1+states[i-1, N-1]
if ni >= N-2:
if np.any(states[i, :N-1]):
ni = np.nonzero(states[i, :N-1])[0][-1]
else:
ni += 1
return states
base = generate(M,N)
The result of base is given by:
base = [[3 0 0 0]
[2 1 0 0]
[2 0 1 0]
[2 0 0 1]
[1 2 0 0]
[1 1 1 0]
[1 1 0 1]
[1 0 2 0]
[1 0 1 1]
[1 0 0 2]
[0 3 0 0]
[0 2 1 0]
[0 2 0 1]
[0 1 2 0]
[0 1 1 1]
[0 1 0 2]
[0 0 3 0]
[0 0 2 1]
[0 0 1 2]
[0 0 0 3]]
The point is that for a given index j,k I want to take all the items in base that has non-zero components in the sites j,k, for example:
Taking j=0,k=1 I have to obtain:
result = [1 4 5 6]
which corresponds to the elements 1,4,5,6 of base that satisfies this condition. On the other hand, I have used the command:
np.nonzero((base[:, j]) & (base[:, k]))[0]
but it doesn't work correctly, any idea why?

First of all, the syntax for list index base[:, j] is wrong, use : [:][j] instead
also:
np.nonzero((base[:, j]) & (base[:, k]))[0]
won't work ,because the & sign is not applicable here..
you could use numpy like this:
b = np.array(base);
j=0;k=1;
np.nonzero(b.T[j]* b.T[k])[0]
which will give:
array([1, 4, 5, 6])

Related

one hot encode with pandas get_dummies missing values

I have a dataset in the form of a DataFrame and each row has a label ranging from 1-5. I am doing a one hot encode using pd.get_dummies(). If my dataset has all 5 labels there is not problem. However not all sets contain all 5 numbers so the encode just skips the missing value and creates a problem for new datasets coming in. Can I set a range so that the one hot encode knows there should be 5 labels? Or would I have to append 1,2,3,4,5 to the end of the array before I perform the encode and then delete the last 5 entries?
Correct encode: values 1-5 are encoded
arr = np.array([1,2,5,3,1,5,1,4])
df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))
>>>[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]
[0 0 0 1 0]]
Missing value encode: this dataset is missing label 4.
arr = np.array([1,2,5,3,1,5,1,])
df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))
>>>[[1 0 0 0]
[0 1 0 0]
[0 0 0 1]
[0 0 1 0]
[1 0 0 0]
[0 0 0 1]
[1 0 0 0]]
Set up the CategoricalDtype before encoding to ensure all categories are represented when getting dummies:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 5, 3, 1, 5, 1])
df = pd.DataFrame(arr, columns=['test'])
# Setup Categorical Dtype
df['test'] = df['test'].astype(pd.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
hotarr = np.array(pd.get_dummies(df['test']))
print(hotarr)
Alternatively can reindex after get_dummies with fill_value=0 to add the missing columns:
hotarr = np.array(pd.get_dummies(df['test'])
.reindex(columns=[1, 2, 3, 4, 5], fill_value=0))
Both produce hotarr with 5 columns even though input does not contain 4:
[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]]

Use Python Loop to create NX3 matrix

I am quite new to Python. The ongoing project involves creating a N X 3 matrix from a set of numbers, numbers∈[0,1,2] where each row is to be filled by a combination of these numbers, like [0,0,0], [0,0,1], [0,0,2]...[2,2,2].
The code is:
import numpy as np
numbers = []
x = 0
while x < 2:
y = 0
while y < 2:
z = 0
while z < 2:
numbers.append((x,y,z))
z += 1
numbers.append((x,y,z))
y += 1
numbers.append((x,y,z))
x += 1
print(np.asarray(numbers))
But the output is only:
[[0 0 0]
[0 0 1]
[0 0 2]
[0 1 0]
[0 1 1]
[0 1 2]
[0 2 2]
[1 0 0]
[1 0 1]
[1 0 2]
[1 1 0]
[1 1 1]
[1 1 2]
[1 2 2]]
It should contain 27 rows. It can also be done using itertools.product though. But how can the code be rewritten to get all the rows?
The quicker and easy to read code is:
result = np.array(list(itertools.product([0,1,2], repeat=3)))
(don't re-invent the wheel if you can use a library function).
But if you insist on your 3 nested loops, run:
numbers = []
for x in range(3):
for y in range(3):
for z in range(3):
numbers.append((x,y,z))
result = np.array(numbers)
The errors in your code are that:
it should contain only the most inner numbers.append((x,y,z)),
all 3 loops should be while ... <= 2:
But my code is shorter.

Python: get all the possible combinations for allocating x apples to y baskets subject to constraint

Suppose we have x apples and y baskets, we want to allocate all the apples to baskets such that each basket at most get z apples. How to write Python codes to get all possible combinations.
For small number of y, I can just loop with respect to y as follows (x=5, y=3, z=2).
all_chances = np.zeros((0,3))
for a in range(3):
for b in range(3):
for c in range(3):
if a+b+c == 5:
all_chances = np.vstack((all_chances, np.array([a,b,c])))
Basically, all_chances are
array([[1., 2., 2.],
[2., 1., 2.],
[2., 2., 1.]])
My question is: what if y is a large number, like x = 30, y = 26, z=2? Do I need to loop 26 times?
I messed around with your question... tried implementing a sort of tree-based approach bc I thought it'd be clever, but my laptop chokes on it. I was curious how many permutations we're looking for with these large numbers anyway, and changed the problem (for myself) to simply counting the permutations to see if it was even doable on a light-weight laptop.
I get 154,135,675,070 unique permutations.
To get started... I messed around with itertools, and permutations took forever with lists of length 26. So... to remind myself of the long-forgotten formula to at least count distinct permutations, I found this... https://socratic.org/questions/how-many-distinct-permutations-can-be-made-from-the-letters-of-the-word-infinity
With that I ran the following to get a count. It runs in under a second.
from numpy import prod
from math import factorial
import itertools
# number of unique permutations
def count_distinct_permutations(tup):
value_counts = [len(list(grp)) for _, grp in itertools.groupby(tup)]
return factorial(sum(value_counts)) / prod([float(factorial(x)) for x in value_counts])
# starting values
x = 30 # apples
y = 26 # baskets
z = 3 # max per basket
# count possible results
result = 0
for combos in itertools.combinations_with_replacement(range(z), y):
if sum(combos) == x:
result += count_distinct_permutations(combos)
Now... this obviously does NOT answer your specific question. Honestly I couldn't hold the result you're looking for in memory anyway. But... you can make some inferences with this... with your chosen values, there's only 12 combinations of values, but between 15k and 50 million permutations of each combination.
You could look at each combination... in the count_distinct_permutations() function, itertools.groupby feeds you how many of each number from (0,1,2) is in the combination, and you could work with each of those twelve results to infer some stuff. Not sure what, but I also am not quite sure what to do with 154 billion lists of length 26. :)
Hope there was something useful here, even if it didn't answer your exact question. Good luck!
Here is a method based on Young diagrams. For example, 4 baskets, 6 eggs, max 3 eggs per basket. If we order the baskets by how full they are we get Young diagrams.
x x x x x x x x x x x x x x x x
x x x x x x x x x x
x x x x
The code below enumerates all possible Young diagrams and for each enumerates all possible permutations.
The same logic can also be used to count.
from itertools import product, combinations
from functools import lru_cache
import numpy as np
def enum_ord_part(h, w, n, o=0):
if h == 1:
d = n
for idx in combinations(range(w), d):
idx = np.array(idx, int)
out = np.full(w, o)
out[idx] = o+1
yield out
else:
for d in range((n-1)//h+1, min(w, n) + 1):
for idx, higher in product(combinations(range(w), d),
enum_ord_part(h-1, d, n-d, o+1)):
idx = np.array(idx)
out = np.full(w, o)
out[idx] = higher
yield out
def bc(n, k):
if 2*k > n:
k = n-k
return np.prod(np.arange(n-k+1, n+1, dtype='O')) // np.prod(np.arange(1, k+1, dtype='O'))
#lru_cache(None)
def count_ord_part(h, w, n):
if h == 1:
return bc(w, n)
else:
return sum(bc(w, d) * count_ord_part(h-1, d, n-d)
for d in range((n-1)//h+1, min(w, n) + 1))
Few examples:
>>> for i, l in enumerate(enum_ord_part(3, 4, 6), 1):
... print(l, end=' ' if i % 8 else '\n')
...
[3 3 0 0] [3 0 3 0] [3 0 0 3] [0 3 3 0] [0 3 0 3] [0 0 3 3] [3 2 1 0] [2 3 1 0]
[3 1 2 0] [2 1 3 0] [1 3 2 0] [1 2 3 0] [2 2 2 0] [3 2 0 1] [2 3 0 1] [3 1 0 2]
[2 1 0 3] [1 3 0 2] [1 2 0 3] [2 2 0 2] [3 0 2 1] [2 0 3 1] [3 0 1 2] [2 0 1 3]
[1 0 3 2] [1 0 2 3] [2 0 2 2] [0 3 2 1] [0 2 3 1] [0 3 1 2] [0 2 1 3] [0 1 3 2]
[0 1 2 3] [0 2 2 2] [3 1 1 1] [1 3 1 1] [1 1 3 1] [1 1 1 3] [2 2 1 1] [2 1 2 1]
[2 1 1 2] [1 2 2 1] [1 2 1 2] [1 1 2 2]
>>>
>>> print(f'{count_ord_part(2, 26, 30):,}')
154,135,675,070
>>> print(f'{count_ord_part(50, 30, 1000):,}')
63,731,848,167,716,295,344,627,252,024,129,873,636,437,590,711

How to find longest consecutive ocurrence of non-zero elements in 2D numpy array

I am simulating protein folding on a 2D grid where every angle is either ±90° or 0°, and have the following problem:
I have an n-by-n numpy array filled with zeros, except for certain places where the value is any integer from 1 to n. Every integer appears just once. Integer k is always a nearest neighbour to k-1 and k + 1, except for the endpoints. The array is saved as an object in the class Grid which I have created for doing energy calculations and folding the protein. Example array, with n=5:
>>> from Grid import Grid
>>> a = Grid(5)
>>> a.show()
[[0 0 0 0 0]
[0 0 0 0 0]
[1 2 3 4 5]
[0 0 0 0 0]
[0 0 0 0 0]]
My goal is to find the longest consecutive line of non-zero elements withouth any bends. In the above case, the result should be 5.
My idea so far are something like this:
def getDiameter(self):
indexes = np.zeros((self.n, 2))
for i in range(1, self.n + 1):
indexes[i - 1] = np.argwhere(self.array == i)[0]
for i in range(self.n):
j = 1
currentDiameter = 1
while indexes[0][i] == indexes[0][i + j] and i + j <= self.n:
currentDiameter += 1
j += 1
while indexes[i][0] == indexes[i + j][0] and i + j <= self.n:
currentDiameter += 1
j += 1
if currentDiameter > diameter:
diameter = currentDiameter
return diameter
This has two problems: (1) it doesn't work, and (2) it is horribly inefficient if I get it to work. I am wondering if anybody has a better way of doing this. If anything is unclear, please let me know.
Edit:
Less trivial example
[[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 10 0 0 0]
[ 0 0 0 0 0 0 9 0 0 0]
[ 0 0 0 0 0 0 8 0 0 0]
[ 0 0 0 4 5 6 7 0 0 0]
[ 0 0 0 3 0 0 0 0 0 0]
[ 0 0 0 2 1 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]]
The correct answer here is 4 (both the longest column and the longest row have four non-zero elements).
What I understood from your question is you need to find the length of longest occurance of consecutive elements in numpy array (row by row).
So for this below one, the output should be 5:
[[1 2 3 4 0]
[0 0 0 0 0]
[10 11 12 13 14]
[0 1 2 3 0]
[1 0 0 0 0]]
Because [10 11 12 13 14] are consecutive elements and they have the longest length comparing to any consecutive elements in any other row.
If this is what you are expecting, consider this:
import numpy as np
from itertools import groupby
a = np.array([[1, 2, 3, 4, 0],
[0, 0, 0, 0, 0],
[10, 11, 12, 13, 14],
[0, 1, 2, 3, 0],
[1, 0, 0, 0, 0]])
a = a.astype(float)
a[a == 0] = np.nan
b = np.diff(a) # Calculate the n-th discrete difference. Consecutive numbers will have a difference of 1.
counter = []
for line in b: # for each row.
if 1 in line: # consecutive elements differ by 1.
counter.append(max(sum(1 for _ in g) for k, g in groupby(line) if k == 1) + 1) # find the longest length of consecutive 1's for each row.
print(max(counter)) # find the max of list holding the longest length of consecutive 1's for each row.
# 5
For your particular example:
[[0 0 0 0 0]
[0 0 0 0 0]
[1 2 3 4 5]
[0 0 0 0 0]
[0 0 0 0 0]]
# 5
Start by finding the longest consecutive occurrence in a list:
def find_longest(l):
counter = 0
counters =[]
for i in l:
if i == 0:
counters.append(counter)
counter = 0
else:
counter += 1
counters.append(counter)
return max(counters)
now you can apply this function to each row and each column of the array, and find the maximum:
longest_occurrences = [find_longest(row) for row in a] + [find_longest(col) for col in a.T]
longest_occurrence = max(longest_occurrences)

quickly calculate randomized 3D numpy array from 2D numpy array

I have a 2-dimensional array of integers, we'll call it "A".
I want to create a 3-dimensional array "B" of all 1s and 0s such that:
for any fixed (i,j) sum(B[i,j,:])==A[i.j], that is, B[i,j,:] contains A[i,j] 1s in it
the 1s are randomly placed in the 3rd dimension.
I know how I would do this using standard python indexing but this turns out to be very slow.
I am looking for a way to do this that takes advantage of the features that can make Numpy fast.
Here is how I would do it using standard indexing:
B=np.zeros((X,Y,Z))
indexoptions=range(Z)
for i in xrange(Y):
for j in xrange(X):
replacedindices=np.random.choice(indexoptions,size=A[i,j],replace=False)
B[i,j,[replacedindices]]=1
Can someone please explain how I can do this in a faster way?
Edit: Here is an example "A":
A=np.array([[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4]])
in this case X=Y=5 and Z>=5
Essentially the same idea as #JohnZwinck and #DSM, but with a shuffle function for shuffling a given axis:
import numpy as np
def shuffle(a, axis=-1):
"""
Shuffle `a` in-place along the given axis.
Apply numpy.random.shuffle to the given axis of `a`.
Each one-dimensional slice is shuffled independently.
"""
b = a.swapaxes(axis,-1)
# Shuffle `b` in-place along the last axis. `b` is a view of `a`,
# so `a` is shuffled in place, too.
shp = b.shape[:-1]
for ndx in np.ndindex(shp):
np.random.shuffle(b[ndx])
return
def random_bits(a, n):
b = (a[..., np.newaxis] > np.arange(n)).astype(int)
shuffle(b)
return b
if __name__ == "__main__":
np.random.seed(12345)
A = np.random.randint(0, 5, size=(3,4))
Z = 6
B = random_bits(A, Z)
print "A:"
print A
print "B:"
print B
Output:
A:
[[2 1 4 1]
[2 1 1 3]
[1 3 0 2]]
B:
[[[1 0 0 0 0 1]
[0 1 0 0 0 0]
[0 1 1 1 1 0]
[0 0 0 1 0 0]]
[[0 1 0 1 0 0]
[0 0 0 1 0 0]
[0 0 1 0 0 0]
[1 0 1 0 1 0]]
[[0 0 0 0 0 1]
[0 0 1 1 1 0]
[0 0 0 0 0 0]
[0 0 1 0 1 0]]]

Categories

Resources