Lets say we have a rank 2 array a with n entries that contain integer values in {0,1,2,...,m}. Now for each of those integers I want to find the indices of the entries of a with this value (called index_i, index_j in the following examples). (So what I'm looking for is like np.unique(...,return_index=True) but for 2d arrays and with the possibility to return all indices of each unique value.)
A naive approach would involve using boolean indexing which would result in O(m*n) operations (see below), but I'd like to only have O(n) operations. While I found a solution to do that, I feel like there should be a built in method or at least something that simplifies this - or that would at least remove these ugly loops:
import numpy as np
a = np.array([[0,0,1],[0,2,1],[2,2,1]])
m = a.max()
#"naive" in O(n*m)
i,j = np.mgrid[range(a.shape[0]), range(a.shape[1])]
index_i = [[] for _ in range(m+1)]
index_j = [[] for _ in range(m+1)]
for k in range(m+1):
index_i[k] = i[a==k]
index_j[k] = j[a==k]
#all the zeros:
print(a[index_i[0], index_j[0]])
#all the ones:
print(a[index_i[1], index_j[1]])
#all the twos:
print(a[index_i[2], index_j[2]])
#"sophisticated" in O(n)
index_i = [[] for _ in range(m+1)]
index_j = [[] for _ in range(m+1)]
for i in range(a.shape[0]):
for j in range(a.shape[1]):
index_i[a[i,j]].append(i)
index_j[a[i,j]].append(j)
#all the zeros:
print(a[index_i[0], index_j[0]])
#all the ones:
print(a[index_i[1], index_j[1]])
#all the twos:
print(a[index_i[2], index_j[2]])
Try it online!
(Note that I will need these indices for write access later, that is, for replacing the values stored in the array. But between these operations I do need to the 2d structure.)
Here's one based on sorting with the intention of having minimal work when iterating to save as a dictionary that has keys being the unique elements and the values as the indices -
shp = a.shape
idx = a.ravel().argsort()
idx_sorted = np.c_[np.unravel_index(idx,shp)]
count = np.bincount(a.ravel())
valid_idx = np.flatnonzero(count!=0)
cs = np.r_[0,count[valid_idx].cumsum()]
out = {e:idx_sorted[i:j] for (e,i,j) in zip(valid_idx,cs[:-1],cs[1:])}
Sample input, output -
In [155]: a
Out[155]:
array([[0, 2, 6],
[0, 2, 6],
[2, 2, 1]])
In [156]: out
Out[156]:
{0: array([[0, 0],
[1, 0]]), 1: array([[2, 2]]), 2: array([[0, 1],
[1, 1],
[2, 0],
[2, 1]]), 6: array([[0, 2],
[1, 2]])}
If all integers in the sequence are covered in the array, we could simplify it a bit -
shp = a.shape
idx = a.ravel().argsort()
idx_sorted = np.c_[np.unravel_index(idx,shp)]
cs = np.r_[0,np.bincount(a.ravel()).cumsum()]
out = {iterID:idx_sorted[i:j] for iterID,(i,j) in enumerate(zip(cs[:-1],cs[1:]))}
Related
All I am trying to do is populate an array with numbers in order. So, array[0][0] = 0, array[0][1]=1, etc. Why is this not working? I cannot figure it out.
def populateArray(ar):
count = 0
i=0
j=0
while (i<len(ar)):
while (j<len(ar[i])):
ar[i][j] = count
count = count + 1
j=j+1
j=0
i=i+1
return ar
numColumns = input('how many columns?')
numRows = input('how many rows?')
arr = [[0]*int(numColumns)]*int(numRows)
arr=populateArray(arr)
print(arr)
when you initiate your arr variable, you are multiplying a list of lists with a number, this results in a "mess" because what you actually do it is to multiply only the reference of the first list (from your list of lists), so actually you have in your arr only one list and a bunch of reference to that list, to fix this you can do:
arr = [[0] * int(numColumns) for _ in range(int(numRows))]
or:
arr = [[0 for _ in range(int(numColumns))] for _ in range(int(numRows))]
after changing this in your code, for numRows = 3 and numColumns = 4 you get:
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
When you use this syntax to create multi dimensional array
arr = [[0]*int(numColumns)]*int(numRows)
the reference of same element is created many times , so if you assign a value to one of them then you are basically changing all the elements because they all reference to the same data.
for ex :
arr = [[0]*int(numColumns)]*int(numRows)
arr[0][1]=2
print(arr)
output
[[0, 2], [0, 2], [0, 2], [0, 2]]
I only changed one element and this is the result .
you should use :
arr = [[0 for i in range(int(numColumns))] for j in range(int(numRows))]
arr[0][1]=2
print(arr)
output :
[[0, 2], [0, 0], [0, 0], [0, 0]]
You can do this:
numColumns = input('how many columns?')
numRows = input('how many rows?')
arr = [[i+j for i in range(int(numRows))] for j in range(int(numColumns))]
arr=populateArray(arr)
print(arr)
The problem with your code is that you append same array to the main array multiple times, like this [l, l, l] and l is a list.
So when ever you change an elemenet of l it will change all of ls in your list.
So, your code works fine but each time you change another list, all of previous list will be effected.
You can also make use of numpy for creating the structure, followed by using numpy.tolist() for converting it to a python list
import numpy as np
numColumns = int(input('how many columns?'))
numRows = int(input('how many rows?') )
arr = np.arange(numRows*numColumns).reshape(numRows, numColumns).tolist()
print(arr)
I need to construct a 2D matrix knowing a row vector. What is the easiest way to construct this matrix?
import numpy as np
a = [1,2,3]
A = np.zeros(3,3)
for i in range(0,3):
A[i][:]= a[i:3]
Using a[i+1:i+2] + b[:-1] I move all elements right (in b) and I add new element from original a at the beginning
a = [0,1,2,3]
A = []
b = a[:] # first row without changes
for i in range(len(a)):
print(b)
A.append(b)
b = a[i+1:i+2] + b[:-1] # in next row move right and add new element at the beginning
print(A)
Result:
[0, 1, 2, 3]
[1, 0, 1, 2]
[2, 1, 0, 1]
[3, 2, 1, 0]
Suppose I have the following list of lists:
a = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]
]
I want to have the average of each n-th element in the arrays. However, when wanting to do this in a simple way, Python generated out-of-bounds errors because of the different lengths. I solved this by giving each array the length of the longest array, and filling the missing values with None.
Unfortunately, doing this made it impossible to compute an average, so I converted the arrays into masked arrays. The code shown below works, but it seems rather cumbersome.
import numpy as np
import numpy.ma as ma
a = [ [1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6] ]
# Determine the length of the longest list
lenlist = []
for i in a:
lenlist.append(len(i))
max = np.amax(lenlist)
# Fill each list up with None's until required length is reached
for i in a:
if len(i) <= max:
for j in range(max - len(i)):
i.append(None)
# Fill temp_array up with the n-th element
# and add it to temp_array
temp_list = []
masked_arrays = []
for j in range(max):
for i in range(len(a)):
temp_list.append(a[i][j])
masked_arrays.append(ma.masked_values(temp_list, None))
del temp_list[:]
# Compute the average of each array
avg_array = []
for i in masked_arrays:
avg_array.append(np.ma.average(i))
print avg_array
Is there a way to do this more quickly? The final list of lists will contain 600000 'rows' and up to 100 'columns', so efficiency is quite important :-).
tertools.izip_longest would do all the padding with None's for you so your code can be reduced to:
import numpy as np
import numpy.ma as ma
from itertools import izip_longest
a = [ [1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6] ]
averages = [np.ma.average(ma.masked_values(temp_list, None)) for temp_list in izip_longest(*a)]
print(averages)
[2.0, 3.0, 4.0, 6.0]
No idea what the fastest way in regard to the numpy logic but this is definitely going to be a lot more efficient than your own code.
If you wanted a faster pure python solution:
from itertools import izip_longest, imap
a = [[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]]
def avg(x):
x = filter(None, x)
return sum(x, 0.0) / len(x)
filt = imap(avg, izip_longest(*a))
print(list(filt))
[2.0, 3.0, 4.0, 6.0]
If you have 0's in the arrays that won't work as 0 will be treated as Falsey, you will have to use a list comp to filter in that case but it will still be faster:
def avg(x):
x = [i for i in x if i is not None]
return sum(x, 0.0) / len(x)
filt = imap(avg, izip_longest(*a))
Here's an almost* fully vectorized solution based on np.bincount and np.cumsum -
# Store lengths of each list and their cumulative and entire summations
lens = np.array([len(i) for i in a]) # Only loop to get lengths
C = lens.cumsum()
N = lens.sum()
# Create ID array such that the first element of each list is 0,
# the second element as 1 and so on. This is needed in such a format
# for use with bincount later on.
shifts_arr = np.ones(N,dtype=int)
shifts_arr[C[:-1]] = -lens[:-1]+1
id_arr = shifts_arr.cumsum()-1
# Use bincount to get the summations and thus the
# averages across all lists based on their positions.
avg_out = np.bincount(id_arr,np.concatenate(a))/np.bincount(id_arr)
-* Almost because we are getting the lengths of lists with a loop, but with minimal computation involved there, must not affect the total runtime hugely.
Sample run -
In [109]: a = [ [1, 2, 3],
...: [2, 3, 4],
...: [3, 4, 5, 6] ]
In [110]: lens = np.array([len(i) for i in a])
...: C = lens.cumsum()
...: N = lens.sum()
...:
...: shifts_arr = np.ones(N,dtype=int)
...: shifts_arr[C[:-1]] = -lens[:-1]+1
...: id_arr = shifts_arr.cumsum()-1
...:
...: avg_out = np.bincount(id_arr,np.concatenate(a))/np.bincount(id_arr)
...:
In [111]: avg_out
Out[111]: array([ 2., 3., 4., 6.])
You can already clean your code to compute the max length: this single line does the job:
len(max(a,key=len))
Combining with other answer you will get the result like so:
[np.mean([x[i] for x in a if len(x) > i]) for i in range(len(max(a,key=len)))]
You can also avoid the masked array and use np.nan instead:
def replaceNoneTypes(x):
return tuple(np.nan if isinstance(y, type(None)) else y for y in x)
a = [np.nanmean(replaceNoneTypes(temp_list)) for temp_list in zip_longest(*df[column], fillvalue=np.nan)]
On your test array:
[np.mean([x[i] for x in a if len(x) > i]) for i in range(4)]
returns
[2.0, 3.0, 4.0, 6.0]
If you are using Python version >= 3.4, then import the statistics module
from statistics import mean
if using lower versions, create a function to calculate mean
def mean(array):
sum = 0
if (not(type(array) == list)):
print("there is some bad format in your input")
else:
for elements in array:
try:
sum = sum + float(elements)
except:
print("non numerical entry found")
average = (sum + 0.0) / len(array)
return average
Create a list of lists, for example
myList = [[1,2,3],[4,5,6,7,8],[9,10],[11,12,13,14],[15,16,17,18,19,20,21,22],[23]]
iterate through myList
for i, lists in enumerate(myList):
print(i, mean(lists))
This will print down the sequence n, and the average of nth list.
To find particularly the average of only nth list, create a function
def mean_nth(array, n):
if((type(n) == int) and n >= 1 and type(array) == list):
return mean(myList[n-1])
else:
print("there is some bad format of your input")
Note that index starts from zero, so for instance if you are looking for the mean of 5th list, it will be at index 4. this explains n-1 in the code.
And then call the function, for example
avg_5thList = mean_nth(myList, 5)
print(avg_5thList)
Running the above code on myList yields following result:
0 2.0
1 6.0
2 9.5
3 12.5
4 18.5
5 23.0
18.5
where the first six lines are generated from the iterative loop, and display the index of nth list and list average. Last line (18.5) displays the average of 5th list as a result of mean_nth(myList, 5) call.
Further, for a list like yours,
a = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]
]
Lets say you want average of 1st elements, i.e. (1+2+3)/3 = 2, or 2nd elements, i.e., (2+3+4)/3 = 3, or 4th elements such as 6/1 = 6, you will need to find the length of each list so that you can identify in the nth element exists in a list or not. For that, you first need to arrange your list of lists in the order of length of lists.
You can either
1) first sort the main list according to size of constituent lists iteratively, and then go through the sorted list to identify if the constituent lists are of sufficient length
2) or you can iteratively look into the original list for length of constituent lists.
(I can definitely get back with working out a faster recursive algorithm if needed)
Computationally second one is more efficient, so assuming that your 5th element means 4th in the index(0, 1, 2, 3, 4), or nth element means (n-1)th element, lets go with that and create a function
def find_nth_average(array, n):
if(not(type(n) == int and (int(n) >= 1))):
return "Bad input format for n"
else:
if (not(type(array) == list)):
return "Bad input format for main list"
else:
total = 0
count = 0
for i, elements in enumerate(array):
if(not(type(elements) == list)):
return("non list constituent found at location " + str(i+1))
else:
listLen = len(elements)
if(int(listLen) >= n):
try:
total = total + elements[n-1]
count = count + 1
except:
return ("non numerical entity found in constituent list " + str(i+1))
if(int(count) == 0):
return "No such n-element exists"
else:
average = float(total)/float(count)
return average
Now lets call this function on your list a
print(find_nth_average(a, 0))
print(find_nth_average(a, 1))
print(find_nth_average(a, 2))
print(find_nth_average(a, 3))
print(find_nth_average(a, 4))
print(find_nth_average(a, 5))
print(find_nth_average(a, 'q'))
print(find_nth_average(a, 2.3))
print(find_nth_average(5, 5))
The corresponding results are:
Bad input format for n
2.0
3.0
4.0
6.0
No such n-element exists
Bad input format for n
Bad input format for n
Bad input format for main list
If you have an erratic list, like
a = [[1, 2, 3], 2, [3, 4, 5, 6]]
that contains a non - list element, you get an output:
non list constituent found at location 2
If your constituent list is erratic, like:
a = [[1, 'p', 3], [2, 3, 4], [3, 4, 5, 6]]
that contains a non - numerical entity in a list, and find the average of 2nd elements by print(find_nth_average(a, 2))
you get an output:
non numerical entity found in constituent list 1
I have a matrix
a = [[1,2,-3,1],
[-1,0,0,1],
[1,1,1,1]]
i want to modify it such that the result will contain only the columns that are positive.
a = [[2, 1],
[0, 1],
[1, 1]]
def removing_missing_data(x):
"""
input: lists of lists.
return: non-negative values.
"""
for i in x:
f = []
for k in i:
if k < 0:
f.append(i.index(k))
t(f,x)
def t(x,y):
count = 0
for i in x:
i = i - count
for l in y:
l.pop(i)
count+=1
The code above works but its not efficient way to deal when the matrix is too large. I would like to know if there is any way to deal with this issue to optimize the running time so it can be used on large data sets.
You can find the min item in each column and then filter the positive ones. It can be done very easy with numpy:
>>> import numpy as np
>>> a = [[1,2,-3,1],
... [-1,0,0,1],
... [1,1,1,1]]
>>> b = np.array(a)
>>> b[:,(b.min(axis=0)>=0)]
array([[2, 1],
[0, 1],
[1, 1]])
I have 2D list and I need to search for the index of an element. As I am begineer to programming I used the following function:
def in_list(c):
for i in xrange(0,no_classes):
if c in classes[i]:
return i;
return -1
Here classes is a 2D list and no_classes denotes the number of classes i.e the 1st dimesntion of the list. -1 is returned when c is not in the araray. Is there any I can optimize the search?
You don't need to define no_classes yourself. Use enumerate():
def in_list(c, classes):
for i, sublist in enumerate(classes):
if c in sublist:
return i
return -1
Use list.index(item)
a = [[1,2],[3,4,5]]
def in_list(item,L):
for i in L:
if item in i:
return L.index(i)
return -1
print in_list(3,a)
# prints 1
if order doesn't matter and you have no duplicates in your data, I suggest to turn you 2D list into list of sets:
>>> l = [[1, 2, 4], [6, 7, 8], [9, 5, 10]]
>>> l = [set(x) for x in l]
>>> l
[set([1, 2, 4]), set([8, 6, 7]), set([9, 10, 5])]
After that, your original function will work faster, because search of element in set is constant (while search of element in list is linear), so you algorithm becomes O(N) and not O(N^2).
Note that you should not do this in your function or it would be converted each time function is called.
If your "2D" list is rectangular (same number of columns for each line), you should convert it to a numpy.ndarray and use numpy functionalities to do the search. For an array of integers, you can use == for comparison. For an array of float numbers, you should use np.isclose instead:
a = np.array(c, dtype=int)
i,j = np.where(a == element)
or
a = np.array(c, dtype=float)
i,j = np.where(np.isclose(a, element))
such that i and j contain the line and column indices, respectively.
Example:
a = np.array([[1, 2],
[3, 4],
[2, 6]], dtype=float)
i, j = np.where(np.isclose(a, 2))
print(i)
#array([0, 2])
print(j)
#array([1, 0]))
using list comprehension: (2D list to 1D)
a = [[1,22],[333,55555,6666666]]
d1 = [x for b in a for x in b]
print(d1)