Python - Select elements from matrix within range - python

I have a question regarding python and selecting elements within a range.
If I have a n x m matrix with n row and m columns, I have a defined range for each column (so I have m min and max values).
Now I want to select those rows, where all values are within the range.
Looking at the following example:
input = matrix([[1, 2], [3, 4],[5,6],[1,8]])
boundaries = matrix([[2,1],[8,5]])
#Note:
#col1min = 2
#col1max = 8
#col2min = 1
#col2max = 5
print(input)
desired_result = matrix([[3, 4]])
print(desired_result)
Here, 3 rows where discarded, because they contained values beyond the boundaries.
While I was able to get values within one range for a given array, I did not manage to solve this problem efficiently.
Thank you for your help.

I believe that there is more elegant solution, but i came to this:
def foo(data, boundaries):
zipped_bounds = list(zip(*boundaries))
output = []
for item in data:
for index, bound in enumerate(zipped_bounds):
if not (bound[0] <= item[index] <= bound[1]):
break
else:
output.append(item)
return output
data = [[1, 2], [3, 4], [5, 6], [1, 8]]
boundaries = [[2, 1], [8, 5]]
foo(data, boundaries)
Output:
[[3, 4]]
And i know that there is not checking and raising exceptions if the sizes of arrays won't match each concrete size. I leave it OP to implement this.

Your example data syntax is not correct matrix([[],..]) so it needs to be restructured like this:
matrix = [[1, 2], [3, 4],[5,6],[1,8]]
bounds = [[2,1],[8,5]]
I'm not sure exactly what you mean by "efficient", but this solution is readable, computationally efficient, and modular:
# Test columns in row against column bounds or first bounds
def row_in_bounds(row, bounds):
for ci, colVal in enumerate(row):
bi = ci if len(bounds[0]) >= ci + 1 else 0
if not bounds[1][bi] >= colVal >= bounds[0][bi]:
return False
return True
# Use a list comprehension to apply test to n rows
print ([r for r in matrix if row_in_bounds(r,bounds)])
>>>[[3, 4]]
First we create a reusable test function for rows accepting a list of bounds lists, tuples are probably more appropriate, but I stuck with list as per your specification.
Then apply the test to your matrix of n rows with a list comprehension. If n exceeds the bounds column index or the bounds column index is falsey use the first set of bounds provided.
Keeping the row iterator out of the row parser function allows you to do things like get min/max from the filtered elements as required. This way you will not need to define a new function for every manipulation of the data required.

Related

compare each row with each column in matrix using only pure python

I have a certain function that I made and I want to run it on each column and each row of a matrix, to check if there are rows and columns that produce the same output.
for example:
matrix = [[1,2,3],
[7,8,9]]
I want to run the function, lets call it myfun, on each column [1,7], [2,8] and [3,9] separatly, and also run it on each row [1,2,3] and [7,8,9]. If there is a row and a column that produce the same result, the counter ct would go up 1. All of this is found in another function, called count_good, which basically counts rows and columns that produce the same result.
here is the code so far:
def count_good(mat):
ct = 0
for i in mat:
for j in mat:
if myfun(i) == myfun(j):
ct += 1
return ct
However, when I use print to check my code I get this:
mat = [[1,2,3],[7,8,9]]
​
for i in mat:
for j in mat:
print(i,j)
​
[1, 2, 3] [1, 2, 3]
[1, 2, 3] [7, 8, 9]
[7, 8, 9] [1, 2, 3]
[7, 8, 9] [7, 8, 9]
I see that the code does not return what I need' which means that the count_good function won't work. How can I run a function on each row and each column? I need to do it without any help of outside libraries, no map,zip or stuff like that, only very pure python.
Let's start by using itertools and collections for this, then translate it back to "pure" python.
from itertools import product, starmap, chain # combinations?
from collections import Counter
To iterate in a nested loop efficiently, you can use itertools.product. You can use starmap to expand the arguments of a function as well. Here is a generator of the values of myfun over the rows:
starmap(myfun, product(matrix, repeat=2))
To transpose the matrix and iterate over the columns, use the zip(* idiom:
starmap(myfun, product(zip(*matrix), repeat=2))
You can use collections.Counter to map all the repeats for each possible return value:
Counter(starmap(myfun, chain(product(matrix, repeat=2), product(zip(*matrix), repeat=2))))
If you want to avoid running myfun on the same elements, replace product(..., repeat=2) with combinations(..., 2).
Now that you have the layout of how to do this, replace all the external library stuff with equivalent builtins:
counter = {}
for i in range(len(matrix)):
for j in range(len(matrix)):
result = myfun(matrix[i], matrix[j])
counter[result] = counter.get(result, 0) + 1
for i in range(len(matrix[0])):
for j in range(len(matrix[0])):
c1 = [matrix[row][i] for row in range(len(matrix))]
c2 = [matrix[row][j] for row in range(len(matrix))]
result = myfun(c1, c2)
counter[result] = counter.get(result, 0) + 1
If you want combinations instead, replace the loop pairs with
for i in range(len(...) - 1):
for j in range(i + 1, len(...)):
Using native python:
def count_good(mat):
ct = 0
columns = [[row[col_idx] for row in mat] for col_idx in range(len(mat[0]))]
for row in mat:
for column in columns:
if myfun(row) == myfun(column):
ct += 1
return ct
However, this is very inefficient as it is a triple nested for-loop. I would suggest using numpy instead.
e.g.
def count_good(mat):
ct = 0
mat = np.array(mat)
for row in mat:
for column in mat.T:
if myfun(row) == myfun(column):
ct += 1
return ct
TL;DR
To get a column from a 2D list of N lists of M elements, first flatten the list to a 1D list of N×M elements, then choosing elements from the 1D list with a stride equal to M, the number of columns, gives you a column of the original 2D list.
First, I create a matrix of random integers, as a list of lists of equal
length — Here I take some liberty from the objective of "pure" Python, the OP
will probably input by hand some assigned matrix.
from random import randrange, seed
seed(20220914)
dim = 5
matrix = [[randrange(dim) for column in range(dim)] for row in range(dim)]
print(*matrix, sep='\n')
We need a function to be applied to each row and each column of the matrix,
that I intend must be supplied as a list. Here I choose a simple summation of
the elements.
def myfun(l_st):
the_sum = 0
for value in l_st: the_sum = the_sum+value
return the_sum
To proceed, we are going to do something unexpected, that is we unwrap the
matrix, starting from an empty list we do a loop on the rows and "sum" the
current row to unwrapped, note that summing two lists gives you a single
list containing all the elements of the two lists.
unwrapped = []
for row in matrix: unwrapped = unwrapped+row
In the following we will need the number of columns in the matrix, this number
can be computed counting the elements in the last row of the matrix.
ncols = 0
for value in row: ncols = ncols+1
Now, we can compute the values produced applying myfunc to each column,
counting how many times we have the same value.
We use an auxiliary variable, start, that is initialized to zero and
incremented in every iteration of the following loop, that scans, using a
dummy variable, all the elements of the current row, hence start has the
values 0, 1, ..., ncols-1, so that unwrapped[start::ncols] is a list
containing exactly one of the columns of the matrix.
count_of_column_values = {}
start = 0
for dummy in row:
column_value = myfun(unwrapped[start::ncols])
if column_value not in count_of_column_values:
count_of_column_values[column_value] = 1
else:
count_of_column_values[column_value] = count_of_column_values[column_value] + 1
start = start+1
At this point, we are ready to apply myfun to the rows
count = 0
for row in matrix:
row_value = myfun(row)
if row_value in count_of_column_values: count = count+count_of_column_values[row_value]
print(count)
Executing the code above prints
[1, 4, 4, 1, 0]
[1, 2, 4, 1, 4]
[1, 4, 4, 0, 1]
[4, 0, 3, 1, 2]
[0, 0, 4, 2, 2]
3

How to Partition an array into 2 arrays with equal sums

we have an array of integers that has to be partitioned into 2 arrays. My goal is not just to say it's possible or not, it has to return the 2 arrays as an output.
Input = [ 1, 2, 3, 4, 6]
output = [1, 3, 4] [2, 6]
Both the arrays need to have the same sum. In this case, it is 8 for both arrays. All the elements should be used and no integers should repeat again in the output.
This is how I am trying.
def partition(nums):
if sum(nums) % 2:
return "Not possible"
target = (sum(nums))/2
possible = set()
possible.add(0)
for i in range(len(nums)):
next = set()
for t in possible:
next.add(t + nums[i])
if t + nums[i] == target:
sub = [t, nums[i]]
print(sub)
next.add(t)
possible = next
nums = [1, 2, 3, 4, 6]
print(partition(nums))
This code repeats the same elements and makes an array like [4,4]. I don't understand what to do to stop that.
I am a newbie. So you can completely rewrite it and come up with your own technique. Is it even possible to do something like that?
One approach is to use knapsack algorithm. The knapsack has to hold a weight of Total Sum/2. Get the items whose weight is (Total Sum)/2.the remaining items will have same weight.
Other approach is backtracking. Just run through the list to get a combination of numbers summing to Total Sum/2, once found a list return. But this will be inefficient.

Determining index each group duplicate values in an array in Python with the fastest way

I want to find an index of each group duplicate value like this:
s = [2,6,2,88,6,...]
The results must return the index from original s: [[0,2],[1,4],..] or the result can show another way.
I find many solutions so I find the fastest way to get duplicate group:
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
But after sort I got wrong index from original s.
In my case, I have ~ 200mil value on the list and I want to find the fastest way to do that. I use an array to store value because I want to use GPU to make it faster.
Using hash structures like dict helps.
For example:
import numpy as np
from collections import defaultdict
a=np.array([2,4,2,88,15,4])
table=defaultdict(list)
for ind,num in enumerate(a):
table[num]+=[ind]
Outputs:
{2: [0, 2], 4: [1, 5], 88: [3], 15: [4]}
If you want to show duplicated elements in the order from small to large:
for k,v in sorted(table.items()):
if len(v)>1:
print(k,":",v)
Outputs:
2 : [0, 2]
4 : [1, 5]
The speed is determined by how many different values in the number list.
See if this meets your performance requirements (here, s is your input array):
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = np.split(sorted_inds, cum_counts[:-1])
Notes:
The result would be a list of arrays.
Each of these arrays would contain indices of a repeated value in s. Eg, if the value 13 is repeated 7 times in s, there would be an array with 7 indices among the arrays of result
If you want to ignore singleton values of s (values that occur only once in s), you can pass minlength=2 to np.bincount()
(This is a variation of my other answer. Here, instead of splitting the large array sorted_inds, we take slices from it, so it's likely to have a different kind of performance characteristic)
If s is the input array:
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = [sorted_inds[:cum_counts[0]]] + [sorted_inds[cum_counts[i]:cum_counts[i+1]] for i in range(cum_counts.size-1)]

Iterating through a two dimensional array in Python?

I'm trying to iterate through a two dimensional array in Python and compare items in the array to ints, however I am faced with a ton of various errors whenever I attempt to do such. I'm using numpy and pandas.
My dataset is created as follows:
filename = "C:/Users/User/My Documents/JoeTest.csv"
datas = pandas.read_csv(filename)
dataset = datas.values
Then, I attempt to go through the data, grabbing certain elements of it.
def model_building(data):
global blackKings
flag = 0;
blackKings.append(data[0][1])
for i in data:
if data[i][39] == 1:
if data[i][40] == 1:
values.append(1)
else:
values.append(-1)
else:
if data[i][40] == 1:
values.append(-1)
else:
values.append(1)
for j in blackKings:
if blackKings[j] != data[i][1]:
flag = 1
if flag == 1:
blackKings.append(data[i][1])
flag = 0;
However, doing so leaves me with a ValueError: The Truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). I don't want to use either of these, as I'm looking to compare the actual value of that one specific instance. Is there another way around this problem?
You need to tell us something about this: dataset = datas.values
It's probably a 2d array, since it derives from a load of a csv. But what shape and dtype? Maybe even a sample of the array.
Is that the data argument in the function?
What are blackKings and values? You treat them like lists (with append).
for i in data:
if data[i][39] == 1:
This doesn't make sense. for i in data, if data is 2d, i is the the first row, then the second row, etc. If you want i to in an index, you use something like
for i in range(data.shape[0]):
2d array indexing is normally done with data[i,39].
But in your case data[i][39] is probably an array.
Anytime you use an array in a if statement, you'll get this ValueError, because there are multiple values.
If i were proper indexes, then data[i,39] would be a single value.
To illustrate:
In [41]: data=np.random.randint(0,4,(4,4))
In [42]: data
Out[42]:
array([[0, 3, 3, 2],
[2, 1, 0, 2],
[3, 2, 3, 1],
[1, 3, 3, 3]])
In [43]: for i in data:
...: print('i',i)
...: print('data[i]',data[i].shape)
...:
i [0 3 3 2] # 1st row
data[i] (4, 4)
i [2 1 0 2] # a 4d array
data[i] (4, 4)
...
Here i is a 4 element array; using that to index data[i] actually produces a 4 dimensional array; it isn't selecting one value, but rather many values.
Instead you need to iterate in one of these ways:
In [46]: for row in data:
...: if row[3]==1:
...: print(row)
[3 2 3 1]
In [47]: for i in range(data.shape[0]):
...: if data[i,3]==1:
...: print(data[i])
[3 2 3 1]
To debug a problem like this you need to look at intermediate values, and especially their shapes. Don't just assume. Check!
I'm going to attempt to rewrite your function
def model_building(data):
global blackKings
blackKings.append(data[0, 1])
# Your nested if statements were performing an xor
# This is vectorized version of the same thing
values = np.logical_xor(*(data.T[[39, 40]] == 1)) * -2 + 1
# not sure where `values` is defined. If you really wanted to
# append to it, you can do
# values = np.append(values, np.logical_xor(*(data.T[[39, 40]] == 1)) * -2 + 1)
# Your blackKings / flag logic can be reduced
mask = (blackKings[:, None] != data[:, 1]).all(1)
blackKings = np.append(blackKings, data[:, 1][mask])
This may not be perfect because it is difficult to parse your logic considering you are missing some pieces. But hopefully you can adopt some of what I've included here and improve your code.

Selecting unique random values from the third column of a an array in python

I have a 41000x3 numpy array that I call "sortedlist" in the function below. The third column has a bunch of values, some of which are duplicates, others which are not. I'd like to take a sample of unique values (no duplicates) from the third column, which is sortedlist[:,2]. I think I can do this easily with numpy.random.sample(sortedlist[:,2], sample_size). The problem is I'd like to return, not only those values, but all three columns where, in the last column, there are the randomly chosen values that I get from numpy.random.sample.
EDIT: By unique values I mean I want to choose random values which appear only once. So If I had an array:
array = [[0, 6, 2]
[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]
[5, 2, 8]]
And I wanted to choose 4 values of the third column, I want to get something like new_array_1 out:
new_array_1 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[5, 2, 8]]
But I don't want something like new_array_2, where two values in the 3rd column are the same:
new_array_2 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]]
I have the code to choose random values but without the criterion that they shouldn't be duplicates in the third column.
samplesize = 100
rand_sortedlist = sortedlist[np.random.randint(len(sortedlist), size = sample_size),:]]
I'm trying to enforce this criterion by doing something like this
array_index = where( array[:,2] == sample(SelectionWeight, sample_size) )
But I'm not sure if I'm on the right track. Any help would be greatly appreciated!
I can't think of a clever numpythonic way to do this that doesn't involve multiple passes over the data. (Sometimes numpy is so much faster than pure Python that's still the fastest way to go, but it never feels right.)
In pure Python, I'd do something like
def draw_unique(vec, n):
# group indices by value
d = {}
for i, x in enumerate(vec):
d.setdefault(x, []).append(i)
drawn = [random.choice(d[k]) for k in random.sample(d, n)]
return drawn
which would give
>>> a = np.random.randint(0, 10, (41000, 3))
>>> drawn = draw_unique(a[:,2], 3)
>>> drawn
[4219, 6745, 25670]
>>> a[drawn]
array([[5, 6, 0],
[8, 8, 1],
[5, 8, 3]])
I can think of some tricks with np.bincount and scipy.stats.rankdata but they hurt my head, and there always winds up being one step at the end I can't see how to vectorize.. and if I'm not vectorizing the whole thing I might as well use the above which at least is simple.
I believe this will do what you want. Note that the running time will almost certainly be dominated by whatever method you use to generate your random numbers. (An exception is if the dataset is gigantic but you only need a small number of rows, in which case very few random numbers need to be drawn.) So I'm not sure this will run much faster than a pure python method would.
# arrayify your list of lists
# please don't use `array` as a variable name!
a = np.asarray(arry)
# sort the list ... always the first step for efficiency
a2 = a[np.argsort(a[:, 2])]
# identify rows that are duplicates (3rd column is non-increasing)
# Note this has length one less than a2
duplicate_rows = np.diff(a2[:, 2]) == 0)
# if duplicate_rows[N], then we want to remove row N and N+1
keep_mask = np.ones(length(a2), dtype=np.bool) # all True
keep_mask[duplicate_rows] = 0 # remove row N
keep_mask[1:][duplicate_rows] = 0 # remove row N + 1
# now actually slice the array
a3 = a2[keep_mask]
# select rows from a3 using your preferred random number generator
# I actually prefer `random` over numpy.random for sampling w/o replacement
import random
result = a3[random.sample(xrange(len(a3)), DESIRED_NUMBER_OF_ROWS)]

Categories

Resources