Modify matrix based on column values

Modify matrix based on column values - python

I have a matrix
a = [[1,2,-3,1],
[-1,0,0,1],
[1,1,1,1]]
i want to modify it such that the result will contain only the columns that are positive.
a = [[2, 1],
[0, 1],
[1, 1]]
def removing_missing_data(x):
"""
input: lists of lists.
return: non-negative values.
"""
for i in x:
f = []
for k in i:
if k < 0:
f.append(i.index(k))
t(f,x)
def t(x,y):
count = 0
for i in x:
i = i - count
for l in y:
l.pop(i)
count+=1
The code above works but its not efficient way to deal when the matrix is too large. I would like to know if there is any way to deal with this issue to optimize the running time so it can be used on large data sets.

You can find the min item in each column and then filter the positive ones. It can be done very easy with numpy:
>>> import numpy as np
>>> a = [[1,2,-3,1],
... [-1,0,0,1],
... [1,1,1,1]]
>>> b = np.array(a)
>>> b[:,(b.min(axis=0)>=0)]
array([[2, 1],
[0, 1],
[1, 1]])

Related

python populating array using while loop

All I am trying to do is populate an array with numbers in order. So, array[0][0] = 0, array[0][1]=1, etc. Why is this not working? I cannot figure it out.
def populateArray(ar):
count = 0
i=0
j=0
while (i<len(ar)):
while (j<len(ar[i])):
ar[i][j] = count
count = count + 1
j=j+1
j=0
i=i+1
return ar
numColumns = input('how many columns?')
numRows = input('how many rows?')
arr = [[0]*int(numColumns)]*int(numRows)
arr=populateArray(arr)
print(arr)

when you initiate your arr variable, you are multiplying a list of lists with a number, this results in a "mess" because what you actually do it is to multiply only the reference of the first list (from your list of lists), so actually you have in your arr only one list and a bunch of reference to that list, to fix this you can do:
arr = [[0] * int(numColumns) for _ in range(int(numRows))]
or:
arr = [[0 for _ in range(int(numColumns))] for _ in range(int(numRows))]
after changing this in your code, for numRows = 3 and numColumns = 4 you get:
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]

When you use this syntax to create multi dimensional array
arr = [[0]*int(numColumns)]*int(numRows)
the reference of same element is created many times , so if you assign a value to one of them then you are basically changing all the elements because they all reference to the same data.
for ex :
arr = [[0]*int(numColumns)]*int(numRows)
arr[0][1]=2
print(arr)
output
[[0, 2], [0, 2], [0, 2], [0, 2]]
I only changed one element and this is the result .
you should use :
arr = [[0 for i in range(int(numColumns))] for j in range(int(numRows))]
arr[0][1]=2
print(arr)
output :
[[0, 2], [0, 0], [0, 0], [0, 0]]

You can do this:
numColumns = input('how many columns?')
numRows = input('how many rows?')
arr = [[i+j for i in range(int(numRows))] for j in range(int(numColumns))]
arr=populateArray(arr)
print(arr)
The problem with your code is that you append same array to the main array multiple times, like this [l, l, l] and l is a list.
So when ever you change an elemenet of l it will change all of ls in your list.
So, your code works fine but each time you change another list, all of previous list will be effected.

You can also make use of numpy for creating the structure, followed by using numpy.tolist() for converting it to a python list
import numpy as np
numColumns = int(input('how many columns?'))
numRows = int(input('how many rows?') )
arr = np.arange(numRows*numColumns).reshape(numRows, numColumns).tolist()
print(arr)

Numpy: How to quickly replace equal values in an matrix?

Lets say we have a rank 2 array a with n entries that contain integer values in {0,1,2,...,m}. Now for each of those integers I want to find the indices of the entries of a with this value (called index_i, index_j in the following examples). (So what I'm looking for is like np.unique(...,return_index=True) but for 2d arrays and with the possibility to return all indices of each unique value.)
A naive approach would involve using boolean indexing which would result in O(m*n) operations (see below), but I'd like to only have O(n) operations. While I found a solution to do that, I feel like there should be a built in method or at least something that simplifies this - or that would at least remove these ugly loops:
import numpy as np
a = np.array([[0,0,1],[0,2,1],[2,2,1]])
m = a.max()
#"naive" in O(n*m)
i,j = np.mgrid[range(a.shape[0]), range(a.shape[1])]
index_i = [[] for _ in range(m+1)]
index_j = [[] for _ in range(m+1)]
for k in range(m+1):
index_i[k] = i[a==k]
index_j[k] = j[a==k]
#all the zeros:
print(a[index_i[0], index_j[0]])
#all the ones:
print(a[index_i[1], index_j[1]])
#all the twos:
print(a[index_i[2], index_j[2]])
#"sophisticated" in O(n)
index_i = [[] for _ in range(m+1)]
index_j = [[] for _ in range(m+1)]
for i in range(a.shape[0]):
for j in range(a.shape[1]):
index_i[a[i,j]].append(i)
index_j[a[i,j]].append(j)
#all the zeros:
print(a[index_i[0], index_j[0]])
#all the ones:
print(a[index_i[1], index_j[1]])
#all the twos:
print(a[index_i[2], index_j[2]])
Try it online!
(Note that I will need these indices for write access later, that is, for replacing the values stored in the array. But between these operations I do need to the 2d structure.)

Here's one based on sorting with the intention of having minimal work when iterating to save as a dictionary that has keys being the unique elements and the values as the indices -
shp = a.shape
idx = a.ravel().argsort()
idx_sorted = np.c_[np.unravel_index(idx,shp)]
count = np.bincount(a.ravel())
valid_idx = np.flatnonzero(count!=0)
cs = np.r_[0,count[valid_idx].cumsum()]
out = {e:idx_sorted[i:j] for (e,i,j) in zip(valid_idx,cs[:-1],cs[1:])}
Sample input, output -
In [155]: a
Out[155]:
array([[0, 2, 6],
[0, 2, 6],
[2, 2, 1]])
In [156]: out
Out[156]:
{0: array([[0, 0],
[1, 0]]), 1: array([[2, 2]]), 2: array([[0, 1],
[1, 1],
[2, 0],
[2, 1]]), 6: array([[0, 2],
[1, 2]])}
If all integers in the sequence are covered in the array, we could simplify it a bit -
shp = a.shape
idx = a.ravel().argsort()
idx_sorted = np.c_[np.unravel_index(idx,shp)]
cs = np.r_[0,np.bincount(a.ravel()).cumsum()]
out = {iterID:idx_sorted[i:j] for iterID,(i,j) in enumerate(zip(cs[:-1],cs[1:]))}

Sum of all numbers in the first or second place in an array

I have a 2d list, for example:
list1 = [[1,2],[3,4],[5,6],[7,8]]
and I want to find the sum of all the numbers at the n'th place of every element.
For example if I want the answer for 0, I would calculate:
my_sum = list1[0][0] + list1[1][0] + list1[2][0]
or
my_sum = 0
place = 0
for i in range(len(list1)):
my_sum += list1[i][place]
return my_sum
Output: 16
Is there a more elegant way to do this? Or one that uses only one line of code?
I mean as fictional code for example:
fictional_function(list1,place) = 16

Since you are looking for a functional solution, consider operator.itemgetter:
from operator import itemgetter
L = [[1,2],[3,4],[5,6],[7,8]]
res = sum(map(itemgetter(0), L)) # 16
For performance and simpler syntax, you can use a 3rd party library such as NumPy:
import numpy as np
A = np.array([[1,2],[3,4],[5,6],[7,8]])
res = A[:, 0].sum() # 16

As a generalization if you want multiple indices (e.g. 0 and 1) you could use reduce combined with and element-wise sum something like this:
from functools import reduce
def fictional_function(lst, *places):
s_places = set(places)
def s(xs, ys):
return [x + y for x, y in zip(xs, ys)]
return [x for i, x in enumerate(reduce(s, lst)) if i in s_places]
list1 = [[1, 2], [3, 4], [5, 6], [7, 8]]
print(fictional_function(list1, 0))
print(fictional_function(list1, 0, 1))
print(fictional_function(list1, *[1, 0]))
Output
[16]
[16, 20]
[16, 20]
The idea is that the function s sums two list element-wise, for example:
s([1, 2], [3, 4]) # [4, 6]
and with reduce apply s to a list of lists, finally filter the result for the intended indices (places) only.

list1 = [[1,2],[3,4],[5,6],[7,8]]
ind = 0
sum_ind = sum(list(zip(*list1))[ind])
The above can be even written as function taking list and the index as input and returns the sum of the common index.
What we do in the above is first we get all the same indexes to individual lists using zip and then chooses which index one has to be summed and passes the same to sum function.

Python 3.6. Get average Y for all same X coordinates

I have a list of coordinates that looks like this:
my_list = [[1, 1], [1, 3], [1, 5], [2, 1], [2, 3]]
As we see, there are same X values for first three coordinates with different Y and same situation for another two coordiantes. I want to make new list which will look like this:
new_list = [[1, 3], [2, 2]]
where y1 = 3 = (1+3+5)/3 and y2 = 2 = (1+3)/2.
I have written my code which is below, but it works slowly.
I work with hundreds of thousands coordinates so the question is: How to make this code work faster? Is there any optimization or special open source libraty, which can speed up my code?
Thank you in advance.
x_mass = []
for m in mass:
x_mass.append(m[0])
set_x_mass = set(x_mass)
list_x_mass = list(set_x_mass)
performance_points = []
def function(i):
unique_x_mass = []
for m in mass:
if m[0] == i:
unique_x_mass.append(m)
summ_y = 0
for m in unique_x_mass:
summ_y += m[1]
point = [float(m[0]), float(summ_y/len(unique_x_mass))]
performance_points.append(point)
return performance_points
for x in list_x_mass:
function(x)

Create DataFrame and aggregate mean:
L = [[1, 1], [1, 3], [1, 5], [2, 1], [2, 3]]
L1 = pd.DataFrame(L).groupby(0, as_index=False)[1].mean().values.tolist()
print (L1)
[[1, 3], [2, 2]]

The pandas solution offered by #jezrael is elegant but slow (like almost everything pandas). I would suggest using modules itertools and statistics:
from statistics import mean
from itertools import groupby
grouper = groupby(L, key=lambda x: x[0])
#The next line is again more elegant, but slower:
#grouper = groupby(L, key=operator.itemgetter(0))
[[x, mean(yi[1] for yi in y)] for x,y in grouper]
The result is, of course, the same. The execution time for the sample list is two orders of magnitude faster.

Python: compute average of n-th elements in list of lists with different lengths

Suppose I have the following list of lists:
a = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]
]
I want to have the average of each n-th element in the arrays. However, when wanting to do this in a simple way, Python generated out-of-bounds errors because of the different lengths. I solved this by giving each array the length of the longest array, and filling the missing values with None.
Unfortunately, doing this made it impossible to compute an average, so I converted the arrays into masked arrays. The code shown below works, but it seems rather cumbersome.
import numpy as np
import numpy.ma as ma
a = [ [1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6] ]
# Determine the length of the longest list
lenlist = []
for i in a:
lenlist.append(len(i))
max = np.amax(lenlist)
# Fill each list up with None's until required length is reached
for i in a:
if len(i) <= max:
for j in range(max - len(i)):
i.append(None)
# Fill temp_array up with the n-th element
# and add it to temp_array
temp_list = []
masked_arrays = []
for j in range(max):
for i in range(len(a)):
temp_list.append(a[i][j])
masked_arrays.append(ma.masked_values(temp_list, None))
del temp_list[:]
# Compute the average of each array
avg_array = []
for i in masked_arrays:
avg_array.append(np.ma.average(i))
print avg_array
Is there a way to do this more quickly? The final list of lists will contain 600000 'rows' and up to 100 'columns', so efficiency is quite important :-).

tertools.izip_longest would do all the padding with None's for you so your code can be reduced to:
import numpy as np
import numpy.ma as ma
from itertools import izip_longest
a = [ [1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6] ]
averages = [np.ma.average(ma.masked_values(temp_list, None)) for temp_list in izip_longest(*a)]
print(averages)
[2.0, 3.0, 4.0, 6.0]
No idea what the fastest way in regard to the numpy logic but this is definitely going to be a lot more efficient than your own code.
If you wanted a faster pure python solution:
from itertools import izip_longest, imap
a = [[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]]
def avg(x):
x = filter(None, x)
return sum(x, 0.0) / len(x)
filt = imap(avg, izip_longest(*a))
print(list(filt))
[2.0, 3.0, 4.0, 6.0]
If you have 0's in the arrays that won't work as 0 will be treated as Falsey, you will have to use a list comp to filter in that case but it will still be faster:
def avg(x):
x = [i for i in x if i is not None]
return sum(x, 0.0) / len(x)
filt = imap(avg, izip_longest(*a))

Here's an almost* fully vectorized solution based on np.bincount and np.cumsum -
# Store lengths of each list and their cumulative and entire summations
lens = np.array([len(i) for i in a]) # Only loop to get lengths
C = lens.cumsum()
N = lens.sum()
# Create ID array such that the first element of each list is 0,
# the second element as 1 and so on. This is needed in such a format
# for use with bincount later on.
shifts_arr = np.ones(N,dtype=int)
shifts_arr[C[:-1]] = -lens[:-1]+1
id_arr = shifts_arr.cumsum()-1
# Use bincount to get the summations and thus the
# averages across all lists based on their positions.
avg_out = np.bincount(id_arr,np.concatenate(a))/np.bincount(id_arr)
-* Almost because we are getting the lengths of lists with a loop, but with minimal computation involved there, must not affect the total runtime hugely.
Sample run -
In [109]: a = [ [1, 2, 3],
...: [2, 3, 4],
...: [3, 4, 5, 6] ]
In [110]: lens = np.array([len(i) for i in a])
...: C = lens.cumsum()
...: N = lens.sum()
...:
...: shifts_arr = np.ones(N,dtype=int)
...: shifts_arr[C[:-1]] = -lens[:-1]+1
...: id_arr = shifts_arr.cumsum()-1
...:
...: avg_out = np.bincount(id_arr,np.concatenate(a))/np.bincount(id_arr)
...:
In [111]: avg_out
Out[111]: array([ 2., 3., 4., 6.])

You can already clean your code to compute the max length: this single line does the job:
len(max(a,key=len))
Combining with other answer you will get the result like so:
[np.mean([x[i] for x in a if len(x) > i]) for i in range(len(max(a,key=len)))]

You can also avoid the masked array and use np.nan instead:
def replaceNoneTypes(x):
return tuple(np.nan if isinstance(y, type(None)) else y for y in x)
a = [np.nanmean(replaceNoneTypes(temp_list)) for temp_list in zip_longest(*df[column], fillvalue=np.nan)]

On your test array:
[np.mean([x[i] for x in a if len(x) > i]) for i in range(4)]
returns
[2.0, 3.0, 4.0, 6.0]

If you are using Python version >= 3.4, then import the statistics module
from statistics import mean
if using lower versions, create a function to calculate mean
def mean(array):
sum = 0
if (not(type(array) == list)):
print("there is some bad format in your input")
else:
for elements in array:
try:
sum = sum + float(elements)
except:
print("non numerical entry found")
average = (sum + 0.0) / len(array)
return average
Create a list of lists, for example
myList = [[1,2,3],[4,5,6,7,8],[9,10],[11,12,13,14],[15,16,17,18,19,20,21,22],[23]]
iterate through myList
for i, lists in enumerate(myList):
print(i, mean(lists))
This will print down the sequence n, and the average of nth list.
To find particularly the average of only nth list, create a function
def mean_nth(array, n):
if((type(n) == int) and n >= 1 and type(array) == list):
return mean(myList[n-1])
else:
print("there is some bad format of your input")
Note that index starts from zero, so for instance if you are looking for the mean of 5th list, it will be at index 4. this explains n-1 in the code.
And then call the function, for example
avg_5thList = mean_nth(myList, 5)
print(avg_5thList)
Running the above code on myList yields following result:
0 2.0
1 6.0
2 9.5
3 12.5
4 18.5
5 23.0
18.5
where the first six lines are generated from the iterative loop, and display the index of nth list and list average. Last line (18.5) displays the average of 5th list as a result of mean_nth(myList, 5) call.
Further, for a list like yours,
a = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]
]
Lets say you want average of 1st elements, i.e. (1+2+3)/3 = 2, or 2nd elements, i.e., (2+3+4)/3 = 3, or 4th elements such as 6/1 = 6, you will need to find the length of each list so that you can identify in the nth element exists in a list or not. For that, you first need to arrange your list of lists in the order of length of lists.
You can either
1) first sort the main list according to size of constituent lists iteratively, and then go through the sorted list to identify if the constituent lists are of sufficient length
2) or you can iteratively look into the original list for length of constituent lists.
(I can definitely get back with working out a faster recursive algorithm if needed)
Computationally second one is more efficient, so assuming that your 5th element means 4th in the index(0, 1, 2, 3, 4), or nth element means (n-1)th element, lets go with that and create a function
def find_nth_average(array, n):
if(not(type(n) == int and (int(n) >= 1))):
return "Bad input format for n"
else:
if (not(type(array) == list)):
return "Bad input format for main list"
else:
total = 0
count = 0
for i, elements in enumerate(array):
if(not(type(elements) == list)):
return("non list constituent found at location " + str(i+1))
else:
listLen = len(elements)
if(int(listLen) >= n):
try:
total = total + elements[n-1]
count = count + 1
except:
return ("non numerical entity found in constituent list " + str(i+1))
if(int(count) == 0):
return "No such n-element exists"
else:
average = float(total)/float(count)
return average
Now lets call this function on your list a
print(find_nth_average(a, 0))
print(find_nth_average(a, 1))
print(find_nth_average(a, 2))
print(find_nth_average(a, 3))
print(find_nth_average(a, 4))
print(find_nth_average(a, 5))
print(find_nth_average(a, 'q'))
print(find_nth_average(a, 2.3))
print(find_nth_average(5, 5))
The corresponding results are:
Bad input format for n
2.0
3.0
4.0
6.0
No such n-element exists
Bad input format for n
Bad input format for n
Bad input format for main list
If you have an erratic list, like
a = [[1, 2, 3], 2, [3, 4, 5, 6]]
that contains a non - list element, you get an output:
non list constituent found at location 2
If your constituent list is erratic, like:
a = [[1, 'p', 3], [2, 3, 4], [3, 4, 5, 6]]
that contains a non - numerical entity in a list, and find the average of 2nd elements by print(find_nth_average(a, 2))
you get an output:
non numerical entity found in constituent list 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modify matrix based on column values - python

You can find the min item in each column and then filter the positive ones. It can be done very easy with numpy: >>> import numpy as np >>> a = [[1,2,-3,1], ... [-1,0,0,1], ... [1,1,1,1]] >>> b = np.array(a) >>> b[:,(b.min(axis=0)>=0)] array([[2, 1], [0, 1], [1, 1]])

Related

python populating array using while loop

Numpy: How to quickly replace equal values in an matrix?

Sum of all numbers in the first or second place in an array

Python 3.6. Get average Y for all same X coordinates

Python: compute average of n-th elements in list of lists with different lengths

Categories

Resources