Walk along 2D numpy array as long as values remain the same - python

Short description
I want to walk along a numpy 2D array starting from different points in specified directions (either 1 or -1) until a column changes (see below)
Current code
First let's generate a dataset:
# Generate big random dataset
# first column is id and second one is a number
np.random.seed(123)
c1 = np.random.randint(0,100,size = 1000000)
c2 = np.random.randint(0,20,size = 1000000)
c3 = np.random.choice([1,-1],1000000 )
m = np.vstack((c1, c2, c3)).T
m = m[m[:,0].argsort()]
Then I wrote the following code that starts at specific rows in the matrix (start_points) then keeps extending in the specified direction (direction_array) until the metadata changes:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = start_mat[:,1]
direction_array = start_mat[:,2]
walk_array = start_array
while True:
walk_array = np.add(walk_array, direction_array)
try:
walk_mat = mat[walk_array]
walk_metadata = walk_mat[:,1]
if sorted(metadata) != sorted(walk_metadata):
raise IndexError
except IndexError:
return start_mat, mat[walk_array + (direction_array *-1)]
s = time.time()
for i in range(100000):
start_points = np.random.randint(0,1000000,size = 3)
res = walk(m, start_points)
Question
While the above code works fine I think there must be an easier/more elegant way to walk along a numpy 2D array from different start points until the value of another column changes? This for example requires me to slice the input array for every step in the while loop which seems quite inefficient (especially when I have to run walk millions of times).

You don't have to whole input array in while loop. You could just use the column that values you want to check.
I refactored a little bit your code as well so there is no while True statement and so there is no if that raises error for no particular reason.
Code:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = sorted(start_mat[:,1])
direction_array = start_mat[:,2]
data = mat[:,1]
walk_array = np.add(start_array, direction_array)
try:
while metadata == sorted(data[walk_array]):
walk_array = np.add(walk_array, direction_array)
except IndexError:
pass
return start_mat, mat[walk_array - direction_array]
In this particular reason if len(start_array) is a big number (thousands of elements) you could use collections.Counter instead of sort as it will be much faster.
I was thinking of another approach that could be used and I that there could be a array with desired slices in correct direction.
But this approach seems very dirty. Anyway I will post it maybe you will find it anyhow useful
Code:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = sorted(start_mat[:,1])
direction_array = start_mat[:,2]
_data = mat[:,1]
walk_slices = zip(*[
data[start_points[i]+direction_array[i]::direction_array[i]]
for i in range(len(start_points))
])
for step, walk_metadata in enumerate(walk_slices):
if metadata != sorted(walk_metadata):
break
return start_mat, mat[start_array + (direction_array * step)]

To perform operation starting from a single row, define the following class:
class Walker:
def __init__(self, tbl, row):
self.tbl = tbl
self.row = row
self.dir = self.tbl[self.row, 2]
# How many rows can I move from "row" in the indicated direction
# while metadata doesn't change
def numEq(self):
# Metadata from "row" in the required direction
md = self.tbl[self.row::self.dir, 1]
return ((md != md[0]).cumsum() == 0).sum() - 1
# Get row "n" positions from "row" in the indicated direction
def getRow(self, n):
return self.tbl[self.row + n * self.dir]
Then, to get the result, run:
def walk_2(m, start_points):
# Create walkers for each starting point
wlk = [ Walker(m, n) for n in start_points ]
# How many rows can I move
dist = min([ w.numEq() for w in wlk ])
# Return rows from changed positions
return np.vstack([ w.getRow(dist) for w in wlk ])
The execution time of my code is roughly the same as yours,
but in my opinion my code is more readable and concise.

Related

How to create coordinates for nodes of graph

I have a list that has strings separated by commas. The values of each string are nothing but the navigation steps/action of the same procedure done by different users. I want to create coordinates for these steps/actions and store them for creating graph. Each unique steps/actions
will have one coordinate. My idea is I will consider a string with more steps first. I will assign them coordinates ranging from (1,0) to (n,0). Here first string will have 'y' as 0 saying all the actions will be in one layer. When i check for steps/actions in second string, if there are any missing ones i will assign them (1,1) to (n,1). So on... Care has to be taken that if first steps/actions of one string falls in between of another bigger string, the coordinates should be after that.
This sounds confusing, but in simple terms, i want to create coordinates for user flow of a website.
Assume list,
A = ['A___O___B___C___D___E___F___G___H___I___J___K___L___M___N',
'A___O___B___C___E___D___F___G___H___I___J___K___L___M___N',
'A___B___C___D___E___F___G___H___I___J___K___L___M___N',
'A___B___C___E___D___F___G___H___I___J___K___L___M___N',
'A___Q___C___D___E___F___G___H___I___J___K___L___M___N',
'E___P___F___G___H___I___J___K___L___M___N']
I started below code, but it is getting complicated. Any help is appreciated.
A1 = [i.split('___') for i in A]
# A1.sort(key=len, reverse=True)
A1 = sorted(A1, reverse=True)
if len(A1)>1:
Actions = {}
horizontalVal = {}
verticalVal = {}
restActions = []
for i in A1:
for j in i[1:]:
restActions.append(j)
for i in range (len(A1)):
if A1[i][0] not in restActions and A1[i][0] not in Actions.keys():
Actions[A1[i][0]] = [i,0]
horizontalVal[A1[i][0]] = i
verticalVal[A1[i][0]] = 0
unmarkedActions = []
for i in range(len(sortedLen)):
currLen = sortedLen[i]
for j in range(len(A1)):
if len(A1[j]) == currLen:
if j == 0:
for k in range(len(A1[j])):
currK = A1[j][k]
if currK not in Actions.keys():
Actions[currK] = [k,0]
horizontalVal[currK] = k
verticalVal[currK] = 0
else:
currHori = []
print(A1[j])
for k in range(len(A1[j])):
currK = A1[j][k]
.
. to be continued

Find int values in a numpy array that are "close in value" and combine them

I have a numpy array with these values:
[10620.5, 11899., 11879.5, 13017., 11610.5]
import Numpy as np
array = np.array([10620.5, 11899, 11879.5, 13017, 11610.5])
I would like to get values that are "close" (in this instance, 11899 and 11879) and average them, then replace them with a single instance of the new number resulting in this:
[10620.5, 11889, 13017, 11610.5]
the term "close" would be configurable. let's say a difference of 50
the purpose of this is to create Spans on a Bokah graph, and some lines are just too close
I am super new to python in general (a couple weeks of intense dev)
I would think that I could arrange the values in order, and somehow grab the one to the left, and right, and do some math on them, replacing a match with the average value. but at the moment, I just dont have any idea yet.
Try something like this, I added a few extra steps, just to show the flow:
the idea is to group the data into adjacent groups, and decide if you want to group them or not based on how spread they are.
So as you describe you can combine you data in sets of 3 nummbers and if the difference between the max and min numbers are less than 50 you average them, otherwise you leave them as is.
import pandas as pd
import numpy as np
arr = np.ravel([1,24,5.3, 12, 8, 45, 14, 18, 33, 15, 19, 22])
arr.sort()
def reshape_arr(a, n): # n is number of consecutive adjacent items you want to compare for averaging
hold = len(a)%n
if hold != 0:
container = a[-hold:] #numbers that do not fit on the array will be excluded for averaging
a = a[:-hold].reshape(-1,n)
else:
a = a.reshape(-1,n)
container = None
return a, container
def get_mean(a, close): # close = how close adjacent numbers need to be, in order to be averaged together
my_list=[]
for i in range(len(a)):
if a[i].max()-a[i].min() > close:
for j in range(len(a[i])):
my_list.append(a[i][j])
else:
my_list.append(a[i].mean())
return my_list
def final_list(a, c): # add any elemts held in the container to the final list
if c is not None:
c = c.tolist()
for i in range(len(c)):
a.append(c[i])
return a
arr, container = reshape_arr(arr,3)
arr = get_mean(arr, 5)
final_list(arr, container)
You could use fuzzywuzzy here to gauge the ratio of cloesness between 2 data sets.
See details here: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/
Taking Gustavo's answer and tweaking it to my needs:
def reshape_arr(a, close):
flag = True
while flag is not False:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a
this will do it if I do something like this:
candlesticks['support'] = reshape_arr(supres_df['support'], 150)
where candlesticks is the main DataFrame that I am using and supres_df is another DataFrame that I am massaging before I apply it to the main one.
it works, but is extremely slow. I am trying to optimize it now.
I added a while loop because after averaging, the averages can become close enough to average out again, so I will loop again, until it doesn't need to average anymore. This is total newbie work, so if you see something silly, please comment.

Compute Higher Moments of Data Matrix

this probably leads to scipy/numpy, but right now I'm happy with any functionality as I couldn't find anything in those packages. I have a matrix that contains data for a multi-variate distribution (let's say, 2, for the fun of it). Is there any function to compute (higher) moments of that? All I could find was numpy.mean() and numpy.cov() :o
Thanks :)
/edit:
So some more detail: I have multivariate data, that is, a matrix where rows display variables and columns observations. Now I would like to have a simple way of computing the joint moments of that data, as defined in http://en.wikipedia.org/wiki/Central_moment#Multivariate_moments .
I'm pretty new to python/scipy so I'm not sure I'd be the best person to code this one up, especially for the n-variables case (note that the wikipedia definition is for n=2), and I kind of expected there to be some out-of-the-box thing to use as I thought this would be a standard problem.
/edit2:
Just for the future, in case someone wants to do something similar, the following code (which is still under review) should give the sample equivalent of the raw moments E(X^2), E(Y^2), etc. It only works for two variables right now, but it should be extendable if one feels the need. If you see some mistakes or unclean/unpython-nish code, feel free to comment.
from numpy import *
# this function should return something as
# moments[0] = 1
# moments[1] = mean(X), mean(Y)
# moments[2] = 1/n*X'X, 1/n*X'Y, 1/n*Y'Y
# moments[3] = mean(X'X'X), mean(X'X'Y), mean(X'Y'Y),
# mean(Y'Y'Y)
# etc
def getRawMoments(data, moment, axis=0):
a = moment
if (axis==0):
n = float(data.shape[1])
X = matrix(data[0,:]).reshape((n,1))
Y = matrix(data[1,:]).reshape((n,1))
else:
n = float(data.shape[0])
X = matrix(data[:,0]).reshape((n,1))
Y = matrix(data[:,1]).reshape((n,11))
result = 1
Z = hstack((X,Y))
iota = ones((1,n))
moments = {}
moments[0] = 1
#first, generate huge-ass matrix containing all x-y combinations
# for every power-combination k,l such that k+l = i
# for all 0 <= i <= a
for i in arange(1,a):
if i==2:
moments[i] = moments[i-1]*Z
# if even, postmultiply with X.
elif i%2 == 1:
moments[i] = kron(moments[i-1], Z.T)
# Else, postmultiply with X.T
elif i%2==0:
temp = moments[i-1]
temp2 = temp[:,0:n]*Z
temp3 = temp[:,n:2*n]*Z
moments[i] = hstack((temp2, temp3))
# since now we have many multiple moments
# such as x**2*y and x*y*x, filter non-distinct elements
momentsDistinct = {}
momentsDistinct[0] = 1
for i in arange(1,a):
if i%2 == 0:
data = 1/n*moments[i]
elif i == 1:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
data = 1/n*hstack((temp2))
else:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
temp3 = temp[:,n:2*n]*iota.T
data = 1/n*hstack((temp2, temp3))
momentsDistinct[i] = unique(data.flat)
return momentsDistinct(result, axis=1)

Can a python list hold a multi-dimentional array as its element?

I am trying to do image processing using python.
I try to create a list which holds numpy.ndarrays.
My code looks like this,
def Minimum_Close(Shade_Corrected_Image, Size):
uint32_Shade_Corrected_Image = pymorph.to_int32(Shade_Corrected_Image)
Angles = []
[Row, Column] = Shade_Corrected_Image.shape
Angles = [i*15 for i in range(12)]
Image_Close = [0 for x in range(len(Angles))]
Image_Closing = numpy.zeros((Row, Column))
for s in range(len(Angles)):
Struct_Element = pymorph.seline(Size, Angles[s])
Image_Closing = pymorph.close(uint32_Shade_Corrected_Image,Struct_Element )
Image_Close[s] = Image_Closing
Min_Close_Image = numpy.zeros(Shade_Corrected_Image.shape)
temp_array = [][]
Temp_Cell = numpy.zeros((Row, Column))
for r in range (1, Row):
for c in range(1,Column):
for Cell in Image_Close:
Temp_Cell = Image_Close[Cell]
temp_array[Cell] = Temp_Cell[r][c]
Min_Close_Image[r][c] = min(temp_array)
Min_Close_Image = Min_Close_Image - Shade_Corrected_Image
return Min_Close_Image
While running this code I'm getting error:
Temp_Cell = Image_Close[Cell]
TypeError: only integer arrays with one element can be converted to an index
How can I make a data structure which holds different multi-dimensional arrays and then traverse through it??
Making a list of arrays is not necessary when you're using numpy.
I suggest rewriting the whole function like this:
def Minimum_Close(shade_corrected_image, size):
uint32_shade_corrected_image = pymorph.to_int32(shade_corrected_image)
angles = np.arange(12) * 15
def pymorph_op(angle):
struct_element = pymorph.seline(size, angle)
return pymorph.close(uint32_shade_corrected_image, struct_element)
image_close = np.dstack(pymorph_op(a) for a in angles)
min_close_image = np.min(image_close, axis=-1) - shade_corrected_image
return min_close_image
I lower cased variable names so that they stop getting highlighted as classes.
What about:
for cnt,Cell in enumerate(Image_Close):
Temp_Cell = Image_Close[cnt]

Row, column assignment without for-loop

I wrote a small script to assign values to a numpy array by knowing their row and column coordinates:
gridarray = np.zeros([3,3])
gridarray_counts = np.zeros([3,3])
cols = np.random.random_integers(0,2,15)
rows = np.random.random_integers(0,2,15)
data = np.random.random_integers(0,9,15)
for nn in np.arange(len(data)):
gridarray[rows[nn],cols[nn]] += data[nn]
gridarray_counts[rows[nn],cols[nn]] += 1
In fact, then I know how many values are stored in the same grid cell and what the sum is of them. However, performing this on arrays of lengths 100000+ it is getting quite slow. Is there another way without using a for-loop?
Is an approach similar to this possible? I know this is not working yet.
gridarray[rows,cols] += data
gridarray_counts[rows,cols] += 1
I would use bincount for this, but for now bincount only takes 1darrays so you'll need to write your own ndbincout, something like:
def ndbincount(x, weights=None, shape=None):
if shape is None:
shape = x.max(1) + 1
x = np.ravel_multi_index(x, shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
out.shape = shape
return out
Then you can do:
gridarray = np.zeros([3,3])
cols = np.random.random_integers(0,2,15)
rows = np.random.random_integers(0,2,15)
data = np.random.random_integers(0,9,15)
x = np.vstack([rows, cols])
temp = ndbincount(x, data, gridarray.shape)
gridarray = gridarray + temp
gridarray_counts = ndbincount(x, shape=gridarray.shape)
You can do this directly:
gridarray[(rows,cols)]+=data
gridarray_counts[(rows,cols)]+=1

Categories

Resources