Calculating mean value using ArcPy Statistics_analysis, in_memory - python

So, among the selected values want to calculate median value.
arcpy.env.workspace = r"Database Connections\local.sde"
pLoc = "local.DBO.Parcels"
luLoc = "local.DBO.Land_Use"
luFields = ["MedYrBlt","MedVal","OCCount"]
arcpy.MakeFeatureLayer_management(pLoc,"cities_lyr")
arcpy.SelectLayerByAttribute_management("cities_lyr", "NEW_SELECTION", "YrBlt > 1000")
from selected cities_lyr want to calculate mean value field from YrBlt
with arcpy.da.SearchCursor(luLoc, ["OID#", "SHAPE#", luFields[0], luFields[1], luFields[2]]) as cursor:
for row in cursor:
if arcpy.Exists('in_memory/stats'):
arcpy.Delete_management(r'in_memory/stats')
arcpy.SelectLayerByLocation_management('cities_lyr', select_features = row[1])
arcpy.Statistics_analysis('cities_lyr', 'in_memory/stats','YrBlt MEAN','OBJECTID')
Here comes a question:
I just want to see the mean value, how can I do that?
luFields = ["MedYrBlt","MedVal","OCCount"]
are going to be used later not important for now.

Append values to an empty array and then calculate mean of that array. For example:
# Create array & cycle through years, append values to array
yrArray =[]
for row in cursor:
val = getValue("yrBlt")
yrArray.append(val)
#get sum of all values in array
x = 0
for i in yrArray:
x += i
#get average by dividing above sum by the length of the array.
meanYrBlt = x / len(yrArray)
On another note it may be beneficial to separate these processes out into their own classes. For example:
class arrayAvg:
def __init__(self,array):
x = 0
for i in array:
x += 1
arrayLength = len(array)
arrayAvg = x/arrayLength
self.avg = arrayAvg
self.count = arrayLength
This way you can reuse the code by calling:
yrBltAvg = arrayAvg(yrArray)
avg = yrBltAvg.avg #returns average
count = yrBltAvg.count #returns count
The second portion is unnecessary, but allows you to take advantage of object oriented programming, and you can expand upon that throughout the program.

Related

Using Returned Value in Later Function

Let's say I wanted to call a function to do some calculation, but I also wanted to use that calculated value in a later function. When I return the value of the first function can I not just send it to my next function? Here is an example of what I am talking about:
def add(x,y):
addition = x + y
return addition
def square(a):
result = a * a
return result
sum = add(1,4)
product = square(addition)
If I call the add function, it'll return 5 as the addition result. But I want to use that number 5 in the next function, can I just send it to the next function as shown? In the main program I am working on it does not work like this.
Edit: This is a sample of the code I am actually working on which will give a better idea of what the problem is. The problem is when I send the mean to the calculateStdDev function.
#import libraries to be used
import time
import StatisticsCalculations
#global variables
mean = 0
stdDev = 0
#get file from user
fileChoice = input("Enter the .csv file name: ")
inputFile = open(fileChoice)
headers = inputFile.readline().strip('\n').split(',') #create headers for columns and strips unnecessary characters
#create a list with header-number of lists in it
dataColumns = []
for i in headers:
dataColumns.append([]) #fills inital list with as many empty lists as there are columns
#counts how many rows there are and adds a column of data into each empty list
rowCount = 0
for row in inputFile:
rowCount = rowCount + 1
comps = row.strip().split(',') #components of data
for j in range(len(comps)):
dataColumns[j].append(float(comps[j])) #appends the jth entry into the jth column, separating data into categories
k = 0
for entry in dataColumns:
print("{:>11}".format(headers[k]),"|", "{:>10.2f}".format(StatisticsCalculations.findMax(dataColumns[k])),"|",
"{:>10.2f}".format(StatisticsCalculations.findMin(dataColumns[k])),"|","{:>10.2f}".format(StatisticsCalculations.calculateMean(dataColumns[k], rowCount)),"|","{:>10.2f}".format()) #format each data entry to be right aligned and be correctly spaced in its column
#prining break line for each row
k = k + 1 #counting until dataColumns is exhausted
inputFile.close()
And the StatisticsCalculations module:
import math
def calculateMean(data, rowCount):
sumForMean = 0
for entry in data:
sumForMean = sumForMean + entry
mean = sumForMean/rowCount
return mean
def calculateStdDev(data, mean, rowCount, entry):
stdDevSum = 0
for x in data:
stdDevSum = float(stdDevSum) + ((float(entry[x]) - mean)** 2) #getting sum of squared difference to be used in std dev formula
stdDev = math.sqrt(stdDevSum / rowCount) #using the stdDevSum for the remaining parts of std dev formula
return stdDev
def findMin(data):
lowestNum = 1000
for component in data:
if component < lowestNum:
lowestNum = component
return lowestNum
def findMax(data):
highestNum = -1
for number in data:
if number > highestNum:
highestNum = number
return highestNum
First of all, sum is a reserved word, you shouldn't use it as a variable.
You can do it this way:
def add(x,y):
addition = x + y
return addition
def square(a):
result = a * a
return result
s = add(1, 4)
product = square(s)
Or directly:
product = square(add(1, 4))

Find int values in a numpy array that are "close in value" and combine them

I have a numpy array with these values:
[10620.5, 11899., 11879.5, 13017., 11610.5]
import Numpy as np
array = np.array([10620.5, 11899, 11879.5, 13017, 11610.5])
I would like to get values that are "close" (in this instance, 11899 and 11879) and average them, then replace them with a single instance of the new number resulting in this:
[10620.5, 11889, 13017, 11610.5]
the term "close" would be configurable. let's say a difference of 50
the purpose of this is to create Spans on a Bokah graph, and some lines are just too close
I am super new to python in general (a couple weeks of intense dev)
I would think that I could arrange the values in order, and somehow grab the one to the left, and right, and do some math on them, replacing a match with the average value. but at the moment, I just dont have any idea yet.
Try something like this, I added a few extra steps, just to show the flow:
the idea is to group the data into adjacent groups, and decide if you want to group them or not based on how spread they are.
So as you describe you can combine you data in sets of 3 nummbers and if the difference between the max and min numbers are less than 50 you average them, otherwise you leave them as is.
import pandas as pd
import numpy as np
arr = np.ravel([1,24,5.3, 12, 8, 45, 14, 18, 33, 15, 19, 22])
arr.sort()
def reshape_arr(a, n): # n is number of consecutive adjacent items you want to compare for averaging
hold = len(a)%n
if hold != 0:
container = a[-hold:] #numbers that do not fit on the array will be excluded for averaging
a = a[:-hold].reshape(-1,n)
else:
a = a.reshape(-1,n)
container = None
return a, container
def get_mean(a, close): # close = how close adjacent numbers need to be, in order to be averaged together
my_list=[]
for i in range(len(a)):
if a[i].max()-a[i].min() > close:
for j in range(len(a[i])):
my_list.append(a[i][j])
else:
my_list.append(a[i].mean())
return my_list
def final_list(a, c): # add any elemts held in the container to the final list
if c is not None:
c = c.tolist()
for i in range(len(c)):
a.append(c[i])
return a
arr, container = reshape_arr(arr,3)
arr = get_mean(arr, 5)
final_list(arr, container)
You could use fuzzywuzzy here to gauge the ratio of cloesness between 2 data sets.
See details here: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/
Taking Gustavo's answer and tweaking it to my needs:
def reshape_arr(a, close):
flag = True
while flag is not False:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a
this will do it if I do something like this:
candlesticks['support'] = reshape_arr(supres_df['support'], 150)
where candlesticks is the main DataFrame that I am using and supres_df is another DataFrame that I am massaging before I apply it to the main one.
it works, but is extremely slow. I am trying to optimize it now.
I added a while loop because after averaging, the averages can become close enough to average out again, so I will loop again, until it doesn't need to average anymore. This is total newbie work, so if you see something silly, please comment.

efficient way to split temporal Numpy vector automatically

I have a temporal vector as in the following image:
Numpy vector:
https://drive.google.com/file/d/0B4Jac-wNMDxHS3BnUzBoUkdmOGs/view?usp=sharing
I would like to know an efficient way to split the vector in numpy, and extract the 5 chunks of the signals that drop in amplitude significantly.
I could separate them by considering the amplitude 2.302 as the cut off amplitude and separate them by the initial index when the signal drops bellow this value and the final index when the signal goes above this value.
Any efficient way to do this in numpy?
So I've programmed the solution in pure python and lists:
vec = np.load('vector_numpy.npy')
# plt.plot(vec)
# plt.show()
print vec.shape
temporal_vec = []
flag = 0
flag_start = 0
flag_end = 0
all_vectors = []
all_index = []
count = -1
for element in vec:
count = count+1
#print element
if element < 2.302:
if flag_start ==0:
all_index.append(count)
flag_start=1
temporal_vec.append(element)
flag = 1
if flag == 1:
if element >= 2.302:
if flag_start==1:
all_index.append(count)
flag_start=0
all_vectors.append(temporal_vec)
temporal_vec = []
flag = 0
print(all_vectors)
for element in all_vectors:
print(len(element))
plt.plot(element)
plt.show()
print(all_index)
Any fancier way in Numpy or better/shorter python code?

Compute Higher Moments of Data Matrix

this probably leads to scipy/numpy, but right now I'm happy with any functionality as I couldn't find anything in those packages. I have a matrix that contains data for a multi-variate distribution (let's say, 2, for the fun of it). Is there any function to compute (higher) moments of that? All I could find was numpy.mean() and numpy.cov() :o
Thanks :)
/edit:
So some more detail: I have multivariate data, that is, a matrix where rows display variables and columns observations. Now I would like to have a simple way of computing the joint moments of that data, as defined in http://en.wikipedia.org/wiki/Central_moment#Multivariate_moments .
I'm pretty new to python/scipy so I'm not sure I'd be the best person to code this one up, especially for the n-variables case (note that the wikipedia definition is for n=2), and I kind of expected there to be some out-of-the-box thing to use as I thought this would be a standard problem.
/edit2:
Just for the future, in case someone wants to do something similar, the following code (which is still under review) should give the sample equivalent of the raw moments E(X^2), E(Y^2), etc. It only works for two variables right now, but it should be extendable if one feels the need. If you see some mistakes or unclean/unpython-nish code, feel free to comment.
from numpy import *
# this function should return something as
# moments[0] = 1
# moments[1] = mean(X), mean(Y)
# moments[2] = 1/n*X'X, 1/n*X'Y, 1/n*Y'Y
# moments[3] = mean(X'X'X), mean(X'X'Y), mean(X'Y'Y),
# mean(Y'Y'Y)
# etc
def getRawMoments(data, moment, axis=0):
a = moment
if (axis==0):
n = float(data.shape[1])
X = matrix(data[0,:]).reshape((n,1))
Y = matrix(data[1,:]).reshape((n,1))
else:
n = float(data.shape[0])
X = matrix(data[:,0]).reshape((n,1))
Y = matrix(data[:,1]).reshape((n,11))
result = 1
Z = hstack((X,Y))
iota = ones((1,n))
moments = {}
moments[0] = 1
#first, generate huge-ass matrix containing all x-y combinations
# for every power-combination k,l such that k+l = i
# for all 0 <= i <= a
for i in arange(1,a):
if i==2:
moments[i] = moments[i-1]*Z
# if even, postmultiply with X.
elif i%2 == 1:
moments[i] = kron(moments[i-1], Z.T)
# Else, postmultiply with X.T
elif i%2==0:
temp = moments[i-1]
temp2 = temp[:,0:n]*Z
temp3 = temp[:,n:2*n]*Z
moments[i] = hstack((temp2, temp3))
# since now we have many multiple moments
# such as x**2*y and x*y*x, filter non-distinct elements
momentsDistinct = {}
momentsDistinct[0] = 1
for i in arange(1,a):
if i%2 == 0:
data = 1/n*moments[i]
elif i == 1:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
data = 1/n*hstack((temp2))
else:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
temp3 = temp[:,n:2*n]*iota.T
data = 1/n*hstack((temp2, temp3))
momentsDistinct[i] = unique(data.flat)
return momentsDistinct(result, axis=1)

Row, column assignment without for-loop

I wrote a small script to assign values to a numpy array by knowing their row and column coordinates:
gridarray = np.zeros([3,3])
gridarray_counts = np.zeros([3,3])
cols = np.random.random_integers(0,2,15)
rows = np.random.random_integers(0,2,15)
data = np.random.random_integers(0,9,15)
for nn in np.arange(len(data)):
gridarray[rows[nn],cols[nn]] += data[nn]
gridarray_counts[rows[nn],cols[nn]] += 1
In fact, then I know how many values are stored in the same grid cell and what the sum is of them. However, performing this on arrays of lengths 100000+ it is getting quite slow. Is there another way without using a for-loop?
Is an approach similar to this possible? I know this is not working yet.
gridarray[rows,cols] += data
gridarray_counts[rows,cols] += 1
I would use bincount for this, but for now bincount only takes 1darrays so you'll need to write your own ndbincout, something like:
def ndbincount(x, weights=None, shape=None):
if shape is None:
shape = x.max(1) + 1
x = np.ravel_multi_index(x, shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
out.shape = shape
return out
Then you can do:
gridarray = np.zeros([3,3])
cols = np.random.random_integers(0,2,15)
rows = np.random.random_integers(0,2,15)
data = np.random.random_integers(0,9,15)
x = np.vstack([rows, cols])
temp = ndbincount(x, data, gridarray.shape)
gridarray = gridarray + temp
gridarray_counts = ndbincount(x, shape=gridarray.shape)
You can do this directly:
gridarray[(rows,cols)]+=data
gridarray_counts[(rows,cols)]+=1

Categories

Resources