Related
I have a variable with zeros and ones. Each sequence of ones represent "a phase" I want to observe, each sequence of zeros represent the space/distance that intercurr between these phases.
It may happen that a phase carries a sort of "impulse response", for example it can be the echo of a voice: in this case we will have 1,1,1,1,0,0,1,1,1,0,0,0 as an output, the first sequence ones is the shout we made, while the second one is just the echo cause by the shout.
So I made a function that doesn't take into account the echos/response of the main shout/action, and convert the ones sequence of the echo/response into zeros.
(1) If the sequence of zeros is greater or equal than the input threshold nearby_thr the function will recognize that the sequence of ones is an independent phase and it won't delete or change anything.
(2) If the sequence of zeros (between two sequences of ones) is smaller than the input threshold nearby_thr the function will recognize that we have "an impulse response/echo" and we do not take that into account. Infact it will convert the ones into zeros.
I made a naive function that can accomplish this result but I was wondering if pandas already has a function like that, or if it can be accomplished in few lines, without writing a "C-like" function.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
# import utili_funzioni.util00 as ut0
x1 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.DataFrame([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
# rule = x1==1 ## counting number of consecutive ones
# cumsum_ones = rule.cumsum() - rule.cumsum().where(~rule).ffill().fillna(0).astype(int)
def detect_nearby_el_2(df, nearby_thr):
global el2del
# df = consecut_zeros
# i = 0
print("")
print("")
j = 0
enterOnce_if = 1
reset_count_0s = 0
start2detect = False
count0s = 0 # init
start2_getidxs = False # if this is not true, it won't store idxs to delete
el2del = [] # store idxs to delete elements
for i in range(df.shape[0]):
print("")
print("i: ", i)
x_i = df.iloc[i, 0]
if x_i == 1 and j==0: # first phase (ones) has been detected
start2detect = True # first phase (ones) has been detected
# j += 1
print("count0s:",count0s)
if start2detect == True: # first phase, seen/detected, --> (wait) has ended..
if x_i == 0: # 1st phase detected and ended with "a zero"
if reset_count_0s == 1:
count0s = 0
reset_count_0s = 0
count0s += 1
if enterOnce_if == 1:
start2_getidxs=True # avoiding to delete first phase
enterOnce_0 = 0
if start2_getidxs==True: # avoiding to delete first phase
if x_i == 1 and count0s < nearby_thr:
print("this is NOT a new phase!")
el2del = [*el2del, i] # idxs to delete
reset_count_0s = 1 # reset counter
if x_i == 1 and count0s >= nearby_thr:
print("this is a new phase!") # nothing to delete
reset_count_0s = 1 # reset counter
return el2del
def convert_nearby_el_into_zeros(df,idx):
df0 = df + 0 # error original dataframe is modified!
if len(idx) > 0:
# df.drop(df.index[idx]) # to delete completely
df0.iloc[idx] = 0
else:
print("no elements nearby to delete!!")
return df0
######
print("")
x1_2del = detect_nearby_el_2(df=x1,nearby_thr=3)
x2_2del = detect_nearby_el_2(df=x2,nearby_thr=3)
## deleting nearby elements
x1_a = convert_nearby_el_into_zeros(df=x1,idx=x1_2del)
x2_a = convert_nearby_el_into_zeros(df=x2,idx=x2_2del)
## PLOTTING
# ut0.grayplt()
fig1 = plt.figure()
fig1.suptitle("x1",fontsize=20)
ax1 = fig1.add_subplot(1,2,1)
ax2 = fig1.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x1)
line2, = ax2.plot(x1_a)
fig2 = plt.figure()
fig2.suptitle("x2",fontsize=20)
ax1 = fig2.add_subplot(1,2,1)
ax2 = fig2.add_subplot(1,2,2,sharey=ax1)
ax1.title.set_text("PRE-detect")
ax2.title.set_text("POST-detect")
line1, = ax1.plot(x2)
line2, = ax2.plot(x2_a)
You can see that x1 has two "response/echoes" that I want to not take into account, while x2 has none, infact nothing changed in x2
My question is: How this can be accomplished in few lines using pandas?
Thank You
Interesting problem, and I'm sure there's a more elegant solution out there, but here is my attempt - it's at least fairly performant:
x1 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1])
x2 = pd.Series([0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1,0])
def remove_echos(series, threshold):
starting_points = (series==1) & (series.shift()==0)
echo_starting_points = starting_points & series.shift(threshold)==1
echo_starting_points = series[echo_starting_points].index
change_points = series[starting_points].index.to_list() + [series.index[-1]]
for (start, end) in zip(change_points, change_points[1:]):
if start in echo_starting_points:
series.loc[start:end] = 0
return series
x1 = remove_echos(x1, 3)
x2 = remove_echos(x2, 3)
(I changed x1 and x2 to be Series instead of DataFrame, it's easy to adapt this code to work with a df if you need to.)
Explanation: we define the "starting point" of each section as a 1 preceded by a 0. Of those we define an "echo" starting point if the point threshold places before is a 1. (The assumption is that we don't have a phases which is shorter than threshold.) For each echo starting point, we zero from it to the next starting point or the end of the Series.
I have a set of data for which has an ID, timestamp, and identifiers. I have to go through it, calculate the entropy and save some other links for the data. At each step more identifiers are added to the identifiers dictionary and I have to re-compute the entropy and append it. I have really large amount of data and the program gets stuck due to growing number of identifiers and their entropy calculation after each step. I read the following solution but it is about the data consisting of numbers.
Incremental entropy computation
I have copied two functions from this page and the incremental calculation of entropy gives different values than the classical full entropy calculation at every step.
Here is the code I have:
from math import log
# ---------------------------------------------------------------------#
# Functions copied from https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = [] # Classical way of calculating entropy at every step
updated_entropies = [] # Incremental way of calculating entropy at every step
for item in input_data:
temp = item[2].split(",")
identifiers_sum = sum(total_identifiers.values()) # Sum of all identifiers
old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1] # Get previous entropy calculation
for identifier in temp:
S_new = len(temp) # sum of new samples
temp_dictionaty = {a:1 for a in temp} # Store current identifiers and their occurrence
if identifier not in total_identifiers:
total_identifiers[identifier] = 1
else:
total_identifiers[identifier] += 1
current_entropy = entropy(total_identifiers.values()) # Entropy for current set of identifiers
updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
updated_entropies.append(updated_entropy)
entropy_value = entropy(total_identifiers.values()) # Classical entropy calculation for comparison. This step becomes too expensive with big data
all_entropies.append(entropy_value)
print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum) # Gives 12 while the sum is 14 ???
print("All Classical Entropies: ", all_entropies) # print for comparison
print("All Updated Entropies: ", updated_entropies)
The other issue is that when I print "Sum of total_identifiers", it gives 12 instead of 14! (Due to very large amount of data, I read the actual file line by line and write the results directly to the disk and do not store it in the memory apart from the dictionary of identifiers).
The code above uses Theorem 4; it seems to me that you want to use Theorem 5 instead (from the paper in the next paragraph).
Note, however, that if the number of identifiers is really the problem then the incremental approach below isn't going to work either---at some point the dictionaries are going to get too large.
Below you can find a proof-of-concept Python implementation that follows the description from Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams.
import collections
import math
import random
def log2(p):
return math.log(p, 2) if p > 0 else 0
CountChange = collections.namedtuple('CountChange', ('label', 'change'))
class EntropyHolder:
def __init__(self):
self.counts_ = collections.defaultdict(int)
self.entropy_ = 0
self.sum_ = 0
def update(self, count_changes):
r = sum([change for _, change in count_changes])
residual = self._compute_residual(count_changes)
self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual
self._update_counts(count_changes)
return self.entropy_
def _compute_residual(self, count_changes):
r = sum([change for _, change in count_changes])
residual = 0
for label, change in count_changes:
p_new = (self.counts_[label] + change) / (self.sum_ + r)
p_old = self.counts_[label] / (self.sum_ + r)
residual += p_new * log2(p_new) - p_old * log2(p_old)
return residual
def _update_counts(self, count_changes):
for label, change in count_changes:
self.sum_ += change
self.counts_[label] += change
def entropy(self):
return self.entropy_
def naive_entropy(counts):
s = sum(counts)
return sum([-(r/s) * log2(r/s) for r in counts])
if __name__ == '__main__':
print(naive_entropy([1, 1]))
print(naive_entropy([1, 1, 1, 1]))
entropy = EntropyHolder()
freq = collections.defaultdict(int)
for _ in range(100):
index = random.randint(0, 5)
entropy.update([CountChange(index, 1)])
freq[index] += 1
print(naive_entropy(freq.values()))
print(entropy.entropy())
Thanks #blazs for providing the entropy_holder class. That solves the problem. So the idea is to import entropy_holder.py from (https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738) and use it to store the previous entropy and update at every step when new identifiers come.
So the minimum working code would look like this:
import entropy_holder
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
for identifier in item[2].split(","):
entropy.update([entropy_holder.CountChange(identifier, 1)])
print(entropy.entropy())
This entropy by using the Blaz's incremental formulas is very close to the entropy calculated the classical way and saves from iterating over all the data again and again.
So, among the selected values want to calculate median value.
arcpy.env.workspace = r"Database Connections\local.sde"
pLoc = "local.DBO.Parcels"
luLoc = "local.DBO.Land_Use"
luFields = ["MedYrBlt","MedVal","OCCount"]
arcpy.MakeFeatureLayer_management(pLoc,"cities_lyr")
arcpy.SelectLayerByAttribute_management("cities_lyr", "NEW_SELECTION", "YrBlt > 1000")
from selected cities_lyr want to calculate mean value field from YrBlt
with arcpy.da.SearchCursor(luLoc, ["OID#", "SHAPE#", luFields[0], luFields[1], luFields[2]]) as cursor:
for row in cursor:
if arcpy.Exists('in_memory/stats'):
arcpy.Delete_management(r'in_memory/stats')
arcpy.SelectLayerByLocation_management('cities_lyr', select_features = row[1])
arcpy.Statistics_analysis('cities_lyr', 'in_memory/stats','YrBlt MEAN','OBJECTID')
Here comes a question:
I just want to see the mean value, how can I do that?
luFields = ["MedYrBlt","MedVal","OCCount"]
are going to be used later not important for now.
Append values to an empty array and then calculate mean of that array. For example:
# Create array & cycle through years, append values to array
yrArray =[]
for row in cursor:
val = getValue("yrBlt")
yrArray.append(val)
#get sum of all values in array
x = 0
for i in yrArray:
x += i
#get average by dividing above sum by the length of the array.
meanYrBlt = x / len(yrArray)
On another note it may be beneficial to separate these processes out into their own classes. For example:
class arrayAvg:
def __init__(self,array):
x = 0
for i in array:
x += 1
arrayLength = len(array)
arrayAvg = x/arrayLength
self.avg = arrayAvg
self.count = arrayLength
This way you can reuse the code by calling:
yrBltAvg = arrayAvg(yrArray)
avg = yrBltAvg.avg #returns average
count = yrBltAvg.count #returns count
The second portion is unnecessary, but allows you to take advantage of object oriented programming, and you can expand upon that throughout the program.
I'm writing a method for calculating the covariance of 2 to 8 time-series variables. I'm intending for the variables to be contained in list objects when they are passed to this method. The method should return 1 number, not a covariance matrix.
The method works fine the first time it's called. Anytime it's called after that, it returns a 0. An example is attached at the bottom, below my code. Any advice/feeback regarding the variable scope issues here would be greatly appreciated. Thanks!
p = [3,4,4,654]
o = [4,67,4,1]
class Toolkit():
def CovarianceScalar(self, column1, column2 = [], column3 = [], column4 = [],column5 = [],column6 = [],column7 = [],column8 = []):
"""Assumes all columns have length equal to Len(column1)"""
#If only the first column is passed, this will act as a variance function
import numpy as npObject
#This is a binary-style number that is assigned a value of 1 if one of the input vectors/lists has zero length. This way, the CovarianceResult variable can be computed, and the relevant
# terms can have a 1 added to them if they would otherwise go to 0, preventing the CovarianceResult value from incorrectly going to 0.
binUnityFlag2 = 1 if (len(column2) == 0) else 0
binUnityFlag3 = 1 if (len(column3) == 0) else 0
binUnityFlag4 = 1 if (len(column4) == 0) else 0
binUnityFlag5 = 1 if (len(column5) == 0) else 0
binUnityFlag6 = 1 if (len(column6) == 0) else 0
binUnityFlag7 = 1 if (len(column7) == 0) else 0
binUnityFlag8 = 1 if (len(column8) == 0) else 0
# Some initial housekeeping: ensure that all input column lengths match that of the first column. (Will later advise the user if they do not.)
lngExpectedColumnLength = len(column1)
inputList = [column2, column3, column4, column5, column6, column7, column8]
inputListNames = ["column2","column3","column4","column5","column6","column7","column8"]
for i in range(0,len(inputList)):
while len(inputList[i]) < lngExpectedColumnLength: #Empty inputs now become vectors of 1's.
inputList[i].append(1)
#Now start calculating the covariance of the inputs:
avgColumn1 = sum(column1)/len(column1) #<-- Each column's average
avgColumn2 = sum(column2)/len(column2)
avgColumn3 = sum(column3)/len(column3)
avgColumn4 = sum(column4)/len(column4)
avgColumn5 = sum(column5)/len(column5)
avgColumn6 = sum(column6)/len(column6)
avgColumn7 = sum(column7)/len(column7)
avgColumn8 = sum(column8)/len(column8)
avgList = [avgColumn1,avgColumn2,avgColumn3,avgColumn4,avgColumn5, avgColumn6, avgColumn7,avgColumn8]
#start building the scalar-valued result:
CovarianceResult = float(0)
for i in range(0,lngExpectedColumnLength):
CovarianceResult +=((column1[i] - avgColumn1) * ((column2[i] - avgColumn2) + binUnityFlag2) * ((column3[i] - avgColumn3) + binUnityFlag3 ) * ((column4[i] - avgColumn4) + binUnityFlag4 ) *((column5[i] - avgColumn5) + binUnityFlag5) * ((column6[i] - avgColumn6) + binUnityFlag6 ) * ((column7[i] - avgColumn7) + binUnityFlag7)* ((column8[i] - avgColumn8) + binUnityFlag8))
#Finally, divide the sum of the multiplied deviations by the sample size:
CovarianceResult = float(CovarianceResult)/float(lngExpectedColumnLength) #Coerce both terms to a float-type to prevent return of array-type objects.
return CovarianceResult
Example:
myInst = Toolkit() #Create a class instance.
First execution of the function:
myInst.CovarianceScalar(o,p)
#Returns -2921.25, the covariance of the numbers in lists o and p.
Second time around:
myInst.CovarianceScalar(o,p)
#Returns: 0.0
I belive that the problem you are facing is due to mutable default arguments. Basicily, when you first execute myInst.CovarianceScalar(o,p) all columns other than first two are []. During this execution, you change the arguments. Thus when you execute the same function as before, myInst.CovarianceScalar(o,p), the other columns in the arguments are not [] anymore. They take values of whatever value they have as a result of the first execution.
this probably leads to scipy/numpy, but right now I'm happy with any functionality as I couldn't find anything in those packages. I have a matrix that contains data for a multi-variate distribution (let's say, 2, for the fun of it). Is there any function to compute (higher) moments of that? All I could find was numpy.mean() and numpy.cov() :o
Thanks :)
/edit:
So some more detail: I have multivariate data, that is, a matrix where rows display variables and columns observations. Now I would like to have a simple way of computing the joint moments of that data, as defined in http://en.wikipedia.org/wiki/Central_moment#Multivariate_moments .
I'm pretty new to python/scipy so I'm not sure I'd be the best person to code this one up, especially for the n-variables case (note that the wikipedia definition is for n=2), and I kind of expected there to be some out-of-the-box thing to use as I thought this would be a standard problem.
/edit2:
Just for the future, in case someone wants to do something similar, the following code (which is still under review) should give the sample equivalent of the raw moments E(X^2), E(Y^2), etc. It only works for two variables right now, but it should be extendable if one feels the need. If you see some mistakes or unclean/unpython-nish code, feel free to comment.
from numpy import *
# this function should return something as
# moments[0] = 1
# moments[1] = mean(X), mean(Y)
# moments[2] = 1/n*X'X, 1/n*X'Y, 1/n*Y'Y
# moments[3] = mean(X'X'X), mean(X'X'Y), mean(X'Y'Y),
# mean(Y'Y'Y)
# etc
def getRawMoments(data, moment, axis=0):
a = moment
if (axis==0):
n = float(data.shape[1])
X = matrix(data[0,:]).reshape((n,1))
Y = matrix(data[1,:]).reshape((n,1))
else:
n = float(data.shape[0])
X = matrix(data[:,0]).reshape((n,1))
Y = matrix(data[:,1]).reshape((n,11))
result = 1
Z = hstack((X,Y))
iota = ones((1,n))
moments = {}
moments[0] = 1
#first, generate huge-ass matrix containing all x-y combinations
# for every power-combination k,l such that k+l = i
# for all 0 <= i <= a
for i in arange(1,a):
if i==2:
moments[i] = moments[i-1]*Z
# if even, postmultiply with X.
elif i%2 == 1:
moments[i] = kron(moments[i-1], Z.T)
# Else, postmultiply with X.T
elif i%2==0:
temp = moments[i-1]
temp2 = temp[:,0:n]*Z
temp3 = temp[:,n:2*n]*Z
moments[i] = hstack((temp2, temp3))
# since now we have many multiple moments
# such as x**2*y and x*y*x, filter non-distinct elements
momentsDistinct = {}
momentsDistinct[0] = 1
for i in arange(1,a):
if i%2 == 0:
data = 1/n*moments[i]
elif i == 1:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
data = 1/n*hstack((temp2))
else:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
temp3 = temp[:,n:2*n]*iota.T
data = 1/n*hstack((temp2, temp3))
momentsDistinct[i] = unique(data.flat)
return momentsDistinct(result, axis=1)