Addition of values in a 10x10 python pandas dataframe excluding the diagonal - python

Hi I have a data frame example below. I want to add all off diagonal values to calculate a for a miss-classification score. This is 1-the miss-classification rate. How to I add all the off diagonal values up?
I have tried this code.
1-(my_LINEARSVC_cross.ix[0,4]+my_LINEARSVC_cross.ix[4,0])/np.sum(my_LINEARSVC_cross.values)
How can I ammend this to add all the off diagonal values?

Simply compute the sum of whole matrix minus its trace (sum of a diagonal), which in numpy would be
m.sum() - m.trace()
so if its a pandas frame you can convert it
import numpy as np
m = np.array(my_LINEARSVC_cross)
print m.sum() - m.trace()
or take out through .values
print my_LINEARSVC_cross.values.sum() - my_LINEARSVC_cross.values.trace()

Related

How can I set a assign a subset of one Dataarray with the non-nan values from another in xarray?

Setting
I am programming a derivative function that treats points next to missing values differently (forward/backward differences instead of central ones).
Ansatz
In order to do that, I create a DataArray that includes all of those values and indices and calculate the values.
Problem
I can't find a way to create a DataArray that is filled with all values in my subselection array (the points next to missing ones, or x_nan in the example) where it is not NaN, and my original DataArray for all other values. The xr.where statement at the end of the example summarizes my desired behaviour quite well - but it fails due to unequal indices
Minimal, Reproducible Example
import numpy as np
import xarray as xr
# I use the example dataarray from the xr.where() documentation
x = xr.DataArray(
0.1 * np.arange(10),
dims=["lat"],
coords={"lat": np.arange(10)},
name="sst",
)
# set some values to NaN to setup problem case
x[{'lat':slice(3,5)}] = np.nan
# create subselection array and calculate the correct values
# (just an assignment for the sake of simplicity)
x_nan = x.where(np.isnan(x.shift({'lat':-1})), drop=True)
x_nan[:] = np.arange(1,3.1,1)
xr.where(x.isnull(), x_nan, x)
This outputs:
ValueError: indexes along dimension 'lat' are not equal
I know I could use drop=False in my x_nan creation but, if possible, I would like to avoid that as it would complicate the calculation of the x_nan values (not shown here). Thanks in advance!

How to extract specific parts of a numpy array?

I have the following looking correlation function.
I want to extract only the main peak of the function in a seperate array. The central peak has the form of a gaussian.. I want to seperate the peak with a width arround the peak of approximately four times the FWHM of the gaussian peak. I have the correlation function stored in a numpy array. Any tips/ideas how to approach this ?
Numpy's argmax (Docs) function returns the index of the max value of a numpy array. With that value you could then get the values around that index.
Example:
m = numpy.argmax(arr)
values = arr[m-width:m+width]

How to find peaks in 1d array

I am reading a csv file in python and preparing a dataframe out of it. I have a Microsoft Kinect which is recording Arm Abduction exercise and generating this CSV file.
I have this array of Y-Coordinates of ElbowLeft joint. You can visualize this here. Now, I want to come up with a solution which can count number of peaks or local maximum in this array.
Can someone please help me to solve this problem?
You can use the find_peaks_cwt function from the scipy.signal module to find peaks within 1-D arrays:
from scipy import signal
import numpy as np
y_coordinates = np.array(y_coordinates) # convert your 1-D array to a numpy array if it's not, otherwise omit this line
peak_widths = np.arange(1, max_peak_width)
peak_indices = signal.find_peaks_cwt(y_coordinates, peak_widths)
peak_count = len(peak_indices) # the number of peaks in the array
More information here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks_cwt.html
It's easy, put the data in a 1-d array and compare each value with the neighboors, the n-1 and n+1 data are smaller than n.
Read data as Robert Valencia suggests
max_local=0
for u in range (1,len(data)-1):
if ((data[u]>data[u-1])&(data[u]>data[u+1])):
max_local=max_local+1
You could try to smooth the data with a smoothing filter and then find all values where the value before and after are less than the current value. This assumes you want all peaks in the sequence. The reason you need the smoothing filter is to avoid local maxima. The level of smoothing required will depend on the noise present in your data.
A simple smoothing filter sets the current value to the average of the N values before and N values after the current value in your sequence along with the current value being analyzed.

Get median value in each bin in a 2D grid

I have a 2-D array of coordinates and each coordinates correspond to a value z (like z=f(x,y)). Now I want to divide this whole 2-D coordinate set into, for example, 100 even bins. And calculate the median value of z in each bin. Then use scipy.interpolate.griddata function to create a interpolated z surface. How can I achieve it in python? I was thinking of using np.histogram2d but I think there is no median function in it. And I found myself have hard time understanding how scipy.stats.binned_statistic work. Can someone help me please. Thanks.
With numpy.histogram2d you can both count the number of data and sum it, thus it gives you the possibility to compute the average.
I would try something like this:
import numpy as np
coo=np.array([np.arange(1000),np.arange(1000)]).T #your array coordinates
def func(x, y): return x*(1-x)*np.sin(np.pi*x) / (1.5+np.sin(2*np.pi*y**2)**2)
z = func(coo[:,0], coo[:,1])
(n,ex,ey)=np.histogram2d(coo[:,0], coo[:,1],bins=100) # here we get counting
(tot,ex,ey)=np.histogram2d(coo[:,0], coo[:,1],bins=100,weights=z) # here we get total over z
average=tot/n
average=np.nan_to_num(average) #cure 0/0
print(average)
you'll need a few functions or one depending on how you want to structure things:
function to create the bins should take in your data, determine how big each bin is and return an array or array of arrays (also called lists in python).
Happy to help with this but would need more information about the data.
get the median of the bins:
Numpy (part of scipy) has a median function
http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.median.html
essentially the median on an array called
"bin"
would be:
$ numpy.median(bin)
Note: numpy.median does accept multiple arrays, so you could get the median for some or all of your bins at once. numpy.median(bins) which would return an array of the median for each bin
Updated
Not 100% on your example code, so here goes:
import numpy as np
# added some parenthesis as I wasn't sure of the math. also removed ;'s
def bincalc(x, y):
return x*(1-x)*(np.sin(np.pi*x))/(1.5+np.sin(2*(np.pi*y)**2)**2)
coo = np.random.rand(1000,2)
tcoo = coo[0]
a = []
for i in tcoo:
a.append(bincalc(coo[0],coo[1]))
z_med = np.median(a)
print(z_med)`

pandas rolling_std only perform every Nth calculation

I am working on some code optimization. Currently I use the pandas rolling_mean and rolling_std to compute normalized cross correlations of time series data from seismic instruments. For non-pertinent technical reasons I am only interested in every Nth value of the output of these pandas rolling mean and rolling std calls, so I am looking for away to only compute every Nth value. I may have to write a cython code to do this but I would prefer not to. Here is an example:
import pandas as pd
import numpy as np
As=5000 #Array size
as=150 #Moving window size
N=3 # only interested in every N values of output array
ar=np.random.rand(As) # generate generic random array
RSTD=pd.rolling_std(ar,as)[as-1:] # dont return the nans before widows overlap
foo=RSTD[::N] # use array indexing to decimate RSTD to only return every Nth value
Is there a good pandas way to only calculate every Nth value of RSTD rather than calculate all the values and decimate?
Thanks

Categories

Resources