Apply SciPy newton method to optimize a pandas dataframe Weibull sum

Apply SciPy newton method to optimize a pandas dataframe Weibull sum - python

I'm a novice programmer, but know my way around excel. However, I'm trying to teach myself python to enable myself to work with much larger datasets and, primarily, because I'm finding it really interesting and enjoyable.
I'm trying to figure out how to recreate the Excel goal seek function (I believe SciPy newton should be equivalent) within the script I have written. However, instead of being able to define a simple function f(x) for which to find the root of, I'm trying to find the root of the sum of a dataframe column. And I have no idea how to approach this.
My code up until the goal seek part is as follows:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import weibull_min
# need to use a gamma function later on, so import math
import math
%matplotlib inline
# create dataframe using lidar experimental data
df = pd.read_csv(r'C:\Users\Latitude\Documents\Coursera\Wind Resource\Proj' \
'ect\Wind_Lidar_40and140.txt',
sep=' ',
header=None,
names=['Year','Month','Day','Hour','v_40','v_140'])
# add in columns for velocity cubed
df['v_40_cubed'] = df['v_40']**3
df['v_140_cubed'] = df['v_140']**3
# calculate mean wind speed, mean cubed wind speed, mean wind speed cubed
# use these to calculate energy patter factor, c and k
v_40_bar = df['v_40'].mean()
v_40_cubed_bar = df['v_40_cubed'].mean()
v_40_bar_cubed = v_40_bar ** 3
# energy pattern factor = epf
epf = v_40_cubed_bar / v_40_bar_cubed
# shape parameter = k
k_40 = 1 + 3.69/epf**2
# scale factor = c
# use imported math library to use gamma function math.gamma
c_40 = v_40_bar / math.gamma(1+1/k_40)
# create new dataframe from current, using bins of 0.25 and generate frequency for these
#' bins'
bins_1 = np.linspace(0,16,65,endpoint=True)
freq_df = df.apply(pd.Series.value_counts, bins=bins_1)
# tidy up the dataframe by dropping superfluous columns and adding in a % time column for
# frequency
freq_df_tidy = freq_df.drop(['Year','Month','Day','Hour','v_40_cubed','v_140_cubed'], axis=1)
freq_df_tidy['v_40_%time'] = freq_df_tidy['v_40']/freq_df_tidy['v_40'].sum()
# add in usable bin value for potential calculation of weibull
freq_df_tidy['windspeed_bin'] = np.linspace(0,16,64,endpoint=False)
# calculate weibull column and wind power density from the weibull fit
freq_df_tidy['Weibull_40'] = weibull_min.pdf(freq_df_tidy['windspeed_bin'], k_40, loc=0, scale=c_40)/4
freq_df_tidy['Wind_Power_Density_40'] = 0.5 * 1.225 * freq_df_tidy['Weibull_40'] * freq_df_tidy['windspeed_bin']**3
# calculate wind power density from experimental data
df['Wind_Power_Density_40'] = 0.5 * 1.225 * df['v_40']**3
At this stage, the result from the Weibull data, round(freq_df_tidy['Wind_Power_Density_40'].sum(),2) gives 98.12.
The result from the experimental data, round(df['Wind_Power_Density_40'].mean(),2) gives 101.14.
My aim now is to optimise the parameter c_40, which is used in the weibull calculated weibull power density (98.12), so that the result of the function round(freq_df_tidy['Wind_Power_Density_40'].sum(),2) is close to equal the experimental wind power density (101.14).
Any help on this would be hugely appreciated. Apologies if I've entered too much code into the request - I wanted to provide as much detail as possible. From my research, I think the SciPy newton method should do the trick, but I can't figure out how to apply it here.

Related

Pinescript correlation(source_a, source_b, length) -> to python

I need help with translating pine correlation function to python, I've already translated stdev and swma functions, but this one is a bit confusing for me.
I've also found this explanation but didn't quite understand how to implement it:
in python try using pandas with .rolling(window).corr where window is
the correlation coefficient period, pandas allow you to compute any
rolling statistic by using rolling(). The correlation coefficient from
pine is calculated with : sma(y*x,length) - sma(y,length)*sma(x,length) divided by
stdev(length*y,length)*stdev(length*x,length) where stdev is based on
the naïve algorithm.
Pine documentation for this func:
> Correlation coefficient. Describes the degree to which two series tend
> to deviate from their sma values. correlation(source_a, source_b,
> length) → series[float] RETURNS Correlation coefficient.
ARGUMENTS
source_a (series) Source series.
source_b (series) Target series.
length (integer) Length (number of bars back).

Using pandas is indeed the best option, TA-Lib also has a CORREL function. In order for you to get a better idea of how the correlation function in pine is implemented here is a python code making use of numpy, note that this is not an efficient solution.
import numpy as np
from matplotlib import pyplot as plt
def sma(src,m):
coef = np.ones(m)/m
return np.convolve(src,coef,mode="valid")
def stdev(src,m):
a = sma(src*src,m)
b = np.power(sma(src,m),2)
return np.sqrt(a-b)
def correlation(x,y,m):
cov = sma(x*y,m) - sma(x,m)*sma(y,m)
den = stdev(x,m)*stdev(y,m)
return cov/den
ts = np.random.normal(size=500).cumsum()
n = np.linspace(0,1,len(ts))
cor = correlation(ts,n,14)
plt.subplot(2,1,1)
plt.plot(ts)
plt.subplot(2,1,2)
plt.plot(cor)
plt.show()

One Cycle Fourier Window optimization. My code is inefficient

Good day
EDIT:
What I want: From any current/voltage waveform on a Power System(PS) I want the filtered 50Hz (fundamental) RMS values magnitudes (and effectively their angles). The current as measured contains all harmonics from 100Hz to 1250Hz depending on the equipment. One cannot analyse and calculate using a wave with these harmonics your error gets so big (depending on equipment) that PS protection equipment calculates incorrect quantities. The signal attached also has MANY many other frequency components involved.
My aim: PS protection Relays are special and calculate a 20ms window in a very short time. I.m not trying to get this. I'm using external recording tech and testing what the relays see are true and they operate correctly. Thus I need to do what they do and only keep 50Hz values without any harmonic and DC.
Important expected result: Given any frequency component that MAY be in the signal I want to see the magnitude of any given harmonic (150,250 - 3rd harmonic magnitudes and 5th harmonic of the fundamental) as well as the magnitude of the DC. This will tell me what type of PS equipment possibly injects these frequencies. It is important that I provide a frequency and the answer is a vector of that frequency only with all other values filtered OUT.
RMS-of-the-fundamental vs RMS differs with a factor of 4000A (50Hz only) and 4500A (with other freqs included)
This code calculates a One Cycle Fourier value (RMS) for given frequency. Sometimes called a Fourier filter I think? I use it for Power System 50Hz/0Hz/150Hz analogues analysis. (The answers have been tested and are correct fundamental RMS values. (https://users.wpi.edu/~goulet/Matlab/overlap/trigfs.html)
For a large sample the code is very slow. For 55000 data points it takes 12seconds. For 3 voltages and 3 currents this gets to be VERY slow. I look at 100s of records a day.
How do I enhance it? What Python tips and tricks/ libraries are there to append my lists/array.
(Also feel free to rewrite or use the code). I use the code to pick out certain values out of a signal at given times. (which is like reading the values from a specialized program for power system analysis)
Edited: with how I load the files and use them, code works with pasting it:
import matplotlib.pyplot as plt
import csv
import math
import numpy as np
import cmath
# FILES ATTACHED TO POST
filenamecfg = r"E:/Python_Practise/2019-10-21 13-54-38-482.CFG"
filename = r"E:/Python_Practise/2019-10-21 13-54-38-482.DAT"
t = []
IR = []
newIR=[]
with open(filenamecfg,'r') as csvfile1:
cfgfile = [row for row in csv.reader(csvfile1, delimiter=',')]
numberofchannels=int(np.array(cfgfile)[1][0])
scaleval = float(np.array(cfgfile)[3][5])
scalevalI = float(np.array(cfgfile)[8][5])
samplingfreq = float(np.array(cfgfile)[numberofchannels+4][0])
numsamples = int(np.array(cfgfile)[numberofchannels+4][1])
freq = float(np.array(cfgfile)[numberofchannels+2][0])
intsample = int(samplingfreq/freq)
#TODO neeeed to get number of samples and frequency and detect
#automatically
#scaleval = np.array(cfgfile)[3]
print('multiplier:',scaleval)
print('SampFrq:',samplingfreq)
print('NumSamples:',numsamples)
print('Freq:',freq)
with open(filename,'r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
for row in plots:
t.append(float(row[1])/1000000) #get time from us to s
IR.append(float(row[6]))
newIR = np.array(IR) * scalevalI
t = np.array(t)
def mag_and_theta_for_given_freq(f,IVsignal,Tsignal,samples): #samples are the sample window size you want to caclulate for (256 in my case)
# f in hertz, IVsignal, Tsignal in numpy.array
timegap = Tsignal[2]-Tsignal[3]
pi = math.pi
w = 2*pi*f
Xr = []
Xc = []
Cplx = []
mag = []
theta = []
#print("Calculating for frequency:",f)
for i in range(len(IVsignal)-samples):
newspan = range(i,i+samples)
timewindow = Tsignal[newspan]
#print("this is my time: ",timewindow)
Sig20ms = IVsignal[newspan]
N = len(Sig20ms) #get number of samples of my current Freq
RealI = np.multiply(Sig20ms, np.cos(w*timewindow)) #Get Real and Imaginary part of any signal for given frequency
ImagI = -1*np.multiply(Sig20ms, np.sin(w*timewindow)) #Filters and calculates 1 WINDOW RMS (root mean square value).
#calculate for whole signal and create a new vector. This is the RMS vector (used everywhere in power system analysis)
Xr.append((math.sqrt(2)/N)*sum(RealI)) ### TAKES SO MUCH TIME
Xc.append((math.sqrt(2)/N)*sum(ImagI)) ## these steps make RMS
Cplx.append(complex(Xr[i],Xc[i]))
mag.append(abs(Cplx[i]))
theta.append(np.angle(Cplx[i]))#th*180/pi # this can be used to get Degrees if necessary
#also for freq 0 (DC) id the offset is negative how do I return a negative to indicate this when i'm using MAGnitude or Absolute value
return Cplx,mag,theta #mag[:,1]#,theta # BUT THE MAGNITUDE WILL NEVER BE zero
myZ,magn,th = mag_and_theta_for_given_freq(freq,newIR,t,intsample)
plt.plot(newIR[0:30000],'b',linewidth=0.4)#, label='CFG has been loaded!')
plt.plot(magn[0:30000],'r',linewidth=1)
plt.show()
The code as pasted runs smoothly given the files attached
Regards
EDIT: Please find a test csvfile and COMTRADE TEST files here:
CSV:
https://drive.google.com/open?id=18zc4Ms_MtYAeTBm7tNQTcQkTnFWQ4LUu
COMTRADE
https://drive.google.com/file/d/1j3mcBrljgerqIeJo7eiwWo9eDu_ocv9x/view?usp=sharing
https://drive.google.com/file/d/1pwYm2yj2x8sKYQUcw3dPy_a9GrqAgFtD/view?usp=sharing

Forewords
As I said in my previous comment:
Your code mainly relies on a for loop with a lot of indexation and
scalar operations. You already have imported numpy so you should take
advantage of vectorization.
This answer is a start towards your solution.
Light weight MCVE
First we create a trial signal for the MCVE:
import numpy as np
# Synthetic signal sampler: 5s sampled as 400 Hz
fs = 400 # Hz
t = 5 # s
t = np.linspace(0, t, fs*t+1)
# Synthetic Signal: Amplitude is about 325V #50Hz
A = 325 # V
f = 50 # Hz
y = A*np.sin(2*f*np.pi*t) # V
Then we can compute the RMS of this signal using the usual formulae:
# Actual definition of RMS:
yrms = np.sqrt(np.mean(y**2)) # 229.75 V
Or alternatively we can compute it using DFT (implemented as rfft in numpy.fft):
# RMS using FFT:
Y = np.fft.rfft(y)/y.size
Yrms = np.sqrt(np.real(Y[0]**2 + np.sum(Y[1:]*np.conj(Y[1:]))/2)) # 229.64 V
A demonstration of why this last formulae works can be found here. This is valid because of the Parseval Theorem implies Fourier transform does conserve Energy.
Both versions make use of vectorized functions, no need of splitting real and imaginary part to perform computation and then reassemble into a complex number.
MCVE: Windowing
I suspect you want to apply this function as a window on a long term time serie where RMS value is about to change. Then we can tackle this problem using pandas library which provides time series commodities.
import pandas as pd
We encapsulate the RMS function:
def rms(y):
Y = 2*np.fft.rfft(y)/y.size
return np.sqrt(np.real(Y[0]**2 + np.sum(Y[1:]*np.conj(Y[1:]))/2))
We generate a damped signal (variable RMS)
y = np.exp(-0.1*t)*A*np.sin(2*f*np.pi*t)
We wrap our trial signal into a dataframe to take advantage of the rolling or resample methods:
df = pd.DataFrame(y, index=t*pd.Timedelta('1s'), columns=['signal'])
A rolling RMS of your signal is:
df['rms'] = df.rolling(int(fs/f)).agg(rms)
A periodically sampled RMS is:
df['signal'].resample('1s').agg(rms)
The later returns:
00:00:00 2.187840e+02
00:00:01 1.979639e+02
00:00:02 1.791252e+02
00:00:03 1.620792e+02
00:00:04 1.466553e+02
Signal Conditioning
Addressing your need of keeping only fundamental harmonic (50 Hz), a straightforward solution could be a linear detrend (to remove constant and linear trend) followed by a Butterworth filter (bandpass filter).
We generate a synthetic signal with other frequencies and linear trend:
y = np.exp(-0.1*t)*A*(np.sin(2*f*np.pi*t) \
+ 0.2*np.sin(8*f*np.pi*t) + 0.1*np.sin(16*f*np.pi*t)) \
+ A/20*t + A/100
Then we condition the signal:
from scipy import signal
yd = signal.detrend(y, type='linear')
filt = signal.butter(5, [40,60], btype='band', fs=fs, output='sos', analog=False)
yfilt = signal.sosfilt(filt, yd)
Graphically it leads to:
It resumes to apply the signal conditioning before the RMS computation.

Is there a way to estimate Poisson interaction effect in python statsmodels?

Does statsmodels in Python have a way to estimate interaction with a 95% confidence interval? This would be the linear combination of the model's parameter estimates.
Given the example below, I would like to get the effect of being in arm 'b' among people in place 'there', it would require estimating the linear combination of model parameters:
Beta arm + Delta arm*place, but also including the appropriate confidence interval.
I'm aware of mod.params and mod.conf_int(), but does statsmodels have another methods for linear combinations?
import random
import pandas as pd
import statsmodels.api as sm
import patsy
import numpy as np
cases = np.array([random.randint(0,10) for i in range(200)])
arm = [random.choice(['a', 'b']) for i in range(200)]
place = [random.choice(['here', 'there']) for i in range(200)]
df = pd.DataFrame({'arm': arm, 'place': place})
exog = patsy.dmatrix('arm + place + arm * place', df, return_type='dataframe')
mod = sm.GLM(endog=cases, exog=exog, family=sm.families.Poisson()).fit()
mod.summary()

Bollen's Delta Method is frequently used to get the confidence interval for the linear combination b1 * x + b2 * x * z.
I'm not sure how and to what extent Statsmodels incorporates the Delta Method.
If you want to go down the results.get_prediction route just make sure all the 'other covariates' (if any) are set to their sample or population mean.

Python-Generating numbers according to a corellation matrix

Hi, I am trying to generate correlated data as close to the first table as possible (first three rows shown out of a total of 13). The correlation matrix for the relevant columns is also shown (corr_total).
I am trying the following code, which shows the error:
"LinAlgError: 4-th leading minor not positive definite"
from scipy.linalg import cholesky
# Correlation matrix
# Compute the (upper) Cholesky decomposition matrix
upper_chol = cholesky(corr_total)
# What should be here? The mu and sigma of one row of a table?
rnd = np.random.normal(2.57, 0.78, size=(10,7))
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
My question is what goes into the values of The mu and sigma, and how to resolve the error shown above.
Thanks!
P.S I have edited the question to show the original table. It shows data for four patients. I basically want to make synthetic data for more cases, that replicates the patterns found in these patients

Thank you for answering my question about when data you have access to. The error that you received was generated when you called cholesky. cholesky requires that your matrix be positive semidefinite. One way to check if a matrix is semi-positive definite is to see if all of its eigenvalues are greater than zero. One of the eigenvalues of your correlation/covarance matrix is nearly zero. I think that cholesky is just being fussy. Use can use scipy.linalg.sqrtm as an alternate decomposition.
For your question on the generation of multivariate normals, the random normal that you generate should be a standard random normal, i.e. a mean of 0 and a width of 1. Numpy provides a standard random normal generator with np.random.randn.
To generate a multivariate normal, you should also take the decomposition of the covariance, not the correlation matrix. The following will generate a multivariate normal using an affine transformation, as in your question.
from scipy.linalg import cholesky, sqrtm
relavant_columns = ['Affecting homelife',
'Affecting mobility',
'Affecting social life/hobbies',
'Affecting work',
'Mood',
'Pain Score',
'Range of motion in Doc']
# df is a pandas dataframe containing the data frame from figure 1
mu = df[relavant_columns].mean().values
cov = df[relavant_columns].cov().values
number_of_sample = 10
# generate using affine transformation
#c2 = cholesky(cov).T
c2 = sqrtm(cov).T
s = np.matmul(c2, np.random.randn(c2.shape[0], number_of_sample)) + mu.reshape(-1, 1)
# transpose so each row is a sample
s = s.T
Numpy also has a built-in function which can generate multivariate normals directly
s = np.random.multivariate_normal(mu, cov, size=number_of_sample)

Recreating time series data using FFT results without using ifft

I analyzed the sunspots.dat data (below) using fft which is a classic example in this area. I obtained results from fft in real and imaginery parts. Then I tried to use these coefficients (first 20) to recreate the data following the formula for Fourier transform. Thinking real parts correspond to a_n and imaginery to b_n, I have
import numpy as np
from scipy import *
from matplotlib import pyplot as gplt
from scipy import fftpack
def f(Y,x):
total = 0
for i in range(20):
total += Y.real[i]*np.cos(i*x) + Y.imag[i]*np.sin(i*x)
return total
tempdata = np.loadtxt("sunspots.dat")
year=tempdata[:,0]
wolfer=tempdata[:,1]
Y=fft(wolfer)
n=len(Y)
print n
xs = linspace(0, 2*pi,1000)
gplt.plot(xs, [f(Y, x) for x in xs], '.')
gplt.show()
For some reason however, my plot does not mirror the one generated by ifft (I use the same number of coefficients on both sides). What could be wrong ?
Data:
http://linuxgazette.net/115/misc/andreasen/sunspots.dat

When you called fft(wolfer), you told the transform to assume a fundamental period equal to the length of the data. To reconstruct the data, you have to use basis functions of the same fundamental period = 2*pi/N. By the same token, your time index xs has to range over the time samples of the original signal.
Another mistake was in forgetting to do to the full complex multiplication. It's easier to think of this as Y[omega]*exp(1j*n*omega/N).
Here's the fixed code. Note I renamed i to ctr to avoid confusion with sqrt(-1), and n to N to follow the usual signal processing convention of using the lower case for a sample, and the upper case for total sample length. I also imported __future__ division to avoid confusion about integer division.
forgot to add earlier: Note that SciPy's fft doesn't divide by N after accumulating. I didn't divide this out before using Y[n]; you should if you want to get back the same numbers, rather than just seeing the same shape.
And finally, note that I am summing over the full range of frequency coefficients. When I plotted np.abs(Y), it looked like there were significant values in the upper frequencies, at least until sample 70 or so. I figured it would be easier to understand the result by summing over the full range, seeing the correct result, then paring back coefficients and seeing what happens.
from __future__ import division
import numpy as np
from scipy import *
from matplotlib import pyplot as gplt
from scipy import fftpack
def f(Y,x, N):
total = 0
for ctr in range(len(Y)):
total += Y[ctr] * (np.cos(x*ctr*2*np.pi/N) + 1j*np.sin(x*ctr*2*np.pi/N))
return real(total)
tempdata = np.loadtxt("sunspots.dat")
year=tempdata[:,0]
wolfer=tempdata[:,1]
Y=fft(wolfer)
N=len(Y)
print(N)
xs = range(N)
gplt.plot(xs, [f(Y, x, N) for x in xs])
gplt.show()

The answer from mtrw was extremely helpful and helped me answer the same question as the OP, but my head almost exploded trying to understand the nested loop.
Here's the last part but with numpy broadcasting (not sure if this even existed when the question was asked) rather than calling the f function:
xs = np.arange(N)
omega = 2*np.pi/N
phase = omega * xs[:,None] * xs[None,:]
reconstruct = Y[None,:] * (np.cos(phase) + 1j*np.sin(phase))
reconstruct = (reconstruct).sum(axis=1).real / N
# same output
plt.plot(reconstruct)
plt.plot(wolfer)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apply SciPy newton method to optimize a pandas dataframe Weibull sum - python

Related

Pinescript correlation(source_a, source_b, length) -> to python

One Cycle Fourier Window optimization. My code is inefficient

Is there a way to estimate Poisson interaction effect in python statsmodels?

Python-Generating numbers according to a corellation matrix

Recreating time series data using FFT results without using ifft

Categories

Resources