Related
I am new to spectrogram and try to plot spectrogram by using relative velocity variations value of ambient seismic noise.
So the format of the data I have is 'time', 'station pair', 'velocity variation value' as below. (If error is needed, I can add it on the data)
2013-11-24,05_PK01_05_SS01,0.057039371136200
2013-11-25,05_PK01_05_SS01,-0.003328071661900
2013-11-26,05_PK01_05_SS01,0.137221779659000
2013-11-27,05_PK01_05_SS01,0.068823721831000
2013-11-28,05_PK01_05_SS01,-0.006876687060810
2013-11-29,05_PK01_05_SS01,-0.023895268916200
2013-11-30,05_PK01_05_SS01,-0.105762098404000
2013-12-01,05_PK01_05_SS01,-0.028069540807700
2013-12-02,05_PK01_05_SS01,0.015091601414300
2013-12-03,05_PK01_05_SS01,0.016353885353700
2013-12-04,05_PK01_05_SS01,-0.056654092859700
2013-12-05,05_PK01_05_SS01,-0.044520608528500
2013-12-06,05_PK01_05_SS01,0.020226437197700
...
But I searched for it, I can only see people using data of network, station, location, channel, or wav data.
Therefore, I have no idea what I have to start because the data format is different..
If you know some ways to get spectrogram by using 'value' of timeseries.
p.s. I would compute cross correlation with velocity variation value and other environmental data such as air temperature, air pressure etc.
###Edit (I add two pictures but the notice pops up that I cannot post images yet but only link)
I would write about groundwater level or other environmental data because those are easier to see variations.
The plot that I want to make similarly is from David et al., 2021 as below.
enter image description here
X axis shows time series and y axis shows cycles/day.
So when the light color is located at 1 then it means diurnal cycle (if 2, semidiurnal cycle).
Now I plot spectrogram and make the frequency as cycles / 1day.
enter image description here
But what I found to edit are two.
In the reference, it is normalized as log scale.
So I need to find the way to edit it as log scale.
In the reference, the x axis becomes 1*10^7.
But in my data, there are only 755 points in time series (dates in 2013-2015).
So what do I have to do to make x axis to time series?
p.s. The code I made
fil=pd.read_csv('myfile.csv')
cf=fil.iloc[:,1]
cf=cf/max(abs(cf))
nfft=128 #The number of data points
fs=1/86400 #Hz [0, fs/2] cycles / unit time
n=len(cf)
fr=fs/n
spec, freq, tt, pplot = pylab.specgram(cf, NFFT=nfft, Fs=fs, detrend=pylab.detrend,
window=pylab.window_hanning, noverlap=100, mode='psd')
pylab.title('%s' % e_n)
plt.colorbar()
plt.ylabel("Frequency (cycles / %s Day)" % str(1/fs/86400))
plt.xlabel("days")
plt.show()
If you look closely at it, wav data is basically just an array of numbers (sound amplitude), recorded at a certain interval.
Note: You have an array of equally spaced samples, but they are for velocity difference, not amplitude. So while the following is technically valid, I don't think that the resulting frequencies represent seismic sound frequencies?
So the discrete Fourier transform (in the form of np.fft.rfft) would normally be the right thing to use.
If you give the function np.fft.rfft() n numbers, it will return n/2+1 frequencies. This is because of the inherent symmetry in the transform.
However, one thing to keep in mind is the frequency resolution of FFT. For example if you take n=44100 samples from a wav file sampled at Fs=44100 Hz, you get a convenient frequency resolution of Fs/n = 1 Hz. Which means that the first number in the FFT result is 0 Hz, the second number is 1 Hz et cetera.
It seems that the sampling frequency in your dataset is once per day, i.e. Fs= 1/(24x3600) =0.000012 Hz. Suppose you have n = 10000 samples, then the FFT will return 5001 numbers, with a frequency resolution of Fs/n= 0.0000000012 Hz. That means that the highest frequency you will be able to detect from data sampled at this frequncy is 0.0000000012*5001 = 0.000006 Hz.
So the highest frequency you can detect is approximately Fs/2!
I'm no domain expert, but that value seems to be a bit low for seismic noise?
I have data from a number of high frequency data capture devices connected to generators on an electricity grid. These meters collect data in ~1 second "bursts" at ~1.25ms frequency, ie. fast enough to actually see the waveform. See below graphs showing voltage and current for the three phases shown in different colours.
This timeseries has a changing fundamental frequency, ie the frequency of the electricity grid is changing over the length of the timeseries. I want to roll this (messy) waveform data up to summary statistics of frequency and phase angle for each phase, calculated/estimated every 20ms (approx once per cycle).
The simplest way that I can think of would be to just count the gap between the 0 passes (y=0) on each wave and use the offset to calculate phase angle. Is there a neat way to achieve this (ie. a table of interpolated x values for which y=0).
However the above may be quite noisy, and I was wondering if there is a more mathematically elegant way of estimating a changing frequency and phase angle with pandas/scipy etc. I know there are some sophisticated techniques available for periodic functions but I'm not familiar enough with them. Any suggestions would be appreciated :)
Here's a "toy" data set of the first few waves as a pandas Series:
import pandas as pd, datetime as dt
ds_waveform = pd.Series(
index = pd.date_range('2020-08-23 12:35:37.017625', '2020-08-23 12:35:37.142212890', periods=100),
data = [ -9982., -110097., -113600., -91812., -48691., -17532.,
24452., 75533., 103644., 110967., 114652., 92864.,
49697., 18402., -23309., -74481., -103047., -110461.,
-113964., -92130., -49373., -18351., 24042., 75033.,
103644., 111286., 115061., 81628., 61614., 19039.,
-34408., -62428., -103002., -110734., -114237., -92858.,
-49919., -19124., 23542., 74987., 103644., 111877.,
115379., 82720., 62251., 19949., -33953., -62382.,
-102820., -111053., -114555., -81941., -62564., -19579.,
34459., 62706., 103325., 111877., 115698., 83084.,
62888., 20949., -33362., -61791., -102547., -111053.,
-114919., -82805., -62882., -20261., 33777., 62479.,
103189., 112195., 116380., 83630., 63843., 21586.,
-32543., -61427., -102410., -111553., -115374., -83442.,
-63565., -21217., 33276., 62024., 103007., 112468.,
116471., 84631., 64707., 22405., -31952., -61108.,
-101955., -111780., -115647., -84261.])
how to show variance of these data points over time? I used this plot to show them but because the time starts from 0 to 20 000 seconds and it is difficult to see all the points properly to observe the variance or invariance, the problem is: the points are overlapped to each other.
after zoom in
I finally could solve this problem by subtracting each time from the minimum time for each subject. Now all the times starts from 0 and the variance between subjects can be seen easily
Normalize your axes to 1 by dividing with the maximum value. Afterwards you can scale your axis by a factor X.
I'm developing a line chart. The data is being generated by a sensor and is a tuple (timestamp, value). Sensor creates a new datapoint every 60 seconds or so.
Now I want to display it in a graph and my limitation is about 900 points on then graph. In a daily view of that graph, I'd get about 1440 points and that's too much.
I'm looking for a general way how to shrink my dataset of any size to fixed size (in my case 900) while it keeps the timestamp distribution linear.
Thanks
I believe you are trying to resample your data. Your current sample rate is 1/60 samples per second and you are trying to get to 1/96 samples per second (900 / (24*60*60)). The ratio between the two rates is 5/8.
If you search for "python resample" you will find other similar questions and articles involving numpy and pandas which have built in routines for it.
To do it manually you can first upsample by 5 to get to 7200 samples per second and then downsample by 8 to get down to 900 samples per second.
To upsample you can make a new list five times as long and fill in every fifth element with your existing data. Then you can do, say, linear interpolation to fill in the gaps.
One you do that you can downsample by simply taking every eighth element.
Here's my final solution using pandas:
df = pd.read_json('co2.json')
# calculates the 'rule' parameter for resampling
seconds = int(df.tail(1)[0]) - int(df.head(1)[0])
rule = seconds // 960
df.index = pd.to_datetime(df[0], unit='s')
df.resample('%sS' % rule).mean()
I have two time series of 3D accelerometer data that have different time bases (clocks started at different times, with some very slight creep during the sampling time), as well as containing many gaps of different size (due to delays associated with writing to separate flash devices).
The accelerometers I'm using are the inexpensive GCDC X250-2. I'm running the accelerometers at their highest gain, so the data has a significant noise floor.
The time series each have about 2 million data points (over an hour at 512 samples/sec), and contain about 500 events of interest, where a typical event spans 100-150 samples (200-300 ms each). Many of these events are affected by data outages during flash writes.
So, the data isn't pristine, and isn't even very pretty. But my eyeball inspection shows it clearly contains the information I'm interested in. (I can post plots, if needed.)
The accelerometers are in similar environments but are only moderately coupled, meaning that I can tell by eye which events match from each accelerometer, but I have been unsuccessful so far doing so in software. Due to physical limitations, the devices are also mounted in different orientations, where the axes don't match, but they are as close to orthogonal as I could make them. So, for example, for 3-axis accelerometers A & B, +Ax maps to -By (up-down), +Az maps to -Bx (left-right), and +Ay maps to -Bz (front-back).
My initial goal is to correlate shock events on the vertical axis, though I would eventually like to a) automatically discover the axis mapping, b) correlate activity on the mapped aces, and c) extract behavior differences between the two accelerometers (such as twisting or flexing).
The nature of the times series data makes Python's numpy.correlate() unusable. I've also looked at R's Zoo package, but have made no headway with it. I've looked to different fields of signal analysis for help, but I've made no progress.
Anyone have any clues for what I can do, or approaches I should research?
Update 28 Feb 2011: Added some plots here showing examples of the data.
My interpretation of your question: Given two very long, noisy time series, find a shift of one that matches large 'bumps' in one signal to large bumps in the other signal.
My suggestion: interpolate the data so it's uniformly spaced, rectify and smooth the data (assuming the phase of the fast oscillations is uninteresting), and do a one-point-at-a-time cross correlation (assuming a small shift will line up the data).
import numpy
from scipy.ndimage import gaussian_filter
"""
sig1 and sig 2 are assumed to be large, 1D numpy arrays
sig1 is sampled at times t1, sig2 is sampled at times t2
t_start, t_end, is your desired sampling interval
t_len is your desired number of measurements
"""
t = numpy.linspace(t_start, t_end, t_len)
sig1 = numpy.interp(t, t1, sig1)
sig2 = numpy.interp(t, t2, sig2)
#Now sig1 and sig2 are sampled at the same points.
"""
Rectify and smooth, so 'peaks' will stand out.
This makes big assumptions about your data;
these assumptions seem true-ish based on your plots.
"""
sigma = 10 #Tune this parameter to get the right smoothing
sig1, sig2 = abs(sig1), abs(sig2)
sig1, sig2 = gaussian_filter(sig1, sigma), gaussian_filter(sig2, sigma)
"""
Now sig1 and sig2 should look smoothly varying, with humps at each 'event'.
Hopefully we can search a small range of shifts to find the maximum of the
cross-correlation. This assumes your data are *nearly* lined up already.
"""
max_xc = 0
best_shift = 0
for shift in range(-10, 10): #Tune this search range
xc = (numpy.roll(sig1, shift) * sig2).sum()
if xc > max_xc:
max_xc = xc
best_shift = shift
print 'Best shift:', best_shift
"""
If best_shift is at the edges of your search range,
you should expand the search range.
"""
If the data contains gaps of unknown sizes that are different in each time series, then I would give up on trying to correlate entire sequences, and instead try cross correlating pairs of short windows on each time series, say overlapping windows twice the length of a typical event (300 samples long). Find potential high cross correlation matches across all possibilities, and then impose a sequential ordering constraint on the potential matches to get sequences of matched windows.
From there you have smaller problems that are easier to analyze.
This isn't a technical answer, but it might help you come up with one:
Convert the plot to an image, and stick it into a decent image program like gimp or photoshop
break the plots into discrete images whenever there's a gap
put the first series of plots in a horizontal line
put the second series in a horizontal line right underneath it
visually identify the first correlated event
if the two events are not lined up vertically:
select whichever instance is further to the left and everything to the right of it on that row
drag those things to the right until they line up
This is pretty much how an audio editor works, so you if you converted it into a simple audio format like an uncompressed WAV file, you could manipulate it directly in something like Audacity. (It'll sound horrible, of course, but you'll be able to move the data plots around pretty easily.)
Actually, audacity has a scripting language called nyquist, too, so if you don't need the program to detect the correlations (or you're at least willing to defer that step for the time being) you could probably use some combination of audacity's markers and nyquist to automate the alignment and export the clean data in your format of choice once you tag the correlation points.
My guess is, you'll have to manually build an offset table that aligns the "matches" between the series. Below is an example of a way to get those matches. The idea is to shift the data left-right until it lines up and then adjust the scale until it "matches". Give it a try.
library(rpanel)
#Generate the x1 and x2 data
n1 <- rnorm(500)
n2 <- rnorm(200)
x1 <- c(n1, rep(0,100), n2, rep(0,150))
x2 <- c(rep(0,50), 2*n1, rep(0,150), 3*n2, rep(0,50))
#Build the panel function that will draw/update the graph
lvm.draw <- function(panel) {
plot(x=(1:length(panel$dat3))+panel$off, y=panel$dat3, ylim=panel$dat1, xlab="", ylab="y", main=paste("Alignment Graph Offset = ", panel$off, " Scale = ", panel$sca, sep=""), typ="l")
lines(x=1:length(panel$dat3), y=panel$sca*panel$dat4, col="red")
grid()
panel
}
#Build the panel
xlimdat <- c(1, length(x1))
ylimdat <- c(-5, 5)
panel <- rp.control(title = "Eye-Ball-It", dat1=ylimdat, dat2=xlimdat, dat3=x1, dat4=x2, off=100, sca=1.0, size=c(300, 160))
rp.slider(panel, var=off, from=-500, to=500, action=lvm.draw, title="Offset", pos=c(5, 5, 290, 70), showvalue=TRUE)
rp.slider(panel, var=sca, from=0, to=2, action=lvm.draw, title="Scale", pos=c(5, 70, 290, 90), showvalue=TRUE)
It sounds like you want to minimize the function (Ax'+By) + (Az'+Bx) + (Ay'+Bz) for a pair of values: Namely, the time-offset: t0 and a time scale factor: tr. where Ax' = tr*(Ax + t0), etc..
I would look into SciPy's bivariate optimize functions. And I would use a mask or temporarily zero the data (both Ax' and By for example) over the "gaps" (assuming the gaps can be programmatically determined).
To make the process more efficient, start with a coarse sampling of A and B, but set the precision in fmin (or whatever optimizer you've selected) that is commensurate with your sampling. Then proceed with progressively finer-sampled windows of the full dataset until your windows are narrow and are not down-sampled.
Edit - matching axes
Regarding the issue of trying to identify which axis is co-linear with a given axis, and not knowing at thing about the characteristics of your data, i can point towards a similar question. Look into pHash or any of the other methods outlined in this post to help identify similar waveforms.