I have an "ideal" formula in which I have a sum of values like
score[i] = SUM(properties[i]) * frequency[i] + recency[i]
with properties being a vector of values, frequency and recency scalar values, taken from a given dataset of N items. While all variables here is numeric and with discrete integer values, the recency value is a UNIX timestamp in a given time range (like 1 month since now, or 1 week since now, etc. on daily basis).
In the dataset, each item i has a date value expressed as recency[i], and a frequency value frequency[i], and the list properties[i]. All properties of item[i] are therefore evaluated on each day expressed as recency[i] in the proposed time range.
According to this formula the recency contribution to the score value for the item[i] is a negative contribution: the older is the timestamp the better is the score (hence the + sign in that formula).
My idea was to use a re-scaler approach in the given range like
scaler = MinMaxScaler(feature_range=(min(recencyVec), max(recencyVec)))
scaler = scaler.fit(values)
normalized = scaler.transform(values)
where recencyVec collects all recency vectors for each data point, where min(recencyVec) is the first day and max(recencyVec) is the last day.
using the scikit-learn object MinMaxScaler, hence transforming the recency values by scaling each feature to the given range as suggested in How to Normalize and Standardize Time Series Data in Python
Is this the correct approach for this numerical formulation? Which alternative approach may be possible to normalize the timestamp values when summed to other discrete numeric values?
Is recency then an absolute UNIX timestamp? Or do you already subtract the current timestamp? If not, then depending on your goal, it might be sufficient to simply subtract the current unix timestamp from recency, so that it consistently describes "seconds from now", or the time delta instead of absolute unix time.
Of course, that would create quite a large score, but it will be consistent.
What scaling you use depends on your goal (what is an acceptable score?), but many are valid as long as they're monotonic. In addition to the min-max scaling (where I would suggest using 0 as minimum and set maximum to some known maximum time offset), you might also want to consider the log transformation.
Related
So I have a pandas data-frame (‘df_forecast’) that consists of two columns, the timestamp (‘DI_Interval’) and a list of values that are in Log10 format (‘RRP_SA’), as shown in the image included below.
I want to determine the Log10 inverse of the ‘RRP_SA’ column and create this as a new column called ‘RRPForecast’, which I have done with the following code:
# Calculate the inverse log10 of the number and create as new column
df_forecast["RRPForecast"] = 10**( df_forecast["RRP_SA"])
The issue is that the original ‘RRP_SA’ column has some negative numbers in it which also need to be converted back to their original number format. Could someone please help me edit my code so that it has the flexibility to do the log 10 inverse of both positive and negative numbers.
Reason for having positive and negative log10 numbers: I am performing a forecast of electricity prices for the spot market, in order to train my Machine Learning model, both the negative and positive electricity prices are converted into log10 format so that the model can be trained faster. However, I need the prices back in normal number format so that I can make comparisons with the actual electricity prices, thank you.
So imagine I have a database with identifiers and a timestamp:
ID. Time_Stamp_Col
id1. 2017-10-16 17:54:28
id2. 2016-09-13 17:14:17
id3. 2019-10-01 19:30:37
id4. 2017-08-27 20:55:30
id5. 2017-11-19 10:56:15
id6. 2018-02-12 09:59:24
and an arbitrary number of timestamps (2 for this example):
2018-02-12 09:55:29
2017-11-19 10:21:12
How do I return a column that holds the minimum timestampdiff between the Time_stamp_Col and the arbitrary number of timestamps?
(I am using python so I am totally okay making a loop to generate repetitive text to fit the arbitrary number of timestamps)
I have this so far:
SELECT
LEAST(DATEDIFF('2018-02-12 09:55:29',
b.Time_Stamp_Col),
DATEDIFF('2017-11-19 10:21:12',
b.Time_Stamp_Col)),
FROM
DataBaseInQuestion b
But it is so incredibly slow. DataBaseInQuestion has 14 million rows. Is there a faster way?
Find the "median timestamp range" which has minimal summary difference for given "arbitrary number of timestamps".
If the amount of "arbitrary timestamps" is odd then this is the median timestamp. Take the timestamp equal to this median. If such timestamp not exists then take any timestamp within the range of the timestamps adjacent to the median timestamp or, if no such timestamp, take the timestamp closest to this range.
If the amount of "arbitrary timestamps" is even then this is a range between two median timestamps. Take any timestamp within this range or, if no such timestamp, take the timestamp closest to this range.
In both variants "closest timestamp" means "the timestamp which has minimal amount of arbitrary timestamps between self and closest range border, if there are a couple of such timestamp then take closest by the difference".
We need not formula/theory but practical solution. Steps:
We have an array of "data timestamps". Let's say it is DTS[1..X], it contains X timestamps.
We have an array of "arbitrary timestamps". Let's say it is ATS[1..N], it contains N timestamps.
Calculate the indices of two median elements in ATS (for odd-amount array this will be the same element). N1 = (N+1) MOD 2 ; N2 = (N+2) MOD 2.
In DTS - find the timestamp DTS[K1] closest but not above ATS[N2] and the timestamp DTS[K2] closest but not below ATS[N1].
Calculate "summary distance" for DTS[K1] and DTS[K2].
If the sums are equal then both elements and all elements between them (yes, they may be not adjacent in this case!) are the solution.
If they differs then the element with least sum is a solution. It seems that in this case cannot be a couple of solutions (but you may test its neighbors for to ensure).
Why this must work?
Imagine that AST contains 2 ts only. Take one DST between them, it has some difference sum. Move it 1s left. The distance to left ATS decreases by 1s, to right - increases by 1s, and total sum is unchanged. Move one more, and again... and the sum is constant until we reach left ATS. When we cross it the sum will increase by 2 for each 1s move.
Now imagine that we have 3-element ATS. Again take one DST and put it over the middle AST. Move to left or to right by 1s - partial sum to left/right ASTs will not change, distance to middle will increase by 1s, total sum will increase by 1s. Move more - when we cross the extreme ATS the sum will increase by 3 for each step...
Expand this to 4, 5, ... element in ATS. The timestamp which has minimal sum matches the median timestamp or median timestamps range. Moving away from it increases the sum, crossing a timestamp increases increasing rate.
I have the following data in a pandas dataframe:
FileName Onsets Offsets
FileName1 [0, 270.78, 763.33] [188.56, 727.28, 1252.90]
FileName2 [0, 634.34, 1166.57, 1775.95, 2104.01] [472.04, 1034.37, 1575.88, 1970.79, 2457.09]
FileName3 [0, 560.97, 1332.21, 1532.47] [356.79, 1286.26, 1488.54, 2018.61]
These are data from audio files. Each row contains a series of onset and offset times for each of the sounds I'm researching. This means that the numbers are coupled, e.g. the second offset time marks the end of the sound that began at the second onset time.
To test a hypothesis, I need to select random offset times within various ranges. For instance, I need to multiply each offset time by between 0.95 and 1.05 to create random adjustments within a +/- 5% range around the actual offset time. Then 0.90 to 1.10, and so forth.
Importantly, the adjustment needs to not push the offset time earlier or later than the preceding or subsequent onset time. I think this means that I need to initially calculate the largest acceptable adjustment for each offset time, and then set the maximum allowable time for the whole dataset to be whatever the lowest acceptable adjustment is. I'll be using this code for different datasets, so this maximum adjustment percentage shouldn't be hardcoded.
How can I code this function?
The code below generates adjustments, but I haven't figured out to calculate and set the bounds yet.
import random
Offsets_5 = Offsets*(random.uniform(0.95,1.05))
Offsets_10 = Offsets*(random.uniform(0.90,1.10))
Offsets_15 = Offsets*(random.uniform(0.85,1.15))
I am trying to align my data so that when I use another comparison method the two data sets are aligned so that they are most similar. So far I have cross-correlated the two Pandas Series and found the lag position for highest correlation. How can I then shift my data to give the a new highest correlation lag position of 0 when the Series are then cross-correlated again.
I have 4 fairly large Pandas Series. One of these Series is a Query to be compared to the other 3 Series and itself.
To find the offset for highest correlation between a query-target series pair, I have used np.correlate() and have calculated the lag position, and for highest correlation. Having found this lag position I have tried to incorporate this lag into each of the series in order that, once a cross correlation is recalculated, the new lag for highest correlation is now 0. Unfortunately, this has not been very successful.
I feel there are a few ways I could be going wrong in my methodology here, I'm very new to coding, so any pointers will be very much appreciated.
What I Have So Far
Producing a DataFrame containing the original lag positions for highest correlation in each comparison.
lags_o = pd.Dataframe({"a":[np.correlate(s4, s1, mode='full').argmax() - np.correlate(s4, s1, mode='full').size/2],
"b": [np.correlate(s4, s2, mode='full').argmax() - np.correlate(s4, s2, mode='full').size/2],
"c": [np.correlate(s4, s3, mode='full').argmax() - np.correlate(s4, s3, mode='full').size/2],
"d": [np.correlate(s4, s4, mode='full').argmax() - np.correlate(s4, s4, mode='full').size/2]})
When this is run I get the expected value of 0 for the "d" column, indicating that the two series are optimally aligned (which makes sense). The other columns return non zero values so now i want to incorporate these required shifts into the new cross correlation.
# shifting the series by the lag given in lags_o for that comparison
s1_lagged = s1.shift(lags_o["a"].item()
# selecting all non-NaN values in the series for the next correlation
s1_lagged = s1_lagged[~np.isnan(s1_lagged)]
# this is repeated for the other routes - selecting the appropriate df column
What I expected to get back when the query route and the new shifted target series was then passed to the cross-correlation was that each lag position in lags_n would be 0. However, this is not what I am getting at all. Even more confusingly is that the new lag position does not seem to relate to the old lag position (as in the lag does not seem to shift along with the shift imputed into the series). I have tried shifting both the query and target series in turn but have not managed to get the required value.
So my question is how should I correctly manipulate these Series so that i can align these data sets. Happy New Year and thank you for your time and any suggestions you may have.
Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.
Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths