Generating data associated with a trend - python

I want to create 3 different datasets with a column each having dates (dd/mm/yyyy). These dates need to be in a range of 3 months like January 2019 to April 2019. The count for each date needs to represent the number of searches. The dataset should have 2000 entries and dates can be repititive as well. All 3 datasets are to be created such that one has a upward trend to the count, one has a downward trend to the count, and one is normally distributed.
Upward trend with the time, i.e. increasing entries with time ( lower count in beginning and increasing moving forward.)
Declining trend with time i.e. decreasing entries with time (higher count in the beginning and decreasing moving forward)
I am able to generate a normal distribution using datagenerator plugin of
www.generatedata.com
I am now interested in the other 2 use cases i.e. upward trend and declining trend. Can anyone advise me how to do the same. For random distribution, I was able to achieve using the faker library as well.
from faker import Factory
import random
import numpy as np
faker = Factory.create()
def date_between(d1, d2):
f = '%b%d-%Y'
return faker.date_time_between_dates(datetime.strptime(d1, f), datetime.strptime(d2, f))
def fakerecord():
return {'ID': faker.numerify('######'),
'S_date': date_between('jan01-2019', 'apr01-2019')
}
Can anyone advise how can I incorporate trends to the dataset.
Thanks

you can do it like below.
trend function defines your trend if start is higher than end it is downward trend and vice versa. you can also control the rate of trend by changing difference between start and end
import numpy as np
import pandas as pd
dates = pd.date_range("2019-1-1", "2019-4-1", freq="D")
def trend(count, start_weight=1, end_weight=3):
lin_sp = np.linspace(start_weight, end_weight, count)
return lin_sp/sum(lin_sp)
date_trends = np.random.choice(dates,size=20000, p=trend(len(dates)))
print("Total dates", len(date_trends))
print("counts of each dates")
print(np.unique(date_trends, return_counts=True)[1])

I edited my first answer to make it more clear.
With the function below you can set the relative probabilities of generating a search on the start and end dates of your choice.
Ex. if starting_prob = 0.1 and ending_prob = 1.0, then the probability of seeing a
search on the start date is 1/10 of the probability of seeing a search on the
end date
If starting_prob = 1.0 and ending_prob = 0.1, then the probability of seeing a
search on the end date is 1/10 of the probability of seeing a search on the
start date
import datetime
import numpy as np
def random_dates(start, end, starting_prob = 0.1, ending_prob = 1.0, num_samples = 2000):
"""
Generate increasing or decreasing counts of datetimes between `start` and `end`
Parameters:
start: string in format'%b%d-%Y' (i.e. 'Sep19-2019')
end : string in format'%b%d-%Y'. must be after start
starting_prob: (float) relative probability of seeing a search on the first day
ending_prob: (float) relative probability of seeing a search on the last day
num_samples: number of dates in the list
"""
start_date = datetime.datetime.strptime(start, '%b%d-%Y')
end_date = datetime.datetime.strptime(end, '%b%d-%Y')
# Get days between `start` and `end`
num_days = (end_date - start_date).days
linear_probabilities = np.linspace(starting_prob, ending_prob, num_days)
# normalize probabilities so they add up to 1
linear_probabilities /= np.sum(linear_probabilities)
rand_days = np.random.choice(num_days, size = num_samples, replace = True,
p = linear_probabilities)
rand_date = [(start_date + datetime.timedelta(int(rand_days[ii]))).strftime('%b%d-%Y')
for ii in range(num_samples)]
# return list of date strings
return rand_date
You could use the function to generate different sets of dates (each with 20000 samples):
rdates_decreasing = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 1.0, ending_prob = 0.1,
num_samples = 20000)
rdates_increasing = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 0.1, ending_prob = 1.0,
num_samples = 20000)
rdates_random = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 1.0, ending_prob = 1.0,
num_samples = 20000)
You can use pandas to save a csv file. Each column will have a list of dates.
import pandas as pd
pd.DataFrame({'dates_decreasing': rdates_decreasing,
'dates_increasing': rdates_increasing,
'dates_random': rdates_random,
}).to_csv("path\to\datefile.csv", index = False)
You could convert your dates to counts in a data frame like this:
from collections import Counter
import matplotlib.pyplot as plt
# create dataframe with counts
df1 = pd.DataFrame({"dates_decreasing": list(Counter(rdates_decreasing).keys()),
"counts_decreasing": list(Counter(rdates_decreasing).values()),
"dates_increasing": list(Counter(rdates_increasing).keys()),
"counts_increasing": list(Counter(rdates_increasing).values()),
"dates_random": list(Counter(rdates_random).keys()),
"counts_random": list(Counter(rdates_random).values()),
})
# convert to datetime
df1['dates_decreasing']= pd.to_datetime(df1['dates_decreasing'])
df1['dates_increasing']= pd.to_datetime(df1['dates_increasing'])
df1['dates_random']= pd.to_datetime(df1['dates_random'])
# plot
fig, ax = plt.subplots()
ax.plot(df1.dates_decreasing, df1.counts_decreasing, "o", label = "decreasing")
ax.plot(df1.dates_increasing, df1.counts_increasing, "o", label = "increasing")
ax.plot(df1.dates_random, df1.counts_random, "o", label = "random")
ax.set_ylabel("count")
ax.legend()
fig.autofmt_xdate()
plt.show()

Related

Optimum way to plot sine wave from oscilloscope .csv with matplotlib in python

I wish to plot 1 or 2 periods of a sine wave from csv data obtained from an oscilloscope.
the DF columns are
I have managed to produce something workable but it requires manual work for each new set of csv data. I am sure there is a better way.
The code I have finds the start and end times of one period by obtaining the row indexes where the amplitude is greater than zero but less than some manually inputted value obtained by trial and error. The thinking is that a sine wave willcross the x axis at the start and end of a period.
import pandas as pd
import matplotlib.pyplot as plot
import numpy as np
df = pd.read_csv("tek0000.csv", skiprows=20)
Sin_wave_vals = np.where((df['CH1'] >0)&(df['CH1'] <0.021))
// gives output (array([ 355, 604, 730, 1230, 1480, 1604, 1730, 2980, 3604, 3854, 4979,
5230, 5980, 7730, 9355, 9980]),)
time_df = df['TIME']
# select the rows from the df for a single period
time_df =time_df.iloc[Sin_wave_vals[0][0]:Sin_wave_vals[0][1]]
amplitude_ch1 = df.iloc[355:604,1]
plot.plot(time, amplitude_ch1)
plot.title('Sine wave')
plot.xlabel('Time')
plot.ylabel('Amplitude = sin(time)')
plot.grid(True, which='both')
plot.axhline(y=0, color='k')
plot.show()
This works ok and plots what I require, but will work out too manual as I have about 20 of these to plot. I tried to obtain the upper limit by using the following
upper_lim=min(filter(lambda x: x > 0, df['CH1']))
# returns 0.02
Sin_wave_vals = np.where((df['CH1'] >0)&(df['CH1'] <upper_lim))
However, this did not work out how I intended..
#Tim Roberts that works well, thank you.
This function works for a clear wave, with noise and a low amplitude there is an issue as the wave can pass the x axis multiple times in a period.
def Plot_wave(df,title, periods, time_col_iloc, amp_col_iloc):
## the time and columns location in DF are final 2 arguments and ahould be int
zero_crossings = np.where(np.diff(np.sign(df.iloc[:,1])))[0]
if periods == 1:
time = df.iloc[zero_crossings[0]:zero_crossings[2],[time_col_iloc]]
amplitude = df.iloc[zero_crossings[0]:zero_crossings[2],amp_col_iloc]
if periods == 2:
time = df.iloc[zero_crossings[0]:zero_crossings[4],[0]]
amplitude = df.iloc[zero_crossings[0]:zero_crossings[4],1]
if periods > 2:
return("please enter period of 1 OR 2")
plot.plot(time, amplitude)
plot.title(title)
plot.xlabel('Time')
plot.ylabel('Amplitude = sin(time)')
plot.grid(True, which='both')
plot.axhline(y=0, color='k')
plot.show()

'The `start` argument could not be matched to a location related to the index of the data.'

i'm trying to forecast a simple model
whenever i try to use the predict method i get the error ('The start argument could not be matched to a location related to the index of the data.')
can anyone please help ?
df_comp['Date'] = pd.to_datetime(df_comp['Date'])
df_comp= df_comp.set_index("Date")
size = int(len(df_comp)*0.8)
df, df_test = df_comp.iloc[:size], df_comp.iloc[size:]
model_ar = ARIMA(df.Fullmonth, order = (1,0,0))
results_ar = model_ar.fit()
start_date="2021-12-01"
end_date="2022-03-01"
df_pred_AR = results_ar.predict(start = start_date, end = end_date)
Judging by the documentation.
Quote: 'What this means is that you cannot specify forecasting steps by dates, and the output of the forecast and get_forecast methods will not have associated dates. The reason is that without a given frequency, there is no way to determine what date each forecast should be assigned to.' enter
Perhaps this is the reason, so it is not possible to specify dates outside of the training.I did the following, took the number of test elements, in this case it is 11 and submitted it to the predict function. As an example, I used the data 'web.DataReader ' got a forecast, which is drawn with a orange line. And when drawing, I used test date indexes.
import pandas_datareader.data as web
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_model import ARIMA
df_comp = web.DataReader('^GSPC', 'yahoo', start='2022-02-15', end='2022-05-01')
x = len(df_comp)
size = int(x * 0.8)
index = x - size
df = df_comp.iloc[:size]
df_test = df_comp.iloc[size:]
model_ar = ARIMA(df.Close, order=(1, 0, 0))
results_ar = model_ar.fit()
df_pred_AR = results_ar.predict(1, index)
fig, ax = plt.subplots()
ax.plot(df_comp.index, df_comp['Close'].values, label='Price')
ax.plot(df_comp.index[size:], df_pred_AR)
plt.show()

Fourier Result on Time Series explained python

I have passed my time series data,which is essentially measurements from a sensor about pressure, through a Fourier transformation, similar to what is described in https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101.
The file used can be found here:
https://docs.google.com/spreadsheets/d/1MLETSU5Trl5gLGO6pv32rxBsR8xZNkbK/edit?usp=sharing&ouid=110574180158524908052&rtpof=true&sd=true
The code related is this :
import pandas as pd
import numpy as np
file='test.xlsx'
df=pd.read_excel(file,header=0)
#df=pd.read_csv(file,header=0)
df.head()
df.tail()
# drop ID
df=df[['JSON_TIMESTAMP','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_ADH_COATWEIGHT_SP']]
# extract year month
df["year"] = df["JSON_TIMESTAMP"].str[:4]
df["month"] = df["JSON_TIMESTAMP"].str[5:7]
df["day"] = df["JSON_TIMESTAMP"].str[8:10]
df= df.sort_values( ['year', 'month','day'],
ascending = [True, True,True])
df['JSON_TIMESTAMP'] = df['JSON_TIMESTAMP'].astype('datetime64[ns]')
df.sort_values(by='JSON_TIMESTAMP', ascending=True)
df1=df.copy()
df1 = df1.set_index('JSON_TIMESTAMP')
df1 = df1[["ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB"]]
import matplotlib.pyplot as plt
#plt.figure(figsize=(15,7))
plt.rcParams["figure.figsize"] = (25,8)
df1.plot()
#df.plot(style='k. ')
plt.show()
df1.hist(bins=20)
from scipy.fft import rfft,rfftfreq
## https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101
# convert into x and y
x = list(range(len(df1.index)))
y = df1['ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB']
# apply fast fourier transform and take absolute values
f=abs(np.fft.fft(df1))
# get the list of frequencies
num=np.size(x)
freq = [i / num for i in list(range(num))]
# get the list of spectrums
spectrum=f.real*f.real+f.imag*f.imag
nspectrum=spectrum/spectrum[0]
# plot nspectrum per frequency, with a semilog scale on nspectrum
plt.semilogy(freq,nspectrum)
nspectrum
type(freq)
freq= np.array(freq)
freq
type(nspectrum)
nspectrum = nspectrum.flatten()
# improve the plot by adding periods in number of days rather than frequency
import pandas as pd
results = pd.DataFrame({'freq': freq, 'nspectrum': nspectrum})
results['period'] = results['freq'] / (1/365)
plt.semilogy(results['period'], results['nspectrum'])
# improve the plot by convertint the data into grouped per day to avoid peaks
results['period_round'] = results['period'].round()
grouped_day = results.groupby('period_round')['nspectrum'].sum()
plt.semilogy(grouped_day.index, grouped_day)
#plt.xticks([1, 13, 26, 39, 52])
My end result is this :
Result of Fourier Trasformation for Data
My question is, what does this eventually show for our data, and intuitively what does the spike at the last section mean?What can I do with such result?
Thanks in advance all!

How to split dataframe according to intersection point in Python?

I am working on a project which is aiming to show difference between good form and bad form of an exercise. To do this we collected the acceleration data with wrist based accelerometer. The image above shows 2 set of a fitness execise (bench press). Each set has 10 repetitions. And the image below shows 10 repetitions of 1 set.I have a raw data set which consist of 10 set of an execises. What I want to do is splitting the raw data to 10 parts which will contain the part between 2 black line in the image above so I can analyze the data easily. My supervisor gave me a starting point which is choosing cutpoint in the each set. He said take a cutpoint, find the first interruption time start cutting at 3 sec before that time and count to 10 and finish cutting.
This an idea that I don't know how to apply. At least, if you can tell how to cut a dataframe according to cutpoint I would be greatful.
Well, I found another way to detect periodic parts of my accelerometer data. So, Here is my code:
import numpy as np
from peakdetect import peakdetect
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib import style
from pandas import DataFrame as df
style.use('ggplot')
def get_periodic(path):
periodics = []
data_frame = df.from_csv(path)
data_frame.columns = ['z', 'y', 'x']
if path.__contains__('1'):
if path.__contains__('bench'):
bench_press_1_week = data_frame.between_time('11:24', '11:52')
peak_indexes = get_peaks(bench_press_1_week.y, lookahead=3000)
for i in range(0, len(peak_indexes)):
time_indexes = bench_press_1_week.index.tolist()
start_time = time_indexes[0]
periodic_start = start_time.to_datetime() + dt.timedelta(0, peak_indexes[i] / 100)
periodic_end = periodic_start + dt.timedelta(0, 60)
periodic = bench_press_1_week.between_time(periodic_start.time(), periodic_end.time())
periodics.append(periodic)
return periodics
def get_peaks(data, lookahead):
peak_indexes = []
correlation = np.correlate(data, data, mode='full')
realcorr = correlation[correlation.size / 2:]
maxpeaks, minpeaks = peakdetect(realcorr, lookahead=lookahead)
for i in range(0, len(maxpeaks)):
peak_indexes.append(maxpeaks[i][0])
return peak_indexes
def show_segment_plot(data, periodic_area, exercise_name):
plt.figure(8)
gs = gridspec.GridSpec(7, 2)
ax = plt.subplot(gs[:2, :])
plt.title(exercise_name)
ax.plot(data)
k = 0
for i in range(2, 7):
for j in range(0, 2):
ax = plt.subplot(gs[i, j])
title = "{} {}".format(k + 1, ".Set")
plt.title(title)
ax.plot(periodic_area[k])
k = k + 1
plt.show()
Firstly, this question gave me another perspective for my problem. The image below shows the raw accelerometer data of bench press with 10 sets. Here it has 3 axis(x,y,z) and it's major axis is y(Blue on the image).
I used autocorrelation function for detecting the periodic parts, In the image above every peak represents 1 set of execises. With this peak detection algorithm I found each peak's x-axis value,
In[196]: maxpeaks
Out[196]:
[[16204, 32910.14013671875],
[32281, 28726.95849609375],
[48515, 24583.898681640625],
[64436, 22088.130859375],
[80335, 19582.248291015625],
[96699, 16436.567626953125],
[113081, 12100.027587890625],
[129027, 8098.98486328125],
[145184, 5387.788818359375]]
Basically, each x-value represent samples. My sampling frequency was 100Hz so 16204/100 = 162,04 seconds. To find the time of periodic part I added 162,04 sec to started time. Each bench press took aproximatelly 1 min and in this example, exercise's starting time was 11:24, for first periodic part's start time is 11:26 and ending time is 1 min after. There is some lag but yes best solution that I found is this.

numpy function to aggregate a signal for time?

I want to compute the aggregated average of a signal over time, in a certain period. I don't know how this is called scientifically.
Example: I have an electricity consumption for a full year in 15 minute values. I want to know my average consumption by hour of the day (24 values). But it is more complex: there are more measurements in between the 15-minute steps, and I cannot foresee where they are. However, they should be taken into account, with a correct 'weight'.
I wrote a function that works, but it is extremely slow. Here is a test setup:
import numpy as np
signal = np.arange(6)
time = np.array([0, 2, 3.5, 4, 6, 8])
period = 4
interval = 2
def aggregate(signal, time, period, interval):
pass
aggregated = aggregate(signal, time, period, interval)
# This should be the result: aggregated = array([ 2. , 3.125])
aggregated should have period/interval values. This is the manual computation:
aggregated[0] = (np.trapz(y=np.array([0, 1]), x=np.array([0, 2]))/interval + \
np.trapz(y=np.array([3, 4]), x=np.array([4, 6]))/interval) / (period/interval)
aggregated[1] = (np.trapz(y=np.array([1, 2, 3]), x=np.array([2, 3.5, 4]))/interval + \
np.trapz(y=np.array([4, 5]), x=np.array([6, 8]))/interval) / (period/interval)
One last detail: it has to be efficient, thats why my own solution is not useful. Maybe I'm overlooking a numpy or scipy method? Or is this something pandas can do?
Thanks a lot for your help.
I would strongly recommend using Pandas. Here I'm using version 0.8 (soon to be released). I think this is close to what you want.
import pandas as p
import numpy as np
import matplotlib as plt
# Make up some data:
time = p.date_range(start='2011-05-23', end='2012-05-23', freq='min')
watts = np.linspace(0, 3.14 * 365, time.size)
watts = 38 * (1.5 + np.sin(watts)) + 8 * np.sin(5 * watts)
# Create a time series
ts = p.Series(watts, index=time, name='watts')
# Resample down to 15 minute pieces, using mean values
ts15 = ts.resample('15min', how='mean')
ts15.plot()
Pandas can easily do many other things with your data (like determine your average weekly energy profile). Check out p.read_csv() for reading in your data.
I think this is pretty close to what you need. I'm not sure I interpreted interval and period correctly, but I think I got it write within some constant factor.
import numpy as np
def aggregate(signal, time, period, interval):
assert (period % interval) == 0
ipp = period / interval
midpoint = np.r_[time[0], (time[1:] + time[:-1])/2., time[-1]]
cumsig = np.r_[0, (np.diff(midpoint) * signal).cumsum()]
grid = np.linspace(0, time[-1], np.floor(time[-1]/period)*ipp + 1)
cumsig = np.interp(grid, midpoint, cumsig)
return np.diff(cumsig).reshape(-1, ipp).sum(0) / period
I worked out a function that does exactly what I wanted based on the previous answers and on pandas.
def aggregate_by_time(signal, time, period=86400, interval=900, label='left'):
"""
Function to calculate the aggregated average of a timeseries by
period (typical a day) in bins of interval seconds (default = 900s).
label = 'left' or 'right'. 'Left' means that the label i contains data from
i till i+1, 'right' means that label i contains data from i-1 till i.
Returns an array with period/interval values, one for each interval
of the period.
Note: the period has to be a multiple of the interval
"""
def make_datetimeindex(array_in_seconds, year):
"""
Create a pandas DateIndex from a time vector in seconds and the year.
"""
start = pandas.datetime(year, 1, 1)
datetimes = [start + pandas.datetools.timedelta(t/86400.) for t in array_in_seconds]
return pandas.DatetimeIndex(datetimes)
interval_string = str(interval) + 'S'
dr = make_datetimeindex(time, 2012)
df = pandas.DataFrame(data=signal, index=dr, columns=['signal'])
df15min = df.resample(interval_string, closed=label, label=label)
# now create bins for the groupby() method
time_s = df15min.index.asi8/1e9
time_s -= time_s[0]
df15min['bins'] = np.mod(time_s, period)
df_aggr = df15min.groupby(['bins']).mean()
# if you only need the numpy array: take df_aggr.values
return df_aggr

Categories

Resources