Pandas dataframe.resample and mean gives higher values with increasing sample width

Pandas dataframe.resample and mean gives higher values with increasing sample width - python

I have a time series of rain intensity (in µm/s), which I resample to 1 minute intervals. The data already has a 1 minute time step, but I may have data outage due to quality checks or basic equipment failure. The resample ensures that I have a consistent, equidistant time series to loop over, which is fastest for me so far.
The problem is that in theory I can choose another time step for the calculation, say 5 minutes. I have found that this gives larger dimensions for a rainwater basin, which was odd to me. I figured out that it is because the sum of the resample systematically gives higher values, i.e. more precipitation -> larger basin.
How is it that resample gives this odd result? Is it because it can take the same time steps and account for them in different resampled time steps...?
File is uploaded here
import pandas
import numpy
import datetime
import matplotlib
from matplotlib import pyplot as plt
data1 = pandas.read_csv("rain_1min.txt", sep=";", parse_dates=["time"], index_col="time")
test = list(range(1,121))
sums = []
for timestep in test:
data_rs = data1["rain"].resample(f"{timestep}Min").mean().replace("nan", 0.0)
sums.append(numpy.nansum(data_rs))
fig, ax = plt.subplots(figsize=[8,4], dpi=100)
ax.plot(test, sums)
ax.set_xlabel("Rule = x Min")
ax.set_ylabel("Sum of mean()")

Related

Time series graph analysis by detecting specific point increment and decrement

I have one month hourly flow data as follows
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame(index=pd.date_range(start = "01/01/2010", end = "02/01/2010"),
data=np.random.rand(32)*140,
columns=['data'])
df.plot(label='data')
plt.legend(); plt.ylabel('flow')
The maximum and minimum flow values are 140 and 30. Data in x-axis is hourly distributed.
"My task is to distribute flow data in a way that when value decrease from 100 to 30 it should take 4.5 hours. If flow value is less than 100 (while going down) it should follow same slope as it did in previous condition."
In simple, I have to draw a line from point 100 to 30 with slope -15.55 while trend are decreasing. But don't know how to detect point 100 and its occurring time in x-axes.
Need some code and analyze advise to do it.
I am very new in python and trying to learn techniques. Would be appreciable if can suggest different way to do and some explanation if possible. Thank You.

Interpreting and understanding fft plots of time series data

I have a time series sensor data. It is for a period of 24 hours sampled every minute (so in total 1440 data points per day). I did a fft on this to see what are the dominant frequencies. But what I got is a very noisy fft and a strong peak at zero.
I have already subtracted the mean to remove for the DC component at bin 0. But I still get a strong peak at zero. I'm not able to figure what could be the other reason or what should I try next to remove this.
The graph is very different from I have usually seen online as I was learning about fft. In the sense, I'm not able to see dominant peaks like how it is usually seen. Is my fft wrong?
Attaching code that i tried and images:
import numpy as np
from matplotlib import pyplot as plt
from scipy.fftpack import fft,fftfreq
import random
x = np.random.default_rng().uniform(29,32,1440).tolist()
x=np.array(x)
x=x-x.mean()
N = 1440
# sample spacing
T = 1.0 / 60
yf = fft(x)
yf_abs = abs(yf).tolist()
plt.plot(abs(yf))
plt.show()
freqs = fftfreq(len(x), 60)
plt.plot(freqs,yf_abs)
plt.show()
Frequency vs amplitude
Since I'm new to this, I'm not able to figure out where I'm wrong or interpret the results. Any help will be appreciated. Thanks! :)

How can i plot graph of random array having with 20 minutes time and 8 ms interval?

I have a problem with ploting graph in python. I need to create a random array and time array. Then i will plot a graph with them.
Time array should start with 0 and it should finish with 20 minutes. The interval of time array should be 8 milliseconds.
I tried to write a code like this: (But the graphic doesnt seems good. Could anyone help me please?)
Code:
import numpy as np
import matplotlib.pyplot as plt
random_array = np.random.rand(150000)
time_array = np.linspace(0, 1200, 150000)
# The number 1200 is the conversion of 20 minutes to seconds.
# And the number 150000 is for interval. (1200 second / 0.008 second )
plt.plot(time_array, random_array)
plt.xlabel('Time (second)')
plt.ylabel('Value')

I made a few changes to your code, and uploaded the plot.
(1) changed the plt.plot to plt.scatter
(2) moved your comments to the top, and used comment syntax of # instead of "
(3) added np.sort() to the y-variable data, just to make it easier to see in this example.
For your actual dataset, don't use the np.sort() because in your dataset the time variable (x-axis) is probably aligned with your independent variable (y-axis) … sorting it would mess up the data.
Without the np.sort() in the example, it just shows a block of random scatter dots.
# (The number 1200 is the conversion of 20 minutes to seconds.
# And the number 150000 is for interval. (1200 second / 0.008 second ))
import numpy as np
import matplotlib.pyplot as plt
random_array= np.random.rand(150000)
time_array= np.linspace(0,1200,150000)
plt.scatter(time_array,np.sort(random_array))
plt.xlabel('Time (second)')
plt.ylabel('Value')

as far as I can tell your code is fine, I'd just try and express your constraints directly in the code, e.g:
import numpy as np
time_seconds = np.arange(0, 20 * 60, 0.008)
random_values = np.random.rand(*time_seconds.shape)
i.e. just calculate 20 * 60 seconds and use 0.008 directly, putting the units somewhere obvious in the variable name. Python can work out that this would be 150k samples for us (it's a computer!) and we just reuse this value to sample some random values

Normalizing time series measurements

I have read the following sentence:
Figure 3 depicts how the pressure develops during a touch event. It
shows the mean over all button touches from all users. To account for
the different hold times of the touch events, the time axis has been
normalized before averaging the pressure values.
They have measured the touch pressure over touch events and made a plot. I think normalizing the time axis means to scale the time axis to 1s, for example. But how can this be made? Let's say for example I have a measurement which spans 3.34 seconds (1000 timestamps and 1000 measurements). How can I normalize this measurement?

If you want to normalize you data you can do as you suggest and simply calculate:
z_i=\frac{x_i-min(x)}{max(x)-min(x)}
(Sorry but i cannot post images yet but you can visit this )
where zi is your i-th normalized time data, and xi is your absolute data.
An example using numpy:
import numpy
x = numpy.random.rand(10) # generate 10 random values
normalized = (x-min(x))/(max(x)-min(x))
print(x,normalized)

how to plot two time series that have different sample rates on the same graph with matplotlib

I have two sets of data that I would like to plot on the same graph. Both sets of data have 200 seconds worth of data. DatasetA (BLUE) is sampled at 25 Hz and DatasetB (Red) is sampled at 40Hz. Hence DatasetA has 25*200 = 5000 (time,value) samples... and DatasetB has 40*200 = 8000 (time,value) samples.
datasets with different sample rates
As you can see above, I have managed to plot these in matplotlib using the 'plot_date' function. As far as I can tell, the 'plot' function will not work because the number of (x,y) pairs are different in each sample. The issue I have is the format of the xaxis. I would like the time to be a duration in seconds, rather than an exact time of the format hh:mm:ss. Currently, the seconds value resets back to zero when it hits each minute (as seen in the zoomed out image below).
zoomed out full time scale
How can I make the plot show the time increasing from 0-200 seconds rather than showing hours:min:sec ?
Is there a matplotlib.dates.DateFormatter that can do this (I have tried, but can't figure it out...)? Or do I somehow need to manipulate the datetime x-axis values to be a duration, rather than an exact time? (how to do this)?
FYI:
The code below is how I am converting the original csv list of float values (in seconds) into datetime objects, and again into matplotlib date-time objects -- to be used with the axes.plot_date() function.
from matplotlib import dates
import datetime
## arbitrary start date... we're dealing with milliseconds here.. so only showing time on the graph.
base_datetime = datetime.datetime(2018,1,1)
csvDateTime = map(lambda x: base_datetime + datetime.timedelta(seconds=x), csvTime)
csvMatTime = map(lambda x: dates.date2num(x), csvDateTime)
Thanks for your help/suggestions!

Well, thanks to ImportanceOfBeingErnst for pointing out that I was vastly over-complicating things...
It turns out that I really only need the ax.plot(x,y) function rather than the ax.plot_date(mdatetime, y) function. Plot can actually plot varied lengths of data as long as each individual trace has the same number of x and y values. Since the data is all given in seconds I can easily plot using 0 as my "reference time".
For anyone else struggling with plotting duration rather than exact times, you can simply manipulate the "time" (x) data by using python's map() function, or better yet a list comprehension to "time shift" the data or convert to a single unit of time (e.g. simply turn minutes into seconds by dividing by 60).
"Time Shifting" might look like:
# build some sample 25 Hz time data
time = range(0,1000,1)
time = [x*.04 for x in time]
# "time shift it by 5 seconds, since this data is recorded 5 seconds after the other signal
time = [x+5 for x in time]
Here is my plotting code for any other matplotlib beginners like me :) (this will not run, since I have not converted my variables to generic data... but nevertheless it is a simple example of using matplotlib.)
fig,ax = plt.subplots()
ax.grid()
ax.set_title(plotTitle)
ax.set_xlabel("time (s)")
ax.set_ylabel("value")
# begin looping over the different sets of data.
tup = 0
while (tup < len(alldata)):
outTime = alldata[tup][1].get("time")
# each signal is time shifted 5 seconds later.
# in addition each signal has different sampling frequency,
# so len(outTime) is different for almost every signal.
outTime = [x +(5*tup) for x in outTime]
for key in alldata[tup][1]:
if(key not in channelSelection):
## if we dont want to plot that data then skip it.
continue
else:
data = alldata[tup][1].get(key)
## using list comprehension to scale y values.
data = [100*x for x in data]
ax.plot(outTime,data,linestyle='solid', linewidth='1', marker='')
tup+=1
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe.resample and mean gives higher values with increasing sample width - python

Related

Time series graph analysis by detecting specific point increment and decrement

Interpreting and understanding fft plots of time series data

How can i plot graph of random array having with 20 minutes time and 8 ms interval?

Normalizing time series measurements

how to plot two time series that have different sample rates on the same graph with matplotlib

Categories

Resources