Finding/Plotting average of series with non-consistent x - python

I have a weird problem concerning a weird dataset.
Basically I have 25 replicates of a model, they all have fire sizes and those fire sizes are summed cumulatively.
So a short summary of data would be by example :
fires_size rep cumsum
0 1 rep_9 1
1 1 rep_9 2
2 1 rep_9 3
....
50 59 rep_9 4000
51 75 rep_9 4075
....
150 1 rep_20 1
151 1 rep_20 2
152 1 rep_20 3
....
200 12 rep_20 3500
201 70 rep_20 3570
So when I plot this pandas dataframe, with fire sizes as x and cumulative area burnt as y : I get something like this (the blue lines, as I have two different datasets).
Image there as I can't upload picture
So now a cool thing would be to be able to create an average replicate that could be drawn on top of my other reps to show well the average distribution, even better would be to be able to calculate the standard deviation and use fill.between to show even better the variability.
My problem is that since my fire sizes (x axes) are not consistent(and multiple as y is cumulative), I have no idea how I could do that. I tried a few trendline, tried lowess and things like that but it never gives me a really good result.
So is there an easy way to do that? I am lacking some basic statistic knowledges here! And can't really find any answer that I can relate too as I don't even know how to describe my dataset.
Thank you so much!
I will post the link to the full data in a comment as I can't post more than one link

Related

Identifying outliers in an event sequence using a Python Dataframe

I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.

Pandas infrastructure data statistics plot with date per user

I am trying to display some infrastructure usage daily statistics with Pandas but I'm a beginner and can't figure it out after many hours of research.
Here's my data types per column:
Name object UserService object
ItemSize int64 ItemsCount int64
ExtractionDate datetime64[ns]
Each day I have a new extraction for each users, so I probably need to use the group_by before plotting.
Data sample:
Name UserService ItemSize ItemsCount ExtractionDate
1 xyzf_s xyfz 40 1 2018-12-12
2 xyzf1 xyzf 53 5 2018-12-12
3 xyzf2 xyzf 71 4 2018-12-12
4 xyzf3 xyzf 91 3 2018-12-12
14 vo12 vo 41 5 2018-12-12
One of the graph I am trying to display is as follow:
x axis should be the extraction date
y axis should be the items count (it's divided by 1000 so it's by thousands of items from 1 to 100)
Each line on the graph should represent a user evolution (to look at data spikes), I guess I would have to display the top 10 or 50 because it would be difficult to have a graph of 1500 users.
I'm also interested by any other way you would exploit those data to look for data increase and anomaly in data consumption.
Assuming the user is shown in the name columns and there is only one line per user per day, to get the plot you are explicitly asking for, you can use the following code:
# Limit to 10 users
users_to_plot = df.Name.unique()[:10]
for u in users_to_plot:
mask = (df['Name'] == u)
values = df[mask]
plt.plot('ExtractionDate','ItemsCount',data=values.sort_values('ExtractionDate'))
It's important to look at the data and think about what information you are trying to extract and what that looks like. It's probably worth exploring with some individuals first and getting an idea of what is the thing you are trying to identify. Think about what makes that unique and if you can make it pop on a graph.

Pandas series function that shows the probability of the up and down moves of the stock price

Days Adjusted stock price
price
0 100
1 50
2 200
3 210
4 220
5 34
6 35
7 36
8 89
Assuming this table is a pandas dataframe. Can someone help me out with writing function that show the probability of the up and down moves of the stock price. For example, what is the probability of the stock price having two up days in a row.
Thanks I am new to python and I have been trying to figure this out for a while!
Actual stock price movement prediction is both a broad and a deep subject usually associated with time series analysis which I would consider out of the scope of this question.
However, the naive approach would be to assume the Bernoulli model where each price move is considered independent both of any previous moves and of time.
In this case, the probability of the price moving up can be inferred by measuring all the up moves against all moves recorded.
# df is a single-column pandas DataFrame storing the price
((df['price'] - df['price'].shift(1)) > 0).sum()/(len(df) - 1)
which for the data you posted gives 0.75.
Given the above, the probability of the price going up for two consecutive days would be 0.75*0.75 approximately equating to 0.56.

Deep learning training the dataset which has gap

I have a dataset of a sensor (station) for several years with this structure:
station Direction year month day dayOfweek hour volume
1009 3 2015 1 1 5 0 37
1009 3 2015 1 1 5 1 20
1009 3 2015 1 1 5 2 24
... . .. .. .. .. .. ..
there is plenty of gap(missed value) in this data. For example there might be a month or several days missed. I fill the missed volumes with 0. I want to predict volume based on previous data. I used LSTM and the mean absolute percent error (MAPE) is quite high around 20 and I need to reduce it.
The main problem that I have is even for traning I have a gap. Is there any other techniqe in deep learning for these kind of data?
There are multiple ways to handle missing values as listed here (https://machinelearningmastery.com/handle-missing-data-python/).
If i have enough data I will just ommit rows with missing data. If i do not have enough data and/or have to predict on cases where data is missing I normally try those two approaches and choose the one with the higher accuracy.
The same as you. I choose a distinct value which is not included in the dataset, like 0 in your case and fill in that value. The other approach is to use the mean or median of the training set. I use the same value (calculated on training set) in my validation set/test set. The median is better than the mean, if the mean does not make sense in the current context. (2014.5 as year for example).

How to remove transients in time-series data in Python (or Pandas)?

I have a time-series set of data recording the flow and temperature of a heat pump. The first few minutes when the system kicks on, the flows and temperatures aren't fully developed and I'd like to filter them out.
Time (min) Flow Supply T Return T
….
45 0 0 0
46 0 0 0
47 1.338375 92.711328 78.72152
48 2.267975 82.578552 74.239624
49 0.778125 96.073136 74.288664
50 0.778125 101.3998 74.686288
51 0.7885 102.1189 74.490528
….
For instance, the first 3 minutes of operation (from 47-49 minutes), don't do any calculations with the data.
I can do that with a loop, but the data set is very large (>200 mb text file) and takes a really long time to loop through. I was wondering if there's a more efficient way to pull it out, perhaps using Pandas?
Any help or advice is appreciated! Thanks in advance!!
Please try the following, I think it should work, basically it filters out the rows where row at n-3 does not equal 0 and is not NaN this assumes that when there is no flow you have a value of 0:
In [12]:
df[(df.Flow.shift(3)!=0) & (df.Flow.shift(3).notnull())]
Out[12]:
Time_(min) Flow Supply_T Return_T
5 50 0.778125 101.3998 74.686288
6 51 0.788500 102.1189 74.490528

Categories

Resources