I have a pandas df similar to the following:
time price
00:00:00 2
00:10:00 6
00:20:00 3
01:25:00 16
02:25:00 7
etc...
I would like to plot on the same graph:
Time series as the variation of price as function of time
Bar graph as the number of observations taken that hour. So for the previous example, at 01:00:00 we have a bar of height 3 since from (00:00:00 -> 01:00:00 we had 3 observations) and at 02:00:00 we have height 1.
I was able to transform the data to get the bar graph data as a separate df using (.groupby(pd.Grouper(key='time', freq='H'))). Now I have 2 dfs and trying to plot them.
My data points are around 100K+, cleaned it from outliers.
Any pointers?
Related
I have massive data from CSV which spans every hour for a whole year. It has not been difficult plotting the whole data (or specific data) through the whole year.
However, I would like to take a closer look at month (for ex just plot January or February), and for the life of me, I haven't found out how to do that.
Date Company1 Company2
2020-01-01 00:00:00 100 200
2020-01-01 01:00:00 110 180
2020-01-01 02:00:00 90 210
2020-01-01 03:00:00 100 200
.... ... ...
2020-12-31 21:00:00 100 200
2020-12-31 22:00:00 80 230
2020-12-31 23:00:00 120 220
All of the columns are correctly formatted, the datetime is correctly formatted. How can I slice or define exactly the period I want to plot?
You can extract the month portion of a pandas datetime using .dt.month on a datetime series. Then check if that is equal to the month in question:
df_january = df[df['Date'].dt.month == 1]
You can then plot using your df_january dataframe. N.B. this will pick up data from other years as well if your dataset expanded to cover other years.
#WakemeUpNow had the solution I hadn't noticed. defining xlin while plotting did the trick.
df.DateTime.plot(x='Date', y='Company', xlim=('2020-01-01 00:00:00 ', '2020-12-31 23:00:00'))
plt.show()
I have two separate DataFrames, which both contain rainfall amounts and dates corresponding to them.
df1:
time tp
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 0.0
3 2013-01-01 03:00:00 0.0
4 2013-01-01 04:00:00 0.0
... ...
8755 2013-12-31 19:00:00 0.0
8756 2013-12-31 20:00:00 0.0
8757 2013-12-31 21:00:00 0.0
8758 2013-12-31 22:00:00 0.0
8759 2013-12-31 23:00:00 0.0
[8760 rows x 2 columns]
df2:
time tp
0 2013-07-18T18:00:01 0.002794
1 2013-07-18T20:00:00 0.002794
2 2013-07-18T21:00:00 0.002794
3 2013-07-18T22:00:00 0.002794
4 2013-07-19T00:00:00 0.000000
... ...
9656 2013-12-30T13:30:00 0.000000
9657 2013-12-30T23:30:00 0.000000
9658 2013-12-31T00:00:00 0.000000
9659 2013-12-31T00:00:00 0.000000
9660 2014-01-01T00:00:00 0.000000
[9661 rows x 2 columns]
I'm trying to plot a scatter graph comparing the two data frames. The way I'm doing it is by choosing a specific date and time and plotting the df1 tp on one axis and df2 tp on the other axis.
For example,
If the date/time on both dataframes = 2013-12-31 19:00:00, then plot tp for df1 onto x-axis, and tp for df2 on the y-axis.
To solve this, I tried using the following:
df1['dates_match'] = np.where(df1['time'] == df2['time'], 'True', 'False')
which will tell me if the dates match, and if they do I can plot. The problem arises as I have a different number of rows on each dataframe, and most methods only allow comparison of dataframes with exactly the same amount of rows.
Does anyone know of an alternative method I could use to plot the graph?
Thanks in advance!
The main goal is to plot two time series with that apparently don't have the same frequency to be able to compare them.
Since the main issue here is the different timestamps let's tackle that with pandas resample so we have a more uniform timestamps for each observation. To take the sum of 30 minutes intervals you can do (feel free to change the time interval and the agg function if you want to)
df1.set_index("time", inplace=True)
df2.set_index("time", inplace=True)
df1_resampled = df1.resample("30T").sum() # taking the sum of 30 minutes intervals
df2_resampled = df2.resample("30T").sum() # taking the sum of 30 minutes intervals
Now that the timestamps are more organized you can either merge the newer resampled dataframes if you want to and then plot i
df_joined = df1_resampled.join(df2_resampled, lsuffix="_1", rsuffix="_2")
df_joined.plot(marker="o", figsize=(12,6))
# df_joined.plot(subplots=True) if you want to plot them separately
Since df1 starts on 2013-01-01 and df2 on 2013-07-18 you'll have a first period where only df1 will exist if you want to plot only the overlapped period you can pass how="outer" to when joining both dataframes.
Question:
I have a timeseries dataset with irregular intervals, and I want to compute the averages per regular time interval.
What is the best way to do this in Python?
Example:
Below a simplified dataset as a Pandas series:
base = pd.to_datetime('2021-01-01 12:00')
mydict = {
base: 5,
base + timedelta(minutes=5): 10,
base + timedelta(minutes=7): 12,
base + timedelta(minutes=12): 6,
base + timedelta(minutes=25): 8
}
series = pd.Series(mydict)
Returns:
2021-01-01 12:00:00 5
2021-01-01 12:05:00 10
2021-01-01 12:07:00 12
2021-01-01 12:12:00 6
2021-01-01 12:25:00 8
My solution:
I want to resample this to a regular 15 minute interval and take the mean. I can do this by first resampling to a very small interval (seconds) and then resampling to 15 minutes:
series.resample('S').ffill().resample('15T').mean()
Returns:
2021-01-01 12:00:00 8.200000
2021-01-01 12:15:00 6.003328
It does not feel Pythonic to first resample to a small interval before sampling to the desired interval. And I expect it also get quite slow with large datasets that require high accuracy. Is there a better way to do this?
P.S. In case you are wondering: If you resample to 15 minutes right away you do not get the desired result:
series.resample('15T').mean()
Returns:
2021-01-01 12:00:00 8.25
2021-01-01 12:15:00 8.00
If the timestamps in your data represent breakpoints between intervals, then your data describes a step function. You can use a package called staircase which is built upon pandas and numpy for analysis with step functions.
Using the setup code you provided, create a staircase.Stairs object from series. These objects represent step functions are to staircase as Series are to pandas.
import staircase as sc
sf = sc.Stairs.from_values(initial_value=0, values=series)
There are lots of things you can do with Stairs objects, including plotting
sf.plot(style="hlines")
Next create your 15 minute bins, eg
bins = pd.date_range(base, periods=5, freq="15min")
bins looks like this
DatetimeIndex(['2021-01-01 12:00:00', '2021-01-01 12:15:00',
'2021-01-01 12:30:00', '2021-01-01 12:45:00',
'2021-01-01 13:00:00'],
dtype='datetime64[ns]', freq='15T')
Next we slice the stepfunction into pieces with the bins and take the mean. This is analogous to groupby-apply with dataframes in pandas.
means = sf.slice(bins).mean()
means is a pandas.Series indexed by the bins (a pandas.IntervalIndex) with the mean values
[2021-01-01 12:00:00, 2021-01-01 12:15:00) 8.200000
[2021-01-01 12:15:00, 2021-01-01 12:30:00) 6.666667
[2021-01-01 12:30:00, 2021-01-01 12:45:00) 8.000000
[2021-01-01 12:45:00, 2021-01-01 13:00:00) 8.000000
dtype: float64
If you just wanted to have the start points of the interval as the index then you can do this
means.index = means.index.left
Or similarly, use endpoints. If you're feeding this data into a ML algorithm then use endpoints to avoid data leakage.
summary
import staircase as sc
sf = sc.Stairs.from_values(initial_value=0, values=series)
bins = pd.date_range(base, periods=5, freq="15min")
means = sf.slice(bins).mean()
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
I would like to plot a timedelta with pandas 0.22.0 . Unfortunately, the y-axis only goes up to just over 3. Why is that?
import pandas as pd
df = pd.DataFrame({'date': ['2017-07-01', '2017-07-02', '2017-07-03', '2017-07-04', '2017-07-05'], 'minutes': [195, 69, 76, 25, 540]})
df.index = pd.to_datetime(df['date'])
series = pd.Series(data=pd.to_timedelta(df['minutes'], 'm'))
With series.describe it shows me everything correctly:
series.describe()
Out[6]:
count 5
mean 0 days 03:01:00
std 0 days 03:30:20.768597
min 0 days 00:25:00
25% 0 days 01:09:00
50% 0 days 01:16:00
75% 0 days 03:15:00
max 0 days 09:00:00
Name: minutes, dtype: object
Picture of the plot:
The timedeltas are being set to timedelta64[ns] by default so your seeing data in nanoseconds. However, when you run describe, your stats are appearing in days. If you convert to seconds it becomes much clearer
series.dt.seconds.describe()
count 5.000000
mean 10860.000000
std 12620.768598
min 1500.000000
25% 4140.000000
50% 4560.000000
75% 11700.000000
max 32400.000000
An we can see the max at 32400, which appears to be correct on your plot and data. However, you're also plotting in nanoseconds, which you will see if you hover the mouse over the values and check your y's. You may want to construct your plot like so
series.dt.seconds.plot()
Imagine you got a Dataframe containing value observations of variables. Each observation is saved as a triple Variable, Timestamp, Value. This layout is somewhat a "observation dataframe".
#Variable Time Value
#852-YF-007 2016-05-10 23:00:00 4
#852-YF-007 2016-05-11 04:00:00 4
#...
#852-YF-008 2016-05-10 23:00:00 5
#852-YF-008 2016-05-11 04:00:00 3
#...
#852-YF-009 2016-05-10 23:00:00 2
#852-YF-009 2016-05-11 04:00:00 9
#...
That data is loaded into a Spark Dataframe and the timestamps are sampled so that we have one value for each variable for a specific timestamp.
Question: How can I convert/transpose that efficiently into a "Instants Dataframe" like this:
#Time 852-YF-007 852-YF-008 852-YF-009
#2016-05-10 23:00:00 4 5 2
#2016-05-11 04:00:00 4 3 9
#...
The number of columns depends on the number of variables. Each column is the timeseries (all sampled values for that variables) while the rows are the timestamps. Note: the number of timestamps will be much larger than the number of variables.
Update: It's related to pivot-tables but I do not have a fixed number of columns. That number varies by the number of variables.