I'm generating some plots based on data that I'm holding in a pandas DataFrame; a snapshot of what this data (call it data)looks like is below:
CIG CLD DPT OBV P06 P12 POS POZ Q06 Q12 TMP \
2010-10-01 18:00:00 8 CL 54 N NaN NaN 0 0 NaN NaN 85
2010-10-01 21:00:00 8 CL 50 N NaN NaN 0 0 NaN NaN 89
2010-10-02 00:00:00 8 CL 51 N 0 NaN 0 0 0 NaN 81
2010-10-02 03:00:00 8 CL 52 N NaN NaN 0 0 NaN NaN 67
2010-10-02 06:00:00 8 CL 52 N 0 NaN 0 0 0 NaN 62
2010-10-02 09:00:00 8 CL 51 N NaN NaN 0 0 NaN NaN 59
...
The idea for one of the plots is to overlay traces of the TMP and DPT fields (generated by using data['TMP'].plot()) on top of shading corresponding to the CLD field. So for instance, the block of time between 2010-10-01 18:00:00-2010-10-01 19:30:00 might be a light gray, and if the next entry for CLD were something else other than "CL", then the block 2010-10-01 19:30:00-2010-10-01 22:30:00 might be a darker color, that way I can see how the CLD field changes contemporaneously with the other fields.
My idea was to use a Rectangle patch from matplotlib.Patches to accomplish this shading. Since I'm basing the bounds on of the plot on the trace of TMP and DPT, I'll always know exactly what the height of the patch is, and I also always know its left boundary and its width - but the wrinkle is that I know them in datetime coordinates, not in x-y coordinates. So, if bnd_left is the left boundary as a datetime, ylo and height are floats, and width is a datetime.timedelta, I'm trying to make a patch like,
shading_patch = Rectangle([bnd_left, ylo], width, height)
But this doesn't work. There is a TypeError when the patch tries to create itself, since one cannot add a float and a datetime.timedelta. In the documentation, I can't find anything on how to transform the datetime coordinates to floats in the native transform of the plot I created by using the DataFrame.plot() method when I created the traces I'm trying to draw underneath.
Is there any simple way to draw patches on those plots generated with DataFrame.plot()?
Ok, after some more digging a much easier solution came up - use the axvspanmethod. There is a caveat, though. In Pandas v. 0.12, if you slice through a DataFrame or Timeseries using the .ix attribute, for some weird reason you screw up the formatting into x-axis dates. When you plot, you must plot with my_dataframe.plot(ax=ax, x_compat=True) and configure the ticks yourself, or the shading from axvspan won't work.
Related
I have a data frame like that (it's just the head) :
Timestamp Function_code Node_id Delta
0 2000-01-01 10:39:51.790683 Tx_PDO_2 54 551.0
1 2000-01-01 10:39:51.791650 Tx_PDO_2 54 601.0
2 2000-01-01 10:39:51.792564 Tx_PDO_3 54 545.0
3 2000-01-01 10:39:51.793511 Tx_PDO_3 54 564.0
There are only two types of Function_code : Tx_PDO_2 and Tx_PDO_3
I plot in two windows, a graph with Timestamp on the x-axis and Delta on the y-axis. One for Tx_PDO_2 and the other for Tx_PDO_3 :
delta_rx_tx_df.groupby("Function_code").plot(x="Timestamp", y="Delta", )
Now, I want to know which window corresponds to which Function_code
I tried to use title=delta_rx_tx_df.groupby("Function_code").groups but it did not work.
There may be a better way, but for starters, you can assign the titles to the plots after they are created:
plots = delta_rx_tx_df.groupby("Function_code").plot(x="Timestamp", y="Delta")
plots.reset_index()\
.apply(lambda x: x[0].set_title(x['Function_code']), axis=1)
I want to integrate the following dataframe, such that I have the integrated value for every hour. I have roughly a 10s sampling rate, but if it is necissary to have an even timeinterval, I guess I can just use df.resample().
Timestamp Power [W]
2022-05-05 06:00:05+02:00 2.0
2022-05-05 06:00:15+02:00 1.2
2022-05-05 06:00:25+02:00 0.3
2022-05-05 06:00:35+02:00 4.3
2022-05-05 06:00:45+02:00 1.1
...
2022-05-06 20:59:19+02:00 1.4
2022-05-06 20:59:29+02:00 2.0
2022-05-06 20:59:39+02:00 4.1
2022-05-06 20:59:49+02:00 1.3
2022-05-06 20:59:59+02:00 0.8
So I want to be able to integrate over both hours and days, so my output could look like:
Timestamp Energy [Wh]
2022-05-05 07:00:00+02:00 some values
2022-05-05 08:00:00+02:00 .
2022-05-05 09:00:00+02:00 .
2022-05-05 10:00:00+02:00 .
2022-05-05 11:00:00+02:00
...
2022-05-06 20:00:00+02:00
2022-05-06 21:00:00+02:00
(hour 07:00 is to include values between 06:00-07:00, and so on...)
and
Timestamp Energy [Wh]
2022-05-05 .
2022-05-06 .
So how do I achieve this? I was thinking I could use scipy.integrate, but my outputs look a bit weird.
Thank you.
You could create a new column representing your Timestamp truncated to hours:
df['Timestamp_hour'] = df['Timestamp'].dt.floor('h')
Please note that in that case, the rows between hour 6.00 to hour 6.59 will be included into the 6 hour and not the 7 one.
Then you can group your rows by your new column before applying your integration computation:
df_integrated_hour = (
df
.groupby('Timestamp_hour')
.agg({
'Power': YOUR_INTEGRATION_FUNCTION
})
.rename(columns={'Power': 'Energy'})
.reset_index()
)
Hope this will help you
Here's a very simple solution using rectangle integration with rectangles spaced in 10 second intervals starting at zero and therefore NOT centered exactly on the data points (assuming that the data is delivered in regular intervals and no data is missing), a.k.a. a simple average.
from numpy import random
import pandas as pd
times = pd.date_range('2022-05-05 06:00:04+02:00', '2022-05-06 21:00:00+02:00', freq='10S')
watts = random.rand(len(times)) * 5
df = pd.DataFrame(index=times, data=watts, columns=["Power [W]"])
hourly = df.groupby([df.index.date, df.index.hour]).mean()
hourly.columns = ["Energy [Wh]"]
print(hourly)
hours_in_a_day = 24 # add special casing for leap days here, if required
daily = df.groupby(df.index.date).mean()
daily.columns = ["Energy [Wh]"]
print(daily)
Output:
Energy [Wh]
2022-05-05 6 2.625499
7 2.365678
8 2.579349
9 2.569170
10 2.543611
11 2.742332
12 2.478145
13 2.444210
14 2.507821
15 2.485770
16 2.414057
17 2.567755
18 2.393725
19 2.609375
20 2.525746
21 2.421578
22 2.520466
23 2.653466
2022-05-06 0 2.559110
1 2.519032
2 2.472282
3 2.436023
4 2.378289
5 2.549572
6 2.558478
7 2.470721
8 2.429454
9 2.390543
10 2.538194
11 2.537564
12 2.492308
13 2.387632
14 2.435582
15 2.581616
16 2.389549
17 2.461523
18 2.576084
19 2.523577
20 2.572270
Energy [Wh]
2022-05-05 60.597007
2022-05-06 59.725029
Trapezoidal integration should give a slightly better approximation but it's harder to implement right. You'd have to deal carefully with the hour boundaries. That's basically just a matter of inserting interpolated values twice at the full hour (at 09:59:59.999 and 10:00:00). But then you'd also have to figure out a way to extrapolate to the start and end of the range, i.e. in your example go from 06:00:05 to 06:00:00. But careful, what to do if your measurements only start somewhere in the middle like 06:17:23?
This solution uses a package called staircase, which is part of the pandas ecosystem and exists to make working with step functions (i.e. piecewise constant) easier.
It will create a Stairs object (which represents a step function) from a pandas.Series, then bin across arbitrary DatetimeIndex values, then integrate.
This solution requires staircase 2.4.2 or above
setup
df = pd.DataFrame(
{
"Timestamp":pd.to_datetime(
[
"2022-05-05 06:00:05+02:00",
"2022-05-05 06:00:15+02:00",
"2022-05-05 06:00:25+02:00",
"2022-05-05 06:00:35+02:00",
"2022-05-05 06:00:45+02:00",
]
),
"Power [W]":[2.0, 1.2, 0.3, 4.3, 1.1]
}
)
solution
import staircase as sc
# create step function
sf = sc.Stairs.from_values(
initial_value=0,
values=df.set_index("Timestamp")["Power [W]"],
)
# optional: plot
sf.plot(style="hlines")
# create the bins (datetime index) over which you want to integrate
# using 20s intervals in this example
bins = pd.date_range(
"2022-05-05 06:00:00+02:00", "2022-05-05 06:01:00+02:00", freq="20s"
)
# slice into bins and integrate
result = sf.slice(bins).integral()
result will be a pandas.Series with an IntervalIndex and Timedelta values. The IntervalIndex retains timezone info, it just doesn't display it:
[2022-05-05 06:00:00, 2022-05-05 06:00:20) 0 days 00:00:26
[2022-05-05 06:00:20, 2022-05-05 06:00:40) 0 days 00:00:30.500000
[2022-05-05 06:00:40, 2022-05-05 06:01:00) 0 days 00:00:38
dtype: timedelta64[ns]
You can change the index to be the "left" values (and see this timezone info) like this:
result.index = result.index.left
You can change values to a float with division by an appropriate Timedelta. Eg to convert to minutes:
result/pd.Timedelta("1min")
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
I'm struggling to get a stacked vbar working.
With python/pandas and bokeh I want to plot several statistics about the players of a football team. The dataframe is nicely filled, the values are a string where they should be an int where it should be a numeric value.
I used the sample of bokeh to try and adjust it for my purpose, but I'm stuck on
'ValueError: Keyword argument sequences for broadcasting must be the same length as stackers' this error.
My code (without imports and scraping pieces) is:
source = ColumnDataSource(data=statsdfsource[['goals','assists','naam']])
p = figure(plot_height=250, title="Fruit Counts by Year",
toolbar_location=None, tools="")
p.vbar_stack(['goals','assists'], x='naam', width=0.9, color=colors,
source=source)
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
show(p)
The dataframe I fill the columndatasource with is
goals assists naam
0 NaN NaN Miguel Santos
1 NaN NaN Aykut Özer
2 NaN NaN Job van de Walle
3 NaN NaN Rowen Koot
4 8.0 6.0 Perr Schuurs
5 4.0 2.0 Wessel Dammers
6 12.0 2.0 Stefan Askovski
7 1.0 NaN Mica Pinto
8 NaN NaN Christopher Braun
9 1.0 4.0 Marco Ospitalieri
10 NaN 1.0 Clint Esser
The result I want to reach is a stacked columnframe, where on the x-axis is the name of the player, with 2 columns above it, one with the goals the player made and one with the assists.
I think I'm messing up somewhere with how my dataframe is built, but I'm a bit floating how it should be formed (can't really imagine on the other hand that the dataframe doesn't fit the purpose).
When using categorical ranges, you have to tell figure what the categories for the axis are and what order you want them to show up, e.g. provide x_range something like:
# specify all the factors for the x-axis by passing x_range
p = figure(..., x_range=sorted(df.naam.unique()))
It's also possible the NaN values are a problem, since they are "contagious". I'd recommend changing them to zeros instead in any case.
Finally the error message probably indicates that your colors list is the wrong length. You are stacking two bars in each column, so the list of colors needs to also be two (one color for each "row" in the stack).
I have a dataframe as show below:
df =
index boolvalue
2014-05-21 10:00:00 1
2014-05-21 11:00:00 1
2014-05-21 12:00:00 0
2014-05-21 13:00:00 1
2014-05-21 14:00:00 0
2014-05-21 15:00:00 1
....
The column just has two values, "1" and "0".
This are origin code and figure I have done:
plt.scatter(df.index, df.boolvalue, s = 5,c='b' )
plt.ylim([-2, 2])
l would like to plot it as a scatter plot, with value "1" in color blue and "0" in color red .
Because the index (time series) is long, I think it is better not to use a for-loop. Does anyone have an idea to do it? Thanks in advance!
You might consider if you want a scatterplot or two boxplots.
For a scatterplot, you could use
for (v, c) in [(1, 'b'), (0, 'r')]:
plt.scatter(df.index[df.boolvalue == v], df.boolvalue[df.boolvalue == v], s = 5,c=c)
Given the sample data, this looks like
For two boxplots, consider using seaborn.boxplot:
import seaborn as sns
sns.boxplot(x="boolvalue", y="index", data=df.reset_index())
I have a Pandas Series with 76 elements, when I try to print out the Series (for debugging) it is abbreviated with "..." in the output. Is there a way to pretty print all of the elements of the Series?
In this example, the Series is called "data"
print str(data)
gives me this
Open 40.4568
High 40.4568
Low 39.806
Close 40.114
Volume 796146.2
Active 1
TP1_ema 700
stop_ema_width 0.5
LS_ema 10
stop_window 210
target_width 3
LS_width 0
TP1_pct 1
TP1_width 4
stop_ema 1400
...
ValueSharesHeld NaN
AccountIsWorth NaN
Profit NaN
BuyPrice NaN
SellPrice NaN
ShortPrice NaN
BtcPrice NaN
LongStopPrice NaN
ShortStopPrice NaN
LongTargetPrice NaN
ShortTargetPrice NaN
LTP1_Price NaN
STP1_Price NaN
TradeOpenPrice NaN
TheEnd False
Name: 2000-11-03 14:00, Length: 76, dtype: object
Note the "..." inserted in the middle. I'm debugging using PTVS on Visual Studio 2013 (Python Tools for Visual Studio". I get the same behaviour with enthought canopy.
pd.options.display.max_rows = 100
The default is set at 60 (so dataframes or series with more elements will be truncated when printed).