Plotting categorical data over time in Python - python

I have a data set as follows:
[Time of notification], [Station], [Category]
2019-02-04 19.36:22, Location A, Alert
2019-02-04 20.06:35, Location B, Request
2019-02-05 07.04:53, Location A, Incident
Time of notification is in datetime64[ns] format. The time span is one year.
I am trying to get the following line graphs:
One per station
Time on x axis. Preferably: Accumulated for days of the week and hours (e.g. all Mondays, Tuesdays etc together, so that a daily/weekly trend over the whole year becomes visible).
Number of notifications (for that station) on the y axis. Category is irrelevant.
I have tried a lot, but I am new to time series and to visualization, and I am getting nowhere after hours of trying. I have been trying with plt.subplots, value_counts etcetera. Also tried making this graph for one station first, but even that didn't work out.
Can anyone help?
Thank you!

Related

Understanding Plotly Time Difference units

So, I have a problem similar to this question. I have a DataFrame with a column 'diff' and a column 'date' with the following dtypes:
delta_df['diff'].dtype
>>> dtype('<m8[ns]')
delta_df['date'].dtype
>>> datetime64[ns, UTC]
According to this answer, there are (kind of) equivalent. However, then I plot using plotly (I used histogram and scatter), the 'diff' axis has a weird unit. Something like 2T, 2.5T, 3T, etc, what is this? The data on 'diff' column looks like 0 days 00:29:36.000001 so I don't understand what is happening (column of 'date' is 2018-06-11 01:04:25.000005+00:00).
BTW, the diff column was generated using df['date'].diff().
So my question is:
What is this T? Is it a standard choosen by plotly like 30 mins and then 2T is 1 hour? if so, how to check the value of the chosen T?
Maybe more important, how to plot with the axis as it appears on the column so it's easier to read?
The "T" you see in the axis label of your plot represents a time unit, and in Plotly, it stands for "Time". By default, Plotly uses seconds as the time unit, but if your data spans more than a few minutes, it will switch to larger time units like minutes (T), hours (H), or days (D). This is probably what is causing the weird units you see in your plot.
It's worth noting that using "T" as a shorthand for minutes is a convention adopted by some developers and libraries because "M" is already used to represent months.
To confirm that the weird units you see are due to Plotly switching to larger time units, you can check the largest value in your 'diff' column. If the largest value is more than a few minutes, Plotly will switch to using larger time units.

How to get traded volume from tick data?

I have this excel data with price movement and traded volume.
By using mydf[mydf.columns[5]].resample('1Min').ohlc(), I get OHLC data but don't know how to get trade volume for each minute. I have few problems in my mind:
the tick frequency is not uniform (means for some particular minute i may have say 100 sample size and for for other it may vary to 120 so .group() may not work for me)
OHLC function takes care of the previous issue automatically as i make column G date time index
Can i have a code which should based on "G" column do sum for volume in particular minute and then subtract it with previous minute volume data so that i get exact traded volume for that particular minute?
Here is the input for ohlc
and the output i get is this,
.
I am not interested in CE as of now.
I just want another column added in this dataframe with volume for each minute value.

remove/isolate days when there is no change (in pandas)

I have annual hourly energy data for two AC systems for two hotel rooms. I want to figure out when the rooms were occupied or not by isolating/removing the days when the ac was not used for 24 hours.
I did df[df.Meter2334Diff > 0.1] for one room which gives me all the hours when AC was turned on, however it removes the hours of the days when the room was most likely occupied and the AC was turned off. This is where my knowledge stops. I therefore enquire the assistance from the oracles of the internet.
my dataframe above
results after df[df.Meter2334Diff > 0.1]
If I've interpreted your question correctly, you want to extract all the days from the dataframe where the Meter2334Diff value was zero?
As your data is currently has a frequency of every hour, we can resample it in pandas using the resample() function. To resample() we can pass the freq parameter which tells pandas at what time interval to aggregate the data. There are lots of options (see the docs) but in your case we can set freq='D' to group by day.
Then we can calculate the sum of that day for the Meter2334Diff column. If we then filter out the days that have a value == 0 (obviously without knowledge of your dataset etc I don't know whether 0 is the correct value).
total_daily_meter_diff = df.resample('D')['Meter2334Diff'].sum()
days_less_than_cutoff = total_daily_meter_diff.query('MeterDiff2334 == 0')
We can then use these days to filter in the original dataset:
df.loc[df.index.floor('D').isin(days_less_than_cutoff) , :]

Group datafrime by time slots in Pandas python

i'm working with a dataset that comes from the data sent by underground sensors stations, which provide an estimate of the flow of the cars going through them.
My data are grouped by hour for each sensor on the same period of time,
this is how the df looks like:
I thought to find some trends of the flow in various time slots( like morning, afternoon, evening, night)
My question is:
there's a way to group the data for each station_id in time slots?
For example group the data of each station from 00:00 to 06:00, from 06:00 to 12:00 and so on, and for every subgroup calculate the mean of the flow value.
Concerning the time i'm interested in keeping for each time slot only the day and the month
I've read the datetime's documentation and tried with some methods but unsuccessfully
I'll appreciate everyone who'll reply and help me with any tip.
create the bins and group by them:
df = pd.read_csv('readings_by_hour.csv')
df['time'] = pd.to_datetime(df['time'])
df['time_bins'] = df['time'].dt.floor('6h')
df.groupby(['station_id', 'time_bins'])['flow'].mean()

Take maximum rainfall value for each season over a time period (xarray)

I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Categories

Resources