I have two data frames in pandas one with four timesseries with second by second data like the following
timestamp ID value1 value2 value3 value4
2016/01/01T01:01:01 1234 100 50 50 60
2016/01/01T01:01:02 1234 101 48 48 52
2016/01/01T01:01:02 1234 101 48 48 52
....
and a second with averages from selected intervals
ID start_time end_time avg_value1 avg_value2 avg_value3 avg_value4
1234 01:01:01 01:01:15 100.1 50.2 49 55
...
I would like to plot these two as timeseries superimposed over each other with the averages appearing as flat lines starting at start_time and ending at end_time. How would I go about doing this in the latest version of pandas?
The easiest way is to put all the data into a single DataFrame and use the built-in .plot() method.
Assuming your original DataFrame is called df the the code below should solve your issue (you might need to strip out the "ID" column):
means = df.groupby(pd.TimeGrouper('15s')).mean()
means.columns = ['avg_'+col for col in df.columns]
merged_df = pd.concat([df, means], axis=1).fillna(method='ffill')
merged_df.plot()
Using some intraday 1s candle stock data, you get something like this:
If you want to further customize your plots I am afraid you will have to spend a few hours/days studying the basics of matplotlib.
Related
I have a data frame like that (it's just the head) :
Timestamp Function_code Node_id Delta
0 2000-01-01 10:39:51.790683 Tx_PDO_2 54 551.0
1 2000-01-01 10:39:51.791650 Tx_PDO_2 54 601.0
2 2000-01-01 10:39:51.792564 Tx_PDO_3 54 545.0
3 2000-01-01 10:39:51.793511 Tx_PDO_3 54 564.0
There are only two types of Function_code : Tx_PDO_2 and Tx_PDO_3
I plot in two windows, a graph with Timestamp on the x-axis and Delta on the y-axis. One for Tx_PDO_2 and the other for Tx_PDO_3 :
delta_rx_tx_df.groupby("Function_code").plot(x="Timestamp", y="Delta", )
Now, I want to know which window corresponds to which Function_code
I tried to use title=delta_rx_tx_df.groupby("Function_code").groups but it did not work.
There may be a better way, but for starters, you can assign the titles to the plots after they are created:
plots = delta_rx_tx_df.groupby("Function_code").plot(x="Timestamp", y="Delta")
plots.reset_index()\
.apply(lambda x: x[0].set_title(x['Function_code']), axis=1)
I have a dataframe with two columns: timeStamp and eventMessage (string).
timeStamp: eventMessage:
2020-10-19T10:07:56.7450775+02:00 transaction successful
2020-10-19T10:08:13.025169+02:00 transaction successful
I want to end up with a dataframe that has two columns : hour and numberOfEvents per that hour.
hour: numberOfEvents:
1 41
2 0
... ...
24 32
I've tried the df.resample('H', on='timeStamp', how='count'), but I think the how='count' is deprecated now?
Is there a new quick pandas way to do it?
UPDATE: thanks to Ami Tavory's tip the df now looks like this:
timeStamp
10 792
11 792
14 594
15 198
16 198
I'm not actually sure if it's a dataframe with one column or some other type completely. And how do I fill in the hours that had zero events?
Miniupdate: It's pandas.core.series.Series
Converted it to df with:
series = df.message.groupby(pd.to_datetime(df.timeStamp).dt.hour).count()
df2 = pd.DataFrame({'hour': series.index, 'counted': series.values})
Now I just need to figure out how to add and fill in other hours from 1 to 24 that had no events with a zero.
Regarding your new question (after the edit).
Converted it to df with:
You can more easily convert it with
df = series.to_frame().
Now I just need to figure out how to add and fill in other hours from 1 to 24 that had no events with a zero.
new_index = Index(arange(0,23,1), name="hour")
df.set_index("hour").reindex(new_index).fillna(0)
Group by the hour, and count:
df.eventMessage.groupby(pd.to_datetime(df.timeStamp).dt.hour)).count()
I have the following dataset:
dataset.head(7)
Transaction_date Product Product Code Description
2019-01-01 A 123 A123
2019-01-02 B 267 B267
2019-01-09 B 267 B267
2019-02-11 C 139 C139
2019-02-11 A 125 C125
2019-02-12 C 139 C139
2019-02-12 A 123 A123
The dataset stores transaction information, for which a transaction date is available. In other words, not for all days, data is available.
Ultimately, I want to create a time series plot, showing me the number of transactions per day.
So far, I have done a simple countplot:
ax = sns.countplot(x=dataset["Transaction_date"],data=dataset)
This plot shows me the dates, where a transaction happened. But I would prefer to see also the dates, where no transaction has happened in a plot, preferably shown as 0.
I have tried the following, but retrieve an error message:
groupbydate = dataset.groupby("Transaction_date")
ax = sns.tsplot(x="Transaction_date",y="Product",data=groubydate.fillna(0))
But I get the error
cannot label index with a null key
Due to restrictions, I can only use seaborn 0.8.1
I believe reindex should work for you:
# First convert the index to datetime
dataset.index = pd.DatetimeIndex(dataset.index)
# Then reindex! You can also select the min and max of the index for the limits
dataset= dataset.reindex(pd.date_range("2019-01-01", "2019-02-12"), fill_value="NaN")
You can drop the rows containing NaN values using pandas.DataFrame.dropna, and then plot the chart. For example:
dataset.dropna(thresh=2)
will drop all rows where there are at least two NaN values.
You may also want to fill the NaN values using pandas.DataFrame.fillna
I'm using/learning Pandas to load a csv style dataset where I have a time column that can be used as index. The data is sampled roughly at 100Hz. Here is a simplified snippet of the data:
Time (sec) Col_A Col_B Col_C
0.0100 14.175 -29.97 -22.68
0.0200 13.905 -29.835 -22.68
0.0300 12.257 -29.32 -22.67
... ...
1259.98 -0.405 2.205 3.825
1259.99 -0.495 2.115 3.735
There are 20 min of data, resulting in about 120,000 rows at 100 Hz. My goal is to select those rows within a certain time range, say 100-200 sec.
Here is what I've figured out
import panda as pd
df = pd.DataFrame(my_data) # my_data is a numpy array
df.set_index(0, inplace=True)
df.columns = ['Col_A', 'Col_B', 'Col_C']
df.index = pd.to_datetime(df.index, unit='s', origin='1900-1-1') # the date in origin is just a space-holder
My dataset doesn't include the date. How to avoid setting a fake date like I did above? It feels wrong, and also is quite annoying when I plot the data against time.
I know there are ways to remove date from the datatime object like here.
But my goal is to select some rows that are in a certain time range, which means I need to use pd.date_range(). This function does not seem to work without date.
It's not the end of the world if I just use a fake date throughout my project. But I'd like to know if there are more elegant ways around it.
I don't see why you need to use datetime64 objects for this. Your time column is an number, so you can very easily select time intervals with inequalities. You can also plot the columns without issue.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Time': np.arange(0,1200,0.01),
'Col_A': np.random.randint(1,100,120000),
'Col_B': np.random.randint(1,10,120000)})
Select Data between 100 and 200 seconds.
df[df.Time.between(100,200)]
Outputs:
Time Col_A Col_B
10000 100.00 75 9
10001 100.01 23 7
...
19999 199.99 39 7
20000 200.00 25 2
Plotting against time
#First 100 rows just for illustration
df[0:100].plot(x='Time')
Convert to timedelta64
If you really wanted to, you could convert the column to a timedelta64[ns]
df['Time'] = pd.to_datetime(df.Time, unit='s') - pd.to_datetime('1970-01-01')
print(df.head())
# Time Col_A Col_B
#0 00:00:00 67 6
#1 00:00:00.010000 93 1
#2 00:00:00.020000 99 3
#3 00:00:00.030000 18 2
#4 00:00:00.040000 84 3
df.dtypes
#Time timedelta64[ns]
#Col_A int32
#Col_B int32
#dtype: object
I'm facing bit of an issue adding a new column to my Pandas DataFrame: I have a DataFrame in which each row represents a record of location data and a timestamp. Those records belong to trips, so each row also contains a trip id. Imagine the DataFrame looks kind of like this:
TripID Lat Lon time
0 42 53.55 9.99 74
1 42 53.58 9.99 78
3 42 53.60 9.98 79
6 12 52.01 10.04 64
7 12 52.34 10.05 69
Now I would like to delete the records of all trips that have less than a minimum amount of records to them. I figured I could simply get the number of records of each trip like so:
lengths = df['TripID'].value_counts()
Then my idea was to add an additional column to the DataFrame and fill it with the values from that Series corresponding to the trip id of each record. I would then be able to get rid of all rows in which the value of the lengthcolumn is too small.
However, I can't seem to find a way to get the length values into the correct rows. Would any one have an idea for that or even a better approach to the entire problem?
Thanks very much!
EDIT:
My desired output should look something like this:
TripID Lat Lon time length
0 42 53.55 9.99 74 3
1 42 53.58 9.99 78 3
3 42 53.60 9.98 79 3
6 12 52.01 10.04 64 2
7 12 52.34 10.05 69 2
If I understand correctly, to get the length of the trip, you'd want to get the difference between the maximum time and the minimum time for each trip. You can do that with a groupby statement.
# Groupby, get the minimum and maximum times, then reset the index
df_new = df.groupby('TripID').time.agg(['min', 'max']).reset_index()
df_new['length_of_trip'] = df_new.max - df_new.min
df_new = df_new.loc[df_new.length_of_trip > 90] # to pick a random number
That'll get you all the rows with a trip length above the amount you need, including the trip IDs.
You can use groupby and transform to directly add the lengths column to the DataFrame, like so:
df["lengths"] = df[["TripID", "time"]].groupby("TripID").transform("count")
I managed to find an answer to my question that is quite a bit nicer than my original approach as well:
df = df.groupby('TripID').filter(lambda x: len(x) > 2)
This can be found in the Pandas documentation. It gets rid of all groups that have 2 or less elements in them, or trips that are 2 records or shorter in my case.
I hope this will help someone else out as well.