I have a DataFrame like this
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
value date
0 64.885 2018-01-11
1 74.839 2018-01-15
2 41.481 2018-01-17
3 22.027 2018-01-17
4 53.747 2018-01-18
... ... ...
514 61.017 2018-12-22
515 68.376 2018-12-21
516 79.079 2018-12-26
517 73.975 2018-12-26
518 76.923 2018-12-26
519 rows × 2 columns
And I want to plot this value vs date and I am using this
df.plot( x='date',y='value')
And I get this
The point here, this plot have to many fluctuation, and I want to soften this, my idea is group the values by date intervals and get the mean, for example 10 days, the mean between July 1 and July 10, and create de point in July 5
A long way is, get date range, separate in N ranges with start and end dates, filter data with date calculate the mean, and put in other DataFrame
Is there a short way to do that?
PD: Ignore the peaks
One thing you could do for instance is to take the rolling mean of the dataframe, using DataFrame.rolling along with mean:
df = df.set_index(df.date).drop('date', axis=1)
df.rolling(3).mean().plot()
For the example dataframe you have, directly plotting the dataframe would result in:
And having taking the rolling mean, you would have:
Here I chose a window of 3, but his will depend on how wmooth you want it to be
Based on yatu answer
The problem with his answer, is the rolling function considere values as index, not as date, with some transformations rolling can read Timestamp as use time as window [ pandas.rolling ]
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
df['date'] = df.apply(lambda row: pd.Timestamp(row.date), axis=1 )
df = df.set_index(df.date).drop('date', axis=1)
df.sort_index(inplace=True)
df.rolling('10d').mean().plot( ylim=(30,100) , figsize=(16,5),grid='true')
Final results
Related
Consider this sample data created by this code:
import random
np.random.seed(0)
rng = pd.date_range('2017-09-19', periods=1000, freq='D')
randomlist = np.random.choice(1000, 10000, replace=True)
print(f'randomlist length is {len(randomlist)}')
test = pd.DataFrame({ 'id': randomlist[:(len(rng))], 'Date': rng, 'Val': np.random.randn(len(rng)) })
The desired output is a groupby id, summing all values, but only within a particular date range of the Date column. Even more complicated than that, I want to see the total Val by id for dates that are the following:
Using the date which is one month later than the earliest date for each id and one year later than that starting date of one month later than the earliest date.
So, for example, if my data appeared this way:
id Date Val
0 684 2017-09-19 0.640472
1 684 2017-10-20 -0.732568
2 501 2017-08-21 -1.141365
3 501 2017-09-22 -0.283020
4 501 2017-09-23 0.725941
5 684 2017-09-24 0.56789
I would want the groupby to only consider the dates for id 684 between 2017-10-19 (i.e. one month later than the earliest date) and 2018-10-19 (i.e. one year after the earliest date plus one month).
I have tried straight groupby and Grouper to no avail. None seem to have this ability to limit the consideration by date. Perhaps I am missing something easy? Thanks for taking a look
I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5
I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object
Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()
I have a dataframe with two columns: timeStamp and eventMessage (string).
timeStamp: eventMessage:
2020-10-19T10:07:56.7450775+02:00 transaction successful
2020-10-19T10:08:13.025169+02:00 transaction successful
I want to end up with a dataframe that has two columns : hour and numberOfEvents per that hour.
hour: numberOfEvents:
1 41
2 0
... ...
24 32
I've tried the df.resample('H', on='timeStamp', how='count'), but I think the how='count' is deprecated now?
Is there a new quick pandas way to do it?
UPDATE: thanks to Ami Tavory's tip the df now looks like this:
timeStamp
10 792
11 792
14 594
15 198
16 198
I'm not actually sure if it's a dataframe with one column or some other type completely. And how do I fill in the hours that had zero events?
Miniupdate: It's pandas.core.series.Series
Converted it to df with:
series = df.message.groupby(pd.to_datetime(df.timeStamp).dt.hour).count()
df2 = pd.DataFrame({'hour': series.index, 'counted': series.values})
Now I just need to figure out how to add and fill in other hours from 1 to 24 that had no events with a zero.
Regarding your new question (after the edit).
Converted it to df with:
You can more easily convert it with
df = series.to_frame().
Now I just need to figure out how to add and fill in other hours from 1 to 24 that had no events with a zero.
new_index = Index(arange(0,23,1), name="hour")
df.set_index("hour").reindex(new_index).fillna(0)
Group by the hour, and count:
df.eventMessage.groupby(pd.to_datetime(df.timeStamp).dt.hour)).count()
I was trying out time series analysis with pandas data frames and found that there were easy ways to select specific columns like all the rows of an year, between two dates, etc.
For example, consider
ind = pd.date_range('2004-01-01', '2019-08-13')
data = np.random.randn(len(ind))
df = pd.DataFrame(d, index=ind)
Here, we can select all the rows between and including the dates '2014-01-23' and '2014-06-18' with
df['2014-01-23':'2014-06-18']
and all the rows of the year '2015' with just
df['2015']
Is there a similar way to select all the rows belonging to a specific month but for all years?
I found ways to get all the rows of a particular month and a particular year with syntax like
df['01-2015'] #all rows of January 2015
I was hoping pandas would have a way with simple syntax to get all rows of a month irrespective of the year. Does such a way exist?
Use DatetimeIndex.month, compare and filter by with boolean indexing:
print (df[df.index.month == 1])
0
2004-01-01 2.398676
2004-01-02 2.074744
2004-01-03 0.106972
2004-01-04 0.294587
2004-01-05 0.243768
...
2019-01-27 -1.623171
2019-01-28 -0.043810
2019-01-29 -0.999764
2019-01-30 -0.928471
2019-01-31 -0.304730
[496 rows x 1 columns]