I have a regularly spaced time series stored in a pandas data frame:
1998-01-01 00:00:00 5.71
1998-01-01 12:00:00 5.73
1998-01-02 00:00:00 5.68
1998-01-02 12:00:00 5.69
...
I also have a list of dates that are irregularly spaced:
1998-01-01
1998-07-05
1998-09-21
....
I would like to calculate the average of the time series between each time interval of the list of dates. Is this somehow possible using pandas.DataFrame.resample? If not, what is the easiest way to do it?
Edited:
For example, calculate the mean of 'series' in between the dates in 'dates', created by the following code:
import pandas as pd
import numpy as np
import datetime
rng = pd.date_range('1998-01-01', periods=365, freq='D')
series = pd.DataFrame(np.random.randn(len(rng)), index=rng)
dates = [pd.Timestamp('1998-01-01'), pd.Timestamp('1998-07-05'), pd.Timestamp('1998-09-21')]
You can loop through the dates and use select only the rows falling in between those dates like this,
import pandas as pd
import numpy as np
import datetime
rng = pd.date_range('1998-01-01', periods=365, freq='D')
series = pd.DataFrame(np.random.randn(len(rng)), index=rng)
dates = [pd.Timestamp('1998-01-01'), pd.Timestamp('1998-07-05'), pd.Timestamp('1998-09-21')]
for i in range(len(dates)-1):
start = dates[i]
end = dates[i+1]
sample = series.loc[(series.index > start) & (series.index <= end)]
print(f'Mean value between {start} and {end} : {sample.mean()[0]}')
# Output
Mean value between 1998-01-01 00:00:00 and 1998-07-05 00:00:00 : -0.024342221543215112
Mean value between 1998-07-05 00:00:00 and 1998-09-21 00:00:00 : 0.13945008064765074
Instead of a loop, you can also use a list comprehension like this,
print([series.loc[(series.index > dates[i]) & (series.index <= dates[i+1])].mean()[0] for i in range(len(dates) - 1) ]) # [-0.024342221543215112, 0.13945008064765074]
You could iterate over the dates like this:
for ti in range(1,len(dates)):
start_date, end_date = dates[ti-1],dates[ti]
mask = (series.index > start_date) & (series.index <= end_date)
print(series[mask].mean())
Related
Is there way to create a dataframe(using pandas) from start date to End date with random values
For example, the required data frame:
date value
2015-06-25 12
2015-06-26 23
2015-06-27 3.4
2015-06-28 5.6
Is this dataframe I need to set "from date" to "To date" so that there is no need to type manually because the rows keep on increasing and "value" column should have values generated randomly
You can use pandas.date_range and numpy.random.uniform:
import numpy as np
dates = pd.date_range('2015-06-25', '2015-06-28', freq='D')
df = pd.DataFrame({'date': dates,
'value': np.random.uniform(0, 50, size=len(dates))})
output:
date value
0 2015-06-25 39.496971
1 2015-06-26 29.947877
2 2015-06-27 7.328549
3 2015-06-28 6.738427
import pandas as pd
import numpy as np
beg = '2015-06-25 '
end = '2015-06-28'
index = pd.date_range(beg, end)
values = np.random.rand(len(index))
df = pd.DataFrame(values, index=index)
index
0
2015-06-25 00:00:00
0.865455754057059
2015-06-26 00:00:00
0.959574113247196
2015-06-27 00:00:00
0.8634928651970131
2015-06-28 00:00:00
0.6448579391510935
id timestamp energy
0 a 2012-03-18 10:00:00 0.034
1 b 2012-03-20 10:30:00 0.052
2 c 2013-05-29 11:00:00 0.055
3 d 2014-06-20 01:00:00 0.028
4 a 2015-02-10 12:00:00 0.069
I want to plot these data like below.
just time on x-axis, not date nor datetime.
because I want to see the values per each hour.
https://i.stack.imgur.com/u73eJ.png
but this code plot like this.
plt.plot(df['timestamp'], df['energy'])
https://i.stack.imgur.com/yd6NL.png
I tried some codes but they just format the X data hide date part and plot like second graph.
+ df['timestamp'] is datetime type.
what should I do? Thanks.
you can convert your datetime into time, if your df["timestamp"] is already in datetime format then
df["time"] = df["timestamp"].map(lambda x: x.time())
plt.plot(df['time'], df['energy'])
if df["timestamp"] is of type string then you can add one more line in front as df["timestamp"] = pd.to_datetime(df["timestamp"])
Update: look like matplotlib does not accept time types, just convert to string
df["time"] = df["timestamp"].map(lambda x: x.strftime("%H:%M"))
plt.scatter(df['time'], df['energy'])
First check, if type of df["timestamp"] is in datetime format.
if not
import pandas as pd
time = pd.to_datetime(df["timestamp"])
print(type(time))
Then,
import matplotlib.pyplot as plt
values = df['energy']
plt.plot_date(dates , values )
plt.xticks(rotation=45)
plt.show()
I have three different strings
'0300' , '0600' and '03125455'.
I want to convert them into pandas timestamps as
'03:00:0000' , '06:00:0000' and '03:12:5455'
so that I can interpolate corresponding values of variables for the first two at the third one. I do not have any date data. What I am using is the following
time1 = pd.to_datetime('2018050103000000') # Dummy date 2018-05-01
time2 = pd.to_datetime('2018050106000000')
timeX = pd.to_datetime('2018050103125455')
val1 = 100
val2 = 200
df = pd.DataFrame( [(time1, val1) , (time2, val2)] , columns=['Times','Values'] )
df = df.set_index('Times')
df = pd.Series(df['Values'], index=df.index)
inter = df.resample('S').interpolate(method='linear')
valX =interp.loc[timeX]
But I am getting the following error:
OverflowError: Python int too large to convert to C long
How should I properly convert these strings to datetime with or without using dummy dates? I just need time values, not dates.
Done in two steps
import pandas as pd
import datetime as dt
df = pd.DataFrame([('0300',100),('0600',200)], columns=['Times', 'Values'])
df
Out[25]:
Times Values
0 0300 100
1 0600 200
Convert to datetime column
df['Times2'] = df.Times.apply(lambda x:pd.to_datetime(x.ljust(8,'0'),format='%H%M%S%f'))
Out[49]:
Times Values Times2
0 0300 100 1900-01-01 03:00:00
1 0600 200 1900-01-01 06:00:00
Then convert datetime to just time columns
df['Times2'] = df.Times2.apply(lambda x: dt.datetime.time(x))
df
Out[51]:
Times Values Times2
0 0300 100 03:00:00
1 0600 200 06:00:00
Looks like you need.
Ex:
time_data = ['0300' , '0600', '03125455']
for t in time_data:
print(pd.to_datetime(t.ljust(8, "0"), format="%H%M%S%f").strftime("%H:%M:%S%f"))
Output:
03:00:00000000
06:00:00000000
03:12:54550000
I have a dataset with measurements acquired almost every 2-hours over a week. I would like to calculate a mean of measurements taken at the same time on different days. For example, I want to calculate the mean of every measurement taken between 12:00 and 13:59.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
#generating test dataframe
date_today = datetime.now()
time_of_taken_measurment = pd.date_range(date_today, date_today +
timedelta(72), freq='2H20MIN')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100,
size=len(time_of_taken_measurment))
df = pd.DataFrame({'measurementTimestamp': time_of_taken_measurment, 'measurment': data})
df = df.set_index('measurementTimestamp')
#Calculating the mean for measurments taken in the same hour
hourly_average = df.groupby([df.index.hour]).mean()
hourly_average
The code above gives me this output:
0 47.967742
1 43.354839
2 46.935484
.....
22 42.833333
23 52.741935
I would like to have a result like this:
0 mean0
2 mean1
4 mean2
.....
20 mean10
22 mean11
I was trying to solve my problem using rolling_mean function, but I could not find a way to apply it to my static case.
Use the built-in floor functionality of datetimeIndex, which allows you to easily create 2 hour time bins.
df.groupby(df.index.floor('2H').time).mean()
Output:
measurment
00:00:00 51.516129
02:00:00 54.868852
04:00:00 52.935484
06:00:00 43.177419
08:00:00 43.903226
10:00:00 55.048387
12:00:00 50.639344
14:00:00 48.870968
16:00:00 43.967742
18:00:00 49.225806
20:00:00 43.774194
22:00:00 50.590164
I am working on a project and I am trying to calculate the number of business days within a month. What I currently did was extract all of the unique months from one dataframe into a different dataframe and created a second column with
df2['Signin Date Shifted'] = df2['Signin Date'] + pd.DateOffset(months=1)
Thus the current dataframe looks like:
I know I can do dt.daysinmonth or a timedelta but that gives me all of the days within a month including Sundays/Saturdays (which I don't want).
Using busday_count from np
Ex:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Signin Date": ["2018-01-01", "2018-02-01"]})
df["Signin Date"] = pd.to_datetime(df["Signin Date"])
df['Signin Date Shifted'] = pd.DatetimeIndex(df['Signin Date']) + pd.DateOffset(months=1)
df["bussDays"] = np.busday_count( df["Signin Date"].values.astype('datetime64[D]'), df['Signin Date Shifted'].values.astype('datetime64[D]'))
print(df)
Output:
Signin Date Signin Date Shifted bussDays
0 2018-01-01 2018-02-01 23
1 2018-02-01 2018-03-01 20
MoreInfo