I am working with hourly monitoring data which consists of incomplete time series, i.e. several hours during a year (or during several years) will be absent from my dataframe.
I would like to determine the data capture, i.e. the percentage of values present in a month, a season, or a year.
This works with the following code (for demonstration written for monthly resampling) - however that piece of code appears somewhat inefficient, because I need to create a second hourly dataframe and I need to resample two dataframes.
Is there a more elegant solution to this?
import numpy as np
import pandas as pd
# create dummy series
t1 = pd.date_range(start="1997-01-01 05:00", end="1997-04-25 17:00", freq="H")
t2 = pd.date_range(start="1997-06-11 15:00", end="1997-06-15 12:00", freq="H")
t3 = pd.date_range(start="1997-06-18 00:00", end="1997-08-22 23:00", freq="H")
df1 = pd.DataFrame(np.random.randn(len(t1)), index=t1)
df2 = pd.DataFrame(np.random.randn(len(t2)), index=t2)
df3 = pd.DataFrame(np.random.randn(len(t3)), index=t3)
df = pd.concat((df1, df2, df3))
# create time index with complete hourly coverage over entire years
tstart = "%i-01-01 00:00"%(df.index.year[0])
tend = "%i-12-31 23:00"%(df.index.year[-1])
tref = pd.date_range(start=tstart, end=tend, freq="H")
dfref = pd.DataFrame(np.zeros(len(tref)), index=tref)
# count number of values in reference dataframe and actual dataframe
# Example: monthly resampling
cntref = dfref.resample("MS", "count")
cnt = df.resample("MS", "count").reindex(cntref.index).fillna(0)
for i in range(len(cnt.index)):
print cnt.index[i], cnt.values[i], cntref.values[i], cnt.values[i] / cntref.values[i]
pandas' Timedelta will do the trick:
# Time delta between rows of the df
df['index'] = df.index
pindex = df['index'].shift(1)
delta = df['index'] - pindex
# Any delta > 1H means a missing data period
missing_delta = delta[delta > pd.Timedelta('1H')]
# Sum of missing data periods divided by total period
ratio_missing = missing_delta.sum() / (df.index[-1] - df.index[0])
You can use TimeGrouper.
# Create an hourly index spanning the range of your data.
idx = pd.date_range(pd.Timestamp(df.index[0].strftime('%Y-%m-%d %H:00')),
pd.Timestamp(df.index[-1].strftime('%Y-%m-%d %H:00')),
freq='H')
# Use TimeGrouper to calculate the fraction of observations from `df` that are in the
# hourly time index.
>>> (df.groupby(pd.TimeGrouper('M')).size() /
pd.Series(idx).reindex(idx).groupby(pd.TimeGrouper('M')).size())
1997-01-31 1.000000
1997-02-28 1.000000
1997-03-31 1.000000
1997-04-30 0.825000
1997-05-31 0.000000
1997-06-30 0.563889
1997-07-31 1.000000
1997-08-31 1.000000
Freq: M, dtype: float64
As there have been no further suggestions, it appears as if the originally posted solution is most efficient.
Not sure about performance, but for a (very long) one liner you can do this once you have created 'df'... It at least has the benefits of not requiring a dummy dataframe. It should work for any period of data input and resampling.
month_counts = df.resample('H').mean().resample('M').count() / df.resample('H').ffill().fillna(1).resample('M').count()
Related
I have the following problem. Suppose I have a wide data Frame consisting of three columns (mock example follows below). Essentially, it consists of three factors, A, B and C for which I have certain values for each business day within a time range.
import pandas as pd
import numpy as np
index_d = pd.bdate_range(start='10/5/2022', end='10/27/2022')
index = np.repeat(index_d,3)
values = np.random.randn(3*len(index_d), 1)
columns_v = len(index_d)*["A","B","C"]
df = pd.DataFrame()
df["x"] = np.asarray(index)
df["y"] = values
df["factor"] = np.asarray([columns_v]).T
I would like to plot the business weekly averages of the the three factors along time. A business week goes from Monday to Friday. However, in the example above I start within a week and end within a week. That means the first weekly averages consist only of the data points on 5th, 6th and 7th of October. Similar for the last week. Ideally, the output should have the form
dt1 = dt.datetime.strptime("20221007", "%Y%m%d").date()
dt2 = dt.datetime.strptime("20221014", "%Y%m%d").date()
dt3 = dt.datetime.strptime("20221021", "%Y%m%d").date()
dt4 = dt.datetime.strptime("20221027", "%Y%m%d").date()
d = 3*[dt1, dt2, dt3, dt4]
values = np.random.randn(len(d), 1)
factors = 4*["A","B","C"]
df_output = pd.DataFrame()
df_output["time"] = d
df_output["values"] = values
df_output["factors"] = factors
I can then plot the weekly averages using seaborn as a lineplot with hue. Important to note is that the respective time value for weekly average is always the last business day in that week (Friday except for the last, where it is a Thursday).
I was thinking of groupby. However, given that my real data is much larger and has possibly some NaN I'm not sure how to do it. In particular with regards to the random start / end points that don't need to be Monday / Friday.
Try as follows:
res = df.groupby([pd.Grouper(key='x', freq='W-FRI'),df.factor])['y'].mean()\
.reset_index(drop=False)
res = res.rename(columns={'x':'time','factor':'factors','y':'values'})
res['time'] = res.time.map(pd.merge_asof(df.x, res.time, left_on='x',
right_on='time', direction='forward')\
.groupby('time').last()['x']).astype(str)
print(res)
time factors values
0 2022-10-07 A 0.171228
1 2022-10-07 B -0.250432
2 2022-10-07 C -0.126960
3 2022-10-14 A 0.455972
4 2022-10-14 B 0.582900
5 2022-10-14 C 0.104652
6 2022-10-21 A -0.526221
7 2022-10-21 B 0.371007
8 2022-10-21 C 0.012099
9 2022-10-27 A -0.123510
10 2022-10-27 B -0.566441
11 2022-10-27 C -0.652455
Plot data:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
fig, ax = plt.subplots(figsize=(8,5))
ax = sns.lineplot(data=res, x='time', y='values', hue='factors')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
plt.show()
Result:
Explanation
First, apply df.groupby. Grouping by factor is of course easy; for the dates we can use pd.Grouper with freq parameter set to W-FRI (each week through to Friday), and then we want to get the mean for column y (NaN values will just be ignored).
In the next step, let's use df.rename to rename the columns.
We are basically done now, except for the fact that pd.Grouper will use each Friday (even if it isn't present in the actual set). E.g.:
print(res.time.unique())
['2022-10-07T00:00:00.000000000' '2022-10-14T00:00:00.000000000'
'2022-10-21T00:00:00.000000000' '2022-10-28T00:00:00.000000000']
If you are OK with this, you can just start plotting (but see below). If you would like to get '2022-10-27' instead of '2022-10-28', we can combine Series.map applied to column time with pd.merge_asof,and perform another groupby to get last in column x. I.e. this will get us the closest match to each Friday within each week (so, in fact just Friday in all cases, except the last: 2022-10-17).
In either scenario, before plotting, make sure to turn the datetime values into strings: res['time'] = res['time'].astype(str)!
You can add a column with the calendar week:
df['week'] = df.x.dt.isocalendar().week
Get a mask for all the Fridays, and for the last day:
last_of_week = (df.x.dt.isocalendar().day == 5).values
last_of_week[-1] = True
Get the actual dates:
last_days = df.x[last_of_week].unique()
Group by week and factor, take the mean:
res = df.groupby(['week', 'factor']).mean().reset_index()
Clean up:
res = res.drop('week', axis=1)
res['x'] = pd.Series(last_days).repeat(3).reset_index(drop=True)
I have downloaded a NetCDF4 file of total hourly precipitation across Sierra Leone from 1974 to Present, and have started to create a code to analyze it.
I'm trying to form a table in Python that will display my average annual rainfall for different rainfall durations, rather like this one below:
I'm wondering if anyone has done anything similar to this before and could possibly help me out as I'm very new to programming?
Here is the script I've written so far that records the hourly data for each year. From here I need to find a way to store this information onto a table, then to change the duration to say, 2 hours, and repeat until I have a complete table:
import glob
import numpy as np
from netCDF4 import Dataset
import pandas as pd
import xarray as xr
all_years = []
for file in glob.glob('*.nc'):
data = Dataset(file, 'r')
time = data.variables['time']
year = time.units[11:16]
all_years.append(year)
year_start = '01-01-1979'
year_end = '31-12-2021'
date_range = pd.date_range(start = str(year_start),
end = str(year_end),
freq = 'H')
df = pd.DataFrame(0.0,columns = ['tp'], index = date_range)
lat_freetown = 8.4657
lon_freetown = 13.2317
all_years.sort()
for yr in range(1979,2021):
data = Dataset('era5_year' + str(yr)+ '.nc', 'r')
lat = data.variables['latitude'][:]
lon = data.variables['longitude'][:]
sq_diff_lat = (lat - lat_freetown)**2
sq_diff_lon = (lon - lon_freetown)**2
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
tp = data.variables['tp']
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'H')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: ' + str(d_range[t_index])+str(tp[t_index, min_index_lat, min_index_lon]))
df.loc[d_range[t_index]]['tp'] = tp[t_index, min_index_lat, min_index_lon]
I gave this a try, I hope it helps.
I downloaded two years of coarse US precip data here:
https://downloads.psl.noaa.gov/Datasets/cpc_us_hour_precip/precip.hour.2000.nc
https://downloads.psl.noaa.gov/Datasets/cpc_us_hour_precip/precip.hour.2001.nc
import xarray as xr
import pandas as pd
# Read two datasets and append them so there are multiple years of hourly data
precip_full1 = xr.open_dataset('precip.hour.2000.nc') * 25.4
precip_full2 = xr.open_dataset('precip.hour.2001.nc') * 25.4
precip_full = xr.concat([precip_full1,precip_full2],dim='time')
# Select only the Western half of the US
precip = precip_full.where(precip_full.lon<257,drop=True)
# Initialize output
output = []
# Select number of hours to sum
# This assumes that the data is hourly
intervals = [1,2,6,12,24]
# Loop through each desired interval
for interval in intervals:
# Take rolling sum
# This means the value at any time is the sum of the preceeding times
# So when interval is 6, it's the sum of the previous six values
roll = precip.rolling(time=interval,center=False).sum()
# Take the annual mean and average over all space
annual = roll.groupby('time.year').mean('time').mean(['lat','lon'])
# Convert output to a pandas dataframe
# and rename the column to correspond to the interval length
tab = annual.to_dataframe().rename(columns={'precip':str(interval)})
# Keep track of the output by appending it to the output list
output.append(tab)
# Combine the dataframes into one, by rows
output = pd.concat(output,1)
The output looks like this:
1 2 6 12 24
year
2000 0.014972 0.029947 0.089856 0.179747 0.359576
2001 0.015610 0.031219 0.093653 0.187290 0.374229
Again, this assumes that the data is already hourly. It also takes the average of any (for example) 6 hour period, so it's not just 00:00-06:00, 06:00-12:00, etc., it's 00:00-06:00, 001:00-07:00, etc., and then the annual mean. If you wanted the former you could use xarray's resample function after taking the rolling sum.
I have taken the following sample of data.
trip_id,vehicle_id,customer_id,fleet,trip_start,distance_miles,journey_duration
1,d3550e496af4,442342ac078e,Salt Lake City,2020-06-02 16:12:22,2.30266927956152,0 days 00:13:12.549351
2,2afc10228a2b,4d3ea6d8bb4b,Provo,2020-06-02 16:17:21,0.495335235709548,0 days 00:02:48.407770
3,442342ac078e,442342ac078e,Salt Lake City,2020-06-02 16:43:05,0.7933172567617909,0 days 00:15:33.417755
4,8701da8e6582,567c93d144ed,Provo,2020-06-02 19:34:40,0.9158009891104686,0 days 00:07:04.912849
5,b70fa4bc1486,391526cd2b71,Provo,2020-06-02 20:02:37,1.6248457639858709,0 days 00:11:51.821411
6,f6f0a689fc3a,2b9d754d1c4f,Provo,2020-06-02 20:57:27,0.8310125874177197,0 days 00:07:37.959237
I read this data into a df using:
df = pd.read_clipboard(sep=',')
What I'm struggling to figure out is how to create a summary table using this information. The output df I would like is below:
Here, I want to group by city, while being able to calculate the total number of unique vehicles, unique customers, trips, and the sum of the total distance and duration (in minutes) of every journey in that city.
For example, we can see that for row 0 and 2, there are 2 unique vehicles but it's from the same customer.
I have tried using groupby/summing/unique methods but have had issues when it comes to certain values I want to obtain. Any idea of where to go next? Cheers
You need to convert a few columns and then you can just group and summarise
df['trip_start'] = pd.to_datetime(df['trip_start'], format='%Y-%d-%m %H:%M:%S')
df['journey_duration'] = pd.to_timedelta(df['journey_duration'])
df['Date'] = df['trip_start'].dt.strftime('%b %Y')
df.groupby(['Date', 'fleet']).agg(
Total_Customers = ('customer_id', 'nunique'),
Total_Vehicles = ('vehicle_id', 'nunique'),
Total_Trips = ('trip_id', 'nunique'),
Total_Distance = ('distance_miles', 'sum'),
Total_Duration = ('journey_duration', 'sum'),
)
I have the following pandas dataframe:
Here is what I am trying to do:
Take the difference of values in the start_time column and find the indices with values less than 0.05
Remove those values from the end_time and start_time columns accounting for the difference
Let's take the example of dataframe below. The start_time column index 2 and 3 value have a difference of less than 0.05 (36.956 - 36.908667).
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
3 37.001667 36.956000 37.039667
4 37.210333 37.197333 37.306333
This is what I am trying to achieve. Remove the start_time from the 3rd column and end_time from the second column
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 37.039667
4 37.210333 37.197333 37.306333
This cannot be achieved by a simple shift.
In addition, care should be taken when dealing with continuous start_time difference < 0.05.
Import pandas.
import pandas as pd
Read data. Note that I add one additional row to the sample data above.
df = pd.DataFrame({
'peak_time': [30.691333, 36.918000, 37.001667, 37.1, 37.210333],
'start_time': [30.670667, 36.908667, 36.956000, 36.96, 37.197333],
'end_time': [30.710333, 36.932333, 37.039667, 37.1, 37.306333]
})
Calculate the forward and backward difference of start_time column.
df['start_time_diff1'] = abs(df['start_time'].diff(1))
df['start_time_diff-1'] = abs(df['start_time'].diff(-1))
We can notice that ROW 2 has both differences less than 0.05, indicating that this row has to be first deleted.
After deleting it, we need to record the end_time of the row about to be deleted in the next step.
df2 = df[~(df['start_time_diff1'].lt(0.05) & df['start_time_diff-1'].lt(0.05))].copy()
df2['end_time_shift'] = df2['end_time'].shift(-1)
Then, we can use the simple diff to filter out ROW 3.
df2 = df2[~df2['start_time_diff1'].lt(0.05)].copy()
Finally, paste the end_time to the correct place.
df2.loc[df2['start_time_diff-1'].lt(0.05), 'end_time'] = df2.loc[
df2['start_time_diff-1'].lt(0.05), 'end_time_shift']
You can use .shift() to compare to prior row and take the difference to return rows where the difference is less than .05 by creating a boolean mask s. Then, witrh ~ simply filter out those rows:
s = df['start_time'] - df.shift()['start_time'] < .05
df = df[~s]
df
Out[1]:
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
4 37.210333 37.197333 37.306333
Another way is to use .diff()
df[~(df.start_time.diff()<0.05)]
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
4 37.210333 37.197333 37.306333
I have two data frames:
The first date frame is:
import pandas as pd
df1 = pd.DataFrame({'serialNo':['aaaa','bbbb','cccc','ffff','aaaa','bbbb','aaaa'],
'Name':['Sayonti','Ruchi','Tony','Gowtam','Toffee','Tom','Sayonti'],
'testName': [4402, 3747 ,5555,8754,1234,9876,3602],
'moduleName': ['singing', 'dance','booze', 'vocals','drama','paint','singing'],
'endResult': ['WARNING', 'FAILED', 'WARNING', 'FAILED','WARNING','FAILED','WARNING'],
'Date':['2018-10-5','2018-10-6','2018-10-7','2018-10-8','2018-10-9','2018-10-10','2018-10-8'],
'Time_df1':['23:26:39','22:50:31','22:15:28','21:40:19','21:04:15','20:29:11','19:54:03']})
The second data frame is:
df2 = pd.DataFrame({'serialNo':['aaaa','bbbb','aaaa','ffff','xyzy','aaaa'],
'Food':['Strawberry','Coke','Pepsi','Nuts','Apple','Candy'],
'Work': ['AP', 'TC','OD', 'PU','NO','PM'],
'Date':['2018-10-1','2018-10-6','2018-10-2','2018-10-3','2018-10-5','2018-10-10'],
'Time_df2':['09:00:00','10:00:00','11:00:00','12:00:00','13:00:00','14:00:00']
})
I am joining the two based on serial number:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
result = pd.merge(df1,df2,on=['serialNo'],how='inner')
Now I want that Date_y lies within 3 days of Date_x starting from Date_x
which means Date_X+(1,2,3 days) should be Date_y. And I can get that as below but I also want to check for the time range which I do not know how to achieve
result = result[result.Date_x.sub(result.Date_y).dt.days.between(0,3)]
I want to check for the time such that Time_df2 is within 6 hours of start time being Time_df1. Please help?
You could have a column within your dataframe that combines the date and the time. Here's an example of combining a single row in the dataframe:
# Combining Date_x and time_df1
value_1_x = datetime.datetime.combine(result['Date_x'][0].date() ,\
datetime.datetime.strptime(result['Time_df1'][0], '%H:%M:%S').time())
# Combining date_y and time_df2
value_2_y = datetime.datetime.combine(result['Date_y'][0].date() , \
datetime.datetime.strptime(result['Time_df2'][0], '%H:%M:%S').time())
Then given two datetime objects, you can simply subtract to find the difference you are looking for:
difference = value_1_x - value_2_y
print(difference)
Which gives the output:
4 days, 14:26:39
My understanding is that you are looking to see if something is within 3 days and 6 hours (or a total of 78 hours). You can convert this to hours easily, and then make the desired comparison:
hours_difference = abs(value_1_x - value_2_y).total_seconds() / 3600.0
print(hours_difference)
Which gives the output:
110.44416666666666
Hope that helps!