remove/isolate days when there is no change (in pandas) - python

I have annual hourly energy data for two AC systems for two hotel rooms. I want to figure out when the rooms were occupied or not by isolating/removing the days when the ac was not used for 24 hours.
I did df[df.Meter2334Diff > 0.1] for one room which gives me all the hours when AC was turned on, however it removes the hours of the days when the room was most likely occupied and the AC was turned off. This is where my knowledge stops. I therefore enquire the assistance from the oracles of the internet.
my dataframe above
results after df[df.Meter2334Diff > 0.1]

If I've interpreted your question correctly, you want to extract all the days from the dataframe where the Meter2334Diff value was zero?
As your data is currently has a frequency of every hour, we can resample it in pandas using the resample() function. To resample() we can pass the freq parameter which tells pandas at what time interval to aggregate the data. There are lots of options (see the docs) but in your case we can set freq='D' to group by day.
Then we can calculate the sum of that day for the Meter2334Diff column. If we then filter out the days that have a value == 0 (obviously without knowledge of your dataset etc I don't know whether 0 is the correct value).
total_daily_meter_diff = df.resample('D')['Meter2334Diff'].sum()
days_less_than_cutoff = total_daily_meter_diff.query('MeterDiff2334 == 0')
We can then use these days to filter in the original dataset:
df.loc[df.index.floor('D').isin(days_less_than_cutoff) , :]

Related

Handle NaN values in mean pandas

I calculated the average of the values contained in a column within my df as follows:
meanBpm = df['tempo'].mean()
the average is calculated for different days of the week and for some days the value I expect is returned, while for other days it returns NaN. This is because it is possible that the bpm (the tempo column) for a certain day is not there because for example I have not listened to any songs. I would like to replace these NaNs in ouput with a default value which could be 0 or -1
EDIT: i solved it, thanks a lot everyone for the replies
what your'e looking for is -
df['tempo'].fillna(0).mean()

Changing data for a specific time period throughout a year to zeros

I have solar performance data at 5-second intervals which spans a full year. The issues is data is reflecting performance at night. I would like to change these values all to zero.
I have tried using:
data.between_time('00:00:00','06:11:00')['SYSTEM MPPT PV Power [W]']) = 0
and
data.between_time('17:00:00','23:59:55')['SYSTEM MPPT PV Power [W]']) = 0
and
data.at_time('00:00:00')['SYSTEM MPPT PV Power [W]']) = 0
seeing that these are the periods that should have no perfomance.
However, this does not allow me to make an assignment to the column that is reflecting the performance.
Is there any way to do this?
pandas.DataFrame.between_times returns the rows between the times (assuming that the dataframe has a datetimeindex), does not select them. So assignment operator does not work.
Assuming that your dataframe has a datetimeindex (it should have, since you are using between_times) you can use between_times to find the indexes at which you want to have 0, and do the assignment in the following line:
zeroidx = data.between_time('00:00:00','06:11:00').index.\
append([data.between_time('17:00:00','23:59:55').index, \
data.at_time('00:00:00').index])
data.loc[zeroidx, 'SYSTEM MPPT PV Power [W]'] = 0

Get sum of business days in dataframe python with resample

I have a time-series where I want to get the sum the business day values for each week. A snapshot of the dataframe (df) used is shown below. Note that 2017-06-01 is a Friday and hence the days missing represent the weekend
I use resample to group the data by each week, and my aim is to get the sum. When I apply this function however I get results which I can't justify. I was expecting in the first row to get 0 which is the sum of the values contained in the first week, then 15 for the next week etc...
df_resampled = df.resample('W', label='left').sum()
df_resampled.head()
Can someone explain to me what am I missing since it seems like I have not understood the resampling function correctly?

Extract future timeseries data and join on past timeseries that are 12 hours apart?

I am in a data science course and my instructor isn't very strong in python.
Use a shift function to pull prices by 12 hours (aligning prices 12 hours in the future with a row's current prices). Then create a new column populated with this info.
So I should have my index, column 1, and newcolumn
I have tried a few different ways. I have tried extracting the 12 hours into a list and merging, I have tried using .slice, and I have tried creating a function.
https://imgur.com/a/AYaM1Ye
This seemed to work
slice= currency [currency.index.min():currency.index.max()]
#Move the datetime values forward an hour
shifted = slice.shift(periods=1, freq='12H')

Python/Pandas: sort by date and compute two week (rolling?) average

So far I've read in 2 CSV's and merged them based on a common element. I take the output of the merged CSV and iterate through the unique element they've been merged on. While I have them separated I want to generate a daily count line and a two week rolling average from the current date going backward. I cannot index based of the 'Date Opened' field but I still need my outputs organized by this with the most recent first. Once these are sorted by date my daily count plotting issue will be rectified. My remaining task would be to compute a two week rolling average for count within the week. I've looked into the Pandas documentation and I think the rolling_mean will work but the parameters of this function don't really make sense to me. I've tried biwk_avg = pd.rolling_mean(open_dt, 28) but that doesnt seem to work. I know there is an easier way to do this but I think I've hit a roadblock with the documentation available. The end result should look something like this graph. Right now my daily count graph isnt sorted(even though I think I've instructed it to) and is unusable in line form.
def data_sort():
data_merge = data_extract()
domains = data_merge.groupby('PWx Domain')
for domain in domains.groups.items():
dsort = (data_merge.loc[domain[1]])
print (dsort.head())
open_dt = pd.to_datetime(dsort['Date Opened']).dt.date
#open_dt.to_csv('output\''+str(domain)+'_out.csv', sep = ',')
open_ct = open_dt.value_counts(sort= False)
biwk_avg = pd.rolling_mean(open_ct, 28)
plt.plot(open_ct,'bo')
plt.show()
data_sort()
Rolling mean alone is not enough in your case; you need a combination of resampling (to group data by days) followed by a 14-day rolling mean (why do you use 28 in your code?). Something like thins:
for _,domain in data_merge.groupby('PWx Domain'):
# Convert date to the index
domain.index = pd.to_datetime(domain['Date Opened'])
# Sort dy dates
domain.sort_index(inplace=True)
# Do the averaging
rolling = pd.rolling_mean(domain.resample('1D').mean(), 14)
plt.plot(rolling,'bo')
plt.show()

Categories

Resources