I calculated the average of the values contained in a column within my df as follows:
meanBpm = df['tempo'].mean()
the average is calculated for different days of the week and for some days the value I expect is returned, while for other days it returns NaN. This is because it is possible that the bpm (the tempo column) for a certain day is not there because for example I have not listened to any songs. I would like to replace these NaNs in ouput with a default value which could be 0 or -1
EDIT: i solved it, thanks a lot everyone for the replies
what your'e looking for is -
df['tempo'].fillna(0).mean()
Related
I have a dataset containing hourly temperatures for a year. So, I have 24 entries for each day (temp for every hour) and I want to find out the 5 days with highest temp. I am aware of nlargest() function to find out 5 max values but those values happen to be on a single day only. How do I find out the 5 max values but on different days?
I tried using nlargest() and .loc() but could not find the solution. Please help.
I have attached what the dataset looks like.
You might want to get the max per group with groupby.max then find the top 5 with nlargest
df.groupby(['year','month','day'])['temp'].max().nlargest(5)
You can use groupby
An example would be
max_5_temps = df.groupby('date_column')['temperature_column'].max().nlargest(5)
I have a dataframe, called PORResult, of daily temperatures where rows are years and each column is a day (121 rows x 365 columns). I also have an array, called Percentile_90, of a threshold temperature for each day (length=365). For every day for every year in the PORResult dataframe I want to find out if the value for that day is higher than the value for that day in the Percentile_90 array. The results of which I want to store in a new dataframe, called Count (121rows x 365 columns). To start, the Count dataframe is full of zeros, but if the daily value in PORResult is greater than the daily value in Percentile_90. I want to change the daily value in Count to 1.
This is what I'm starting with:
for i in range(len(PORResult)):
if PORResult.loc[i] > Percentile_90[i]:
CountResult[i]+=1
But when I try this I get KeyError:0. What else can I try?
(Edited:)
Depending on your data structure, I think
CountResult = PORResult.gt(Percentile_90,axis=0).astype(int)
should do the trick. Generally, the toolset provided in pandas is sufficient that for-looping over a dataframe is unnecessary (as well as remarkably inefficient).
I have annual hourly energy data for two AC systems for two hotel rooms. I want to figure out when the rooms were occupied or not by isolating/removing the days when the ac was not used for 24 hours.
I did df[df.Meter2334Diff > 0.1] for one room which gives me all the hours when AC was turned on, however it removes the hours of the days when the room was most likely occupied and the AC was turned off. This is where my knowledge stops. I therefore enquire the assistance from the oracles of the internet.
my dataframe above
results after df[df.Meter2334Diff > 0.1]
If I've interpreted your question correctly, you want to extract all the days from the dataframe where the Meter2334Diff value was zero?
As your data is currently has a frequency of every hour, we can resample it in pandas using the resample() function. To resample() we can pass the freq parameter which tells pandas at what time interval to aggregate the data. There are lots of options (see the docs) but in your case we can set freq='D' to group by day.
Then we can calculate the sum of that day for the Meter2334Diff column. If we then filter out the days that have a value == 0 (obviously without knowledge of your dataset etc I don't know whether 0 is the correct value).
total_daily_meter_diff = df.resample('D')['Meter2334Diff'].sum()
days_less_than_cutoff = total_daily_meter_diff.query('MeterDiff2334 == 0')
We can then use these days to filter in the original dataset:
df.loc[df.index.floor('D').isin(days_less_than_cutoff) , :]
I have a sheet like this. I need to calculate absolute of "CURRENT HIGH" - "PREVIOUS DAY CLOSE PRICE" of particular "INSTRUMENT" and "SYMBOL".
So I used .shift(1) function of pandas dataframe to create a lagged close column and then I am subtracting current HIGH and lagged close column but that also subtracts between 2 different "INSTRUMENTS" and "SYMBOL". But if a new SYMBOL or INSTRUMENTS appears I want First row to be NULL instead of subtracting current HIGH and lagged close column.
What should I do?
I believe you need if all days are consecutive per groups:
df['new'] = df['HIGH'].sub(df.groupby(['INSTRUMENT','SYMBOL'])['CLOSE'].shift())
I have a time-series where I want to get the sum the business day values for each week. A snapshot of the dataframe (df) used is shown below. Note that 2017-06-01 is a Friday and hence the days missing represent the weekend
I use resample to group the data by each week, and my aim is to get the sum. When I apply this function however I get results which I can't justify. I was expecting in the first row to get 0 which is the sum of the values contained in the first week, then 15 for the next week etc...
df_resampled = df.resample('W', label='left').sum()
df_resampled.head()
Can someone explain to me what am I missing since it seems like I have not understood the resampling function correctly?