Pandas Aggregate Daily Data to Monthly Timeseries - python

I have a time series that looks like this (below)
And I want to resample it monthly, so it has 2019-10 is equal to the average of all the values of october, November is the average of all the PTS values for November, etc.
However, when i use the pd.resample('M').mean() method, if the final day for each month does not have a value, it fills in a Nan in my data frame. How do I solve this?
Date PTS
2019-10-23 14.0
2019-10-26 14.0
2019-10-27 8.0
2019-10-29 29.0
2019-10-31 17.0
2019-11-03 12.0
2019-11-05 2.0
2019-11-07 15.0
2019-11-08 7.0
2019-11-14 16.0
2019-11-16 12.0
2019-11-20 22.0
2019-11-22 9.0
2019-11-23 20.0
2019-11-25 18.0```

Would this work?
pd.resample('M').mean().dropna()

Do you have a code sample? This works:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
days = np.arange(31)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(days, 60),
"values": rng.integers(0, 60, size=60)})
data.set_index("dates", inplace=True)
# Set the last day to null.
data.loc["2019-03-31"] = np.nan
# This works
data.resample("M").mean()
It also works with an incomplete month:
incomplete_days = np.arange(10)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(incomplete_days, 10),
"values": rng.integers(0, 60, size=10)})
data.set_index("dates", inplace=True)
data.resample("M").mean()
You should check your data and types more thoroughly in case the NaN you're receiving indicates a more pressing issue.

Related

Time interval between rows in a column, Python

Input
New Time
11:59:57
12:42:10
12:48:45
18:44:53
18:49:06
21:49:54
21:54:48
5:28:20
Below I wrote code to create interval in min.
import pandas as pd
import numpy as np
df = pd.read_csv(r"D:\test\test1.csv")
df['Interval in min'] = (pd.to_timedelta(df['New Time'].astype(str)).diff(1).dt.floor('T').dt.total_seconds().div(60))
print(df)
Output
New Time Interval in min
11:59:57 NaN
12:42:10 42.0
12:48:45 6.0
18:44:53 356.0
18:49:06 4.0
21:49:54 180.0
21:54:48 4.0
5:28:20 -987.0
Last interval in min i.e. -987 min is not correct, it should rather be 453 min (+1 day).
Assuming you want to consider a negative difference to be a new day, you could use:
s = pd.to_timedelta(df['New Time']).diff()
df['Interval in min'] = (s
.add(pd.to_timedelta(s.lt('0').cumsum(), unit='d'))
.dt.floor('T').dt.total_seconds().div(60)
)
output:
New Time Interval in min
0 11:59:57 NaN
1 12:42:10 42.0
2 12:48:45 6.0
3 18:44:53 356.0
4 18:49:06 4.0
5 21:49:54 180.0
6 21:54:48 4.0
7 5:28:20 453.0

Sum of same days of complex Pandas Data Frame

The question has a base on the following SO:
Groupy brings only one key from Pandas dictionary
Dataframe looks like:
ALUP11 Return % Day CESP6 Return % Day TAEE11 Return % Day
Data
2020-08-13 23.81 0.548986 13.0 29.38 -2.747435 13.0 28.33 -0.770578 13.0
2020-09-01 23.68 1.067008 1.0 30.21 0.365449 1.0 28.55 1.205246 1.0
2020-08-31 23.43 -1.139241 31.0 30.10 -2.336145 31.0 28.21 -0.669014 31.0
2020-08-28 23.70 1.455479 28.0 30.82 1.615562 28.0 28.40 0.459851 28.0
2020-08-27 23.36 -0.680272 27.0 30.33 -1.717434 27.0 28.27 0.354988 27.0
After having the dataframe from dictionary, I need the sum of same days but
result = df.groupby('Day').agg({'Return %': ['sum']})
result
Get error:
ValueError: Grouper for 'Day' not 1-dimensional
For each symbol I would like to sum same days of month. In the example I have 3 symbols, so the result should be like:
If your data looks like the data in the answer to your previous question, the error is because you have two columns named Day. As they appear to have the same data you could drop the last column and then your groupby will work:
df = df.iloc[:, :-1].groupby('Day')

How to add rows to a table with missing dates (days), filling in the added rows with the values copied from the rows below?

I have table:
import pandas as pd
import numpy as np
df = pd.DataFrame([
("2019-01-22", np.nan, np.nan),
("2019-01-25", 10, 15),
("2019-01-28", 200, 260),
("2019-02-03", 3010, 3800),
("2019-02-05", 40109, 45009)],
columns=["date", "col1", "col2"])
I need to add new rows to the table where the date (day) is missing. In the added rows, in columns col1 and col2, there must be values copied from the row located below in the table (from rows with more recent dates).
I need to get the following table:
Use pandas.to_datetime and asfreq:
df.set_index(pd.to_datetime(df['date'])).drop('date', 1).asfreq('1 d').bfill().reset_index()
Output:
date col1 col2
0 2019-01-22 10.0 15.0
1 2019-01-23 10.0 15.0
2 2019-01-24 10.0 15.0
3 2019-01-25 10.0 15.0
4 2019-01-26 200.0 260.0
5 2019-01-27 200.0 260.0
6 2019-01-28 200.0 260.0
7 2019-01-29 3010.0 3800.0
8 2019-01-30 3010.0 3800.0
9 2019-01-31 3010.0 3800.0
10 2019-02-01 3010.0 3800.0
11 2019-02-02 3010.0 3800.0
12 2019-02-03 3010.0 3800.0
13 2019-02-04 40109.0 45009.0
14 2019-02-05 40109.0 45009.0
df = df.sort_values("date")
df = df.fillna(method='bfill')
Sort the dataframe according to date and fill the nulls with the next non-null values.
try this code:
import pandas as pd
import numpy as np
df = pd.DataFrame([
("2019-01-22", np.nan, np.nan),
("2019-01-25", 10, 15),
("2019-01-28", 200, 260),
("2019-02-03", 3010, 3800),
("2019-02-05", 40109, 45009)],
columns=["date", "col1", "col2"])
df['date'] = pd.to_datetime(df['date'])
df.index = df['date']
df.drop('date',1,inplace=True)
df.resample('D').asfreq().bfill()
df.reset_index(inplace=True)
convert date to an actual date obj (was a str)
set index to be the date column (because of how resample/bfill works)
drop the date column
resample dates on a daily bases, backfill missing data
reset index so its back to being a regular column

How to fillna/missing values for an irregular timeseries for a Drug when Half-life is known

I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64

Extracting metadata from comment fields using pandas

I need to download and process the Australian Bureaux of Meteorology weather files. So far the following Python works well, it's extracting and cleansing the data exactly as I want
import pandas as pd
df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#', skiprows=3, na_values=-9999.0, quotechar='"', skipfooter=1, names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns', 'rain', 'prob'], header=0, converters={'stn': str})
The issue is the file is overwritten daily, and the metadata which indicates what day and time the forecast was produced on is in the comment fields on the first two lines, i.e. the file contains the following data
# date=20131111
# time=06
[fcst_DB]
stn[7] , per, evap, amax, amin, gmin, suns, rain, prob
"001006", 0,-9999.0, 39.9,-9999.0,-9999.0,-9999.0, 4.0, 100.0
"001006", 1,-9999.0, 39.4, 26.5,-9999.0,-9999.0, 6.0, 100.0
"001006", 2,-9999.0, 35.5, 26.2,-9999.0,-9999.0, 7.0, 100.0
Is it possible using pandas to include the first two lines in the result. Ideally by adding a date and time column to the result and using the values 20131111 and 06 for each row in the output.
Regards
Dave
Will the first two lines always be a date and time? In that case I'd suggest parsing those separately and handing the rest of the stream off to read_csv.
import urllib2
r = urllib2.urlopen(url)
In [29]: r = urllib2.urlopen(url)
In [30]: date = next(r).strip('# date=').rstrip()
In [31]: time = next(r).strip('# time=').rstrip()
In [32]: stamp = pd.to_datetime(x + ' ' + time)
In [33]: stamp
Out[33]: Timestamp('2013-11-12 00:00:00', tz=None)
Then use your code to read (I changed the skiprows to 1)
In [34]: df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#',
skiprows=1, na_values=-9999.0, quotechar='"', skipfooter=1,
names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns',
'rain', 'prob'], header=0, converters={'stn': str})
In [43]: df['timestamp'] = stamp
In [44]: df.head()
Out[44]:
stn per evap amax amin gmin suns rain prob timestamp
0 001006 0 NaN 39.9 NaN NaN NaN 2.9 100.0 2013-11-12 00:00:00
1 001006 1 NaN 35.8 25.8 NaN NaN 7.0 100.0 2013-11-12 00:00:00
2 001006 2 NaN 37.0 25.5 NaN NaN 4.0 71.4 2013-11-12 00:00:00
3 001006 3 NaN 39.0 26.0 NaN NaN 1.0 60.0 2013-11-12 00:00:00
4 001006 4 NaN 41.2 26.1 NaN NaN 0.0 40.0 2013-11-12 00:00:00

Categories

Resources