So i wanted to downsampling my data using ffill method
I have a data:
2020-01-01 1.248310e+06
2021-01-01 1.259511e+06
2022-01-01 1.276312e+06
2023-01-01 1.298714e+06
The output should be:
2020-01-01 1.248310e+06
2020-02-01 1.248310e+06
2020-03-01 1.248310e+06
.... ...
2023-10-01 1.298714e+06
2023-11-01 1.298714e+06
2023-12-01 1.298714e+06
Here is what I tried
down_sampling = df.resample('MS', fill_method= 'ffill')
I get something like:
2020-01-01 1.248310e+06
2020-02-01 1.248310e+06
2020-03-01 1.248310e+06
.... ...
2022-11-01 1.276312e+06
2022-12-01 1.276312e+06
2023-01-01 1.298714e+06
The problem here is the year 2023 has only one month.
Can you suggest any idea on how to fixed it.
Thank you.
You can do it like this:
index = pd.date_range('1/1/2020', periods=4, freq='YS')
series = pd.Series([1.248310e+06, 1.259511e+06, 1.276312e+06, 1.298714e+06], index=index)
series2 = pd.Series(1.298714e+06, pd.date_range('12/1/2023', periods=1))
series = series.append(series2)
down_sampling = series.resample('MS').ffill()
A hacky but pythonic solution:
df.append(df.iloc[[-1]].set_index(df.iloc[[-1]].index.shift(1, freq="D"))).resample("H").ffill()[:-1]
It picks the last row (as a df: df.iloc[[-1]]),
increases its index by one step (here 1 day: .index.shift(1, freq="D")),
then resamples: .resample("H").ffill()
And removes the single dummy row at the end of df ([:-1])
I'd actually expect closed: parameter of resample to do the job.
Related
I'm having trouble finding an efficient way to update some column values in a large pandas DataFrame.
The code below creates a DataFrame in a similar format to what I'm working with. A summary of the data: the DataFrame contains three days of consumption data with each day being split into 10 periods of measurement. Each measurement period is also recorded during four separate processes being a preliminary reading, end of day reading and two later revisions with all updates being recorded by the Last_Update column with the date.
dates = ['2022-01-01']*40 + ['2022-01-02']*40 + ['2022-01-03']*40
periods = list(range(1,11))*12
versions = (['PRELIM'] * 10 + ['DAILY'] * 10 + ['REVISE'] * 20) * 3
data = {'Date': dates,
'Period' : periods,
'Version': versions,
'Consumption': np.random.randint(1, 30, 120)}
df = pd.DataFrame(data)
df.Date = pd.to_datetime(df.Date)
## Add random times to the REVISE Last_Update values
df['Last_Update'] = df['Date'].apply(lambda x: x + pd.Timedelta(hours=np.random.randint(1,23), minutes=np.random.randint(1,59)))
df['Last_Update'] = df['Last_Update'].where(df.Version == 'REVISE', df['Date'])
The problem is that the two revision categories are both specified by the same value: "REVISE". One of these "REVISE" values must be changed to something like "REVISE_2". If you group the data in the following way df.groupby(['Date', 'Period', 'Version', 'Last_Update'])['Consumption'].sum() you can see there are two Last_Update dates for each period in each day for REVISE. So we need to set the REVISE with the largest date to REVISE_2.
The only way I've managed to find a solution is using a very convoluted function with the apply method to test which date is larger and store its index and then change the value using loc. This ended up taking huge amount of time for small segments of the data (the full dataset is millions of rows).
I feel like there is an easy solution using groupby functions by I'm having difficulties navigating the multi index output.
Any help would be appreciated cheers.
We figure our the index of the max REVISE date using idxmax after some grouping, and then change the labels:
last_revised_date_idx = df[df['Version'] == 'REVISE'].groupby(['Date', 'Period'], group_keys = False)['Last_Update'].idxmax()
df.loc[last_revised_date_idx, 'Version'] = 'REVISE_2'
check the output:
df.groupby(['Date', 'Period', 'Version', 'Last_Update'])['Consumption'].count().head(20)
produces
Date Period Version Last_Update
2022-01-01 1 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 03:50:00 1
REVISE_2 2022-01-01 12:10:00 1
2 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 10:45:00 1
REVISE_2 2022-01-01 22:05:00 1
3 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 17:03:00 1
REVISE_2 2022-01-01 19:10:00 1
4 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 15:23:00 1
REVISE_2 2022-01-01 18:08:00 1
5 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 12:19:00 1
REVISE_2 2022-01-01 18:04:00 1
I have a dataframe containing time series with hourly measurements with the following structure: name, time, output. For each name the measurements come from more or less the same time period. I am trying to fill in the missing values, such that for each day all 24h appear in the time column.
So I'm expecting a table like this:
name time output
x 2018-02-22 00:00:00 100
...
x 2018-02-22 23:00:00 200
x 2018-02-24 00:00:00 300
...
x 2018-02-24 23:00:00 300
y 2018-02-22 00:00:00 100
...
y 2018-02-22 23:00:00 200
y 2018-02-25 00:00:00 300
...
y 2018-02-25 23:00:00 300
For this I groupby name and then try to apply a custom function that adds the missing timestamps in the corresponding dataframe.
def add_missing_hours(df):
start_date = df.time.iloc[0].date()
end_date = df.time.iloc[-1].date()
dates_range = pd.date_range(start_date, end_date, freq = '1H')
new_dates = set(dates_range) - set(df.time)
name = df["name"].iloc[0]
df = df.append(pd.DataFrame({'GSRN':[name]*len(new_dates), 'time': new_dates}))
return df
For some reason the name column is dropped when I create the DataFrame, but I can't understand why. Does anyone know why or have a better idea how to fill in the missing timestamps?
Edit 1:
This is different than the [question here][1] because they didn't need all 24 values/day -- resampling between 2pm and 10pm will only give the values in between.
Edit 2:
I found a (not great) solution by creating a multi index with all name-timestamps pairs and combining with the table. Code below for anyone interested, but still interested in a better solution:
start_date = datetime.datetime.combine(df.time.min().date(),datetime.time(0, 0))
end_date = datetime.datetime.combine(df.time.max().date(),datetime.time(23, 0))
new_idx = pd.date_range(start_date, end_date, freq = '1H')
mux = pd.MultiIndex.from_product([df['name'].unique(),new_idx], names=('name','time'))
df_complete = pd.DataFrame(index=mux).reset_index().combine_first(df)
df_complete = df_complete.groupby(["name",df_complete.time.dt.date]).filter(lambda g: (g["output"].count() == 0))
The last line removes any days that were completely missing for the specific name in the initial dataframe.
try:
1st create dataframe starting from min date to max date with hour as an interval. Then concatenate them together.
df.time = pd.to_datetime(df.time)
min_date = df.time.min()
max_date = df.time.max()
dates_range = pd.date_range(min_date, max_date, freq = '1H')
df.set_index('time', inplace=True)
df3=pd.DataFrame(dates_range).set_index(0)
df4 = df3.join(df)
df4:
name output
2018-02-22 00:00:00 x 100.0
2018-02-22 00:00:00 y 100.0
2018-02-22 01:00:00 NaN NaN
2018-02-22 02:00:00 NaN NaN
2018-02-22 03:00:00 NaN NaN
... ... ...
2018-02-25 19:00:00 NaN NaN
2018-02-25 20:00:00 NaN NaN
2018-02-25 21:00:00 NaN NaN
2018-02-25 22:00:00 NaN NaN
2018-02-25 23:00:00 y 300.0
98 rows × 2 columns
I have a dataframe that contains NaN values and I want to fill the missing data using information of the same month.
the dataframe looks this:
data = {'x':[208.999,-894.0,-171.0,108.999,-162.0,-29.0,-143.999,-133.0,-900.0],
'e':[0.105,0.209,0.934,0.150,0.158,'',0.333,0.089,0.189],
}
df = pd.DataFrame(data)
df = pd.DataFrame(data, index =['2020-01-01', '2020-02-01',
'2020-03-01', '2020-01-01',
'2020-02-01','2020-03-01',
'2020-01-01','2020-02-01',
'2020-03-01'])
df.index = pd.to_datetime(df.index)
df['e'] =df['e'].apply(pd.to_numeric, errors='coerce')
Now im using df=df.fillna(df['e'].mean()) to fill the nan value but it takes all the column data, is and it gives me 0.27 is there a way to use only the data of the same month?, the result should be 0.56
Try grouping in index.month and get mean (transformed) then fillna
df.index = pd.to_datetime(df.index)
out = df.fillna({'e':df.groupby(df.index.month)['e'].transform('mean')})
print(out)
x e
2020-01-01 208.999 0.1050
2020-02-01 -894.000 0.2090
2020-03-01 -171.000 0.9340
2020-01-01 108.999 0.1500
2020-02-01 -162.000 0.1580
2020-03-01 -29.000 0.5615
2020-01-01 -143.999 0.3330
2020-02-01 -133.000 0.0890
2020-03-01 -900.000 0.1890
Maybe you could use interpolate() instead of fillna(), but you have to sort the index first, ie.:
df.e.sort_index().interpolate()
Output:
2020-01-01 0.1050
2020-01-01 0.1500
2020-01-01 0.3330
2020-02-01 0.2090
2020-02-01 0.1580
2020-02-01 0.0890
2020-03-01 0.9340
2020-03-01 0.5615
2020-03-01 0.1890
Name: e, dtype: float64
By default linear interpolation is used, so in case of a single occurrence of NaN you get the mean value and the missing one was replaced by 0.5615 like you expected. However if the NaN was the first sample of the month after sorting the result would be the mean of the last month's last value and this month's next value, but it works in cases where there are NaNs for the whole month and nothing to average, so depending how strict you are on the same month requirement or how are your missing values spread across the whole dataframe you can accept this solution or not.
I have an Excel file with a column named StartTime having hh:mm:ss XX data and the cells are in `h:mm:ss AM/FM' custom format. For example,
ID StartTime
1 12:00:00 PM
2 1:00:00 PM
3 2:00:00 PM
I used the following code to read the file
df = pd.read_excel('./mydata.xls',
sheet_name='Sheet1',
converters={'StartTime' : str},
)
df shows
ID StartTime
1 12:00:00
2 1:00:00
3 2:00:00
Is it a bug or how do you overcome this? Thanks.
[Update: 7-Dec-2018]
I guess I may have made changes to the Excel file that made it weird. I created another Excel file and present here (I could not attach an Excel file here, and it is not safe too):
I created the following code to test:
import pandas as pd
df = pd.read_excel('./Book1.xlsx',
sheet_name='Sheet1',
converters={'StartTime': str,
'EndTime': str
}
)
df['Hours1'] = pd.NaT
df['Hours2'] = pd.NaT
print(df,'\n')
df.loc[~df.StartTime.isnull() & ~df.EndTime.isnull(),
'Hours1'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
df['Hours2'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
print(df)
The outputs are
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 NaT NaT
1 1 12:00:00 13:00:00 NaT NaT
2 2 13:00:00 14:00:00 NaT NaT
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 3600000000000 01:00:00
1 1 12:00:00 13:00:00 3600000000000 01:00:00
2 2 13:00:00 14:00:00 3600000000000 01:00:00
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
Now the question has become: "Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel". I have changed the title of the question too. Thank you for those who replied and tried it out.
The question is
How to represent the time value to hour instead of microseconds?
It seems that the StartTime column is formated as text in your file.
Have you tried reading it with parse_dates along with a parser function specified via the date_parser parameter? Should work similar to read_csv() although the docs don't list the above options explicitly despite them being available.
Like so:
pd.read_excel(r'./mydata.xls',
parse_dates=['StartTime'],
date_parser=lambda x: pd.datetime.strptime(x, '%I:%M:%S %p').time())
Given the update:
pd.read_excel(r'./mydata.xls', parse_dates=['StartTime', 'EndTime'])
(df['EndTime'] - df['StartTime']).dt.seconds//3600
alternatively
# '//' is available since pandas v0.23.4, otherwise use '/' and round
(df['EndTime'] - df['StartTime'])//pd.Timedelta(1, 'h')
both resulting in the same
0 1
1 1
2 1
dtype: int64
The data is given as following:
return
2010-01-04 0.016676
2010-01-05 0.003839
...
2010-01-05 0.003839
2010-01-29 0.001248
2010-02-01 0.000134
...
What I want get is to extract all value that is the last day of month appeared in the data .
2010-01-29 0.00134
2010-02-28 ......
If I directly use pandas.resample, i.e., df.resample('M).last(). I would select the correct rows with the wrong index. (it automatically use the last day of the month as the index)
2010-01-31 0.00134
2010-02-28 ......
How can I get the correct answer in a Pythonic way?
An assumption made here is that your date data is part of the index. If not, I recommend setting it first.
Single Year
I don't think the resampling or grouper functions would do. Let's group on the month number instead and call DataFrameGroupBy.tail.
df.groupby(df.index.month).tail(1)
Multiple Years
If your data spans multiple years, you'll need to group on the year and month. Using a single grouper created from dt.strftime—
df.groupby(df.index.strftime('%Y-%m')).tail(1)
Or, using multiple groupers—
df.groupby([df.index.year, df.index.month]).tail(1)
Note—if your index is not a DatetimeIndex as assumed here, you'll need to replace df.index with pd.to_datetime(df.index, errors='coerce') above.
Although this doesn't answer the question properly I'll leave it if someone is interested.
An approach which would only work if you are certain you have all days (!IMPORTANT) is to add 1 day too with pd.Timedelta and check if day == 1. I did a small running time test and it is 6x faster than the groupby solution.
df[(df['dates'] + pd.Timedelta(days=1)).dt.day == 1]
Or if index:
df[(df.index + pd.Timedelta(days=1)).day == 1]
Full example:
import pandas as pd
df = pd.DataFrame({
'dates': pd.date_range(start='2016-01-01', end='2017-12-31'),
'i': 1
}).set_index('dates')
dfout = df[(df.index + pd.Timedelta(days=1)).day == 1]
print(dfout)
Returns:
i
dates
2016-01-31 1
2016-02-29 1
2016-03-31 1
2016-04-30 1
2016-05-31 1
2016-06-30 1
2016-07-31 1
2016-08-31 1
2016-09-30 1
2016-10-31 1
2016-11-30 1
2016-12-31 1
2017-01-31 1
2017-02-28 1
2017-03-31 1
2017-04-30 1
2017-05-31 1
2017-06-30 1
2017-07-31 1
2017-08-31 1
2017-09-30 1
2017-10-31 1
2017-11-30 1
2017-12-31 1