I have a large dateset that includes categorical data which are my labels ( non-uniform timestamp). I have another dataset which is aggregate of the measurement.
When I want to assemble these two dataset, they have two different timestamp ( aggregated vs non-aggregated).
Categorical dataframe (df_Label)
count 1185
unique 10
top ABCD
freq 1165
Aggregated Dataset (MeasureAgg),
In order to assemble the label dataframe with measurement dataframe.
I use df_Label=df_Label.reindex(MeasureAgg.index, method='nearest')
The issue is that the result of this reindexing will eliminate many of my labels, so the df.describe() will be:
count 4
unique 2
top ABCD
freq 3
I looked in two several lines of where the labels get replaced by nan but couldn't find any indication of where this come from.
I was suspicious that this issue might be due clustering of the labels in between two timestamp which eliminate many of them but this is not the case.
I tried this for fabricated dataset and it work as expected but not sure why is not working in my case df_Label=df_Label.reindex(MeasureAgg.index, method='nearest')
my apology on vague nature of my question, I couldn't replicate the issue with fabricated dataset( for fabricated dataset it worked fine). I would greatly appreciate if any one can guide me with alternative way/modified way that I can assemble these two dataframes.
Thanks in advance
Update:
There is only timestamp since it is mostly missing data
df_Label.head(5)
Time
2000-01-01 00:00:10.870 NaN
2000-01-01 00:00:10.940 NaN
2000-01-01 00:00:11.160 NaN
2000-01-01 00:00:11.640 NaN
2000-01-01 00:00:12.460 NaN
Name: SUM, dtype: object
df_Label.describe()
count 1185
unique 10
top 9_33_2_0_0_0
freq 1165
Name: SUM, dtype: object
MeasureAgg.head(5)
Time mean std skew kurt
2000-01-01 00:00:00 0.0 0.0
2010-01-01 00:00:00 0.0
2015-01-01 00:00:00
2015-12-01 00:00:00
2015-12-01 12:40:00 0.0
MeasureAgg.describe()
mean std skew kurt
count 407.0 383.0 382.0 382.0
mean 487.3552791234544 35.67631749396375 -0.7545081710390299 2.52171909979003
std 158.53524231679074 43.66050329988979 1.3831195437535115 6.72280956322486
min 0.0 0.0 -7.526780108501018 -1.3377292623812096
25% 474.33696969696973 11.5126181533734 -1.1790982769904146 -0.4005545816076801
50% 489.03428571428566 13.49696931937243 -0.2372819584684056 -0.017202890096714274
75% 532.3371929824561 51.40084557371704 0.12755009341999793 1.421205718986767
max 699.295652173913 307.8822231525122 1.2280152015331378 66.9243304128838
Related
I have the problem explaining a gap between the historic data and the forecast.
The blue is the historic. And the orange is the lin lin regression forecast with future values.
Dataframe df is the training dataset with columns year, pax, RealGDPLP.
Dataframe FutureValCPs has the columns year and RealGDPLP.
How do you explaing that it is not continuous (in other cases it is)?
The OLS results are attached. Anything which gives an indication?
Thank you.
With no data, no code and no details about the graphical engine applied to produce your plot it's going to be hard to be absolutely certain. But your forecasts seem perfectly fine compared to your historical data in that it at the very least predicts a smooth future increase in your values. If the blue line represents your entire dataset, there's really not much more that can be said using OLS.
The reason why there's a gap in your plot, is that the two lines in your plot are two different lines and don't share a common timestamp in the transition between historical and forecasted values. There are ways to visually remedy this, but as I've mentioned I have no idea how you've estimated the model or produced this plot.
Edit: Extended answer based on more information from OP:
This should resemble your issue with regards to the plot:
I'm assuming that the following dataframe will represent your situation:
historic forecast
dates
2020-01-01 1.0 NaN
2020-01-02 2.0 NaN
2020-01-03 3.0 NaN
2020-01-04 3.0 NaN
2020-01-05 6.0 NaN
2020-01-06 4.0 NaN
2020-01-07 8.0 NaN
2020-01-08 NaN 6.0
2020-01-09 NaN 7.0
2020-01-10 NaN 8.0
2020-01-11 NaN 9.0
2020-01-12 NaN 10.0
2020-01-13 NaN 11.0
2020-01-14 NaN 12.0
And I think this is a perfectly natural situation for series for historic and forecasted values; there's no reason why there should not be a visual gap between them. Now, one way to visually remedy this could be to include the forecasted value of 6.0 at index 2020-01-08 for the historic series, or the historic value of 8 at index 2020-01-08 for the forecasts. You can do so using df['forecast'].loc['2020-01-07']=8.0 or df['historic'].loc['2020-01-08']=6.0. This can of course be done more smoothly by programmatically determining the inserted value and the index. But here's the result either way:
Complete code:
import seaborn as sns
import pandas as pd
sns.set_style("darkgrid")
plt.xticks(rotation=45)
#sns.set_xticklabels(rotation=45)
%matplotlib inline
df_historic = pd.DataFrame({'dates': pd.date_range("20200101", periods=7),
'historic': [1,2,3,3,6,4,8]}).set_index('dates')
df_forecast = pd.DataFrame({'dates': pd.date_range("20200108", periods=7),
'forecast': [6,7,8,9,10,11,12]}).set_index('dates')
df=pd.merge(df_historic, df_forecast, how='outer', left_index=True, right_index=True)
#df['forecast'].loc['2020-01-07']=8.0
df['historic'].loc['2020-01-08']=6.0
for column in df.columns:
g=sns.lineplot(x=df.index, y=df[column])
g.set_xticklabels(labels=df.index, rotation=-20)
I hope this helps!
I have 2 datasets for which there are multiple repeated rows for each date due to different recording time of each health attribute pertaining to that date.
I want to shrink my dataset to aggregate the values of each column pertaining to the same day. I don't want to create a new data frame because I need to then merge the datasets with other datasets. After trying the code below, my df's shape still returns the same number of rows. Any help would be appreciated.
sample data:
Data Head
Output snapshot
count calorie update_time speed distance date
101 4.290000 2018-04-30 18:35:00.291 1.527778 78.420000 2018-04-30
25 0.960000 2018-04-13 19:55:00.251 1.027778 14.360000 2018-04-13
38 1.530000 2018-04-02 10:14:58.210 1.194444 24.190000 2018-04-02
35 1.450000 2018-04-27 10:55:01.281 1.500000 27.450000 2018-04-27
0 0.000000 2018-04-21 13:46:36.801 0.000000 0.000000 2018-04-21
34 1.820000 2018-04-01 08:35:05.481 2.222222 30.260000 2018-04-01
df_SC['date']=df_SC.groupby('date').agg({"distance": "sum","calorie":"sum",
"count":"sum","speed":"mean"}).reset_index()
expect sum of distance, calorie, count and mean of speed to show up under each respective column and against each data.
I need to calculate moving average using pandas.
ser = pd.Series(np.random.randn(100),
index=pd.date_range('1/1/2000', periods=100, freq='1min'))
ser.rolling(window=20).mean().tail(5)
[Out]
2000-01-01 01:35:00 0.390383
2000-01-01 01:36:00 0.279308
2000-01-01 01:37:00 0.173532
2000-01-01 01:38:00 0.194097
2000-01-01 01:39:00 0.194743
Freq: T, dtype: float64
But after appending a new row like this,
new_row = pd.Series([1.0], index=[pd.to_datetime("2000-01-01 01:40:00")])
ser = ser.append(new_row)
I have to recalcuate all moving data, like this,
ser.rolling(window=20).mean().tail(5)
[Out]
2000-01-01 01:36:00 0.279308
2000-01-01 01:37:00 0.173532
2000-01-01 01:38:00 0.194097
2000-01-01 01:39:00 0.194743
2000-01-01 01:40:00 0.201918
dtype: float64
I think I just need calculate last 2000-01-01 01:40:00 0.201918 data, but I can't find pandas api that calculate only last appended row value. Pandas rolling().mean() always calculate all series data
This is simple example but in my real project, range is more than 1,000,000 series, and each rolling calculation consumes much time
Is there way to solve this problem in pandas?
As Anton vBR wrote in his comment, after you append the row, you can calculate the last value with
ser.tail(20).mean
which takes time independent of the series length (1000000 in your example).
If you do this operation often, you can calculate it a bit more efficiently. The mean after appending the row is:
20 times the mean of the penultimate row
plus the latest appended value
minus the value at the 21 last index
divided by 20
This is more complicated to implement, though.
I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.
I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.