Python -Pandas Downsampling with first returns NaN - python

I am trying use pandas to resample vessel tracking data from seconds to minutes using how='first'. The dataframe is called hg1s. The unique ID is called MMSI. The datetime index is TX_DTTM. Here is a data sample:
TX_DTTM MMSI LAT LON NS
2013-10-01 00:00:02 367542760 29.660550 -94.974195 15
2013-10-01 00:00:04 367542760 29.660550 -94.974195 15
2013-10-01 00:00:07 367451120 29.614161 -94.954459 0
2013-10-01 00:00:15 367542760 29.660210 -94.974069 15
2013-10-01 00:00:13 367542760 29.660210 -94.974069 15
The code to resample:
hg1s1min = hg1s.groupby('MMSI').resample('1Min', how='first')
And a data sample of the output:
hg1s1min[20000:20004]
MMSI TX_DTTM NS LAT LON
367448060 2013-10-21 00:42:00 NaN NaN NaN
2013-10-21 00:43:00 NaN NaN NaN
2013-10-21 00:44:00 NaN NaN NaN
2013-10-21 00:45:00 NaN NaN NaN
It's safe to assume that there are several data points within each minute, so I don't understand why this isn't picking up the first record for that method. I looked at this link: Pandas Downsampling Issue because it seemed similar to my problem. I tried passing label='left' and label='right', neither worked.
How do I return the first record in every minute for each MMSI?

As it turns out, the problem isn't with the method, but with my assumption about the data. The large data set is a month, or 44640 minutes. While every record in my dataset has the relevant values, there isn't 100% overlap in time. In this case MMSI = 367448060 is present at 2013-10-17 23:24:31 and again at 2013-10-29 20:57:32. between those two data points, there isn't data to sample, resulting in a NaN, which is correct.

Related

Pandas unstack() and pivot(): MemoryError

Problem description
I'd like to unstack or pivot a DataFrame, but it raises the numpy exception MemoryError: Unable to allocate 1.72 GiB for an array with shape (1844040704,) and data type bool. I have tried this with a DataFrame with a numerical index -> df.pivot() and with a Multiindex -> df.unstack() ]. Both show the same exception and I don't know a way around. I don't feel like I have an exceptionally large dataset with 175199 rows. I have previously used unstack on DataFrames with more than 5mio rows. The df will even become 2 x larger for the complete analysis!
I try to unstack with df_unstacked = df.unstack(level=0)
Additional info
Befor pivot / unstack, I had to add an unique index with df['row_num'] = np.arange(len(df)), because the dataset contains (wanted) duplicate index entries. Thats due to daylight saving time, where one day in octobre has 25 hours. The 2nd hour is duplicated.
I work with Jupyterlab from a virtualenv with python 3.7.
Package versions:
pandas==1.1.2
numpy==1.19.2
jupyterlab==2.2.8
Example data
value
target_frame row_num year
2017-01-01 01:00:00 0 2016 10,3706
2017-01-01 01:15:00 1 2016 27,2456
2017-01-01 01:30:00 2 2016 20,4022
2017-01-01 01:45:00 3 2016 14,4911
2017-01-01 02:00:00 4 2016 14,2611
... ...
2017-12-31 23:45:00 175195 2020 30,7177
2017-01-01 00:00:00 175196 2020 21,4708
2017-01-01 00:15:00 175197 2020 44,9192
2017-01-01 00:30:00 175198 2020 37,8560
2017-01-01 00:45:00 175199 2020 30,9901
[175200 rows x 1 columns]
Desired result
The index will contain duplicates. For the record, i don't care if it's an index or a regular column.
value
year 2016 2017 ... 2020
target_frame
2017-01-01 01:00:00 10,3706 11 ... 32
2017-01-01 01:15:00 27,2456 12 ... 32
2017-01-01 01:30:00 20,4022 13 ... 541
2017-01-01 01:45:00 14,4911 51 ... 123
2017-01-01 02:00:00 14,2611 56 ... 12
... ...
2017-12-31 23:45:00 30,7177 12 ... 12
2017-01-01 00:00:00 21,4708 21 ... 12
2017-01-01 00:15:00 44,9192 21 ... 13
2017-01-01 00:30:00 37,8560 21 ... 11
2017-01-01 00:45:00 30,9901 12 ... 10
[35040 rows x 5 columns]
I will try to help you by addressing the issue of lack of memory, and a way to deal with it.
As your data already has almost 2 billion records, and the error is related to memory, I will focus on that without taking into account the transformations themselves.
If you are using something like, df, df_pivoted, df_unstacked, etc. With each transformation you are creating a new variable, and multiplying your memory consumption. So it is important to clear the memory in the process. Even if your data don´t seems big enough to consume all your memory.
One way to solve this problem is to work on "chuncks" and save each transformation step to a file in order to clear the memory.
So the first step is to save the files, with a simple 'dataframe.to_csv ()'.
The second step is to make the transformations using parts of the data that fit in memory.
For this, there is an argument in the pandas.read_csv () function, called 'chuncksize' that transforms your import object into iteration TextFileReader.
that way, if you want to access the data information, you need to iterate over it.
iterator = pandas.read_csv('file.csv', chuncksize=32)
iterator.shape # will raise an error.
AttributeError: 'TextFileReader' object has no attribute 'shape'
the right way to do it:
for chunck in iterator:
print (chunck.shape)
output:
(32, ncols)
That way, to deal with your problem, you can work with chuncks and use the join functions to do the analysis as you need the data.
I think this might be a bug in pandas or numpy. There are differnet ErrorMessages with different pandas and numpy versions (Anaconda vs. pip). I coded the transformation myself and it runs in no time.
# Get the 2017 timestamp for the side_df
side_df = pd.DataFrame ({'timestamp': next_df.loc[next_df['year'] == 2017]['target_frame']})
for year in next_df['year'].unique():
side_df[year] = next_df.loc[next_df['year'] == year]['value']
display(side_df)
Results in:
timestamp 2016 2017 2018 2019 2020
8839 2017-01-01 01:00:00 10,3706 4,4184 14,7919 30,6942 31,0594
8840 2017-01-01 01:15:00 27,2456 23,7641 31,0019 40,2778 46,8350
8841 2017-01-01 01:30:00 20,4022 14,9732 23,8531 34,4941 41,3688
8842 2017-01-01 01:45:00 14,4911 9,4986 17,0181 28,8678 37,8213
8843 2017-01-01 02:00:00 14,2611 5,1241 14,0869 24,3203 34,4150
... ... ... ... ... ... ...
43874 2017-12-31 23:45:00 10,9256 15,2959 22,6000 40,1677 NaN
43875 2017-01-01 00:00:00 10,9706 4,8184 11,5150 30,9208 NaN
43876 2017-01-01 00:15:00 35,6275 25,8251 30,2893 41,5722 NaN
43877 2017-01-01 00:30:00 24,555 17,7821 24,2928 35,5510 NaN
43878 2017-01-01 00:45:00 5,61 11,7059 20,0477 31,2884 NaN
There are still some problems in the dataset (like the NaNs), but that has nothing to do with this question.

pandas resample starting before the dataset first entry

Dear experienced community,
I can't find an elegant solution to my problem.
I have a subsample of my dataset which I want to resample weekly, but starting some weeks before the first entry in my data frame (so a few weeks with 0 counts)
A sample of the data:
In:
print(df_pec.head())
Out:
Count Image_Sequence_DateTime
18 1 2015-11-06 03:22:19
21 1 2015-11-11 01:48:51
22 1 2015-11-11 07:30:47
37 1 2015-11-25 09:42:23
48 1 2015-12-05 12:12:34
With the earliest image sequence at:
In:
df_pec.Image_Sequence_DateTime.min()
Out:
2015-09-30 15:16:38
I have another function that gives me the starting point of the first week and the last point of the last week ever measured in that experiment, which are:
In:
print(s_startend)
Out:
Start 2015-09-28
End 2017-12-25
dtype: datetime64[ns]
My problem is that I want to resample df_pec weekly, but starting on the very first second of the very first day of the very first week of the experimental deployment.(using s_startend as reference)
I try:
df_pec=df_pec.resample('1W', on='Image_Sequence_DateTime').sum()
print(df_pec.head(),'\n',df_pec.tail())
Out:
Count
Image_Sequence_DateTime
2015-10-04 26.0
2015-10-11 92.0
2015-10-18 204.0
2015-10-25 193.0
2015-11-01 187.0
Count
Image_Sequence_DateTime
2017-11-19 20.0
2017-11-26 34.0
2017-12-03 16.0
2017-12-10 11.0
2017-12-17 3.0
This is pretty weird because it is even skipping the first days of data in df_pec.(starting 2015-09-30 15:16:38)
And even if it worked, I have no way of indicating the resampling to start and end in specified values (s_startend from my example), even if there are no records in the earliest and latest weeks in my subsample df_pec.
I thought about artificially adding two entries to df_pec with the real start and real end, but I think it is not so elegant and I don't want to be adding meaningless keys to my df.
Thank you very much for your wisdom!

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

How to fillna/missing values for an irregular timeseries for a Drug when Half-life is known

I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64

Pandas, Filling between dates with average change between previous rows

I think this is best illustrated with examples. Lets say we have a DataFrame like such:
295340 299616
2014-11-02 304.904110 157.123288
2014-12-02 597.303413 305.488493
2015-01-02 896.310372 454.614630
2015-02-02 1192.379580 599.588466
2015-02-04 1211.285484 NaN
2015-03-02 NaN 726.622932
Now let's say I want to reindex this, like such:
rng = pd.date_range(df.index[0], df.index[-1])
df.reindex(rng)
295340 299616
2014-11-02 304.904110 157.123288
2014-11-03 NaN NaN
2014-11-04 NaN NaN
2014-11-05 NaN NaN
...
2014-11-29 NaN NaN
2014-11-30 NaN NaN
2014-12-01 NaN NaN
2014-12-02 597.303413 305.488493
Now if we look at 295340, we see the difference between their values is (597.30-304.90) = 292.39.
The amount of days between the two values is 31. So the average increase is 9.43 a day.
So what I would want is something like such:
295340 299616
2014-11-02 304.904110 157.123288
2014-11-03 314.336345 NaN
2014-11-04 323.768581 NaN
2014-11-05 333.200816 NaN
The way I calculated that was:
304.904110 + (((597.303413-304.904110) / 31) * N)
Where N is 1 for the first row since row 1, 2 after, etc.
I would obviously want all the columns filled this way, so 299616 with the same method and such.
Any ideas for something that is as efficient as possible? I know of ways to do this, but nothing seems efficient and it seems like there should be some type of fillna() or something that works for this type of finance related problem.
NOTE: The columns will not all be spaced out the same. Each one can have numbers anywhere within the range of dates, so I can't just assume that the next number for each column will be at X date.
You can use DataFrame.interpolate with the "time" method after a resample. (It won't give quite the numbers you gave, because there are only 30 days between 2 Nov and 2 Dec, not 31):
>>> dnew = df.resample("1d").interpolate("time")
>>> dnew.head(100)
295340 299616
2014-11-02 304.904110 157.123288
2014-11-03 314.650753 162.068795
[...]
2014-11-28 558.316839 285.706466
2014-11-29 568.063483 290.651972
2014-11-30 577.810126 295.597479
2014-12-01 587.556770 300.542986
2014-12-02 597.303413 305.488493
2014-12-03 606.948799 310.299014
[...]
2014-12-30 867.374215 440.183068
2014-12-31 877.019600 444.993589
2015-01-01 886.664986 449.804109
2015-01-02 896.310372 454.614630
[...]
2015-02-01 1182.828960 594.911891
2015-02-02 1192.379580 599.588466
[...]
The downside here is that it'll extrapolate using the last value at the end:
[...]
2015-01-31 1173.278341 590.235315
2015-02-01 1182.828960 594.911891
2015-02-02 1192.379580 599.588466
2015-02-03 1201.832532 604.125411
2015-02-04 1211.285484 608.662356
2015-02-05 1211.285484 613.199302
2015-02-06 1211.285484 617.736247
[...]
So you'd have to decide how you want to handle that.

Categories

Resources