start time series data when there are no NaN's

start time series data when there are no NaN's - python

I currently have some time series data that I applied a rolling mean on with a window of 17520.
Thus before the head of my data looked like this:
SETTLEMENTDATE ==
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
And now it looks like this:
SETTLEMENTDATE =
0 2006/01/01 00:30:00 NaN ... NaN NaN
1 2006/01/01 01:00:00 NaN ... NaN NaN
2 2006/01/01 01:30:00 NaN ... NaN NaN
3 2006/01/01 02:00:00 NaN ... NaN NaN
4 2006/01/01 02:30:00 NaN ... NaN NaN
How can I get it so that my data only begins, when there is not a NaN? (also making sure that the date matches)
=

You can try with rolling with min_periods = 1
data['NSW DEMAND'] = data['NSW DEMAND'].rolling(17520,min_periods=17520).mean()
Also try using for loo, you do not need to write the columns one by one
youcols=['xxx'...'xxx1']
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
Base on your comments
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
then ,
data=data.dropna(subset=youcols,thresh =1)

Related

Pandas fillna() method not filling all missing values

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?

when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

Take time points, and make labels against datetime object to correlate for things around points

I'm trying to use the usual times I take medication (so + 4 hours on top of that) and fill in a data frame with a label, of being 2,1 or 0, for when I am on this medication, or for the hour after the medication as 2 for just being off of the medication.
As an example of the dataframe I am trying to add this column too,
<bound method NDFrame.to_clipboard of id sentiment magnitude angry disgusted fearful \
created
2020-05-21 12:00:00 23.0 -0.033333 0.5 NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:45:00 46022.0 -1.000000 1.0 NaN NaN NaN
happy neutral sad surprised
created
2020-05-21 12:00:00 NaN NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN
... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN
2021-04-20 01:45:00 NaN NaN NaN NaN
[32024 rows x 10 columns]>
And the data for the timestamps for when i usually take my medication,
['09:00 AM', '12:00 PM', '03:00 PM']
How would I use those time stamps to get this sort of column information?
Update
So, trying to build upon the question, How would I make sure it only adds medication against places where there is data available, and making sure that the after medication timing of one hour is applied correctly!
Thanks

Use np.select() to choose the appropriate label for a given condition.
First dropna() if all values after created are null (subset=df.columns[1:]). You can change the subset depending on your needs (e.g., subset=['id'] if rows should be dropped just for having a null id).
Then generate datetime arrays for taken-, active-, and after-medication periods based on the duration of the medication. Check whether the created times match any of the times in active (label 1) or after (label 2), otherwise default to 0.
# drop rows that are empty except for column 0 (i.e., except for df.created)
df.dropna(subset=df.columns[1:], inplace=True)
# convert times to datetime
df.created = pd.to_datetime(df.created)
taken = pd.to_datetime(['09:00:00', '12:00:00', '15:00:00'])
# generate time arrays
duration = 2 # hours
active = np.array([(taken + pd.Timedelta(f'{h}H')).time for h in range(duration)]).ravel()
after = (taken + pd.Timedelta(f'{duration}H')).time
# define boolean masks by label
conditions = {
1: df.created.dt.floor('H').dt.time.isin(active),
2: df.created.dt.floor('H').dt.time.isin(after),
}
# create medication column with np.select()
df['medication'] = np.select(conditions.values(), conditions.keys(), default=0)
Here is the output with some slightly modified data that better demonstrate the active / after / nan scenarios:
created id sentiment magnitude medication
0 2020-05-21 12:00:00 23.0 -0.033333 0.5 1
3 2020-05-21 12:45:00 39.0 -0.500000 0.5 1
4 2020-05-21 13:00:00 90.0 -0.500000 0.5 1
5 2020-05-21 13:15:00 100.0 -0.033333 0.1 1
9 2020-05-21 14:15:00 1000.0 0.033333 0.5 2
10 2020-05-21 14:30:00 3.0 0.001000 1.0 2
17 2021-04-20 01:00:00 46022.0 -1.000000 1.0 0
20 2021-04-20 01:45:00 46022.0 -1.000000 1.0 0

Finding maximum null values in stretch and generating flag

I have dataframe with datetime and two columns.I have to find maximum stretch of null values in a 'particular date' for column 'X' and replace it with zero in both column for that particular date. In addition to that I have to create third column with name 'flag' which will carry value of 1 for every zero imputation in other two column or else value of 0. In example below, January 1st the maximum stretch null value is 3 times, so I have to replace this with zero. Similarly, I have to replicate the process for 2nd January.
Below is my sample data:
Datetime X Y
01-01-2018 00:00 1 1
01-01-2018 00:05 nan 2
01-01-2018 00:10 2 nan
01-01-2018 00:15 3 4
01-01-2018 00:20 2 2
01-01-2018 00:25 nan 1
01-01-2018 00:30 nan nan
01-01-2018 00:35 nan nan
01-01-2018 00:40 4 4
02-01-2018 00:00 nan nan
02-01-2018 00:05 2 3
02-01-2018 00:10 2 2
02-01-2018 00:15 2 5
02-01-2018 00:20 2 2
02-01-2018 00:25 nan nan
02-01-2018 00:30 nan 1
02-01-2018 00:35 3 nan
02-01-2018 00:40 nan nan
"Below is the result that I am expecting"
Datetime X Y Flag
01-01-2018 00:00 1 1 0
01-01-2018 00:05 nan 2 0
01-01-2018 00:10 2 nan 0
01-01-2018 00:15 3 4 0
01-01-2018 00:20 2 2 0
01-01-2018 00:25 0 0 1
01-01-2018 00:30 0 0 1
01-01-2018 00:35 0 0 1
01-01-2018 00:40 4 4 0
02-01-2018 00:00 nan nan 0
02-01-2018 00:05 2 3 0
02-01-2018 00:10 2 2 0
02-01-2018 00:15 2 5 0
02-01-2018 00:20 2 2 0
02-01-2018 00:25 nan nan 0
02-01-2018 00:30 nan 1 0
02-01-2018 00:35 3 nan 0
02-01-2018 00:40 nan nan 0
This question is the extension of previous question. Here is the link Python - Find maximum null values in stretch and replacing with 0

First create consecutive groups for each column filled by unique values:
df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 6.0 108.0
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 8.0 144.0
2018-02-01 00:30:00 8.0 NaN
2018-02-01 00:35:00 NaN 180.0
2018-02-01 00:40:00 10.0 180.0
Then get groups with maximum count - here group 4:
a = df2.stack().value_counts().index[0]
print (a)
4.0
Get mask for match rows for set 0 and for Flag column cast mask to integer to Tru/False to 1/0 mapping:
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
2018-02-01 00:40:00 NaN NaN 0
EDIT:
Added new condition for match dates from list:
dates = df.index.floor('d')
filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]
df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 NaN NaN
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 NaN NaN
2018-02-01 00:30:00 NaN NaN
2018-02-01 00:35:00 NaN NaN
2018-02-01 00:40:00 NaN NaN
a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0

How to skip remainder of dataframe after blank line?

The following routine retrieves a data file.
wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")
df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
The first three data lines are incomplete so I can slice them off like this df[3:]
At about line 1640 there is a blank line. I wish to skip the rest of the file after that line. I tried to find that line like so and get its index so I could do another slice, but I get nan for the index value.
df[df.isnull().all(1)].index.values[0]
How can I find that line and skip the rest of the file?

I think you have two nan-row problems in this file:
The first row after the header is already an empty row leading to a nan index.
The reason for your post here, the empty row which indicates the end of the data you're interested in.
first import the data as you did it:
df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
NaN NaN NaN NaN ... NaN NaN NaN
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
then remove first empty row (nan-index), problem No1:
df = df[1:]
df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
1987-08-07 00:00:00 0.56 0.15 0.29 ... 323.00 316.23 323.00
And now you want to index all rows before the first nan-index, problem No2.
Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.
This tested against 0 returns a boolean index for your data:
data_idx = df.index.isna().astype(int).cumsum() == 0
Applied to your dataframe:
df[data_idx]
Bullish Neutral ... Low Close
Date ...
1987-06-26 00:00:00 NaN NaN ... NaN NaN
1987-07-17 00:00:00 NaN NaN ... 307.63 314.59
1987-07-24 00:00:00 0.360000 0.500000 ... 307.81 309.27
1987-07-31 00:00:00 0.260000 0.480000 ... 310.65 318.66
1987-08-07 00:00:00 0.560000 0.150000 ... 316.23 323.00
... ... ... ... ...
2018-10-11 00:00:00 0.306061 0.339394 ... 2784.86 2785.68
2018-10-18 00:00:00 0.339350 0.310469 ... 2710.51 2809.21
2018-10-25 00:00:00 0.279693 0.310345 ... 2651.89 2656.10
2018-11-01 00:00:00 0.379310 0.275862 ... 2603.54 2711.74
2018-11-08 00:00:00 0.412844 0.275229 ... 2700.44 2813.89
[1635 rows x 12 columns]

Choosing time from 2300-0000 for different days

So I'm having a issue with with the 23:00-00:00 time for different days within in Python.
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-10 23:00:00 NaN 0.207653 0.205911 0.202886
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
What I'm looking for is to mainly select the 00:00:00 hour which is why I've applied df = df.reset_index().groupby(df.index.date).first().set_index('times') but if that doesn't exist that it should use the 23:00:00 of the previous days as the 00:00:00 of the next day. The following is wrong:
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
How do I get it to look at the 23:00:00 of the previous day to the 00:00:00 of the next day, to achieve this solution.
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.