Need to replace Nan values of a timeseries dataframe with logic - python

df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000','3/13/2000','3/14/2000','3/15/2000','3/16/2000','3/17/2000','3/18/2000'],
'value': [2,NaN,NaN,NaN,NaN,NaN,NaN,NaN,25]})
In this dataframe, I want to replace the Nan values as with the following logic:
In this case the difference between two dates in terms of days when the value column is not Nan is 8 days i.e. 3/18/2000 - 3/10/2000 = 8 days. And lets say the delta = 23 which we get from subtracting 25-2.
I want to replace the Nan values for all the other t day as 2+(delta)*(t/8) where t is any day with a nan value between the given two non nan value
My desired outcome of value column is :
[2,4.875,7.75,10.625,13.5,16.375,19.25,22.125,25]

You can set the date to timedelta, then as index and interpolate with the 'index' method:
df['value'] = (df
.assign(date=pd.to_datetime(df['date']))
.set_index('date')['value']
.interpolate('index')
.values
)
output:
date value
0 3/10/2000 2.000
1 3/11/2000 4.875
2 3/12/2000 7.750
3 3/13/2000 10.625
4 3/14/2000 13.500
5 3/15/2000 16.375
6 3/16/2000 19.250
7 3/17/2000 22.125
8 3/18/2000 25.000

Related

How to shift a column by 1 year in Python

With the python shift function, you are able to offset values by the number of rows. I'm looking to offset values by a specified time, which is 1 year in this case.
Here is my sample data frame. The value_py column is what I'm trying to return with a shift function. This is an over simplified example of my problem. How do I specify date as the offset parameter and not use rows?
import pandas as pd
import numpy as np
test_df = pd.DataFrame({'dt':['2020-01-01', '2020-08-01', '2021-01-01', '2022-01-01'],
'value':[10,13,15,14]})
test_df['dt'] = pd.to_datetime(test_df['dt'])
test_df['value_py'] = [np.nan, np.nan, 10, 15]
I have tried this but I'm seeing the index value get shifted by 1 year and not the value column
test_df.set_index('dt')['value'].shift(12, freq='MS')
This should solve your problem:
test_df['new_val'] = test_df['dt'].map(test_df.set_index('dt')['value'].shift(12, freq='MS'))
test_df
dt value value_py new_val
0 2020-01-01 10 NaN NaN
1 2020-08-01 13 NaN NaN
2 2021-01-01 15 10.0 10.0
3 2022-01-01 14 15.0 15.0
Use .map() to map the values of the shifted dates to original dates.
Also you should use 12 as your shift parameter not -12.

how do i split a column into two in python on the basis of data in it

for instance the column i want to split is duration here, it has data points like - 110 or 2 seasons, i want to make a differerent column for seasons and in place of seasons in my current column it should say null as this would make the type of column int from string
screenshot of my data
i tried the split function but that's for splliting in between data points, unlike splitting different other data points
I have tried to replicate a portion of your dataframe in order to provide the below solution - note that it will also change the np.NaN values to 'Null' as requested.
Creating the sample dataframe off of your screenshot:
movies_dic = {'release_year': [2021,2020,2021,2021,2021,1940,2018,2008,2021],
'duration':[np.NaN, 94, 108, 97, 104, 60, '4 Seasons', 90, '1 Season']}
stack_df = pd.DataFrame(movies_dic)
stack_df
The issue is likely that the 'duration' column is of object dtypes - namely it contains both string and integer values in it. I have made 2 small functions that will make use of the data types and allocate them to their respective column. The first is taking all the 'string' rows and placing them in the 'series_duration' column:
def series(x):
if type(x) == str:
return x
else:
return 'Null'
Then the movies function keeps the integer values (i.e. those without the word 'Season' in them) as is:
def movies(x):
if type(x) == int:
return x
else:
return 'Null'
stack_df['series_duration'] = stack_df['duration'].apply(lambda x: series(x))
stack_df['duration'] = stack_df['duration'].apply(lambda x: movies(x))
stack_df
release_year duration series_duration
0 2021 Null Null
1 2020 94 Null
2 2021 108 Null
3 2021 97 Null
4 2021 104 Null
5 1940 60 Null
6 2018 Null 4 Seasons
7 2008 90 Null
8 2021 Null 1 Season
I have created an example to give you some ideas about how to manage the problem.
First of all, I created a DF with ints, strings with format:' X seasons' and negative numbers:
import pandas as pd
data = [5,4,3,4,5,6,'4 seasons', -110, 10]
df = pd.DataFrame(data, columns=['Numbers'])
Then I created the next loop, what it does is to create new columns depending the format of the value (string or negative number), insert them and transform the original value into an NaN.
index=0
for n in df['Numbers']:
if type(n)==str:
df.loc[index, 'Seasons'] = n
df['Numbers'] = df['Numbers'].replace([n], np.nan)
elif n < 0:
df.loc[index, 'Negatives'] = n
df['Numbers'] = df['Numbers'].replace([n], np.nan)
index+=1
The output would be like this:
Numbers Seasons Negatives
0 5.0 NaN NaN
1 4.0 NaN NaN
2 3.0 NaN NaN
3 4.0 NaN NaN
4 5.0 NaN NaN
5 6.0 NaN NaN
6 NaN 4 seasons NaN
7 NaN NaN -110.0
8 10.0 NaN NaN
Then you can adjust this example as you want.

Replace values of outliers with mean of closest normal data points in Pandas DataFrame

I am to perform seasonal decomposition on around 1000 time series which can contain outliers. I want to replace the outliers with a mean value, since the outlier can corrupt the seasonality extraction.
df
TimeStamp | value
2021-01-01 1
2021-01-02 5
2021-01-03 23
2021-01-04 18
2021-01-05 7
2021-01-06 3
...
Outliers are defined as any sample with an absolute z-score larger than 3.
df['zscore'] = scipy.stats.zscore(df['value'])
I can identify the timestamp of all outliers
(df['zscore'].abs() >= 3]).index
which in above example would return
[2021-01-03,2021-01-04]
Given this list of indexes, how do I replace the value with the mean of the closest previous and next neighbors such that I get the following output?
df_mod
TimeStamp | value
2021-01-01 1
2021-01-02 5
2021-01-03 6
2021-01-04 6
2021-01-05 7
2021-01-06 3
...
Would appreciate any help on how to realize this type of function / logic.
EDIT
There can exists NaN values in the time series from the beginning, which I do not want to replace with mean.
You can replace value to NaN by condition in Series.mask and then for replace by mean sum forward and back filling values with divide by 2:
df = df.reset_index(drop=True)
orig = df.index
df = df.dropna(subset=['value'])
df['value'] = df['value'].mask(df['zscore'].abs() >= 3)
df['value'] = df['value'].ffill().add(df['value'].bfill()).div(2)
df = df.reindex(orig)
Solution without helper column:
zscore = scipy.stats.zscore(df['value'])
df['value'] = df['value'].mask(zscore.abs() >= 3)
df['value'] = df['value'].ffill().add(df['value'].bfill()).div(2)

How to lag data by x specific days on a multi index pandas dataframe?

I have a DataFrame that has dates, assets, and then price/volume data. I'm trying to pull in data from 7 days ago, but the issue is that I can't use shift() because my table has missing dates in it.
date cusip price price_7daysago
1/1/2017 a 1
1/1/2017 b 2
1/2/2017 a 1.2
1/2/2017 b 2.3
1/8/2017 a 1.1 1
1/8/2017 b 2.2 2
I've tried creating a lambda function to try to use loc and timedelta to create this shifting, but I was only able to output empty numpy arrays:
def row_delta(x, df, days, colname):
if datetime.strptime(x['recorddate'], '%Y%m%d') - timedelta(days) in [datetime.strptime(x,'%Y%m%d') for x in df['recorddate'].unique().tolist()]:
return df.loc[(df['recorddate_date'] == df['recorddate_date'] - timedelta(days)) & (df['cusip'] == x['cusip']) ,colname]
else:
return 'nothing'
I also thought of doing something similar to this in order to fill in missing dates, but my issue is that I have multiple indexes, the dates and the cusips so I can't just reindex on this.
merge the DataFrame with itself while adding 7 days to the date column for the right Frame. Use the suffixes argument to name the columns appropriately.
import pandas as pd
df['date'] = pd.to_datetime(df.date)
df.merge(df.assign(date = df.date+pd.Timedelta(days=7)),
on=['date', 'cusip'],
how='left', suffixes=['', '_7daysago'])
Output: df
date cusip price price_7daysago
0 2017-01-01 a 1.0 NaN
1 2017-01-01 b 2.0 NaN
2 2017-01-02 a 1.2 NaN
3 2017-01-02 b 2.3 NaN
4 2017-01-08 a 1.1 1.0
5 2017-01-08 b 2.2 2.0
you can set date and cusip as index and use unstack and shift together
shifted = df.set_index(["date", "cusip"]).unstack().shift(7).stack()
then simply merge shifted with your original df

Fast elementwise apply function using index and column name - pandas

I have a dataframe which can be simplified like this:
df = pd.DataFrame(index = ['01/11/2017', '01/11/2017', '01/11/2017', '02/11/2017', '02/11/2017', '02/11/2017'],
columns = ['Period','_A', '_B', '_C'] )
df.Period = [1, 2, 3, 1, 2, 3]
df
which looks like:
Date Period _A _B _C
01/11/2017 1 NaN NaN NaN
01/11/2017 2 NaN NaN NaN
01/11/2017 3 NaN NaN NaN
02/11/2017 1 NaN NaN NaN
02/11/2017 2 NaN NaN NaN
02/11/2017 3 NaN NaN NaN
And I want to apply my function to each cell
Get_Y(Date, Period, Location)
(where _A, _B, _C, ... are the locations).
Get_Y is a complex function, that looks up data from other dataframes using the Date, Period and Location, and based on criteria gives a value for Y (a float between 0 and 1).
I have managed to make this work with iterrows:
for index, row in PeriodDF.iterrows():
date = index
Period = row.loc[row.index[0]]
LocationList = row.index[1:]
print(date, Period)
for Location in LocationList :
PeriodDF.loc[(PeriodDF.index == date)&(PeriodDF.Period == Period), Location] = Get_Y(date, Period, Location)
But this takes over 1 hour.
There must be a way to do this faster in pandas.
I have tried creating 3 dataframes, one an array of the Period, one an array of the Location, and one of the Date, but not sure how to apply Get_Y elementwise, using the value from each dataframe.

Categories

Resources