I have a highly sparse dataframe (only one non-zero value per row) indexed by non-regular Timestamps for which I am trying to do the following.
For each non-zero value in a given column, I want to count the numbers of other non-zero values in other columns within a given timedelta. In a way, I am trying to compute something similar to a rolling cross_tab.
My solution so far is ugly and slow as I haven't figured out how to do this using slicing and rolling. It looks something like:
delta = 1
values = pd.DataFrame(0,index= df.columns,columns= df.columns)
for j in df.columns:
for i in range(len(df[df[j]!=0].index)-1):
#min is used to avoid overlapping
values[j] +=df[(df.index<min((df[df[j]!=0].index + pd.tseries.timedeltas.to_timedelta(delta, unit='h'))[i],df[df[j]!=0].index[i+1]))&(df.index>=df[df[j]!=0].index[i])].astype(bool).sum()
values = values.T
and a toy-example dataframe is:
df = pd.DataFrame.from_dict({"2016-01-01 10:00.00":[0,1],
"2016-01-01 10:30.00":[1,0],
"2016-01-01 12:00.00":[0,1],
"2016-01-01 14:00.00":[1,0]},
orient="index")
df.columns=['a','b']
df.index = pd.to_datetime(df.index)
a b
2016-01-01 10:00:00 0 1
2016-01-01 10:30:00 1 0
2016-01-01 12:00:00 0 1
2016-01-01 14:00:00 1 0
The desired output should look like (with the counts depending on the timedelta):
a b
a 1 0
b 1 1
Hard to tell what exactly you want. But it sounded kind of like this
I want to use a new feature pandas 0.19. Time aware rolling. In order to use it, we need a sorted index.
d1 = df.sort_index()
Now, let's assume we want to count within plus or minus one hour. Let's start by adding two hours to every element of the index
d1.index = d1.index + pd.offsets.Hour(2)
Then we'll roll through, looking back four hours. This will be like looking forward two hours and backwards two hours relative to the original indices.
d2 = d1.rolling('4H').sum()
d2.index = d2.index - pd.offsets.Hour(2)
d2
a b
2016-01-01 10:00:00 0.0 1.0
2016-01-01 10:30:00 1.0 1.0
2016-01-01 12:00:00 1.0 2.0
2016-01-01 14:00:00 2.0 1.0
Related
I am to perform seasonal decomposition on around 1000 time series which can contain outliers. I want to replace the outliers with a mean value, since the outlier can corrupt the seasonality extraction.
df
TimeStamp | value
2021-01-01 1
2021-01-02 5
2021-01-03 23
2021-01-04 18
2021-01-05 7
2021-01-06 3
...
Outliers are defined as any sample with an absolute z-score larger than 3.
df['zscore'] = scipy.stats.zscore(df['value'])
I can identify the timestamp of all outliers
(df['zscore'].abs() >= 3]).index
which in above example would return
[2021-01-03,2021-01-04]
Given this list of indexes, how do I replace the value with the mean of the closest previous and next neighbors such that I get the following output?
df_mod
TimeStamp | value
2021-01-01 1
2021-01-02 5
2021-01-03 6
2021-01-04 6
2021-01-05 7
2021-01-06 3
...
Would appreciate any help on how to realize this type of function / logic.
EDIT
There can exists NaN values in the time series from the beginning, which I do not want to replace with mean.
You can replace value to NaN by condition in Series.mask and then for replace by mean sum forward and back filling values with divide by 2:
df = df.reset_index(drop=True)
orig = df.index
df = df.dropna(subset=['value'])
df['value'] = df['value'].mask(df['zscore'].abs() >= 3)
df['value'] = df['value'].ffill().add(df['value'].bfill()).div(2)
df = df.reindex(orig)
Solution without helper column:
zscore = scipy.stats.zscore(df['value'])
df['value'] = df['value'].mask(zscore.abs() >= 3)
df['value'] = df['value'].ffill().add(df['value'].bfill()).div(2)
I'm trying to transform my dataframe based on certain conditions. Following is my input dataframe
In [11]: df
Out[11]:
DocumentNumber I_Date N_Date P_Date Amount
0 1234 2016-01-01 2017-01-01 2017-10-23 38.38
1 2345 2016-01-02 2017-01-02 2018-03-26 41.00
2 1324 2016-01-12 2017-01-03 2018-03-26 30.37
3 5421 2016-01-13 2017-01-02 2018-03-06 269.00
4 5532 2016-01-15 2017-01-04 2018-06-30 271.00
Desired solution:
Each row is a unique document and my aim is to find the number of documents and their total amount, which meet the mentioned condition if I am running for each day and delta combination.
I am able to get to my desired result via for-loop, but I know it is not the ideal way and gets slower as my data increases. Since I am new to python, I need help to get rid of the loop by a list comprehension or any other faster option.
Code:
d1 = datetime.date(2017, 1, 1)
d2 = datetime.date(2017, 1, 15)
mydates = pd.date_range(d1, d2).tolist()
Delta = pd.Series(range(0,5)).tolist()
df_A =[]
for i in mydates:
for j in Delta:
A = df[(df["I_Date"]<i) & (df["N_Date"]>i+j) & (df["P_Date"]>i) ]
A["DateCutoff"] = i
A["Delta"]=j
A = A.groupby(['DateCutoff','Delta'],as_index=False).agg({'Amount':'sum','DocumentNumber':'count'})
A.columns = ['DateCutoff','Delta','A_PaymentAmount','A_DocumentNumber']
df_A.append(A)
df_A = pd.concat(df_A, sort = False)
Output:
In [14]: df_A
Out[14]:
DateCutoff Delta A_PaymentAmount A_DocumentNumber
0 2017-01-01 0 611.37 4
0 2017-01-01 1 301.37 2
0 2017-01-01 2 271.00 1
0 2017-01-02 0 301.37 2
0 2017-01-02 1 271.00 1
0 2017-01-03 0 271.00 1
I don't see a way to remove the loop from your code, because the loop is creating individual dataframes based on the contents of mydates and Delta.
In this example you are creating 75 different dataframes
On each dataframe you .groupby, then .agg the sum of payments and the count of document numbers.
Each dataframe is appended to a list.
pd.concat the complete list into a dataframe.
One significant improvement
Check the Boolean condition before creating the dataframe and performing the remaining operations. In this example, operations were performed on 69 empty dataframes. By checking the condition first, operations will only be performed on the 6 dataframes containing data.
condition.any() returns True as long as at least one element is True
Minor changes
datetime + int is deprecated, so change that to datetime + timedelta(days=x)
pd.Series(range(0,5)).tolist() is overkill for making a list. Now timedelta objects are needed, so use [timedelta(days=x) for x in range(5)]
Instead of iterating with two for-loops, use itertools.product on mydates and Delta. This creates a generator of tuples in the form (Timestamp('2017-01-01 00:00:00', freq='D'), datetime.timedelta(0))
Use .copy() when creating dataframe A, to prevent SettingWithCopyWarning
Note:
A list comprehension was mentioned in the question. They are just a pythonic way of making a for-loop, but don't necessarily improve performance.
All of the calculations are using pandas methods, not for-loops. The for-loop only creates the dataframe from the condition.
Updated Code:
from itertools import product
import pandas as pd
from datetime import date, timedelta
d1 = date(2017, 1, 1)
d2 = date(2017, 1, 15)
mydates = pd.date_range(d1, d2)
Delta = [timedelta(days=x) for x in range(5)]
df_list = list()
for t in product(mydates, Delta):
condition = (df["I_Date"]<t[0]) & (df["N_Date"]>t[0]+t[1]) & (df["P_Date"]>t[0])
if condition.any():
A = df[condition].copy()
A["DateCutoff"] = t[0]
A["Delta"] = t[1]
A = A.groupby(['DateCutoff','Delta'],as_index=False).agg({'Amount':'sum','DocumentNumber':'count'})
A.columns = ['DateCutoff','Delta','A_PaymentAmount','A_DocumentNumber']
df_list.append(A)
df_CutOff = pd.concat(df_list, sort = False)
Output
The same as the original
DateCutoff Delta A_PaymentAmount A_DocumentNumber
0 2017-01-01 0 611.37 4
0 2017-01-01 1 301.37 2
0 2017-01-01 2 271.00 1
0 2017-01-02 0 301.37 2
0 2017-01-02 1 271.00 1
0 2017-01-03 0 271.00 1
Let's assume a dataframe using datetimes as index, where we have a column named 'Score', initialy set to 10:
score
2016-01-01 10
2016-01-02 10
2016-01-03 10
2016-01-04 10
2016-01-05 10
2016-01-06 10
2016-01-07 10
2016-01-08 10
I want to substract a fixed value (let's say 1) from the score, but only when the index is between certain dates (for example between the 3rd and the 6th):
score
2016-01-01 10
2016-01-02 10
2016-01-03 9
2016-01-04 9
2016-01-05 9
2016-01-06 9
2016-01-07 10
2016-01-08 10
Since my real dataframe is big, and I will be doing this for different dateranges and different fixed values N for each one of them, I'd like to achieve this without requiring to create a new column set to -N for each case.
Something like numpy's where function, but for a certain range, and allowing me to sum/substract to current value if the condition is met, and do nothing otherwise. Is there something like that?
Use index slicing:
df.loc['2016-01-03':'2016-01-06', 'score'] -= 1
Assuming dates are datetime dtype:
#if date is own column:
df.loc[df['date'].dt.day.between(3,6), 'score'] = df['score'] - 1
#if date is index:
df.loc[df.index.day.isin(range(3,7)), 'score'] = df['score'] - 1
I would do something like that using query :
import pandas as pd
df = pd.DataFrame({"score":pd.np.random.randint(1,10,100)},
index=pd.date_range(start="2018-01-01", periods=100))
start = "2018-01-05"
stop = "2018-04-08"
df.query('#start <= index <= #stop ') - 1
Edit : note that something using eval which goes to boolean, can be used but in a different manner because pandas where acts on the False values.
df.where(~df.eval('#start <= index <= #stop '),
df['score'] - 1, axis=0, inplace=True)
See how I inverted the comparison operators (with ~), in order to get what I wanted. It's efficient but not really clear. Of course, you can also use pd.np.where and all is good in the world.
I have a DataFrame that has dates, assets, and then price/volume data. I'm trying to pull in data from 7 days ago, but the issue is that I can't use shift() because my table has missing dates in it.
date cusip price price_7daysago
1/1/2017 a 1
1/1/2017 b 2
1/2/2017 a 1.2
1/2/2017 b 2.3
1/8/2017 a 1.1 1
1/8/2017 b 2.2 2
I've tried creating a lambda function to try to use loc and timedelta to create this shifting, but I was only able to output empty numpy arrays:
def row_delta(x, df, days, colname):
if datetime.strptime(x['recorddate'], '%Y%m%d') - timedelta(days) in [datetime.strptime(x,'%Y%m%d') for x in df['recorddate'].unique().tolist()]:
return df.loc[(df['recorddate_date'] == df['recorddate_date'] - timedelta(days)) & (df['cusip'] == x['cusip']) ,colname]
else:
return 'nothing'
I also thought of doing something similar to this in order to fill in missing dates, but my issue is that I have multiple indexes, the dates and the cusips so I can't just reindex on this.
merge the DataFrame with itself while adding 7 days to the date column for the right Frame. Use the suffixes argument to name the columns appropriately.
import pandas as pd
df['date'] = pd.to_datetime(df.date)
df.merge(df.assign(date = df.date+pd.Timedelta(days=7)),
on=['date', 'cusip'],
how='left', suffixes=['', '_7daysago'])
Output: df
date cusip price price_7daysago
0 2017-01-01 a 1.0 NaN
1 2017-01-01 b 2.0 NaN
2 2017-01-02 a 1.2 NaN
3 2017-01-02 b 2.3 NaN
4 2017-01-08 a 1.1 1.0
5 2017-01-08 b 2.2 2.0
you can set date and cusip as index and use unstack and shift together
shifted = df.set_index(["date", "cusip"]).unstack().shift(7).stack()
then simply merge shifted with your original df
Having this DataFrame:
import pandas
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar'])
df.set_index(['a', 'b'], inplace=True)
df
I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()).
When I try:
df['foo'][0] = df['foo'][0].resample('D').sum().dropna()
That seems to work well:
However, I get a warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The question is, how should I do this instead?
Notes
Things I have tried but do not work (resampling or not, the assignment raises an exception):
df.iloc[0].loc['foo'] = df.iloc[0].loc['foo']
df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo']
df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']
A bit more information about the data (in case it is relevant):
The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
The real DataFrame has, of course, many more rows in it (thousands).
There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).
Using Python 3.5.1 and Pandas 0.18.1.
This should work:
df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()
Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.
Simply set df.is_copy = False before asignment of new value.
Hierarchical data in pandas
It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a,b, andbar).
Restructured Data
import pandas as pd
# Define Index
dates = pd.date_range('2016-01-01', periods=5, freq='H')
# Define Series
s = pd.Series([0, 1, 2, 3, 4], index=dates)
# Place Series in Hierarchical DataFrame
heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar'])
df = pd.DataFrame(s, columns=heirIndex)
print df
a 1
b 2
bar 8
2016-01-01 00:00:00 0
2016-01-01 01:00:00 1
2016-01-01 02:00:00 2
2016-01-01 03:00:00 3
2016-01-01 04:00:00 4
Resampling
With the data in this format, resampling becomes very simple.
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna()
print df_resampled
a 1
b 2
bar 8
2016-01-01 10
Update (from data description)
If the data has variable length Series each with a different index and non-numeric categories that is ok. Let's make an example:
# Define Series
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
# Define Series
dates2 = pandas.date_range('2016-01-14', periods=6, freq='H')
s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2)
# Define DataFrames
df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c']))
df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))
df = pd.concat([df1, df2])
print df
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 1.0 NaN
2016-01-01 02:00:00 2.0 NaN
2016-01-01 03:00:00 3.0 NaN
2016-01-01 04:00:00 4.0 NaN
2016-01-14 00:00:00 NaN -200.0
2016-01-14 01:00:00 NaN 10.0
2016-01-14 02:00:00 NaN 24.0
2016-01-14 03:00:00 NaN 30.0
2016-01-14 04:00:00 NaN 40.0
2016-01-14 05:00:00 NaN 100.0
The only issues is that after resampling. You will want to use how='all' while dropping na rows like this:
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna(how='all')
print df_resampled
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 10.0 NaN
2016-01-14 NaN 4.0