Duplicating n rows in a DataFrame using 2nd level index? - python

I have a pandas DataFrame that for instance is looking like this.
df
Values
Timestamp
2020-02-01 A
2020-02-02 B
2020-02-03 C
I would like (to ease processing to be done afterward) to keep a window of n row and duplicate it for each timestamp, and creating a 2nd level index with local int index.
With n=2, this would give:
df_new
Values
Timestamp 2nd_level_index
2020-02-01 0 NaN
1 A
2020-02-02 0 A
1 B
2020-03-03 0 B
1 C
Is there any kind of pandas built-in function that would help me do that?
A rolling window with fixed size (n) seems to be the start, but then how do I duplicate the window and store it for each row using a 2nd level index?
Thanks in advance for any help!
Bests,
EDIT 04/05
Taking propose code, and changing a bit the output format, I adapted it for a 2-column DataFrame.
I ended up with following code.
import pandas as pd
import numpy as np
from random import seed, randint
def transpose_n_rows(df: pd.DataFrame, n_rows: int) -> pd.DataFrame:
array = np.concatenate((np.full((len(df.columns),n_rows-1), np.nan), df.transpose()), axis=1)
shape = array.shape[:-1] + (array.shape[-1] - n_rows + 1, n_rows)
strides = array.strides + (array.strides[-1],)
array = np.lib.stride_tricks.as_strided(array, shape=shape, strides=strides)
midx = pd.MultiIndex.from_product([df.columns, range(n_rows)], names=['Data','Position'])
transposed = pd.DataFrame(np.concatenate(array, axis=1), index=df.index, columns=midx)
return transposed
n = 4
start = '2020-01-01 00:00+00:00'
end = '2020-01-01 12:00+00:00'
pr2h = pd.period_range(start=start, end=end, freq='2h')
seed(1)
values1 = [randint(0,10) for ts in pr2h]
values2 = [randint(20,30) for ts in pr2h]
df2h = pd.DataFrame({'Values1' : values1, 'Values2': values2}, index=pr2h)
df2h_new = transpose_n_rows(df2h, n)
Which gives.
In [29]:df2h
Out[29]:
Values1 Values2
2020-01-01 00:00 2 27
2020-01-01 02:00 9 30
2020-01-01 04:00 1 26
2020-01-01 06:00 4 23
2020-01-01 08:00 1 21
2020-01-01 10:00 7 27
2020-01-01 12:00 7 20
In [30]:df2h_new
Out[30]:
Data Values1 Values2
Position 0 1 2 3 0 1 2 3
2020-01-01 00:00 NaN NaN NaN 2.0 NaN NaN NaN 27.0
2020-01-01 02:00 NaN NaN 2.0 9.0 NaN NaN 27.0 30.0
2020-01-01 04:00 NaN 2.0 9.0 1.0 NaN 27.0 30.0 26.0
2020-01-01 06:00 2.0 9.0 1.0 4.0 27.0 30.0 26.0 23.0
2020-01-01 08:00 9.0 1.0 4.0 1.0 30.0 26.0 23.0 21.0
2020-01-01 10:00 1.0 4.0 1.0 7.0 26.0 23.0 21.0 27.0
2020-01-01 12:00 4.0 1.0 7.0 7.0 23.0 21.0 27.0 20.0
However, I am calling this function transpose_n_rows in a for loop for a significant number of DataFrames. This first use makes me a bit afraid with performance issues.
I could read that one should avoid multiple calls to np.concatenate or pd.concat, and here, I have 2 of them for a use that maybe can be bypassed?
Please, is there any advice to get rid of them if this is possible?
I thank you in advance for any help! Bests,

I think there is not built-in method in pandas.
Possible solution with strides for generate rolling 2d array:
n = 2
#added Nones for first values of 2d array
x = np.concatenate([[None] * (n-1), df['Values']])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(x, n)
print (a)
[[None 'A']
['A' 'B']
['B' 'C']]
And then create MultiIndex by MultiIndex.from_product and flatten values of array by numpy.ravel:
mux = pd.MultiIndex.from_product([df.index, range(n)], names=('times','level1'))
df = pd.DataFrame({'Values': np.ravel(a)}, index=mux)
print (df)
Values
times level1
2020-02-01 0 None
1 A
2020-02-02 0 A
1 B
2020-02-03 0 B
1 C
If values are numbers add missing values:
print (df)
Values
Timestamp
2020-02-01 1
2020-02-02 2
2020-02-03 3
n = 2
x = np.concatenate([[np.nan] * (n-1), df['Values']])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(x, n)
print (a)
[[nan 1.]
[ 1. 2.]
[ 2. 3.]]
mux = pd.MultiIndex.from_product([df.index, range(n)], names=('times','level1'))
df = pd.DataFrame({'Values': np.ravel(a)}, index=mux)
print (df)
Values
times level1
2020-02-01 0 NaN
1 1.0
2020-02-02 0 1.0
1 2.0
2020-02-03 0 2.0
1 3.0

Related

Subtract one column from another in pandas - with a condition

I have this code that will subtract, for each person (AAC or AAB), timepoint 1 from time point 2 data.
i.e this is the original data:
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 1 2.0 NaN 4.0
1 4 3 2.0 6.0 NaN
2 4 3 NaN 6.0 NaN
3 4 5 2.0 6.0 NaN
This is the code:
import sys
import numpy as np
from sklearn.metrics import auc
import pandas as pd
from numpy import trapz
#read in file
df = pd.DataFrame([[0,1,2,np.nan,4],[4,3,2,6,np.nan],[4,3,np.nan,6,np.nan],[4,5,2,6,np.nan]],columns=['pep_seq','AAC-T01','AAC-T02','AAB-T01','AAB-T02'])
#standardise the data by taking T0 away from each sample
df2 = df.drop(['pep_seq'],axis=1)
df2 = df2.apply(lambda x: x.sub(df2[x.name[:4]+"T01"]))
df2.insert(0,'pep_seq',df['pep_seq'])
print(df)
print(df2)
This is the output (i.e. df2)
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN NaN
1 4 0 -1.0 0.0 NaN
2 4 0 NaN 0.0 NaN
3 4 0 -3.0 0.0 NaN
...but what I actually wanted was to subtract the T01 columns from all the others EXCEPT for when the T01 value is NaN in which case keep the original value, so the desired output was (see the 4.0 in AAB-T02):
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN 4.0
1 4 0 -1.0 0 NaN
2 4 0 NaN 0 NaN
3 4 0 -3.0 0 NaN
Could someone show me where I went wrong? Note that in real life, there are ~100 timepoints per person, not just two.
You can fill the nan to 0 when doing subtraction
df2 = df2.apply(lambda x: x.sub(df2[x.name[:4]+"T01"].fillna(0)))
# ^^^^ Changes here
df2.insert(0,'pep_seq',df['pep_seq'])
print(df2)
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN 4.0
1 4 0 -1.0 0.0 NaN
2 4 0 NaN 0.0 NaN
3 4 0 -3.0 0.0 NaN
I hope that I understand you correctly but numpy.where() should do it for you.
Have a look here: condition based substraction

Taking first and last value in a rolling window

Initial problem statement
Using pandas, I would like to apply function available for resample() but not for rolling().
This works:
df1 = df.resample(to_freq,
closed='left',
kind='period',
).agg(OrderedDict([('Open', 'first'),
('Close', 'last'),
]))
This doesn't:
df2 = df.rolling(my_indexer).agg(
OrderedDict([('Open', 'first'),
('Close', 'last') ]))
>>> AttributeError: 'first' is not a valid function for 'Rolling' object
df3 = df.rolling(my_indexer).agg(
OrderedDict([
('Close', 'last') ]))
>>> AttributeError: 'last' is not a valid function for 'Rolling' object
What would be your advice to keep first and last value of a rolling windows to be put into two different columns?
EDIT 1 - with usable input data
import pandas as pd
from random import seed
from random import randint
from collections import OrderedDict
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0,10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
# First & last work with resample
resampled_first = df.resample('3H',
closed='left',
kind='period',
).agg(OrderedDict([('Values', 'first')]))
resampled_last = df.resample('3H',
closed='left',
kind='period',
).agg(OrderedDict([('Values', 'last')]))
# They don't with rolling
rolling_first = df.rolling(3).agg(OrderedDict([('Values', 'first')]))
rolling_first = df.rolling(3).agg(OrderedDict([('Values', 'last')]))
Thanks for your help!
Bests,
You can use own function to get first or last element in rolling window
rolling_first = df.rolling(3).agg(lambda rows: rows[0])
rolling_last = df.rolling(3).agg(lambda rows: rows[-1])
Example
import pandas as pd
from random import seed, randint
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
df['last'] = df['Values'].rolling(3).agg(lambda rows: rows[-1])
print(df)
Result
Values first last
2020-01-01 00:00:00+00:00 2 NaN NaN
2020-01-01 01:00:00+00:00 9 NaN NaN
2020-01-01 02:00:00+00:00 1 2.0 1.0
2020-01-01 03:00:00+00:00 4 9.0 4.0
2020-01-01 04:00:00+00:00 1 1.0 1.0
2020-01-01 05:00:00+00:00 7 4.0 7.0
2020-01-01 06:00:00+00:00 7 1.0 7.0
2020-01-01 07:00:00+00:00 7 7.0 7.0
2020-01-01 08:00:00+00:00 10 7.0 10.0
2020-01-01 09:00:00+00:00 6 7.0 6.0
2020-01-01 10:00:00+00:00 3 10.0 3.0
2020-01-01 11:00:00+00:00 1 6.0 1.0
2020-01-01 12:00:00+00:00 7 3.0 7.0
2020-01-01 13:00:00+00:00 0 1.0 0.0
2020-01-01 14:00:00+00:00 6 7.0 6.0
2020-01-01 15:00:00+00:00 6 0.0 6.0
2020-01-01 16:00:00+00:00 9 6.0 9.0
2020-01-01 17:00:00+00:00 0 6.0 0.0
2020-01-01 18:00:00+00:00 7 9.0 7.0
2020-01-01 19:00:00+00:00 4 0.0 4.0
2020-01-01 20:00:00+00:00 3 7.0 3.0
2020-01-01 21:00:00+00:00 9 4.0 9.0
2020-01-01 22:00:00+00:00 1 3.0 1.0
2020-01-01 23:00:00+00:00 5 9.0 5.0
2020-01-02 00:00:00+00:00 0 1.0 0.0
EDIT:
Using dictionary you have to put directly lambda, not string
result = df['Values'].rolling(3).agg({'first': lambda rows: rows[0], 'last': lambda rows: rows[-1]})
print(result)
The same with own function - you have to put its name, not string with name
def first(rows):
return rows[0]
def last(rows):
return rows[-1]
result = df['Values'].rolling(3).agg({'first': first, 'last': last})
print(result)
Example
import pandas as pd
from random import seed, randint
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
result = df['Values'].rolling(3).agg({'first': lambda rows: rows[0], 'last': lambda rows: rows[-1]})
print(result)
def first(rows):
return rows[0]
def mylast(rows):
return rows[-1]
result = df['Values'].rolling(3).agg({'first': first, 'last': last})
print(result)
In case anyone else needs to find the difference between the first and last value in a 'rolling-window'. I used this on stock market data and wanted to know the price difference from the beginning to the end of the 'window' so I created a new column which used the current row 'close' value and the 'open' value using .shift() so it is taking the "open" value from 60 rows above.
df[windowColumn] = df["close"] - (df["open"].shift(60))
I think it's a very quick method for large datasets.

Date subraction not working with date range function

I am trying subtract two datetimes when there valid there is valid value for T1,T2 obtain the difference. The difference is caluclated by considering only weekdays between the dates not considering saturday and sunday.
Code works for only some rows. How can this be fixed.
T1 T2 Diff
0 2017-12-04 05:48:15 2018-01-05 12:15:22 NaN
1 2017-07-10 08:23:11 2018-01-05 15:28:22 NaN
2 2017-09-11 05:10:37 2018-01-29 15:02:07 NaN
3 2017-12-21 04:51:12 2018-01-29 16:06:43 NaN
4 2017-10-13 10:11:00 2018-02-22 16:19:04 NaN
5 2017-09-28 21:44:31 2018-01-29 12:42:02 NaN
6 2018-01-23 20:00:58 2018-01-29 14:40:33 NaN
7 2017-11-28 15:39:38 2018-01-31 11:57:04 NaN
8 2017-12-21 12:44:00 2018-01-31 13:12:37 30.0
9 2017-11-09 05:52:29 2018-01-22 11:42:01 53.0
10 2018-02-12 04:21:08 NaT NaN
df[['T1','T2','diff']].dtypes
T1 datetime64[ns]
T2 datetime64[ns]
diff float64
df['T1'] = pd.to_datetime(df['T1'])
df['T2'] = pd.to_datetime(df['t2'])
def fun(row):
if row.isnull().any():
return np.nan
ts = pd.DataFrame(pd.date_range(row["T1"],row["T2"]), columns=["date"])
ts["dow"] = ts["date"].dt.weekday
return (ts["dow"]<5).sum()
df["diff"] = df.apply(lambda x: fun(x), axis=1)
Instead of trying to check for a null value in the row, use a try/except to capture the error when it does the calculation with a null value.
This worked for me in, I think, the manner you want.
import pandas as pd
import numpy as np
df = pd.read_csv("/home/rightmire/Downloads/test.csv", sep=",")
# df = df[["m1","m2"]]
print(df)
# print(df[['m1','m2']].dtypes)
df['m1'] = pd.to_datetime(df['m1'])
df['m2'] = pd.to_datetime(df['m2'])
print(df[['m1','m2']].dtypes)
#for index, row in df.iterrows():
def fun(row):
try:
ts = pd.DataFrame(pd.date_range(row["m1"],row["m2"]), columns=["date"])
# print(ts)
ts["dow"] = ts["date"].dt.weekday
result = (ts["dow"]<5).sum()
# print("Result = ", result)
return result
except Exception as e:
# print("ERROR:{}".format(str(e)))
result = np.nan
# print("Result = ", result)
return result
df["diff"] = df.apply(lambda x: fun(x), axis=1)
print(df["diff"])
OUTPUT OF INTEREST:
dtype: object
0 275.0
1 147.0
2 58.0
3 28.0
4 95.0
5 87.0
6 4.0
7 46.0
8 30.0
9 96.0
10 NaN
11 27.0
12 170.0
13 158.0
14 79.0
Name: diff, dtype: float64

How to calculate inverse cumsum in pandas

I am trying to find a way to calculate an inverse cumsum for pandas. This means applying cumsum but from bottom to top. The problem I'm facing is, I'm trying to find the number of workable day for each month for Spain both from top to bottom (1st workable day = 1, 2nd = 2, 3rd = 3, etc...) and bottom to top (last workable day = 1, day before last = 2, etc...).
So far I managed to get the top to bottom order to work but can't get the inverse order to work, I've searched a lot and couldn't find a way to perform an inverse cummulative sum:
import pandas as pd
from datetime import date
from workalendar.europe import Spain
import numpy as np
cal = Spain()
#print(cal.holidays(2019))
rng = pd.date_range('2019-01-01', periods=365, freq='D')
df = pd.DataFrame({ 'Date': rng})
df['flag_workable'] = df['Date'].apply(lambda x: cal.is_working_day(x))
df_workable = df[df['flag_workable'] == True]
df_workable['month'] = df_workable['Date'].dt.month
df_workable['workable_day'] = df_workable.groupby('month')['flag_workable'].cumsum()
print(df)
print(df_workable.head(30))
Output for January:
Date flag_workable month workable_day
1 2019-01-02 True 1 1.0
2 2019-01-03 True 1 2.0
3 2019-01-04 True 1 3.0
6 2019-01-07 True 1 4.0
7 2019-01-08 True 1 5.0
Example for last days of January:
Date flag_workable month workable_day
24 2019-01-25 True 1 18.0
27 2019-01-28 True 1 19.0
28 2019-01-29 True 1 20.0
29 2019-01-30 True 1 21.0
30 2019-01-31 True 1 22.0
This would be the expected output after applying the inverse cummulative:
Date flag_workable month workable_day inv_workable_day
1 2019-01-02 True 1 1.0 22.0
2 2019-01-03 True 1 2.0 21.0
3 2019-01-04 True 1 3.0 20.0
6 2019-01-07 True 1 4.0 19.0
7 2019-01-08 True 1 5.0 18.0
Last days of January:
Date flag_workable month workable_day inv_workable_day
24 2019-01-25 True 1 18.0 5.0
27 2019-01-28 True 1 19.0 4.0
28 2019-01-29 True 1 20.0 3.0
29 2019-01-30 True 1 21.0 2.0
30 2019-01-31 True 1 22.0 1.0
Invert the row order of the DataFrame prior to grouping so that the cumsum is calculated in reverse order within each month.
df['inv_workable_day'] = df[::-1].groupby('month')['flag_workable'].cumsum()
df['workable_day'] = df.groupby('month')['flag_workable'].cumsum()
# Date flag_workable month inv_workable_day workable_day
#1 2019-01-02 True 1 5.0 1.0
#2 2019-01-03 True 1 4.0 2.0
#3 2019-01-04 True 1 3.0 3.0
#6 2019-01-07 True 1 2.0 4.0
#7 2019-01-08 True 1 1.0 5.0
#8 2019-02-01 True 2 1.0 1.0
Solution
Whichever column you want to apply cumsum to you have two options:
Order descending a copy of that column by index, followed by cumsum and then order ascending by index. Finally assign it back to the data frame column.
Use numpy:
import numpy as np
array = df.column_data.to_numpy()
array = np.flip(array) # to flip the order
array = np.cumsum(array)
array = np.flip(array) # to flip back to original order
df.column_data_cumsum = array

Pandas' diff but user defined function

Pandas' pandas.DataFrame.diff almost does what I'm attempting to do.
From the documentation
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
... 'b': [1, 1, 2, 3, 5, 8],
... 'c': [1, 4, 9, 16, 25, 36]})
>>> df
a b c
0 1 1 1
1 2 1 4
2 3 2 9
3 4 3 16
4 5 5 25
5 6 8 36
df.diff(axis=0) and df.diff(axis=1) respectively produces
>>> df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0
>>> df.diff(axis=1)
a b c
0 NaN 0.0 0.0
1 NaN -1.0 3.0
2 NaN -1.0 7.0
3 NaN -1.0 13.0
4 NaN 0.0 20.0
5 NaN 2.0 28.0
What df.diff is doing is essentially applying this function
def diff_func(columns):
return columns[1:] - columns[0:-1]
I want to define my own function, which replace the diff_func.
What I want to is "apply" my own function (possibly nonlinear) to the consecutive (periods=1) columns/rows. For example, func(x,y) = sin(x)*cos(y) where x,y are consecutive columns or rows of periods=n
The you should consider shift
df-df.shift(1)
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0
Here's a way that works for me using pandas DataFrame built-in method 'apply'.
I wrapped the custom diff function, product, which takes two arguments and returns one value, in diff_custom which takes a vector argument and returns an equal length vector, appropriately NaN-padded. We execute diff_custom using the the pandas DataFrame's built-in apply method:
import pandas
import numpy
df = pandas.DataFrame([[4, 9],[5,10],[22,44]], columns=['A', 'B'])
# Function that acts on neighboring row or column values, val1 and val2
def product(val1,val2):
return(val1*val2)
# Function that acts on DataFrame row or column, x
def diff_custom(x):
vals = [product(x[n],x[n+1]) for n in range(x.shape[0]-1)]
ret = list([numpy.nan]) # pad return vector however you need to
ret = ret + vals
return(ret)
# Use DataFrame built-in 'apply' method
df.apply(diff_custom,axis=1)
A B
0 NaN 36.0
1 NaN 50.0
2 NaN 968.0
df.apply(diff_custom,axis=0)
A B
0 NaN NaN
1 20.0 90.0
2 110.0 440.0

Categories

Resources