Get the Minimum and Maximum value within specific date range in DataFrame - python

I have a DataFrame that has the columns 'From' (datetime), 'To' (datetime). There are some overlapping in the ranges of different rows of the table.
Here is the simplified version of criteria dataframe (the date range is vary and overlapping with each other):
df1= pd.DataFrame({'From': pd.date_range(start='2020-01-01', end='2020-01-31',freq='2D'), 'To': pd.date_range(start='2020-01-05', end='2020-02-04',freq='2D')})
From To
0 2020-01-01 2020-01-05
1 2020-01-03 2020-01-07
2 2020-01-05 2020-01-09
3 2020-01-07 2020-01-11
4 2020-01-09 2020-01-13
5 2020-01-11 2020-01-15
6 2020-01-13 2020-01-17
7 2020-01-15 2020-01-19
8 2020-01-17 2020-01-21
9 2020-01-19 2020-01-23
10 2020-01-21 2020-01-25
11 2020-01-23 2020-01-27
12 2020-01-25 2020-01-29
13 2020-01-27 2020-01-31
14 2020-01-29 2020-02-02
15 2020-01-31 2020-02-04
And I have a dataframe which keep the daily high and low value like this
random.seed(0)
df2= pd.DataFrame({'Date': pd.date_range(start='2020-01-01', end='2020-01-31'), 'High': [random.randint(7,15)+5 for i in range(31)], 'Low': [random.randint(0,7)-1 for i in range(31)]})
Date High Low
0 2020-01-01 18 6
1 2020-01-02 18 6
2 2020-01-03 12 3
3 2020-01-04 16 -1
4 2020-01-05 20 -1
5 2020-01-06 19 0
6 2020-01-07 18 5
7 2020-01-08 16 -1
8 2020-01-09 19 6
9 2020-01-10 17 4
10 2020-01-11 15 2
11 2020-01-12 20 4
12 2020-01-13 14 0
13 2020-01-14 16 2
14 2020-01-15 14 2
15 2020-01-16 13 2
16 2020-01-17 16 1
17 2020-01-18 20 6
18 2020-01-19 14 0
19 2020-01-20 16 0
20 2020-01-21 13 4
21 2020-01-22 13 6
22 2020-01-23 17 0
23 2020-01-24 19 3
24 2020-01-25 20 3
25 2020-01-26 13 0
26 2020-01-27 17 4
27 2020-01-28 18 2
28 2020-01-29 17 3
29 2020-01-30 15 6
30 2020-01-31 20 0
Then I hope to get the maximum and minimum value based on the From Date and To Date in df1, Here is the expected result:
result = pd.DataFrame({'From': pd.date_range(start='2020-01-01', end='2020-01-31',freq='2D'), 'To': pd.date_range(start='2020-01-05', end='2020-02-04',freq='2D'), 'High':[20,20,20,19,20,20,16,20,20,17,20,20,20,20,20,20], 'Low':[-1,-1,-1,-1,0,0,1,0,0,0,0,0,0,0,0,0]})
From To High Low
0 2020-01-01 2020-01-05 20 -1
1 2020-01-03 2020-01-07 20 -1
2 2020-01-05 2020-01-09 20 -1
3 2020-01-07 2020-01-11 19 -1
4 2020-01-09 2020-01-13 20 0
5 2020-01-11 2020-01-15 20 0
6 2020-01-13 2020-01-17 16 1
7 2020-01-15 2020-01-19 20 0
8 2020-01-17 2020-01-21 20 0
9 2020-01-19 2020-01-23 17 0
10 2020-01-21 2020-01-25 20 0
11 2020-01-23 2020-01-27 20 0
12 2020-01-25 2020-01-29 20 0
13 2020-01-27 2020-01-31 20 0
14 2020-01-29 2020-02-02 20 0
15 2020-01-31 2020-02-04 20 0
I have tried to use resampling method, but it seems not support custom date range. I'm looking for a reasonably efficient and elegant way of doing this. Thank you very much.

With the size of the data, I think you should consider another approach, the idea is to vectorize by chunk over df1 the comparison between dates with df2. It is lot more lines than other solutions, but it will be way faster for large dataframes.
# this is a parameter you can play with,
# but if your df1 is in memory, this value should work
nb_split = int((len(df1)*len(df2))//4e6)+1
# work with arrays of flaot
arr1 = df1[['From','To']].astype('int64').to_numpy().astype(float)
arr2 = df2.astype('int64').to_numpy().astype(float)
# create result array
arr_out = np.zeros((len(arr1), 2), dtype=float)
i = 0 #index position
for arr1_sp in np.array_split(arr1, nb_split, axis=0):
# get length of the chunk
lft = len(arr1_sp)
# get the min datetime in From and max in To
min_from = arr1_sp[:, 0].min()
max_to = arr1_sp[:, 1].max()
# select the rows of arr2 tht are within the min and max date of the split
arr2_sp = arr2[(arr2[:,0]>=min_from)&(arr2[:,0]<=max_to), :]
# create an bool arraywith True when the date in arr2_sp is above from and below to
# each row is the reuslt for each row of arr1_sp
m = np.less_equal.outer(arr1_sp[:,0], arr2_sp[:, 0])\
&np.greater_equal.outer(arr1_sp[:,1], arr2_sp[:, 0])
# use this mask to get the values high and low within the range row-wise
# and replace where the mask was False by np.nan
arr_high = arr2_sp[:,1]*m
arr_high[~m] = np.nan
arr_low = arr2_sp[:,2]*m
arr_low[~m] = np.nan
# put the result in the result array
arr_out[i:i+lft, 0] = np.nanmax(arr_high, axis=1)
arr_out[i:i+lft, 1] = np.nanmin(arr_low, axis=1)
i += lft #update first idx position for next loop
# create the columns in df1
df1['High'] = arr_out[:, 0]
df1['Low'] = arr_out[:, 1]
I tried with df1 with 10000 rows and df2 5000 rows, and this method is about 102ms while the method with apply getHighLow2is about 8s, so 80 time faster this way. Adn the results where the same.

Here is a function which does this:
Checks the dates which are in the from/to interval
Gets the maximum and minimum values of the High and Low columns respectively
def get_high_low(d1):
high = df2.loc[df2["Date"].isin(pd.date_range(d1["From"], d1["To"])), "High"].max()
low = df2.loc[df2["Date"].isin(pd.date_range(d1["From"], d1["To"])), "Low"].max()
return pd.Series([high, low], index=["High", "Low"])
Then we can just apply this function and concatenate the result with the dates.
pd.concat([df1, df1.apply(get_high_low, axis=1)], axis=1)
The result
From To High Low
0 2020-01-01 2020-01-05 19 4
1 2020-01-03 2020-01-07 17 5
2 2020-01-05 2020-01-09 19 5
3 2020-01-07 2020-01-11 19 2
4 2020-01-09 2020-01-13 17 4
5 2020-01-11 2020-01-15 19 4
6 2020-01-13 2020-01-17 19 5
7 2020-01-15 2020-01-19 18 5
8 2020-01-17 2020-01-21 18 0
9 2020-01-19 2020-01-23 19 3
10 2020-01-21 2020-01-25 19 5
11 2020-01-23 2020-01-27 19 5
12 2020-01-25 2020-01-29 17 5
13 2020-01-27 2020-01-31 17 3
14 2020-01-29 2020-02-02 17 1
15 2020-01-31 2020-02-04 13 -1

I would do a cross merge and query, then groupby:
(df1.assign(dummy=1)
.merge(df2.assign(dummy=1), on='dummy') # this is cross merge
.drop('dummy', axis=1) # remove the `dummy` column
.query('From<=Date<=To') # only choose valid data
.groupby(['From','To']) # groupby `From` and `To`
.agg({'High':'max','Low':'min'}) # aggregation
.reset_index()
)
Output:
From To High Low
0 2020-01-01 2020-01-05 20 -1
1 2020-01-03 2020-01-07 20 -1
2 2020-01-05 2020-01-09 20 -1
3 2020-01-07 2020-01-11 19 -1
4 2020-01-09 2020-01-13 20 0
5 2020-01-11 2020-01-15 20 0
6 2020-01-13 2020-01-17 16 0
7 2020-01-15 2020-01-19 20 0
8 2020-01-17 2020-01-21 20 0
9 2020-01-19 2020-01-23 17 0
10 2020-01-21 2020-01-25 20 0
11 2020-01-23 2020-01-27 20 0
12 2020-01-25 2020-01-29 20 0
13 2020-01-27 2020-01-31 20 0
14 2020-01-29 2020-02-02 20 0
15 2020-01-31 2020-02-04 20 0

You can create a simple function that gets the min and max within a given date renge. Than use the apply function to add the columns.
def MaxMin(row):
dfRange = df2[(df2['Date']>=row['From'])&(df2['Date']<=row['To'])] # df2 rows within a given date range
row['High'] = dfRange['High'].max()
row['Low'] = dfRange['Low'].min()
return row
df1 = df1.apply(MaxMin, axis =1)

Define the following function:
def getHighLow(row):
wrk = df2[df2.Date.between(row.From, row.To)]
return pd.Series([wrk.High.max(), wrk.Low.min()], index=['High', 'Low'])
Then run:
df1.join(df1.apply(getHighLow, axis=1))
According to the DRY rule, it is better to find wrk (a set of rows between
given dates) once and then (form wrk) extract maximal High and
minimal Low.
Another advantage over the other solution: My code runs quicker by about
30 % (at least on my computer, measurements performed using %timeit).
Edit
Yet quicker solution is when the search in df2 can be performed by index
instead of "from regular column".
As a preparatory step run:
df2a = df2.set_index('Date')
Then define another variant of getHighLow function:
def getHighLow2(row):
wrk = df2a.loc[row.From : row.To]
return pd.Series([wrk.High.max(), wrk.Low.min()], index=['High', 'Low'])
To get the result, run:
df1.join(df1.apply(getHighLow2, axis=1))
For your data, the execution time is about a half of the other solution
(not including the time to create df2a, but it can be created just in this form (with Date as the index)).

Related

Pandas: Accumulated Shares Holdings per Ticker per Day from a List of Trades

I have a pd.DataFrame (pandas.core.frame.DataFrame) with some stock trades.
data = {'Date': ['2021-01-15', '2021-01-21', '2021-02-28', '2021-01-30', '2021-02-16', '2021-03-22', '2021-01-08', '2021-03-02', '2021-02-25', '2021-04-04', '2021-03-15', '2021-04-08'], 'Ticker': ['MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL'], 'Quantity': [2,3,7,2,6,4,-3,8,-2,9,11,1]}
df = pd.DataFrame(data)
Date Ticker Quantity
0 2021-01-15 MFST 2
1 2021-01-21 AMZN 3
2 2021-02-28 GOOG 7
3 2021-01-30 AAPL 2
4 2021-02-16 MFST 6
5 2021-03-22 AMZN 4
6 2021-01-08 GOOG -3
7 2021-03-02 AAPL 8
8 2021-02-25 MFST -2
9 2021-04-04 AMZN 9
10 2021-03-15 GOOG 11
11 2021-04-08 AAPL 1
Quantity refers to the number of shares bought.
I am looking for an efficient way to create a new df which contains the number of shares for each Ticker per day.
The first trade was on 2021-01-08 and the last on 2021-04-08. I want a new dataframe that contains all days between those to dates as rows and the tickers as columns. Values shall be the number of shares I hold at a specific day. Hence, if I buy 4 shares of a stock at 2021-03-15 (assuming no further buying or selling) I will have them from 2021-03-15 till 2021-04-08 which should be represented as a 4 in every row for this specific ticker. If I decide to buy more shares this number will change on that day and all following days.
Could be something like this:
Date MFST AMZN GOOG APPL
2021-01-08 2 3 1 0
2021-01-09 2 3 1 0
2021-01-10 2 3 1 0
...
2021-04-08 2 3 1 7
My first guess was to create an empty DataFrame and then iterate with two for loops over all its Dates and Tickers. However, I think that is not the most efficient way. I am thankful for any recommendation!
You can use df.pivot() to transform your data into a tabular form, as shown on the expected output layout, as follows:
df.pivot(index='Date', columns='Ticker', values='Quantity').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
If you need to aggregate Quantity for same date for each stock, you can use df.pivot_table() with parameter aggfunc='sum', as follows:
df.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
Result:
Date AAPL AMZN GOOG MFST
0 2021-01-21 0 3 0 0
1 2021-02-28 0 0 1 0
2 2021-03-15 0 0 0 2
3 2021-04-30 7 0 0 0
Additional Test Case:
To showcase the aggregation function of df.pivot_table(), I have added some data as follows:
data = {'Date': ['2021-03-15',
'2021-01-21',
'2021-01-21',
'2021-02-28',
'2021-02-28',
'2021-04-30',
'2021-04-30'],
'Ticker': ['MFST', 'AMZN', 'AMZN', 'GOOG', 'GOOG', 'AAPL', 'AAPL'],
'Quantity': [2, 3, 4, 1, 2, 7, 2]}
df = pd.DataFrame(data)
Date Ticker Quantity
0 2021-03-15 MFST 2
1 2021-01-21 AMZN 3
2 2021-01-21 AMZN 4
3 2021-02-28 GOOG 1
4 2021-02-28 GOOG 2
5 2021-04-30 AAPL 7
6 2021-04-30 AAPL 2
df.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
Date AAPL AMZN GOOG MFST
0 2021-01-21 0 7 0 0
1 2021-02-28 0 0 3 0
2 2021-03-15 0 0 0 2
3 2021-04-30 9 0 0 0
Edit
Based on latest requirement:
The first trade was on 2021-03-15 and the last on 2021-04-30. I want a
new dataframe that contains all days between those to dates as rows
and the tickers as columns. Values shall be the number of shares I
hold at a specific day. Hence, if I buy 4 shares of a stock at
2021-03-15 (assuming no further buying or selling) I will have them
from 2021-03-15 till 2021-04-30 which should be represented as a 4 in
every row for this specific ticker. If I decide to buy more shares
this number will change on that day and all following days.
Here is the enhanced solution:
data = {'Date': ['2021-01-15', '2021-01-21', '2021-02-28', '2021-01-30', '2021-02-16', '2021-03-22', '2021-01-08', '2021-03-02', '2021-02-25', '2021-04-04', '2021-03-15', '2021-04-08'], 'Ticker': ['MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL'], 'Quantity': [2,3,7,2,6,4,-3,8,-2,9,11,1]}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df1 = df.set_index('Date').asfreq('D')
df1['Ticker'] = df1['Ticker'].ffill().bfill()
df1['Quantity'] = df1['Quantity'].fillna(0)
df2 = df1.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
df3 = df2[['Date']].join(df2.iloc[:,1:].cumsum())
Result:
print(df3)
Date AAPL AMZN GOOG MFST
0 2021-01-08 0 0 -3 0
1 2021-01-09 0 0 -3 0
2 2021-01-10 0 0 -3 0
3 2021-01-11 0 0 -3 0
4 2021-01-12 0 0 -3 0
5 2021-01-13 0 0 -3 0
6 2021-01-14 0 0 -3 0
7 2021-01-15 0 0 -3 2
8 2021-01-16 0 0 -3 2
9 2021-01-17 0 0 -3 2
10 2021-01-18 0 0 -3 2
11 2021-01-19 0 0 -3 2
12 2021-01-20 0 0 -3 2
13 2021-01-21 0 3 -3 2
14 2021-01-22 0 3 -3 2
15 2021-01-23 0 3 -3 2
16 2021-01-24 0 3 -3 2
17 2021-01-25 0 3 -3 2
18 2021-01-26 0 3 -3 2
19 2021-01-27 0 3 -3 2
20 2021-01-28 0 3 -3 2
21 2021-01-29 0 3 -3 2
22 2021-01-30 2 3 -3 2
23 2021-01-31 2 3 -3 2
24 2021-02-01 2 3 -3 2
25 2021-02-02 2 3 -3 2
26 2021-02-03 2 3 -3 2
27 2021-02-04 2 3 -3 2
28 2021-02-05 2 3 -3 2
29 2021-02-06 2 3 -3 2
30 2021-02-07 2 3 -3 2
31 2021-02-08 2 3 -3 2
32 2021-02-09 2 3 -3 2
33 2021-02-10 2 3 -3 2
34 2021-02-11 2 3 -3 2
35 2021-02-12 2 3 -3 2
36 2021-02-13 2 3 -3 2
37 2021-02-14 2 3 -3 2
38 2021-02-15 2 3 -3 2
39 2021-02-16 2 3 -3 8
40 2021-02-17 2 3 -3 8
41 2021-02-18 2 3 -3 8
42 2021-02-19 2 3 -3 8
43 2021-02-20 2 3 -3 8
44 2021-02-21 2 3 -3 8
45 2021-02-22 2 3 -3 8
46 2021-02-23 2 3 -3 8
47 2021-02-24 2 3 -3 8
48 2021-02-25 2 3 -3 6
49 2021-02-26 2 3 -3 6
50 2021-02-27 2 3 -3 6
51 2021-02-28 2 3 4 6
52 2021-03-01 2 3 4 6
53 2021-03-02 10 3 4 6
54 2021-03-03 10 3 4 6
55 2021-03-04 10 3 4 6
56 2021-03-05 10 3 4 6
57 2021-03-06 10 3 4 6
58 2021-03-07 10 3 4 6
59 2021-03-08 10 3 4 6
60 2021-03-09 10 3 4 6
61 2021-03-10 10 3 4 6
62 2021-03-11 10 3 4 6
63 2021-03-12 10 3 4 6
64 2021-03-13 10 3 4 6
65 2021-03-14 10 3 4 6
66 2021-03-15 10 3 15 6
67 2021-03-16 10 3 15 6
68 2021-03-17 10 3 15 6
69 2021-03-18 10 3 15 6
70 2021-03-19 10 3 15 6
71 2021-03-20 10 3 15 6
72 2021-03-21 10 3 15 6
73 2021-03-22 10 7 15 6
74 2021-03-23 10 7 15 6
75 2021-03-24 10 7 15 6
76 2021-03-25 10 7 15 6
77 2021-03-26 10 7 15 6
78 2021-03-27 10 7 15 6
79 2021-03-28 10 7 15 6
80 2021-03-29 10 7 15 6
81 2021-03-30 10 7 15 6
82 2021-03-31 10 7 15 6
83 2021-04-01 10 7 15 6
84 2021-04-02 10 7 15 6
85 2021-04-03 10 7 15 6
86 2021-04-04 10 16 15 6
87 2021-04-05 10 16 15 6
88 2021-04-06 10 16 15 6
89 2021-04-07 10 16 15 6
90 2021-04-08 11 16 15 6
Use df.groupby
df.groupby(['Date']).agg('sum')

Python Pandas interpolation: redistribute value forwards over missing date range

I have time trend data on facility traffic (admissions to and releases from a facility over time), with gaps. Because of the structure of this data, when a gap appears, the "releases" one day prior to the gap are artificially high (accounting for all unseen individuals released over the period of the gap), and the "admissions" one day after the gap are artificially high (for the same reason: any individual who was admitted during the gap and remains in the facility will appear as an "admission" on this date).
Here is a sample Pandas series involving such a data gap (with zeroes implying missing data on 2020-01-04 through 2020-01-07):
date(index) releases admissions
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 50 14
2020-01-04 0 0
2020-01-05 0 0
2020-01-06 0 0
2020-01-07 0 0
2020-01-08 8 100
2020-01-09 11 19
2020-01-10 9 17
A visualization of this (ignore the separate linear interpolation over the missing total population) looks like the following:
I want to smooth this data, but I'm not sure what interpolation method to use. What I want to accomplish is redistribution forwards of the "releases" on date gap(0)-1 and redistribution backwards of "admissions" on date gap(n)+1. For instance, if a gap is 4 days long and on day gap(n)+1 there are 100 admissions, I want to redistribute such that, on each day of the gap, there are 20 admissions, and on day gap(n)+1 admissions are revised to show 20.
Using the above example series, redistribution would look like the following:
date(index) releases admissions
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 10 14
2020-01-04 10 20
2020-01-05 10 20
2020-01-06 10 20
2020-01-07 10 20
2020-01-08 8 20
2020-01-09 11 19
2020-01-10 9 17
You can create groups with consecutive zeros + one value before for releases and one value after for admissions, and then use transform('mean') to calculate average for each group:
# releases
df['releases'] = df.groupby(
df['releases'].replace(0, np.nan).notna().cumsum()
)['releases'].transform('mean')
# admissions
df['admissions'] = df.groupby(
df['admissions'].replace(0, np.nan).notna().iloc[::-1].cumsum().iloc[::-1]
)['admissions'].transform('mean')
Output:
releases admissions
date
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 10 14
2020-01-04 10 20
2020-01-05 10 20
2020-01-06 10 20
2020-01-07 10 20
2020-01-08 8 20
2020-01-09 11 19
2020-01-10 9 17
Update: For keeping the existing NA values:
# releases
df['releases_i'] = df.groupby(
df['releases'].ne(0).cumsum()
)['releases'].transform('mean')
# admissions
df['admissions_i'] = df.groupby(
df['admissions'].ne(0).iloc[::-1].cumsum().iloc[::-1]
)['admissions'].transform('mean')

How to divide a pandas dataframe into several dataframes by month and year

I have a dataframe with different columns (like price, id, product and date) and I need to divide this dataframe into several dataframes based on the current date of the system (current_date = np.datetime64(date.today())).
For example, if today is 2020-02-07 I want to divide my main dataframe into three different ones where df1 would be the data of the last month (data of 2020-01-07 to 2020-02-07), df2 would be the data of the last three months (excluding the month already in df1 so it would be more accurate to say from 2019-10-07 to 2020-01-07) and df3 would be the data left on the original dataframe.
Is there some easy way to do this? Also, I've been trying to use Grouper but I keep getting this error over an over again: NameError: name 'Grouper' is not defined (my Pandas version is 0.24.2)
You can use offsets.DateOffset for last 1mont and 3month datetimes, filter by boolean indexing:
rng = pd.date_range('2019-10-10', periods=20, freq='5d')
df = pd.DataFrame({'date': rng, 'id': range(20)})
print (df)
date id
0 2019-10-10 0
1 2019-10-15 1
2 2019-10-20 2
3 2019-10-25 3
4 2019-10-30 4
5 2019-11-04 5
6 2019-11-09 6
7 2019-11-14 7
8 2019-11-19 8
9 2019-11-24 9
10 2019-11-29 10
11 2019-12-04 11
12 2019-12-09 12
13 2019-12-14 13
14 2019-12-19 14
15 2019-12-24 15
16 2019-12-29 16
17 2020-01-03 17
18 2020-01-08 18
19 2020-01-13 19
current_date = pd.to_datetime('now').floor('d')
print (current_date)
2020-02-07 00:00:00
last1m = current_date - pd.DateOffset(months=1)
last3m = current_date - pd.DateOffset(months=3)
m1 = (df['date'] > last1m) & (df['date'] <= current_date)
m2 = (df['date'] > last3m) & (df['date'] <= last1m)
#filter non match m1 or m2 masks
m3 = ~(m1 | m2)
df1 = df[m1]
df2 = df[m2]
df3 = df[m3]
print (df1)
date id
18 2020-01-08 18
19 2020-01-13 19
print (df2)
date id
6 2019-11-09 6
7 2019-11-14 7
8 2019-11-19 8
9 2019-11-24 9
10 2019-11-29 10
11 2019-12-04 11
12 2019-12-09 12
13 2019-12-14 13
14 2019-12-19 14
15 2019-12-24 15
16 2019-12-29 16
17 2020-01-03 17
print (df3)
date id
0 2019-10-10 0
1 2019-10-15 1
2 2019-10-20 2
3 2019-10-25 3
4 2019-10-30 4
5 2019-11-04 5

pandas multiple date ranges from column of dates

Current df:
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018
.. ..
I have the df with ID and Date. ID is unique in the original df.
I would like to create a new df based on date. Each ID has a Max Date, I would like to use that date and go back 4 days(5 rows each ID)
There are thousands of IDs.
Expect to get:
ID Date
11 3/15/2018
11 3/16/2018
11 3/17/2018
11 3/18/2018
11 3/19/2018
22 1/1/2018
22 1/2/2018
22 1/3/2018
22 1/4/2018
22 1/5/2018
33 2/8/2018
33 2/9/2018
33 2/10/2018
33 2/11/2018
33 2/12/2018
… …
I tried the following method, i think use date_range might be right direction, but I keep get error.
pd.date_range
def date_list(row):
list = pd.date_range(row["Date"], periods=5)
return list
df["Date_list"] = df.apply(date_list, axis = "columns")
Here is another by using df.assign to overwrite date and pd.concat to glue the range together. cᴏʟᴅsᴘᴇᴇᴅ's solution wins in performance but I think this might be a nice addition as it is quite easy to read and understand.
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
Alternative:
dates = (pd.date_range(*x) for x in zip(df['Date']-pd.Timedelta(days=4), df['Date']))
df = (pd.DataFrame(dict(zip(df['ID'],dates)))
.T
.stack()
.reset_index(0)
.rename(columns={'level_0': 'ID', 0: 'Date'}))
Full example:
import pandas as pd
data = '''\
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df['Date']= pd.to_datetime(df.Date)
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
df.sort_values(by=['ID','Date'], ascending = [True,True], inplace=True)
print(df)
Returns:
ID Date
0 11 2018-03-15
0 11 2018-03-16
0 11 2018-03-17
0 11 2018-03-18
0 11 2018-03-19
1 22 2018-01-01
1 22 2018-01-02
1 22 2018-01-03
1 22 2018-01-04
1 22 2018-01-05
2 33 2018-02-08
2 33 2018-02-09
2 33 2018-02-10
2 33 2018-02-11
2 33 2018-02-12
reindexing with pd.date_range
Let's try creating a flat list of date-ranges and reindexing this DataFrame.
from itertools import chain
v = df.assign(Date=pd.to_datetime(df.Date)).set_index('Date')
# assuming ID is a string column
v.reindex(chain.from_iterable(
pd.date_range(end=i, periods=5) for i in v.index)
).bfill().reset_index()
Date ID
0 2018-03-14 11
1 2018-03-15 11
2 2018-03-16 11
3 2018-03-17 11
4 2018-03-18 11
5 2018-03-19 11
6 2017-12-31 22
7 2018-01-01 22
8 2018-01-02 22
9 2018-01-03 22
10 2018-01-04 22
11 2018-01-05 22
12 2018-02-07 33
13 2018-02-08 33
14 2018-02-09 33
15 2018-02-10 33
16 2018-02-11 33
17 2018-02-12 33
concat based solution on keys
Just for fun. My reindex solution is definitely more performant and easier to read, so if you were to pick one, use that.
v = df.assign(Date=pd.to_datetime(df.Date))
v_dict = {
j : pd.DataFrame(
pd.date_range(end=i, periods=5), columns=['Date']
)
for j, i in zip(v.ID, v.Date)
}
(pd.concat(v_dict, axis=0)
.reset_index(level=1, drop=True)
.rename_axis('ID')
.reset_index()
)
ID Date
0 11 2018-03-14
1 11 2018-03-15
2 11 2018-03-16
3 11 2018-03-17
4 11 2018-03-18
5 11 2018-03-19
6 22 2017-12-31
7 22 2018-01-01
8 22 2018-01-02
9 22 2018-01-03
10 22 2018-01-04
11 22 2018-01-05
12 33 2018-02-07
13 33 2018-02-08
14 33 2018-02-09
15 33 2018-02-10
16 33 2018-02-11
17 33 2018-02-12
group by ID, select the column Date, and for each group generate a series of five days leading up to the greatest date.
rather than writing a long lambda, I've written a helper function.
def drange(x):
e = x.max()
s = e-pd.Timedelta(days=4)
return pd.Series(pd.date_range(s,e))
res = df.groupby('ID').Date.apply(drange)
Then drop the extraneous level from the resulting multiindex and we get our desired output
res.reset_index(level=0).reset_index(drop=True)
# outputs:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
5 22 2018-01-01
6 22 2018-01-02
7 22 2018-01-03
8 22 2018-01-04
9 22 2018-01-05
10 33 2018-02-08
11 33 2018-02-09
12 33 2018-02-10
13 33 2018-02-11
14 33 2018-02-12
Compact alternative
# Help function to return Serie with daterange
func = lambda x: pd.date_range(x.iloc[0]-pd.Timedelta(days=4), x.iloc[0]).to_series()
res = df.groupby('ID').Date.apply(func).reset_index().drop('level_1',1)
You can try groupby with date_range
df.groupby('ID').Date.apply(lambda x : pd.Series(pd.date_range(end=x.iloc[0],periods=5))).reset_index(level=0)
Out[793]:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
0 22 2018-01-01
1 22 2018-01-02
2 22 2018-01-03
3 22 2018-01-04
4 22 2018-01-05
0 33 2018-02-08
1 33 2018-02-09
2 33 2018-02-10
3 33 2018-02-11
4 33 2018-02-12

Python how to get values in one dataframe from the other dataframe

import pandas as pd
import numpy as np
df1=pd.DataFrame(np.arange(25).reshape((5,5)),index=pd.date_range('2015/01/01',periods=5,freq='D')))
df1['trading_signal']=[1,-1,1,-1,1]
df1
0 1 2 3 4 trading_signal
2015-01-01 0 1 2 3 4 1
2015-01-02 5 6 7 8 9 -1
2015-01-03 10 11 12 13 14 1
2015-01-04 15 16 17 18 19 -1
2015-01-05 20 21 22 23 24 1
and
df2
0 1 2 3 4
Date Time
2015-01-01 22:55:00 0 1 2 3 4
23:55:00 5 6 7 8 9
2015-01-02 00:55:00 10 11 12 13 14
01:55:00 15 16 17 18 19
02:55:00 20 21 22 23 24
how would I get the value of trading_signal from df1 and sent it to df2.
I want an output like this:
0 1 2 3 4 trading_signal
Date Time
2015-01-01 22:55:00 0 1 2 3 4 1
23:55:00 5 6 7 8 9 1
2015-01-02 00:55:00 10 11 12 13 14 -1
01:55:00 15 16 17 18 19 -1
02:55:00 20 21 22 23 24 -1
You need to either merge or join. If you merge you need to reset_index, which is less memory efficient ans slower than using join. Please read the docs on Joining a single index to a multi index:
New in version 0.14.0.
You can join a singly-indexed DataFrame with a level of a
multi-indexed DataFrame. The level will match on the name of the index
of the singly-indexed frame against a level name of the multi-indexed
frame
If you want to use join, you must name the index of df1 to be Date so that it matches the name of the first level of df2:
df1.index.names = ['Date']
df1[['trading_signal']].join(df2, how='right')
trading_signal 0 1 2 3 4
Date Time
2015-01-01 22:55:00 1 0 1 2 3 4
23:55:00 1 5 6 7 8 9
2015-01-02 00:55:00 -1 10 11 12 13 14
01:55:00 -1 15 16 17 18 19
02:55:00 -1 20 21 22 23 24
I'm joining right for a reason, if you don't understand what this means please read Brief primer on merge methods (relational algebra).

Categories

Resources