I have a pd.DataFrame (pandas.core.frame.DataFrame) with some stock trades.
data = {'Date': ['2021-01-15', '2021-01-21', '2021-02-28', '2021-01-30', '2021-02-16', '2021-03-22', '2021-01-08', '2021-03-02', '2021-02-25', '2021-04-04', '2021-03-15', '2021-04-08'], 'Ticker': ['MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL'], 'Quantity': [2,3,7,2,6,4,-3,8,-2,9,11,1]}
df = pd.DataFrame(data)
Date Ticker Quantity
0 2021-01-15 MFST 2
1 2021-01-21 AMZN 3
2 2021-02-28 GOOG 7
3 2021-01-30 AAPL 2
4 2021-02-16 MFST 6
5 2021-03-22 AMZN 4
6 2021-01-08 GOOG -3
7 2021-03-02 AAPL 8
8 2021-02-25 MFST -2
9 2021-04-04 AMZN 9
10 2021-03-15 GOOG 11
11 2021-04-08 AAPL 1
Quantity refers to the number of shares bought.
I am looking for an efficient way to create a new df which contains the number of shares for each Ticker per day.
The first trade was on 2021-01-08 and the last on 2021-04-08. I want a new dataframe that contains all days between those to dates as rows and the tickers as columns. Values shall be the number of shares I hold at a specific day. Hence, if I buy 4 shares of a stock at 2021-03-15 (assuming no further buying or selling) I will have them from 2021-03-15 till 2021-04-08 which should be represented as a 4 in every row for this specific ticker. If I decide to buy more shares this number will change on that day and all following days.
Could be something like this:
Date MFST AMZN GOOG APPL
2021-01-08 2 3 1 0
2021-01-09 2 3 1 0
2021-01-10 2 3 1 0
...
2021-04-08 2 3 1 7
My first guess was to create an empty DataFrame and then iterate with two for loops over all its Dates and Tickers. However, I think that is not the most efficient way. I am thankful for any recommendation!
You can use df.pivot() to transform your data into a tabular form, as shown on the expected output layout, as follows:
df.pivot(index='Date', columns='Ticker', values='Quantity').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
If you need to aggregate Quantity for same date for each stock, you can use df.pivot_table() with parameter aggfunc='sum', as follows:
df.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
Result:
Date AAPL AMZN GOOG MFST
0 2021-01-21 0 3 0 0
1 2021-02-28 0 0 1 0
2 2021-03-15 0 0 0 2
3 2021-04-30 7 0 0 0
Additional Test Case:
To showcase the aggregation function of df.pivot_table(), I have added some data as follows:
data = {'Date': ['2021-03-15',
'2021-01-21',
'2021-01-21',
'2021-02-28',
'2021-02-28',
'2021-04-30',
'2021-04-30'],
'Ticker': ['MFST', 'AMZN', 'AMZN', 'GOOG', 'GOOG', 'AAPL', 'AAPL'],
'Quantity': [2, 3, 4, 1, 2, 7, 2]}
df = pd.DataFrame(data)
Date Ticker Quantity
0 2021-03-15 MFST 2
1 2021-01-21 AMZN 3
2 2021-01-21 AMZN 4
3 2021-02-28 GOOG 1
4 2021-02-28 GOOG 2
5 2021-04-30 AAPL 7
6 2021-04-30 AAPL 2
df.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
Date AAPL AMZN GOOG MFST
0 2021-01-21 0 7 0 0
1 2021-02-28 0 0 3 0
2 2021-03-15 0 0 0 2
3 2021-04-30 9 0 0 0
Edit
Based on latest requirement:
The first trade was on 2021-03-15 and the last on 2021-04-30. I want a
new dataframe that contains all days between those to dates as rows
and the tickers as columns. Values shall be the number of shares I
hold at a specific day. Hence, if I buy 4 shares of a stock at
2021-03-15 (assuming no further buying or selling) I will have them
from 2021-03-15 till 2021-04-30 which should be represented as a 4 in
every row for this specific ticker. If I decide to buy more shares
this number will change on that day and all following days.
Here is the enhanced solution:
data = {'Date': ['2021-01-15', '2021-01-21', '2021-02-28', '2021-01-30', '2021-02-16', '2021-03-22', '2021-01-08', '2021-03-02', '2021-02-25', '2021-04-04', '2021-03-15', '2021-04-08'], 'Ticker': ['MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL','MFST', 'AMZN', 'GOOG', 'AAPL'], 'Quantity': [2,3,7,2,6,4,-3,8,-2,9,11,1]}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df1 = df.set_index('Date').asfreq('D')
df1['Ticker'] = df1['Ticker'].ffill().bfill()
df1['Quantity'] = df1['Quantity'].fillna(0)
df2 = df1.pivot_table(index='Date', columns='Ticker', values='Quantity', aggfunc='sum').rename_axis(columns=None).reset_index().fillna(0, downcast='infer')
df3 = df2[['Date']].join(df2.iloc[:,1:].cumsum())
Result:
print(df3)
Date AAPL AMZN GOOG MFST
0 2021-01-08 0 0 -3 0
1 2021-01-09 0 0 -3 0
2 2021-01-10 0 0 -3 0
3 2021-01-11 0 0 -3 0
4 2021-01-12 0 0 -3 0
5 2021-01-13 0 0 -3 0
6 2021-01-14 0 0 -3 0
7 2021-01-15 0 0 -3 2
8 2021-01-16 0 0 -3 2
9 2021-01-17 0 0 -3 2
10 2021-01-18 0 0 -3 2
11 2021-01-19 0 0 -3 2
12 2021-01-20 0 0 -3 2
13 2021-01-21 0 3 -3 2
14 2021-01-22 0 3 -3 2
15 2021-01-23 0 3 -3 2
16 2021-01-24 0 3 -3 2
17 2021-01-25 0 3 -3 2
18 2021-01-26 0 3 -3 2
19 2021-01-27 0 3 -3 2
20 2021-01-28 0 3 -3 2
21 2021-01-29 0 3 -3 2
22 2021-01-30 2 3 -3 2
23 2021-01-31 2 3 -3 2
24 2021-02-01 2 3 -3 2
25 2021-02-02 2 3 -3 2
26 2021-02-03 2 3 -3 2
27 2021-02-04 2 3 -3 2
28 2021-02-05 2 3 -3 2
29 2021-02-06 2 3 -3 2
30 2021-02-07 2 3 -3 2
31 2021-02-08 2 3 -3 2
32 2021-02-09 2 3 -3 2
33 2021-02-10 2 3 -3 2
34 2021-02-11 2 3 -3 2
35 2021-02-12 2 3 -3 2
36 2021-02-13 2 3 -3 2
37 2021-02-14 2 3 -3 2
38 2021-02-15 2 3 -3 2
39 2021-02-16 2 3 -3 8
40 2021-02-17 2 3 -3 8
41 2021-02-18 2 3 -3 8
42 2021-02-19 2 3 -3 8
43 2021-02-20 2 3 -3 8
44 2021-02-21 2 3 -3 8
45 2021-02-22 2 3 -3 8
46 2021-02-23 2 3 -3 8
47 2021-02-24 2 3 -3 8
48 2021-02-25 2 3 -3 6
49 2021-02-26 2 3 -3 6
50 2021-02-27 2 3 -3 6
51 2021-02-28 2 3 4 6
52 2021-03-01 2 3 4 6
53 2021-03-02 10 3 4 6
54 2021-03-03 10 3 4 6
55 2021-03-04 10 3 4 6
56 2021-03-05 10 3 4 6
57 2021-03-06 10 3 4 6
58 2021-03-07 10 3 4 6
59 2021-03-08 10 3 4 6
60 2021-03-09 10 3 4 6
61 2021-03-10 10 3 4 6
62 2021-03-11 10 3 4 6
63 2021-03-12 10 3 4 6
64 2021-03-13 10 3 4 6
65 2021-03-14 10 3 4 6
66 2021-03-15 10 3 15 6
67 2021-03-16 10 3 15 6
68 2021-03-17 10 3 15 6
69 2021-03-18 10 3 15 6
70 2021-03-19 10 3 15 6
71 2021-03-20 10 3 15 6
72 2021-03-21 10 3 15 6
73 2021-03-22 10 7 15 6
74 2021-03-23 10 7 15 6
75 2021-03-24 10 7 15 6
76 2021-03-25 10 7 15 6
77 2021-03-26 10 7 15 6
78 2021-03-27 10 7 15 6
79 2021-03-28 10 7 15 6
80 2021-03-29 10 7 15 6
81 2021-03-30 10 7 15 6
82 2021-03-31 10 7 15 6
83 2021-04-01 10 7 15 6
84 2021-04-02 10 7 15 6
85 2021-04-03 10 7 15 6
86 2021-04-04 10 16 15 6
87 2021-04-05 10 16 15 6
88 2021-04-06 10 16 15 6
89 2021-04-07 10 16 15 6
90 2021-04-08 11 16 15 6
Use df.groupby
df.groupby(['Date']).agg('sum')
I have time trend data on facility traffic (admissions to and releases from a facility over time), with gaps. Because of the structure of this data, when a gap appears, the "releases" one day prior to the gap are artificially high (accounting for all unseen individuals released over the period of the gap), and the "admissions" one day after the gap are artificially high (for the same reason: any individual who was admitted during the gap and remains in the facility will appear as an "admission" on this date).
Here is a sample Pandas series involving such a data gap (with zeroes implying missing data on 2020-01-04 through 2020-01-07):
date(index) releases admissions
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 50 14
2020-01-04 0 0
2020-01-05 0 0
2020-01-06 0 0
2020-01-07 0 0
2020-01-08 8 100
2020-01-09 11 19
2020-01-10 9 17
A visualization of this (ignore the separate linear interpolation over the missing total population) looks like the following:
I want to smooth this data, but I'm not sure what interpolation method to use. What I want to accomplish is redistribution forwards of the "releases" on date gap(0)-1 and redistribution backwards of "admissions" on date gap(n)+1. For instance, if a gap is 4 days long and on day gap(n)+1 there are 100 admissions, I want to redistribute such that, on each day of the gap, there are 20 admissions, and on day gap(n)+1 admissions are revised to show 20.
Using the above example series, redistribution would look like the following:
date(index) releases admissions
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 10 14
2020-01-04 10 20
2020-01-05 10 20
2020-01-06 10 20
2020-01-07 10 20
2020-01-08 8 20
2020-01-09 11 19
2020-01-10 9 17
You can create groups with consecutive zeros + one value before for releases and one value after for admissions, and then use transform('mean') to calculate average for each group:
# releases
df['releases'] = df.groupby(
df['releases'].replace(0, np.nan).notna().cumsum()
)['releases'].transform('mean')
# admissions
df['admissions'] = df.groupby(
df['admissions'].replace(0, np.nan).notna().iloc[::-1].cumsum().iloc[::-1]
)['admissions'].transform('mean')
Output:
releases admissions
date
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 10 14
2020-01-04 10 20
2020-01-05 10 20
2020-01-06 10 20
2020-01-07 10 20
2020-01-08 8 20
2020-01-09 11 19
2020-01-10 9 17
Update: For keeping the existing NA values:
# releases
df['releases_i'] = df.groupby(
df['releases'].ne(0).cumsum()
)['releases'].transform('mean')
# admissions
df['admissions_i'] = df.groupby(
df['admissions'].ne(0).iloc[::-1].cumsum().iloc[::-1]
)['admissions'].transform('mean')
I have a dataframe with different columns (like price, id, product and date) and I need to divide this dataframe into several dataframes based on the current date of the system (current_date = np.datetime64(date.today())).
For example, if today is 2020-02-07 I want to divide my main dataframe into three different ones where df1 would be the data of the last month (data of 2020-01-07 to 2020-02-07), df2 would be the data of the last three months (excluding the month already in df1 so it would be more accurate to say from 2019-10-07 to 2020-01-07) and df3 would be the data left on the original dataframe.
Is there some easy way to do this? Also, I've been trying to use Grouper but I keep getting this error over an over again: NameError: name 'Grouper' is not defined (my Pandas version is 0.24.2)
You can use offsets.DateOffset for last 1mont and 3month datetimes, filter by boolean indexing:
rng = pd.date_range('2019-10-10', periods=20, freq='5d')
df = pd.DataFrame({'date': rng, 'id': range(20)})
print (df)
date id
0 2019-10-10 0
1 2019-10-15 1
2 2019-10-20 2
3 2019-10-25 3
4 2019-10-30 4
5 2019-11-04 5
6 2019-11-09 6
7 2019-11-14 7
8 2019-11-19 8
9 2019-11-24 9
10 2019-11-29 10
11 2019-12-04 11
12 2019-12-09 12
13 2019-12-14 13
14 2019-12-19 14
15 2019-12-24 15
16 2019-12-29 16
17 2020-01-03 17
18 2020-01-08 18
19 2020-01-13 19
current_date = pd.to_datetime('now').floor('d')
print (current_date)
2020-02-07 00:00:00
last1m = current_date - pd.DateOffset(months=1)
last3m = current_date - pd.DateOffset(months=3)
m1 = (df['date'] > last1m) & (df['date'] <= current_date)
m2 = (df['date'] > last3m) & (df['date'] <= last1m)
#filter non match m1 or m2 masks
m3 = ~(m1 | m2)
df1 = df[m1]
df2 = df[m2]
df3 = df[m3]
print (df1)
date id
18 2020-01-08 18
19 2020-01-13 19
print (df2)
date id
6 2019-11-09 6
7 2019-11-14 7
8 2019-11-19 8
9 2019-11-24 9
10 2019-11-29 10
11 2019-12-04 11
12 2019-12-09 12
13 2019-12-14 13
14 2019-12-19 14
15 2019-12-24 15
16 2019-12-29 16
17 2020-01-03 17
print (df3)
date id
0 2019-10-10 0
1 2019-10-15 1
2 2019-10-20 2
3 2019-10-25 3
4 2019-10-30 4
5 2019-11-04 5
Current df:
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018
.. ..
I have the df with ID and Date. ID is unique in the original df.
I would like to create a new df based on date. Each ID has a Max Date, I would like to use that date and go back 4 days(5 rows each ID)
There are thousands of IDs.
Expect to get:
ID Date
11 3/15/2018
11 3/16/2018
11 3/17/2018
11 3/18/2018
11 3/19/2018
22 1/1/2018
22 1/2/2018
22 1/3/2018
22 1/4/2018
22 1/5/2018
33 2/8/2018
33 2/9/2018
33 2/10/2018
33 2/11/2018
33 2/12/2018
… …
I tried the following method, i think use date_range might be right direction, but I keep get error.
pd.date_range
def date_list(row):
list = pd.date_range(row["Date"], periods=5)
return list
df["Date_list"] = df.apply(date_list, axis = "columns")
Here is another by using df.assign to overwrite date and pd.concat to glue the range together. cᴏʟᴅsᴘᴇᴇᴅ's solution wins in performance but I think this might be a nice addition as it is quite easy to read and understand.
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
Alternative:
dates = (pd.date_range(*x) for x in zip(df['Date']-pd.Timedelta(days=4), df['Date']))
df = (pd.DataFrame(dict(zip(df['ID'],dates)))
.T
.stack()
.reset_index(0)
.rename(columns={'level_0': 'ID', 0: 'Date'}))
Full example:
import pandas as pd
data = '''\
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df['Date']= pd.to_datetime(df.Date)
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
df.sort_values(by=['ID','Date'], ascending = [True,True], inplace=True)
print(df)
Returns:
ID Date
0 11 2018-03-15
0 11 2018-03-16
0 11 2018-03-17
0 11 2018-03-18
0 11 2018-03-19
1 22 2018-01-01
1 22 2018-01-02
1 22 2018-01-03
1 22 2018-01-04
1 22 2018-01-05
2 33 2018-02-08
2 33 2018-02-09
2 33 2018-02-10
2 33 2018-02-11
2 33 2018-02-12
reindexing with pd.date_range
Let's try creating a flat list of date-ranges and reindexing this DataFrame.
from itertools import chain
v = df.assign(Date=pd.to_datetime(df.Date)).set_index('Date')
# assuming ID is a string column
v.reindex(chain.from_iterable(
pd.date_range(end=i, periods=5) for i in v.index)
).bfill().reset_index()
Date ID
0 2018-03-14 11
1 2018-03-15 11
2 2018-03-16 11
3 2018-03-17 11
4 2018-03-18 11
5 2018-03-19 11
6 2017-12-31 22
7 2018-01-01 22
8 2018-01-02 22
9 2018-01-03 22
10 2018-01-04 22
11 2018-01-05 22
12 2018-02-07 33
13 2018-02-08 33
14 2018-02-09 33
15 2018-02-10 33
16 2018-02-11 33
17 2018-02-12 33
concat based solution on keys
Just for fun. My reindex solution is definitely more performant and easier to read, so if you were to pick one, use that.
v = df.assign(Date=pd.to_datetime(df.Date))
v_dict = {
j : pd.DataFrame(
pd.date_range(end=i, periods=5), columns=['Date']
)
for j, i in zip(v.ID, v.Date)
}
(pd.concat(v_dict, axis=0)
.reset_index(level=1, drop=True)
.rename_axis('ID')
.reset_index()
)
ID Date
0 11 2018-03-14
1 11 2018-03-15
2 11 2018-03-16
3 11 2018-03-17
4 11 2018-03-18
5 11 2018-03-19
6 22 2017-12-31
7 22 2018-01-01
8 22 2018-01-02
9 22 2018-01-03
10 22 2018-01-04
11 22 2018-01-05
12 33 2018-02-07
13 33 2018-02-08
14 33 2018-02-09
15 33 2018-02-10
16 33 2018-02-11
17 33 2018-02-12
group by ID, select the column Date, and for each group generate a series of five days leading up to the greatest date.
rather than writing a long lambda, I've written a helper function.
def drange(x):
e = x.max()
s = e-pd.Timedelta(days=4)
return pd.Series(pd.date_range(s,e))
res = df.groupby('ID').Date.apply(drange)
Then drop the extraneous level from the resulting multiindex and we get our desired output
res.reset_index(level=0).reset_index(drop=True)
# outputs:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
5 22 2018-01-01
6 22 2018-01-02
7 22 2018-01-03
8 22 2018-01-04
9 22 2018-01-05
10 33 2018-02-08
11 33 2018-02-09
12 33 2018-02-10
13 33 2018-02-11
14 33 2018-02-12
Compact alternative
# Help function to return Serie with daterange
func = lambda x: pd.date_range(x.iloc[0]-pd.Timedelta(days=4), x.iloc[0]).to_series()
res = df.groupby('ID').Date.apply(func).reset_index().drop('level_1',1)
You can try groupby with date_range
df.groupby('ID').Date.apply(lambda x : pd.Series(pd.date_range(end=x.iloc[0],periods=5))).reset_index(level=0)
Out[793]:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
0 22 2018-01-01
1 22 2018-01-02
2 22 2018-01-03
3 22 2018-01-04
4 22 2018-01-05
0 33 2018-02-08
1 33 2018-02-09
2 33 2018-02-10
3 33 2018-02-11
4 33 2018-02-12
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.arange(25).reshape((5,5)),index=pd.date_range('2015/01/01',periods=5,freq='D')))
df1['trading_signal']=[1,-1,1,-1,1]
df1
0 1 2 3 4 trading_signal
2015-01-01 0 1 2 3 4 1
2015-01-02 5 6 7 8 9 -1
2015-01-03 10 11 12 13 14 1
2015-01-04 15 16 17 18 19 -1
2015-01-05 20 21 22 23 24 1
and
df2
0 1 2 3 4
Date Time
2015-01-01 22:55:00 0 1 2 3 4
23:55:00 5 6 7 8 9
2015-01-02 00:55:00 10 11 12 13 14
01:55:00 15 16 17 18 19
02:55:00 20 21 22 23 24
how would I get the value of trading_signal from df1 and sent it to df2.
I want an output like this:
0 1 2 3 4 trading_signal
Date Time
2015-01-01 22:55:00 0 1 2 3 4 1
23:55:00 5 6 7 8 9 1
2015-01-02 00:55:00 10 11 12 13 14 -1
01:55:00 15 16 17 18 19 -1
02:55:00 20 21 22 23 24 -1
You need to either merge or join. If you merge you need to reset_index, which is less memory efficient ans slower than using join. Please read the docs on Joining a single index to a multi index:
New in version 0.14.0.
You can join a singly-indexed DataFrame with a level of a
multi-indexed DataFrame. The level will match on the name of the index
of the singly-indexed frame against a level name of the multi-indexed
frame
If you want to use join, you must name the index of df1 to be Date so that it matches the name of the first level of df2:
df1.index.names = ['Date']
df1[['trading_signal']].join(df2, how='right')
trading_signal 0 1 2 3 4
Date Time
2015-01-01 22:55:00 1 0 1 2 3 4
23:55:00 1 5 6 7 8 9
2015-01-02 00:55:00 -1 10 11 12 13 14
01:55:00 -1 15 16 17 18 19
02:55:00 -1 20 21 22 23 24
I'm joining right for a reason, if you don't understand what this means please read Brief primer on merge methods (relational algebra).