I'm trying to build a time series consisting of the market value of my portfolio. The whole website is build on django framework. So the datasets will be dynamic.
I have a dataset named dataset, this dataset is containing stocks close price:
YAR.OL NHY.OL
date
2000-01-03 NaN 18.550200
2000-01-04 NaN 18.254101
2000-01-05 NaN 17.877100
2000-01-06 NaN 18.523300
2000-01-07 NaN 18.819500
... ... ...
2020-07-27 381.799988 26.350000
2020-07-28 382.399994 26.490000
2020-07-29 377.899994 26.389999
2020-07-30 372.000000 25.049999
2020-07-31 380.700012 25.420000
And I have a dataframe named positions consisting of the positions in a users portfolio:
Date Direction Ticker Price ... FX-rate Comission Short Cost-price
0 2020-07-27 Buy YAR.OL 381.0 ... 1.0 0.0 False 381.0
1 2020-07-31 Sell YAR.OL 380.0 ... 1.0 0.0 False -380.0
2 2020-07-28 Buy NHY.OL 26.5 ... 1.0 0.0 False 26.5
code for the postions dataset:
data = zip(date_list, direction_list ,ticker_list,price_list,new_volume_list,exchange_list,commision_list,short_list, cost_price_list)
df = pd.DataFrame(data,columns=['Date','Direction','Ticker','Price','Volume','FX-rate','Comission','Short','Cost-price'])
Further, I have managed to split the postions dataset into one dataset for each ticker:
dataset = self.dataset_creator(n_ticker_list)
dataset.index = pd.to_datetime(dataset.index)
positions = self.get_all_positions(selected_portfolio)
for ticker in n_ticker_list:
s = positions.loc[positions['Ticker']==ticker]
s = s.sort_values(by='Date')
print(s)
This gives me:
Date Direction Ticker Price ... FX-rate Comission Short Cost-price
0 2020-07-27 Buy YAR.OL 381.0 ... 1.0 0.0 False 381.0
1 2020-07-31 Sell YAR.OL 380.0 ... 1.0 0.0 False -380.0
[2 rows x 9 columns]
Date Direction Ticker Price ... FX-rate Comission Short Cost-price
2 2020-07-28 Buy NHY.OL 26.5 ... 1.0 0.0 False 26.5
I have made this is excel, and the end goal is to create the yellow dataframe:
Please note that this is dynamic, I have used two stocks and a lesser timeframe to make it easier to create, but it could just as easily be 10 stocks
Overview / summary
Keep one data frame for each 'concept' -- closing prices, positions, etc.
Then multiply data frames (value = positions x price).
Separate into multiple data frames for reporting.
from io import StringIO
import pandas as pd
# create data frame with closing prices
data = '''date YAR.OL NHY.OL
2020-07-27 381.799988 26.350000
2020-07-28 382.399994 26.490000
2020-07-29 377.899994 26.389999
2020-07-30 372.000000 25.049999
2020-07-31 380.700012 25.420000
'''
closing_prices = (pd.read_csv(StringIO(data),
sep='\s+', engine='python',
parse_dates=['date']
)
.set_index('date')
.sort_index()
.sort_index(axis=1)
)
print(closing_prices.round(2))
NHY.OL YAR.OL
date
2020-07-27 26.35 381.8
2020-07-28 26.49 382.4
2020-07-29 26.39 377.9
2020-07-30 25.05 372.0
2020-07-31 25.42 380.7
Now create positions (by typing in from the Excel screen shot). I assumed each entry was buy or sell for that day. Cumulative sum gives then-current positions.
positions = [
('YAR.OL', '2020-07-27', 1),
('YAR.OL', '2020-07-31', -1),
('NHY.OL', '2020-07-28', 1),
]
# changed cost_price to volume
positions = pd.DataFrame(positions, columns=['tickers', 'date', 'volume'])
positions['date'] = pd.to_datetime(positions['date'])
positions = (positions.pivot(index='date', columns='tickers', values='volume')
.sort_index()
.sort_index(axis=1)
)
positions = positions.reindex( closing_prices.index ).fillna(0).cumsum()
print(positions)
tickers NHY.OL YAR.OL
date
2020-07-27 0.0 1.0 # <-- these are transaction volumes
2020-07-28 1.0 1.0
2020-07-29 1.0 1.0
2020-07-30 1.0 1.0
2020-07-31 1.0 0.0
Now, the portfolio value is positions times closing price. There is one column for each stock. And we can compute the sum for each day with 'sum(axis=1)'
port_value = positions * closing_prices
port_value['total'] = port_value.sum(axis=1)
print(port_value.round(2))
tickers NHY.OL YAR.OL total
date
2020-07-27 0.00 381.8 381.80
2020-07-28 26.49 382.4 408.89
2020-07-29 26.39 377.9 404.29
2020-07-30 25.05 372.0 397.05
2020-07-31 25.42 0.0 25.42
UPDATE - suggestions for further work
Include traded price in the Positions data frame.
Also include trade timestamp in the Positions data frame.
The end-of-day portfolio value would use end-of-day prices. Profit/loss also includes purchase/sale price. Which do you want?
The data frame Index (and MultiIndex) along with broadcasting are relevant concepts for this application.
Related
DataFrame1 :
origin 2001-01-01 00:00:00 2002-01-01 00:00:00 2003-01-01 00:00:00 2004-01-01 00:00:00 ... 2008-01-01 00:00:00 2009-01-01 00:00:00 2010-01-01 00:00:00 Grand Total
Simulation 1 1.597942e+13 NaN 1.114312e+20 4.370424e+26 ... 3.633710e+52 3.388095e+58 1.103886e+64 3.159025e+71
Simulation 2 1.852542e+13 NaN 1.280181e+20 4.958904e+26 ... 7.830853e+52 1.077502e+59 5.605342e+64 1.852667e+72
Simulation 3 1.978941e+13 NaN 1.024391e+20 5.038746e+26 ... 6.922672e+52 9.431727e+58 5.947689e+63 4.921311e+71
Simulation 4 1.845122e+13 NaN 1.050210e+20 4.305396e+26 ... 6.529340e+52 1.004737e+59 4.311079e+63 6.250895e+71
Simulation 5 1.733954e+13 NaN 1.082353e+20 4.400699e+26 ... 4.554812e+52 2.587384e+58 5.571276e+63 1.459044e+71
Im trying to filter the cloumn Grand Total from the above Dataframe1.
DataFrame2 :
CI Var
0 60.0 2.059017e+72
1 70.0 2.402186e+72
2 80.0 2.745356e+72
3 90.5 3.105684e+72
In DataFrame2, in Column Var first value is 2.059017e+72, now we have to collect the values from Grand Total column of DataFrame1 which is greater than 2.059017e+72 and store it in the separate dataframe, for each value of var..
You can filter the columns like this:
DataFrame3 = DataFrame1.loc[DataFrame2['Var'][0] < DataFrame1['Grand Total']]
Do you want to print the values or save them as an extra column of df2?
I am working with UPC (product#), date_expected, and quantity_picked columns and need my data organized to show the total quantity_picked per day (for every day) for each UPC. Example data below:
UPC quantity_picked date_expected
0 0001111041660 1.0 2019-05-14 15:00:00
1 0001111045045 1.0 2019-05-14 15:00:00
2 0001111050268 1.0 2019-05-14 15:00:00
3 0001111086132 1.0 2019-05-14 15:00:00
4 0001111086983 1.0 2019-05-14 15:00:00
5 0001111086984 1.0 2019-05-14 15:00:00
... ... ...
39694 0004470036000 6.0 2019-06-24 20:00:00
39695 0007225001116 1.0 2019-06-24 20:00:00
I was able to successfully organize my data in this manner using the code below, but the output leaves out dates with quantity_picked=0
orders = pd.read_sql_query(SQL, con=sql_conn)
order_daily = orders.copy()
order_daily['date_expected'] = order_daily['date_expected'].dt.normalize()
order_daily['date_expected'] = pd.to_datetime(order_daily.date_expected, format='%Y-%m-%d')
# Groups by date and UPC getting the sum of quanitity picked for each
# then resets index to fill in dates for all rows
tipd = order_daily.groupby(['UPC', 'date_expected']).sum().reset_index()
# Rearranging of columns to put UPC column first
tipd = tipd[['UPC','date_expected','quantity_picked']]
gives the following output:
UPC date_expected quantity_picked
0 0000000002554 2019-05-21 4.0
1 0000000002554 2019-05-24 2.0
2 0000000002554 2019-06-02 2.0
3 0000000002554 2019-06-17 2.0
4 0000000003082 2019-05-15 2.0
5 0000000003082 2019-05-16 2.0
6 0000000003082 2019-05-17 8.0
... ... ...
31588 0360600051715 2019-06-17 1.0
31589 0501072452748 2019-06-15 1.0
31590 0880100551750 2019-06-07 2.0
When I try to follow the solution given in:
Pandas filling missing dates and values within group
I adjust my code to
tipd = order_daily.groupby(['UPC', 'date_expected']).sum().reindex(idx, fill_value=0).reset_index()
# Rearranging of columns to put UPC column first
tipd = tipd[['UPC','date_expected','quantity_picked']]
# Viewing first 10 rows to check format of dataframe
print('Preview of Total per Item per Day')
print(tipd.iloc[0:10])
And receive the following error:
TypeError: Argument 'tuples' has incorrect type (expected numpy.ndarray, got DatetimeArray)
I need each date to be listed for each product, even when quantity picked is zero. I plan on creating two new columns using .shift and .diff for calculations, and those columns will not be accurate if my data is skipping dates.
Any guidance is very much appreciated.
I have to monthly normalize values of one dataframe column Allocation.
data=
Allocation Temperature Precipitation Radiation
Date_From
2018-11-01 00:00:00 0.001905 9.55 0.0 0.0
2018-11-01 00:15:00 0.001794 9.55 0.0 0.0
2018-11-01 00:30:00 0.001700 9.55 0.0 0.0
2018-11-01 00:45:00 0.001607 9.55 0.0 0.0
This means, if we have 2018-11, divide Allocation by 11.116, while in 2018-12, divide Allocation by 2473.65, and so on... (These values come from a list Volume, where Volume[0] corresponds to 2018-11 untill Volume[7] corresponds to 2019-06).
Date_From is a index and a timestamp.
data_normalized=
Allocation Temperature Precipitation Radiation
Date_From
2018-11-01 00:00:00 0.000171 9.55 0.0 0.0
2018-11-01 00:15:00 0.000097 9.55 0.0 0.0
...
My approach was the use of itertuples:
for row in data.itertuples(index=True,name='index'):
if row.index =='2018-11':
data['Allocation']/Volume[0]
Here, the if statement is never true...
Another approach was
if ((row.index >='2018-11-01 00:00:00') & (row.index<='2018-11-31 23:45:00')):
However, here I get the error TypeError: '>=' not supported between instances of 'builtin_function_or_method' and 'str'
Can I solve my problem with this approach or should I use a different approach? I am happy about any help
Cheers!
Maybe you can put your list Volume in a dataframe where the date (or index) is the first day of every month.
import pandas as pd
import numpy as np
N = 16
date = pd.date_range(start='2018-01-01', periods=N, freq="15d")
df = pd.DataFrame({"date":date, "Allocation":np.random.randn(N)})
# A dataframe where at every month associate a volume
df_vol = pd.DataFrame({"month":pd.date_range(start="2018-01-01", periods=8, freq="MS"),
"Volume": np.arange(8)+1})
# convert every date with the beginning of the month
df["month"] = df["date"].astype("datetime64[M]")
# merge
df1 = pd.merge(df,df_vol, on="month", how="left")
# divide allocation by Volume.
# Now it's vectorial as to every date we merged the right volume.
df1["norm"] = df1["Allocation"]/df1["Volume"]
The following is my dataframe which holds values from multiple Excel files. I wanted to do a time series analysis, so I made the index as datetimeindex. But my index is not arranged according to the date. The following is my dataframe:
Item Details Unit Op. Qty Price Op. Amt. Cl. Qty Price.1 Cl. Amt.
Month
2013-04-01 5 In 1 Pcs -56.0 172.78 -9675.58 -68.0 175.79 -11953.96
2013-04-01 Adaptor Pcs -17.0 9.00 -152.99 -17.0 9.00 -152.99
2013-04-01 Agro Tape Pcs -2.0 26.25 -52.50 -2.0 26.25 -52.50
...
2014-01-01 12" Angal Pcs -6.0 31.50 -189.00 -6.0 31.50 -189.00
2014-01-01 13 Mm Electrical Drill Check Set -1.0 247.50 -247.50 -1.0 247.50 -247.50
2014-01-01 14" Blad Pcs -5.0 157.49 -787.45 -5.0 157.49 -787.45
...
2013-09-01 Zinc Bolt 1/4 X 2"(box) Box -1.0 899.99 -899.99 -1.0 899.99 -899.99
2013-09-01 Zorik 88 32gram Pcs -1.0 45.00 -45.00 -1.0 45.00 -45.00
2013-09-01 Zorrik 311 Gram Pcs -1.0 270.01 -270.01 -1.0 270.01 -270.01
It is not sorted according to the date. I wanted to sort the index and its respective rows also. I googled it and found that there is a way to sort the datetimeindex and is as follows:
all_data.index.sort_values()
DatetimeIndex(['2013-04-01', '2013-04-01', '2013-04-01', '2013-04-01',
'2013-04-01', '2013-04-01', '2013-04-01', '2013-04-01',
'2013-04-01', '2013-04-01',
...
'2014-02-01', '2014-02-01', '2014-02-01', '2014-02-01',
'2014-02-01', '2014-02-01', '2014-02-01', '2014-02-01',
'2014-02-01', '2014-02-01'],
dtype='datetime64[ns]', name=u'Month', length=71232, freq=None)
But it is sorting only the index, how can I sort the entire dataframe according to the sorted index? Kindly help.
I think you need sort_index:
all_data = all_data.sort_index()
I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64