The following is my dataframe which holds values from multiple Excel files. I wanted to do a time series analysis, so I made the index as datetimeindex. But my index is not arranged according to the date. The following is my dataframe:
Item Details Unit Op. Qty Price Op. Amt. Cl. Qty Price.1 Cl. Amt.
Month
2013-04-01 5 In 1 Pcs -56.0 172.78 -9675.58 -68.0 175.79 -11953.96
2013-04-01 Adaptor Pcs -17.0 9.00 -152.99 -17.0 9.00 -152.99
2013-04-01 Agro Tape Pcs -2.0 26.25 -52.50 -2.0 26.25 -52.50
...
2014-01-01 12" Angal Pcs -6.0 31.50 -189.00 -6.0 31.50 -189.00
2014-01-01 13 Mm Electrical Drill Check Set -1.0 247.50 -247.50 -1.0 247.50 -247.50
2014-01-01 14" Blad Pcs -5.0 157.49 -787.45 -5.0 157.49 -787.45
...
2013-09-01 Zinc Bolt 1/4 X 2"(box) Box -1.0 899.99 -899.99 -1.0 899.99 -899.99
2013-09-01 Zorik 88 32gram Pcs -1.0 45.00 -45.00 -1.0 45.00 -45.00
2013-09-01 Zorrik 311 Gram Pcs -1.0 270.01 -270.01 -1.0 270.01 -270.01
It is not sorted according to the date. I wanted to sort the index and its respective rows also. I googled it and found that there is a way to sort the datetimeindex and is as follows:
all_data.index.sort_values()
DatetimeIndex(['2013-04-01', '2013-04-01', '2013-04-01', '2013-04-01',
'2013-04-01', '2013-04-01', '2013-04-01', '2013-04-01',
'2013-04-01', '2013-04-01',
...
'2014-02-01', '2014-02-01', '2014-02-01', '2014-02-01',
'2014-02-01', '2014-02-01', '2014-02-01', '2014-02-01',
'2014-02-01', '2014-02-01'],
dtype='datetime64[ns]', name=u'Month', length=71232, freq=None)
But it is sorting only the index, how can I sort the entire dataframe according to the sorted index? Kindly help.
I think you need sort_index:
all_data = all_data.sort_index()
Related
I have a time series that looks like this (below)
And I want to resample it monthly, so it has 2019-10 is equal to the average of all the values of october, November is the average of all the PTS values for November, etc.
However, when i use the pd.resample('M').mean() method, if the final day for each month does not have a value, it fills in a Nan in my data frame. How do I solve this?
Date PTS
2019-10-23 14.0
2019-10-26 14.0
2019-10-27 8.0
2019-10-29 29.0
2019-10-31 17.0
2019-11-03 12.0
2019-11-05 2.0
2019-11-07 15.0
2019-11-08 7.0
2019-11-14 16.0
2019-11-16 12.0
2019-11-20 22.0
2019-11-22 9.0
2019-11-23 20.0
2019-11-25 18.0```
Would this work?
pd.resample('M').mean().dropna()
Do you have a code sample? This works:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
days = np.arange(31)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(days, 60),
"values": rng.integers(0, 60, size=60)})
data.set_index("dates", inplace=True)
# Set the last day to null.
data.loc["2019-03-31"] = np.nan
# This works
data.resample("M").mean()
It also works with an incomplete month:
incomplete_days = np.arange(10)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(incomplete_days, 10),
"values": rng.integers(0, 60, size=10)})
data.set_index("dates", inplace=True)
data.resample("M").mean()
You should check your data and types more thoroughly in case the NaN you're receiving indicates a more pressing issue.
The question has a base on the following SO:
Groupy brings only one key from Pandas dictionary
Dataframe looks like:
ALUP11 Return % Day CESP6 Return % Day TAEE11 Return % Day
Data
2020-08-13 23.81 0.548986 13.0 29.38 -2.747435 13.0 28.33 -0.770578 13.0
2020-09-01 23.68 1.067008 1.0 30.21 0.365449 1.0 28.55 1.205246 1.0
2020-08-31 23.43 -1.139241 31.0 30.10 -2.336145 31.0 28.21 -0.669014 31.0
2020-08-28 23.70 1.455479 28.0 30.82 1.615562 28.0 28.40 0.459851 28.0
2020-08-27 23.36 -0.680272 27.0 30.33 -1.717434 27.0 28.27 0.354988 27.0
After having the dataframe from dictionary, I need the sum of same days but
result = df.groupby('Day').agg({'Return %': ['sum']})
result
Get error:
ValueError: Grouper for 'Day' not 1-dimensional
For each symbol I would like to sum same days of month. In the example I have 3 symbols, so the result should be like:
If your data looks like the data in the answer to your previous question, the error is because you have two columns named Day. As they appear to have the same data you could drop the last column and then your groupby will work:
df = df.iloc[:, :-1].groupby('Day')
I'm trying to build a time series consisting of the market value of my portfolio. The whole website is build on django framework. So the datasets will be dynamic.
I have a dataset named dataset, this dataset is containing stocks close price:
YAR.OL NHY.OL
date
2000-01-03 NaN 18.550200
2000-01-04 NaN 18.254101
2000-01-05 NaN 17.877100
2000-01-06 NaN 18.523300
2000-01-07 NaN 18.819500
... ... ...
2020-07-27 381.799988 26.350000
2020-07-28 382.399994 26.490000
2020-07-29 377.899994 26.389999
2020-07-30 372.000000 25.049999
2020-07-31 380.700012 25.420000
And I have a dataframe named positions consisting of the positions in a users portfolio:
Date Direction Ticker Price ... FX-rate Comission Short Cost-price
0 2020-07-27 Buy YAR.OL 381.0 ... 1.0 0.0 False 381.0
1 2020-07-31 Sell YAR.OL 380.0 ... 1.0 0.0 False -380.0
2 2020-07-28 Buy NHY.OL 26.5 ... 1.0 0.0 False 26.5
code for the postions dataset:
data = zip(date_list, direction_list ,ticker_list,price_list,new_volume_list,exchange_list,commision_list,short_list, cost_price_list)
df = pd.DataFrame(data,columns=['Date','Direction','Ticker','Price','Volume','FX-rate','Comission','Short','Cost-price'])
Further, I have managed to split the postions dataset into one dataset for each ticker:
dataset = self.dataset_creator(n_ticker_list)
dataset.index = pd.to_datetime(dataset.index)
positions = self.get_all_positions(selected_portfolio)
for ticker in n_ticker_list:
s = positions.loc[positions['Ticker']==ticker]
s = s.sort_values(by='Date')
print(s)
This gives me:
Date Direction Ticker Price ... FX-rate Comission Short Cost-price
0 2020-07-27 Buy YAR.OL 381.0 ... 1.0 0.0 False 381.0
1 2020-07-31 Sell YAR.OL 380.0 ... 1.0 0.0 False -380.0
[2 rows x 9 columns]
Date Direction Ticker Price ... FX-rate Comission Short Cost-price
2 2020-07-28 Buy NHY.OL 26.5 ... 1.0 0.0 False 26.5
I have made this is excel, and the end goal is to create the yellow dataframe:
Please note that this is dynamic, I have used two stocks and a lesser timeframe to make it easier to create, but it could just as easily be 10 stocks
Overview / summary
Keep one data frame for each 'concept' -- closing prices, positions, etc.
Then multiply data frames (value = positions x price).
Separate into multiple data frames for reporting.
from io import StringIO
import pandas as pd
# create data frame with closing prices
data = '''date YAR.OL NHY.OL
2020-07-27 381.799988 26.350000
2020-07-28 382.399994 26.490000
2020-07-29 377.899994 26.389999
2020-07-30 372.000000 25.049999
2020-07-31 380.700012 25.420000
'''
closing_prices = (pd.read_csv(StringIO(data),
sep='\s+', engine='python',
parse_dates=['date']
)
.set_index('date')
.sort_index()
.sort_index(axis=1)
)
print(closing_prices.round(2))
NHY.OL YAR.OL
date
2020-07-27 26.35 381.8
2020-07-28 26.49 382.4
2020-07-29 26.39 377.9
2020-07-30 25.05 372.0
2020-07-31 25.42 380.7
Now create positions (by typing in from the Excel screen shot). I assumed each entry was buy or sell for that day. Cumulative sum gives then-current positions.
positions = [
('YAR.OL', '2020-07-27', 1),
('YAR.OL', '2020-07-31', -1),
('NHY.OL', '2020-07-28', 1),
]
# changed cost_price to volume
positions = pd.DataFrame(positions, columns=['tickers', 'date', 'volume'])
positions['date'] = pd.to_datetime(positions['date'])
positions = (positions.pivot(index='date', columns='tickers', values='volume')
.sort_index()
.sort_index(axis=1)
)
positions = positions.reindex( closing_prices.index ).fillna(0).cumsum()
print(positions)
tickers NHY.OL YAR.OL
date
2020-07-27 0.0 1.0 # <-- these are transaction volumes
2020-07-28 1.0 1.0
2020-07-29 1.0 1.0
2020-07-30 1.0 1.0
2020-07-31 1.0 0.0
Now, the portfolio value is positions times closing price. There is one column for each stock. And we can compute the sum for each day with 'sum(axis=1)'
port_value = positions * closing_prices
port_value['total'] = port_value.sum(axis=1)
print(port_value.round(2))
tickers NHY.OL YAR.OL total
date
2020-07-27 0.00 381.8 381.80
2020-07-28 26.49 382.4 408.89
2020-07-29 26.39 377.9 404.29
2020-07-30 25.05 372.0 397.05
2020-07-31 25.42 0.0 25.42
UPDATE - suggestions for further work
Include traded price in the Positions data frame.
Also include trade timestamp in the Positions data frame.
The end-of-day portfolio value would use end-of-day prices. Profit/loss also includes purchase/sale price. Which do you want?
The data frame Index (and MultiIndex) along with broadcasting are relevant concepts for this application.
Here is the python code which is trying to read the CSV file from alphavantage URL and converts it to pandas data frame. Multiple issues are there with this.
Before raising the issue, here is the code below.
dailyurl = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=NSE:{}&apikey=key&outputsize=full&datatype=csv'.format(Ticker)
cols = ['timestamp', 'open', 'high', 'low', 'close','adjusted_close','volume','dividend_amount','split_coefficient']
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols)
dfmonthly = pd.read_csv(monthlyurl, skiprows=0, header=None,names=cols)
dfdaily.rename(columns = {'timestamp':'date'}, inplace = True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.drop(dfdaily.index[:1], inplace=True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.reset_index(inplace=True, drop=False)
print(dfdaily.head(6))
Issues:
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols) return values seems to not match with pandas dataframe (looks like it contains a string) hence when I use this dataframe I am getting error "high is not dobule"
This URL return value contains multi-index as below
0 1 2 3 4
0 Timestamp open High Low close
1 09-02-2017 100 110 99 96
In the above first 0,1,2,3,4 column index not wanted hence added
dfdaily.drop(dfdaily.index[:1], inplace=True) now ,is there a better way to get the dataframe output converting this from csv to pddataframe.
As i see the read values are string i just tried making the dataframe as numeric value by using this line
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
this converts the date value to 0.0 so lost the purpose the date should be retain as its.And with this many lines of code for converting pandasdata frame it takes lot of time,so really a better way of doing to get the desired output is needed.
The output I am getting is :
index date open high low close adjusted_close volume
0 1 0.0 1629.05 1655.00 1617.30 1639.40 1639.40 703720.0
1 2 0.0 1654.00 1679.00 1638.05 1662.15 1662.15 750746.0
2 3 0.0 1680.00 1687.00 1620.60 1641.65 1641.65 1466983.0
3 4 0.0 1530.00 1683.75 1511.20 1662.15 1662.15 2109416.0
4 5 0.0 1600.00 1627.95 1546.50 1604.95 1604.95 1472164.0
5 6 0.0 1708.05 1713.00 1620.20 1628.90 1628.90 1645045.0
Multiindex is not required and date shall be as date not "0"
and other open high low close shall be in numerical format.
light on this optimization , a nice code which will give pandas numerical dataframe with an index as "date" so that it can be used for arithmetic logical execution further.
I think you need omit parameter names, because csv has header. Also for DatetimeIndex add parameter index_col for set first column to index and parse_dates for convert it to datetimes. Last rename_axis rename timestamp to date:
dfdaily = pd.read_csv(dailyurl, index_col=[0], parse_dates=[0])
dfdaily = dfdaily.rename_axis('date')
print (dfdaily.head())
open high low close adjusted_close volume \
date
2018-02-09 20.25 21.0 20.25 20.25 20.25 21700
2018-02-08 20.50 20.5 20.25 20.50 20.50 1688900
2018-02-07 20.50 20.5 20.25 20.50 20.50 301800
2018-02-06 20.25 21.0 20.25 20.25 20.25 39400
2018-02-05 20.50 21.0 20.25 20.50 20.50 5400
dividend_amount split_coefficient
date
2018-02-09 0.0 1.0
2018-02-08 0.0 1.0
2018-02-07 0.0 1.0
2018-02-06 0.0 1.0
2018-02-05 0.0 1.0
print (dfdaily.dtypes)
open float64
high float64
low float64
close float64
adjusted_close float64
volume int64
dividend_amount float64
split_coefficient float64
dtype: object
print (dfdaily.index)
DatetimeIndex(['2018-02-09', '2018-02-08', '2018-02-07', '2018-02-06',
'2018-02-05', '2018-02-02', '2018-02-01', '2018-01-31',
'2018-01-30', '2018-01-29',
...
'2000-01-14', '2000-01-13', '2000-01-12', '2000-01-11',
'2000-01-10', '2000-01-07', '2000-01-06', '2000-01-05',
'2000-01-04', '2000-01-03'],
dtype='datetime64[ns]', name='date', length=4556, freq=None)
I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64