Sorry if this seems like a stupid question,
I have a dataset which looks like this
type time latitude longitude altitude (m) speed (km/h) name desc currentdistance timeelapsed
T 2017-10-07 10:44:48 28.750766667 77.088805000 783.5 0.0 2017-10-07_10-44-48 0.0 00:00:00
T 2017-10-07 10:44:58 28.752345000 77.087840000 853.5 7.8 198.70532 00:00:10
T 2017-10-07 10:45:00 28.752501667 77.087705000 854.5 7.7 220.53915 00:00:12
Im not exactly sure how to approach this,calculating acceleration requires taking difference of speed and time,any suggestions on what i may try?
Thanks in advance
Assuming your data was loaded from a CSV as follows:
type,time,latitude,longitude,altitude (m),speed (km/h),name,desc,currentdistance,timeelapsed
T,2017-10-07 10:44:48,28.750766667,77.088805000,783.5,0.0,2017-10-07_10-44-48,,0.0,00:00:00
T,2017-10-07 10:44:58,28.752345000,77.087840000,853.5,7.8,,,198.70532,00:00:10
T,2017-10-07 10:45:00,28.752501667,77.087705000,854.5,7.7,,,220.53915,00:00:12
The time column is converted to a datetime object, and the timeelapsed column is converted into seconds. From this you could add an acceleration column by
calculating the difference in speed (km/h) between each row and dividing by the difference in time between each row as follows:
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv', parse_dates=['time'], dtype={'name':str, 'desc':str})
df['timeelapsed'] = (pd.to_datetime(df['timeelapsed'], format='%H:%M:%S') - datetime(1900, 1, 1)).dt.total_seconds()
df['acceleration'] = (df['speed (km/h)'] - df['speed (km/h)'].shift(1)) / (df['timeelapsed'] - df['timeelapsed'].shift(1))
print df
Giving you:
type time latitude longitude altitude (m) speed (km/h) name desc currentdistance timeelapsed acceleration
0 T 2017-10-07 10:44:48 28.750767 77.088805 783.5 0.0 2017-10-07_10-44-48 NaN 0.00000 0.0 NaN
1 T 2017-10-07 10:44:58 28.752345 77.087840 853.5 7.8 NaN NaN 198.70532 10.0 0.78
2 T 2017-10-07 10:45:00 28.752502 77.087705 854.5 7.7 NaN NaN 220.53915 12.0 -0.05
Related
Input
New Time
11:59:57
12:42:10
12:48:45
18:44:53
18:49:06
21:49:54
21:54:48
5:28:20
Below I wrote code to create interval in min.
import pandas as pd
import numpy as np
df = pd.read_csv(r"D:\test\test1.csv")
df['Interval in min'] = (pd.to_timedelta(df['New Time'].astype(str)).diff(1).dt.floor('T').dt.total_seconds().div(60))
print(df)
Output
New Time Interval in min
11:59:57 NaN
12:42:10 42.0
12:48:45 6.0
18:44:53 356.0
18:49:06 4.0
21:49:54 180.0
21:54:48 4.0
5:28:20 -987.0
Last interval in min i.e. -987 min is not correct, it should rather be 453 min (+1 day).
Assuming you want to consider a negative difference to be a new day, you could use:
s = pd.to_timedelta(df['New Time']).diff()
df['Interval in min'] = (s
.add(pd.to_timedelta(s.lt('0').cumsum(), unit='d'))
.dt.floor('T').dt.total_seconds().div(60)
)
output:
New Time Interval in min
0 11:59:57 NaN
1 12:42:10 42.0
2 12:48:45 6.0
3 18:44:53 356.0
4 18:49:06 4.0
5 21:49:54 180.0
6 21:54:48 4.0
7 5:28:20 453.0
This table is a pandas dataframe. Can someone help me out with writing function that show the probability of price up for 5 consecutive days in a row for the past 1000 days. So I will know the probability of price up for tomorrow if the past 4 days' price has been increasing.
Appreciate if any help.
import ccxt
import pandas as pd
binance=ccxt.binance()
def get_price(pair):
df=binance.fetch_ohlcv(pair,timeframe="1d",limit=1000) #limit = 30
df = pd.DataFrame(df).rename(columns={0:"date",1:"open",2:"high",3:"low",4:"close",5:"volume"})
df['date'] = pd.to_datetime(df['date'], unit='ms') + pd.Timedelta(hours=8)
df.set_index("date",inplace=True)
return df
df=get_price("BTC/USDT")
df["daily_return"]=df.close.pct_change()
Random comment, I think based on context you're after empirical probability, in which case this is a simple one-liner using pandas rolling. If this isn't the case, you probably need to explain what you mean by "probability" or explain in words what you're after.
df["probability_5"] = (df["daily_return"] > 0).rolling(5).mean()
df[["daily_return", "probability_5"]].head(15)
Output:
daily_return probability_5
date
2017-08-17 08:00:00 NaN NaN
2017-08-18 08:00:00 -0.041238 NaN
2017-08-19 08:00:00 0.007694 NaN
2017-08-20 08:00:00 -0.012969 NaN
2017-08-21 08:00:00 -0.017201 0.2
2017-08-22 08:00:00 0.005976 0.4
2017-08-23 08:00:00 0.018319 0.6
2017-08-24 08:00:00 0.049101 0.6
2017-08-25 08:00:00 -0.008186 0.6
2017-08-26 08:00:00 0.013260 0.8
2017-08-27 08:00:00 -0.006324 0.6
2017-08-28 08:00:00 0.017791 0.6
2017-08-29 08:00:00 0.045773 0.6
2017-08-30 08:00:00 -0.007050 0.6
2017-08-31 08:00:00 0.037266 0.6
Just to frame the question properly, I believe you are trying to calculate the relative frequency of n-consecutive postive or negative days of price series/array.
Some research:
https://medium.com/#mikeharrisny/probability-in-trading-and-finance-96344108e1d9
Please see my implementation bellow, using a Pandas Dataframe:
import pandas as pd
random_prices = [100, 90, 95, 98, 99, 98, 97, 100, 99, 98]
df = pd.DataFrame(random_prices, columns=['price'])
def consecutive_days_proba(pandas_series, n, positive=True):
# Transform to daily returns
daily_return = pandas_series.pct_change()
# Drop NA values, this makes the original series n-1
daily_return.dropna(inplace=True)
# Count the total number of days in the new series
total_days = len(daily_return)
if positive:
# count the number of n consecutive days with positive returns
consecutive_n = ((daily_return > 0).rolling(n).sum() == n).sum()
else:
# count the number of n consecutive days with negative returns
consecutive_n = ((daily_return < 0).rolling(n).sum() == n).sum()
return ((consecutive_n / total_days) * 100).round(2)
consecutive_days_proba(df['price'], n=3, positive=True)
So this returns 11.11% which is 1/9. Although the original series is a length of 10, I dont' think it makes sense to use the null days as part of the base.
I have to monthly normalize values of one dataframe column Allocation.
data=
Allocation Temperature Precipitation Radiation
Date_From
2018-11-01 00:00:00 0.001905 9.55 0.0 0.0
2018-11-01 00:15:00 0.001794 9.55 0.0 0.0
2018-11-01 00:30:00 0.001700 9.55 0.0 0.0
2018-11-01 00:45:00 0.001607 9.55 0.0 0.0
This means, if we have 2018-11, divide Allocation by 11.116, while in 2018-12, divide Allocation by 2473.65, and so on... (These values come from a list Volume, where Volume[0] corresponds to 2018-11 untill Volume[7] corresponds to 2019-06).
Date_From is a index and a timestamp.
data_normalized=
Allocation Temperature Precipitation Radiation
Date_From
2018-11-01 00:00:00 0.000171 9.55 0.0 0.0
2018-11-01 00:15:00 0.000097 9.55 0.0 0.0
...
My approach was the use of itertuples:
for row in data.itertuples(index=True,name='index'):
if row.index =='2018-11':
data['Allocation']/Volume[0]
Here, the if statement is never true...
Another approach was
if ((row.index >='2018-11-01 00:00:00') & (row.index<='2018-11-31 23:45:00')):
However, here I get the error TypeError: '>=' not supported between instances of 'builtin_function_or_method' and 'str'
Can I solve my problem with this approach or should I use a different approach? I am happy about any help
Cheers!
Maybe you can put your list Volume in a dataframe where the date (or index) is the first day of every month.
import pandas as pd
import numpy as np
N = 16
date = pd.date_range(start='2018-01-01', periods=N, freq="15d")
df = pd.DataFrame({"date":date, "Allocation":np.random.randn(N)})
# A dataframe where at every month associate a volume
df_vol = pd.DataFrame({"month":pd.date_range(start="2018-01-01", periods=8, freq="MS"),
"Volume": np.arange(8)+1})
# convert every date with the beginning of the month
df["month"] = df["date"].astype("datetime64[M]")
# merge
df1 = pd.merge(df,df_vol, on="month", how="left")
# divide allocation by Volume.
# Now it's vectorial as to every date we merged the right volume.
df1["norm"] = df1["Allocation"]/df1["Volume"]
Here is the python code which is trying to read the CSV file from alphavantage URL and converts it to pandas data frame. Multiple issues are there with this.
Before raising the issue, here is the code below.
dailyurl = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=NSE:{}&apikey=key&outputsize=full&datatype=csv'.format(Ticker)
cols = ['timestamp', 'open', 'high', 'low', 'close','adjusted_close','volume','dividend_amount','split_coefficient']
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols)
dfmonthly = pd.read_csv(monthlyurl, skiprows=0, header=None,names=cols)
dfdaily.rename(columns = {'timestamp':'date'}, inplace = True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.drop(dfdaily.index[:1], inplace=True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.reset_index(inplace=True, drop=False)
print(dfdaily.head(6))
Issues:
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols) return values seems to not match with pandas dataframe (looks like it contains a string) hence when I use this dataframe I am getting error "high is not dobule"
This URL return value contains multi-index as below
0 1 2 3 4
0 Timestamp open High Low close
1 09-02-2017 100 110 99 96
In the above first 0,1,2,3,4 column index not wanted hence added
dfdaily.drop(dfdaily.index[:1], inplace=True) now ,is there a better way to get the dataframe output converting this from csv to pddataframe.
As i see the read values are string i just tried making the dataframe as numeric value by using this line
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
this converts the date value to 0.0 so lost the purpose the date should be retain as its.And with this many lines of code for converting pandasdata frame it takes lot of time,so really a better way of doing to get the desired output is needed.
The output I am getting is :
index date open high low close adjusted_close volume
0 1 0.0 1629.05 1655.00 1617.30 1639.40 1639.40 703720.0
1 2 0.0 1654.00 1679.00 1638.05 1662.15 1662.15 750746.0
2 3 0.0 1680.00 1687.00 1620.60 1641.65 1641.65 1466983.0
3 4 0.0 1530.00 1683.75 1511.20 1662.15 1662.15 2109416.0
4 5 0.0 1600.00 1627.95 1546.50 1604.95 1604.95 1472164.0
5 6 0.0 1708.05 1713.00 1620.20 1628.90 1628.90 1645045.0
Multiindex is not required and date shall be as date not "0"
and other open high low close shall be in numerical format.
light on this optimization , a nice code which will give pandas numerical dataframe with an index as "date" so that it can be used for arithmetic logical execution further.
I think you need omit parameter names, because csv has header. Also for DatetimeIndex add parameter index_col for set first column to index and parse_dates for convert it to datetimes. Last rename_axis rename timestamp to date:
dfdaily = pd.read_csv(dailyurl, index_col=[0], parse_dates=[0])
dfdaily = dfdaily.rename_axis('date')
print (dfdaily.head())
open high low close adjusted_close volume \
date
2018-02-09 20.25 21.0 20.25 20.25 20.25 21700
2018-02-08 20.50 20.5 20.25 20.50 20.50 1688900
2018-02-07 20.50 20.5 20.25 20.50 20.50 301800
2018-02-06 20.25 21.0 20.25 20.25 20.25 39400
2018-02-05 20.50 21.0 20.25 20.50 20.50 5400
dividend_amount split_coefficient
date
2018-02-09 0.0 1.0
2018-02-08 0.0 1.0
2018-02-07 0.0 1.0
2018-02-06 0.0 1.0
2018-02-05 0.0 1.0
print (dfdaily.dtypes)
open float64
high float64
low float64
close float64
adjusted_close float64
volume int64
dividend_amount float64
split_coefficient float64
dtype: object
print (dfdaily.index)
DatetimeIndex(['2018-02-09', '2018-02-08', '2018-02-07', '2018-02-06',
'2018-02-05', '2018-02-02', '2018-02-01', '2018-01-31',
'2018-01-30', '2018-01-29',
...
'2000-01-14', '2000-01-13', '2000-01-12', '2000-01-11',
'2000-01-10', '2000-01-07', '2000-01-06', '2000-01-05',
'2000-01-04', '2000-01-03'],
dtype='datetime64[ns]', name='date', length=4556, freq=None)
I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64