Creating an array of timestamps between two timestamps in pyspark

Creating an array of timestamps between two timestamps in pyspark - python

I have two timestamp columns in my pyspark dataframe. I want to create a third column which has the array of timestamp hours between the two timestamps.
This is the code I wrote for that..
# Creating udf function
def getBetweenStamps(st_date, dc_date):
import numpy as np
hr = 0
date_list = []
runnig_date = st_date
while (dc_date>runnig_date):
runnig_date = st_date+timedelta(hours=hr)
date_list.append(runnig_date)
hr+=1
dates = np.array(date_list)
return dates
udf_betweens = F.udf(getBetweenStamps, ArrayType(DateType()))
# Using udf function
orders.withColumn('date_array', udf_betweens(F.col('start_date'), F.col('ICUDischargeDate'))).show()
However this is showing the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I think the inputs to the functions are going in as two arrays and not as two datetimes causing the error. Is there any way around this? Any other way of solving this problem?
Thank you very much.

You are getting the error when returning numpy array from your udf. You can simply return the date_list and it will work.
def getBetweenStamps(st_date, dc_date):
import numpy as np
hr = 0
date_list = []
runnig_date = st_date
while (dc_date>runnig_date):
runnig_date = st_date+timedelta(hours=hr)
date_list.append(runnig_date)
hr+=1
return date_list
udf_betweens = F.udf(getBetweenStamps, ArrayType(DateType()))
To test the above function:
df = spark.sql("select current_timestamp() as t1").withColumn("t2", col("t1") + expr("INTERVAL 1 DAYS"))
df.withColumn('date_array', udf_betweens(F.col('t1'), F.col('t2'))).show()

Related

Timeseries dataframe returns an error when using Pandas Align - valueError: cannot join with no overlapping index names

My goal:
I have two time-series data frames, one with a time interval of 1m and the other with a time interval of 5m. The 5m data frame is a resampled version of the 1m data. What I'm doing is computing a set of RSI values that correspond to the 5m df using the vectorbt library, then aligning and broadcasting these values to the 1m df using df.align
The Problem:
When trying to do this line by line, it works perfectly. Here's what the final result looks like:
However, when applying it under the function, it returns the following error while having overlapping index names:
ValueError: cannot join with no overlapping index names
Here's the complete code:
import vectorbt as vbt
import numpy as np
import pandas as pd
import datetime
end_date = datetime.datetime.now()
start_date = end_date - datetime.timedelta(days=3)
btc_price = vbt.YFData.download('BTC-USD',
interval='1m',
start=start_date,
end=end_date,
missing_index='drop').get('Close')
def custom_indicator(close, rsi_window=14, ma_window=50):
close_5m = close.resample('5T').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill')
print(rsi) #to check
print(close) #to check
return
#setting up indicator factory
ind = vbt.IndicatorFactory(
class_name='Combination',
short_name='comb',
input_names=['close'],
param_names=['rsi_window', 'ma_window'],
output_names=['value']).from_apply_func(custom_indicator,
rsi_window=14,
ma_window=50,
keep_pd=True)
res = ind.run(btc_price, rsi_window=21, ma_window=50)
print(res)
Thank you for taking the time to read this. Any help would be appreciated!

if you checked the columns of both , rsi and close
print('close is', close.columns)
print('rsi is', rsi.columns)
you will find
rsi is MultiIndex([(14, 'Close')],
names=['rsi_window', None])
close is Index(['Close'], dtype='object')
as it has two indexes, one should be dropped, so it can be done by the below code
rsi.columns = rsi.columns.droplevel()
to drop one level of the indexes, so it could be align,

The problem is that the data must be a time series and not a pandas data frame for table joins using align
You need to fix the data type
# Time Series
close = close['Close']
close_5m = close.resample('15min').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right')

When you are aligning the data make sure to include join='right'
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right'

Find a certain date inside Timestamp vector

I have a certain timestamp vector and I need to find the position index of the date inside this vector. Let's say I want to find inside this vector the position index of 2017-01-01.
Here below is the basic code that creates a ts vector:
import numpy as np
import pandas as pd
ts_vec = []
t = pd._libs.tslibs.timestamps.Timestamp('2016-03-03 00:00:00')
for i in range(1000):
ts_vec = [*ts_vec,t]
t = t+pd.Timedelta(days=1)
ts_vec = np.array(ts_vec)
How should I do this? Thank You

outp = np.where(ts_vec==pd._libs.tslibs.timestamps.Timestamp('2017-01-01 00:00:00'))

python pandas: vectorized time series window function

I have a pandas dataframe in the following format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
I have written the following to create two new fields indicating 30 day windows:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
Unfortunately, this is far too slow doing the row-wise apply for my application. I would greatly appreciate any tips on vectorizing these functions if possible.
EDIT:
The resultant dataframe should have this layout:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
It does not need to be resampled or windowed in the formal sense. It just needs 'window_start_dt' and 'window_end_dt' columns to be added. The current code works, it just need to be vectorized if possible.

EDIT 2: pandas.cut is built-in:
tt=[[1,'2004-01-02',0.1,25,47],
[1,'2004-01-17',0.2,150,8],
[2,'2004-01-29',0.2,150,25],
[3,'2017-07-15',0.3,55,17],
[3,'2016-05-12',0.3,55,47],
[4,'2012-02-23',0.2,150,22],
[4,'2009-10-10',0.1,25,12],
[4,'2014-04-04',0.2,150,2],
[5,'2008-07-09',0.2,150,43]]
start_date_period = pd.date_range('2004-01-01', '12-01-2017', freq='MS')
end_date_period = pd.date_range('2004-01-30', '12-31-2017', freq='M')
df = pd.DataFrame(tt,columns=['customer_id','transaction_dt','product','price','units'])
df['transaction_dt'] = pd.Series([pd.to_datetime(sub_t[1],format='%Y-%m-%d') for sub_t in tt])
the_cut = pd.cut(df['transaction_dt'],bins=start_date_period,right=True,labels=False,include_lowest=True)
df['win_start_test'] = pd.Series([start_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
df['win_end_test'] = pd.Series([end_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
print(df.head())
win_start_test and win_end_test should be equal to their counterparts computed using your function.
The ValueError was coming from not casting x to int in the relevant line. I also added a NaN check, though it wasn't needed for this toy example.
Note the change to pd.date_range and the use of the start-of-month and end-of-month flags M and MS, as well as converting the date strings into datetime.

Storing Datetime in a matrix to be used to define points of interest (Python)

I have bunch of CSV files that contain rows of dates corresponding to data, with column headers Using pandas, I have been able to import the CSV files. Now, I made a CSV file that labels the points of interest by datetime. I have also used pandas to import this file. I need to store the start time and end time in a matrix/array/something to call later to parse with my data which is labeled with these dates. Currently, using pd.to_datetime I have been able to convert the strings in my CSVs to datetime, but I have no idea how to store this. This is my third day using Python, so I apologize for the newbie question. I am a relatively advanced user of Matlab. I will provide my code, but I will not be able to provide the data in question as it is not owned by me. Thanks guys!
NUMBER_OF_CLASSES = 4
SUBSPACE_DIMENSION = 3
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
Pdata_1 = Pdata_1.as_matrix()
startdatetime = []
enddatetime = []
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

Instead of iterating through the dataframe you can store the start and ending dates in a new dataframe and convert the columns to timeseries and then you can access the data by iloc method :
dates = PeriodList[['START','END']]
dates['START'] = pd.to_datetime(dates['START'])
dates['END'] = pd.to_datetime(dates['END'])
# You can access the dates based on index using iloc
dates.iloc[3]
#If you Start date you can use the column name
dates.iloc[3]['START']
Incase you want to store specifically under existing data structure, you can use dictionary with key as index and values as dataframe values
start_end = dict(zip(dates.index, dates.values))
If you are looking for the difference of the end date and start date you can simply subtract the columns i.e
dates['Difference'] = dates['END']-dates['START']
I suggest you to go through pandas documentation for more info about accessing the data here
Edit :
You can also use dictionary in your code i.e
startdatetime = {}
enddatetime = {}
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
Hope this helps

Figured out a solution: Make empty strings, so then the loop stores the value each iteration. Since it is an empty string, there will not be a "cannot convert to float" error. Thanks for the help #Bharath Shetty
Code:
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
#Pdata_1 = Pdata_1.as_matrix()
startdatetime = ['' for x in range(list_m)]
enddatetime = ['' for x in range(list_m)]
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

Excel xlwings data input for Python Technical Indicators

I am trying to replicate a simple Technical-Analysis indicator using xlwings. However, the list/data seems not to be able to read Excel values. Below is the code
import pandas as pd
import datetime as dt
import numpy as np
#xw.func
def EMA(df, n):
EMA = pd.Series(pd.ewma(df['Close'], span = n, min_periods = n - 1), name = 'EMA_' + str(n))
df = df.join(EMA)
return df
When I enter a list of excel data : EMA = ({1,2,3,4,5}, 5}, I get the following error message
TypeError: list indices must be integers, not str EMA = pd.Series(pd.ewma(df['Close'], span = n, min_periods = n - 1), name = 'EMA_' + str(n))
(Expert) help much appreciated! Thanks.

EMA() expects a DataFrame df and a scalar n, and it returns the EMA in a separate column in the source DataFrame. You are passing a simple list of values, this is not supposed to work.
Construct a DataFrame and assign the values to the Close column:
v = range(100) # use your list of values instead
df = pd.DataFrame(v, columns=['Close'])
Call EMA() with this DataFrame:
EMA(df, 5)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating an array of timestamps between two timestamps in pyspark - python

Related

Timeseries dataframe returns an error when using Pandas Align - valueError: cannot join with no overlapping index names

Find a certain date inside Timestamp vector

python pandas: vectorized time series window function

Storing Datetime in a matrix to be used to define points of interest (Python)

Excel xlwings data input for Python Technical Indicators

Categories

Resources