Storing Datetime in a matrix to be used to define points of interest (Python) - python

I have bunch of CSV files that contain rows of dates corresponding to data, with column headers Using pandas, I have been able to import the CSV files. Now, I made a CSV file that labels the points of interest by datetime. I have also used pandas to import this file. I need to store the start time and end time in a matrix/array/something to call later to parse with my data which is labeled with these dates. Currently, using pd.to_datetime I have been able to convert the strings in my CSVs to datetime, but I have no idea how to store this. This is my third day using Python, so I apologize for the newbie question. I am a relatively advanced user of Matlab. I will provide my code, but I will not be able to provide the data in question as it is not owned by me. Thanks guys!
NUMBER_OF_CLASSES = 4
SUBSPACE_DIMENSION = 3
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
Pdata_1 = Pdata_1.as_matrix()
startdatetime = []
enddatetime = []
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

Instead of iterating through the dataframe you can store the start and ending dates in a new dataframe and convert the columns to timeseries and then you can access the data by iloc method :
dates = PeriodList[['START','END']]
dates['START'] = pd.to_datetime(dates['START'])
dates['END'] = pd.to_datetime(dates['END'])
# You can access the dates based on index using iloc
dates.iloc[3]
#If you Start date you can use the column name
dates.iloc[3]['START']
Incase you want to store specifically under existing data structure, you can use dictionary with key as index and values as dataframe values
start_end = dict(zip(dates.index, dates.values))
If you are looking for the difference of the end date and start date you can simply subtract the columns i.e
dates['Difference'] = dates['END']-dates['START']
I suggest you to go through pandas documentation for more info about accessing the data here
Edit :
You can also use dictionary in your code i.e
startdatetime = {}
enddatetime = {}
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
Hope this helps

Figured out a solution: Make empty strings, so then the loop stores the value each iteration. Since it is an empty string, there will not be a "cannot convert to float" error. Thanks for the help #Bharath Shetty
Code:
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
#Pdata_1 = Pdata_1.as_matrix()
startdatetime = ['' for x in range(list_m)]
enddatetime = ['' for x in range(list_m)]
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

Related

Big O notation : limited input

As an exercise, I am trying to set Monte Carlo Simulation on a chosen ticker symbol.
from numpy.random import randint
from datetime import date
from datetime import timedelta
import pandas as pd
import yfinance as yf
from math import log
# ticker symbol
ticker_input = "AAPL" # change
# start day + endday for Yahoo Finance API, 5 years of data
start_date = date.today()
end_date = start_date - timedelta(days=1826)
# retrieve data from Yahoo Finance
data = yf.download(ticker_input, end_date,start_date)
yf_data = data.reset_index()
# dataframe : define columns
df = pd.DataFrame(columns=['date', "ln_change", 'open_price', 'random_num'])
open_price = []
date_historical = []
for column in yf_data:
open_price = yf_data["Open"].values
date_historical = yf_data["Date"].values
# list order: descending
open_price[:] = open_price[::-1]
date_historical[:] = date_historical[::-1]
# Populate data into dataframe
for i in range(0, len(open_price)-1):
# date
day = date_historical[i]
# ln_change
lnc = log(open_price[i]/open_price[i+1], 2)
# random number
rnd = randint(1, 1258)
# op = (open_price[i]) open price
df.loc[i] = [day, open_price[i], lnc, rnd]
I was wondering how to calculate Big O if you have e.g. nested loops or exponential complexity but have a limited input like one in my example, maximum input size is 1259 instances of float number. Input size is not going to change.
How do you calculate code complexity in that scenario?
It is a matter of points of view. Both ways of seeing it are technically correct. The question is: What information do you wish to convey to the reader?
Consider the following code:
quadraticAlgorithm(n) {
for (i <- 1...n)
for (j <- 1...n)
doSomethingConstant();
}
quadraticAlgorithm(1000);
The function is clearly O(n2). And yet the program will always run in the same, constant time, because it just contains one function call with n=1000. It is still perfectly valid to refer to the function as O(n2). And we can refer to the program as O(1).
But sometimes the boundaries are not that clear. Then it is up to you to choose if you wish to see it as an algorithm with a time complexity as some function of n, or as a piece of constant code that runs in O(1). The importance is to make it clear to the reader how you define things.

Find a certain date inside Timestamp vector

I have a certain timestamp vector and I need to find the position index of the date inside this vector. Let's say I want to find inside this vector the position index of 2017-01-01.
Here below is the basic code that creates a ts vector:
import numpy as np
import pandas as pd
ts_vec = []
t = pd._libs.tslibs.timestamps.Timestamp('2016-03-03 00:00:00')
for i in range(1000):
ts_vec = [*ts_vec,t]
t = t+pd.Timedelta(days=1)
ts_vec = np.array(ts_vec)
How should I do this? Thank You
outp = np.where(ts_vec==pd._libs.tslibs.timestamps.Timestamp('2017-01-01 00:00:00'))

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

How to append data to a dataframe whithout overwriting?

I'm new to python but I need it for a personal project. And so I have this lump of code. The function is to create a table and update it as necessary. The problem is that the table keeps being overwritten and I don't know why. Also I'm struggling with correctly assigning the starting position of the new lines to append, and that's why total (ends up overwritten as well) and pos are there, but I haven't figured out how to correctly use them. Any tips?
import datetime
import pandas as pd
import numpy as np
total ={}
entryTable = pd.read_csv("Entry_Table.csv")
newEntries = int(input("How many new entries?\n"))
for i in range(newEntries):
ID = input ("ID?\n")
VQ = int (input ("VQ?\n"))
timeStamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
entryTable.loc[i] = [timeStamp, ID, VQ]
entryTable.to_csv("Inventory_Table.csv")
total[i] = 1
pos = sum(total.values())
print(pos)
inventoryTable = pd.read_csv("Inventory_Table.csv", index_col = 0)
Your variable 'i' runs from index 0 to the number of 'newEntries'. When you add new data to row 'i' in your Pandas dataframe, you are overwriting existing data in that row. If you want to add new data, try 'n+i' where n is the initial number of entries. You can determine n with either
n = len(entryTable)
or
n = entryTable.shape[0]

How to convert an array of dates (format 'mm/dd/yy HH:MM:SS') to numerics?

I have recently (1 week) decided to migrate my work to Python from matlab. Since I am used to matlab, I am finding it difficult sometimes to get the exact equivalent of what I want to do in python.
Here's my problem:
I have a set of csv files that I want to process. So far, I have succeeded in loading them into groups. Each column has a size of more 600000 x 1. In one of the columns in the csv file is the time which has a format of 'mm/dd/yy HH:MM:SS'. I want to convert the time column to number and I am using date2num from matplot lib for that. Is there a 'matrix' way of doing it? The command in matlab for doing that is datenum(time, 'mm/dd/yyyy HH:MM:SS') where time is a 600000 x 1 matrix.
Thanks
Here is an example of the code that I am talking about:
import csv
import time
import datetime from datetime
import date from matplotlib.dates
import date2num
time = []
otherColumns = []
for d in csv.DictReader(open('MyFile.csv')):
time.append(str(d['time']))
otherColumns.append(float(d['otherColumns']))
timeNumeric = date2num(datetime.datetime.strptime(time,"%d/%m/%y %H:%M:%S" ))
you could use a generator:
def pre_process(dict_sequence):
for d in dict_sequence:
d['time'] = date2num(datetime.datetime.strptime(d['time'],"%d/%m/%y %H:%M:%S" ))
yield d
now you can process your csv:
for d in pre_process(csv.DictReader(open('MyFile.csv'))):
process(d)
the advantage of this solution is that it doesn't copy sequences that are potentially large.
Edit:
So you the contents of the file in a numpy array?
reader = csv.DictReader(open('MyFile.csv'))
#you might want to get rid of the intermediate list if the file is really big.
data = numpy.array(list(d.values() for d in pre_process(reader)))
Now you have a nice big array that allows all kinds of operations. You want only the first column to get your 600000x1 matrix:
data[:,0] # assuming time is the first column
The closest thing in Python for matlab's matrix/vector operation is list comprehension. If you would like to apply a Python function on each item in a list you could do:
new_list = [date2num(data) for data in old_list]
or
new_list = map(date2num, old_list)

Categories

Resources