Formating time and plot it - python

I have the following excel file and the time stamp in the format
20180821_2330
1) for a lot of days. How would I format it as standard time so that I can plot it versus the other sensor values ?
2) I would like to have a big plot with for example sensor 1 reading against all the days, is that possible ?
https://www.mediafire.com/file/m36ha4777d6epvd/median_data.xlsx/file

is this something you are looking for? I improvised and created 'n' column which could represent your 'timestamp' as the data frame. Basically, what I think you should do, is to apply another function - let's call it 'apply_fun' on your column which stores 'timestamps' a function which takes each element and transforms it into strptime() format.
import datetime
import pandas as pd
n = {'timestamp':['20180822_2330', '20180821_2334', '20180821_2334', '20180821_2330']}
data_series = pd.DataFrame(n)
def format_dates(n):
x = n.find('_')
y = datetime.datetime.strptime(n[:x]+n[x+1:], '%Y%m%d%H%M')
return y
def apply_fun(dataset):
dataset['timestamp2'] = dataset['timestamp'].apply(format_dates)
return dataset
print(apply_fun(data_series))
When it comes to 2nd point, I am not able to reach the site due to McAffe agent at work, which does not allow to open it. Once you have 1st, you can ask for 2nd separately.

Related

Big O notation : limited input

As an exercise, I am trying to set Monte Carlo Simulation on a chosen ticker symbol.
from numpy.random import randint
from datetime import date
from datetime import timedelta
import pandas as pd
import yfinance as yf
from math import log
# ticker symbol
ticker_input = "AAPL" # change
# start day + endday for Yahoo Finance API, 5 years of data
start_date = date.today()
end_date = start_date - timedelta(days=1826)
# retrieve data from Yahoo Finance
data = yf.download(ticker_input, end_date,start_date)
yf_data = data.reset_index()
# dataframe : define columns
df = pd.DataFrame(columns=['date', "ln_change", 'open_price', 'random_num'])
open_price = []
date_historical = []
for column in yf_data:
open_price = yf_data["Open"].values
date_historical = yf_data["Date"].values
# list order: descending
open_price[:] = open_price[::-1]
date_historical[:] = date_historical[::-1]
# Populate data into dataframe
for i in range(0, len(open_price)-1):
# date
day = date_historical[i]
# ln_change
lnc = log(open_price[i]/open_price[i+1], 2)
# random number
rnd = randint(1, 1258)
# op = (open_price[i]) open price
df.loc[i] = [day, open_price[i], lnc, rnd]
I was wondering how to calculate Big O if you have e.g. nested loops or exponential complexity but have a limited input like one in my example, maximum input size is 1259 instances of float number. Input size is not going to change.
How do you calculate code complexity in that scenario?
It is a matter of points of view. Both ways of seeing it are technically correct. The question is: What information do you wish to convey to the reader?
Consider the following code:
quadraticAlgorithm(n) {
for (i <- 1...n)
for (j <- 1...n)
doSomethingConstant();
}
quadraticAlgorithm(1000);
The function is clearly O(n2). And yet the program will always run in the same, constant time, because it just contains one function call with n=1000. It is still perfectly valid to refer to the function as O(n2). And we can refer to the program as O(1).
But sometimes the boundaries are not that clear. Then it is up to you to choose if you wish to see it as an algorithm with a time complexity as some function of n, or as a piece of constant code that runs in O(1). The importance is to make it clear to the reader how you define things.

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

Error "numpy.float64 object is not iterable" for CSV file creation in Python

I have some very noisy (astronomy) data in csv format. Its shape is (815900,2) with 815k points giving information of what the mass of a disk is at a certain time. The fluctuations are pretty noticeable when you look at it close up. For example, here is an snippet of the data where the first column is time in seconds and the second is mass in kg:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
So it looks like there is a 1.53E+028 data point of noise, and also probably the 2.19E+028 and 2.35E+028 points.
To fix this, I am trying to set a Python script that will read in the csv data, then put some restriction on it so that if the mass is e.g. < 2.35E+028, it will remove the whole row and then create a new csv file with only the "good" data points:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41242600,2.40936E+028
Following this old question top answer by n8henrie, I so far have:
import pandas as pd
import csv
# Here are the locations of my csv file of my original data and an EMPTY csv file that will contain my good, noiseless set of data
originaldata = '/Users/myname/anaconda2/originaldata.csv'
gooddata = '/Users/myname/anaconda2/gooddata.csv'
# I use pandas to read in the original data because then I can separate the columns of time as 'T' and mass as 'M'
originaldata = pd.read_csv('originaldata.csv',delimiter=',',header=None,names=['t','m'])
# Numerical values of the mass values
M = originaldata['m'].values
# Now to put a restriction in
for row in M:
new_row = []
for column in row:
if column > 2.35E+028:
new_row.append(column)
csv.writer(open(newfile,'a')).writerow(new_row)
print('\n\n')
print('After:')
print(open(newfile).read())
However, when I run this, I get this error:
TypeError: 'numpy.float64' object is not iterable
I know the first column (time) is dtype int64 and the second column (mass) is dtype float64... but as a beginner, I'm still not quite sure what this error means or where I'm going wrong. Any help at all would be appreciated. Thank you very much in advance.
You can select rows by a boolean operation. Example:
import pandas as pd
from io import StringIO
data = StringIO('''\
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
''')
df = pd.read_csv(data,names=['t','m'])
good = df[df.m > 2.35e+28]
out = StringIO()
good.to_csv(out,index=False,header=False)
print(out.getvalue())
Output:
40023700,2.40896e+28
40145700,2.44487e+28
40267700,2.44487e+28
40389700,2.44478e+28
40755400,2.44496e+28
40877200,2.44489e+28
40999000,2.44489e+28
41242600,2.40936e+28
This returns a column: M = originaldata['m'].values
So when you do for row in M:, you get only one value in row, so you can't iterate on it again.

How do I execute my function on an array of data in Python?

In MATLAB, if I write my own function I can pass an array to that function and it automagically just deals with it. I'm trying to do the same thing in Python, the other Data Science language, and it's not dealing with it.
Is there an easy way to do this without having to make loops every time I want to operate on all values in an array? That's a lot more work! It seems like the great minds working in Python would have had something like this need before.
I tried switching the data type to a list() since that's iterable, but that didn't seem to work. I'm still getting an error that basically says it doesn't want an array object.
Here's my code:
import scipy
from collections import deque
import numpy as np
import os
from datetime import date, timedelta
def GetExcelData(filename,rowNum,titleCol):
csv = np.genfromtxt(filename, delimiter= ",")
Dates = deque(csv[rowNum,:])
if titleCol == True:
Dates.popleft()
return list(Dates)
def from_excel_ordinal(ordinal, _epoch=date(1900, 1, 1)):
if ordinal > 59: #the function stops working here when I pass my array
ordinal -= 1 # Excel leap year bug, 1900 is not a leap year!
return _epoch + timedelta(days=ordinal - 1) # epoch is day 1
os.chdir("C:/Users/blahblahblah")
filename = "SamplePandL.csv"
Dates = GetExcelData(filename,1,1)
Dates = from_excel_ordinal(Dates) #this is the call with an issue
print(Dates)
Well you can use map function for that.
Dates = from_excel_ordinal(Dates)
Replace the above line from the code with the following code
Dates = list(map(from_excel_ordinal,Dates))
In the above code, for each value present in Dates, function from_excel_ordinal will be called. Finally it gets converted into list and stored into Dates.

How to convert an array of dates (format 'mm/dd/yy HH:MM:SS') to numerics?

I have recently (1 week) decided to migrate my work to Python from matlab. Since I am used to matlab, I am finding it difficult sometimes to get the exact equivalent of what I want to do in python.
Here's my problem:
I have a set of csv files that I want to process. So far, I have succeeded in loading them into groups. Each column has a size of more 600000 x 1. In one of the columns in the csv file is the time which has a format of 'mm/dd/yy HH:MM:SS'. I want to convert the time column to number and I am using date2num from matplot lib for that. Is there a 'matrix' way of doing it? The command in matlab for doing that is datenum(time, 'mm/dd/yyyy HH:MM:SS') where time is a 600000 x 1 matrix.
Thanks
Here is an example of the code that I am talking about:
import csv
import time
import datetime from datetime
import date from matplotlib.dates
import date2num
time = []
otherColumns = []
for d in csv.DictReader(open('MyFile.csv')):
time.append(str(d['time']))
otherColumns.append(float(d['otherColumns']))
timeNumeric = date2num(datetime.datetime.strptime(time,"%d/%m/%y %H:%M:%S" ))
you could use a generator:
def pre_process(dict_sequence):
for d in dict_sequence:
d['time'] = date2num(datetime.datetime.strptime(d['time'],"%d/%m/%y %H:%M:%S" ))
yield d
now you can process your csv:
for d in pre_process(csv.DictReader(open('MyFile.csv'))):
process(d)
the advantage of this solution is that it doesn't copy sequences that are potentially large.
Edit:
So you the contents of the file in a numpy array?
reader = csv.DictReader(open('MyFile.csv'))
#you might want to get rid of the intermediate list if the file is really big.
data = numpy.array(list(d.values() for d in pre_process(reader)))
Now you have a nice big array that allows all kinds of operations. You want only the first column to get your 600000x1 matrix:
data[:,0] # assuming time is the first column
The closest thing in Python for matlab's matrix/vector operation is list comprehension. If you would like to apply a Python function on each item in a list you could do:
new_list = [date2num(data) for data in old_list]
or
new_list = map(date2num, old_list)

Categories

Resources