How to set a timestamp as an initial value (0) - python

I have a timestamp from a database and need to make a plot (rates vs time). I have the timestamps from when the process starts and ends, but I need to make the starting timestamp equal to 0 min (initial value) and the ending value equal to 20-30ish minutes (depending on the trial). I'm not sure what to use.
Also, I have the rates as a list and need to put them in an array for the matplotlib. I used np.asarray() and it says that the type is an array, but it is only giving me one number (the last number) on my plot. Any ideas on how to solve this?
code:
# timestamp comes out as 2.0080506043443555 e-16 because of the float
# need to change that into minutes for each run
L3time = L3timestamp.split()
del L3time[0]
for k in range(0, len(L3time)):
print(float(L3time[k]))
#print("this is the L3 rates")
for k in range(1, len(L3time)):
L3rate = (float(L3[k]) - float(L3[k-1]))*1000/(float(L3time[k]) - float(L3time[k-1]))
print(float(L3rate))
# putting the L3 rate into an array
L3RateArray = np.asarray(L3rate)
# putting the timestamp into an array
timestampArray = np.asarray(L3time)
for k in inFile.readlines():
plt.plot([timestampArray], [L3RateArray], 'ro')
plt.xlabel("time (m)")
plt.ylabel("L3 Rates (Hz)")
plt.suptitle("L3 Rates vs. Time")
plt.show()

Related

How to find AVERAGE drawdown of 7 assets?

I'm currently tasked with finding the average drawdown of 7 assets. This is what I have so far:
end = dt.datetime.today()
start = end - dt.timedelta(365)
tickers = ["SBUX", "MCD", "CMG", "WEN", "DPZ", "YUM", "DENN"]
bench = ['SPY', 'IWM', 'DIA']
table_1 = pd.DataFrame(index=tickers)
data = yf.download(tickers+bench, start, end)['Adj Close']
log_returns = np.log(data/data.shift())
table_1["drawdown"] = (log_returns.min() - log_returns.max() ) / log_returns.max()
However, this only gives me the maximum drawdown, when I actually want the average.
You will need scipy to find local max/min:
from scipy.signal import argrelextrema
I've defined a function that calculates the local min and max of the time series. Then simply calculate the relative difference between each local maximum and next local minimum and compute the mean:
def av_dd(series):
series = series.values # convert to numpy array
drawdowns = []
loc_max = argrelextrema(series, np.greater)[0] # getting indexes of local maximums
loc_min = argrelextrema(series, np.less)[0] # getting indexes of local minimums
# adding first value of series if first local minimum comes before first local maximum (you want the first drawdown to be taken into account)
if series[0]>series[1]:
loc_max = np.insert(loc_max,0,0)
# adding last value of series if last local maximum comes after last local minimum (you want the last drawdown to be taken into account)
if len(loc_max)>len(loc_min):
loc_min = np.append(loc_min, len(series)-1)
for i in range(len(loc_max)):
drawdowns.append(series[loc_min[i]]/series[loc_max[i]]-1)
return sum(drawdowns)/len(drawdowns)
Both if statements in the function are here to make sure that you also take into account the first and last drawdown depending what are the local extremas at the beginning and end of the time series.
You simply need to apply this function to your data time
table_1['drawdown'] = df.apply(lambda x: av_dd(x))

Calculate average and normalize GPS data in Python

I have a dataset in json with gps coordinates:
"utc_date_and_time":"2021-06-05 13:54:34", # timestamp
"hdg":"018.0", # heading
"sog":"000.0", # speed
"lat":"5905.3262N", # latitude
"lon":"00554.2433E" # longitude
This data will be imported into a database, with one entry every second for every "vessel".
As you can imagine this is a huge amount of data that provides a level of accuracy I do not need.
My goal:
Create a new entry in the database for every X seconds
If I set X to 60 (a minute) and there are missing 10 entries within this period, 50 entries should be used. Data can be missing for certain periods, and I do not want this to create bogus positions.
Use timestamp from last entry in period.
Use the heading (hdg) that is appearing the most times within this period.
Calculate average speed within this period.
Latitude and longitude could use the last entry, but I have seen "spikes" that needs to be filtered out, or use average, and remove values that differ too much.
My script is now pushing all the data to the database via a for loop with different data-checks inside it, and this is working.
I am new to python and still learning every day through reading and youtube videos, but it would be great if anyone could point me in the right direction for how to achieve the above goal.
As of now the data is imported into a dictionary. And I am wondering if creating a dictionary where the timestamp is the key is the way to go, but I am a little lost.
Code:
import os
import json
from pathlib import Path
from datetime import datetime, timedelta, date
def generator(data):
for entry in data:
yield entry
data = json.load(open("5_gps_2021-06-05T141524.1397180000.json"))["gps_data"]
gps_count = len(data)
start_time = None
new_gps = list()
tempdata = list()
seconds = 60
i = 0
for entry in generator(data):
i = i+1
if start_time == None:
start_time = datetime.fromisoformat(entry['utc_date_and_time'])
# TODO: Filter out values with too much deviation
tempdata.append(entry)
elapsed = (datetime.fromisoformat(entry['utc_date_and_time']) - start_time).total_seconds()
if (elapsed >= seconds) or (i == gps_count):
# TODO: Calculate average values etc. instead of using last
new_gps.append(tempdata)
tempdata = []
start_time = None
print("GPS count before:" + str(gps_count))
print("GPS count after:" + str(len(new_gps)))
Output:
GPS count before:1186
GPS count after:20

Pandas- locate a value based on logical statements

I am using the this dataset for a project.
I am trying to find the total yield for each inverter for the 34 day duration of the dataset (basically use the final and initial value available for each inverter). I have been able to get the list of inverters using pd.unique()(there are 22 inverters for each solar power plant.
I am having trouble querying the total_yield data for each inverter.
Here is what I have tried:
def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
delta = np.zeros(len(arr))
index =0
for i in arr:
initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
initial = initial.loc[initial["INVERTER_ID"]==i]
initial.reset_index(inplace=True,drop=True)
initial = initial.at[0,"TOTAL_YIELD"]
final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
final = final.loc[final["INVERTER_ID"]==i]
final.reset_index(inplace=True, drop=True)
final = final.at[0,"TOTAL_YIELD"]
delta[index] = final - initial
index = index + 1
return delta
Reference: arr is the array of inverters, listed below. df is the generation dataframe for each plant.
The problem is that not every inverter has a data point for each interval. This makes this function only work for the inverters at the first plant, not the second one.
My second approach was to filter by the inverter first, then take the first and last data points. But I get an error- 'Series' objects are mutable, thus they cannot be hashed
Here is the code for that so far:
def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
delta = np.zeros(len(arr))
index = 0
for i in arr:
initial = df.loc(df["INVERTER_ID"] == i)
index += 1
break
return delta
List of inverters at plant 1 for reference(labeled as SOURCE_KEY):
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']
List of inverters at plant 2:
['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
Thank you very much.
I can't download the dataset to test this. Getting "To May Requests" Error.
However, you should be able to do this with a groupby.
import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])
So if I'm understanding this right, what you want is the TOTAL_YIELD for each inverter for the beginning of the time period starting 5-05-2020 02:00 and ending 17-06-2020 23:45. Try this:
# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr):
# to filter the info to between the two dates, but not necessarily assuming that
# each inverter's data starts and ends at each date
inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020
23:45:00')]
inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]
# sort by date
inverter_df.sort_values(by='DATE_TIME', inplace= True)
# grab TOTAL_YIELD at the first available date
initial = inverter_df['TOTAL_YIELD'].iloc[0]
# grab TOTAL_YIELD at the last available date
final = inverter_df['TOTAL_YIELD'].iloc[-1]
delta[index] = final - initial

Python image file manipulation

Python beginner here. I am trying to make us of some data stored in a dictionary.
I have some .npy files in a folder. It is my intention to build a dictionary that encapsulates the following: reading of the map, done with np.load, the year, month, and date of the current map (as integers), the fractional time in years (given that a month has 30 days - it does not affect my calculations afterwards), and the number of pixels, and number of pixels above a certain value. At the end I expect to get a dictionary like:
{'map0':'array(from np.load)', 'year', 'month', 'day', 'fractional_time', 'pixels'
'map1':'....}
What I managed until now is the following:
import glob
file_list = glob.glob('*.npy')
def only_numbers(seq): #for getting rid of any '.npy' or any other string
seq_type= type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
maps = {}
for i in range(0, len(file_list)-1):
maps[i] = np.load(file_list[i])
numbers[i]=list(only_numbers(file_list[i]))
I have no idea how to to get a dictionary to have more values that are under the for loop. I can only manage to generate a new dictionary, or a list (e.g. numbers) for every task. For the numbers dictionary, I have no idea how to manipulate the date in the format YYYYMMDD to get the integers I am looking for.
For the pixels, I managed to get it for a single map, using:
data = np.load('20100620.npy')
print('Total pixel count: ', data.size)
c = (data > 50).astype(int)
print('Pixel >50%: ',np.count_nonzero(c))
Any hints? Until now, image processing seems to be quite a challenge.
Edit: Managed to split the dates and make them integers using
date=list(only_numbers.values())
year=int(date[i][0:4])
month=int(date[i][4:6])
day=int(date[i][6:8])
print (year, month, day)
If anyone is interested, I managed to do something else. I dropped the idea of a dictionary containing everything, as I needed to manipulate further easier. I did the following:
file_list = glob.glob('data/...') # files named YYYYMMDD.npy
file_list.sort()
def only_numbers(seq): # i make sure that i remove all characters and symbols from the name of the file
seq_type = type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
numbers = {}
time = []
np_above_value = []
for i in range(0, len(file_list) - 1):
maps = np.load(file_list[i])
maps[np.isnan(maps)] = 0 # had some NANs and getting some errors
numbers[i] = only_numbers(file_list[i]) # getting a dictionary with the name of the files that contain only the dates - calling the function I defined earlier
date = list(numbers.values()) # registering the name of the files (only the numbers) as a list
year = int(date[i][0:4]) # selecting first 4 values (YYYY) and transform them as integers, as required
month = int(date[i][4:6]) # selecting next 2 values (MM)
day = int(date[i][6:8]) # selecting next 2 values (DD)
time.append(year + ((month - 1) * 30 + day) / 360) # fractional time
print('Total pixel count for map '+ str(i) +':', maps.size) # total number of pixels for the current map in iteration
c = (maps > value).astype(int)
np_above_value.append (np.count_nonzero(c)) # list of the pixels with a value bigger than value
print('Pixels with concentration >value% for map '+ str(i) +':', np.count_nonzero(c)) # total number of pixels with a value bigger than value for the current map in iteration
plt.plot(time, np_above_value) # pixels with concentration above value as a function of time
I know it might be very clumsy. Second week of python, so please overlook that. It does the trick :)

Reducing numpy array for drawing chart

I want to draw chart in my python application, but source numpy array is too large for doing this (about 1'000'000+). I want to take mean value for neighboring elements. The first idea was to do it in C++-style:
step = 19000 # every 19 seconds (for example) make new point with neam value
dt = <ordered array with time stamps>
value = <some random data that we want to draw>
index = dt - dt % step
cur = 0
res = []
while cur < len(index):
next = cur
while next < len(index) and index[next] == index[cur]:
next += 1
res.append(np.mean(value[cur:next]))
cur = next
but this solution works very slow. I tried to do like this:
step = 19000 # every 19 seconds (for example) make new point with neam value
dt = <ordered array with time stamps>
value = <some random data that we want to draw>
index = dt - dt % step
data = np.arange(index[0], index[-1] + 1, step)
res = [value[index == i].mean() for i in data]
pass
This solution is slower than the first one. What is the best solution for this problem?
np.histogram can provide sums over arbitrary bins. If you have time series, e.g.:
import numpy as np
data = np.random.rand(1000) # Random numbers between 0 and 1
t = np.cumsum(np.random.rand(1000)) # Random time series, from about 1 to 500
then you can calculate the binned sums across 5 second intervals using np.histogram:
t_bins = np.arange(0., 500., 5.) # Or whatever range you want
sums = np.histogram(t, t_bins, weights=data)[0]
If you want the mean rather than the sum, remove the weights and use the bin tallys:
means = sums / np.histogram(t, t_bins)][0]
This method is similar to the one in this answer.

Categories

Resources