Sub DataFrame extraction error (list index out of range) - python

I am running error saying: IndexError: list index out of range
when running this code:
import pandas as pd
import pandas_datareader as wb
import datetime as dt
data = wb.DataReader('spy', 'yahoo', start='1/1/1978', end='30/10/2019')
data['Change'] = data['Close'].pct_change() * 100
data['Gaps'] = (((data['Open'] - data['Close'].shift(1))/data['Close'].shift(1)) * 100)
data['Gaps'].astype(float)
data['Performance during day'] = ((data['Close'] - data['Open'])/data['Open']) * 100
data.reset_index(inplace=True)
data['Date'] = data['Date'].dt.date
data = round(data, 2)
filtered_data = list((data[data['Gaps'] > 2].index.astype(int)))
list_of_slices = []
for each in filtered_data:
event = data.iloc[filtered_data[each]-30:filtered_data[each]+60]
list_of_slices.append(event)
I want to extract part of the Dataframe and create new Sub dataframe from data extracted to plot afterwards candlestick chart

Related

Convert Array to dataframe with Longitude, Lattitude coordinates

Imported Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
I am trying to creat a Heatmap out of my strava dataset ( which turns to be a csv file of 155479 rows with Georaphical cooridnates) I tried first to display the whole dataset on Folium using python, the problem is that Folium seemed to crash when i tried to upload the whole dataset ( it was working with a sample).
Meanwhile I found this post https://towardsdatascience.com/create-a-heatmap-from-the-logs-of-your-activity-tracker-c9fc7ace1657 the code is working in displaying all the datset.
size_x, size_y = 1000, 1000
​
df2 = df[(df.lat > LAT_MIN) & (df.lat < LAT_MAX) &
(df.lon > LAT_MIN) & (df.lon < LAT_MAX)].copy()
df2['x'] = (size_x * (df2.lon - df2.lon.min())/(df2.lon.max() -df2.lon.min())).astype(int)
df2['y'] = (size_y * (df2.lat - df2.lat.min())/(df2.lat.max() - df2.lat.min())).astype(int)
data = np.zeros((size_x,size_y))
width = 2 ​
df3 = df2[['x', 'y','type']].groupby(['x', 'y']).count().reset_index()
for index, row in df3.iterrows():
x = int(row['x'])
y = int(row['y'])
data[y - width:y + width, x - width:x + width] += row ['type'] ​
max = len(df2.source.unique()) * 1
and creating a descent heatmap
#data[data > max] = max data = (data - data.min()) / (data.max() -
#data.min()) cmap = plt.get_cmap('hot')
#data = cmap(data)
However when i try to convert this below array to a Dataframe
df_data = pd.DataFrame(data) df_data.head()
​
​I dont understand the below error
ValueError: Must pass 2-d input. shape=(1000, 1000, 4)
The error means that Pandas can't organize your data into a table. By definition, tables have 2 dimensions (rows and columns), but the data you passed has 3 dimensions: 1000, 1000 and 4.
To make it work, you should reshape the data to 2 dimensions.

How to extract efficiently time-series data from a netCDF file?

I want to extract time-series of data from a unique netCDF file.
I have to extract three-time series of daily temperatures across more than 500 cities from 2004 to 2016 (more precisely, I extract 3-time series across 3 points coordinates for each city).
The following program works, but it is very slow. (More than 8hours to obtain one location time series). I have already tried to divide coordinates into several CSV files and run the program separately for each of these files, but it is not very efficient.
Maybe I should chunck the netCDF file (5 Go) into smaller files to reduce the "reading" process. But I don't know how to do that.
from netCDF4 import Dataset
from datetime import datetime
from netCDF4 import Dataset
import pandas as pd
import os
import numpy as np
os.chdir('D:PATH/tmp/')
date_range = pd.date_range(start = "2004-01-01", end = "2016-12-31", freq ='D')
df = pd.DataFrame(0.0, columns = ['Temp1','Temp2','Temp3'], index = date_range)
cities = pd.read_csv(r'D:\PATH\cities_coordinates.csv', sep =',')
cities['NUTS_ID']= cities['NUTS_ID'].map(str)
for index, row in cities.iterrows():
location = row['NUTS_ID']
location_latitude1 = row['lat1']
location_longitude1 = row['lon1']
location_latitude2 = row['lat2']
location_longitude2 = row['lon2']
location_latitude3 = row['lat3']
location_longitude3 = row['lon3']
for day in date_range:
data = Dataset("D:/PATH/temperature.nc",'r')
# Storing the lat and lon data into variables of the netCDF file into variables
lat = data.variables['latitude'][:]
lon = data.variables['longitude'][:]
# Squared difference between the specified lat, lon and the lat, lon of the netCDF
sq_diff_lat1 = (lat - location_latitude1)**2
sq_diff_lon1 = (lon - location_longitude1)**2
sq_diff_lat2 = (lat - location_latitude2)**2
sq_diff_lon2 = (lon - location_longitude2)**2
sq_diff_lat3 = (lat - location_latitude3)**2
sq_diff_lon3 = (lon - location_longitude3)**2
# Identify the index of the min value for lat and lon
min_index_lat1 = sq_diff_lat1.argmin()
min_index_lon1 = sq_diff_lon1.argmin()
min_index_lat2 = sq_diff_lat2.argmin()
min_index_lon2 = sq_diff_lon2.argmin()
min_index_lat3 = sq_diff_lat3.argmin()
min_index_lon3 = sq_diff_lon3.argmin()
# Accessing the temperature data
tx = data.variables['tx']
start = '2004-01-01'
end = '2016-12-31'
d_range = pd.date_range(start = start, end = end, freq='D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: '+str(d_range[t_index]))
df.loc[d_range[t_index]]['Temp1']=tx[t_index, min_index_lat1, min_index_lon1]
df.loc[d_range[t_index]]['Temp2']=tx[t_index, min_index_lat2, min_index_lon2]
df.loc[d_range[t_index]]['Temp3']=tx[t_index, min_index_lat3, min_index_lon3]
df.to_csv(location +'.csv')

heatmap of values grouped by time - seaborn

I'm plotting the counts of a variable grouped by time as a heatmap. However, when including both hour and minute, the counts are quite low so the resulting heatmap doesn't really provide any real insight. Is it possible to group the counts in a bigger block of time? I'm hoping to test some different periods (5, 10 mins).
I'm also hoping to plot time on the x-axis. Similar to the output attached.
import seaborn as sns
import pandas as pd
from datetime import datetime
from datetime import timedelta
start = datetime(1900,1,1,10,0,0)
end = datetime(1900,1,1,13,0,0)
seconds = (end - start).total_seconds()
step = timedelta(minutes = 1)
array = []
for i in range(0, int(seconds), int(step.total_seconds())):
array.append(start + timedelta(seconds=i))
array = [i.strftime('%Y-%m-%d %H:%M%:%S') for i in array]
df2 = pd.DataFrame(array).rename(columns = {0:'Time'})
df2['Count'] = np.random.uniform(0.0, 0.5, size = len(df2))
df2['Count'] = df2['Count'].round(1)
df2['Time'] = pd.to_datetime(df2['Time'])
df2['Hour'] = df2['Time'].dt.hour
df2['Min'] = df2['Time'].dt.minute
g = df2.groupby(['Hour','Min','Count'])
count_df = g['Count'].nunique().unstack()
count_df.fillna(0, inplace = True)
sns.heatmap(count_df)
To deal with such cases, I think it would be easy to use data downsampling. It is also easy to change the thresholds. The axis labels in the output graph will need to be modified, but we recommend this method.
import seaborn as sns
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
start = datetime(1900,1,1,10,0,0)
end = datetime(1900,1,1,13,0,0)
seconds = (end - start).total_seconds()
step = timedelta(minutes = 1)
array = []
for i in range(0, int(seconds), int(step.total_seconds())):
array.append(start + timedelta(seconds=i))
array = [i.strftime('%Y-%m-%d %H:%M:%S') for i in array]
df2 = pd.DataFrame(array).rename(columns = {0:'Time'})
df2['Count'] = np.random.uniform(0.0, 0.5, size = len(df2))
df2['Count'] = df2['Count'].round(1)
df2['Time'] = pd.to_datetime(df2['Time'])
df2['Hour'] = df2['Time'].dt.hour
df2['Min'] = df2['Time'].dt.minute
df2.set_index('Time', inplace=True)
count_df = df2.resample('10min')['Count'].value_counts().unstack()
count_df.fillna(0, inplace = True)
sns.heatmap(count_df.T)
The way you could achieve this is by creating a column with numbers that have repeating elements for the number of minutes.
For example:
minutes = 3
x = [0,1,2]
np.repeat(x, repeats=minutes, axis=0)
>>>> [0,0,0,1,1,1,2,2,2]
and then group your data using this column.
So your code would look like:
...
minutes = 5
x = [i for i in range(int(df2.shape[0]/5))]
df2['group'] = np.repeat(x, repeats=minutes, axis=0)
g = df2.groupby(['Min', 'Count'])
count_df = g['Count'].nunique().unstack()
count_df.fillna(0, inplace = True)

np.log returns a dataframe full of NaNs

I have made 2 functions, one for the cumulative logarithmic returns and the other for the total relative return.
Cumulative logarithmic returns:
# Cumulative logarithmic returns function:
def tlog_r(data, start, end):
tlog_return = copy.deepcopy(data)
for t in range(0,len(tlog_return)):
x = data[t]
y = data[0]
tlog_return[t] = x/y
tlog_return = np.log(tlog_return)
tlog_return[0] = 0
return tlog_return`
Total relative returns:
# Total relative returns function:
def tr_rel(data):
tlog_return = copy.deepcopy(data)
for t in range(0,len(tlog_return)):
x = data[t]
y = data[0]
tlog_return[t] = x/y
tlog_return = np.log(tlog_return)
tlog_return[0] = 0
tr_relative = copy.deepcopy(tlog_return)
for t in range(0,len(tr_relative)):
tr_relative[t] = 100*(np.exp(tr_relative[t])-1)
print(tr_relative)
return tr_relative`
I want to calculate them from a dataframe of a stock between 2 dates.
It doesn't give any error but if dates don't start in 2000, 2005 or 2011 it returns a dataframe full of NaNs except for the value in index [0].
Why is this happening? How can I solve it?
In case you need it, this is the part of the code where I call the functions:
from relative_returns_functions import tlog_r, tr_rel
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import copy
ticker='AAPL'
start_date='2000-01-01'
end_date='2019-12-31'
price='Close'
# Program
panel_data = data.DataReader(ticker , 'yahoo', start_date, end_date)[price]
title = '{} {} price'.format(ticker, price) #Plot title
panel_data.plot(title=title)
# Data procesing
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')
panel_data = panel_data.reindex(all_weekdays)
panel_data = panel_data.fillna(method='ffill')
# Plot
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,6))
comp_title = '{} returns comparation'.format(ticker)
fig.suptitle(comp_title)
sum_log_returns = tlog_r(panel_data, start_date, end_date)
ax1.plot(sum_log_returns.index, sum_log_returns, label=ticker)
ax1.set_ylabel('Cumulative log returns')
ax1.legend(loc='best')
tot_logreturns = tr_rel(panel_data)
ax2.plot(tot_logreturns.index, tot_logreturns, label=ticker)
ax2.set_ylabel('Total relative returns (%)')
ax2.legend(loc='best')
plt.show()
Here you have a minimal reproducible example, you will have to import the functions, pandas, numpy and copy.
ticker='AAPL'
start_date='2000-01-01'
end_date='2019-12-31'
price='Close'
panel_data = data.DataReader(ticker , 'yahoo', start_date, end_date)[price]
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')
panel_data = panel_data.reindex(all_weekdays)
panel_data = panel_data.fillna(method='ffill')
sum_log_returns = tlog_r(panel_data, start_date, end_date)
print(sum_log_returns)
tot_logreturns = tr_rel(panel_data)
print(tot_logreturns)

Pandas rolling standard deviation

Is anyone else having trouble with the new rolling.std() in pandas? The deprecated method was rolling_std(). The new method runs fine but produces a constant number that does not roll with the time series.
Sample code is below. If you trade stocks, you may recognize the formula for Bollinger bands. The output I get from rolling.std() tracks the stock day by day and is obviously not rolling.
This in in pandas 0.19.1. Any help would be appreciated.
import datetime
import pandas as pd
import pandas_datareader.data as web
start = datetime.datetime(2012,1,1)
end = datetime.datetime(2012,12,31)
g = web.DataReader(['AAPL'], 'yahoo', start, end)
stocks = g['Close']
stocks['Date'] = pd.to_datetime(stocks.index)
stocks['AAPL_LO'] = stocks['AAPL'] - stocks['AAPL'].rolling(20).std() * 2
stocks['AAPL_HI'] = stocks['AAPL'] + stocks['AAPL'].rolling(20).std() * 2
stocks.dropna(axis=0, how='any', inplace=True)
import pandas as pd
from pandas_datareader import data as pdr
import numpy as np
import datetime
end = datetime.date.today()
begin=end-pd.DateOffset(365*10)
st=begin.strftime('%Y-%m-%d')
ed=end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
def bollinger_strat(data, window, no_of_std):
rolling_mean = data['Close'].rolling(window).mean()
rolling_std = data['Close'].rolling(window).std()
df['Bollinger High'] = rolling_mean + (rolling_std * no_of_std)
df['Bollinger Low'] = rolling_mean - (rolling_std * no_of_std)
bollinger_strat(data,20,2)

Categories

Resources