Say I have an xarray DataArray. One of the Dimensions is a time dimension:
import numpy as np
import xarray as xr
import pandas as pd
time = pd.date_range('1980-01-01', '2017-12-01', freq='MS')
time = xr.DataArray(time, dims=('time',), coords={'time':time})
da = xr.DataArray(np.random.rand(len(time)), dims=('time',), coords={'time':time})
Now if I only want the years from 1990 to 2000, what I can do is easy:
da.sel(time=slice('1990', '2000'))
But what if I want to drop these years? I want the data for all years except those.
da.drop_sel(time=slice('1990', '2000'))
fails with
TypeError: unhashable type: 'slice'
What's going on? What's the proper way to do that?
At the moment, I'm creating a new DataArray:
tdrop = da.time.sel(time=slice('1990', '2000'))
da.drop_sel(time=tdrop)
But that seems unnecessary convoluted.
What about using where with the drop optional parameter set to True to filter on the year? Using the example below, the data with 1990 <= year <= 2000 would be dropped.
da = da.where((da["time.year"] < 1990) | (da["time.year"] > 2000), drop=True)
Related
I have a pandas dataframe with daily data. At the last day of each month, I would like to compute a quantity that depends on the daily data of the previous n months (e.g., n=3).
My current solution is to use the pandas rolling function to compute this quantity for every day, and then, only keep the quantities of the last days of each month (and discard all the other quantities). This however implies that I perform a lot of unnecessary computations.
Does somebody of you know how I can improve that?
Thanks a lot in advance!
EDIT:
In the following, I add two examples. In both cases, I compute rolling regressions of stock returns. The first (short) example shows the problem described above and is a sub-problem of my actual problem. The second (long) example shows my actual problem. Therefore, I would either need a solution of the first example that can be embedded in my algorithm for solving the second example or a completely different solution of the second example. Note: The dataframe that I'm using is very large, which means that multiple copies of the entire dataframe are not feasible.
Example 1:
import pandas as pd
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
df = pd.DataFrame(index=dates,columns=['Y','X']).sort_index()
# Generate Data
df['X'] = np.array(range(0,365))
df['Y'] = 3.1*X-2.5
df = df.iloc[random.sample(range(365),280)] # some days are missing
df.iloc[random.sample(range(280),20),0] = np.nan # some observations are missing
df = df.sort_index()
# Compute Beta
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'], sm.add_constant(df.loc[ser.index,'X']), missing = 'drop').fit().params[-1]
df['beta'] = df['Y'].rolling('60D', min_periods=10).apply(estimate_beta) # use last 60 days and require at least 10 observations
# Get last entries per month
df_monthly = df[['beta']].groupby([pd.Grouper(freq='M', level='date')]).agg('last')
df_monthly
Example 2:
import pandas as pd
from pandas import IndexSlice as idx
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
arrays = [dates.tolist()+dates.tolist(),["10000"]*365+["10001"]*365]
index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["Date", "Stock"])
df = pd.DataFrame(index=index,columns=['Y','X']).sort_index()
# Generate Data
df.loc[idx[:,"10000"],'X'] = X = np.array(range(0,365)).astype(float)
df.loc[idx[:,"10000"],'Y'] = 3*X-2
df.loc[idx[:,"10001"],'X'] = X
df.loc[idx[:,"10001"],'Y'] = -X+1
df = df.iloc[random.sample(range(365*2),360*2)] # some days are missing
df.iloc[random.sample(range(280*2),20*2),0] = np.nan # some observations are missing
# Estimate beta
def estimate_beta_grouped(df_in):
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'].astype(float),sm.add_constant(df.loc[ser.index,'X'].astype(float)), missing = 'drop').fit().params[-1]
df = df_in.droplevel('Stock').reset_index().set_index(['Date']).sort_index()
df['beta'] = df['Y'].rolling('60D',min_periods=10).apply(estimate_beta)
return df[['beta']]
df_beta = df.groupby(level='Stock').apply(estimate_beta_grouped)
# Extract beta at last day per month
df_monthly = df.groupby([pd.Grouper(freq='M', level='Date'), df.index.get_level_values(1)]).agg('last') # get last observations
df_monthly = df_monthly.merge(df_beta, left_index=True, right_index=True, how='left') # merge beta on df_monthly
df_monthly
Ive been trying to create a generic weather importer that can resample data to set intervals (e.g. from 20min to hours or the like (I've use 60min in the code below)).
For this I wanted to use the Pandas resample function. After a bit of puzzling I came up with the below (which is not the prettiest code). I had one problem with the averaging of the wind direction for the set periods, which I've tried to solve with pandas' resampler.apply.
However, I've hit a problem with the definition which gives the following error:
TypeError: can't convert complex to float
I realise I'm trying to force a square peg in a round hole, but I have no idea how to overcome this. Any hints would be appreciated.
raw data
import pandas as pd
import os
from datetime import datetime
from pandas import ExcelWriter
from math import *
os.chdir('C:\\test')
file = 'bom.csv'
df = pd.read_csv(file,skiprows=0, low_memory=False)
#custom dataframe reampler (.resampler.apply)
def custom_resampler(thetalist):
try:
s=0
c=0
n=0.0
for theta in thetalist:
s=s+sin(radians(theta))
c=c+cos(radians(theta))
n+=1
s=s/n
c=c/n
eps=(1-(s**2+c**2))**0.5
sigma=asin(eps)*(1+(2.0/3.0**0.5-1)*eps**3)
except ZeroDivisionError:
sigma=0
return degrees(sigma)
# create time index and format dataframes
df['DateTime'] = pd.to_datetime(df['DateTime'],format='%d/%m/%Y %H:%M')
df.index = df['DateTime']
df = df.drop(['Year','Month', 'Date', 'Hour', 'Minutes','DateTime'], axis=1)
dfws = df
dfwdd = df
dfws = dfws.drop(['WDD'], axis=1)
dfwdd = dfwdd.drop(['WS'], axis=1)
#resample data to xxmin and merge data
dfwdd = dfwdd.resample('60T').apply(custom_resampler)
dfws = dfws.resample('60T').mean()
dfoutput = pd.merge(dfws, dfwdd, right_index=True, left_index=True)
# write series to Excel
writer = pd.ExcelWriter('bom_out.xlsx', engine='openpyxl')
dfoutput.to_excel(writer, sheet_name='bom_out')
writer.save()
Did a bit more research and found that changing the definition worked best.
However, this gave a weird outcome by opposing angle (180degrees) division, which I accidently discovered. I had to deduct a small value, which will give a degree error in the actual outcome.
I would still be interested to know:
what was done wrong with the complex math
a better solution for opposing angles (180 degrees)
# changed the imports
from math import sin,cos,atan2,pi
import numpy as np
#changed the definition
def custom_resampler(angles,weights=0,setting='degrees'):
'''computes the mean angle'''
if weights==0:
weights=np.ones(len(angles))
sumsin=0
sumcos=0
if setting=='degrees':
angles=np.array(angles)*pi/180
for i in range(len(angles)):
sumsin+=weights[i]/sum(weights)*sin(angles[i])
sumcos+=weights[i]/sum(weights)*cos(angles[i])
average=atan2(sumsin,sumcos)
if setting=='degrees':
average=average*180/pi
if average == 180 or average == -180: #added since 290 degrees and 110degrees average gave a weird outcome
average -= 0.1
elif average < 0:
average += 360
return round(average,1)
I am trying to predict the stock price of Facebook on the 1664th row of the .csv file. I am encountering an error when it comes to appending a np.array. Here's my code:
##predicts price of facebook stock for one day
from sklearn.svm import SVR
import numpy as np
import pandas as pd
##store and show data
df = pd.read_csv (r'fb.csv')
##get and print last row of data
actual_price = df.tail(1)
#print(actual_price)
##prepare and print svr models
##get all of the data except for last row
df = df.head(len(df)-1)
ind = (np.arange((len(df.index))))
df["index"] = ind
##create empty list to store dependent and independent data
# days1 =
days = np.array([])
adj_close_prices = np.array([])
##get the date and adjusted close prices
df_days = df.loc[:, 'index']
df_adj_close = df.loc[:, 'Adj Close']
##create the independent dataset ### this part to specify
for day in df_days:
days = np.append(float(day))
And the error which keeps occurring is the following:
days = np.append(float(day))
File "<__array_function__ internals>", line 4, in append
TypeError: _append_dispatcher() missing 1 required positional argument: 'values'
I have very basic level of Python knowledge and have been using YouTube and online resources to come up with what I have already.
I haven't checked your whole code, but the append problem isn't too hard to fix. Check out numpy.append() documentation and you will notice that it takes 2 parameters that are required (and a third that is optional), in a form of numpy.append(array_to_apend_to, value_to_apend) so, in your code it should look like days = np.append(days, float(day))
I am trying to select several years from a dataframe in monthly resolution.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import netCDF4 as nc
#-- open net-cdf and read in variables
data = nc.Dataset('test.nc')
time = nc.num2date(data.variables['Time'][:],
data.variables['Time'].units)
df = pd.DataFrame(data.variables['mgpp'][:,0,0], columns=['mgpp'])
df['dates'] = time
df = df.set_index('dates')
print(df.head())
This is what the head looks like:
mgpp
dates
1901-01-01 0.040735
1901-02-01 0.041172
1901-03-01 0.053889
1901-04-01 0.066906
Now I managed to extract one year:
df_cp = df[df.index.year == 2001]
but how would I extract several years, say 1997, 2001 and 2007 and have them stored in the same dataframe? Is there a one/ two line solution? My only idea for now is to iterate and then merge the dataframes but maybe there is a better solution!
I have a pandas dataframe in the following format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
I have written the following to create two new fields indicating 30 day windows:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
Unfortunately, this is far too slow doing the row-wise apply for my application. I would greatly appreciate any tips on vectorizing these functions if possible.
EDIT:
The resultant dataframe should have this layout:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
It does not need to be resampled or windowed in the formal sense. It just needs 'window_start_dt' and 'window_end_dt' columns to be added. The current code works, it just need to be vectorized if possible.
EDIT 2: pandas.cut is built-in:
tt=[[1,'2004-01-02',0.1,25,47],
[1,'2004-01-17',0.2,150,8],
[2,'2004-01-29',0.2,150,25],
[3,'2017-07-15',0.3,55,17],
[3,'2016-05-12',0.3,55,47],
[4,'2012-02-23',0.2,150,22],
[4,'2009-10-10',0.1,25,12],
[4,'2014-04-04',0.2,150,2],
[5,'2008-07-09',0.2,150,43]]
start_date_period = pd.date_range('2004-01-01', '12-01-2017', freq='MS')
end_date_period = pd.date_range('2004-01-30', '12-31-2017', freq='M')
df = pd.DataFrame(tt,columns=['customer_id','transaction_dt','product','price','units'])
df['transaction_dt'] = pd.Series([pd.to_datetime(sub_t[1],format='%Y-%m-%d') for sub_t in tt])
the_cut = pd.cut(df['transaction_dt'],bins=start_date_period,right=True,labels=False,include_lowest=True)
df['win_start_test'] = pd.Series([start_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
df['win_end_test'] = pd.Series([end_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
print(df.head())
win_start_test and win_end_test should be equal to their counterparts computed using your function.
The ValueError was coming from not casting x to int in the relevant line. I also added a NaN check, though it wasn't needed for this toy example.
Note the change to pd.date_range and the use of the start-of-month and end-of-month flags M and MS, as well as converting the date strings into datetime.