Linear Regression on Multiindex Pandas Dataframe in Python

Linear Regression on Multiindex Pandas Dataframe in Python - python

I'm trying to perform a regression of annual temperatures over time, and obtain a slope/linear trend (number generated by the regression) for each latitude and longitude coordinate (the full dataset has many lat/lon locations). I want to replace the year and temp for each location with this slope value. My end goal is to map these trends with cartopy.
Here is some test data in a pandas multi index dataframe
tempanomaly
lat lon time_bnds
-89.0 -179.0 1957 0.606364
1958 0.495000
1959 0.134286
this is my goal:
lat lon trend
-89.0 -179.0 -0.23604
this is my regression function
def regress(y):
#X is the year or index, y is the temperature
X=np.array(range(len(y))).reshape(len(y),1)
y = y.array
fit = np.polyfit(X, y, 1)
return (fit[0])
and here is how I'm attempting to call it
reg = df.groupby(["lat", "lon"]).transform(regress)
The error I'm receiving is TypeError: Transform function invalid for data types.
In the debugging process, I found that the regression was running for each line (3 times, using the test data), as opposed to once for each location (only one location is in the test data). I believe the problem lies in the method I'm using to call the regression, but can't figure out another way to iterate through and perform a regression by lat/lon pairs—I appreciate any help!

I think you have also error in your regress function because in your case X should be 1D vector. So here it is the fixed regress function:
def regress(y):
#X is the year or index, y is the temperature
X = np.array(range(len(y)))
y = y.array
fit = np.polyfit(X, y, 1)
return (fit[0])
For pandas.DataFrame.transform produced DataFrame will have same axis length as self. Pandas Documentation
Therefore aggregate is a better option for your case.
reg = df.groupby(["lat", "lon"]).aggregate(trend=pd.NamedAgg('tempanomaly', regress)).reset_index()
which produces:
lat lon trend
-89.0 -179.0 -0.236039
with the sample data created as follows:
lat_lon = [(-89.0, -179.0), (-89.0, -179.0), (-89.0, -179.0)]
index = pd.MultiIndex.from_tuples(lat_lon, names=["lat", "lon"])
df = pd.DataFrame({
'time_bnds':[1957,1958,1959],
'tempanomaly': [0.606364, 0.495000, 0.134286]
},index=index)

Related

Calculate mean from only one variable in pandas dataframe and netcdf

I am aiming to calculate daily climatology from a dataset, i.e. obtain the sea surface temperature (SST) for each day of the year by averaging all the years (for example, for January 1st, the average SST of all January 1st from 1982 to 2018). To do so, I made the following steps:
DATA PREPARATION STEPS
Here is a Drive link to both datasets to make the code reproducible:
link to datasets
First, I load two datasets:
ds1 = xr.open_dataset('./anomaly_dss/archive_to2018.nc') #from 1982 to 2018
ds2 = xr.open_dataset('./anomaly_dss/realtime_from2018.nc') #from 2018 to present
Then I convert to pandas dataframe and merge both in one:
ds1 = ds1.where(ds1.time > np.datetime64('1982-01-01'), drop=True) # Grab all data since 1/1/1982
ds2 = ds2.where(ds2.time > ds1.time.max(), drop=True) # Grab all data since the end of the archive
# Convert to Pandas Dataframe
df1 = ds1.to_dataframe().reset_index().set_index('time')
df2 = ds2.to_dataframe().reset_index().set_index('time')
# Merge these datasets
df = df1.combine_first(df2)
So far, this is how my dataframe looks like:
NOTE THAT THE LAT,LON GOES FROM LAT(35,37.7), LON(-10,-5), THIS MUST REMAIN LIKE THAT
ANOMALY CALCULATION STEPS
# Anomaly claculation
def standardize(x):
return (x - x.mean())/x.std()
# Calculate a daily average
df_daily = df.resample('1D').mean()
# Calculate the anomaly for each yearday
df_daily['anomaly'] = df_daily['analysed_sst'].groupby(df_daily.index.dayofyear).transform(standardize)
I obtain the following dataframe:
As you can see, I obtain the mean values of all three variables.
QUESTION
As I want to plot the climatology data on a map, I DO NOT want lat/lon variables to be averaged to one point. I need the anomaly from all the points lat/lon points, and I don't really know how to achieve that.
Any help would be very appreciated!!

I think you can do all that in a simpler and more straightforward way without converting your dataarray to a dataframe:
import os
#Will open and combine automatically the 2 datasets
DS = xr.open_mfdataset(os.path.join('./anomaly_dss', '*.nc'))
da = DS.analysed_sst
#Resampling
da = da.resample(time = '1D').mean()
# Anomaly calculation
def standardize(x):
return (x - x.mean())/x.std()
da_anomaly = da.groupby(da.time.dt.dayofyear).apply(standardize)
Then you can plot the anomaly for any day with:
da_anomaly[da_anomaly.dayofyear == 1].plot()

Y intercept of pandas dataframe with multiple series for linear regression

count 716865 716873 716884 716943
0 -0.16029615828413712 -0.07630309240006158 0.11220663712532133 -0.2726775504078691
1 -0.6687265363491811 -0.6135022705188075 -0.49097425130988914 -0.736020384028633
2 0.06735205699309535 0.07948417451634422 0.09240256047258057 0.0617964313591086
3 0.372935701728449 0.44324822316416074 0.5625073287879649 0.3199599294007491
4 0.39439310866886124 0.45960496068147993 0.5591549439131621 0.34928093849248304
5 -0.08007381002566456 -0.021313801077641505 0.11996141286735541 -0.15572679401876433
I have this dataframe named df2_norm on python. I compute slope with the following code:
allowableCorr = self.df2_norm.corr(method = 'pearson')
self.slope = allowableCorr * (self.df2_norm.std().values / self.df2_norm.std().values[:, np.newaxis])
Q1) How do I compute the y intercept using pandas,numpy and matplotlib only into a matrix that is like a heat/correlation map?
Q2) Is there a way to compute the scatter plot for each column as the train data and the rest as the test data?
Thank you.

Is it possible to shorten individual columns in pandas dataframes?

I am working with a 1000x40 data frame where I am fitting each column with a function.
For this, I am normalizing the data to run from 0 to 1 and then I fit each column by this sigmoidal function,
def func_2_2(x, slope, halftime):
yfit = 0 + 1 / (1+np.exp(-slope*(x-halftime)))
return yfit
# inital guesses for function
slope_guess = 0.5
halftime_guess = 100
# Construct initial guess array
p0 = np.array([slope_guess, halftime_guess])
# set up curve fit
col_params = {}
for col in dfnormalized.columns:
x = df.iloc[:,0]
y = dfnormalized[col].values
popt = curve_fit(func_2_2, x, y, p0=p0, maxfev=10000)
col_params[col] = popt[0]
This code is working well for me, but the data fitting would physically make more sense if I could cut each column shorter on an individual basis. The data plateaus for some of the columns already at e.g. 500 data points, and for others at 700 to virtually 1. I would like to implement a function where I simply cut off the column after it arrives at 1 (and there is no need to have another 300 or more data points to be included in the fit). I thought of cutting off 50 data points starting from the end if their average number is close to 1. I would dump them, until I arrive at the data that I want in be included.
When I try to add a function where I try to determine the average of the last 50 datapoints with e.g. passing the y-vector from above like this:
def cutdata(y)
lastfifty = y.tail(50).average
I receive the error message
AttributeError: 'numpy.ndarray' object has no attribute 'tail'
Does my approach make sense and is it possible within the data frame?
- Thanks in advance, any help is greatly appreciated.
print(y)
gives
[0.00203105 0.00407113 0.00145333 ... 0.99178177 0.97615621 0.97236191]

This has to do with the use of pd.Series.values, which will give you an np.ndarray instead of a pd.Series.
A conservative change to your code would move the use of .values into the curve_fit call. It may not even be necessary there, since a pd.Series is already a np.ndarray for most purposes.
for col in dfnormalized.columns:
x = df.iloc[:,0]
y = dfnormalized[col] # No more .values here.
popt = curve_fit(func_2_2, x, y.values, p0=p0, maxfev=10000)
col_params[col] = popt[0]
The essential part is highlighted by the comment, which is that your y variable will remain a pd.Series. Then you can get the average of the last observations.
y.tail(50).mean()

applying a generalized additive model to an xarray

I have a netCDF file which I have read with xarray. The array contains times, latidude, longitude and only one data variable (i.e. index values)
# read the netCDF files
with xr.open_mfdataset('wet_tropics.nc') as wet:
print(wet)
Out[]:
<xarray.Dataset>
Dimensions: (time: 1437, x: 24, y: 20)
Coordinates:
* y (y) float64 -1.878e+06 -1.878e+06 -1.878e+06 -1.878e+06 ...
* x (x) float64 1.468e+06 1.468e+06 1.468e+06 1.468e+06 ...
* time (time) object '2013-03-29T00:22:28.500000000' ...
Data variables:
index_values (time, y, x) float64 dask.array<shape=(1437, 20, 24), chunksize=(1437, 20, 24)>
So far, so good.
Now I need to apply a generalized additive model to each grid cell in the array. The model I want to use comes from Facebook Prophet (https://facebook.github.io/prophet/) and I have successfully applied it to a pandas array of data before. For example:
cns_ap['y'] = cns_ap['av_index'] # Prophet requires specific names 'y' and 'ds' for column names
cns_ap['ds'] = cns_ap['Date']
cns_ap['cap'] = 1
m1 = Prophet(weekly_seasonality=False, # disables weekly_seasonality
daily_seasonality=False, # disables daily_seasonality
growth='logistic', # logistic because indices have a maximum
yearly_seasonality=4, # fourier transform. int between 1-10
changepoint_prior_scale=0.5).fit(cns_ap)
future1 = m1.make_future_dataframe(periods=60, # 5 year prediction
freq='M', # monthly predictions
include_history=True) # fits model to all historical data
future1['cap'] = 1 # sets cap at maximum index value
forecast1 = m1.predict(future1)
# m1.plot_components(forecast1, plot_cap=False);
# m1.plot(forecast1, plot_cap=False, ylabel='CNS index', xlabel='Year');
The problem is that now I have to
1)iterate through every cell of the netCDF file,
2)get all the values for that cell through time,
3)apply the GAM (using fbprophet), and then export and plot the results.
The question: do you have any ideas on how to loop through the raster, get the index_values of each pixel for all times so that i can run the GAM?
I think that a nested for loop would be feasible, although i dont know how to make one that goes through every cell.
Any help is appreciated

Time series forecasting using statsmodels

So here I am attempting to forecast a year worth of values in a timeseries (ts), using arima model but I can't actually get the forecasted values', the predicted values are somewhat in different scale (you can see the last one from the dataset is 339 and the predicted are very small) but I am not sure where to tweak the code. I was trying to change fill_value to different value but I don't know if this is proper method.
I suppose this might also have something to do with this line:
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
Is there a way to extend the index to cover forecasted values?
The code is below:
ts_log = np.log(ts)
ts_log_diff = ts_log - ts_log.shift()
model = ARIMA(ts_log, order=(2, 1, 2))
results_ARIMA = model.fit(disp=-1)
plt.plot(ts_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_diff)**2))
predictions_ARIMA_diff = pd.Series(results_ARIMA.predict('1949-02-01','1961-12-01'), copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))
So here you can see how the results look, first one is the last value I have and beginning with 1961-01-01 I have predicted values.
1960-12-01 339.216967
1961-01-01 3.111950
1961-02-01 3.295407
1961-03-01 3.540066
1961-04-01 3.789093
1961-05-01 3.980322
1961-06-01 4.068641
1961-07-01 4.045327
1961-08-01 3.939715
1961-09-01 3.802622
1961-10-01 3.684713
1961-11-01 3.622262
1961-12-01 3.632668

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Linear Regression on Multiindex Pandas Dataframe in Python - python

Related

Calculate mean from only one variable in pandas dataframe and netcdf

Y intercept of pandas dataframe with multiple series for linear regression

Is it possible to shorten individual columns in pandas dataframes?

applying a generalized additive model to an xarray

Time series forecasting using statsmodels

Categories

Resources