Pyspark using Window function with my own function - python

I have a Pandas's code that calcul me the R2 of a linear regression over a window of size x. See my code :
def lr_r2_Sklearn(data):
data = np.array(data)
X = pd.Series(list(range(0,len(data),1))).values.reshape(-1,1)
Y = data.reshape(-1,1)
regressor = LinearRegression()
regressor.fit(X,Y)
return(regressor.score(X,Y))
r2_rolling = df[['value']].rolling(300).agg([lr_r2_Sklearn])
I am making a rolling of size 300 and calcul the r2 for each window. I wish to do the exact same thing but with pyspark and a spark dataframe. I know I must use the Window function, but it's a bit more difficult to understand than pandas, so I am lost ...
I have this but I don't know how to make it works.
w = Window().partitionBy(lit(1)).rowsBetween(-299,0)
data.select(lr_r2('value').over(w).alias('r2')).show()
(lr_r2 return r2)
Thanks !

You need a udf with pandas udf with a bounded condition. This is not possible until spark3.0 and is in development.
Refer answer here : User defined function to be applied to Window in PySpark?
However you can explore the ml package of pyspark:
http://spark.apache.org/docs/2.4.0/api/python/pyspark.ml.html#pyspark.ml.classification.LinearSVC
So you can define a model such as linearSVC and pass various parts of the dataframe to this after assembling it . I suggest using a pipeline consisting of stages, assembler and classifier, then call them in a loop using your various part of your dataframe by filtering it through some unique id.

Related

Scikit-learn QuantileRegressor memory allocation error. No issue with statsmodel QuantReg with the same data

I'm trying to fit a quantile regression model to my input data. I would like to use sklearn, but I am getting a memory allocation error when I try to fit the model. The same data with the statsmodels equivalent function is working fine.
There error I get is the following:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 55.9 GiB for an array with shape (86636, 86636) and data type float64
It doesn't make any sense, my X and y are shapes (86636, 4) and (86636, 1) respectively.
Here's my script:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import QuantileRegressor
training_df = pd.read_csv("/path/to/training_df.csv") # 86,000 rows
FEATURES = [
"feature_1",
"feature_2",
"feature_3",
"feature_4",
]
TARGET = "target"
# STATSMODELS WORKS FINE WITH 86,000, RUNS IN 2-3 SECONDS.
model_statsmodels = sm.QuantReg(training_df[TARGET], training_df[FEATURES]).fit(q=0.5)
# SKLEARN GIVES A MEMORY ALLOCATION ERROR, OR TAKES MINUTES TO RUN IF I SIGNIFICANTLY TRIM THE DATA TO < 1000 ROWS.
model_sklearn = QuantileRegressor(quantile=0.5, alpha=0)
model_sklearn.fit(training_df[FEATURES], training_df[TARGET])
I've checked the sklearn documentation and pretty sure my inputs are fine as dataframes, I get the same issues with NDarrays. So not sure what the issue is. Is it possible there's an issue with something under-the-hood?
[Here][1] is the scikit-learn documentation for QunatileRegressor.
Many thanks for any help / ideas.
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.QuantileRegressor.html
0
The sklearn QuantileRegressor class uses linear programming to solve the quantile regression problem which is much more computationally expensive than iterative reweighted least squares as used by statsmodel QuantReg class.
Here is a github issue for the same problem: https://github.com/scikit-learn/scikit-learn/issues/22922

Loading saved params (a Pandas series) into a Statsmodels state-space model

I'm building a dynamic factor model using the excellent python package statsmodels, and I would like to pickle an estimated parameter vector so I can build the model again later, and load those params into it. (C.f., this Notebook built by Chad Fulton: https://github.com/ChadFulton/tsa-notebooks/blob/master/dfm_coincident.ipynb.)
In the following block of code, initial parameters are estimated with mod.fit() (using the Powell algo) and then given back to mod.fit() to complete the estimation (using the EM algo) using the initial parameters as initial_res.params. (The latter is a Pandas Series.)
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
I would like to pickle res.params (again, a small Pandas Series, and a small disk footprint). Then later build the model from scratch again, and load my saved parameters into it without having to re-estimate the model. Anyone know how that can be done?
Examples I have seen suggest pickling the results object res, but that can be a pretty big save. Building it from scratch is pretty simple, but estimation takes a while. It may be that estimation starting from the saved optimal params is quicker; but still, that's pretty amateurish, right?
TIA,
Drew
You can use the smooth method on any state space model to construct a results object from specific parameters. In your example:
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
res.params.to_csv(...)
# ...later...
params = pd.read_csv(...)
res = mod.smooth(params)

Calculate xarray dataarray from coordinate labels

I have an DataArray with two variables (meteorological data) over time,y,x coordinates. The x and y coordinates are in a projected coordinate system (EPSG:3035) and aligned so that each cell covers pretty much exactly a standard cell of the 1km LAEA reference grid
I want to prepare the data for further use in Pandas and/or database tables, so I want to add the LAEA Gridcell Number/Label which can be calculated from x and y directly via the following (pseudo) function
def func(cell):
return r'1km{}{}'.format(int(cell['y']/1000), int(cell['x']/1000)) # e.g. 1kmN2782E4850
But as far as I can see there seems to be no possibility, to apply this function to a DataArray or DataSet in a way so that I have access to these coordinate variables (at least .apply_ufunc() wasn't really working for me.
I am able to calc this on Pandas later on, but some of my datasets consists of 60 up to 120 Mio. Cells/Rows/datasets and pandas (even with Numba) seems to have troubles with that amount. On the xarray I am able to process this on 32 Cores via Dask.
I would be grateful on any advice on how to get this working.
EDIT: Some more insights of the data I`m working with:
This one is quite the largest with 500 Mio cells, but I am able to downsample this to squarekilometer resolution which ends up with about 160 Mio. cells
If the dataset is small enough, I am able to export it as a pandas dataframe and calculate there, but thats slow and not very robust as the kernel is crashing quite often
This is how you can apply your function:
import xarray as xr
# ufunc
def func(x, y):
#print(y)
return r'1km{}{}'.format(int(y), int(x))
# test data
ds = xr.tutorial.load_dataset("rasm")
xr.apply_ufunc(
func,
ds.x,
ds.y,
vectorize=True,
)
Note that you don't have to list input_core_dims in your case.
Also, since your function isn't vectorized, you need to set vectorized=True:
vectorize : bool, optional
If True, then assume func only takes arrays defined over core
dimensions as input and vectorize it automatically with
:py:func:numpy.vectorize. This option exists for convenience, but is
almost always slower than supplying a pre-vectorized function.
Using this option requires NumPy version 1.12 or newer.
Using vectorized might not be the most performant option as it is essentially just looping, but if you have your data in chunks and use dask, it might be good enough.
If not, you could look into creating a vectorized function with e.g. numba that would speed things up surely.
More info can be found in the xarray tutorial on applying ufuncs
You can use apply_ufunc in an unvectorised way:
def func(x, y):
return f'1km{int(y/1000)}{int(x/1000)}' # e.g. 1kmN2782E4850
xr.apply_ufunc(
func, # first the function
x.x, # now arguments in the order expected by 'func'
x.y
)

pyspark ML LabeledPoint not working with LinearRegression

I'm studying Spark 3.0.1 with pyspark, and have setup some data for simple OLS regression using
data = results.select('OrderMonthYear', 'SaleAmount').rdd.map(lambda row: LabeledPoint(row[1], [row[0]])).toDF()
The OrderMonthYear is my feature column (int), and SaleAmount is the response (float). The LabeledPoint method was imported from pyspark.mllib.regression. I then try to fit the regression model with
from pyspark.ml.regression import LinearRegression
lr = LinearRegression()
modelA = lr.fit(data, {lr.regParam:0.0})
to get this exception
IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
This is clearly not very helpful, as the required and passed features seem to be the same structs. I've searched online, and only found answers to this problem for java, or for someone building the struct themselves. The exception was thrown from a util function that was just throwing a java exception (#Hide where the exception came from that shows a non-Pythonic JVM exception message.), so I can't debug further.
MLlib and RDD-based MLlib functions are deprecated. I suggest using vector assembler of ML:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
data = spark.createDataFrame([[0,1],[1,2],[2,3]]).toDF('OrderMonthYear', 'SaleAmount')
va = VectorAssembler(inputCols=['SaleAmount'], outputCol='features')
data2 = va.transform(data)
lr = LinearRegression(labelCol='OrderMonthYear')
model = lr.fit(data2)
For anyone else following the same LI Learning course, based on some modifications to the accepted answer above to align more with what I was seeing in the course, here's what Cmd 4 cell should look like:
# convenience for specifying schema
from pyspark.ml.feature import VectorAssembler
data = VectorAssembler(inputCols=['OrderMonthYear'], outputCol='features').transform(results.select("OrderMonthYear", "SaleAmount")).drop('OrderMonthYear').withColumnRenamed('SaleAmount', 'label')
display(data)
Alternatively, you can use the following which also works:
from pyspark.ml.linalg import Vectors
data = results.rdd.map(lambda r: (Vectors.dense(r[0]), r[1])).toDF(["features","label"])
display(data)
Then you should be good to go. Note that you'll want to make the same changes to Cmd 4 in notebooks 4.4 and 4.5 as well. Hope this helps!

Different predictions on multiple run of the same algorithm scikit neural network

Since a MLP can implement any function. I have the following code, using which I am trying to implement the AND function. But what I find that on running the program multiple times, I end up getting different predicted values. Why is this happening ? Also how does one determine which type of activation function has to be provided at different layers ?
from sknn.mlp import Regressor,Layer,Classifier
import numpy as np
X_train = np.array([[0,0],[0,1],[1,0],[1,1]])
y_train = np.array([0,0,0,1])
nn = Classifier(layers=[Layer("Softmax", units=2),Layer("Linear", units=2)],learning_rate=0.001,n_iter=25)
nn.fit(X_train, y_train)
X_example = np.array([[0,0],[0,1],[1,0],[1,1]])
y_example = nn.predict(X_example)
print (y_example)
-The different values obtained for every run is because your weights are randomly initialized.
-Activation functions have different properties. You can either use your experience to decide which is best for your situation, or you can read how they work (https://stats.stackexchange.com/questions/115258/comprehensive-list-of-activation-functions-in-neural-networks-with-pros-cons)

Categories

Resources