I would like to use pandas and statsmodels to fit a linear model on subsets of a dataframe and return the predicted values. However, I am having trouble figuring out the right pandas idiom to use. Here is what I am trying to do:
import pandas as pd
import statsmodels.formula.api as sm
import seaborn as sns
tips = sns.load_dataset("tips")
def fit_predict(df):
m = sm.ols("tip ~ total_bill", df).fit()
return pd.Series(m.predict(df), index=df.index)
tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
This raises the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-139-b3d2575e2def> in <module>()
----> 1 tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
3033 return self._transform_general(func, *args, **kwargs)
3034 except:
-> 3035 return self._transform_general(func, *args, **kwargs)
3036
3037 # a reduction transform
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _transform_general(self, func, *args, **kwargs)
2988 group.T.values[:] = res
2989 else:
-> 2990 group.values[:] = res
2991
2992 applied.append(group)
ValueError: could not broadcast input array from shape (62) into shape (62,6)
The error makes sense in that I think .transform wants to map a DataFrame to a DataFrame. But is there a way to do a groupby operation on a DataFrame, pass each chunk into a function that reduces it to a Series (with the same index), and then combine the resulting Series into something that can be inserted into the original dataframe?
The top part here is the same, I'm just using a toy dataset b/c I'm behind a firewall.
tips = pd.DataFrame({ 'day':list('MMMFFF'), 'tip':range(6),
'total_bill':[10,40,20,80,50,40] })
def fit_predict(df):
m = sm.ols("tip ~ total_bill", df).fit()
return pd.Series(m.predict(df), index=df.index)
If you change 'transform' to 'apply', you'll get:
tips.groupby("day").apply(fit_predict)
day
F 3 2.923077
4 4.307692
5 4.769231
M 0 0.714286
1 1.357143
2 0.928571
That's not quite what you want, but if you drop level=0, you can proceed as desired:
tips['predicted'] = tips.groupby("day").apply(fit_predict).reset_index(level=0,drop=True)
day tip total_bill predicted
0 M 0 10 0.714286
1 M 1 40 1.357143
2 M 2 20 0.928571
3 F 3 80 2.923077
4 F 4 50 4.307692
5 F 5 40 4.769231
Related
I'm working with a pandas Dataframe on python, but in order to plot as a map my data I have to transform it into a xarray Dataset, since the library I'm using to plot (salem) works best for this class. The problem I'm having is that the grid of my data isn't regular so I can't seem to be able to create the Dataset.
My Dataframe has the latitude and longitude, as well as the value in each point:
lon lat value
0 -104.936302 -51.339233 7.908411
1 -104.827377 -51.127686 7.969049
2 -104.719154 -50.915470 8.036676
3 -104.611641 -50.702595 8.096765
4 -104.504814 -50.489056 8.163690
... ... ... ...
65995 -32.911377 15.359591 25.475702
65996 -32.957718 15.579139 25.443994
65997 -33.004040 15.798100 25.429346
65998 -33.050335 16.016472 25.408105
65999 -33.096611 16.234255 25.383844
[66000 rows x 3 columns]
In order to create the Dataset using lat and lon as coordinates and fill all of the missing values with NaN, I was trying the following:
ds = xr.Dataset({
'ts': xr.DataArray(
data = value, # enter data here
dims = ['lon','lat'],
coords = {'lon': lon, 'lat':lat},
attrs = {
'_FillValue': np.nan,
'units' : 'K'
}
)},
attrs = {'attr': 'RegCM output'}
)
ds
But I got the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [41], in <cell line: 1>()
1 ds = xr.Dataset({
----> 2 'ts': xr.DataArray(
3 data = value, # enter data here
4 dims = ['lon','lat'],
5 coords = {'lon': lon, 'lat':lat},
6 attrs = {
7 '_FillValue': np.nan,
8 'units' : 'K'
9 }
10 )},
11 attrs = {'example_attr': 'this is a global attribute'}
12 )
14 # ds = xr.Dataset(
15 # data_vars=dict(
16 # variable=(["lon", "lat"], value)
(...)
25 # }
26 # )
27 ds
File ~\anaconda3\lib\site-packages\xarray\core\dataarray.py:406, in DataArray.__init__(self, data, coords, dims, name, attrs, indexes, fastpath)
404 data = _check_data_shape(data, coords, dims)
405 data = as_compatible_data(data)
--> 406 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
407 variable = Variable(dims, data, attrs, fastpath=True)
408 indexes = dict(
409 _extract_indexes_from_coords(coords)
410 ) # needed for to_dataset
File ~\anaconda3\lib\site-packages\xarray\core\dataarray.py:123, in _infer_coords_and_dims(shape, coords, dims)
121 dims = tuple(dims)
122 elif len(dims) != len(shape):
--> 123 raise ValueError(
124 "different number of dimensions on data "
125 f"and dims: {len(shape)} vs {len(dims)}"
126 )
127 else:
128 for d in dims:
ValueError: different number of dimensions on data and dims: 1 vs 2
I would really appreciate any insights to solve this.
If you really require a rectangularly gridded dataset you need to resample your data into a regular grid... (rasterio, pyresample etc. provide useful functionalities for that). However if you just want to plot the data, this is not necessary!
Not sure about salem (never used it so far), but I've tried my best to simplify plotting of irrelgularly sampled data in the visualization-library I'm developing EOmaps!
You could get a "contour-plot" like appearance if you use a "delaunay triangulation" to visualize the data:
import pandas as pd
df = pd.read_csv("... path-to df.csv ...", index_col=0)
from eomaps import Maps
m = Maps()
m.add_feature.preset.coastline()
m.set_data(df, x="lon", y="lat", crs=4326, parameter="value")
m.set_shape.delaunay_triangulation()
m.plot_map()
I have the following dataframe called 'data':
Month
Revenue Index
1920-01-01
1.72
1920-02-01
1.83
1920-03-01
1.94
...
...
2021-10-01
114.20
2021-11-01
115.94
2021-12-01
116.01
This is essentially a monthly revenue index on which I am trying to use seasonal_decompose with the following code:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative')
But unfortunately I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-08e3139bbf77> in <module>()
----> 1 result = seasonal_decompose(data['Consumptieprijsindex'], model='multiplicative')
2 rcParams['figure.figsize'] = 12, 6
3 plt.rc('lines', linewidth=1, color='r')
4
5 fig = result.plot()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/seasonal.py in seasonal_decompose(x, model, filt, freq, two_sided, extrapolate_trend)
125 freq = pfreq
126 else:
--> 127 raise ValueError("You must specify a freq or x must be a "
128 "pandas object with a timeseries index with "
129 "a freq not set to None")
ValueError: You must specify a freq or x must be a pandas object with a timeseries index with a freq not set to None
Does anyone know how to solve this issue? Thanks!
The following code in the comments answered my question:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative', period=12)
I want to use xarray functionality to reduce a dataset by a custom/external function across a named dimension.
Create dataset to demonstrate the problem
import xarray as xr
import numpy as np
import pandas as pd
time = pd.date_range("2000-01-01", "2001-01-01", freq="D")
sids = np.arange(4)
obs = np.random.random(size=(len(time), len(sids)))
sim = np.random.random(size=(len(time), len(sids)))
original = xr.Dataset({"obs": (("time", "station_id"), obs), "sim": (("time", "station_id"), sim)}, coords={"time": time, "station_id": sids})
I want to calculate the mean_squared_error using the two variables in original, calculating the metric by collapsing the "time" dimension. This should return an xr.Dataset like the following:
<xarray.Dataset>
Dimensions: (station_id: 4)
Coordinates:
* station_id (station_id) int64 0 1 2 3
Data variables:
mean_squared_error (station_id) float64 0.4411 0.183 0.06754 0.9662
I have tried using the reduce function
from sklearn.metrics import mean_squared_error
original.reduce(mean_squared_error, dim="time")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-243-51111f05437b> in <module>
----> 1 original.reduce(mean_squared_error, dim="time")
~/miniconda3/envs/ml/lib/python3.8/site-packages/xarray/core/dataset.py in reduce(self, func, dim, keep_attrs, keepdims, numeric_only, **kwargs)
4915 # the former is often more efficient
4916 reduce_dims = None # type: ignore[assignment]
-> 4917 variables[name] = var.reduce(
4918 func,
4919 dim=reduce_dims,
~/miniconda3/envs/ml/lib/python3.8/site-packages/xarray/core/variable.py in reduce(self, func, dim, axis, keep_attrs, keepdims, **kwargs)
1721 )
1722 if axis is not None:
-> 1723 data = func(self.data, axis=axis, **kwargs)
1724 else:
1725 data = func(self.data, **kwargs)
~/miniconda3/envs/ml/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
TypeError: mean_squared_error() got an unexpected keyword argument 'axis'
There is a package called xskillscore, which has a method to calculate the MSE.
pip install xskillscore
xskillscore.mse(original.obs, original.sim, 'time')
I believe this would work :
np.sqrt(np.square(original["sim"] - original["obs"]).mean(dim="time"))
One solution does not use the internal functions of xarray, but instead requires you to loop over all of your dimension station_id.
from collections import defaultdict
# calculate error metric
out = defaultdict(list)
for sid in original.station_id.values:
data = original.sel(station_id=sid)
orig_err = np.sqrt(mean_squared_error(data["obs"], data["sim"]))
out["original"].append(orig_err)
out["station_id"].append(sid)
rmse = pd.DataFrame(out).set_index("station_id").to_xarray()
This gives you the solution but does not use the internal broadcasting features of xarray and so would struggle with larger datasets.
I am using the playerStat.csv which includes 8 columns from which I only need 2. So I`m trying to create a new DataFrame with only those 2 columns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv("HLTVData/playerStats.csv")
dataset.head(20)
I only need the ADR and the Rating.
So I first create a matrix with the data set.
mat = dataset.as_matrix()
#4 is the ADR and 6 is the Rating
newDAtaSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
But it didn`t work, it threw an exception
NameError Traceback (most recent call last)
<ipython-input-10-1f975cc2514a> in <module>()
1 #4 is the ADR and 6 is the Rating
----> 2 newDataSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
NameError: name 'indexMatrix' is not defined
I also tried using the dataset.
newDataSet = pd.DataFrame(dataset, index=np.array(range(dataset.shape[0])), columns=dataset['ADR'])
/home/tensor/miniconda3/envs/tensorflow35openvc/lib/python3.5/site-packages/pandas/core/internals.py in _make_na_block(self, placement, fill_value)
3984
3985 dtype, fill_value = infer_dtype_from_scalar(fill_value)
-> 3986 block_values = np.empty(block_shape, dtype=dtype)
3987 block_values.fill(fill_value)
3988 return make_block(block_values, placement=placement)
MemoryError:
I think you need parameter usecols in read_csv:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=['ADR','Rating'])
Or:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=[4,6])
Given the code:
import statsmodels.api as sm
import statsmodels.formula.api as smf
df.reset_index(drop=True, inplace=True)
display(df.describe())
md = smf.mixedlm("c ~ iscorr", df, groups=df.subnum)
mdf = md.fit()
Where df is a pandas.DataFrame, I get the following error out of smf.mixedlm:
IndexError Traceback (most recent call last)
<ipython-input-34-5373fe9b774a> in <module>()
4 df.reset_index(drop=True, inplace=True)
5 display(df.describe())
----> 6 md = smf.mixedlm("c ~ iscorr", df, groups=df.subnum)
7 # mdf = md.fit()
/home/lthibault/.pyenv/versions/3.5.0/lib/python3.5/site-packages/statsmodels/regression/mixed_linear_model.py in from_formula(cls, formula, data, re_formula, subset, *args, **kwargs)
651 subset=None,
652 exog_re=exog_re,
--> 653 *args, **kwargs)
654
655 # expand re names to account for pairs of RE
/home/lthibault/.pyenv/versions/3.5.0/lib/python3.5/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, *args, **kwargs)
148 kwargs.update({'missing_idx': missing_idx,
149 'missing': missing})
--> 150 mod = cls(endog, exog, *args, **kwargs)
151 mod.formula = formula
152
/home/lthibault/.pyenv/versions/3.5.0/lib/python3.5/site-packages/statsmodels/regression/mixed_linear_model.py in __init__(self, endog, exog, groups, exog_re, use_sqrt, missing, **kwargs)
537
538 # Split the data by groups
--> 539 self.endog_li = self.group_list(self.endog)
540 self.exog_li = self.group_list(self.exog)
541 self.exog_re_li = self.group_list(self.exog_re)
/home/lthibault/.pyenv/versions/3.5.0/lib/python3.5/site-packages/statsmodels/regression/mixed_linear_model.py in group_list(self, array)
671 if array.ndim == 1:
672 return [np.array(array[self.row_indices[k]])
--> 673 for k in self.group_labels]
674 else:
675 return [np.array(array[self.row_indices[k], :])
/home/lthibault/.pyenv/versions/3.5.0/lib/python3.5/site-packages/statsmodels/regression/mixed_linear_model.py in <listcomp>(.0)
671 if array.ndim == 1:
672 return [np.array(array[self.row_indices[k]])
--> 673 for k in self.group_labels]
674 else:
675 return [np.array(array[self.row_indices[k], :])
IndexError: index 7214 is out of bounds for axis 1 with size 7214
Why is this error occurring? len(df) reports that there are 7296 rows, so there should be no issue indexing the 7214th, and the explicit re-indexing ensures that the indices span from zero to 7295.
You may download df here to fiddle around with it if you'd like.
You have 82 null values in iscorr:
>>> df.iscorr.isnull().sum()
82
Drop them and you will be fine:
df = df[df.iscorr.notnull()]
Per the function's docstring:
Notes
------
`data` must define __getitem__ with the keys in the formula
terms args and kwargs are passed on to the model
instantiation. E.g., a numpy structured or rec array, a
dictionary, or a pandas DataFrame.
If `re_formula` is not provided, the default is a random
intercept for each group.
This method currently does not correctly handle missing
values, so missing values should be explicitly dropped from
the DataFrame before calling this method.
"""
Output:
>>> mdf.params
Intercept 0.032000
iscorr[T.True] 0.030670
Intercept RE -0.057462