I have an xarray.Dataset with temperature data and want to calculate the binned temperature for every element of the array using a rolling-window of 7-days.
I have data in this form:
import xarray as xr
ds = xr.Dataset(
{'t2m': (['time', 'lat', 'lon'], t2m)},
coords={
'lon': lon,
'lat': lat,
'time': time,
}
)
And then I use the rolling() method and apply a function on each window array:
r = ds.t2m.\
chunk({'time': 10}).\
rolling(time=7)
window_results = []
for label, arr_window in tqdm(r):
max_temp = arr_window.max(dim=...).values
min_temp = arr_window.min(dim=...).values
if not np.isnan(max_temp):
bins = np.arange(min_temp, max_temp, 2)
buckets = np.digitize(arr_window.isel(time=-1),
bins=bins)
buckets_arr = xr.DataArray(buckets,
dims={
'lat': arr_window.lat.values,
'lon': arr_window.lon.values
})
buckets_arr = buckets_arr.assign_coords({'time': label})
window_results.append(buckets_arr)
At the end, I get a list of each timestep with a window-calculation of binned arrays:
ds_concat = xr.concat(window_results, dim='time')
ds_concat
>> <xarray.DataArray (time: 18, lat: 10, lon: 10)>
array([[[1, 2, 2, ..., 2, 2, 3],
[1, 3, 3, ..., 1, 1, 2],
[2, 3, 2, ..., 1, 2, 3],
...,
[2, 2, 2, ..., 2, 2, 2],
[2, 2, 2, ..., 1, 2, 2],
[2, 2, 3, ..., 2, 3, 2]],
...
This code is yielding the results I am looking for, but I believe there must be a better alternative to apply this same process either using apply_ufunc or dask. I am also using a dask.distributed.Client, so I am looking for a way to optimize my code to run fast.
Any help is appreciated
I finally figure it out! Hope this can help someone with the same problem.
One of the coolest features of dask.distributed is dask.delayed. I can re-write the loop above and use a lazy function:
import dask
import xarray as xr
#dask.delayed
def create_bucket_window(arr, label):
max_temp = arr.max(dim=...).values
min_temp = arr.min(dim=...).values
if not np.isnan(max_temp):
bins = np.arange(min_temp, max_temp, 2)
buckets = np.digitize(arr.isel(time=-1),
bins=bins)
buckets_arr = xr.DataArray(buckets,
dims={
'lat': arr.lat.values,
'lon': arr.lon.values
})
buckets_arr = buckets_arr.assign_coords({'time': label})
return buckets_arr
and then:
window_results = []
for label, arr_window in tqdm(r):
bucket_array = create_bucket_window(arr=arr_window,
label=label)
window_results.append(bucket_array)
Once I do this, dask will lazy-generate this arrays, and will only evaluate them when needed:
dask.compute(*window_results)
And there you will have a collection of results!
Related
I have a 2d numpy array called arm_resets that has positive integers. The first column has all positive integers < 360. For all columns other than the first, I need to replace all values over 360 with the value that is in the same row in the 1st column. I thought this would be a relatively easy thing to do, here's what I have:
i = 300
over_360 = arm_resets[:, [i]] >= 360
print(arm_resets[:, [i]][over_360])
print(arm_resets[:, [0]][over_360])
arm_resets[:, [i]][over_360] = arm_resets[:, [0]][over_360]
print(arm_resets[:, [i]][over_360])
And here's what prints:
[3600 3609 3608 ... 3600 3611 3605]
[ 0 9 8 ... 0 11 5]
[3600 3609 3608 ... 3600 3611 3605]
Since all numbers that are being shown in the first print (first 3 and last 3) are above 360, they should be getting replaced by the 2nd print in the 3rd print. Why is this not working?
edit: reproducible example:
df = pd.DataFrame({"start":[1,2,5,6],"freq":[1,5,6,9]})
periods = 6
arm_resets = df[["start"]].values
freq = df[["freq"]].values
arm_resets = np.pad(arm_resets,((0,0),(0,periods-1)))
for i in range(1,periods):
arm_resets[:,[i]] = arm_resets[:,[i-1]] + freq
#over_360 = arm_resets[:,[i]] >= periods
#arm_resets[:,[i]][over_360] = arm_resets[:,[0]][over_360]
arm_resets
Given commented out code here's what prints:
array([[ 1, 2, 3, 4, 5, 6],
[ 2, 7, 12, 17, 22, 27],
[ 3, 9, 15, 21, 27, 33],
[ 4, 13, 22, 31, 40, 49]])
What I would expect:
array([[ 1, 2, 3, 4, 5, 1],
[ 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3],
[ 4, 4, 4, 4, 4, 4]])
Now if it helps, the final 2d array I'm actually trying to create is a 1/0 array that indicates which are filled in, so in this example I'd want this:
array([[ 0, 1, 1, 1, 1, 1],
[ 0, 0, 1, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 1, 0]])
The code I use to achieve this from the above arm_resets is this:
fin = np.zeros((len(arm_resets),periods),dtype=int)
for i in range(len(arm_resets)):
fin[i,a[i]] = 1
The slice arm_resets[:, [i]] is a fancy index, and therefore makes a copy of the ith column of the data. arm_resets[:, [i]][over_360] = ... therefore calls __setitem__ on a temporary array that is discarded as soon as the statement executes. If you want to assign to the mask, call __setitem__ on the sliced object directly:
arm_resets[over_360, [i]] = ...
You also don't need to make the index into a list. It's generally better to use simple indices, especially when doing assignments, since they create views rather than copies:
arm_resets[over_360, i] = ...
With slicing, even the following should work, since it calls __setitem__ on a view:
arm_resets[:, i][over_360] = ...
This index does not help you process each row of the data, since i is a column. In fact, you can process the entire matrix in one step, without looping, if you use indices rather than a boolean mask. The reason that indices are useful is that you can match the item from the correct row in the first column:
rows, cols = np.nonzero(arm_resets[:, 1:] >= 360)
arm_resets[rows, cols] = arm_resets[rows, 1]
You can use np.where()
first_col = arm_resets[:,0] # first col
first_col = first_col.reshape(first_col.size,1) #Transfor in 2d array
arm_resets = np.where(arm_resets >= 360,first_col,arm_resets)
You can see in detail how np.where work here, but basically it compare arm_resets >= 360, if true it put first_col value in place (there another detail here with broadcasting) if false it put arm_resets value.
Edit: As suggested by Mad Physicist. You can use arm_resets[:,0,None] directly instead of creating first_col variable.
arm_resets = np.where(arm_resets >= 360,arm_resets[:,0,None],arm_resets)
I have a dataframe which stores different variables. I'm using OLS linear regression and using all of the variables to predict the 'price' column.
import pandas as pd
import statsmodels.api as sm
data = {'accommodates':[2, 2, 3, 2, 2, 6, 8, 4, 3, 2],
'bedrooms':[1, 2, 1, 1, 3, 4, 2, 2, 2, 3],
'instant_bookable':[1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
'availability_365':[123, 3, 33, 14, 15, 16, 3, 41, 61, 74],
'minimum_nights':[3, 12, 1, 4, 6, 7, 2, 3, 6, 10],
'beds':[2, 2, 3, 4, 1, 5, 6, 2, 3, 2],
'price':[59, 234, 15, 162, 56, 42, 28, 52, 22, 31]}
df = pd.DataFrame(data, columns = ['accommodates', 'bedrooms', 'instant_bookable', 'availability_365',
'minimum_nights', 'beds', 'price'])
I have a for loop which calculates the Adjusted R squared value for each variable:
fit_d = {}
for columns in [x for x in df.columns if x != 'price']:
Y = df['price']
X = df[columns]
X = sm.add_constant(X)
model = sm.OLS(Y,X, missing = 'drop').fit()
fit_d[columns] = model.rsquared
fit_d
How can I modify my code in order to find the combination of variables that give the largest Adjusted R squared value? Ideally the function would find the variable with the largest adj. R squared value first, then using the 1st variable iterate with the remaining variables to get 2 variables that give the highest value, then 3 variables etc. until the value cannot be increased further. I'd like the output to be something like
Best variables: {'accommodates, 'availability', 'bedrooms'}
Here is a "brute force way" to do all possible combinations (from itertools) of different length to find the variables with higher R value. The idea is to do 2 loops, one for the number of variables to try, and one for all the combinations with the number of variables.
from itertools import combinations
# all possible columns for X
cols = [x for x in df.columns if x != 'price']
# define Y as same accross the loops
Y = df['price']
# define result dictionary
fit_d = {}
# loop for any length of combinations
for i in range(1, len(cols)+1):
# loop for any combinations with length i
for comb in combinations(cols, i):
# Define X from the combination
X = df[list(comb)]
X = sm.add_constant(X)
# perform the OLS opertion
model = sm.OLS(Y,X, missing = 'drop').fit()
# save the rsquared in a dictionnary
fit_d[comb] = model.rsquared
# extract the key for the max R value
key_max = max(fit_d, key=fit_d.get)
print(f'Best variables {key_max} for a R-value of {round(fit_d[key_max], 5)}')
# Best variables ('accommodates', 'bedrooms', 'instant_bookable', 'availability_365', 'minimum_nights', 'beds') for a R-value of 0.78506
I have a function which implements the k-mean algorithm and I want to use it with DataFrames in order to take into account indexes. For the moment I use DataFrame.values and it works. Yet I don't get the indexes of the output.
def cluster_points(X, mu):
clusters = {}
for x in X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
def reevaluate_centers(mu, clusters):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(np.mean(clusters[k], axis = 0))
return newmu
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))
def find_centers(X, K):
# Initialize to K random centers
oldmu = random.sample(X, K)
mu = random.sample(X, K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Reevaluate centers
mu = reevaluate_centers(oldmu, clusters)
return(mu, clusters)
For instance with thus example minimal and sufficient :
import itertools
df = pd.DataFrame(np.random.randint(0,10,size=(10, 5)), index = list(range(10)), columns=list(range(5)))
df.index.name = 'subscriber_id'
df.columns.name = 'ad_id'
I get :
find_centers(df.values, 2)
([array([ 3.8, 3. , 3.6, 2. , 3.6]),
array([ 6.8, 3.6, 5.6, 6.8, 6.8])],
{0: [array([2, 0, 5, 6, 4]),
array([1, 1, 2, 3, 3]),
array([6, 0, 4, 0, 3]),
array([7, 9, 4, 1, 7]),
array([3, 5, 3, 0, 1])],
1: [array([6, 2, 5, 9, 6]),
array([8, 9, 7, 2, 8]),
array([7, 5, 3, 7, 8]),
array([7, 1, 5, 7, 6]),
array([6, 1, 8, 9, 6])]})
I have the values but don't have the indexes.
If you want to get the array of values including the index, you can simply add the index to the columns with reset_index():
values_with_index = df.reset_index().values
Update
If what you want is to have the index on the output, but not use it during the actual clustering, you can do the following. First, pass the actual data frame object to find_centers:
find_centers(df, 2)
Then change cluster_points as follows:
def cluster_points(X, mu):
clusters = {}
for _, x in X.iterrows():
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]]))
for i in enumerate(mu)], key=lambda t:t[1])[0]
# You can replace this try/except block with
# clusters.setdefault(bestmukey, []).append(x)
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
The centers in the output will still be arrays, but the clusters will contain series objects with each row. The name property of each of these series is the index value in the data frame.
How do I get it to print just a list of the averages?
I just need it to be the exact same format as my np
arrays so I can compare them to see if they are the same or not.
Code:
import numpy as np
from pprint import pprint
centroids = np.array([[3,44],[4,15],[5,15]])
dataPoints = np.array([[2,4],[17,4],[45,2],[45,7],[16,32],[32,14],[20,56],[68,33]])
def size(vector):
return np.sqrt(sum(x**2 for x in vector))
def distance(vector1, vector2):
return size(vector1 - vector2)
def distances(array1, array2):
lists = [[distance(vector1, vector2) for vector2 in array2] for vector1 in array1]
#print lists.index(min, zip(*lists))
smallest = [min(zip(l,range(len(l)))) for l in zip(*lists)]
clusters = {}
for j, (_, i) in enumerate(smallest):
clusters.setdefault(i,[]).append(dataPoints[j])
pprint (clusters)
print'\nAverage of Each Point'
avgDict = {}
for k,v in clusters.iteritems():
avgDict[k] = sum(v)/ (len(v))
avgList = np.asarray(avgDict)
pprint (avgList)
distances(centroids,dataPoints)
Current Output:
{0: [array([16, 32]), array([20, 56])],
1: [array([2, 4])],
2: [array([17, 4]),
array([45, 2]),
array([45, 7]),
array([32, 14]),
array([68, 33])]}
Average of Each Point
array({0: array([18, 44]), 1: array([2, 4]), 2: array([41, 12])}, dtype=object)
Desired Output:
[[18,44],[2,4],[41,12]]
Or whatever the best format to compare my arrays/lists. I am aware I should have just stuck with one data type.
Do you try to cluster the dataPoints by the index of the nearest centroids, and find out the average position of the clustered points? If it is, I advise to use some broadcast rules of numpy to get the output you need.
Consider this,
np.linalg.norm(centroids[None, :, :] - dataPoints[:, None, :], axis=-1)
It creates a matrix showing all distances between dataPoints and centroids,
array([[ 40.01249805, 11.18033989, 11.40175425],
[ 42.3792402 , 17.02938637, 16.2788206 ],
[ 59.39696962, 43.01162634, 42.05948169],
[ 55.97320788, 41.77319715, 40.79215611],
[ 17.69180601, 20.80865205, 20.24845673],
[ 41.72529209, 28.01785145, 27.01851217],
[ 20.80865205, 44.01136217, 43.65775991],
[ 65.9241989 , 66.48308055, 65.520989 ]])
And you can compute the indices of the nearest centroids by this trick (they are split into 3 lines for readability),
In: t0 = centroids[None, :, :] - dataPoints[:, None, :]
In: t1 = np.linalg.norm(t0, axis=-1)
In: t2 = np.argmin(t1, axis=-1)
Now t2 has the indices,
array([1, 2, 2, 2, 0, 2, 0, 2])
To find the #1 cluster, use the boolean mask t2 == 0,
In: dataPoints[t2 == 0]
Out: array([[16, 32],
[20, 56]])
In: dataPoints[t2 == 1]
Out: array([[2, 4]])
In: dataPoints[t2 == 2]
Out: array([[17, 4],
[45, 2],
[45, 7],
[32, 14],
[68, 33]])
Or just calculate the average in your case,
In: np.mean(dataPoints[t2 == 0], axis=0)
Out: array([ 18., 44.])
In: np.mean(dataPoints[t2 == 1], axis=0)
Out: array([ 2., 4.])
In: np.mean(dataPoints[t2 == 2], axis=0)
Out: array([ 41.4, 12. ])
Of course, the latter blocks can be rewritten in for-loop if you want.
It might be a good practice to formulate the solution by numpy's conventions in my opinion.
I have a set of data, and a set of thresholds for creating bins:
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
thresholds = np.array([0,5,10])
bins = np.digitize(data, thresholds, right=True)
For each of the elements in bins, I want to know the base percentile. For example, in bins, the smallest bin should start at the 0th percentile. Then the next bin, for example, the 20th percentile. So that if a value in data falls between the 0th and 20th percentile of data, it belongs in the first bin.
I've looked into pandas rank(pct=True) but can't seem to get this done correctly.
Suggestions?
You can calculate the percentile for each element in your data array as described in a previous StackOverflow question (Map each list value to its corresponding percentile).
import numpy as np
from scipy import stats
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
Method 1: Using scipy.stats.percentileofscore :
data_percentile = np.array([stats.percentileofscore(data, a) for a in data])
data_percentile
Out[1]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Method 2: Using scipy.stats.rankdata and normalising to 100 (faster) :
ranked = stats.rankdata(data)
data_percentile = ranked/len(data)*100
data_percentile
Out[2]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Now that you have a list of percentiles, you can bin them as before using numpy.digitize :
bins_percentile = [0,20,40,60,80,100]
data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
data_binned_indices
Out[3]:
array([1, 1, 2, 2, 2, 3, 3, 5, 5, 4, 5], dtype=int64)
This gives you the data binned according to the indices of your chosen list of percentiles. If desired, you could also return the actual (upper) percentiles using numpy.take :
data_binned_percentiles = np.take(bins_percentile, data_binned_indices)
data_binned_percentiles
Out[4]:
array([ 20, 20, 40, 40, 40, 60, 60, 100, 100, 80, 100])