I have a dataset which looks like
ID Target Weight Score Scale_Cat Scale_num
0 A D 65.1 87 Up 1
1 A X 35.8 87 Up 1
2 B C 34.7 37.5 Down -2
3 B P 33.4 37.5 Down -2
4 C B 33.1 37.5 Down -2
5 S X 21.4 12.5 NA 9
This dataset consists of nodes (ID) and targets (neighbors) and it has been used as sample for testing label propagation. Classes/Labels are within the column Scale_num and can take values from -2 to 2 at step by one. The label 9 means unlabelled and it is the label that I would like to predict using label propagation algorithm.
Looking for some example on Google about label propagation, I have found this code useful (difference is in label assignment, since in my df I have already information on data which have labelled - from -2 to 2 at step by 1, and unlabelled, i.e. 9): https://mybinder.org/v2/gh/thibaudmartinez/label-propagation/master?filepath=notebook.ipynb
However, trying to use my classes instead of (-1,0,1) as in the original code, I have got some errors. A user has provided some help here: RunTimeError during one hot encoding, for fixing a RunTimeError, unfortunately still without success.
In the answer provided on that link, 40 obs and labels are randomly generated.
import random
labels = list()
for i in range(0,40):
labels.append(list([(lambda x: x+2 if x !=9 else 5)(random.sample(classes,1)[0])]))
index_aka_labels = torch.tensor(labels)
torch.zeros(40, 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
The error I am getting, still a RunTimeError, seems to be still due to a wrong encoding. What I tried is the following:
import random
labels = list(df['Scale_num'])
index_aka_labels = torch.tensor(labels)
torch.zeros(len(df), 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
getting the error
---> 7 torch.zeros(len(df), 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
RuntimeError: Index tensor must have the same number of dimensions as self tensor
For sure, I am missing something (e.g., the way to use classes and labels as well as src, which has never been defined in the answer provided in that link).
The two functions in the original code which are causing the error are as follows:
def _one_hot_encode(self, labels):
# Get the number of classes
classes = torch.unique(labels) # probably this should be replaced
classes = classes[classes != -1] # unlabelled. In my df the unlabelled class is identified by 9
self.n_classes = classes.size(0)
# One-hot encode labeled data instances and zero rows corresponding to unlabeled instances
unlabeled_mask = (labels == -1) # In my df the unlabelled class is identified by 9
labels = labels.clone() # defensive copying
labels[unlabeled_mask] = 0
self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
self.one_hot_labels[unlabeled_mask, 0] = 0
self.labeled_mask = ~unlabeled_mask
def fit(self, labels, max_iter, tol):
self._one_hot_encode(labels)
self.predictions = self.one_hot_labels.clone()
prev_predictions = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
for i in range(max_iter):
# Stop iterations if the system is considered at a steady state
variation = torch.abs(self.predictions - prev_predictions).sum().item()
prev_predictions = self.predictions
self._propagate()
I would like to understand how to use in the right way my classes/labels definition and info from my df in order to run the label propagation algorithm with no errors.
I suspect it's complaining about index_aka_labels lacking the singleton dimension. Note that in your example which works:
import random
labels = list()
for i in range(0,40):
labels.append(list([(lambda x: x+2 if x !=9 else 5)(random.sample(classes,1)[0])]))
index_aka_labels = torch.tensor(labels)
torch.zeros(40, 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
If you run index_aka_labels.shape, it returns (40,1). When you just turn your pandas series into a tensor, however, it will return a tensor of shape (M) (where M is the length of the series). If you simply run:
import random
labels = list(df['Scale_num'])
index_aka_labels = torch.tensor(labels)[:,None] #create another dimension
torch.zeros(len(df), 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
the error should disappear.
One more thing, you are not converting your labels into indices as you did in the top example. To do that, you can run:
import random
labels = list(df['Scale_num'])
index_aka_labels = torch.tensor(labels)[:,None] #create another dimension
index_aka_labels = index_aka_labels + 2 # labels are [-2,-1,0,1,2] and convert them to [0,1,2,3,4]
index_aka_labels[index_aka_labels==11] = 5 #convert label 9 to index 5
torch.zeros(len(df), 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)
I'm in one of those weird places, where I know exactly what I want to do. I could easily code it up using for loops. but I'm trying to learn Numpy and I can't formulate how to solve this in Numpy.
I want to have a 2d array or parameter space. All values between 1200 and 1800, and all combinations therein. So [1200, 1200], [1200, 1201], [1200, 1202] .... [1201, 1200], [1201, 1201] etc.
I want to apply a function across this entire parameter space. The function uses a further 2 arrays, which are also values in 1200-1800 range. But they are random values, so these 2 extra arrays are random values in the 1200-1800 range, so [1356, 1689, 1436, ...] and [1768, 1495, 1358, ...] etc. check_array1 and check_array2.
The function needs to move through the parameter space checking a condition, which is basically if x < check_array1 and y < check_array2 then 1 else 0. Where x and y are the each specific point in the 2d parameter space. It needs to check against every value combination in the check arrays. Sum the total, do a comparison to another static value, and return the difference.
Each unique combination in the parameter space grid will then have a unique value associated with it based on how those specific x and y values from the parameter space compare to the 2 check arrays.
Hopefully the above makes, I just can't figure out how to work this into a Numpy friendly problem. Sorry for the wall of text.
Edit: I've written it in more basic Python to better illustrate what I'm trying to do.
check1 = np.random.randint(1200, 1801, 300)
check2 = np.random.randint(1200, 1801, 300)
def check_this_double(i, j, check1, check2):
total = 0
for num in range(0, len(check1)):
if ((i < check1[num]) or (j < check2[num])):
total += 1
return total
outputs = {}
for i in range(1200, 1801):
for j in range(1200, 1801):
outputs[i,j] = check_this_double(i, j, check1, check2)
Edit 2: I believe I have it.
Following from Mountains code creating the p_space and then using np.vectorize on a normal Python fuction.
check1 = np.random.randint(1200, 1801, 300)
check2 = np.random.randint(1200, 1801, 300)
def calc(i, j):
total = np.where(np.logical_or(check1 < i, checks2 < j), 1, 0)
return total.sum()
rate_calv_v = np.vectorize(rate_calc)
final = rate_calv_v(p_space[:, 0], p_space[:, 1])
Feels kind of like cheating :), there must be way to do it without np.vectorize. But this works for me I believe.
I don't fully understand the problem you are trying to solve. I hope the following will
give you a starting point on how numpy can be used. I recommend going through a numpy introductory tutorial.
numpy boolean indexing and vector math can improve speed and reduce the need for loops.
Here is my understanding of the first part of your questions.
import numpy as np
xv, yv = np.meshgrid(np.arange(1200, 1801), np.arange(1200, 1801))
p_space = np.stack((xv, yv), axis=-1) # the 2d array described
# print original values
print(p_space[0,:10,0])
print(p_space[0,-10:,0])
old_shape = p_space.shape
p_space = p_space.reshape(-1, 2) # flatten the array for the compare
check1 = np.random.randint(1200, 1801, len(p_space))
check2 = np.random.randint(1200, 1801, len(p_space))
# you can used this to access and modify values that meet the condition
index_array = np.logical_and(p_space[:, 0] < check1, p_space[:, 1] < check2)
# do some sort of complex math
p_space[index_array] = p_space[index_array] / 2 + 10
# get the sum across the two columns
print(np.sum(p_space, axis=0))
p_space = p_space.reshape(old_shape) # return to the grid shape
# print modified values
print(p_space[0,:10,0]) # likely to be changed based on checks
print(p_space[0,-10:,0]) # unlikely to be changed
I am working with a 1000x40 data frame where I am fitting each column with a function.
For this, I am normalizing the data to run from 0 to 1 and then I fit each column by this sigmoidal function,
def func_2_2(x, slope, halftime):
yfit = 0 + 1 / (1+np.exp(-slope*(x-halftime)))
return yfit
# inital guesses for function
slope_guess = 0.5
halftime_guess = 100
# Construct initial guess array
p0 = np.array([slope_guess, halftime_guess])
# set up curve fit
col_params = {}
for col in dfnormalized.columns:
x = df.iloc[:,0]
y = dfnormalized[col].values
popt = curve_fit(func_2_2, x, y, p0=p0, maxfev=10000)
col_params[col] = popt[0]
This code is working well for me, but the data fitting would physically make more sense if I could cut each column shorter on an individual basis. The data plateaus for some of the columns already at e.g. 500 data points, and for others at 700 to virtually 1. I would like to implement a function where I simply cut off the column after it arrives at 1 (and there is no need to have another 300 or more data points to be included in the fit). I thought of cutting off 50 data points starting from the end if their average number is close to 1. I would dump them, until I arrive at the data that I want in be included.
When I try to add a function where I try to determine the average of the last 50 datapoints with e.g. passing the y-vector from above like this:
def cutdata(y)
lastfifty = y.tail(50).average
I receive the error message
AttributeError: 'numpy.ndarray' object has no attribute 'tail'
Does my approach make sense and is it possible within the data frame?
- Thanks in advance, any help is greatly appreciated.
print(y)
gives
[0.00203105 0.00407113 0.00145333 ... 0.99178177 0.97615621 0.97236191]
This has to do with the use of pd.Series.values, which will give you an np.ndarray instead of a pd.Series.
A conservative change to your code would move the use of .values into the curve_fit call. It may not even be necessary there, since a pd.Series is already a np.ndarray for most purposes.
for col in dfnormalized.columns:
x = df.iloc[:,0]
y = dfnormalized[col] # No more .values here.
popt = curve_fit(func_2_2, x, y.values, p0=p0, maxfev=10000)
col_params[col] = popt[0]
The essential part is highlighted by the comment, which is that your y variable will remain a pd.Series. Then you can get the average of the last observations.
y.tail(50).mean()
I'm using the tf.data.Dataset API and am trying to truncate a bunch of tensors to length 100. Here's what my dataset looks like:
dataset = tf.data.Dataset.from_tensor_slices(({'reviews': x}, y))
My reviews are just movie reviews (strings), so I perform some preprocessing and map that function on my dataset:
def preprocess(x, y):
# split on whitespace
x['reviews'] = tf.string_split([x['reviews']])
# turn into integers
x['reviews'], y = data_table.lookup(x['reviews']), labels_table.lookup(y)
x['reviews'] = tf.sparse_tensor_to_dense(x['reviews'])
# truncate at length 100
x['reviews'] = x['reviews'][:100]
x['reviews'] = x['reviews'][0]
x['reviews'] = tf.pad(x['reviews'],
paddings=[[100 - tf.shape(x['reviews'])[0], 0]],
mode='CONSTANT',
name='pad_input',
constant_values=0)
return x, y
dataset = dataset.map(preprocess)
However, my code fails with:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: -140 0
On an input of length 240. So, it seems like my padding step calculates 100 - 240 = -140 and I get this error.
Here's my question: how is this possible, given that I truncate to length 100 with:
x['reviews'] = x['reviews'][:100]
It seems clear that this line isn't having any effect, so I'm trying to understand why. The docs are very clear that this is acceptable syntactic sugar for tf.slice:
Note that tf.Tensor.getitem is typically a more pythonic way to
perform slices, as it allows you to write foo[3:7, :-2] instead of
tf.slice(foo, [3, 0], [4, foo.get_shape()[1]-2]).
Any ideas?
Thanks!
I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.
I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.
Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!
Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.
I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.