How to calculate the steepness of a trend in python - python

I am using the regression slope as follows to calculate the steepness (slope) of the trend.
Scenario 1:
For example, consider I am using sales figures (x-axis: 1, 4, 6, 8, 10, 15) for 6 days (y-axis).
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
X = [[1], [4], [6], [8], [10], [15]]
y = [1, 2, 3, 4, 5, 6]
regressor.fit(X, y)
print(regressor.coef_)
This gives me 0.37709497
Scenario 2:
When I run the same program for a different sale figure (e.g., 1, 2, 3, 4, 5, 6) I get the results as 1.
However, you can see that sales is much productive in scenario 1, but not in scenario 2. However, the slope I get for scenario 2 is higher than scenario 1.
Therefore, I am not sure if the regression slope captures what I require. Is there any other approach I can use instead to calculate the sleepness of the trend slope.
I am happy to provide more details if needed.

I believe the problem is your variables are switched. If you want to track sales performance over time, you should perform the regression the other way around. You can invert the slopes you've calculated to get the correct values, which will show higher sales performance in case 1.
1 / 0.377 = 2.65
Here is a visualization of your data:
import matplotlib.pyplot as plt
days = [1,2,3,4,5,6]
sales1 = [1,4,6,8,10,15]
sales2 = [1,2,3,4,5,6]
df = pd.DataFrame({'days': days, 'sales1': sales1, 'sales2': sales2})
df = df.set_index('days')
df.plot(marker='o', linestyle='--')

Related

The sklearn.manifold.TSNE gives different results for same input vectors

I give TSNE a list of vectors, some of these vectors are exactly the same. But the output of fit() function can be different for each!
IS this expected behavior? How can i assure each input vector will be mapped to same output vector?
Exclamation,
I cannot tell for sure, but I even noticed that the first entry in the input list of vectors always gets different unexpected value.
Consider the following simple example.
Notice that the first three vectors are the same, the random_state is static but the first three 2D vectors in the output can be different from each others.
from sklearn import manifold
import numpy as np
X= np.array([ [2, 1, 3, 5],
[2, 1, 3, 5],
[2, 1, 3, 5],
[2, 1, 3, 5],
[12, 1, 3, 5],
[87, 22, 3, 5],
[3, 23, 9, 5],
[43, 87, 3, 5],
[121, 65, 3, 5]])
m = manifold.TSNE(
n_components=2,
perplexity=0.666666,
verbose=0,
random_state=42,
angle=.99,
init='pca',
metric='cosine',
n_iter=1000)
X_emedded = m.fit_transform(X)
# The following might fail
assert( sum(X_emedded[1] - X_emedded[2] ) == 0)
assert( sum(X_emedded[0] - X_emedded[1] ) == 0)
Update....
sklearn.version is '1.2.0'
t-SNE, as presenter by van der Maaten and Hinton 2008 is a technique to "visualizes high-dimensional data by giving each
datapoint a location in a two or three-dimensional map".
There is no guarantee that two identical points are mapped to the same low dimensional point. As a matter of fact it almost never happens as one can see with Algorithm 1 in (Maaten and Hinton 2008). The points in the low dimensional space are obtained with a gradient descent minimizing a cost function after a random initialisation.

Re-calculate similarity ranking/sorting without re-sorting

I have some code which calculates the nearest neighbors amongst some vectors (values).
However, the values of these vectors are dependent on weights. Each column of the vectors has a different weight at every iteration.
Just for the sake of the example, at the code below I try to find everytime the nearest neighbor of the last vector (vector[3]).
That's a very simplified version of my code:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=1)
values = [
[2, 5, 1],
[4, 2, 3],
[1, 5, 2],
[4, 5, 4]
]
weights = [
[1, 3, 1],
[0.5, 2, 1],
[3, 1, 2]
]
# weights set No1
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[0])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
# weights set No2
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[1])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
# weights set No3
new_values = []
for line in values:
new_values.append([a*b for a,b in zip(line,weights[2])])
knn.fit(new_values)
print(knn.kneighbors(new_values[3]))
(Obviously I could have a for loop for the different weights sets but I just wanted to point the repetition of the matter)
My question is, is there any way that I can avoid using the KNN 3 times but just use it once at the beginning to do the initial similarity ranking/sorting and then just do some re-calculations?
In different words, is there any way to reduce the computation complexity of this code in terms of calling the KNN fewer times?
PS
I know that there are KNN implementations which are much faster than the ScikitLearn one but that's not really the point; the point is more on using KNN just once instead of N=3 times or something like that.
assuming calling the KNN fewer times means the number of times the KNN is fit, yes it's possible. if calling the KNN means the number of times kneighbors is invoked, that might be difficult due to how relative distances aren't preserved under affine transformations.
This solution runs in O(wk log n) time compared to the original O(wn) time with w being the number of weights.
what you're doing is
taking the input points
scaling its dimensions (projecting the input points into a new coordinate space)
building a knn model from the scaled inputs
classifying the target based on the scaled input.
However, consider
taking the input points
building a knn model from the scaled inputs
inverse scaling the target point (projecting the target into the original coordinate space)
classifying the inverse scaled target based on the input
the result of this process would be that steps 1 and 2 could be reused for each target point. weights with value 0 will require special handling.
this would look would be something like:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=1, algorithm="kd_tree")
values = [
[2, 5, 1],
[4, 2, 3],
[1, 5, 2],
[4, 5, 4]
]
weights = [
[1, 3, 1],
[0.5, 2, 1],
[3, 1, 2]
]
targets = [
[4, 15, 4], # values[3] * weights[0]
[2.0, 10, 4], # values[3] * weights[1]
[12, 5, 8] # values[3] * weights[2]
]
knn.fit(values)
# weights set No1
print(knn.kneighbors([[a/b for a, b in zip(targets[0], weights[0])]]))
# weights set No2
print(knn.kneighbors([[a/b for a, b in zip(targets[1], weights[1])]]))
# weights set No3
print(knn.kneighbors([[a/b for a, b in zip(targets[2], weights[2])]]))

How to increase the steps of scipy.stats.randint?

I'm trying to generate a frozen discrete Uniform Distribution (like stats.randint(low, high)) but with steps higher than one, is there any way to do this with scipy ?
I think it could be something close to hyperopt's hp.uniformint.
rv_discrete(values=(xk, pk)) constructs a distribution with support xk and provabilities pk.
See an example in the docs:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_discrete.html
IIUC you want to generate a uniform discrete variable with a step (eg., step=3 with low=2 and high=10 gives a universe of [2,5,8])
You can generate a different uniform variable and rescale:
from scipy import stats
low = 2
high = 10
step = 3
r = stats.randint(0, (high-low+1)//step)
low+r.rvs(size=10)*step
example output: array([2, 2, 2, 2, 8, 8, 2, 5, 5, 2])

Python: Zero Crossing method for Frequency Estimation

I'm trying to understand the zero-crossing method for frequency estimation. After searching, found this code:
est_freq = round(framerate / np.mean(np.diff(zero_crossings)) / 2)
Dissecting further to learn, I wrote the code below:
import numpy as np
framerate = 1e3
a = [1, 2, 1, 1, -3, -4, 7, 8, 9, 10, -2, 1, -3, 5, 6, 7, -10]
signs = np.sign(a)
diff = np.diff(signs)
indices_of_zero_crossing = np.where(diff)[0]
print(a)
print(signs)
print(diff)
print(indices_of_zero_crossing)
total_points = np.diff(indices_of_zero_crossing)
print(total_points)
average_of_total_points = np.mean(total_points)
print(average_of_total_points)
freq = framerate/average_of_total_points/2
My question is, what is happening at line freq = framerate/average_of_total_points/2. What is the purpose of finding the mean of the differences in zero crossings and dividing by 2?
Could anyone care to explain? Thank you.
I am not sure where you got the sampling frequency from (framerate) but in digital signal processing there is this thing called the Nyquist frequency where you cannot sample reliable more than half the sampling frequency, which may explain your factor 2. Do note that in your code the division is different from the snippet.
It should be freq = framerate/(average_of_total_points/2)

train test split in python but consider patient information?

I'm wondering if there's an easy way to do train test splitting (mainly interested in crossvalidation) in python such that I don't end up with data points from the same patient in both train and test? That is, I'd like to first split the patients into train and test then the observations accordingly.
Is there a functionality for this kind of scenario or do I have to code it manually?
Sklearn GroupKFold should solve this task. K-fold iterator variant with non-overlapping groups. The same group will not appear in two different folds:
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

Categories

Resources