I try tu run a model with python (no built by me) and I obtain this error:
This comes from:
from seirsplus.models import *
import networkx
numNodes = 10000
baseGraph = networkx.barabasi_albert_graph(n=numNodes, m=9)
G_normal = custom_exponential_graph(baseGraph, scale=100)
# Social distancing interactions:
G_distancing = custom_exponential_graph(baseGraph, scale=10)
# Quarantine interactions:
G_quarantine = custom_exponential_graph(baseGraph, scale=5)
model = SEIRSNetworkModel(G=G_normal, beta=0.155, sigma=1/5.2, gamma=1/12.39, mu_I=0.0004, p=0.5,
Q=G_quarantine, beta_D=0.155, sigma_D=1/5.2, gamma_D=1/12.39, mu_D=0.0004,
theta_E=0.02, theta_I=0.02, phi_E=0.2, phi_I=0.2, psi_E=1.0, psi_I=1.0, q=0.5,
initI=10)
checkpoints = {'t': [20, 100], 'G': [G_distancing, G_normal], 'p': [0.1, 0.5], 'theta_E': [0.02, 0.02], 'theta_I': [0.02, 0.02], 'phi_E': [0.2, 0.2], 'phi_I': [0.2, 0.2]}
model.run(T=300, checkpoints=checkpoints)
model.figure_infections()
i leave it with an image to see the highlighted part.
From what I understand, this has to do with the way the class SEIRSNetworkModel is constructed. I already forked the original repository:
https://github.com/ryansmcgee/seirsplus/wiki/SEIRSNetworkModel-class
but I don´t know where to look for this constructor, and what to search for in order to fix this problem. This may be too simple, but I can´t find my way.
I'd appreciate any help, as simple as possible please since you can see I don't know how to navigate in here.
I was doing a normality test in Python spark-ml and saw what I think is an bug.
Here is the setup, i have a data-set that is normalized (range -1, to 1).
When I do a histogram, i can clearly see that the data is NOT normal:
>>> prices_norm.histogram(10)
([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
[226, 269, 119, 95, 52, 26, 8, 2, 2, 5])
When I run the Kolmgorov-Smirnov test I get the following results:
>>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")
>>> print testResults
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.46231145770077375
pValue = 1.742039845709087E-11
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
The Kolmgorov-Smirnov test defines the null hypothesis (H0) as: the data follows a specified distribution (http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).
In this case the p-value is very low, so we should reject the null hypothesis. This makes sense, as it is clearly not normal.
So why then, does it say:
Sample follows theoretical distribution
Isn't this wrong? Shouldn't it say that the sample does NOT follow a theoretical distribution? Am I missing something?
This was driving me crazy, so I went to look at the source code directly:
git://git.apache.org/spark.git
spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala
The code is correct, the null Hypothesis is set as:
object NullHypothesis extends Enumeration {
type NullHypothesis = Value
val OneSampleTwoSided = Value("Sample follows theoretical distribution")
}
The verbiage of the string message is just restating the null hypothesis:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
________________________________________
H0
Arguably the verbiage is confusing as it could be interpreted both ways. But it is indeed correct.
Lets say I have a list of numbers (all numbers are within 0.5 to 1.5 in this particular example and of course it is a discrete set ).
my_list= [0.564, 1.058, 0.779, 1.281, 0.656, 0.863, 0.958, 1.146, 0.742, 1.139, 0.957, 0.548, 0.572, 1.204, 0.868, 0.57, 1.456, 0.586, 0.718, 0.966, 0.625, 0.951, 0.766, 1.458, 0.83, 1.25, 0.7, 1.334, 1.015, 1.43, 1.376, 0.942, 1.252, 1.441, 0.795, 1.25, 0.851, 1.383, 0.969, 0.629, 1.008, 0.729, 0.841, 0.619, 0.63, 1.189, 0.514, 0.899, 0.807, 0.63, 1.101, 0.528, 1.385, 0.838, 0.538, 1.364, 0.702, 1.129, 0.639, 0.557, 1.28, 0.664, 1.021, 1.43, 0.792, 1.229, 0.837, 1.183, 0.54, 0.831, 1.279, 1.385, 1.377, 0.827, 1.32, 0.537, 1.19, 1.446, 1.222, 0.762, 1.302, 0.626, 1.352, 1.316, 1.286, 1.239, 1.027, 1.198, 0.961, 0.515, 0.989, 0.979, 1.123, 0.889, 1.484, 0.734, 0.718, 0.758, 0.782, 1.163, 0.579, 0.744, 0.711, 1.13, 0.598, 0.913, 1.305, 0.684, 1.108, 1.373, 0.945, 0.837, 1.129, 1.005, 1.447, 1.393, 1.493, 1.262, 0.73, 1.232, 0.838, 1.319, 0.971, 1.234, 0.738, 1.418, 1.397, 0.927, 1.309, 0.784, 1.232, 1.454, 1.387, 0.851, 1.132, 0.958, 1.467, 1.41, 1.359, 0.529, 1.139, 1.438, 0.672, 0.756, 1.356, 0.736, 1.436, 1.414, 0.921, 0.669, 1.21, 1.041, 0.597, 0.541, 1.162, 1.292, 0.538, 1.011, 0.828, 1.356, 0.897, 0.831, 1.018, 1.412, 1.363, 1.371, 1.231, 1.278, 0.564, 1.134, 1.324, 0.593, 1.307, 0.66, 1.376, 1.469, 1.315, 0.959, 1.099, 1.313, 1.032, 1.128, 1.175, 0.64, 0.581, 1.09, 0.934, 0.698, 1.272]
I can make a histogram distribution plot from it as
hist(my_list, bins=20, range=[0.5,1.5])
show()
which produces
Now, I want to create another list of random numbers (lets say this new list consists of 100 numbers) that will follow the same distribution (not sure how to link a discrete set in to a continuous distribution !!! ) as the old list ( my_list ) , so if I plot the histogram distribution from the new list, it will essentially produce the same histogram distribution.
Is there any way to do so in Python 2.7 ? I appreciate any help in advance.
You first need to "bucket up" the range of interest, and of course you can do it with tools from scipy &c, but for the sake of understanding what's going on a little Python version might help - with no optimizations, for ease of understanding:
import collections
def buckets(discrete_set, amin=None, amax=None, bucket_size=None):
if amin is None: amin=min(discrete_set)
if amax is None: amax=min(discrete_set)
if bucket_size is None: bucket_size = (amax-amin)/20
def to_bucket(sample):
if not (amin <= sample <= amax): return None # no bucket fits
return int((sample - amin) // bucket_size)
b = collections.Counter(to_bucket(s)
for s in discrete_set if to_bucket(s) is not None)
return amin, amax, bucket_size, b
So, now you have a Counter (essentially a dict) mapping each bucket from 0 up to its count as observed in the discrete set.
Next, you'll want to generate a random sample matching the bucket distribution measured by calling buckets(discrete_set). A Counter's elements method can help, but you need a list for random.sample...:
mi, ma, bs, bks = buckets(discrete_set)
buckelems = list(bks.elements())
(this may waste a lot of space, but you can optimize it later, separately from this understanding-focused overview:-).
Now it's easy to get an N-sized sample, e.g:
def makesample(N, buckelems, mi, ma, bs):
s = []
for _ in range(N):
buck = random.choice(buckelems)
x = random.uniform(mi+buck*bs, mi+(buck+1)*bs)
s.append(x)
return s
Here I'm assuming the buckets are fine-grained enough that it's OK to use a uniform distribution within each bucket.
Now, optimizing this is of course interesting -- buckelems will have as many items as originally were in discrete_set, and if that imposes an excessive load on memory, cumulative distributions can be built and used instead.
Or, one could bypass the Counter altogether, and just "round" each item in the discrete set to its bucket's lower bound, if memory's OK but one wants more speed. Or, one could leave discrete_set alone and random.choice within it before "perturbing" the chosen value (in different ways depending on the constraints of the exact problem). No end of fun...!-)
Don't read too much into valleys and peaks of histograms with low sample sizes when you're trying to do distribution fitting.
I performed a Kolmogorov-Smirnov test on your data to test the hypothesis that they come from a Uniform(0.5,1.5) distribution, and failed to reject. Consequently, you can generate any size sample you want of Uniform(0.5,1.5)'s.
Given your statement that the underlying distribution is continuous, I think that a distribution-fitting approach is better than a histogram/bucket-based approach.
I want to develop some python code to align datasets obtained by different instruments recording the same event.
As an example, say I have two sets of measurements:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({'TIME':[1.1, 2.4, 3.2, 4.1, 5.3],\
'VALUE':[10.3, 10.5, 11.0, 10.9, 10.7],\
'ERROR':[0.2, 0.1, 0.4, 0.3, 0.2]})
data2 = pd.DataFrame({'TIME':[0.9, 2.1, 2.9, 4.2],\
'VALUE':[18.4, 18.7, 18.9, 18.8],\
'ERROR':[0.3, 0.2, 0.5, 0.4]})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE, yerr=data2.ERROR, fmt='bo')
plt.show()
The result is plotted here:
What I would like to do now is to align the second dataset (data2) to the first one (data1). i.e. to get this:
The second dataset must be shifted to match the first one by subtracting a constant (to be determined) from all its values. All I know is that the datasets are correlated since the two instruments are measuring the same event but with different sampling rates.
At this stage I do not want to make any assumptions about what function best describes the data (fitting will be done after alignment).
I am cautious about using means to perform shifts since it may produce bad results, depending on how the data is sampled. I was considering taking each data2[TIME_i] and working out the shortest distance to data1[~TIME_i]. Then minimizing the sum of those. But I am not sure that would work well either.
Does anyone have any suggestions on a good method to use? I looked at mlpy but it seems to only work on 1D arrays.
Thanks.
You can substract the mean of the difference: data2.VALUE-(data2.VALUE - data1.VALUE).mean()
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({
'TIME': [1.1, 2.4, 3.2, 4.1, 5.3],
'VALUE': [10.3, 10.5, 11.0, 10.9, 10.7],
'ERROR': [0.2, 0.1, 0.4, 0.3, 0.2],
})
data2 = pd.DataFrame({
'TIME': [0.9, 2.1, 2.9, 4.2],
'VALUE': [18.4, 18.7, 18.9, 18.8],
'ERROR': [0.3, 0.2, 0.5, 0.4],
})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE-(data2.VALUE - data1.VALUE).mean(),
yerr=data2.ERROR, fmt='bo')
plt.show()
Another possibility is to subtract the mean of each series
You can calculate the offset of the average and subtract that from every value. If you do this for every value they should align relatively well. This would assume both dataset look relatively similar, so it might not work the best.
Although this question is not Matlab related, you might still be interested in this:
Remove unknown DC Offset from a non-periodic discrete time signal