I looked into the question Best fit Distribution plots and I found that best fit distribution for my dataset is hypsecant distribution. When I ran the code with one part of my dataset, I got the parameters for loc and scale to be:
loc=0.040593036736931196 ; scale=-0.008338984851829193
however, when I input for instance this data into hysecant.pdf with loc and scale, I get Nans as a result instead of values. Why is that?
Here is my input:
[ 1. 0.99745057 0.93944708 0.83424019 0.78561204 0.63211263
0.62151259 0.57883015 0.57512492 0.43878694 0.43813347 0.37548233
0.33836774 0.29610761 0.32102182 0.31378472 0.24809515 0.24638145
0.22580595 0.18480387 0.19404362 0.18919147 0.16377272 0.16954728
0.10912106 0.12407758 0.12819846 0.11673824 0.08957689 0.10353764
0.09469576 0.08336001 0.08591166 0.06309568 0.07445366 0.07062173
0.05535625 0.05682546 0.06803674 0.05217558 0.0492794 0.05403819
0.04535857 0.04562529 0.04259798 0.03830373 0.0374102 0.03217575
0.03291147 0.0288506 0.0268235 0.02467415 0.02409625 0.02486308
-0.02563436 -0.02801487 -0.02937738 -0.02948851 -0.03272476 -0.03324265
-0.03435844 -0.0383104 -0.03864602 -0.04091095 -0.04269355 -0.04428056
-0.05009069 -0.05037519 -0.05122204 -0.05770342 -0.06348465 -0.06468936
-0.06849683 -0.07477151 -0.08893675 -0.097383 -0.1033376 -0.10796748
-0.11835636 -0.13741154 -0.14920072 -0.16698451 -0.1715277 -0.20449029
-0.22241856 -0.25270058 -0.25699927 -0.26731036 -0.31098857 -0.35426224
-0.36204168 -0.44059844 -0.46754863 -0.53560093 -0.61463112 -0.65583547
-0.66378605 -0.70644849 -0.75217157 -0.92236344]
Related
I have a time series data
c1= c(0.558642328,
0.567173803,
0.572518969,
0.579917556,
0.592155421,
0.600239837,
0.598955071,
0.608857572,
0.615442061,
0.613502347,
0.618076897,
0.626769781,
0.633930194,
0.645518577,
0.66773088,
0.68128165,
0.695552504,
0.6992836,
0.702771866,
0.700840271,
0.684032428,
0.665082645,
0.646948862,
0.621813893,
0.597888613,
0.577744126,
0.555984044,
0.533597678,
0.523645413,
0.522041142,
0.525437844,
0.53053292,
0.543152606,
0.549038792,
0.555300856,
0.563411331,
0.572663951,
0.584438777,
0.589476192,
0.604197562,
0.61670388,
0.624161184,
0.624345171,
0.629342985,
0.630379665,
0.620067096,
0.597480375,
0.576228619,
0.561285031,
0.543921304,
0.530826211,
0.519563568,
0.514228535,
0.515202665,
0.516663855,
0.525673366,
0.543545395,
0.551681638,
0.558951402,
0.566816133,
0.573842585,
0.578611696,
0.589180577,
0.603297615,
0.624550509,
0.641310155,
0.655093217,
0.668385196,
0.671600127,
0.658876967,
0.641041982,
0.605081463,
0.585503519,
0.556173635,
0.527428073,
0.502755737,
0.482510734,
0.453295642,
0.439938772,
0.428757811,
0.422361642,
0.40945864,
0.399504355,
0.412688798,
0.42684828,
0.456935656,
0.48355422,
0.513727218,
0.541630101,
0.559122121,
0.561763656,
0.572532833,
0.576761365,
0.576146233,
0.580199403,
0.584954906)
corresponding to dates
dates = seq(as.Date("2016-09-01"), as.Date("2020-07-30"), by=15)
What I want to do is compute normalized spectral entropy of this time series. I have found in literature that high value indicates high stability of a system.
I have found a function here: https://rdrr.io/cran/ForeCA/man/spectral_entropy.html, but cannot generate what I want.
New to this topic, and hence any interpretation would be helpful too.
I wrote a program for a statistics educatory problem, simply put I was supposed to predict prices for the next 250 days, then extract the lowest and highest price from 10k tries of 250-day predictions.
I followed the instructions written on the problem to use the gauss method from the random module and use the mean and std of the given sample.
the highest and lowest prices in the test are in the range of 45-55 but I predict 18-88. is there a problem with my code or is it just not a good method for prediction.
from random import gauss
with open('AAPL_train.csv','r') as sheet: #we categorize the data here
Date=[]
Open=[]
High=[]
Low=[]
Close=[]
Adj_Close=[]
Volume=[]
for lines in sheet.readlines()[1:-1]:
words=lines.strip().split(',')
Date.append(words[0])
Open.append(float(words[1]))
High.append(float(words[2]))
Low.append(float(words[3]))
Close.append(float(words[4]))
Adj_Close.append(float(words[5]))
Volume.append(int(words[6]))
subtract=[] #find the pattern of price changing by finding the day-to-day changes
for i in range(1,len(Volume)):
subtract.append(Adj_Close[i]-Adj_Close[i-1])
mean=sum(subtract)/len(subtract) #find the mean and std of the change pattern
accum=0
for amount in subtract:
accum+= (amount-mean)**2
var=accum/len(subtract)
stdev=var**0.5
worst=[]
best=[]
def Getwb(): #a function to predict
index=Adj_Close[-1]
index_lst=[]
for i in range(250):
index+=gauss(mean,stdev)
index_lst.append(index)
worst=(min(index_lst))
best=(max(index_lst))
return worst,best
for i in range (10000): #try predicting 10000 times and then extract highest and lowest result
x,y=Getwb()
worst.append(x)
best.append(y)
print(min(worst))
print(max(best))
I am reading all my datasets in one ensemble 'ens1'. As given below.
wrf_dict = {"ens1" : [Dataset("wrfout_d01_2020-12-03_01_00_00"),
Dataset("wrfout_d01_2020-12-03_02_00_00"),
Dataset("wrfout_d01_2020-12-03_03_00_00"),
Dataset("wrfout_d01_2020-12-03_04_00_00"),
Dataset("wrfout_d01_2020-12-03_05_00_00"),
Dataset("wrfout_d01_2020-12-03_06_00_00"),
Dataset("wrfout_d01_2020-12-03_07_00_00"),
Dataset("wrfout_d01_2020-12-03_08_00_00"),
Dataset("wrfout_d01_2020-12-03_09_00_00"),
Dataset("wrfout_d01_2020-12-03_10_00_00"),
Dataset("wrfout_d01_2020-12-03_11_00_00"),
Dataset("wrfout_d01_2020-12-03_12_00_00"),
Dataset("wrfout_d01_2020-12-03_13_00_00"),
Dataset("wrfout_d01_2020-12-03_14_00_00"),
Dataset("wrfout_d01_2020-12-03_15_00_00"),
Dataset("wrfout_d01_2020-12-03_16_00_00"),
Dataset("wrfout_d01_2020-12-03_17_00_00"),
Dataset("wrfout_d01_2020-12-03_18_00_00"),
Dataset("wrfout_d01_2020-12-03_19_00_00"),
Dataset("wrfout_d01_2020-12-03_20_00_00"),
Dataset("wrfout_d01_2020-12-03_21_00_00")]}
I have read a common variable from all 21 datasets i.e. QCLOUD.
LWC = getvar(wrf_dict, "QCLOUD", timeidx=ALL_TIMES)[:,0,:,:]
Now I want to take a mean of all 21 QCLOUD variable at each grid location. Can any one suggest me solution
this is my first question here.
I've been wanting to create a dataset with the popular IMDb dataset for learning purpose. The directories are as follows: .../train/pos/ and .../train/neg/ . I created a function which will merge text files with its labels and getting a error. I need your help to debug!
def datasetcreate(filepath, label):
filepaths = tf.data.Dataset.list_files(filepath)
return tf.stack([tf.data.Dataset.from_tensor_slices((_, tf.constant(label, dtype='int32'))) for _ in tf.data.TextLineDataset(filepaths)])
datasetcreate(['aclImdb/train/pos/*.txt'],1)
And this is the error I'm getting:
ValueError: Value tf.Tensor(b'An American in Paris was, in many ways, the ultimate.....dancers of all time.', shape=(), dtype=string) has insufficient rank for batching.
Why does this happen and what can I do to get rid of this? Thanks.
Your code has two problems:
First, the way you load your TextLineDatasets, your loaded tensors contain string objects, which have an empty shape associated, i.e. a rank of zero. The rank of a tensor is the length of the shape property.
Secondly, you are trying to stack two tensors with different rank, which is would throw another error because, a sentence (a sequence of tokens) has a rank of 1 and the label as scalar has a rank of 0.
If you just need the dataset, I recommend to use the Tensorflow Dataset package, which has many ready-to-use datasets available.
If want to solve your particular problem, one way to fix your data pipeline is by using Datasest.interleave and the Dataset.zip functions.
# load positive sentences
filepaths = list(tf.data.Dataset.list_files('aclImdb/train/pos/*.txt'))
sentences_ds = tf.data.Dataset.from_tensor_slices(filepaths)
sentences_ds = sentences_ds.interleave(lambda text_file: tf.data.TextLineDataset(text_file) )
sentences_ds = sentences_ds.map( lambda text: tf.strings.split(text) )
# dataset for labels, create 1 label per file
labels = tf.constant(1, dtype="int32", shape=(len(filepaths)))
label_ds = tf.data.Dataset.from_tensor_slices(labels)
# combine text with label datasets
dataset = tf.data.Dataset.zip( (sentences_ds, label_ds) )
print( list(dataset.as_numpy_iterator() ))
First, you use the interleave function to combine multiple text datasets to one dataset. Next, you use tf.strings.split to split each text to its tokens. Then, you create a dataset for your positive labels. Finally, you combine the two datasets using zip.
IMPORTANT: To train/run any DL models on your dataset, you will likely need further pre-processing for your sentences, e.g. build a vocabulary and train word-embeddings.
If I am using python Sklearn for LDA topic modeling, I can use the transform function to get a "document topic distribution" of the LDA-results like here:
document_topic_distribution = lda_model.transform(document_term_matrix)
Now I tried also the R structural topic models (stm) package and i want get the same. Is there any function in the stm package, which can produce the same thing (document topic distribution)?
I have the stm-object created as follows:
stm_model <- stm(documents = out$documents, vocab = out$vocab,
K = number_of_topics, data = out$meta,
max.em.its = 75, init.type = "Spectral" )
But i didn't find out how I can get the desired distribution out of this object. The documentation didn't really help me aswell.
As emilliman5 pointed out, your stm_model provides access to the underlying parameters of the model, as is shown in the documentation.
Indeed, the theta parameter is a
Number of Documents by Number of Topics matrix of topic proportions.
This requires some linguistical parsing: it is an N_DOCS by N_TOPICS matrix, i.e. it has N_DOCS rows, one per document, and N_TOPICS columns, one per topic. The values are the topic proportions, i.e. if stm_model[1, ] == c(.3, .2, .5), that means Document 1 is 30% Topic 1, 20% Topic 2 and 50% Topic 3.
To find out what topic dominates a document, you have to find the (column!) index of the maximum value, which can be retrieved e.g. by calling apply with MARGIN=1, which basically says "do this row-wise"; which.max simply returns the index of the maximum value:
apply(stm_model$theta, MARGIN=1, FUN=which.max)