I have a time series data and I am trying to fit ARMA(p,q) model to it but I am not sure what 'p' and 'q' to use. I came across this link enter link description here
The usage for this model is enter link description here
But I don't think it automatically decides what 'p' and 'q' to use. It seems like I need to know what 'p' and 'q' is appropriate.
You'll have to do a bit of reading outside of the statsmodel package documentation.
See some of the content in this answer:
https://stackoverflow.com/a/12361198/6923545
There's a guy named Rob Hyndman who wrote a great book on forecasting and it would be a fine idea to start there. Chapters 8.3 and 8.4 are the bulk of what you're looking for
From Rob's book chapter 8.3,
In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable.
This is describing p -- the number of past values used to forecast a value
From Rob's book chapter 8.4,
a moving average model uses past forecast errors in a regression-like model.
This is describing q, the number of previous forecast errors in the model.
This link gives you a little bit of theory and some examples.
CASE 1: you already know the values of p and q (orders of the ARMA Model), and the algorithm finds the best coefficients
CASE 2: if you don't know them, you can specify a range of possible values and the algorithme finds the best model ARMA(p,q) that fits to the data and estimates the corresponding coefficients.
Related
So I am doing this homework in the one of standford courses and I managed to solve all the questions but I am trying to understand the last and correct me if I am wrong.
One it says build original system: That is building the model
The other one is bake off: That is comparing different models to each other to see the best one that perform the best.
Am I correct ?
This is the link to the homework: https://www.youtube.com/watch?v=vqNj1dr8-HM
It is the very end. It is just that these terms very confusing and new to me
Thank in advance.
I need to know the exact steps for building the original system. What does it mean>? What is the bake off?
Backoff means you go back to a n-1 gram level to calculate the probabilities when you encounter a word with prob=0. So in our case you will use a 3-gram model to calculate the probability of "sunny" in the context "is a very".
The most used scheme is called "stupid backoff" and whenever you go back 1 level you multiply the odds by 0.4. So if sunny exists in the 3-gram model the probability would be 0.4 * P("sunny"|"is a very").
You can go back to the unigram model if needed multipliying by 0.4^n where n is the number of times you backed off.
I would like to create in python some RL algorithm where the algorithm would interact with a very big DataFrame representing stock prices. The algorithm would tell us: Knowing all of the prices and price changes in the market, what would be the best places to buy/sell (minimizing loss maximizing reward). It has to look at the entire DataFrame each step (or else it wouldn't have the entire information from the market).
Is it possible to build such algorithm (which works relatively fast on a large df)? How should it be done? What should my environment look like and which algorithm (specifically) should I use for this type of RL and which reward system? Where should I start
I think you are a little confused with this .what I think you want to do is to check whether the stock prices of a particular company will go up or not or the stock price of which company will shoot up where you already have a dataset regarding the problem statement.
about RL it does not work on any dataset it's a technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.
you can check this blog for some explanation don't get confused.
https://towardsdatascience.com/types-of-machine-learning-algorithms-you-should-know-953a08248861
Objective
I am trying to build an Ontology-based semantic search engine specific to our data.
Problem Statement
Our data is just a bunch of sentences and what I want is to give a phrase and get back the sentences which are:
Similar to that phrase
Has a part of a sentence that is similar to the phrase
A sentence which is having contextually similar meanings
Let me try giving you an example suppose I search for the phrase "Buying Experience", I should get the sentences like:
I never thought car buying could take less than 30 minutes to sign and buy.
I found a car that I liked and the purchase process was straightforward and easy
I absolutely hated going car shopping, but today I’m glad I did
The search term will always be one to three-word phrases. Ex: buying experience, driving comfort, infotainment system, interiors, mileage, performance, seating comfort, staff behavior.
Implementation explored already
OpenSemanticSearchAWS Comprehend
Current Implementation
Right now I am exploring one implementation ('Holmes Extractor') which mostly suites the objective I am trying to achieve. For my use case, I am using 'Topic Matching' available in Holmes.
Holmes has already incorporated many important concepts in its design such as NER, pronoun resolution, word embedding based similarity, ontology-based extraction, which makes this implementation even more promising.
What I have tried so far on Holmes
Manually curated a labeled set with sentences
Registered and serialized our dataset with unique label id for each document (sentence)
For comparison prepared a template to get precision, recall, and f1_score of each iteration of running queries (search query is the same as the label)
Using spaCy's 'en_core_web_lg' for word embedding
Score template looks something like:
Query | Predicted Rows | Actual Number of rows in the queried set | Precision | Recall | f1_score
Please find below the parameters used in each iteration. For all the iterations, generated the scores for Manager.overall_similarity_threshold in range(0.0 to 1.0) and topic_match_documents_returning_dictionaries_against.embedding_penalty in range(0.0 to 1.0)
Iteration 1:
Without any Ontology and Manager.embedding_based_matching_on_root_words=False
Iteration 2:
Without any Ontology and Manager.embedding_based_matching_on_root_words=True
Build a custom Ontology using Protege, which is specific to our data set
Iteration 3:
With custom Ontology and Manager.embedding_based_matching_on_root_words=False, Ontology.symmetric_matching=True
Iteration 4:
With custom Ontology and Manager.embedding_based_matching_on_root_words=True, Ontology.symmetric_matching=True
At this stage, I can observe that
custom Ontology,
Manager.embedding_based_matching_on_root_words=True,
Manager.overall_similarity_threshold in range (0.6-0.8),
topic_match_documents_returning_dictionaries_against.embedding_penalty in range in range(0.6-0.8),
together are producing very strong scores.
average scores for 0.8 -> precision: 0.78, recall: 0.712, f1_score: 0.738
average scores for 0.7 -> precision: 0.738, recall: 0.7435, f1_score: 0.726
Still, results are not that accurate even after providing custom ontology, because of a lack of stemming keywords in the Ontology graph.
For example, if in Ontology we have given something like Mileage->fuel efficiency, fuel economy
Holmes will not match sentences with fuel efficient under Mileage, since "fuel efficient" is not mentioned in the Graph
Iteration 5:
With custom Ontology and Manager.embedding_based_matching_on_root_words=False, Ontology.symmetric_matching=False
Iteration 6:
With custom Ontology and Manager.embedding_based_matching_on_root_words=True, Ontology.symmetric_matching=False
Iteration 7:
Along with Iteration 4 parameters, and topic_match_documents_returning_dictionaries_against.tied_result_quotient in range(0.9 to 0.1)
Again results are better in 0.8-0.9 range
Iteration 8:
I had pre-downloaded ontologies from a few sources:
Automobile Ontology:
https://github.com/mfhepp/schemaorg/blob/automotive/data/schema.rdfa
Vehicle Sales Ontology:
http://www.heppnetz.de/ontologies/vso/ns
Product Ontology:
http://www.productontology.org/dump.rdf
Mesh Thesaurus:
https://data.europa.eu/euodp/en/data/dataset/eurovoc
English thesaurus:
https://github.com/mromanello/skosifaurus
https://raw.githubusercontent.com/mromanello/skosifaurus/master/thesaurus.rdf
General Ontologies:
https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30
https://wiki.dbpedia.org/Downloads2015-04#dbpedia-ontology
https://lod-cloud.net/dataset/dbpedia
https://wiki.dbpedia.org/services-resources/ontology
https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia
https://tools.wmflabs.org/wikidata-exports/rdf/
All the above ontologies are available in either ttl, n3 or rdf format. And holmes extractor (implicitly RDFlib) works on OWL syntax. Therefore, the conversion of the given formats to the OWL format for heavy files is another challenge.
I have tried loading the first 4 ontologies on Holmes after conversion to OWL. But the conversion process takes time as well. Also, loading big ontologies on Holmes itself is taking a good amount of time.
As per Holmes documentation, Holmes works best with the ontologies that have been built for specific subject domains and use cases.
What next
The challenges which I am facing here are:
Finding the proper Ontologies available around customer experience in the automobile domain
Making the conversion process of ontologies faster
How can we easily generate the domain-specific ontology from our existing data
Configuring the right set of parameters along with Ontology to get more accurate results for search phrase queries
Any help will be appreciated. Thanks a lot for the help in advance
I am trying to fit ARIMA model. I have 3 months data and it shows count(float) for every minute. Which order I should pass for arima.fit()?
I need to predict for every minute.
A basic ARIMA(p,d,q) model isn't going to work with your data.
Your data violates the assumptions of ARIMA, one of which is that the parameters have to be consistent over time.
I see clusters of 5 spikes, so I'm presuming that you have 5 busy workdays and quiet weekends. Basic ARIMA isn't going to know the difference between a weekday and weekend, so it probably won't give you useful results.
There is such a thing as a SARIMA (Seasonal Autoregressive Integrated Moving Average). That would be useful if you were dealing with daily data points, but not really suitable for minute data either.
I would suggest that you try to filter your data so that it excludes evenings and weekends. Then you might be dealing with a dataset that has consistent parameters. If you're using python, you could try using pyramid's auto_arima() function on your filtered time series data (s) to have it try to auto-find the best parameters p, d, q.
It also does a lot of the statistical tests you might want to look into for this type of analysis. I actually don't always agree with the auto_arima's parameter choices, but it's a start.
model = pyramid.arima.auto_arima(s)
print(model.summary())
http://pyramid-arima.readthedocs.io/en/latest/_submodules/arima.html
1) is your data even good for box -Jenkins model( ARIMA)?
2) I see higher mean in the end. There is clear seasonality in the data. ARIMA will fail. Please try seasonal based models- SARIMA as righly suggested by try, Prophet is another beautiful algo by Facebook for seasonality. - ( R implementation)
https://machinelearningstories.blogspot.com/2017/05/facebooks-phophet-model-for-forecasting.html
3) Not just rely on ARIMA. Try other time series algos that are STL, BTS, TBATS. ( Structural Time series) and Hybrid etc. Let me know if you want some packages information in R?
I have a dataset for store inventory management. For every product I have the history of product Order of renewal. For example for a product A, I have:
A,last_time_of_renawal,volume_order,time_of_order
A,last_time_of_renawal1,volume_order1,time_of_order1
for every line, I have also other information like (category of product, sales number, stock_volume...)
How can I use this dataset and tensorflow (or other deep learning library) to predict the next time_of_order for a product knowing the last_time_of_order
that is too broad of a question for StackOverflow. Try to specify it using this guide.
But essentially you want to do a regression on the delta between time_of_order and last_time_of_order. That's your y. Then you have your features using category of product etc (your x).
Now you have a wide world of statistical analysis at your disposal.
If you insist on using deep-learning: Try setting up a "simple" neural network using a youtube playlist. When you succeed, you can try using your own data.
And you if you encounter problems... come back to StackOverflow with a specific programming question :) Have fun!