Original system and bakeoff meaning in NLP

Original system and bakeoff meaning in NLP - python

So I am doing this homework in the one of standford courses and I managed to solve all the questions but I am trying to understand the last and correct me if I am wrong.
One it says build original system: That is building the model
The other one is bake off: That is comparing different models to each other to see the best one that perform the best.
Am I correct ?
This is the link to the homework: https://www.youtube.com/watch?v=vqNj1dr8-HM
It is the very end. It is just that these terms very confusing and new to me
Thank in advance.
I need to know the exact steps for building the original system. What does it mean>? What is the bake off?

Backoff means you go back to a n-1 gram level to calculate the probabilities when you encounter a word with prob=0. So in our case you will use a 3-gram model to calculate the probability of "sunny" in the context "is a very".
The most used scheme is called "stupid backoff" and whenever you go back 1 level you multiply the odds by 0.4. So if sunny exists in the 3-gram model the probability would be 0.4 * P("sunny"|"is a very").
You can go back to the unigram model if needed multipliying by 0.4^n where n is the number of times you backed off.

Related

Looking for repeated patterns in time series data

I have spent the best part of the last few days searching forums and reading papers trying to solve the following question. I have thousands of time series arrays each of varying lengths containing a single column vector. this column vector contains the time between clicks for dolphins using echolocation.
I have managed to cluster these into similar groups using DTW and want to check which trains have a high degree of similarity i.e repeated patterns. I only want to know the similarity with themselves and don't care to compare them with other trains as I have already applied DTW for that. I'm hoping some of these clusters will contain trains with a high proportion of repeated patterns.
I have already applied the Ljung–Box test to each series to check for autocorrelation but think i should maybe be using something with FFT and the power spectrum. I don't have much experience in this but have tried to do so using a Python package waipy. Ultimately, I just want to know if there is some kind of repeated pattern in the data ideally tested with a p-value. The image I have attached shows an example train across the top. the maximum length of my trains is 550.
example output from Waipy
I know this is quite a complex question but any help would be greatly appreciated even if it is a link to a helpful Python library.
Thanks,
Dex

For anyone in a similar position I decided to go with Motifs as they are able to find a repeated pattern in a time series using euclidian distance. There is a really good package in Python called Stumpy which makes this very easy!
Thanks,
Dex

Implementation of multi-variable, multi-type RNN in Python

I have a dataset which has items with the following layout/schema:
{
words: "Hi! How are you? My name is Helennastica",
ratio: 0.32,
importantNum: 382,
wordArray: ["dog", "cat", "friend"],
isItCorrect: false,
type: 2
}
where I have a lot of different types of data, including:
Arrays (of one type only, e.g an array of strings or array of numbers, never both)
Booleans
Numbers with fixed min/max (i.e on a scale of 0 to 1)
Limitless integers (any integer -∞ to ∞)
Strings, with some dictionary, some new, words
The task is to create an RNN (well, generally, a system that can quickly retrain when given one extra bit of data instead of reprocessing it all - I think an RNN is the best choice; see below for reasoning) which can use all of these factors to categorise any dataset into one of 4 categories - labelled by the type key in the above example, a number 0-3.
I have a set of lots of the examples in the above format (with answer provided), and I have a database filled with uncategorised examples. My intention is to be able to run the ML model on that set, and sort all of them into categories. The reason I need to be able to retrain quickly is because of the feedback feature: if the AI gets something wrong, any user can report it, in which case that specific JSON will be added to the dataset. Obviously, having to retrain with 1000+ JSONs just to add one extra on would take ages - if I am not mistaken, an RNN can get around this.
I have found many possible use-cases for something like this, yet I have spent literal hours browsing through Github trying to find an implementation, or some Tensorflow module/addon to make this easier/copy, but to no avail.
I assume this would not be too difficult using Tensorflow, and I understand a bit of the maths and logic behind it (but not formally educated, so I probably have gaps!), but unfortunately I have essentially no experience with using Tensorflow/any other ML frameworks (beyond copy-pasting code for some other projects). If someone could point me in the right direction in the form of a Github repo/Python framework, or even write some demo code to help solve this problem, it would be greatly appreciated. And if you're just going to correct some of my technical knowledge/tell me where I've gone horrendously wrong, I'd appreciate that feedback to (just leave it as a comment).
Thanks in advance!

Using Python ARMA model fit

I have a time series data and I am trying to fit ARMA(p,q) model to it but I am not sure what 'p' and 'q' to use. I came across this link enter link description here
The usage for this model is enter link description here
But I don't think it automatically decides what 'p' and 'q' to use. It seems like I need to know what 'p' and 'q' is appropriate.

You'll have to do a bit of reading outside of the statsmodel package documentation.
See some of the content in this answer:
https://stackoverflow.com/a/12361198/6923545
There's a guy named Rob Hyndman who wrote a great book on forecasting and it would be a fine idea to start there. Chapters 8.3 and 8.4 are the bulk of what you're looking for
From Rob's book chapter 8.3,
In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable.
This is describing p -- the number of past values used to forecast a value
From Rob's book chapter 8.4,
a moving average model uses past forecast errors in a regression-like model.
This is describing q, the number of previous forecast errors in the model.

This link gives you a little bit of theory and some examples.
CASE 1: you already know the values of p and q (orders of the ARMA Model), and the algorithm finds the best coefficients
CASE 2: if you don't know them, you can specify a range of possible values and the algorithme finds the best model ARMA(p,q) that fits to the data and estimates the corresponding coefficients.

SVM; training data doesn't contain target

I'm trying to predict whether a fan is going to turn out to a sporting event or not. My data (pandas DataFrame) consists of fan information (demographic's, etc.), and whether or not they attended the last 10 matches (g1_attend - g10_attend).
fan_info age neighborhood g1_attend g2_attend ... g1_neigh_turnout
2717 22 downtown 0 1 .47
2219 67 east side 1 1 .78
How can I predict if they're going to attend g11_attend, when g11_attend doesn't exist in the DataFrame?
Originally, I was going to look into applying some of the basic models in scikit-learn for classification, and possibly just add a g11_attend column into the DataFrame. This all has me quite confused for some reason. I'm thinking now that it would be more appropriate to treat this as a time-series, and was looking into other models.

You are correct, you can't just add a new category (ie output class) to a classifier -- this requires something that does time series.
But there is a fairly standard technique for using a classifier on times-series. Asserting (conditional) Time Independence, and using windowing.
In short we are going to make the assumption that whether or not someone attends a game depends only on variables we have captured, and not on some other time factor (or other factor in general).
i.e. we assume we can translate their history of games attended around the year and it will still be the same probability.
This is clearly wrong, but we do it anyway because machine learning techneques will deal with some noised in the data.
It is clearly wrong because some people are going to avoid games in winter cos it is too cold etc.
So now on the the classifier:
We have inputs, and we want just one output.
So the basic idea is that we are going to train a model,
that given as input whether they attended the first 9 games, predicts if they will attend the 10th
So out inputs are 1 age, neighbourhood, g1_attend, g2_attend,... g9_attend
and the output is g10_attend -- a binary value.
This gives us training data.
Then when it it time to test it we move everything accross: switch g1_attend for g2_attend, and g2_attend for g3_attend and ... and g9_attend for g10_attend.
And then our prediction output will be for g11_attend.
You can also train several models with different window sizes.
Eg only looking at the last 2 games, to predict attendance of the 3rd.
This gives you a lot more trainind data, since you can do.
g1,g2->g3 and g2,g3->g4 etc for each row.
You could train a bundle of different window sizes and merge the results with some ensemble technique.
In particular it is a good idea to train g1,...,g8-> g9,
and then use that to predict g10 (using g2,...,g9 as inputs)
to check if it is working.
I suggest in future you may like to ask these questions on Cross Validated. While this may be on topic on stack overflow, it is more on topic there, and has a lot more statisticians and machine learning experts.
1 I suggest discarding fan_id for now as an input. I just don't think it will get you anywhere, but it is beyond this question to explain why.

Bayes Classifier Training set

I am working on a simple naive bayes classifier and I had a conceptual question about it.
I know that the training set is extremely important so I wanted to know what constitutes a good training set in the following example. Say I am classifying web pages and concluding if they are relevant or not. The factors on which this decision is based takes into account the probabilities of certain attributes being present on that page. These would be certain keywords that increase the relevancy of the page. The keywords are apple, banana, mango. The relevant/irrelevant score is for each user. Assume that a user marks the page relevant/irrelevant equally likely.
Now for the training data, to get the best training for my classifier, would I need to have the same number of relevant results as irrelevant results? Do I need to make sure that each user would have relevant/irrelevant results present for them to make a good training set? What do I need to keep in mind?

This is a slightly endless topic as there are millions of factors involved. Python is a good example as it drives most of goolge(for what I know). And this brings us to the very beginning of google-there was an interview with Larry Page some years ago who was speaking about the search engines before google-for example when he typed the word "university", the first result he found had the word "university" a few times in it's title.
Going back to naive Bayes classifiers - there are a few very important key factors - assumptions and pattern recognition. And relations of course. For example you mentioned apples - that could have a few possibilities. For example:
Apple - if eating, vitamins, and shape is present we assume that the we are most likely talking about a fruit.
If we are mentioning electronics, screens, maybe Steve Jobs - that should be obvious.
If we are talking about religion, God, gardens, snakes - then it must have something to do with Adam and Eve.
So depending on your needs, you could have a basic segments of data where each one of these falls into, or a complex structure containing far more details. So yes-you base most of those on plain assumptions. And based on those you can create a more complex patterns for further recognition-Apple-iPod, iPad -having similar pattern in their names, containing similar keywords, mentioning certain people-most likely related to each other.
Irrelevant data is very hard to spot-at this very point you are probably thinking that I own multiple Apple devices, writing on a large iMac, while this couldn't be further from the truth. So this would be a very wrong assumption to begin with. So the classifiers themselves must make a very good segmentation and analysis before jumping to exact conclusions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.