I'm have bandwidth data which identifies protocol usage by tonnage and hour. Based on the protocols, you can tell when something is just connect vs actually being used (1000 bits compared to million or billions of bits) in that hour for that specific protocol. The problem is When looking at each protocol, they are all heavily right skewed. Where 80% of the records are the just connected or what I'm calling "noise.
The task I have is to separate out this noise and focus on only when the protocol is actually being used. My classmates are all just doing this manually and removing at a low threshold. I was hoping there was a way to automate this and using statistics instead of just picking a threshold that "looks good." We have something like 30 different protocols each with a different amount of bits which would represent "noise" i.e. a download prototypical might have 1000 bits where a messaging app might have 75 bits when they are connected but not in full use. Similarly they will have different means and gaps between i.e. download mean is 215,000,000 and messaging is 5,000,000. There isn't any set pattern between them.
Also this "noise" has many connections but only accounts for 1-3% of the total bandwidth being used, this is why we are tasked with identify actual usage vs passive usage.
I don't want any actual code, as I'd like to practice with the implementation and solution building myself. But the logic, process, or name of a statistical method would be very helpful.
Do you have labeled examples, and do you have other data besides the bandwidth? One way to do this would be to train some kind of ML classifier if you have a decent amount of data where you know it's either in use or not in use. If you have enough data you also might be able to do this unsupervised. For a start a simple Naive Bayes classifier works well for binary solutions. As you may be away, NB was the original bases for spam detection (is it spam or not). So your case of is it noise or not should also work, but you will get more robust results if you have other data in addition to the bandwidth to train on. Also, I am wondering if there isn't a way to improve the title of your post so that it communicates your question more quickly.
Related
Suppose that I have a data set S that contains the service time for different jobs, like S={t1,t2,t3,...,tn}, where ti is the service time for the ith job; and n the total number in my data set. This S is only a sample from a population. n here is 300k. I would like to study the impact of long service time as some jobs takes very long and some not. My intuition is to study this impact based on data gathered from real system. The system in study has thousands of millions of jobs, and this number is increasing by 100 new jobs each several seconds. Also, service time is measured via benchmarking the jobs on a local machine. So practically it is expensive to keep expanding your data set. Thus, i decided to randomly pick up 300k.
I am conducting simulation experiments where I have to generate a large number of jobs with their service times (say millions) and then do some other calculations.
How to use S as a population in my simulation, I came across the following:
1- use S itself. I could use bootstrapping 'sample with replacement' or ' sample without replacement'.
2- fit a theoretical distribution model to S and then draw from it.
Am I correct? which approach is best (pros and cons)? the first approach seems easy as just picking a random service time from S each time? is it reliable? Any suggestion is appreciated as I am not got at stats.
Quoting from this tutorial in the 2007 Winter Simulation Conference:
At first glance, trace-driven simulation seems appealing. That is
where historical data are used directly as inputs. It’s hard to argue
about the validity of the distributions when real data from the
real-world system is used in your model. In practice, though, this
tends to be a poor solution for several reasons. Historical data may
be expensive or impossible to extract. It certainly won’t be available
in unlimited quantities, which significantly curtails the statistical
analysis possible. Storage requirements are high. And last, but not
least, it is impossible to assess “what-if?” strategies or try to
simulate a prospective system, i.e., one which doesn’t yet exist.
One of the major uses of simulation is to study alternative configurations or policies, and trace data is not suitable for that—it can only show you how you're currently operating. Trace data cannot be used for studying systems which are under consideration but don't yet exist.
Bootstrapping resamples your existing data. This removes the data quantity limitations, but at a potential cost. Bootstrapping is premised on the assumption that your data are representative and independent. The former may not be an issue with 300k observations, but often comes up when your sample size is smaller due to cost or availability issues. The latter is a big deal if your data come from a time series where the observations are serially correlated or non-homogeneous. In that case, independent random sampling (rather than sequential playback) can lose significant information about the behaviors being studied.
If sequential playback is required you're back to being limited to 300k observations, and that may not be nearly as much data as you think for statistical measures. Variance estimation is essential to calculating margins of error for confidence intervals, and serial correlation has a huge impact on the variance of a sample mean. Getting valid confidence interval estimates can take several orders of magnitude more data than is required for independent data.
In summary, distribution fitting takes more work up front but is usually more useful in the long run.
For my master's thesis I am using a 3rd party program (SExtractor) in addition to a python pipeline to work with astronomical image data. SExtractor takes a configuration file with numerous parameters as input, which influences (after some intermediate steps) the statistics of my data. I've already spent way too much time playing around with the parameters, so I've looked a little bit into machine learning and have gained a very basic understanding.
What I am wondering now is: Is it reasonable to use a machine learning algorithm to optimize the parameters of the SExtractor, when the only method to judge the performance or quality of the parameters is with the final statistics of the analysis run (which takes at least an hour on my machine) and there are more than 6 parameters which influence the statistics.
As an example, I have included 2 different versions of the statistics I am referring to, made from slightly different versions of Sextractor parameters. Red line in the left image is the median value of the standard deviation (as it should be). Blue line is the median of the standard deviation as I get them. The right images display the differences of the objects in the 2 data sets.
I know this is a very specific question, but as I am new to machine learning, I can't really judge if this is possible. So it would be great if someone could suggest me if this is a pointless endeavor and point me in the right.
You can try an educated guess based on the data that you already have. You are trying to optimize the parameters such that the median of the standard deviation has the desired value. You could assume various models and try to estimate the parameters based on the model and the estimated data. But I think you should have a good understanding of machine learning to do so. With good I mean beyound an undergraduate course.
Introduction
I'd like to know what other topic modellers consider to be an optimal topic-modelling workflow all the way from pre-processing to maintenance. While this question consists of a number of sub-questions (which I will specify below), I believe this thread would be useful for myself and others who are interested to learn about best practices of end-to-end process.
Proposed Solution Specifications
I'd like the proposed solution to preferably rely on R for text processing (but Python is fine also) and topic-modelling itself to be done in MALLET (although if you believe other solutions work better, please let us know). I tend to use the topicmodels package in R, however I would like to switch to MALLET as it offers many benefits over topicmodels. It can handle a lot of data, it does not rely on specific text pre-processing tools and it appears to be widely used for this purpose. However some of the issues outline below are also relevant for topicmodels too. I'd like to know how others approach topic modelling and which of the below steps could be improved. Any useful piece of advice is welcome.
Outline
Here is how it's going to work: I'm going to go through the workflow which in my opinion works reasonably well, and I'm going to outline problems at each step.
Proposed Workflow
1. Clean text
This involves removing punctuation marks, digits, stop words, stemming words and other text-processing tasks. Many of these can be done either as part of term-document matrix decomposition through functions such as for example TermDocumentMatrix from R's package tm.
Problem: This however may need to be performed on the text strings directly, using functions such as gsub in order for MALLET to consume these strings. Performing in on the strings directly is not as efficient as it involves repetition (e.g. the same word would have to be stemmed several times)
2. Construct features
In this step we construct a term-document matrix (TDM), followed by the filtering of terms based on frequency, and TF-IDF values. It is preferable to limit your bag of features to about 1000 or so. Next go through the terms and identify what requires to be (1) dropped (some stop words will make it through), (2) renamed or (3) merged with existing entries. While I'm familiar with the concept of stem-completion, I find that it rarely works well.
Problem: (1) Unfortunately MALLET does not work with TDM constructs and to make use of your TDM, you would need to find the difference between the original TDM -- with no features removed -- and the TDM that you are happy with. This difference would become stop words for MALLET. (2) On that note I'd also like to point out that feature selection does require a substantial amount of manual work and if anyone has ideas on how to minimise it, please share your thoughts.
Side note: If you decide to stick with R alone, then I can recommend the quanteda package which has a function dfm that accepts a thesaurus as one of the parameters. This thesaurus allows to to capture patterns (usually regex) as opposed to words themselves, so for example you could have a pattern \\bsign\\w*.?ups? that would match sign-up, signed up and so on.
3. Find optimal parameters
This is a hard one. I tend to break data into test-train sets and run cross-validation fitting a model of k topics and testing the fit using held-out data. Log likelihood is recorded and compared for different resolutions of topics.
Problem: Log likelihood does help to understand how good is the fit, but (1) it often tends to suggest that I need more topics than it is practically sensible and (2) given how long it generally takes to fit a model, it is virtually impossible to find or test a grid of optimal values such as iterations, alpha, burn-in and so on.
Side note: When selecting the optimal number of topics, I generally select a range of topics incrementing by 5 or so as incrementing a range by 1 generally takes too long to compute.
4. Maintenance
It is easy to classify new data into a set existing topics. However if you are running it over time, you would naturally expect that some of your topics may cease to be relevant, while new topics may appear. Furthermore, it might be of interest to study the lifecycle of topics. This is difficult to account for as you are dealing with a problem that requires an unsupervised solution and yet for it to be tracked over time, you need to approach it in a supervised way.
Problem: To overcome the above issue, you would need to (1) fit new data into an old set of topics, (2) construct a new topic model based on new data (3) monitor log likelihood values over time and devise a threshold when to switch from old to new; and (4) merge old and new solutions somehow so that the evolution of topics would be revealed to a lay observer.
Recap of Problems
String cleaning for MALLET to consume the data is inefficient.
Feature selection requires manual work.
Optimal number of topics selection based on LL does not account for what is practically sensible
Computational complexity does not give the opportunity to find an optimal grid of parameters (other than the number of topics)
Maintenance of topics over time poses challenging issues as you have to retain history but also reflect what is currently relevant.
If you've read that far, I'd like to thank you, this is a rather long post. If you are interested in the suggest, feel free to either add more questions in the comments that you think are relevant or offer your thoughts on how to overcome some of these problems.
Cheers
Thank you for this thorough summary!
As an alternative to topicmodels try the package mallet in R. It runs Mallet in a JVM directly from R and allows you to pull out results as R tables. I expect to release a new version soon, and compatibility with tm constructs is something others have requested.
To clarify, it's a good idea for documents to be at most around 1000 tokens long (not vocabulary). Any more and you start to lose useful information. The assumption of the model is that the position of a token within a given document doesn't tell you anything about that token's topic. That's rarely true for longer documents, so it helps to break them up.
Another point I would add is that documents that are too short can also be a problem. Tweets, for example, don't seem to provide enough contextual information about word co-occurrence, so the model often devolves into a one-topic-per-doc clustering algorithm. Combining multiple related short documents can make a big difference.
Vocabulary curation is in practice the most challenging part of a topic modeling workflow. Replacing selected multi-word terms with single tokens (for example by swapping spaces for underscores) before tokenizing is a very good idea. Stemming is almost never useful, at least for English. Automated methods can help vocabulary curation, but this step has a profound impact on results (much more than the number of topics) and I am reluctant to encourage people to fully trust any system.
Parameters: I do not believe that there is a right number of topics. I recommend using a number of topics that provides the granularity that suits your application. Likelihood can often detect when you have too few topics, but after a threshold it doesn't provide much useful information. Using hyperparameter optimization makes models much less sensitive to this setting as well, which might reduce the number of parameters that you need to search over.
Topic drift: This is not a well understood problem. More examples of real-world corpus change would be useful. Looking for changes in vocabulary (e.g. proportion of out-of-vocabulary words) is a quick proxy for how well a model will fit.
I'm using an event loop based server in twisted python that stores files, and I'd like to be able to classify the files according to their compressibility.
If the probability that they'd benefit from compression is high, they would go to a directory with btrfs compression switched on, otherwise they'd go elsewhere.
I do not need to be sure - 80% accuracy would be plenty, and would save a lot of diskspace. But since there is the CPU and fs performance issue too, I can not just save everything compressed.
The files are in the low megabytes. I can not test-compress them without using a huge chunk of CPU and unduly delaying the event loop or refactoring a compression algorithm to fit into the event loop.
Is there any best practice to give a quick estimate for compressibility? What I came up with is taking a small chunk (few kB) of data from the beginning of the file, test-compress it (with a presumably tolerable delay) and base my decision on that.
Any suggestions? Hints? Flaws in my reasoning and/or problem?
Just 10K from the middle of the file will do the trick. You don't want the beginning or the end, since they may contain header or trailer information that is not representative of the rest of the file. 10K is enough to get some amount of compression with any typical algorithm. That will predict a relative amount of compression for the whole file, to the extent that that middle 10K is representative. The absolute ratio you get will not be the same as for the whole file, but the amount that it differs from no compression will allow you to set a threshold. Just experiment with many files to see where to set the threshold.
As noted, you can save time by doing nothing for files that are obviously already compressed, e.g. .png. .jpg., .mov, .pdf, .zip, etc.
Measuring entropy is not necessarily a good indicator, since it only gives the zeroth-order estimate of compressibility. If the entropy indicates that it is compressible enough, then it is right. If the entropy indicates that it is not compressible enough, then it may or may not be right. Your actual compressor is a much better estimator of compressibility. Running it on 10K won't take long.
I think what you are looking for is How to calculate the entropy of a file?
This questions contains all kind of methods to calculate the entropy of the file (and by that you can get the 'compressibility' of a file). Here's a quote from the abstract of this article (Relationship Between Entropy and
Test Data Compression
Kedarnath J. Balakrishnan, Member, IEEE, and Nur A. Touba, Senior Member, IEEE):
The entropy of a set of data is a measure of the amount of information contained in it. Entropy calculations for fully specified data have been used to get a theoretical bound on how much that data can be compressed. This paper extends the concept of entropy for incompletely specified test data (i.e., that has unspecified or don't care bits) and explores the use of entropy to show how bounds on the maximum amount of compression for a particular symbol partitioning can be calculated. The impact of different ways of partitioning the test data into symbols on entropy is studied. For a class of partitions that use fixed-length symbols, a greedy algorithm for specifying the don't cares to reduce entropy is described. It is shown to be equivalent to the minimum entropy set cover problem and thus is within an additive constant error with respect to the minimum entropy possible among all ways of specifying the don't cares. A polynomial time algorithm that can be used to approximate the calculation of entropy is described. Different test data compression techniques proposed in the literature are analyzed with respect to the entropy bounds. The limitations and advantages of certain types of test data encoding strategies are studied using entropy theory
And to be more constructive, checkout this site for python implementation of entropy calculations of chunks of data
Compressed files usually don't compress well. This means that just about any media file is not going to compress very well, since most media formats already include compression. Clearly there are exceptions to this, such as BMP and TIFF images, but you can probably build a whitelist of well-compressed filetypes (PNGs, MPEGs, and venturing away from visual media - gzip, bzip2, etc) to skip and then assume the rest of the files you encounter will compress well.
If you feel like getting fancy, you could build feedback into the system (observe the results of any compression you do and associate the resulting ratio with the filetype). If you come across a filetype that has consistently poor compression, you could add it to the whitelist.
These ideas depend on being able to identify a file's type, but there are standard utilities which do a pretty good job of this (generally much better than 80%) - file(1), /etc/mime.types, etc.
I'd like to begin thinking about how I can scale up my algorithms that I write for data analysis so that they can be applied to arbitrarily large sets of data. I wonder what are the relevant concepts (threads, concurrency, immutable data structures, recursion) and tools (Hadoop/MapReduce, Terracota, and Eucalyptus) to make this happen, and how specifically these concepts and tools are related to each other. I have a rudimentary background in R, Python, and bash scripting and also C and Fortran programming, though I'm familiar with some basic functional programming concepts also. Do I need to change the way that I program, use a different language (Clojure, Haskell, etc.), or simply (or not so simply!) adapt something like R/Hadoop (HRIPE)... or write wrappers for Python to enable multi-threading or Hadoop access? I understand this would might involve requirements for additional hardware and I would like some basic idea of what the requirements/options available might be. My apologies for this rather large and yet vague question, but just trying to get started - thanks in advance!
While languages and associated technologies/frameworks are important for scaling, they tend to pale in comparison to the importance of the algorithms, data structure, and architectures. Forget threads: the number of cores you can exploit that way is just too limited -- you want separate processes exchanging messages, so you can scale up at least to a small cluster of servers on a fast LAN (and ideally a large cluster as well!-).
Relational databases may be an exception to "technologies pale" -- they can really clamp you down when you're trying to scale up a few orders of magnitude. Is that your situation -- are you worried about mere dozens or at most hundreds of servers, or are you starting to think about thousands or myriads? In the former case, you can still stretch relational technology (e.g. by horizontal and vertical sharding) to support you -- in the latter, you're at the breaking point, or well past it, and must start thinking in terms of key/value stores.
Back to algorithms -- "data analysis" cover a wide range... most of my work for Google over the last few years falls in that range, e.g. in cluster management software, and currently in business intelligence. Do you need deterministic analysis (e.g. for accounting purposes, where you can't possibly overlook a single penny out of 8-digit figures), or can you stand some non-determinism? Most "data mining" applications fall into the second category -- you don't need total precision and determinism, just a good estimate of the range that your results can be proven to fall within, with, say, 95% probability.
This is particularly crucial if you ever need to do "real-near-time" data analysis -- near-real-time and 100% accuracy constraints on the same computation do not a happy camper make. But even in bulk/batch off-line data mining, if you can deliver results that are 95% guaranteed orders of magnitude faster than it would take for 99.99% (I don't know if data mining can ever be 100.00%!-), that may be a wonderful tradeoff.
The work I've been doing over the last few years has had a few requirements for "near-real-time" and many more requirements for off-line, "batch" analysis -- and only a very few cases where absolute accuracy is an absolute must. Gradually-refined sampling (when full guaranteed accuracy is not required), especially coupled with stratified sampling (designed closely with a domain expert!!!), has proven, over and over, to be a great approach; if you don't understand this terminology, and still want to scale up, beyond the terabytes, to exabytes and petabytes' worth of processing, you desperately need a good refresher course in Stats 201, or whatever course covers these concepts in your part of the woods (or on iTunes University, or the YouTube offerings in university channels, or blip.tv's, or whatever).
Python, R, C++, whatever, only come into play after you've mastered these algorithmic issues, the architectural issues that go with them (can you design a computation architecture to "statistically survive" the death of a couple of servers out of your myriad, recovering to within statistically significant accuracy without a lot of rework...?), and the supporting design and storage-technology choices.
The main thing for scaling up to large data is to avoid situations where you're reading huge datasets into memory at once. In pythonic terms this generally means using iterators to consume the dataset in manageable pieces.