How to do Log Mining? - python

In order to figure out (or guess) something from one of our proprietary desktop tools developed by wxPython, I injected a logging decorator on several regardful class methods. Each log record looks like the following:
Right now, there are more than 3M log records in database and I started to think "What I can get from those stuff?". I can get some information like:
hit rate of (klass, method) by a period of time (ex, a week).
power users by record counts.
approximate crash rate by lost closing log compared to opening log.
I guess the related technique might be log mining. Does anyone have any idea for further information I can retrieve from this really simple log? I'm really interested to get something more from it.

SpliFF is right, you'll have to decide which questions are important to you and then figure out if you're collecting the right data to answer them. Making sense of this sort of operational data can be very valuable.
You probably want to start by seeing if you can answer some basic questions, and then move on to the tougher stuff once you have your log collection and analysis workflow established. Some longer-term questions you might consider:
What are the most common, severe bugs being encountered "in the wild", ranked by frequency and impact. Data: Capture stacktraces / callpoints and method arguments if possible.
Can you simplify some of the common actions your users perform? If X is the most common, can the number of steps be reduced or can individual steps be simplified? Data: Sessions, clickstreams for the common workflows. Features ranked by frequency of use, number and complexity of steps.
Some features may be confusing, have conflicting options, which lead to user mistakes. Sessions where the user backs up several times to repeat a step, or starts over from the beginning, may be telling.
You may also want to notify users that data is being collected for quality purposes, and even solicit some feedback from within the app's interface.

Patterns!
Patterns preceding failures. Say a failure was logged, now consider exploring these questions:
What was the sequence of klass-method combos that preceded it?
What about other combos?
Is it always the same sequence that precedes the same failures?
Does a sequence of minor failures precede a major failure?
etc
One way to compare patterns can be as such:
Classify each message
Represent each class/type with a unique ID, so you now have a sequence of IDs
Slice the sequence into time periods to compare
Compare the slices (arrays of IDs) with a diff algorithm
Retain samples of periods to establish the common patterns, then compare new samples for the same periods to establish a degree of anomaly

Related

Trying to work out how to produce a synthetic data set using python or javascript in a repeatable way

I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom

Data Cleanse: Grouping within variable company names

So working on some research on nursing homes which are often owned by a chain. We have a list of 9,000 + nursing homes corporate ownership. Now, if I was MERGING this data into anything I think this would not be too much of a challenge, but I am being asked to group the facilities that are associated with each other for another analysis.
For example:
ABCM
ABCM CORP
ABCM CORPORATION
ABCM CORPORATE
I have already removed all the extra spaces, non-alphanumeric, and upcased everything. Just trying to think of a way within like a 90% accuracy I can do this. The within the same variable is the part that is throwing me off. I do have some other details such as ownership, state, zip, etc. I use STATA, SAS, and Python if that helps!
welcome to SO.
String matching is - broadly speaking - a pain, whatever the software you are using, and in most cases need a human intervention to yield satisfactory results.
In Stata you may want to try matchit (ssc install matchit) for fuzzy string merge. I won't go into the details (I suggest you to look at the helpfile, which is pretty well-outlined), but the command returns each string matched with multiple similar entries - where "similar" depends on the chosen method, and you can specify a threshold for the level of similarities kept or discarded.
Even with all the above options, though, the final step is up to you: my personal experience tells me that no matter how restrictive you are, you'll always end up with several "false positives" that you'll have to work yourself!
Good luck!

Implementation of multi-variable, multi-type RNN in Python

I have a dataset which has items with the following layout/schema:
{
words: "Hi! How are you? My name is Helennastica",
ratio: 0.32,
importantNum: 382,
wordArray: ["dog", "cat", "friend"],
isItCorrect: false,
type: 2
}
where I have a lot of different types of data, including:
Arrays (of one type only, e.g an array of strings or array of numbers, never both)
Booleans
Numbers with fixed min/max (i.e on a scale of 0 to 1)
Limitless integers (any integer -∞ to ∞)
Strings, with some dictionary, some new, words
The task is to create an RNN (well, generally, a system that can quickly retrain when given one extra bit of data instead of reprocessing it all - I think an RNN is the best choice; see below for reasoning) which can use all of these factors to categorise any dataset into one of 4 categories - labelled by the type key in the above example, a number 0-3.
I have a set of lots of the examples in the above format (with answer provided), and I have a database filled with uncategorised examples. My intention is to be able to run the ML model on that set, and sort all of them into categories. The reason I need to be able to retrain quickly is because of the feedback feature: if the AI gets something wrong, any user can report it, in which case that specific JSON will be added to the dataset. Obviously, having to retrain with 1000+ JSONs just to add one extra on would take ages - if I am not mistaken, an RNN can get around this.
I have found many possible use-cases for something like this, yet I have spent literal hours browsing through Github trying to find an implementation, or some Tensorflow module/addon to make this easier/copy, but to no avail.
I assume this would not be too difficult using Tensorflow, and I understand a bit of the maths and logic behind it (but not formally educated, so I probably have gaps!), but unfortunately I have essentially no experience with using Tensorflow/any other ML frameworks (beyond copy-pasting code for some other projects). If someone could point me in the right direction in the form of a Github repo/Python framework, or even write some demo code to help solve this problem, it would be greatly appreciated. And if you're just going to correct some of my technical knowledge/tell me where I've gone horrendously wrong, I'd appreciate that feedback to (just leave it as a comment).
Thanks in advance!

How to represent and then load custom dataset with varying number of columns for some entries into sci-kit learn

I am working on a keystroke biometric authentication project. It is like a wrapper over your traditional password-based authentication. If the password is right, it checks for the "typing-rhythm" and gives a positive output if it matches the user's profile. Else, a negative output is given. The "typing-rhythm" is checked by mapping some timing properties that are extracted while typing the password. There are essentially 5 features, namely- PP(Press-Press time), PR(Press-Release time), RP(Release-Press time), RR(Release-Release time) and Total time. PP is the time between pressing two consecutive keys(characters). RR is the time between releasing two consecutive keys. PR is the time for which the key was pressed and the released. RP is time between releasing a key and then pressing the next key. Total time is the time between pressing the first key of the password and releasing the last key of the password.
I'm using a open database GREYC-Web based KeyStroke dynamics for the project. Each session of data collection contains the ASCII value of the key pressed and the timestamp for PP, PR, RP, RR and total time. It also contains whether the actual user is typing the password or an impostor. While collecting the data, the users were allowed to use a password of their own. So naturally, there are passwords of varying length. Apart from that, a user might press extra keys(like Shift, Caps, Backspace, Delete, etc). For even for a particular user, different sessions of typing the password might have different password length. Note, password length in this context is the total number of keys(characters) that the user typed. So for example, if the user's actual password is "abcd". In one session he types it properly and the password length is 4. In another session he types the following set of keys- a, l, BACKSPACE, b,c,d- and thus the password length is 6.
Here is some context of the proposed system. The proposed system block-diagram is as follows. The "Input Feature Space Partition" creates subsets of the actual database to be fed to different classifiers namely-Gaussian, K-NN and OCSVM. The outputs of these classifiers is fed to a Back-Propogation Neural Network(BPNN) whose result is the final output. The BPNN is used to punish those classifiers that give wrong result and reward those classifiers that give correct result.
My question is how do I represent this varying length data in a structured format so that it can be processed and used in sci-kit learn.
I have looked into panda and numpy for pre-processing of data. But my problem precedes pre-processing stage.
Thanks, in advance!
An option would be a Recurrent Neural Network. These are networks effectively feed into themselves, effectively creating a function of time, or in your case, of relative position in a word. The structures of these networks are as follows:
The left part(before the arrow) shows the theoretical structure of an RNN. Values are passed not only between nodes in the network, but also between timesteps. This generalized structure allows for the embedding of arbitrary time, or in your case, arbitrary word length.
A common implementation of RNN's which are able to achieve even better results on some problems are LSTM's, or long short term memory networks.
To avoid an overly complex introductory answer, I will not go into too much detail on these. Baseically, they have more complex "hidden units", which facilitate more complex decisions about what data is kept, and what is "forgotten".
If you would like to implement these on your own, look into Tensorflow. If you have a library that you are more comfortable with, feel free to research its implementation of RNN's and LSTM's, but if not, Tensorflow is a great place to start.
Good luck in your research, and I hope this helps!

Store data series in file or database if I want to do row level math operations?

I'm developing an app that handle sets of financial series data (input as csv or open document), one set could be say 10's x 1000's up to double precision numbers (Simplifying, but thats what matters).
I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input. This will be between columns (row level operations) on one set and also between columns on many (potentially all) sets at the row level also. I plan to write it in Python and it will eventually need a intranet facing interface to display the results/graphs etc. for now, csv output based on some input parameters will suffice.
What is the best way to store the data and manipulate? So far I see my choices as being either (1) to write csv files to disk and trawl through them to do the math or (2) I could put them into a database and rely on the database to handle the math. My main concern is speed/performance as the number of datasets grows as there will be inter-dataset row level math that needs to be done.
-Has anyone had experience going down either path and what are the pitfalls/gotchas that I should be aware of?
-What are the reasons why one should be chosen over another?
-Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?
-Is there any project or framework out there to help with this type of task?
-Edit-
More info:
The rows will all read all in order, BUT I may need to do some resampling/interpolation to match the differing input lengths as well as differing timestamps for each row. Since each dataset will always have a differing length that is not fixed, I'll have some scratch table/memory somewhere to hold the interpolated/resampled versions. I'm not sure if it makes more sense to try to store this (and try to upsample/interploate to a common higher length) or just regenerate it each time its needed.
"I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input."
This is the standard use case for a data warehouse star-schema design. Buy Kimball's The Data Warehouse Toolkit. Read (and understand) the star schema before doing anything else.
"What is the best way to store the data and manipulate?"
A Star Schema.
You can implement this as flat files (CSV is fine) or RDBMS. If you use flat files, you write simple loops to do the math. If you use an RDBMS you write simple SQL and simple loops.
"My main concern is speed/performance as the number of datasets grows"
Nothing is as fast as a flat file. Period. RDBMS is slower.
The RDBMS value proposition stems from SQL being a relatively simple way to specify SELECT SUM(), COUNT() FROM fact JOIN dimension WHERE filter GROUP BY dimension attribute. Python isn't as terse as SQL, but it's just as fast and just as flexible. Python competes against SQL.
"pitfalls/gotchas that I should be aware of?"
DB design. If you don't get the star schema and how to separate facts from dimensions, all approaches are doomed. Once you separate facts from dimensions, all approaches are approximately equal.
"What are the reasons why one should be chosen over another?"
RDBMS slow and flexible. Flat files fast and (sometimes) less flexible. Python levels the playing field.
"Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?"
Star Schema: central fact table surrounded by dimension tables. Nothing beats it.
"Is there any project or framework out there to help with this type of task?"
Not really.
For speed optimization, I would suggest two other avenues of investigation beyond changing your underlying storage mechanism:
1) Use an intermediate data structure.
If maximizing speed is more important than minimizing memory usage, you may get good results out of using a different data structure as the basis of your calculations, rather than focusing on the underlying storage mechanism. This is a strategy that, in practice, has reduced runtime in projects I've worked on dramatically, regardless of whether the data was stored in a database or text (in my case, XML).
While sums and averages will require runtime in only O(n), more complex calculations could easily push that into O(n^2) without applying this strategy. O(n^2) would be a performance hit that would likely have far more of a perceived speed impact than whether you're reading from CSV or a database. An example case would be if your data rows reference other data rows, and there's a need to aggregate data based on those references.
So if you find yourself doing calculations more complex than a sum or an average, you might explore data structures that can be created in O(n) and would keep your calculation operations in O(n) or better. As Martin pointed out, it sounds like your whole data sets can be held in memory comfortably, so this may yield some big wins. What kind of data structure you'd create would be dependent on the nature of the calculation you're doing.
2) Pre-cache.
Depending on how the data is to be used, you could store the calculated values ahead of time. As soon as the data is produced/loaded, perform your sums, averages, etc., and store those aggregations alongside your original data, or hold them in memory as long as your program runs. If this strategy is applicable to your project (i.e. if the users aren't coming up with unforeseen calculation requests on the fly), reading the data shouldn't be prohibitively long-running, whether the data comes from text or a database.
What matters most if all data will fit simultaneously into memory. From the size that you give, it seems that this is easily the case (a few megabytes at worst).
If so, I would discourage using a relational database, and do all operations directly in Python. Depending on what other processing you need, I would probably rather use binary pickles, than CSV.
Are you likely to need all rows in order or will you want only specific known rows?
If you need to read all the data there isn't much advantage to having it in a database.
edit: If the code fits in memory then a simple CSV is fine. Plain text data formats are always easier to deal with than opaque ones if you can use them.

Categories

Resources