Outlier detection in non-normally distributed data - python

I have a big dataset, which contains yearly rapports of companies.
In this dataset I want to detect errors/outliers. These outliers are mainly human input errors. I have trouble deciding which is the best strategy to use for this problem, since my data is not normal distributed.
My dataset contains about 100 columns.
Does anyone has some input on techniques, for detecting human errors?
Think of comma error, to many zeros, ect
Thank you in advance

Well looks it is a complicated problem.
Looks you data has following features.
1. NLP knowledge: company rapports piece of articles. To analysis it, NLP has to be adapted.
2. high dimention: currently you has about 100 columns, considering the NLP decomposed result, you might has thousands of columns in certain cases.
3. non normal distributed.
To solve it, you may try to :
1. Use NLP way to transformat article to numeric information
2. Use typical novel or outlier tools to find it. you can try SKlearn model.
https://scikit-learn.org/stable/modules/outlier_detection.html
Hope it can help you.

Related

What should be used between Doc2Vec and Word2Vec when analyzing product reviews?

I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.

Null values in datset

I'm using a dataset to predict the effects on the economy because of covid-19. The dataset contains 9k rows and around 1k rows in each column is empty. Do I need to fill them manually by looking at other datasets online or can I fill the average or should I leave the dataset as it is?
Generally, I'd say that combining datasets from multiple sources without being really clear about your rational can raise pretty big questions about the reliability of your data.
Otherwise, either assuming averages or leaving null are both valid options depending on what you're trying to do. If you're using scikit learn (eg) you'll probably find that nulls throw up errors, so filling with assumed averages is a relatively common thing to do. Although you might want to watch out given you've got more that 10% nulls!
From experience, I'd say think about what you're trying to do, and what will help you get there best. Then be really clear about presenting your methodology with your findings.

Handle missing values : When 99% of the data is missing from most columns (important ones)

I am facing a dilemma with a project of mine. Few of the variables don't have enough data that means almost 99% data observations are missing.
I am thinking of couple of options -
Impute missing value with mean/knn imputation
Impute missing value with 0.
I couldn't think of anything in this direction. If someone can help that would be great.
P.S. I am not feeling comfortable using mean imputation when 99% of the data is missing. Does someone have a reasoning for that? kindly let me know.
Data has 397576 Observations out of which below are the missing values
enter image description here
99% of the data is missing!!!???
Well, if your dataset has less than 100,000 examples, then you may want to remove those columns instead of imputing through any methods.
If you have a larger dataset then using mean imputing or knn imputing would be ...OK. These methods don't catch the statistics of your data and can eat up memory. Instead use Bayesian methods of Machine Learning like fitting a Gaussian Process through your data or a Variational Auto-Encoder to those sparse columns.
1.) Here are a few links to learn and use gaussian processes to samples missing values from the dataset:
What is a Random Process?
How to handle missing values with GP?
2.) You can also use a VAE to impute the missing values!!!
Try reading this paper
I hope this helps!
My first question to give a good answer would be:
What you are actually trying to archive with the completed data?
.
People impute data for different reasons and the use case makes a big difference for example you could use imputation as:
Preprocessing step for training a machine learning model
Solution to have a nice Graphic/Plot that does not have gaps
Statistical inference tool to evaluate scientific or medical studies
99% of missing data is a lot - in most cases you can expect, that nothing meaningful will come out of this.
For some variables it still might make sense and produce at least something meaningful - but you have to handle this with care and think a lot about your solution.
In general you can say, imputation does not create entries out of thin air. A pattern must be present in the existing data - which then is applied to the missing data.
You probably will have to decide on a variable basis what makes sense.
Take your variable email as an example:
Depending how your data - it might be that each row represents a different customer that has a specific email address. So that every row is supposed to be a unique mail address. In this case imputation won't have any benefits - how should the algorithm guess the email. But if the data is structured differently and customers appear in multiple rows - then an algorithm can still fill in some meaningful data. Seeing that Customer number 4 always has the same mail address and filling it for rows where only customer number 4 is given and the mail is missing.

Classifying sentences with overlapping words

I've this CSV file which has comments (tweets, comments). I want to classify them into 4 categories, viz.
Pre Sales
Post Sales
Purchased
Service query
Now the problems that I'm facing are these :
There is a huge number of overlapping words between each of the
categories, hence using NaiveBayes is failing.
The size of tweets being only 160 chars, what is the best way to
prevent words from one category falling into the another.
What all ways should I select the features which can take care of both the 160 char tweets and a bit lengthier facebook comments.
Please let me know of any reference link/tutorial link to follow up the same, being a newbee in this field
Thanks
I wouldn't be so quick to write off Naive Bayes. It does fine in many domains where there are lots of weak clues (as in "overlapping words"), but no absolutes. It all depends on the features you pass it. I'm guessing you are blindly passing it the usual "bag of words" features, perhaps after filtering for stopwords. Well, if that's not working, try a little harder.
A good approach is to read a couple of hundred tweets and see how you know which category you are looking at. That'll tell you what kind of things you need to distill into features. But be sure to look at lots of data, and focus on the general patterns.
An example (but note that I haven't looked at your corpus): Time expressions may be good clues on whether you are pre- or post-sale, but they take some work to detect. Create some features "past expression", "future expression", etc. (in addition to bag-of-words features), and see if that helps. Of course you'll need to figure out how to detect them first, but you don't have to be perfect: You're after anything that can help the classifier make a better guess. "Past tense" would probably be a good feature to try, too.
This is going to be a complex problem.
How do you define the categories? Get as many tweets and FB posts as you can and tag them all with the correct categories to get some ground truth data
Then you can identify which words/phrases are the best for identifying particular category using e.g. PCA
Look into scikit-learn they have tutorials for text processing and classification.

Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.
As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.
My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.
In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line
"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97
I have around 900K lines for learning and I do my test over 100K lines
As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.
I tried several things:
LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs
I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.
I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.
Is there any way I didn't explore that can fit my needs?
Thanks for any clue and piece of advice.
To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

Categories

Resources