Detecting duplicates in text files - python

I am trying to find the best way to detect/remove duplicates in text data. By duplicates I mean those texts that have a really high similarity, for example all equal but in one sentence. Furthermore the length can vary (by one or two sentence more or less), for this reason Hamming distances is not an option. Any way to compute a similarity factor? should I use term frequency matrices?
About my data: I have it in JSON file with Date, title and body (content). Therefore the similarity coefficient could include this three levels.
Since I am looking for the approach (not the code) I do not think presenting the data is necessary.
kind regards,

You can use the tf-idf ranking method. Look here for more details : Similarity between two text documents

Related

Construct a suffix tree of a concatination of a million words and query it with a test set to find the closest match and classify

The problem I'm trying to solve: I have a million words (multiple languages) and some classes that they classify into as my training corpora. Given the testing corpora of words (which is bound to increase in number over time) I want to get the closest match of each of those words in the training corpora and hence classify that word as the corresponding class of its closest match.
My Solution: Initially, I did this brute force which doesn't scale. Now I'm thinking I build a suffix tree over the concatenation of the training corpora (O(n)) and query the testing corpora (constant time). Trying to do this in python.
I'm looking for tools or packages that get me started or for other more efficient ways to solve the problem at hand. Thanks in advance.
Edit 1: As for how I am finding the closest match, I was thinking a combination of exact match alignment (from the suffix tree) and then for the part of the input string that is left over, I thought of doing a local alignment with affine gap penalty functions.
What distance metric are you using for the closest match?
There are papers that cover how to do an edit distance search using a suffix tree. For each suffix there is an extension of the edit matrix and theses can be ordered so to let one do a ranked search of the suffix tree to find the matching items in order of increasing distance.
An example for this is Top-k String Similarity Search with Edit-Distance Constraints (2013) https://doi.org/10.1109/ICDE.2013.6544886 https://scholar.google.com/scholar?cluster=13387662751776693983
The solution presented avoids computing all the entries in the table as columns are added.
In your problem it seems that for each word there are classes that apply to them if they don't depend on context then the above would work and a word to class map would all that would be needed. But if they depend on context then this seems closer to part of speech tagging.

Consolidating and comparing the text per document

I just started learning how NLP works. What I can do right now is to get the number of frequency of a specific word per document. But what I'm trying to do is to compare the four documents that I have to compare their similarities and different as well as displaying the words that are similar and the words that is unique to each document.
My documents are in .csv format imported using pandas. As each row has their own sentiment.
To be honest, the question you're asking is very high level and difficult (maybe impossible) to answer on a forum like this. So here are some ideas that might be helpful:
You could try to use [term frequency–inverse document frequency (TFIDF)] (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to compare the vocabularies for similarities and differences. This is not a large step from your current word-frequency analysis.
For a more detailed analysis, it might be a good idea to substitute the words of your documents with something like wordnet's synsets. This makes it possible to compare the sentence meanings at a higher level of abstraction than the actual words themselves. For example, if each of your documents mentions "planes", "trains", and "automobiles", there is an underlying similarity (vehicle references) that a simple word comparison will ignore not be able to detect.

Creating vector space

I've got a question:
I have a lot of documents and each line built by some pattern.
Of course, I have this array of patterns.
I want to create some vector space, then to vector this patterns by some rule (I have no ideas about what is this rule yet..) - i.e. to make this patterns like "centroids" of my vector space.
Then to vector each line of the current document (again by this rule) and to count the closet centroid to this line (i.e. minimum of the distance between two vectors).
I don't know how can I do this?
I know about sklearn libraries and CountVectorizer/TfidfVectorizer/HashingVectorizer - but this depends on the vocabulary size. But, again, I have a lot of documents, that's why it'll be too much words in vocabulary (if do this way, but in next new document it can be new word which this vocabulary wouldn't have. That's way this is wrong way to solve my problem)
Also Keras library with it's Text Preprocessing won't solve my problem two. E.x. "one hot" encodes a text into a list of word indexes of size . But each document may have different size and of course the order. That's way comparing two vectors may give big distance, but in fact this vectors (words, that encoded by this vectors) are very similar.

Restrict gensim similarity calculations to a subset of a corpus

I am looking to calculate the similarity between documents using gensim on Python.
I want a way to be able to restrict the calculations to only a subset of the corpus. Specifically, my documents have an associated year, and i want a way of only computing similarities between the search document and other document which have the same value for that variable.
I can not see any instructions on e.g. http://radimrehurek.com/gensim/simserver.html on how to associate additional variables with each document, and in turn how to restrict the similarities to only those documents - and indeed what i am trying to do may not be feasible. My question is thus, is this it is possible, or is the only way to achieve this to use multiple corpuses.
You could work around it just by ignoring results that are not for your target year.
Create a document2year_dict (document, year) for your documents.
Get the list of documents in distance order from
target_document.
Iterate through the list and discard documents
if document2year_dict[current_document] != target_year

Method to do Feature Agglomeration/summation?

I.E - Combining least frequent or informative bigram frequency counts together.
E.G - If I have frequency counts of letter pairs for a sequence, what's a good way to merge similar features together. (For example: "KR" and "RK" into a single feature and so on, or combining all the pairs with a count of 0 together..).
I know scikit learn has something called "ward's agglomerative clustering", but that seems aimed at visual data/pixels, and i'm interested in text data (Protein sequences and bioinformatics). I'd rather avoid clustering if there's a more direct method for concatenating the features together. (I lack background, and haven't done clustering before, and analysis of the features is important to us).
Thanks!

Categories

Resources