reformat text documents for easier comparison

reformat text documents for easier comparison - python

For those who want to spare the reasoning behind the question jump to the TL;DR
Hi I'm currently reading a lot of financial annual reports of companies. While the first one is the most interesting, the documents that come after it often are the same in a lot of regards. So obviously I'm more interested in the differences between them. The documents come in pdfs which are hard to compare. So I thought it would be nice to get them as pure text and compare them with a compare tool. So thats what I did. I piped the following two pdfs through pdftotext with the below params:
annual report for 2018
annual report for 2019
pdftotext -enc UTF-8 -nopgbrk -eol mac
I then realized that compare tools seem to have problems with line breaks. So if I have the exact same sentences, but with different line breaks in both documents, it is shown as a difference. Bullet points in pdfs are transformed to different symbols in the text file which leads to differences as well. So I went into nlp and thought I might get some help there.
TL;DR
I just want to reformat the two snippets below in a defined way that I don't get diffs in a difftool anymore. Like lines are only 80 characters long at most and I want to have some normalized/canonical way for printing bullet points and stuff like that.
I'm currently using spacy and here is an example of two text snippets that are essentially the same but lead to a lot of diffs in difftools. So how can I reprint both snippets to a text document so that the line breaks are the same? Is there even a method to find things like two sentences are exactly the same but in one sentence there is one additional word. I would like reformat that as well without shifting the line break by one word.
import spacy
nlp = spacy.load("en_core_web_sm")
SE_2018_10k_string = '''x
“paying users” refers to the number of unique accounts through which a payment is made in our online games in a particular period. A unique
account through which payments are made in more than one online game or in more than one market is counted as more than one paying user.
“QPUs” refers to the aggregate number of paying users during the quarterly period;
x'''
doc1 = nlp(SE_2018_10k_string)
print('SE_2018_10k_string')
for token in doc1:
print(token.text)
SE_2019_10k_string = '''●
“paying users” refers to the number of unique accounts through which a payment is made in our online games in a particular period. A unique account
through which payments are made in more than one online game or in more than one market is counted as more than one paying user. “QPUs” refers to
the aggregate number of paying users during the quarterly period;
●'''
doc2 = nlp(SE_2019_10k_string)
print('SE_2019_10k_string')
for token in doc2:
print(token.text)
print(doc1.similarity(doc2))

There is no universal way to get rid of the problems you are seeing.
If you find that you have line breaks in different places but your texts are otherwise the same, you can normalize things by removing line breaks. If you find only spaces are different, you can remove spaces, or convert any run of spaces to a single space. If bullets are an issue you can remove them or convert them to a single type of character (but how do you tell if something is a bullet in code? there is no standard way).
Appropriate normalization depends on your data, and for OCR it's typically going to just be hard.
Is there even a method to find things like two sentences are exactly the same but in one sentence there is one additional word.
You can use edit distance metrics like Levenshtein distance to find this. It won't help you with existing diff tools though, since they show any difference.

Related

Use word2vec to expand a glossary in order to classify texts

I have a database containing about 3 million texts (tweets). I put clean texts (removing stop words, tags...) in a list of lists of tokens called sentences (so it contains a list of tokens for each text).
After these steps, if I write
model = Word2Vec(sentences, min_count=1)
I obtain a vocabulary of about 400,000 words.
I have also a list of words (belonging to the same topic, in this case: economics) called terms. I found that 7% of the texts contain at least one of these words (so we can say that 7% of total tweets talk about economics).
My goal is to expand the list terms in order to retrieve more texts belonging to the economic topic.
Then I use
results = model.most_similar(terms, topn=5000)
to find, within the list of lists of tokens sentences, the words most similar to those contained in terms.
Finally if I create the data frame
df = pd.DataFrame(results, columns=['key', 'similarity'])
I get something like that:
key similarity
word1 0.795432
word2 0.787954
word3 0.778942
... ...
Now I think I have two possibilities to define the expanded glossary:
I take the first N words (what should be the value of N?);
I look at the suggested words one by one and decide which one to include in the expanded glossary based on my knowledge (does this word really belong to the economic glossary?)
How should I proceed in a case like this?

There's no general answer for what the cutoff should be, or how much you should use your own manual judgement versus cruder (but fast/automatic) processes. Those are inherently decisions which will be heavily influenced by your data, model quality, & goals – so you have to try different approaches & see what works there.
If you had a goal for what percentage of the original corpus you want to take – say, 14% instead of 7% – you could go as deeply into the ranked candidate list of 'similar words' as necessary to hit that 14% target.
Note that when you retrieve model.most_similar(terms), you are asking the model to 1st average all words in terms together, then return words close to that one average point. To the extent your seed set of terms is tightly around the idea of economics, that might find words close to that generic average idea – but might not find other interesting words, such as close sysnonyms of your seed words that you just hadn't thought of. For that, you might want to get not 5000 neighbors for one generic average point, but (say) 3 neighbors for every individual term. To the extent the 'shape' of the topic isn't a perfect sphere around someplace in the word-vector-space, but rather some lumpy complex volume, that might better reflect your intent.
Instead of using your judgement of the candidate words standing alone to decide whether a word is economics-related, you could instead look at the texts that a word uniquely brings in. That is, for new word X, look at the N texts that contain that word. How many, when applying your full judgement to their full text, deserve to be in your 'economics' subset? Only if it's above some threshold T would you want to move X into your glossary.
But such an exercise may just highlight: using a simple glossary – "for any of these hand-picked N words, every text mentioning at least 1 word is in" – is a fairly crude way of assessing a text's topic. There are other ways to approach the goal of "pick a relevant subset" in an automated way.
For example, you could view your task as that of training a text binary classifier to classify texts as 'economics' or 'not-economics'.
In such a case, you'd start with some training data - a set of example documents that are already labeled 'economics' or 'not-economics', perhaps via individual manual review, or perhaps via some crude bootstrapping (like labeling all texts with some set of glossary words as 'economics', & all others 'not-economics'). Then you'd draw from the full range of potential text-preprocessing, text-feature-extracton, & classification options to train & evaluate classifiers that make that judgement for you. Then you'd evaluate/tune those – a process wich might also improve your training data, as you add new definitively 'economics' or 'not-economics' texts – & eventually settle on one that works well.
Alternatively, you could use some other richer topic-modeling methods (LDA, word2vec-derived Doc2Vec, deeper neural models etc) for modeling the whole dataset, then from some seed-set of definite-'economics' texts, expand outward from them – finding nearest-examples to known-good documents, either auto-including them or hand-reviewing them.
Separately: min_count=1 is almost always a mistake in word2vec & related algorihtms, which do better if you discard words so rare they lack the variety of multiple usage examples the algorithm needs to generate good word-vectors.

Split text into logical blocks

I have an array of (insurance) contracts (in .docx format) processing of which I'm trying to automate.
Current task at hand is to split every contract into so called clauses - parts of contract which describe some specific risk or exclusion from cover.
For example, it can be just one sentence – “This contract covers loss or damage due to fire” or several paragraphs of text that give more details and explain what type of fire this contract covers and what damage is reimbursed.
Good thing is that usually contracts are formatted in some way or another. In best possible scenario, whole contract is a numbered list with items and sub items and we simply can split it by certain level of list hierarchy.
Bad thing is that this is not always the case and the list can be not numbered, but alphabetical or not list at all in word terms: each line starts with a number or a letter user typed in manually. Or it can be not letters or numbers at all, but some amount of spaces or tabs. Or clauses can be separated by their titles that are typed in ALL CAPS.
So the visual representation of structure varies from contract to contract.
So my question is what is the best approach to this task? Regexp? Some ML algo? Maybe there are open source scripts out there that were written to deal with this or similar tasks? Any help will be most welcome!
EDIT (24.12.2019):
Found this repo on github: https://github.com/bmmidei/SliceCast
Form its description: "This repository explores a neural network approach to segment podcasts based on topic of discussion. We model the problem as a binary classification task where each sentence is either labeled as the first sentence of a new segment or a continuation of the current segment. We embed sentences using the Universal Sentence Encoder and use an LSTM-based classification network to obtain the cutoff probabilities. Our results indicate that neural network models are indeed suitable for topical segmentation on long, conversational texts, but larger datasets are needed for a truly viable product.
Read the full report for this work here: Neural Text Segmentation on Podcast Transcripts"

The best course of action for this task is to improve the semantic information found in the document using annotations that rely on word styles. For instance:
add block style for contracts
add a paragraph style for title of contract
add a paragraph style for features of the contract
You can drill down at the inline level and add inline styles that allow to extract more granual information like a keyword inline style.
Then, you can process the .docx file using a python library or maybe convert it to libreoffice and then process it.
That is a classic annotation task for text documents. It is much easier and less costly to setup that alternatives like having a specific (web) app to input the different features you need.

How to generate homophones on substring level?

I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.

How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.

Automatic tagging of words or phrases

I want to automatically tag a word/phrase with one of the defined words/phrases from a list. My list contains about 230 words in columnA which are tagged in columnB. There are around 16 unique tags and every of those 230 words are tagged with one of these 16 tags.
Have a look at my list:
The words/phrases in column A are tagged as words/phrases in column B.
From time to time, new words are added for which tag has to be given manually.
I want to build a predictive algorithm/model to tag new words automatically(or suggest). So if I write a new word, let say 'MIP Reserve' (A36), then it should predict the tag as 'Escrow Deposits'(B36) and not 'Operating Reserve'(B33). How should I predict the tags of new word precisely even if the words do not match with the words in its actual tag?
If someone is willing to see the full list, I can happily share.

Short version
I think your question is a little ill-defined and doesn't have a short coding or macro answer. Given that each item contains such little information, I don't think it is possible to build a good predictive model from your source data. Instead, do the tagging exercise once and look at how you control tagging in the future.
Long version
Here are the steps I would take to create a predictive model and why I don't think you can do this.
Understand why you want to have a predictive program at all
Why do you need a predictive program? Are you sorting through hundreds or thousands of records, all of which are changing and need tagging? If so, I agree, you wouldn't want to do this manually.
If this is a one-off exercise, because over time the tags have become corrupted from their original meaning, your problem is that your tags have become corrupted, not that you need to somehow predict where each item should be tagged. You should be looking at controlling use of the tags, not at predicting how people in the future might mistag or misname something.
Don't forget that there are lots of tools in Excel to make the problem easier. Let's say you know for certain that all items with 'cash' definitely go to 'Operating Cash'. Put an AutoFilter on the list and filter on the word 'cash' - now just copy and paste 'Operating Cash' next to all of these. This way, you can quickly get rid of the obvious ones from your list and focus on the tricky ones.
Understand the characteristics of the tags you want to use.
Take time to look at the tags you are using - what do each of them mean? What are the unique features or combinations of features that this tag is representing?
For example, your tag 'Operating Cash' carries the characteristics of being cash (i.e. not tied up so available for use fairly quickly) and as being earmarked for operations. From these, we could possibly derive further characteristics that it is held in a certain place, or a certain person has responsibility for it.
If you had more source data to go on, you could perhaps use fields such as 'year created', or 'customer' to help you categorise further.
Understand what it is about the items you want to tag that could give you an idea of where they should go.
This is your biggest problem. A quick example - what in the string "MIP Reserve" gives any clues that it should be linked to "Escrow Deposits"? You have no easy way of matching many of the items in your list - many words appear in multiple items across multiple tags.
However, try and look for unique identifiers that will give you clues - for example, all items with the word 'developer' seem to be tagged to 'Developer Fee Note & Interest'. Do you have any more of those? Use these to reduce your problem, since they should be a straightforward mapping.
Any unique identifiers will allow you to set up rules for these strings. You don't even need to stick to one word - perhaps when you see several words, you can narrow down where it will end up e.g. when I see 'egg' this could go into 'bird' or 'reptile', but if 'egg' is paired with 'wing', I can be fairly confident it's 'bird'.
You need to match the characteristics of the items you want to tag with the unique identifiers of the tags you developed in step 1.
Write a program or macro to look for the identifiers in step 2 and return the relevant tag from step 1.
This is the straightforward bit. Look for the identifiers you want (e.g. uses 'cash', contains tag 'Really Important Customer') and look for the best match in the tags you have earlier.
Ensure you catch any errors - what happens if no tag is found? Does it create a new one? Does it recommend contacting you for help? What happens if more than one tag is relevant? What are your tiebreaking criteria?
But be aware of...
Understand how you will control use of these unique identifiers.
Imagine you somehow manage to come up with a list of unique identifiers. How will you control their use? If you have decided to send any item with the word 'cash' to the tag 'Operating Cash' and then in a year, someone comes along and makes an item 'Capital Cash', because they want somewhere to put cash that is about to be spent on capital items, how do you stop this? How are you going to control use of these words?
You will effectively need to take control of the item naming system and set up an agreed list of identifying words. Whenever anyone makes an item, they need to include your identifiers somewhere. I can tell you that this will not work. Either they will use the wrong words and you will end up manually doing it anyway, or they will ring you up confused and you will end up manually doing it anyway.
If you are the only person doing this, just do the exercise once, to your own standard (that you record) and stick to that standard. When you need to hand it over, it's clearly ordered and makes sense. If more than one person is doing this, do the exercise once between you and the team and then agree a way of controlling it.
Writing a predictive program sounds great and might save you some time. But consider why you are writing it. Are you likely to need to tag accounts constantly in the future? If so, control their naming centrally and make it so a tag is mandatory when they are made. If not, why are you writing a program to do this? Just do it once, manually.

Measuring wealth of information on text using NLP

Is there any metric that measures wealth of information on a text?
I am thinking in terms of anything that can reliably show unique information segments within a text. Simple metrics using frequency distributions or unique words are okay but they don't quite show unique information in sentences.
Using coding methods I would have to manually code each sentence/word or anything that would count as unique piece of information in a text but that could take a while. So, I wonder if I could use NLP as an alternative.
UPDATE
As an example:
Navtilos, a small volcanic islet of the Santorini volcano which was created in the eruption of 1928.
If I were to use coding analysis, I can count 4 unique information points: What is Navtilos, where is it, how it was created and when.
Obviously a human interprets text different than a computer. I just wonder if there is a measure that can identify unique information within sentences/texts. It does not have to produce the same result as mine but be reliable across different sentences.
A frequency distribution may work effectively but I wonder if there are other metrics for this.

What you seem to be looking for is a keyword/term extractor (for a list of keyword extractors see, for example, this, "External Links"). An extractor will extract phrases consisting of one or more words that capture some notions mentioned in the text, but without classifying them into classes (as named entity recognisers would do).
See, for example, this demo. From the sentence in your example, it extracts:
small volcanic islet
Navtilos
Santorini
If you have lots of documents, you can then use the frequency distribution of each keyword across documents to measure how specific it is to each document (assuming that uniqueness of a keyword to a document reflects how well it describes the contents of the document). For this, you can use a measure like tf-idf.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.