Split text into logical blocks

Split text into logical blocks - python

I have an array of (insurance) contracts (in .docx format) processing of which I'm trying to automate.
Current task at hand is to split every contract into so called clauses - parts of contract which describe some specific risk or exclusion from cover.
For example, it can be just one sentence – “This contract covers loss or damage due to fire” or several paragraphs of text that give more details and explain what type of fire this contract covers and what damage is reimbursed.
Good thing is that usually contracts are formatted in some way or another. In best possible scenario, whole contract is a numbered list with items and sub items and we simply can split it by certain level of list hierarchy.
Bad thing is that this is not always the case and the list can be not numbered, but alphabetical or not list at all in word terms: each line starts with a number or a letter user typed in manually. Or it can be not letters or numbers at all, but some amount of spaces or tabs. Or clauses can be separated by their titles that are typed in ALL CAPS.
So the visual representation of structure varies from contract to contract.
So my question is what is the best approach to this task? Regexp? Some ML algo? Maybe there are open source scripts out there that were written to deal with this or similar tasks? Any help will be most welcome!
EDIT (24.12.2019):
Found this repo on github: https://github.com/bmmidei/SliceCast
Form its description: "This repository explores a neural network approach to segment podcasts based on topic of discussion. We model the problem as a binary classification task where each sentence is either labeled as the first sentence of a new segment or a continuation of the current segment. We embed sentences using the Universal Sentence Encoder and use an LSTM-based classification network to obtain the cutoff probabilities. Our results indicate that neural network models are indeed suitable for topical segmentation on long, conversational texts, but larger datasets are needed for a truly viable product.
Read the full report for this work here: Neural Text Segmentation on Podcast Transcripts"

The best course of action for this task is to improve the semantic information found in the document using annotations that rely on word styles. For instance:
add block style for contracts
add a paragraph style for title of contract
add a paragraph style for features of the contract
You can drill down at the inline level and add inline styles that allow to extract more granual information like a keyword inline style.
Then, you can process the .docx file using a python library or maybe convert it to libreoffice and then process it.
That is a classic annotation task for text documents. It is much easier and less costly to setup that alternatives like having a specific (web) app to input the different features you need.

Related

reformat text documents for easier comparison

For those who want to spare the reasoning behind the question jump to the TL;DR
Hi I'm currently reading a lot of financial annual reports of companies. While the first one is the most interesting, the documents that come after it often are the same in a lot of regards. So obviously I'm more interested in the differences between them. The documents come in pdfs which are hard to compare. So I thought it would be nice to get them as pure text and compare them with a compare tool. So thats what I did. I piped the following two pdfs through pdftotext with the below params:
annual report for 2018
annual report for 2019
pdftotext -enc UTF-8 -nopgbrk -eol mac
I then realized that compare tools seem to have problems with line breaks. So if I have the exact same sentences, but with different line breaks in both documents, it is shown as a difference. Bullet points in pdfs are transformed to different symbols in the text file which leads to differences as well. So I went into nlp and thought I might get some help there.
TL;DR
I just want to reformat the two snippets below in a defined way that I don't get diffs in a difftool anymore. Like lines are only 80 characters long at most and I want to have some normalized/canonical way for printing bullet points and stuff like that.
I'm currently using spacy and here is an example of two text snippets that are essentially the same but lead to a lot of diffs in difftools. So how can I reprint both snippets to a text document so that the line breaks are the same? Is there even a method to find things like two sentences are exactly the same but in one sentence there is one additional word. I would like reformat that as well without shifting the line break by one word.
import spacy
nlp = spacy.load("en_core_web_sm")
SE_2018_10k_string = '''x
“paying users” refers to the number of unique accounts through which a payment is made in our online games in a particular period. A unique
account through which payments are made in more than one online game or in more than one market is counted as more than one paying user.
“QPUs” refers to the aggregate number of paying users during the quarterly period;
x'''
doc1 = nlp(SE_2018_10k_string)
print('SE_2018_10k_string')
for token in doc1:
print(token.text)
SE_2019_10k_string = '''●
“paying users” refers to the number of unique accounts through which a payment is made in our online games in a particular period. A unique account
through which payments are made in more than one online game or in more than one market is counted as more than one paying user. “QPUs” refers to
the aggregate number of paying users during the quarterly period;
●'''
doc2 = nlp(SE_2019_10k_string)
print('SE_2019_10k_string')
for token in doc2:
print(token.text)
print(doc1.similarity(doc2))

There is no universal way to get rid of the problems you are seeing.
If you find that you have line breaks in different places but your texts are otherwise the same, you can normalize things by removing line breaks. If you find only spaces are different, you can remove spaces, or convert any run of spaces to a single space. If bullets are an issue you can remove them or convert them to a single type of character (but how do you tell if something is a bullet in code? there is no standard way).
Appropriate normalization depends on your data, and for OCR it's typically going to just be hard.
Is there even a method to find things like two sentences are exactly the same but in one sentence there is one additional word.
You can use edit distance metrics like Levenshtein distance to find this. It won't help you with existing diff tools though, since they show any difference.

What is the best approach to measure a similarity between texts in multiple languages in python?

So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.
I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).
The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola), it will be written in English.
My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.
Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).
So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss installation.
Hence, my questions are:
Is my idea with word embeddings worth trying?
Is there maybe a better approach?
If the idea with word embeddings is okay, which ones should I use?
Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.
Thanks in advance for any help! Any advice would be highly appreciated!

You could try Milvus that adopted Faiss to search similar vectors. It's easy to be installed with docker in windows OS.

Word embedding is meaningful inside the language but can't be transferrable to other languages. An observation for this statement is: if two words co-occur with a lot inside sentences, their embeddings can be near each other. Hence, as there is no one-to-one mapping between two general languages, you cannot compare word embeddings.
However, if two languages are similar enough to one-to-one mapping words, you may count on your idea.
In sum, without translation, your idea is not applicable to two general languages anymore.

Does the data contain lots of numerical information (e.g. nutritional facts)? If yes, this could be used to compare the products to some extent. My advice is to think of it not as a linguistic problem, but pattern matching as these texts have been assumably produced using semi-automatic methods using translation memories. Therefore similar texts across languages may have similar form and if so this should be used for comparison.
Multilingual text comparison is not a trivial task and I don't think there are any reasonably good out-of-box solutions for that. Yes, multilingual embeddings exist, but they have to be fine-tuned to work on specific downstream tasks.

Let's say that your task is about a fine-grained entity recognition. I think you have a well defined entities: brand, size etc...
So, these features that defines a product each could be a vector, which means your products could be represented with a matrix.
You can potentially represent each feature with an embedding.
Or mixture of the embedding and one-hot vectors.
Here is how.
Define a list of product features:
product name, brand name, size, weight.
For each product feature, you need a text recognition model:
E.g. with brand recognition you find what part of the text is its brand name.
Use machine translation if it is possible to make unified language representation for all sub texts. E.g. Coca Cola to
ru Кока-Кола, en Coca Cola.
Use contextual embeddings (i.e. huggingface multilingial BERT or something better) to convert prompted text into one vector.
In order to compare two products, compare their feature vectors: what is the average similarity between two feature array. You can also decide what is the weight on each feature.
Try other vectorization methods. Perhaps you dont want to mix brand knockoffs: "Coca Cola" is similar to "Cool Cola". So, maybe embeddings aren't good for brand names and size and weight but good enough for product names. If you want an exact match, you need a hash function for their text. On their multi-lingual prompt-engineered text.
You can also extend each feature vectors, with concatenations of several embeddings or one hot vector of their source language and things like that.
There is no definitive answer here, you need to experiment and test to see what is the best solution. You cam create a test set and make benchmarks for your solutions.

How to extract text between two headings with regex, requires complicated non-capture groups

I want to pull abstracts out of a large corpus of scientific papers using a python script. The papers are all saved as strings in a large csv. I want to something like this: extracting text between two headers I can write a regex to find the 'Abstract' heading. However, finding the next section heading is proving difficult. Headers vary wildly from paper to paper. They can be ALL CAPS or Just Capitalized. They can be one word or a long phrase and span two lines. They are usually followed by one-two newlines. This is what I came up with: -->
abst = re.findall(r'(?:ABSTRACT\s*\n+|Abstract\s*\n+)(.*?)((?:[A-Z]+|(?:\n(?:[A-Z]+|(?:[A-Z][a-z]+\s*)+)\n+)',row[0],re.DOTALL)
Here is an example of an abstract:
'...\nAbstract\nFactorial Hidden Markov Models (FHMMs) are powerful models for
sequential\ndata but they do not scale well with long sequences. We
propose a scalable inference and learning algorithm for FHMMs that
draws on ideas from the stochastic\nvariational inference, neural
network and copula literatures. Unlike existing approaches, the
proposed algorithm requires no message passing procedure among\nlatent
variables and can be distributed to a network of computers to speed up
learning. Our experiments corroborate that the proposed algorithm does
not introduce\nfurther approximation bias compared to the proven
structured mean-field algorithm,\nand achieves better performance with
long sequences and large FHMMs.\n\n1\n\nIntroduction\n\n...'
So I'm trying to find 'Abstract' and 'Introduction' and pull out the text that is between them. However it could be 'ABSTRACT' and 'INTRODUCTION', or ABSTRACT and 'A SINGLE LAYER NETWORK AND THE MEAN FIELD\nAPPROXIMATION\n'
Help?

Recognizing the next section is a bit vague - perhaps we can rely on Abstract-section ending with two newlines?
ABSTRACT\n(.*)\n\n
Or maybe we'll just assume that the next section-title will start with an uppercase letter and be followed by any number of word-characters. (Also that's rather vague, too, and assumes there'l be no \n\n within the Abstract.
ABSTRACT\n(.*)\n\n\U[\w\s]*\n\n
Maybe that stimulates further fiddling on your end... Feel free to post examples where this did not match - maybe we can stepwise refine it.
N.B: as Wiktor pointed out, I could not use the case-insensitive modifiers. So the whole rx should be used with switches for case-insenstive matching.
Update1: the challenge here is really how to identify that a new section has begun...and not to confuse that with paragraph-breaks within the Abstract. Perhaps that can also be dealt with by changing the rather tolerant [\w\s]*with [\w\s]{1,100} which would only recognize text in a new paragraph as a title of the "abstract-successor" if it had between 2 and 100 characters (note: 2 characters, although the limit is set to 1 because of the \U (uppercase character).
ABSTRACT\n(.*)\n\n\U[\w\s]{1,100}\n\n

How to automatically label a cluster of words using semantics?

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.

The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.

The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.

One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms

Measuring wealth of information on text using NLP

Is there any metric that measures wealth of information on a text?
I am thinking in terms of anything that can reliably show unique information segments within a text. Simple metrics using frequency distributions or unique words are okay but they don't quite show unique information in sentences.
Using coding methods I would have to manually code each sentence/word or anything that would count as unique piece of information in a text but that could take a while. So, I wonder if I could use NLP as an alternative.
UPDATE
As an example:
Navtilos, a small volcanic islet of the Santorini volcano which was created in the eruption of 1928.
If I were to use coding analysis, I can count 4 unique information points: What is Navtilos, where is it, how it was created and when.
Obviously a human interprets text different than a computer. I just wonder if there is a measure that can identify unique information within sentences/texts. It does not have to produce the same result as mine but be reliable across different sentences.
A frequency distribution may work effectively but I wonder if there are other metrics for this.

What you seem to be looking for is a keyword/term extractor (for a list of keyword extractors see, for example, this, "External Links"). An extractor will extract phrases consisting of one or more words that capture some notions mentioned in the text, but without classifying them into classes (as named entity recognisers would do).
See, for example, this demo. From the sentence in your example, it extracts:
small volcanic islet
Navtilos
Santorini
If you have lots of documents, you can then use the frequency distribution of each keyword across documents to measure how specific it is to each document (assuming that uniqueness of a keyword to a document reflects how well it describes the contents of the document). For this, you can use a measure like tf-idf.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.