Finding and outputting matches in differing translations of the same novel - python

I am comparing two translations (English to French) of the same novel that are quite different from one another. I am interested in locating any significant matches that exist in both (>3 words in order).
My first instinct was to look at difflib or filecmp, but they seem to mostly output quantitative data when I want qualitative. They also mostly seem fit for line by line comparison, but I want to compare the texts in their entirety. Given the large size of the .txt files (novel-length), am I crazy to think this is even possible?
I'm honestly open to any programming language that can solve this, but partial to python.
thanks!

Related

reformat text documents for easier comparison

For those who want to spare the reasoning behind the question jump to the TL;DR
Hi I'm currently reading a lot of financial annual reports of companies. While the first one is the most interesting, the documents that come after it often are the same in a lot of regards. So obviously I'm more interested in the differences between them. The documents come in pdfs which are hard to compare. So I thought it would be nice to get them as pure text and compare them with a compare tool. So thats what I did. I piped the following two pdfs through pdftotext with the below params:
annual report for 2018
annual report for 2019
pdftotext -enc UTF-8 -nopgbrk -eol mac
I then realized that compare tools seem to have problems with line breaks. So if I have the exact same sentences, but with different line breaks in both documents, it is shown as a difference. Bullet points in pdfs are transformed to different symbols in the text file which leads to differences as well. So I went into nlp and thought I might get some help there.
TL;DR
I just want to reformat the two snippets below in a defined way that I don't get diffs in a difftool anymore. Like lines are only 80 characters long at most and I want to have some normalized/canonical way for printing bullet points and stuff like that.
I'm currently using spacy and here is an example of two text snippets that are essentially the same but lead to a lot of diffs in difftools. So how can I reprint both snippets to a text document so that the line breaks are the same? Is there even a method to find things like two sentences are exactly the same but in one sentence there is one additional word. I would like reformat that as well without shifting the line break by one word.
import spacy
nlp = spacy.load("en_core_web_sm")
SE_2018_10k_string = '''x
“paying users” refers to the number of unique accounts through which a payment is made in our online games in a particular period. A unique
account through which payments are made in more than one online game or in more than one market is counted as more than one paying user.
“QPUs” refers to the aggregate number of paying users during the quarterly period;
x'''
doc1 = nlp(SE_2018_10k_string)
print('SE_2018_10k_string')
for token in doc1:
print(token.text)
SE_2019_10k_string = '''●
“paying users” refers to the number of unique accounts through which a payment is made in our online games in a particular period. A unique account
through which payments are made in more than one online game or in more than one market is counted as more than one paying user. “QPUs” refers to
the aggregate number of paying users during the quarterly period;
●'''
doc2 = nlp(SE_2019_10k_string)
print('SE_2019_10k_string')
for token in doc2:
print(token.text)
print(doc1.similarity(doc2))
There is no universal way to get rid of the problems you are seeing.
If you find that you have line breaks in different places but your texts are otherwise the same, you can normalize things by removing line breaks. If you find only spaces are different, you can remove spaces, or convert any run of spaces to a single space. If bullets are an issue you can remove them or convert them to a single type of character (but how do you tell if something is a bullet in code? there is no standard way).
Appropriate normalization depends on your data, and for OCR it's typically going to just be hard.
Is there even a method to find things like two sentences are exactly the same but in one sentence there is one additional word.
You can use edit distance metrics like Levenshtein distance to find this. It won't help you with existing diff tools though, since they show any difference.

How to extract text between two headings with regex, requires complicated non-capture groups

I want to pull abstracts out of a large corpus of scientific papers using a python script. The papers are all saved as strings in a large csv. I want to something like this: extracting text between two headers I can write a regex to find the 'Abstract' heading. However, finding the next section heading is proving difficult. Headers vary wildly from paper to paper. They can be ALL CAPS or Just Capitalized. They can be one word or a long phrase and span two lines. They are usually followed by one-two newlines. This is what I came up with: -->
abst = re.findall(r'(?:ABSTRACT\s*\n+|Abstract\s*\n+)(.*?)((?:[A-Z]+|(?:\n(?:[A-Z]+|(?:[A-Z][a-z]+\s*)+)\n+)',row[0],re.DOTALL)
Here is an example of an abstract:
'...\nAbstract\nFactorial Hidden Markov Models (FHMMs) are powerful models for
sequential\ndata but they do not scale well with long sequences. We
propose a scalable inference and learning algorithm for FHMMs that
draws on ideas from the stochastic\nvariational inference, neural
network and copula literatures. Unlike existing approaches, the
proposed algorithm requires no message passing procedure among\nlatent
variables and can be distributed to a network of computers to speed up
learning. Our experiments corroborate that the proposed algorithm does
not introduce\nfurther approximation bias compared to the proven
structured mean-field algorithm,\nand achieves better performance with
long sequences and large FHMMs.\n\n1\n\nIntroduction\n\n...'
So I'm trying to find 'Abstract' and 'Introduction' and pull out the text that is between them. However it could be 'ABSTRACT' and 'INTRODUCTION', or ABSTRACT and 'A SINGLE LAYER NETWORK AND THE MEAN FIELD\nAPPROXIMATION\n'
Help?
Recognizing the next section is a bit vague - perhaps we can rely on Abstract-section ending with two newlines?
ABSTRACT\n(.*)\n\n
Or maybe we'll just assume that the next section-title will start with an uppercase letter and be followed by any number of word-characters. (Also that's rather vague, too, and assumes there'l be no \n\n within the Abstract.
ABSTRACT\n(.*)\n\n\U[\w\s]*\n\n
Maybe that stimulates further fiddling on your end... Feel free to post examples where this did not match - maybe we can stepwise refine it.
N.B: as Wiktor pointed out, I could not use the case-insensitive modifiers. So the whole rx should be used with switches for case-insenstive matching.
Update1: the challenge here is really how to identify that a new section has begun...and not to confuse that with paragraph-breaks within the Abstract. Perhaps that can also be dealt with by changing the rather tolerant [\w\s]*with [\w\s]{1,100} which would only recognize text in a new paragraph as a title of the "abstract-successor" if it had between 2 and 100 characters (note: 2 characters, although the limit is set to 1 because of the \U (uppercase character).
ABSTRACT\n(.*)\n\n\U[\w\s]{1,100}\n\n

How to generate homophones on substring level?

I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.
How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.

Use Python to find and remove duplicate text in a collection of files

I have a collection of 40-50 text files that contain markdown. Some of them contain duplicate words, sentences, and paragraphs. I'm looking for a script/algorithm to scan the files and help me identify matches (or near matches). Where can I find such a thing? Searching for this type of thing online yielded results for other types of problems, but not this one. Would appreciate any clues to help me narrow my search...
basically, a simple brute forces can solve all of your problems. But you should consider another algorithms depend on your requirement (timing, memory,...): Boyer–Moore, Rabin–Karp string search algorithm, Knuth–Morris–Pratt algorithm.

Algorithm or tool in python to distinguish between gibberish/errors and foreign words/names?

I'm doing some machine extraction of sometimes-garbled PDF text, which often ends up with words incorrectly split up by spaces, or chunks of words put in incorrect order, resulting in pure gibberish.
I'd like a tool that can scan through and recognize these chunks of pure gibberish while skipping non-dictionary words that are likely to be proper names or simply words in a foreign language.
Not sure if this is even possible, but if it is I imagine something like this could be done using NLTK. I'm just wondering if this has been done before to save me the trouble of reinventing the wheel.
Hm, I imagine you could train a SVM or neural network on character n-grams...but you'd need pretty darn long ones. The problem is that this would probably have a high rate of false negatives (throwing out what you wanted) because you can have drastically different rates of character clusters in various languages.
Take Polish, for example (it's my only second language in easy-to-type Latin characters). Skrzywdy would be a highly unlikely series of letters in English, but is easily pronouncable in Polish.
A better technique might be to use language detection to detect languages used in a document above a certain probability, and then check the dictionaries for those languages...
This won't help for (for instance) a Linguistics textbook where a large variety of snippets of various languages are frequently used.
** EDIT **
Idea 2:
You say this is Bibliographic information. Meta-information like its position in the text or any font information your OCR software is returning to you is almost certainly more important than the series of characters you see showing up. If it's in the title, or near the position where author goes, or in Italics, it's worth considering as foreign...

Categories

Resources