Keeping Numbers in Doc2Vec Tokenization

Keeping Numbers in Doc2Vec Tokenization - python

I’m in the process of trying to get document similarity values for a corpus of approximately 5,000 legal briefs with Doc2Vec (I recognize that the corpus may be a little bit small, but this is a proof-of-concept project for a larger corpus of approximately 15,000 briefs I’ll have to compile later).
Basically, every other component in the creation of the model is going relatively well so far – each brief I have is in a text file within a larger folder, so I compiled them in my script using glob.glob – but I’m running into a tokenization problem. The difficulty is, as these documents are legal briefs, they contain numbers that I’d like to keep, and many of the guides I’ve been using to help me write the code use Gensim’s simple preprocessing, which I believe eliminates digits from the corpus, in tandem with the TaggedDocument feature. However, I want to do as little preprocessing on the texts as possible.
Below is the code I’ve used, and I’ve tried swapping simple_preprocess for genism.utils.tokenize, but when I do that, I get generator objects that don’t appear workable in my final Doc2Vec model, and I can’t actually see how the corpus looks. When I’ve tried to use other tokenizers, like nltk, I don’t know how to fit that into the TaggedDocument component.
brief_corpus = []
for brief_filename in brief_filenames:
with codecs.open(brief_filename, "r", "utf-8") as brief_file:
brief_corpus.append(
gensim.models.doc2vec.TaggedDocument(
gensim.utils.simple_preprocess(
brief_file.read()),
["{}".format(brief_filename)])) #tagging each brief with its filename
I’d appreciate any advice that anyone can give that would help me combine a tokenizer that just separated on whitespace and didn’t eliminate any numbers with the TaggedDocument feature. Thank you!
Update: I was able to create a rudimentary code for some basic tokenization (I do plan on refining it further) without having to resort to Gensim's simple_preprocessing function. However, I'm having difficulty (again!) when using the TaggedDocument feature - but this time, the tags (which I want to be the file names of each brief) don't match the tokenized document. Basically, each document has a tag, but it's not the right one.
Can anyone possibly advise where I might have gone wrong with the new code below? Thanks!
briefs = []
BriefList = [p for p in os.listdir(FILEPATH) if p.endswith('.txt')]
for brief in BriefList:
str = open(FILEPATH + brief,'r').read()
tokens = re.findall(r"[\w']+|[.,!?;]", str)
tagged_data = [TaggedDocument(tokens, [brief]) for brief in BriefList]
briefs.append(tagged_data)

You're likely going to want to write your own preprocessing/tokenization functions. But don't worry, it's not hard to outdo Gensim's simple_preprocess, even with very crude code.
The only thing Doc2Vec needs as the words of a TaggedDocument is a list of string tokens (typically words).
So first, you might be surprised how well it works to just do a default Python string .split() on your raw strings - which just breaks text on whitespace.
Sure, a bunch of the resulting tokens will then be mixes of words & adjoining punctuation, which may be nearly nonsense.
For example, the word 'lawsuit' at the end of the sentence might appear as 'lawsuit.', which then won't be recognized as the same token as 'lawsuit', and might not appear enough min_count times to even be considered, or otherwise barely rise above serving as noise.
But especially for both longer documents, and larger datasets, no one token, or even 1% of all tokens, has that much influence. This isn't exact-keyword-search, where failing to return a document with 'lawsuit.' for a query on 'lawsuit' would be a fatal failure. A bunch of words 'lost' to such cruft may have hadly any effect on the overall document, or model, performance.
As your datasets seem manageable enough to run lots of experiments, I'd suggest trying this dumbest-possible tokenization – only .split() – just as a baseline to become confident that the algorithm still mostly works as well as some more intrusive operation (like simple_preprocess()).
Then, as you notice, or suspect, or ideally measure with some repeatable evaluation, that some things you'd want to be meaningful tokens aren't treated right, gradually add extra steps of stripping/splitting/canonicalizing characters or tokens. But as much as possible: checking that the extra complexity of code, and runtime, is actually delivering benefits.
For example, further refinements could be some mix of:
For each token created by the simple split(), strip off any non-alphanumeric leading/trailing chars. (Advantages: eliminates that punctuation-fouling-words cruft. Disadvantages: might lose useful symbols, like the leading $ of monetary amounts.)
Before splitting, replace certain single-character punctuation-marks (like say ['.', '"', ',', '(', ')', '!', '?', ';', ':']) with the same character with spaces on both sides - so that they're never connected with nearby words, and instead survive a simple .split() as standalone tokens. (Advantages: also prevents words-plus-punctuation cruft. Disadvantages: breaks up numbers like 2,345.77 or some useful abbreviations.)
At some appropriate stage in tokenization, canonicalize many varied tokens into a smaller set of tokens that may be more meaningful than each of them as rare standalone tokens. For example, $0.01 through $0.99 might all be turned into $0_XX - which then has a better chance of influencting the model, & being associated with 'tiny amount' concepts, than the original standalone tokens. Or replacing all digits with #, so that numbers of similar magnitudes share influence, without diluting the model with a token for every single number.
The exact mix of heuristics, and order of operations, will depend on your goals. But with a corpus only in the thousands of docs (rather than hundreds-of-thousands or millions), even if you do these replacements in a fairly inefficient way (lots of individual string- or regex- replacements in serial), it'll likely be a manageable preprocessing cost.
But you can start simple & only add complexity that your domain-specific knowledge, and evaluations, justifies.

Related

Gensim word2vec and large amount of texts

I need to put the texts contained in a column of a MySQL database (about 3 million rows) into a list of lists of tokens. These texts (which are tweets, therefore they are generally short) must be preprocessed before being included in the list (stop words, hashtags, tags etc. must be removed). This list should be passed later as a Word2Vec parameter. This is the part of the code involved
import mysql.connector
import re
from gensim.models import Word2Vec
import preprocessor as p
p.set_options(
p.OPT.URL,
p.OPT.MENTION,
p.OPT.HASHTAG,
p.OPT.NUMBER
)
conn = mysql.connector.connect(...)
cursor = conn.cursor()
query = "SELECT text FROM tweet"
cursor.execute(query)
table = cursor.fetchall()
stopwords = open('stopwords.txt', encoding='utf-8').read().split('\n')
sentences = []
for row in table:
sentences = sentences + [[w for w in re.sub(r'[^\w\s-]', ' ', p.clean(row[0])).lower().split() if w not in stopwords and len(w) > 2]]
cursor.close()
conn.close()
model = Word2Vec(sentences)
...
Obviously it takes a lot of time and I know that my method is probably inefficient. Can anyone recommend a better one? I know it is not a question directly related to gensim and Word2Vec but perhaps those who use them have already faced the problem of working with a large amount of texts.

You haven't mentioned how long your code takes to run, but some potential sources of slowdown in your current technique might include:
the overhead of regex-based preprocessing, especially if a large number of independent regexes are each applied, separately, to the same texts
the inefficiency of expanding a Python list by appending one new item at a time - which as the list grows larger can sometimes be a factor
virtual-memory swapping, if the size of your data exceeds physical RAM
You can check the swapping issue by monitoring memory use using a platform-specific tool (like top on Linux systems) to view memory usage during the operation. If that's a contributor, using a machine with more RAM, or making other code changes to reduce RAM usage (see below), will help.
Your full prprocessing code isn't shown, but a common approach is a lot of independent steps, each of which involves one or more regular-expressions, but then returns a plain modified string (for future steps).
As appealingly simple & pluggable as that is, it often becomes a source of avoidable slowness in preprocessing large amounts of text. For example, each regex/step itself might have to repeat detecting token-boundaries, or splitting then re-concatenating a string. Or, the regexes might use complex match patterns, or techniques (like backtracking) that can be expensive on worst-case inputs.
Often this sort of preprocessing can be greatly improved by one or more of:
coalescing multiple regexes into a single step, so a string faces one front-to-back pass, rather than N
breaking into short tokens early, then leaving the text as a list-of-tokens for later steps - thus never redundantly splitting/joining, and letting later token-oriented steps to work on smaller strings and perhaps even simpler (non-regex) string-tests
Also, even if the preprocessing is still a bit time-consuming, a big process improvement is usually to be sure to only repeat it when the data changes. That is, if you're going to try a bunch of different downstream steps, like different Word2Vec parameters, make sure you're not doing the expensive preprocessing every time. Do it once, write the results aside to a file, then reuse the results file until it needs to be regenerated (because the data or preprocessing rules have changed).
Finally, if the append-one-more pattern is contributing to your slowness, you could pre-allocate your sentences (sentences = [Null,] * desired_length), then replace each row in your loop rather than append (sentences[row_num] = preprocessed_text). But that might not be a major factor, and in fact the suggestion above, about "reuse the results file", is a better way to minimize list-ops/RAM-usage, as well as enable reuse across alternate runs.
That is, open a new working file before your loop. Append each preprocessed text – with spaces between the tokens, and a newline at the end – as one new line to this file. Then, have your Word2Vec step work directly from that file. (In Gensim, you can do this by wrapping the file with a LineSentence utility object, which reads a file of that format back as a re-iterable sequence, with each item being a list-of-tokens, or by using the corpus_file parameter to feed the filename directly to Word2Vec.)
From that list of possible tactics, I'd try:
First, time your existing code for preprocessing (creating your sentences
Then, eliminate all fancy preprocessing, doing nothing more complicated than .split(), and re-time. If there's a big change, then yes, the preprocessing is the major slowdown, and concentrate on improving that.
If even that minimal preprocessing still seems slower-than-desired, then maybe the RAM/concatenation issues are a concern, and try writing to an interim file.
Separately: it's not strictly necessary to worry about removing stop-words in word2vec training - much published work doesn't bother with that step, and the algorithm already includes a sample parameter which causes it to skip a lot of the very-overrepresented words during training as less-interesting. Similarly, 2- and even 1- character tokens may still be interesting, especially in the domain of tweets, so you might not want to always discard them. (For example, lone emoji can be significant 'words'.)

Word2vec on documents each one containing one sentence

I have some unsupervised data (100.000 files) and each file has a paragraph containing one sentence. The preprocessing went wrong and deleted all stop points (.).
I used word2vec on a small sample (2000 files) and it treated each document as one sentence.
Should I continue the process on all remaining files? Or this would result to a bad model ?
Thank you

Did you try it, and get bad results?
I'm not sure what you mean by "deleted all stop points". But, Gensim's Word2Vec is oblivious to what your tokens are, and doesn't really have any idea of 'sentences'.
All that matters is the lists-of-tokens you provide. (Sometimes people include puntuation like '.' as tokens, and sometimes it's stripped - and it doesn't make a very big different either way, and to the extent it does, whether it's good or bad may depend on your data & goals.)
Any lists-of-tokens that include neighboring related tokens, for the sort of context-window training that's central to the word2vec algorithm, should work well.
For example, it can't learn anything from one-word texts, where there are no neighboring words. But running togther sentences, paragraphs, and even full documents into long texts works fine.
Even concatenating wholly-unrelated texts doesn't hurt much: the bit of random noise from unrelated words now in-each-others' windows is outweighed, with enough training, by the meaningful relationships in the much-longer runs of truly-related text.
The main limit to consider is that each training text (list of tokens) shouldn't be more than 10,000 tokens long, as internal implementation limits up through Gensim 4.0 mean tokens past the 10,000th position will be ignored. (This limit might eventually be fixed - but until then, just splitting overlong texts into 10,000-token chunks is a fine workaround with negligible effects via the lost contexts at the break points.)

How to extract text between two headings with regex, requires complicated non-capture groups

I want to pull abstracts out of a large corpus of scientific papers using a python script. The papers are all saved as strings in a large csv. I want to something like this: extracting text between two headers I can write a regex to find the 'Abstract' heading. However, finding the next section heading is proving difficult. Headers vary wildly from paper to paper. They can be ALL CAPS or Just Capitalized. They can be one word or a long phrase and span two lines. They are usually followed by one-two newlines. This is what I came up with: -->
abst = re.findall(r'(?:ABSTRACT\s*\n+|Abstract\s*\n+)(.*?)((?:[A-Z]+|(?:\n(?:[A-Z]+|(?:[A-Z][a-z]+\s*)+)\n+)',row[0],re.DOTALL)
Here is an example of an abstract:
'...\nAbstract\nFactorial Hidden Markov Models (FHMMs) are powerful models for
sequential\ndata but they do not scale well with long sequences. We
propose a scalable inference and learning algorithm for FHMMs that
draws on ideas from the stochastic\nvariational inference, neural
network and copula literatures. Unlike existing approaches, the
proposed algorithm requires no message passing procedure among\nlatent
variables and can be distributed to a network of computers to speed up
learning. Our experiments corroborate that the proposed algorithm does
not introduce\nfurther approximation bias compared to the proven
structured mean-field algorithm,\nand achieves better performance with
long sequences and large FHMMs.\n\n1\n\nIntroduction\n\n...'
So I'm trying to find 'Abstract' and 'Introduction' and pull out the text that is between them. However it could be 'ABSTRACT' and 'INTRODUCTION', or ABSTRACT and 'A SINGLE LAYER NETWORK AND THE MEAN FIELD\nAPPROXIMATION\n'
Help?

Recognizing the next section is a bit vague - perhaps we can rely on Abstract-section ending with two newlines?
ABSTRACT\n(.*)\n\n
Or maybe we'll just assume that the next section-title will start with an uppercase letter and be followed by any number of word-characters. (Also that's rather vague, too, and assumes there'l be no \n\n within the Abstract.
ABSTRACT\n(.*)\n\n\U[\w\s]*\n\n
Maybe that stimulates further fiddling on your end... Feel free to post examples where this did not match - maybe we can stepwise refine it.
N.B: as Wiktor pointed out, I could not use the case-insensitive modifiers. So the whole rx should be used with switches for case-insenstive matching.
Update1: the challenge here is really how to identify that a new section has begun...and not to confuse that with paragraph-breaks within the Abstract. Perhaps that can also be dealt with by changing the rather tolerant [\w\s]*with [\w\s]{1,100} which would only recognize text in a new paragraph as a title of the "abstract-successor" if it had between 2 and 100 characters (note: 2 characters, although the limit is set to 1 because of the \U (uppercase character).
ABSTRACT\n(.*)\n\n\U[\w\s]{1,100}\n\n

Negation handling in NLP

I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in conceptnet5's API.
Here's an example:
The movie wasn't that good.
Hence, I figured that I could use wordnet's lemma functionality to replace adjectives in sentences that contain negation-words like (not, ...).
In the previous example, the algorithm would detect wasn't and would replace it with was not.
Further, it would detect a negation-word not, and replace good with it's antonym bad.
The sentence would read:
The movie was that bad.
While I see that this isn't the most elegant way, and it does probably in many cases produce the wrong result, I'd still like to handle negation that way as I frankly don't know any better approach.
Considering my problem:
Unfortunately, I did not find any library that would allow me to replace all occurrences of appended negation-words (wasn't => was not).
I mean I could do it manually, by replacing the occurrences with a regex, but then I would be stuck with the english language.
Therefore I'd like to ask if some of you know a library, function or better method that could help me here.
Currently I'm using python nltk, still it doesn't seem that it contains such functionality, but I may be wrong.
Thanks in advance :)

Cases like wasn't can be simply parsed by tokenization (tokens = nltk.word_tokenize(sentence)): wasn't will turn into was and n't.
But negative meaning can also be formed by 'Quasi negative words, like hardly, barely, seldom' and 'Implied negatives, such as fail, prevent, reluctant, deny, absent', look into this paper. Even more detailed analysis can be found in Christopher Potts' On the negativity of negation
.
Considering your initial problem, sentiment analysis, most modern approaches, as far as I know, don't process negations explicitly; instead, they use supervised approaches with high-order n-grams. Those actually processing negation usually append special prefix NOT_ to all words between negation and punctuation marks.

Algorithm or tool in python to distinguish between gibberish/errors and foreign words/names?

I'm doing some machine extraction of sometimes-garbled PDF text, which often ends up with words incorrectly split up by spaces, or chunks of words put in incorrect order, resulting in pure gibberish.
I'd like a tool that can scan through and recognize these chunks of pure gibberish while skipping non-dictionary words that are likely to be proper names or simply words in a foreign language.
Not sure if this is even possible, but if it is I imagine something like this could be done using NLTK. I'm just wondering if this has been done before to save me the trouble of reinventing the wheel.

Hm, I imagine you could train a SVM or neural network on character n-grams...but you'd need pretty darn long ones. The problem is that this would probably have a high rate of false negatives (throwing out what you wanted) because you can have drastically different rates of character clusters in various languages.
Take Polish, for example (it's my only second language in easy-to-type Latin characters). Skrzywdy would be a highly unlikely series of letters in English, but is easily pronouncable in Polish.
A better technique might be to use language detection to detect languages used in a document above a certain probability, and then check the dictionaries for those languages...
This won't help for (for instance) a Linguistics textbook where a large variety of snippets of various languages are frequently used.
** EDIT **
Idea 2:
You say this is Bibliographic information. Meta-information like its position in the text or any font information your OCR software is returning to you is almost certainly more important than the series of characters you see showing up. If it's in the title, or near the position where author goes, or in Italics, it's worth considering as foreign...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.