I am using spaCys NLP model to work out the POS of input data so that the my Markov chains can be a bit more gramatically correct as with the example in the python markovify library found here. However the way that spaCy splits tokens makes it difficult when reconstructing them because certain grammatical elements are also split up for example "don't" becomes ["do", "n't"]. This means that you can't rejoin generated Markov chains simply by space anymore but need to know if the tokens make up one word.
I assumed that the is_left_punct and is_right_punct properties of tokens might relate to this but it doesn't seem to be related. My current code simply accounts for PUNCT tokens but the do n't problem persists.
Is there a property of the tokens that I can use to tell the method that joins sentences together when to omit spaces or some other way to know this?
Spacy tokens have a whitespace_ attribute which is always set.
You can always use that as it will represent actual spaces when they were present, or be an empty string when it was not.
This occurs in cases like you mentioned, when the tokenisation splits a continuous string.
So Token("do").whitespace_ will be the empty string.
For example
[bool(token.whitespace_) for token in nlp("don't")]
Should produce
[False, False]
Related
I have a text to parse that contains some amount of stuff that is non-relevant for the parsing. For this reason I would like to be able to tokenize as "TEXT" anything that does not follow the specific patterns I am looking for.
For example, let's say I am looking for the sequences "HELP!" and "OVER HERE!". I would like the sequence "some random text HELP! lorem ipsum" to be tokenized as:
(TEXT,'some random text '), (HELP,'HELP!'), (TEXT:' lorem ipsum').
If I do that:
import ply.lex as lex
tokens = (
'TEXT',
'SIGNAL1',
'SIGNAL2'
)
t_SIGNAL1 = "HELP!"
t_SIGNAL2 = "OVER HERE!"
t_TEXT = r'[\s\S]+'
data = "some random text HELP! lorem ipsum"
lexer = lex.lex()
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break # No more input
print(tok)
It fails, of course, because the TEXT token grabs the whole text.
I could change the regex for t_TEXT into something more fancy, but as I have a big dozen of different specific sequences I want to capture it would be completely unreadable.
I feel like there should be an easy solution for that, but can't figure one out.
Ply's lexer tries patterns in a determined order, which can be exploited to define a default token. But there are a couple of down-sides to this approach.
The defined order is:
Ignored characters, from the definition of t_ignore.
Tokens matched by a token function, in order by function definition order.
Tokens matched by a token variable, in reverse order by regular expression length.
Literal characters, from the definition of literals.
The t_error function, which is called if none of the above match.
Aside from t_error, all of the above are conditional on some pattern matching at least one character at the current input point. So the only reliable fallback (default) pattern would be one which matches any single character: (?s:.) (or just ., if you're willing to globally set the re.S flag). That could be used as the last token function, provided you don't use any token variables nor literal characters, or it could be used as a token variable, provided that it is shorter than any other token variable pattern, and that you don't use literal characters. (That might be easier if you could use ., but you'd still need to ensure that no other variable's pattern had a length of 1.)
The main problem with this approach (other than the inefficiencies it creates) is that the default token is just one character long. In order to implement something like your TEXT token, consisting of the entire sea between the islands you want to parse, you would need to consolidate sequences of consecutive TEXT tokens into a single token. This could be done reasonably easily in your parser, or it could be done using a wrapper around the lexical scanner, but either way it's an additional complication.
Alternatively, you could use t_error as a fallback. t_error is only called if nothing else matches, and if t_error returns a token, the Ply lexer will use that token. So in some ways, t_error is the ideal fallback. (But note that Ply does consider t_error to indicate an error. For example, it will be logged as an error, if you've enabled debug logging.)
The advantage to this approach is that the t_error function can absorb as many input characters as desired, using what every mechanism you consider most appropriate. In fact, it must do this, by explicitly incrementing the value of t.lexer.lexpos (which is what the skip method does); otherwise, an exception will be raised.
But there's a problem: before calling t_error(t), the lexer sets t.lexdata to (a copy of) the input string starting at the current input point. If t_error is called frequently, the cost of these copies could add up, possibly even turning the parse from linear time to quadratic time.
That doesn't free you from the problem of figuring out what the extent of the fallback token should be. As mentioned, t_error is not limited to the use of a precise regular expression, but it's not always obvious what other mechanism could be used.
So that brings us to the third possibility, which is to construct a regular expression which actually matches the text between useful tokens.
In most cases, that can actually be done mechanically, given that all the token patterns are available, either as the value of specific member variables or as the docstring of specific member functions. If you have this list of regular expression strings, you can create a pattern which will match the text up to the first such match, using a lookahead assertion:
# This leaves out the construction of the list of patterns.
#TOKEN(f".*?(?={'|'.join(f'(?:{p})' for p in patterns)})")
def t_TEXT(t):
return t
Note that patterns must also include a pattern which matches the character sets t_ignore and literals.
I'm building a simple spam classifier and from a cursory look at my dataset, most spams put spaces in between "spammy" words, which I assume is for them to bypass spam classifier. Here's some examples:
c redi t card
mort - gage
I would like to be able to take these and encode them in my dataframe as the correct words:
credit card
mortgage
I'm using Python by the way.
This depends a lot on whether you have a list of all spam words or not.
If you do have a list of spam words and you know that there are always only ADDED spaces (e.g. give me your cred it card in formation) but never MISSING spaces (e.g. give me yourcredit cardinformation), then you could use a simple rule-based approach:
import itertools
spam_words = {"credit card", "rolex"}
spam_words_no_spaces = {"".join(s.split()) for s in spam_words}
sentence = "give me your credit car d inform ation and a rol ex"
tokens = sentence.split()
for length in range(1, len(tokens)):
for t in set(itertools.combinations(tokens, length)):
if "".join(t) in spam_words_no_spaces:
print(t)
Which prints:
> ('rol', 'ex')
> ('credit', 'car', 'd')
So first create a set of all spam words, then for an easier comparison remove all spaces (although you could adjust the method to consider only correct spacing spam words).
Then split the sentence into tokens and finally get all possible unique consequtive subsequences in the token list (including one-word sequences and the whole sentence without whitespaces), then check if they're in the list of spam words.
If you don't have a list of spam words your best chance would probably be to do general whitespace-correction on the data. Check out Optical Character Recognition (OCR) Post Correction which you can find some pretrained models for. Also check out this thread which talks about how to add spaces to spaceless text and even mentions a python package for that. So in theory you could remove all spaces and then try to split it again into meaningful words to increase the chance the spam words are found. Generally your problem (and the oppositve, missing whitespaces) is called word boundary detection, so you might want to check some ressources on that.
Also you should be aware that modern pretrained models such as common transformer models often use sub-token-level embeddings for unknown words so that they can relatively easiely still combine what they learned for a split and a non-split version of a common word.
I'm trying to split words like 'olive-oil','high-fat','all-purpose' which are tokenized into one chunk.
The desired tokenization has to be
['olive','-','oil','high','-','fat','all','-','purpose']
I looked into retokenizer and the usage was like below.
doc = nlp("I live in NewYork")
with doc.retokenize() as retokenizer:
heads = [(doc[3], 1), doc[2]]
attrs = {"POS": ["PROPN", "PROPN"],
"DEP": ["pobj", "compound"]}
retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
As you could see in the last line,
to retokenize a chunk into pieces, I have provide what the result would be.
I don't think this is an efficient way of processing words because if I have to provide all the result, it means I just manually type all the possibilities, which I don't think feasible plan.
Given that I know all the cases and provide the ending result one by one
it might be way more efficient that just find the words to be replaced and replace them into what I want manually.
I believe there must be a way to generalize them.
If anyone knows to the way to tokenize the words I put at the top, can you help me?
Thank you
I am looking for something slightly more reliable for unpredictable strings than just checking if "word" in "check for word".
To paint an example, lets say I have the following sentence:
"Learning Python!"
If the sentence contains "Python", I'd want to evaluate to true, but what if it were:
"Learning #python!"
Doing a split with a space as a delimiter would give me ["learning", "#python"] which does not match python.
(Note: While I do understand that I could remove the # for this particular case, the problem with this is that 1. I am tagging programming languages and don't want to strip out the # in C#, and 2. This is just an example case, there's a lot of different ways I could see human typed titles including these hints that I'd still like to catch.)
I'd basically like to inspect if beyond reasonable doubt, the sequence of characters I'm looking for is there, despite any weird ways they might mention it. What are some ways to do this? I have looked at fuzzy search a bit, but I haven't seen any use-cases of looking for single words.
The end goal here is that I have tags of programming languages, and I'd like to take in the titles of people's stream titles and tag the language if its mentioned in the title.
This code prints True if the word contains ‘python’, ignoring case.
import re
input = "Learning Python!"
print(re.search("python", input, re.IGNORECASE) is not None)
What would be the best strategy to mask only specific words during the LM training?
My aim is to dynamically mask at batch-time only words of interest which I have previously collected in a list.
I have already had a look at the mask_tokens() function into the DataCollatorForLanguageModeling class, which is the function actually masking the tokens during each batch, but I cannot find any efficient and smart way to mask only specific words and their corresponding IDs.
I tried one naive approach consisting of matching all the IDs of each batch with a list of word's IDs to mask. However, a for-loop approach has a negative on the performance.
.
Side issue about word prefixed space - Already fixed
Thanks to #amdex1 and #cronoik for helping with a side issue.
This problem arose since the tokenizer, not only splits a single word in multiple tokens, but it also adds special characters if the word does not occur at the begging of a sentence.
E.g.:
The word "Valkyria":
at the beginning of a sentences gets split as ['V', 'alky', 'ria'] with corresponding IDs: [846, 44068, 6374].
while in the middle of a sentence as ['ĠV', 'alky', 'ria'] with corresponding IDs: [468, 44068, 6374],
It is solved by setting add_prefix_space=True in the tokenizer.