Split joined/concatenated words list of different languages - python

I'm trying to split words from different languages that are joined.
My expected result:
input = ['françaisenglishtext']
output = ['français','english','text']
The first word in the output is french and the other words are english.
I tried to use the python wordninja library from this question:How to split text without spaces into list of words by first using it with english and then remove the non-english words and use pyenchant and keep only the english words.
The problem with this method, the output of wordninja will also split the french part into english. So I cannot know which part of the output is french after splitting. It'll also remove the special french character.
My current result:
result_with_wordninja = ['fran', 'a', 'is', 'english', 'text']
Finally I tried to change the dictionary of wordninja but I still face the same problem.
I've also check this answer:Split a paragraph containing words in different languages but it doesn't work for my case since I just have latin characters in my list.
Is there a specific library or method to be able to split joined words from different languages ?
Thank you,

Related

Is there a way to detect if unnecessary characters are added to strings to bypass spam detection?

I'm building a simple spam classifier and from a cursory look at my dataset, most spams put spaces in between "spammy" words, which I assume is for them to bypass spam classifier. Here's some examples:
c redi t card
mort - gage
I would like to be able to take these and encode them in my dataframe as the correct words:
credit card
mortgage
I'm using Python by the way.
This depends a lot on whether you have a list of all spam words or not.
If you do have a list of spam words and you know that there are always only ADDED spaces (e.g. give me your cred it card in formation) but never MISSING spaces (e.g. give me yourcredit cardinformation), then you could use a simple rule-based approach:
import itertools
spam_words = {"credit card", "rolex"}
spam_words_no_spaces = {"".join(s.split()) for s in spam_words}
sentence = "give me your credit car d inform ation and a rol ex"
tokens = sentence.split()
for length in range(1, len(tokens)):
for t in set(itertools.combinations(tokens, length)):
if "".join(t) in spam_words_no_spaces:
print(t)
Which prints:
> ('rol', 'ex')
> ('credit', 'car', 'd')
So first create a set of all spam words, then for an easier comparison remove all spaces (although you could adjust the method to consider only correct spacing spam words).
Then split the sentence into tokens and finally get all possible unique consequtive subsequences in the token list (including one-word sequences and the whole sentence without whitespaces), then check if they're in the list of spam words.
If you don't have a list of spam words your best chance would probably be to do general whitespace-correction on the data. Check out Optical Character Recognition (OCR) Post Correction which you can find some pretrained models for. Also check out this thread which talks about how to add spaces to spaceless text and even mentions a python package for that. So in theory you could remove all spaces and then try to split it again into meaningful words to increase the chance the spam words are found. Generally your problem (and the oppositve, missing whitespaces) is called word boundary detection, so you might want to check some ressources on that.
Also you should be aware that modern pretrained models such as common transformer models often use sub-token-level embeddings for unknown words so that they can relatively easiely still combine what they learned for a split and a non-split version of a common word.

Regex for matching exact words that contain apostrophes in Python?

For the purpose of this project, I'm using more exact regex expressions, rather than more general ones. I'm counting occurrences words from a list of words in a text file called I import into my script called vocabWords, where each word in the list is in the format \bword\b.
When I run my script, \bwhat\b will pick up the words "what" and "what's", but \bwhat's\b will pick up no words. If I switch the order so the apostrophe word is before the root word, words are counted correctly. How can I change my regex list so the words are counted correctly? I understand the problem is using "\b", but I haven't been able to find how to fix this. I cannot have a more general regex, and I have to include the words themselves in the regex pattern.
vocabWords:
\bwhat\b
\bwhat's\b
\biron\b
\biron's\b
My code:
matched = []
regex_all = re.compile('|'.join(vocabWords))
for row in df['test']:
matched.append(re.findall(regex_all, row))
There are at least another 2 solutions:
Test that next symbol isn't an apostrophe r"\bwhat(?!')\b"
Use more general rule r"\bwhat(?:'s)?\b" to caught both variants with/without apostrophe.
If you sort your wordlist by length before turning it into a regexp, longer words (like "what's") will precede shorter words (like "what"). This should do the trick.
regex_all = re.compile('|'.join(sorted(vocabWords, key=len, reverse=True)))

How to handle with words which have space between characters?

I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word.
For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a space for it, otherwise, it can be combined such as "زندهگی".
Can anyone help me using [tag:regex] or any other way that should not tokenize the words that a part of one word ends with "ه" and after that, there will be the "گ " character.
To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیم‌فاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :
\u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
\u200F : Right-to-left mark (http://unicode-table.com/en/#200F)
As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی to زنده‌گی and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:
[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+
Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5 you will see that is correct)
For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زنده‌گی as زندگی and it can not correct this word for you. But the other words like می شود would easily corrects and converts to می‌شود. Also you can add custom words to the database.

How to tokenize continuous words with no whitespace delimiters?

I'm using Python with nltk. I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. So how to tokenize text without any whitespace. Is there any tools in Python?
I am not aware of such tools, but the solution of your problem depends on the language.
For the Turkish language you can scan input text letter by letter and accumulate letters into a word. When you are sure that accumulated word forms a valid word from a dictionary, you save it as a separate token, erase the buffer for accumulating new word and continue the process.
You can try this for English, but I assume that you may find situations when ending of one word may be a beginning of some dictionary word, and this can cause you some problems.

Splitting words in running text using Python?

I am writing a piece of code which will extract words from running text. This text can contain delimiters like \r,\n etc. which might be there in text.
I want to discard all these delimiters and only extract full words. How can I do this with Python? any library available for crunching text in python?
Assuming your definition of "word" agrees with that of the regular expression module (re), that is, letters, digits and underscores, it's easy:
import re
fullwords = re.findall(r'\w+', thetext)
where thetext is the string in question (e.g., coming from an f.read() of a file object f open for reading, if that's where you get your text from).
If you define words differently (e.g. you want to include apostrophes so for example "it's" will be considered "one word"), it isn't much harder -- just use as the first argument of findall the appropriate pattern, e.g. r"[\w']+" for the apostrophe case.
If you need to be very, very sophisticated (e.g., deal with languages that use no breaks between words), then the problem suddenly becomes much harder and you'll need some third-party package like nltk.
Assuming your delimiters are whitespace characters (like space, \r and \n), then basic str.split() does what you want:
>>> "asdf\nfoo\r\nbar too\tbaz".split()
['asdf', 'foo', 'bar', 'too', 'baz']

Categories

Resources