half space (\u200c) don't support in CountVectorizer

half space (\u200c) don't support in CountVectorizer - python

in CountVectorizer of python's library, i want to persian words that contain half space be one token not two word .
I will be grateful to guide me.
thank you.
i used "درخت‌های زیبا" in CountVectorizer . i wanted it to turn into ["درخت‌های","زیبا"] but turned into ["درخت","ها","زیبا"] .

Related

Word Tokenization When There is No Space

I am wondering for the term in Machine Learning, Deep Learning, or in Natural Language Processing that split the word in a paragraph when there is no space between them.
example:
"iwanttocook"
become:
"i want to cook"
It wouldn't be easy since you do not have the character to tokenize the word.
I appreciate any help

You could achieve this using the polyglot package. There is an option for morphological analysis.
This kind of analysis is based on morfessor models trained on most frequent words to encounter morphemes ("primitive units of syntax, the smallest individually meaningful elements in the utterances of a language").
From the documentation:
from polyglot.text import Text
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
print(text.morphemes)
The output would be:
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])
Note that if you want to start working with polyglot, you should first read the documentation carefully, as there are a few things to consider, for example the downloading of language specific models.

I created an simple application using Scala to do this.
https://github.com/shredder47/Nonspaced-Sentence-Tokenizer

Find similar/synonyms/context words Python

Hello i'm looking to find a solution of my issue :
I Want to find a list of similar words with french and english
For example :
name could be : first name, last name, nom, prénom, username....
Postal address could be : city, country, street, ville, pays, code postale ....

The other answer, and comments, describe how to get synonyms, but I think you want more than that?
I can suggest two broad approaches: WordNet and word embeddings.
Using nltk and wordnet, you want to explore the adjacent graph nodes. See http://www.nltk.org/howto/wordnet.html for an overview of the functions available. I'd suggest that once you've found your start word in Wordnet, follow all its relations, but also go up to the hypernym, and do the same there.
Finding the start word is not always easy:
http://wordnetweb.princeton.edu/perl/webwn?s=Postal+address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
Instead it seems I have to use "address": http://wordnetweb.princeton.edu/perl/webwn?s=address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
and then decide which of those is the correct sense here. Then try clicking the hypernym, hyponym, sister term, etc.
To be honest, none of those feels quite right.
Open Multilingual WordNet tries to link different languages. http://compling.hss.ntu.edu.sg/omw/ So you could take your English WordNet code, and move to the French WordNet with it, or vice versa.
The other approach is to use word embeddings. You find the, say, 300 dimensional, vector of your source word, and then hunt for the nearest words in that vector space. This will be returning words that are used in similar contexts, so they could be similar meaning, or similar syntactically.
Spacy has a good implementation, see https://spacy.io/usage/spacy-101#vectors-similarity and https://spacy.io/usage/vectors-similarity
Regarding English and French, normally you would work in the two languages independently. But if you search for "multilingual word embeddings" you will find some papers and projects where the vector stays the same for the same concept in different languages.
Note: the API is geared towards telling you how two words are similar, not finding similar words. To find similar words you need to take your vector and compare with every other word vector, which is O(N) in the size of the vocabulary. So you might want to do this offline, and build your own "synonyms-and-similar" dictionary for each word of interest.

from PyDictionary import PyDictionary
dictionary=PyDictionary()
answer = dictionary.synonym(word)
word is the word for which you are finding the synonyms.

How to calculate the distance in meaning of two words in Python

I am wondering if it's possible to calculate the distance/similarity between two related words in Python (like "fraud" and "steal"). These two words are not synonymous per se but they are clearly related. Are there any concepts/algorithms in NLP that can show this relationship numerically? Maybe via NLTK?
I'm not looking for the Levenshtein distance as that relates to the individual characters that make up a word. I'm looking for how the meaning relates.
Would appreciate any help provided.

My suggestion is as follows:
Put each word through the same thesaurus, to get a list of synonyms.
Get the size of the set of similar synonyms for the two words.
That is a measure of similarity between the words.
If you would like to do a more thorough analysis:
Also get the antonyms for each of the two words.
Get the size of the intersection of the sets of antonyms for the two words.
If you would like to go further!...
Put each word through the same thesaurus, to get a list of synonyms.
Use the top n (=5, or whatever) words from the query result to initiate a new query.
Repeat this to a depth you feel is adequate.
Make a collection of synonyms from the repeated synonym queries.
Get the size of the set of similar synonyms for the two words from the two collections of synonyms.
That is a measure of similarity between the words.

NLTK's wordnet is the tool you'd want to use for this. First get the set of all the senses of each word using:
synonymSet = wordnet.synsets(word)
Then loop through each possible sense of each of the 2 words and compare them to each other in a nested loop:
similarity = synonym1.res_similarity(synonym2,semcor_ic)
Either average that value or use the maximum you find; up to you.
This example is using a word similarity comparison that uses "IC" or information content. This will score similarity higher if the word is more specific, or contains more information, so generally it's closer to what we mean when we think about word similarity.
To use this stuff you'll need the imports and variables:
import nltk
from nltk.corpus import wordnet
from nltk.corpus import wordnet_ic
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

As #jose_bacoy suggested above, the Gensim library can provide a measure of similarity between words using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.
Google Colab comes with the Gensim package installed. We can import the part of it we require:
from gensim.models import KeyedVectors
We will download training data from Google News, and load it up
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
This gives us a measure of similarity between any two words. To use your example:
word_vectors.similarity('fraud', 'steal')
>>> 0.19978741
Twenty percent similarity may be a surprisingly low level of similarity to you if you consider these words to be similar. But consider that fraud is a noun and steal is generally a verb. This will give them very different associations as viewed by word2vec.
They become much more similar if you modify the noun to become a verb:
word_vectors.similarity('defraud', 'steal')
>>> 0.43293646

Python treat multiple words as single

Is there any method to treat multiple word as single in Python? I've written a script to find Tf-Idf value of words in a collection of documents. The problem is that, it gives the Tf-Idf for individual words. But there are cases where i've to treat multiple word as as one, such as words like Big Data , Machine Learning should be treated as a single word and Tf-Idf score for those word should be calculated. Any help would be highly useful.

I would approach it using scikit-learn and the TfidfVectorizer. Tweaking some of it's parameters would basically allow you to do all the work.
It's hard to show it's functionality though without a good example.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = "lots of text"
vectorizer = TfidfVectorizer(ngram_range=(2,2))
result = vectorizer.fit_transform(corpus)
Know that the ngram_range parameter allows you to choose if you are interested in e.g. bigrams, trigrams, etc. by choosing a range.

How to tokenize a Malayalam word?

ഇതുഒരുസ്ടലംമാണ്
itu oru stalam anu
This is a Unicode string meaning this is a place
import nltk
nltk.wordpunct_tokenize('ഇതുഒരുസ്ഥാലമാണ് '.decode('utf8'))
is not working for me .
nltk.word_tokenize('ഇതുഒരുസ്ഥാലമാണ് '.decode('utf8'))
is also not working
other examples
"കണ്ടില്ല " = കണ്ടു +ഇല്ല,
"വലിയൊരു" = വലിയ + ഒരു
Right Split :
ഇത് ഒരു സ്ഥാലം ആണ്
output:
[u'\u0d07\u0d24\u0d4d\u0d12\u0d30\u0d41\u0d38\u0d4d\u0d25\u0d32\u0d02\u0d06\u0d23\u0d4d']
I just need to split the words as shown in the other example. Other example section is for testing.The problem is not with Unicode. It is with morphology of language. for this purpose you need to use a morphological analyzer
Have a look at this paper.
http://link.springer.com/chapter/10.1007%2F978-3-642-27872-3_38

After a crash course of the language from wikipedia (http://en.wikipedia.org/wiki/Malayalam), there are some issues in your question and the tools you've requested for your desired output.
Conflated Task
Firstly, the OP conflated the task of morphological analysis, segmentation and tokenization. Often there is a fine distinction especially for aggluntinative languages such as Turkish/Malayalam (see http://en.wikipedia.org/wiki/Agglutinative_language).
Agglutinative NLP and best practices
Next, I don't think tokenizer is appropriate for Malayalam, an agglutinative language. One of the most studied aggluntinative language in NLP, Turkish have adopted a different strategy when it comes to "tokenization", they found that a full blown morphological analyzer is necessary (see http://www.denizyuret.com/2006/11/turkish-resources.html, www.andrew.cmu.edu/user/ko/downloads/lrec.pdf‎).
Word Boundaries
Tokenization is defined as the identification of linguistically meaningful units (LMU) from the surface text (see Why do I need a tokenizer for each language?) And different language would require a different tokenizer to identify the word boundary of different languages. Different people have approach the problem for finding word boundary different but in summary in NLP people have subscribed to the following:
Agglutinative Languages requires a full blown morphological analyzer trained with some sort of language models. There is often only a single tier when identifying what is token and that is at the morphemic level hence the NLP community had developed different language models for their respective morphological analysis tools.
Polysynthetic Languages with specified word boundary has the choice of a two tier tokenization where the system can first identify an isolated word and then if necessary morphological analysis should be done to obtain a finer grain tokens. A coarse grain tokenizer can split a string using certain delimiter (e.g. NLTK's word_tokenize or punct_tokenize which uses whitespaces/punctuation for English). Then for finer grain analysis at morphemic level, people would usually use some finite state machines to split words up into morpheme (e.g. in German http://canoo.net/services/WordformationRules/Derivation/To-N/N-To-N/Pre+Suffig.html)
Polysynthetic Langauges without specified word boundary often requires a segmenter first to add whitespaces between the tokens because the orthography doesn't differentiate word boundaries (e.g. in Chinese https://code.google.com/p/mini-segmenter/). Then from the delimited tokens, if necessary, morphemic analysis can be done to produce finer grain tokens (e.g. http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html). Often this finer grain tokens are tied with POS tags.
The answer in brief to OP's request/question, the OP had used the wrong tools for the task:
To output tokens for Malayalam, a morphological analyzer is necessary, simple coarse grain tokenizer in NLTK would not work.
NLTK's tokenizer is meant to tokenize polysynthetic Languages with specified word boundary (e.g. English/European languages) so it is not that the tokenizer is not working for Malayalam, it just wasn't meant to tokenize aggluntinative languages.
To achieve the output, a full blown morphological analyzer needs to be built for the language and someone had built it (aclweb.org/anthology//O/O12/O12-1028.pdf‎), the OP should contact the author of the paper if he/she is interested in the tool.
Short of building a morphological analyzer with a language model, I encourage the OP to first spot for common delimiters that splits words into morphemes in the language and then perform the simple re.split() to achieve a baseline tokenizer.

A tokenizer is indeed the right tool; certainly this is what the NLTK calls them. A morphological analyzer (as in the article you link to) is for breaking words into smaller parts (morphemes). But in your example code, you tried to use a tokenizer that is appropriate for English: It recognizes space-delimited words and punctuation tokens. Since Malayalam evidently doesn't indicate word boundaries with spaces, or with anything else, you need a different approach.
So the NLTK doesn't provide anything that detects word boundaries for Malayalam. It might provide the tools to build a decent one fairly easily, though.
The obvious approach would be to try dictionary lookup: Try to break up your input into strings that are in the dictionary. But it would be harder than it sounds: You'd need a very large dictionary, you'd still have to deal with unknown words somehow, and since Malayalam has non-trivial morphology, you may need a morphological analyzer to match inflected words to the dictionary. Assuming you can store or generate every word form with your dictionary, you can use an algorithm like the one described here (and already mentioned by #amp) to divide your input into a sequence of words.
A better alternative would be to use a statistical algorithm that can guess where the word boundaries are. I don't know of such a module in the NLTK, but there has been quite a bit of work on this for Chinese. If it's worth your trouble, you can find a suitable algorithm and train it to work on Malayalam.
In short: The NLTK tokenizers only work for the typographical style of English. You can train a suitable tool to work on Malayalam, but the NLTK does not include such a tool as far as I know.
PS. The NLTK does come with several statistical tokenization tools; the PunctSentenceTokenizer can be trained to recognize sentence boundaries using an unsupervised learning algorithm (meaning you don't need to mark the boundaries in the training data). Unfortunately, the algorithm specifically targets the issue of abbreviations, and so it cannot be adapted to word boundary detection.

maybe the Viterbi algorithm could help?
This answer to another SO question (and the other high-vote answer) could help: https://stackoverflow.com/a/481773/583834

It seems like your space is the unicode character u'\u0d41'. So you should split normally with str.split().
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
x = 'ഇതുഒരുസ്ഥാലമാണ്'.decode('utf8')
y = x.split(u'\u0d41')
print " ".join(y)
[out]:
ഇത ഒര സ്ഥാലമാണ്`

I tried the following:
# encoding=utf-8
import nltk
cheese = nltk.wordpunct_tokenize('ഇതുഒരുസ്ഥാലമാണ്'.decode('utf8'))
for var in cheese:
print var.encode('utf8'),
And as output, I got the following:
ഇത ു ഒര ു സ ് ഥ ാ ലമ ാ ണ ്
Is this anywhere close to the output that you want, I'm a little in the dark here, since its difficult to get this right without understanding the language.

Morphological analysis example
from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")
Gives
[('കേരളം<np><genitive>', 179)]
url: mlmorph
if you using anaconda then:
install git in anaconda prompt
conda install -c anaconda git
then clone the file using following command:
git clone https://gitlab.com/smc/mlmorph.git

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.