I am working with a huge collection of documents written in several languages. I want to compute cosine distance between documents from their tf-idf scores. So far I have:
from sklearn.feature_extraction.text import TfidfVectorizer
# The documents are located in the same folder as the script
text_files = [r'doc1', r'doc2', r'doc3']
files = [open(f) for f in text_files]
documents = [f.read() for f in files]
vectorizer = TfidfVectorizer(ngram_range=(1,1))
tfidf = vectorizer.fit_transform(documents)
vocabulary = vectorizer.vocabulary_
When the three documents doc1, doc2 and doc3 contain English text, the algorithm works like a charm and vocabulary does indeed contains unigrams from the different bodies of text. I tried with Russian too, and it also worked great. However, when I try with some Japanese text, the algorithm does not work as intended anymore.
The problem arises from the fact that Japanese language does not have spaces, so that TfidfVectorizer does not understand what's a word and what isn't. For example I would have something like this in my unigram vocabulary:
診多索いほ権込真べふり告車クノ般宮えぼぜゆ注携ゆクく供9時ク転組けが意見だっあ税新ト復生ひり教台話辞ゃに
Whic is clearly a sentence and not a word. How can I solve this problem?
You should provide a tokenizer for the Japanese
vectorizer = TfidfVectorizer(ngram_range=(1,1), tokenizer=jap_tokenizer)
where jap_tokenizer is either a function you create or one like this.
This appears to be the English version of documents, basically:
documents = ['one word after another', 'two million more words', 'finally almost there']
For your Japanese documents, call them j_doc1, j_doc2, and j_doc3, documents probably looks like this (just an example; bear with me as I didn't bother creating random Japanese sentences):
documents = ['診多索いほ', '診多索いほ', '台話辞ゃに']
The current tokenizer looks for spaces, which your string doesn't have. You could try this:
documents = [" ".join(char for char in d) for d in documents]
Now documents looks like this, which may be more feasible (although that's up to you, as I don't know whether it's appropriate to always add a space between each Japanese character):
documents
Out[40]: ['診 多 索 い ほ', '診 多 索 い ほ', '台 話 辞 ゃ に']
Or define your own tokenizer, as referred to in another answer.
Related
I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.
Here is the code that I am using:
Code
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
Output:
sfx
Similarly for sentence "fast foward2", I get Spacy noun as
foward2
Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.
I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.
I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc
Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.
It seems you can use pyenchant library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]
While using pyenchant spellchecker, I have found it useful to do the check after converting the word to uppercase fully. Also Splitting the sentence/phrase and feeding the words one at a time gives better results.
Example:
enchantChecker.check("landsat".upper()) and enchantChecker.check("wwii".upper()) returns True where as using lowercase words returns False.
If you need to mix more than one spellcheckers, another good one is to check spaCy library's is_oov (out of vocabulary) flag after loading en_core_web_lg
I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:
#### Stemming
ps = PorterStemmer() # PorterStemmer imported from nltk.stem
stemText = []
for word in swFiltText: # Tagged text w/o stop words
stemText.append(ps.stem(word))
#### POS Tagging
def tagging():
tagTot = []
try:
for i in stemText:
words = nltk.word_tokenize(i) # I need to tokenize again (idk why?)
tagged = nltk.pos_tag(words)
tagTot = tagTot + tagged # Combine tagged words into list
except Exception as e:
print(str(e))
return tagTot
tagText = tagging()
At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.
Should I be using a different stemmer/tagger? Or is the error in my code?
Thanks in advance!
You should tag the text before you apply stemming or lemmatisation to it.
Removing the endings of words takes away crucial clues about what part-of-speech tag a word can be.
The reason that you got hous as an adjective is that any tagger expects unprocessed tokens, and words ending in -ous in English are usually adjectives (nefarious, serious). If you tag it first, it can be recognises (even without context) as either a noun or a verb. The tagger can then use context (preceded by the? -> noun) to disambiguate which is the most likely one.
A good lemmatiser can take the part-of-speech into account, eg housing could be a noun (lemma: housing) or a verb (lemma: house). With p-o-s information a lemmatiser can make the correct choice there.
Whether you use stemming or lemmatisation depends on your application. For many purposes they will be equivalent. The main difference from my experience are that:
Stemming is a lot faster, as stemmers have a few rules on how to handle various endings
Lemmatisation gives you 'proper' words which you can look up in dictionaries (if you want to get glosses in other languages or definitions)
Stemmed strings sometimes don't look anything like the original word, and if you present them to a human user they might get confused
Stemmers conflate words which have similar meanings but different lemmas, so for information retrieval they might be more useful
Stemmers don't need a word list, so if you want to write your own stemmer, it's quicker than writing a lemmatiser (if you're processing languages for which no ready-made tools exist)
I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.
So there is a excel file which i have read through pandas and stored it in a dataframe 'df'. Now that excel file contains 24 columns as 'questions' and 631 rows as 'responses/answers'.
So i converted one such question into a list so that i can tokenize it and apply further nlp related tasks on it.
df_lst = df['Q8 Why do you say so ?'].values.tolist()
Now, this gives me a list that contains 631 sentences, out of which some sentences are non-english.. So i want to filter out the non-english sentences so that in the end I am left with a list that contains only english sentences.
What i have:
df_lst = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', 'This text is in a random non english language'...]
Output (What i want):
english_words = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', ...]
Also, I read about a python library named pyenchant which should be able to do this, but it's not compatible with windows 64bit and python 3.. Is there any other way by which this can be done ?
Thanks!
There is another library (closely related to nltk), TextBlob,
Initially bound to Sentiment analysis,
But you can still use it for translation, see the doc here: https://textblob.readthedocs.io/en/dev/quickstart.html
Section Translation and Language Detection
gl
Have you considered taking advantage of the number of English "stopwords" in your sentences? Give a look at the nltk package. Check English stopwords using the following code:
import nltk
from ntlk.corpus import stopwords
ntlk.download('stopwords') # If you just installed the package
set(stopwords.words('english'))
You could add a new column indicating the number of English stopwords present in each of your sentences. Presence of stopwords could be used as a predictor of English language.
Other thing that could work is, if you know for a fact that most answers are in English to begin with, make a frequency ranking for words (possibly for each question in your data). In your example, it looks like the word "customer" shows up quite consistently for the question under study. So you could engineer a variable that indicates the presence of very frequent words in an answer. That could also serve as a predictor. Do not forget to either make all words lowercase or uppercase and deal with plural or 's so you don't rank "customer", "Customer", "customers", "Customers", "customer's" and "customers'" all as different words.
After engineering the variables above, you can set up a threshold above which you consider the sentence to be written in English, or you can go for something a bit more fancy in terms of unsupervised learning.
I have a corpus that I'm using the tm package on in R (and also mirroring the same script in NLTK in python). I'm working with unigrams, but would like a parser of some kind to combine words commonly co-located to be as if one word---ie, I'd like to stop seeing "New" and "York" separately in my data set when they occur together, and see this particular pair represented as "New York" as if that were a single word, and alongside other unigrams.
What is this process called, of transforming meaningful, common n-grams onto the same footing as unigrams? Is it not a thing? Finally, what would the tm_map look like for this?
mydata.corpus <- tm_map(mydata.corpus, fancyfunction)
And/or in python?
I recently had a similar question and played around with collocations
This was the solution I chose to identify pairs of collocated words:
from nltk import word_tokenize
from nltk.collocations import *
text = <a long text read in as string string>
tokenized_text = word_tokenize(text)
bigram_measures = nltk.collocations.BigramAssocMeasures(tokenized_text)
finder = BigramCollocationFinder.from_words()
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)
Should I use NLTK or regular expressions to split it?
How can I do the selection for pronouns (he/she). I want to select any sentence that has a pronoun.
This is a part of a larger project and I am new to Python. Could you please point me to any helpful code?
I am working on a NLP project which has similar needs. I suggest you to use NLTK since it makes things really easy and gives us a lot of flexibility. Since you need to collect all sentences having pronouns, you can split all sentences in the text and hold them in a list. Then, you can iterate over the list and look for sentences containing pronouns. Also make sure you note down the index of the sentence (in the list) or you can form a new list.
Sample code below:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []
for sentence in sentences:
words = word_tokenize(sentence)
for word in words:
word_pos = pos_tag([word])
if word_pos[0][1] == 'PRP':
sentences_with_pronouns.append(sentence)
break
print sentences_with_pronouns
Output:
['she also loves to play chess.']
P.S. Also check pylucene and whoosh libraries which are pretty useful NLP python
packages.
NLTK is your best bet. Given a string of sentences as input, you can obtain a list of those sentences containing pronouns by doing:
from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]
Returns:
['Take it or leave it.']
Basically we split the string into a list of sentences, those sentences into a list of words and convert the list of words for each sentence into a set of part of speech tags (this is important since if we don't, when we have multiple pronouns in a sentence, we would get get duplicate sentences).