get ngrams with positional information - python

I'm trying to group similar short descriptions together and currently using ngrams to extract text features. Here's the ngrams function that I'm using:
def generate_ngrams(text, n):
text = text.lower()
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]
ngrams = zip(*[token[i:] for i in range(n)])
return [" ".join(ngram) for ngram in ngrams]
However, I'm experiencing some undesired results after clustering. Suppose I have the following two texts:
00011122abc
00111224abc
By using ngrams(n=3), my clustering model grouped these together, which is not what I want. So I think I need to pass a new function into tfidf vectorizer instead of ngrams. I think I need to anchor the first char and create substrings as my features for tfidf, so for the first text it will be something like this:
[000, 0001, 00011, 0001111, 0001112 ...]
Has anyone else experienced similar problems or is there a better way to approach this? Thanks!

Related

How to get n-grams from a column in pandas dataframe

I have some doubts regarding n-grams.
Specifically, I would like to extract 2-grams, 3-grams and 4-grams from the following column:
Sentences
For each topic, we will explore the words occuring in that topic and its relative weight.
We will check where our test document would be classified.
For each document we create a dictionary reporting how many
words and how many times those words appear.
Save this to ‘bow_corpus’, then check our selected document earlier.
To do this, I used the following function
def n_grams(lines , min_length=2, max_length=4):
lenghts=range(min_length,max_length+1)
ngrams={length:collections.Counter() for length in lengths)
queue= collection.deque(maxlen=max_length)
but it does not work since I got None as output.
Can you please tell me what is wrong in the code?
Your ngrams dictionary has empty Counter() objects because you don't pass anything to count. There are also a few other problems:
Function names can't include - in Python.
collection.deque is invalid, I think you wanted to call collections.deque()
I think there are better options to fix your code than using collections library. Two of them are as follows:
You might fix your function using list comprehension:
def n_grams(lines, min_length=2, max_length=4):
tokens = lines.split()
ngrams = dict()
for n in range(min_length, max_length + 1):
ngrams[n] = [tokens[i:i+n] for i in range(len(tokens)-n+1)]
return ngrams
Or you might use nltk which supports tokenization and n-grams natively.
from nltk import ngrams
from nltk.tokenize import word_tokenize
def n_grams(lines, min_length=2, max_length=4):
tokens = word_tokenize(lines)
ngrams = {n: ngrams(tokens, n) for n in range(min_length, max_length + 1)}
return ngrams

How can I look for specific bigrams in text example - python?

I am interested in finding how often (in percentage) a set of words, as in n_grams appears in a sentence.
example_txt= ["order intake is strong for Q4"]
def find_ngrams(text):
text = re.findall('[A-z]+', text)
content = [w for w in text if w.lower() in n_grams] # you can calculate %stopwords using "in"
return round(float(len(content)) / float(len(text)), 5)
#the goal is for the above procedure to work on a pandas datafame, but for now lets use 'text' as an example.
#full_MD['n_grams'] = [find_ngrams(x) for x in list(full_MD.loc[:,'text_no_stopwords'])]
Below you see two examples. The first one works, the last doesn't.
n_grams= ['order']
res = [find_ngrams(x) for x in list(example_txt)]
print(res)
Output:
[0.16667]
n_grams= ['order intake']
res = [find_ngrams(x) for x in list(example_txt)]
print(res)
Output:
[0.0]
How can I make the find_ngrams() function process bigrams, so the last example from above works?
Edit: Any other ideas?
You can use SpaCy Matcher:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "orderintake" with no callback and one pattern
pattern = [{"LOWER": "order"}, {"LOWER": "intake"}]
matcher.add("orderintake", None, pattern)
doc = nlp("order intake is strong for Q4")
matches = matcher(doc)
print(len(matches)) #Number of times the bi-gram appears in text
maybe you have already exploited this option, but why not use the a simple .count combined with len:
(example_txt[0].count(n_grams[0]) * len(n_grams[0])) / len(example_txt[0])
or if you are not interested in the spaces as part of your calculation you can use the following:
(example_txt[0].count(n_grams[0])* len(n_grams[0])) / len(example_txt[0].replace(' ',''))
of course you can use them in a list comprehension, this was just for demonstration purposes
The line
re.findall('[A-z]+', text)
returns
['order', 'intake', 'is', 'strong', 'for', 'Q'].
For this reason, the string 'order intake' will not be matched in your for here:
content = [w for w in text if w.lower() in n_grams]
If you want it to match, you'll need to make one single of string from each Bigram.
Instead, you should probably use this to find Bigrams.
For N-grams, have a look at this answer.

Sklearn TfIdfVectorizer remove docs containing all stopwords

I am using sklearn's TfIdfVectorizer to vectorize my corpus. In my analysis, there are some document which all terms are filtered out due to containing all stopwords. To reduce the sparsity issue and because it is meaningless to include them in the analysis, I would like to remove it.
Looking into the TfIdfVectorizer doc, there is no parameter that can be set to do this. Therefore, I am thinking of removing this manually before passing the corpus into the vectorizer. However, this has a potential issue which the stopwords that I have gotten is not the same as the list used by vectorizer, since I also use both min_df and max_df option to filter out terms.
Is there any better way to achieve what I am looking for (i.e. removing/ignoring document containing all stopwords)?
Any help would be greatly appreciated.
You can:
specify your sopwords and then, after TfidfVecorizer
filter out empty rows
The following code snippet shows a simplified example that should set you in the right direction:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["aa ab","aa ab ac"]
stop_words = ["aa","ab"]
tfidf = TfidfVectorizer(stop_words=stop_words)
corpus_tfidf = tfidf.fit_transform(corpus)
idx = np.array(corpus_tfidf.sum(axis=1)==0).ravel()
corpus_filtered = corpus_tfidf[~idx]
Feel free to ask questions if you still have any!
So, you can use this:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
def tokenize(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
punctuations="?:!.,;'�۪"
for token in tokens:
if token in punctuations:
tokens.remove(token)
if re.search('[a-zA-Z0-9]', token):
filtered_tokens.append(token)
st = ' '.join(filtered_tokens)
return st
tokenize(data)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,min_df=0.01,stop_words='english',
use_idf=True,tokenizer=tokenize)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
ids = np.array(tfidf_matrix.sum(axis=1)==0).ravel()
tfidf_filtered = tfidf_matrix[~ids]
This way you can remove stopwords, empty rows and use min_df and max_df.

How to find bi-grams which include pre-defined words?

I know it is possible to find bigrams which have a particular word from the example in the link below:
finder = BigramCollocationFinder.from_words(text.split())
word_filter = lambda w1, w2: "man" not in (w1, w2)
finder.apply_ngram_filter(word_filter)
bigram_measures = nltk.collocations.BigramAssocMeasures()
raw_freq_ranking = finder.nbest(bigram_measures.raw_freq, 10) #top-10
>>>
nltk: how to get bigrams containing a specific word
But I am not sure how this can be applied if I need bigrams containing both words pre-defined.
Example:
My Sentence: "hello, yesterday I have seen a man walking. On the other side there was another man yelling: "who are you, man?"
Given a list:["yesterday", "other", "I", "side"]
How can I get a list of bi-grams with the given words. i.e:
[("yesterday", "I"), ("other", "side")]?
What you want is probably a word_filter function that returns False only if all the words in a particular bigram are part of the list
def word_filter(x, y):
if x in lst and y in lst:
return False
return True
where lst = ["yesterday", "I", "other", "side"]
Note that this function is accessing the lst from the outer scope - which is a dangerous thing, so make sure you don't make any changes to lst within the word_filter function
First you can create all possible bigrams for your vocabulary and feed that as the input for a countVectorizer, which can transform your given text into bigram counts.
Then, you filter the generated bigrams based on the counts given by countVectorizer.
Note: I have changed the token pattern to account for even single character. By default, it skips the single characters.
from sklearn.feature_extraction.text import CountVectorizer
import itertools
corpus = ["hello, yesterday I have seen a man walking. On the other side there was another man yelling: who are you, man?"]
unigrams=["yesterday", "other", "I", "side"]
bi_grams=[' '.join(bi_gram).lower() for bi_gram in itertools.combinations(unigrams, 2)]
vectorizer = CountVectorizer(vocabulary=bi_grams,ngram_range=(2,2),token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(corpus)
print([word for count,word in zip(X.sum(0).tolist()[0],vectorizer.get_feature_names()) if count])
output:
['yesterday i', 'other side']
This approach would be a better approach when you have more number of documents and less number of words in the vocabulary. If its other way around, you can find all the bigrams in the document first and then filter it using your vocabulary.

Extracting sentence from a dataframe with description column based on a phrase

I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.
There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)
Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.

Categories

Resources