How to convert sentences in to vectors - python

I have a dictionary where the keys are words and the values are vectors of those words.
I have a list of sentences which I want to convert into an array. I'm getting an array of all the words but I would like to have an array of sentences with word vectors so I can feed it into a neural network
sentences=["For last 8 years life, Galileo house arrest espousing man's theory",
'No. 2: 1912 Olympian; football star Carlisle Indian School; 6 MLB seasons Reds, Giants & Braves',
'The city Yuma state record average 4,055 hours sunshine year'.......]
word_vec={'For': [0.27452874183654785, 0.8040047883987427],
'last': [-0.6316165924072266, -0.2768899202346802],
'years': [-0.2496756911277771, 1.243837594985962],
'life,': [-0.9836481809616089, -0.9561406373977661].....}
I want to convert the above sentences into vectors of their corresponding words from the dictionary.

Try this:
def sentence_to_list(sentence, words_dict):
return [w for w in sentence.split() if w in words_dict]
So the first of the sentences in your example will be converted to:
['For', 'last', 'years', 'life'] # words not in the dictionary are not present here
Update.
I guess you need to remove punctuation characters. There are several methods how to split the string using several delimiter characters, check this answer: Split Strings into words with multiple word boundary delimiters

This will create vectors, containing list of lists of vectors (one list per one sentence):
vectors = []
for sentence in sentences:
sentence_vec = [ word_vec[word] for word in sentence.split() if word in word_vec ]
vectors.append( sentence_vec )
If you want to ommit puntucations (,.: etc), use re.findall (import re) instead of .split:
words = re.findall(r"[\w']+", sentence)
sentence_vec = [ word_vec[word] for word in words if word in word_vec ]
If you don't want to skip words not available in word_vec, use:
sentence_vec = [ word_vec[word] if word in word_vec else [0,0] for word in words ]
It will place 0,0 for each missing word.

Related

How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column

I have a dataframe with text in one of its columns.
I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.
Use case:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.
another entry like : 'lime soda is the best selling item in fast food stores'
my objective is to get Bigram/trigram like:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
Kindly help me to do that [Python only]
First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.
Edit:
If you haven't already, you need to include the following once in your code, in order to download the stop words list.
nltk.download('stopwords')
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
Update
Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
Output:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']

Checking whether sentences in a database contain certain words from a dictionary

I have a huge dictionary (in Python) containing millions of words and a score for each of them indicating popularity. (Note: I have it as a dictionary but I can easily word with it as a dataframe too). I also have a database/SQL table with a few hundred sentences, each sentence having an ID.
I want to see whether each sentence contains a popular word, that is, whether it contains a word with score below some number n. Is it inefficient to iterate through every sentence and each time check through every word to see if it is in the dictionary, and what score it has?
Is there any other more efficient way to do it?
Here is the approach you can go with: '6' in my example code is the value of 'n' you have added in the question.
import re
words = {
'dog': 5,
'ant': 6,
'elephant': 1
}
n = 6
sentences = ['an ant', 'a dog', 'an elephant']
# Get all the popular words
popular_words = [key for key, val in words.items() if int(val)<int(n)]
popular_words = "|".join(popular_words)
for sentence in sentences:
# Check if sentence contains any of the popular word
if re.search(rf"{popular_words}", sentence):
print (sentence)

Python extracting sentence containing 2 words with conditions of window size

I have the same problem that was discussed in this link
Python extracting sentence containing 2 words
but the difference is that i need to extract only sentences that containe the two words in a defined size windows search . for exemple:
sentences = [ 'There was peace and happiness','hello every one',' How to Find Inner Peace ,love and Happiness ','Inner peace is closely related to happiness']
search_words= ['peace','happiness']
windows_size = 3 #search only the three words after the 1est word 'peace'
#output must be :
output= ['There was peace and happiness',' How to Find Inner Peace love and Happiness ']
Here is a crude solution.
def search(sentences, keyword1, keyword2, window=3):
res = []
for sentence in sentences:
words = sentence.lower().split(" ")
if keyword1 in words and keyword2 in words:
keyword1_idx = words.index(keyword1)
keyword2_idx = words.index(keyword2)
if keyword2_idx - keyword1_idx <= window:
res.append(sentence)
return res
Given a sentences list and two keywords, keyword1 and keyword2, we iterate through the sentences list one by one. We split the sentence into words, assuming that the words are separated by a single space. Then, after performing a cursory check of whether or not both keywords are present in the words list, we find the index of each keyword in words to make sure that the indices are at most window apart, i.e. the words are close together within window words. We append only the sentences that satisfy this condition to the res list, and return that result.

Reconstructing string of words using index position of words

I compressed a file and it gave each unique word in my string a value (0,1,2,3 etc)
I now have the list of numbers in order of appearance e.g (0,1,2,1,3,4,5,2,2 etc)
Using the numbers and list of unique words is there a way to decompress the sentence and get the original sentence I started with?
I have a text file with the following
[0,1,2,3,2,4,5,6,2,7,8,2,9,2,11,12,13,15,16,17,18,19]
["Lines","long","lines","very","many","likes","for","i","love","how","amny","does","it","take","to","make","a","cricle..","big","questions"]
My code compressed the orignal sentence by getting the position and the unique words.
The original sentence was "Lines long lines very lines amny likes for lines i love lines how many lines does it take to make a cricle"
Now i want to be able to reconstruct the sentence using the list of unique words and position list. I want to be able to do this with any sentence not just this one example sentence.
To go back to words, you can access your map of words and for each of the numbers add a word onto the sentence.
numbers = [1, 2]
sentence = ""
words = {1: "hello", 2: "world"}
for number in numbers:
sentence += words[number] + " "
sentence = sentence[:-1] # removes last space
You could use either a dict or a list and a comprehension with str.join :
words = ["I", "like", "boat", "sun", "forest", "dog"]
other_words = {0: "Okay", 1: "Example", 2: "...", 3: "$", 4:"*", 5: "/"}
sentence = (0,1,2,1,3,4,5,2,2)
print(" ".join(words[i] for i in sentence))
# I like boat like sun forest dog boat boat
print(" ".join(other_words[i] for i in sentence))
# Okay Example ... Example $ * / ... ...

Create a stemmer to reduce words to a base form

I am dealing with a case now for which I would like to create my own stemming algorithm. I know that there are some excellent libraries for this but this does not work for this use case.
In essence I would like to import a dictionary so I can loop through words in a sentence and if a word is present in a list, reduce it to its base form.
So in case, fe reduce 'banker' to bank. Im have produced this but this is not scalable.
list_bank = ('banking', 'banker' )
sentence = ("There's a banker")
banker_tags = []
for word in sentence.split():
print(word)
So in case, fe reduce 'banker' to bank
if word in list_bank:
#replace word
Any suggestion on how I can get this working?
Put the words and their stems in a dictionary and then use that to look up the stemmed form:
dictionary = { 'banker' : 'bank', 'banking': 'bank' } # Add the rest of your words and stems
sentence = "There's a banker"
for word in sentence.split():
if word in dictionary:
word = dictionary[word]
print(word)
There's
a
bank

Categories

Resources