Reconstructing string of words using index position of words - python

I compressed a file and it gave each unique word in my string a value (0,1,2,3 etc)
I now have the list of numbers in order of appearance e.g (0,1,2,1,3,4,5,2,2 etc)
Using the numbers and list of unique words is there a way to decompress the sentence and get the original sentence I started with?
I have a text file with the following
[0,1,2,3,2,4,5,6,2,7,8,2,9,2,11,12,13,15,16,17,18,19]
["Lines","long","lines","very","many","likes","for","i","love","how","amny","does","it","take","to","make","a","cricle..","big","questions"]
My code compressed the orignal sentence by getting the position and the unique words.
The original sentence was "Lines long lines very lines amny likes for lines i love lines how many lines does it take to make a cricle"
Now i want to be able to reconstruct the sentence using the list of unique words and position list. I want to be able to do this with any sentence not just this one example sentence.

To go back to words, you can access your map of words and for each of the numbers add a word onto the sentence.
numbers = [1, 2]
sentence = ""
words = {1: "hello", 2: "world"}
for number in numbers:
sentence += words[number] + " "
sentence = sentence[:-1] # removes last space

You could use either a dict or a list and a comprehension with str.join :
words = ["I", "like", "boat", "sun", "forest", "dog"]
other_words = {0: "Okay", 1: "Example", 2: "...", 3: "$", 4:"*", 5: "/"}
sentence = (0,1,2,1,3,4,5,2,2)
print(" ".join(words[i] for i in sentence))
# I like boat like sun forest dog boat boat
print(" ".join(other_words[i] for i in sentence))
# Okay Example ... Example $ * / ... ...

Related

Checking whether sentences in a database contain certain words from a dictionary

I have a huge dictionary (in Python) containing millions of words and a score for each of them indicating popularity. (Note: I have it as a dictionary but I can easily word with it as a dataframe too). I also have a database/SQL table with a few hundred sentences, each sentence having an ID.
I want to see whether each sentence contains a popular word, that is, whether it contains a word with score below some number n. Is it inefficient to iterate through every sentence and each time check through every word to see if it is in the dictionary, and what score it has?
Is there any other more efficient way to do it?
Here is the approach you can go with: '6' in my example code is the value of 'n' you have added in the question.
import re
words = {
'dog': 5,
'ant': 6,
'elephant': 1
}
n = 6
sentences = ['an ant', 'a dog', 'an elephant']
# Get all the popular words
popular_words = [key for key, val in words.items() if int(val)<int(n)]
popular_words = "|".join(popular_words)
for sentence in sentences:
# Check if sentence contains any of the popular word
if re.search(rf"{popular_words}", sentence):
print (sentence)

Simplest way to convert char offsets to word offsets

I have a python string and a substring of selected text. The string for example could be
stringy = "the bee buzzed loudly"
I want to select the text "bee buzzed" within this string. I have the character offsets i.e 4-14 for this particular string. Because those are the character level indices that the selected text is between.
What is the simplest way to convert these to word level indices i.e 1-2 because the second and third words are being selected. I have many strings that are labeled like this and I would like to convert the indices simply and efficiently. The data is currently stored ina dictionary like so:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
I would like to convert it to this form
data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}
Thank you!
It seem like a tokenisation problem.
My solution would to use a span tokenizer and then search you substring spans in the spans.
So using the nltk library:
import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b, sub_e = 4, 14 # substring begin and end
[i for i, (b, e) in enumerate(tokenizer.span_tokenize(stringy))
if b >= sub_b and e <= sub_e]
But this is kind of intricate.
tokenizer.span_tokenize(stringy) returns spans for each token/word it identified.
Heres a simple list index approach:
# set up data
string = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"", "start_word":0,"end_word":0}
#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)
{'string': 'the bee buzzed loudly', 'start_word': 1, 'end_word': 2}
take note that this assumes you're using a chronological order of words inside the string
Try this code, please;
def char_change(dic, start_char, end_char, *arg):
dic[arg[0]] = start_char
dic[arg[1]] = end_char
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))
char_change(data, start_char, end_char, "start_char", "end_char")
print(data)
Default Dictionary:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
INPUT
Please enter your start character: 1
Please enter your end character: 2
OUTPUT Dictionary:
{'string': 'the bee buzzed loudly', 'start_char': 1, 'end_char': 2}

Python extracting sentence containing 2 words with conditions of window size

I have the same problem that was discussed in this link
Python extracting sentence containing 2 words
but the difference is that i need to extract only sentences that containe the two words in a defined size windows search . for exemple:
sentences = [ 'There was peace and happiness','hello every one',' How to Find Inner Peace ,love and Happiness ','Inner peace is closely related to happiness']
search_words= ['peace','happiness']
windows_size = 3 #search only the three words after the 1est word 'peace'
#output must be :
output= ['There was peace and happiness',' How to Find Inner Peace love and Happiness ']
Here is a crude solution.
def search(sentences, keyword1, keyword2, window=3):
res = []
for sentence in sentences:
words = sentence.lower().split(" ")
if keyword1 in words and keyword2 in words:
keyword1_idx = words.index(keyword1)
keyword2_idx = words.index(keyword2)
if keyword2_idx - keyword1_idx <= window:
res.append(sentence)
return res
Given a sentences list and two keywords, keyword1 and keyword2, we iterate through the sentences list one by one. We split the sentence into words, assuming that the words are separated by a single space. Then, after performing a cursory check of whether or not both keywords are present in the words list, we find the index of each keyword in words to make sure that the indices are at most window apart, i.e. the words are close together within window words. We append only the sentences that satisfy this condition to the res list, and return that result.

Remove close matches / similar phrases from list

I am working on removing similar phrases in a list, but I have hit a small roadblock.
I have sentences and phrases, phrases are related to the sentence. All phrases of a sentence are in a single list.
Let the phrase list be : p=[['This is great','is great','place for drinks','for drinks'],['Tonight is a good','good night','is a good','for movies']]
I want my output to be [['This is great','place for drinks'],['Tonight is a good','for movies']]
Basically, I want to get all the longest unique phrases of a list.
I took a look at fuzzywuzzy library, but I am unable to get around to a good solution.
here is my code :
def remove_dup(arr, threshold=80):
ret_arr =[]
for item in arr:
if item[1]<threshold:
ret_arr.append(item[0])
return ret_arr
def find_important(sents=sents, phrase=phrase):
import os, random
from fuzzywuzzy import process, fuzz
all_processed = [] #final array to be returned
for i in range(len(sents)):
new_arr = [] #reshaped phrases for a single sentence
for item in phrase[i]:
new_arr.append(item)
new_arr.sort(reverse=True, key=lambda x : len(x)) #sort with highest length
important = [] #array to store terms
important = process.extractBests(new_arr[0], new_arr) #to get levenshtein distance matches
to_proc = remove_dup(important) #remove_dup removes all relatively matching terms.
to_proc.append(important[0][0]) #the term with highest match is obviously the important term.
all_processed.append(to_proc) #add non duplicates to all_processed[]
return all_processed
Can someone point out what I am missing, or what is a better way to do this?
Thanks in advance!
I would use the difference between each phrase and all the other phrases.
If a phrase has at least one different word compared to all the other phrases then it's unique and should be kept.
I've also made it robust to exact matches and added spaces
sentences = [['This is great','is great','place for drinks','for drinks'],
['Tonight is a good','good night','is a good','for movies'],
['Axe far his favorite brand for deodorant body spray',' Axe far his favorite brand for deodorant spray','Axe is']]
new_sentences = []
s = " "
for phrases in sentences :
new_phrases = []
phrases = [phrase.split() for phrase in phrases]
for i in range(len(phrases)) :
phrase = phrases[i]
if all([len(set(phrase).difference(phrases[j])) > 0 or i == j for j in range(len(phrases))]) :
new_phrases.append(phrase)
new_phrases = [s.join(phrase) for phrase in new_phrases]
new_sentences.append(new_phrases)
print(new_sentences)
Output:
[['This is great', 'place for drinks'],
['Tonight is a good', 'good night', 'for movies'],
['Axe far his favorite brand for deodorant body spray', 'Axe is']]

How to convert sentences in to vectors

I have a dictionary where the keys are words and the values are vectors of those words.
I have a list of sentences which I want to convert into an array. I'm getting an array of all the words but I would like to have an array of sentences with word vectors so I can feed it into a neural network
sentences=["For last 8 years life, Galileo house arrest espousing man's theory",
'No. 2: 1912 Olympian; football star Carlisle Indian School; 6 MLB seasons Reds, Giants & Braves',
'The city Yuma state record average 4,055 hours sunshine year'.......]
word_vec={'For': [0.27452874183654785, 0.8040047883987427],
'last': [-0.6316165924072266, -0.2768899202346802],
'years': [-0.2496756911277771, 1.243837594985962],
'life,': [-0.9836481809616089, -0.9561406373977661].....}
I want to convert the above sentences into vectors of their corresponding words from the dictionary.
Try this:
def sentence_to_list(sentence, words_dict):
return [w for w in sentence.split() if w in words_dict]
So the first of the sentences in your example will be converted to:
['For', 'last', 'years', 'life'] # words not in the dictionary are not present here
Update.
I guess you need to remove punctuation characters. There are several methods how to split the string using several delimiter characters, check this answer: Split Strings into words with multiple word boundary delimiters
This will create vectors, containing list of lists of vectors (one list per one sentence):
vectors = []
for sentence in sentences:
sentence_vec = [ word_vec[word] for word in sentence.split() if word in word_vec ]
vectors.append( sentence_vec )
If you want to ommit puntucations (,.: etc), use re.findall (import re) instead of .split:
words = re.findall(r"[\w']+", sentence)
sentence_vec = [ word_vec[word] for word in words if word in word_vec ]
If you don't want to skip words not available in word_vec, use:
sentence_vec = [ word_vec[word] if word in word_vec else [0,0] for word in words ]
It will place 0,0 for each missing word.

Categories

Resources