I've managed to get my program to store a sentence or two into a dictionary and at the same time create a word position list.
What I need to do now is recreate the original sentence just from the dictionary and the position list. I've done lots of searches but the results I'm getting are either not what I need or are to confusing and beyond me.
Any help would be much appreciated, thanks.
Here is my code so far:
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
print ('This is the sentence with spaces before the punctuations:', sentence)
words_list = sentence.split()
print ('A list of the words in the sentence:', words_list)
dictionary = {}
word_pos_list = []
counter = 0
for word in words_list:
if word not in dictionary:
counter += 1
dictionary[word] = counter
word_pos_list.append(dictionary[word])
print ('The positions of the words in the sentence are:', word_pos_list)
John
While, as mentioned in comments, dictionaries are not sorted datastructures, if you are breaking up a sentence and indexing it into a dictionary and are trying to put it back together, you can try to use an OrderedDict from the collections library to do what you're doing.
That said, this is without any sort of further background or knowledge of how you are splitting your sentence (punctuation etc, I suggest looking into NLTP if you are doing any sort of natural language processing(NLP)).
from collections import OrderedDict
In [182]: def index_sentence(s):
.....: return {s.split(' ').index(i): i for i in s.split(' ')}
.....:
In [183]: def build_sentence_from_dict(d):
.....: return ' '.join(OrderedDict(d).values())
.....:
In [184]: s
Out[184]: 'See spot jump over the brown fox.'
In [185]: id = index_sentence(s)
In [186]: id
Out[186]: {0: 'See', 1: 'spot', 2: 'jump', 3: 'over', 4: 'the', 5: 'brown', 6: 'fox.'}
In [187]: build_sentence_from_dict(id)
Out[187]: 'See spot jump over the brown fox.'
In [188]:
To reconstruct from your list you have to reverse the location mapping:
# reconstruct
reversed_dictionary = {x:y for y, x in dictionary.items()}
print(' '.join(reversed_dictionary[x] for x in word_pos_list))
This can be done more nicely using a defaultdict (dictionary with predifined default value, in your case a list of locations for the word):
#!/usr/bin/env python3.4
from collections import defaultdict
# preprocessing
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
punctuation = ['()?:;,.!/"\'']
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
# using defaultdict this time
word_to_locations = defaultdict(list)
for part in enumerate(sentence.split()):
word_to_locations[part[1]].append(part[0])
# word -> list of locations
print(word_to_locations)
# location -> word
location_to_word = dict((y, x) for x in word_to_locations for y in word_to_locations[x])
print(location_to_word)
# reconstruct
print(' '.join(location_to_word[x] for x in range(len(location_to_word))))
It's not the randomness of dictionary keys that's the problem here, it's the failure to record every position at which a word was seen, duplicate or not. The following does that and then unwinds the dictionary to produce the original sentence, sans punctuation:
from collections import defaultdict
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = set('()?:;\\,.!/"\'')
sentence = ''.join(character for character in sentence if character not in punctuation)
print ('This is the sentence with no punctuation:', sentence)
words = sentence.split()
print('A list of the words in the sentence:', words)
dictionary = defaultdict(list)
last_word_position = 0
for word in words:
last_word_position += 1
dictionary[word].append(last_word_position)
print('A list of unique words in the sentence and their positions:', dictionary.items())
# Now the tricky bit to unwind our random dictionary:
sentence = []
for position in range(1, last_word_position + 1):
sentence.extend([word for word, positions in dictionary.items() if position in positions])
print(*sentence)
The output of the various print() statements:
This is the sentence: This Sentence is a very, very good sentence. Did you like my very good sentence?
This is the sentence with no punctuation: This Sentence is a very very good sentence Did you like my very good sentence
A list of the words in the sentence: ['This', 'Sentence', 'is', 'a', 'very', 'very', 'good', 'sentence', 'Did', 'you', 'like', 'my', 'very', 'good', 'sentence']
A list of unique words in the sentence and their positions: dict_items([('Sentence', [2]), ('is', [3]), ('a', [4]), ('very', [5, 6, 13]), ('This', [1]), ('my', [12]), ('Did', [9]), ('good', [7, 14]), ('you', [10]), ('sentence', [8, 15]), ('like', [11])])
This Sentence is a very very good sentence Did you like my very good sentence
Related
I have a list of tuples with some data like this: [('word1','sentence1'),('word2','sentence1'),('word3','sentence1') ...], I loop on every tuple to get each word and sentence like this:
for collection in tup:
qword = collection[0]
sentence = collection[1]
so far, so good. I needed to remove each word from the sentence so I did this:
q_sentence_split = sentence.split()
new_sentence_split = [word.replace(q_word, '.....') for word in q_sentence_split]
new_sentence = ' '.join(sentence_split)
but This didn't give me what I needed, as it removes the characters of the word of the tuple from each word in the q_senetnce_split, but what I need is the word only, not comparing the characters of the word to the characters of each word of the sentence.
I tried putting if after the for word in q_sentence_split like this:
new_sentence_split = [word.replace(q_word, '.....') for word in q_sentence_split if word == qword]
but this just removed every word the sentence, so I don't know what is wrong with my code
Try this :
lst = [('word1', 'The word1 is in this sentence'), ('word2', 'The word2blawblaw is in this sentence'),
('word3', 'The word1 is in this sentence')]
for word, sentence in lst:
print(' '.join(i for i in sentence.split() if word not in i))
output :
The is in this sentence
The is in this sentence
The word1 is in this sentence
I intentionally put word1 in third sentence.
Explanation :
First we iterate through lst, it gives us a tuple in each iteration, we unpack it with word, sentence.
After splitting the sentence it becomes a list which is ['The', 'word1', 'is', 'in', 'this', 'sentence'] (in first iteration).
Then again we iterate through this list which gives us the individual words, all we have to do is to chech if our word is inside any of these words or not. If it wasn't there that's what we want.
finally we do ' '.join() to make the sentence.
Here is what you can do:
list1 = [('word1', 'The word1 is in this sentence'), ('word2', 'The word2blawblaw is in this sentence'),
('word3', 'The word1 is in this sentence')]
for x, y in list1:
print(' '.join(i for i in y.split() if word not in i))
Python noob so sorry for simple question but I can't find the exact solution for my situation.
I've got a python list, I want to remove stop words from a list. My code isn't removing the stopword if it's paired with another token.
from nltk.corpus import stopwords
rawData = ['for', 'the', 'game', 'the movie']
text = [each_string.lower() for each_string in rawData]
newText = [word for word in text if word not in stopwords.words('english')]
print(newText)
current output:
['game', 'the movie']
desired output
['game', 'movie']
I'd prefer to use list comprehension for this.
It took me a while to do this because list comprehensions are not my thing. Anyways, this is how I did it:
import functools
stopwords = ["for", "the"]
rawData = ['for', 'the', 'game', 'the movie']
lst = functools.reduce(lambda x,y: x+y, [i.split() for i in rawData])
newText = [word for word in lst if word not in stopwords]
print(newText)
Basically, line 4 splits the list values to make a nested list AND turns the nested list one dimensional.
I have a dictionary and a text:
{"love":1, "expect":2, "annoy":-2}
test="i love you, that is annoying"
I need to remove the words from the string if they appear in the dictionary. I have tried this code:
for k in dict:
if k in test:
test=test.replace(k, "")
However the result is:
i you,that is ing
And this is not what I am looking for, as it should not remove "annoy" as a part of the word, the whole word should be evaluated. How can I achieve it?
First, you should not assign names to variables that are also names of builtin in classes, such as dict.
Variable test is a string composed of characters. When you say, if k in test:, you will be testing k to see if it is a substring of test. What you want to do is break up test into a list of words and compare k against each complete word in that list. If words are separated by a single space, then they may be "split" with:
test.split(' ')
The only complication is that it will create the following list:
['i', '', 'you,', 'that', 'is', 'annoying']
Note that the third item still has a , in it. So we should first get rid of punctuation marks we might expect to find in our sentence:
test.replace('.', '').replace(',', ' ').split(' ')
Yielding:
['i', '', 'you', '', 'that', 'is', 'annoying']
The following will actually get rid of all punctuation:
import string
test.translate(str.maketrans('', '', string.punctuation))
So now our code becomes:
>>> import string
>>> d = {"love":1, "expect":2, "annoy":-2}
>>> test="i love you, that is annoying"
>>> for k in d:
... if k in test.translate(str.maketrans('', '', string.punctuation)).split(' '):
... test=test.replace(k, "")
...
>>> print(test)
i you, that is annoying
>>>
You may now find you have extra spaces in your sentence, but you can figure out how to get rid of those.
you can use this:
query = "i love you, that is annoying"
query = query.replace('.', '').replace(',', '')
my_dict = {"love": 1, "expect": 2, "annoy": -2}
querywords = query.split()
resultwords = [word for word in querywords if word.lower() not in my_dict]
result = ' '.join(resultwords)
print(result)
>> 'i you, that is annoying'
If you want to exclude all words without being key sensitive convert all keys in my_dict to lowercase:
my_dict = {k.lower(): v for k, v in my_dict.items()}
I have to count all the words in a file and create a histogram of the words. I am using the following python code.
for word in re.split('[,. ]',f2.read()):
if word not in histogram:
histogram[word] = 1
else:
histogram[word]+=1
f2 is the file I am reading.I tried to parse the file by multiple delimiters but it still does not work. It counts all strings in the file and makes a histogram, but I only want words. I get results like this:
1-1-3: 3
where "1-1-3" is a string that occurs 3 times. How do I check so that only actual words are counted? casing does not matter. I also need to repeat this question but for two word sequences, so an output that looks like:
and the: 4
where "and the" is a two word sequence that appears 4 times. How would I group two word sequences together for counting?
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk import bigrams
from string import punctuation
# preparatory stuff
>>> tokenizer = RegexpTokenizer(r'[^\W\d]+')
>>> my_string = "this is my input string. 12345 1-2-3-4-5. this is my input"
# single words
>>> tokens = tokenizer.tokenize(my_string)
>>> Counter(tokens)
Counter({'this': 2, 'input': 2, 'is': 2, 'my': 2, 'string': 1})
# word pairs
>>> nltk_bigrams = bigrams(my_string.split())
>>> bigrams_list = [' '.join(x).strip(punctuation) for x in list(nltk_bigrams)]
>>> Counter([x for x in bigrams_list if x.replace(' ','').isalpha()])
Counter({'is my': 2, 'this is': 2, 'my input': 2, 'input string': 1})
Assuming you want to count all words in a string you could do something like this using a defaultdict as counters:
#!/usr/bin/env python3
# coding: utf-8
from collections import defaultdict
# For the sake of simplicty we are using a string instead of a read file
sentence = "The quick brown fox jumps over the lazy dog. THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG. The quick brown fox"
# Specify the word pairs you want to count as a single phrase
special_pairs = [('the', 'quick')]
# Convert sentence / input to lowercase in order to neglect case sensitivity and print lowercase sentence to double-check
sentence = sentence.lower()
print(sentence)
# Split string into single words
word_list = sentence.split(' ')
print(word_list)
# Since we know all the word in our input sentence we have to correct the word_list with our word pairs which need
# to be counted as a single phrase and not two single words
for pair in special_pairs:
for index, word in enumerate(word_list):
if pair[0] == word and pair[1] == word_list[index+1]:
word_list.remove(pair[0])
word_list.remove(pair[1])
word_list.append(' '.join([pair[0], pair[1]]))
d = defaultdict(int)
for word in word_list:
d[word] += 1
print(d.items())
Output:
the quick brown fox jumps over the lazy dog. the quick brown fox jumps over the lazy dog. the quick brown fox
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'the', 'quick', 'brown', 'fox']
dict_items([('lazy', 2), ('dog.', 2), ('fox', 3), ('brown', 3), ('jumps', 2), ('the quick', 3), ('the', 2), ('over', 2)])
I would appreciate someone's help on this probably simple matter: I have a long list of words in the form ['word', 'another', 'word', 'and', 'yet', 'another']. I want to compare these words to a list that I specify, thus looking for target words whether they are contained in the first list or not.
I would like to output which of my "search" words are contained in the first list and how many times they appear. I tried something like list(set(a).intersection(set(b))) - but it splits up the words and compares letters instead.
How can I write in a list of words to compare with the existing long list? And how can I output co-occurences and their frequencies? Thank you so much for your time and help.
>>> lst = ['word', 'another', 'word', 'and', 'yet', 'another']
>>> search = ['word', 'and', 'but']
>>> [(w, lst.count(w)) for w in set(lst) if w in search]
[('and', 1), ('word', 2)]
This code basically iterates through the unique elements of lst, and if the element is in the search list, it adds the word, along with the number of occurences, to the resulting list.
Preprocess your list of words with a Counter:
from collections import Counter
a = ['word', 'another', 'word', 'and', 'yet', 'another']
c = Counter(a)
# c == Counter({'word': 2, 'another': 2, 'and': 1, 'yet': 1})
Now you can iterate over your new list of words and check whether they are contained within this Counter-dictionary and the value gives you their number of appearance in the original list:
words = ['word', 'no', 'another']
for w in words:
print w, c.get(w, 0)
which prints:
word 2
no 0
another 2
or output it in a list:
[(w, c.get(w, 0)) for w in words]
# returns [('word', 2), ('no', 0), ('another', 2)]