Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm new to python and I'm trying to teach myself how to use it by completing tasks. I am trying to complete the task below and have written the code beneath it. However, my code does not disregard the punctuation of the input sentence and does not store the sentence's words in a list. What do I need to add to it? (keep in mind, I am BRAND NEW to python, so I have very little knowledge)
Develop a program that identifies individual words in a sentence, stores these in a list and replaces each word in the original sentence with the position of that word in the list.
For example, the sentence:
ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR
COUNTRY
contains the words ASK, NOT, WHAT, YOUR, COUNTRY, CAN, DO, FOR, YOU
The sentence can be recreated from the positions of these words in this list using the sequence
1,2,3,4,5,6,7,8,9,1,3,9,6,7,8,4,5
Save the list of words and the positions of these words in the sentence as separate files or as a single
file.
Analyse the requirements for this system and design, develop, test and evaluate a program to:
• identify the individual words in a sentence and store them in a list
• create a list of positions for words in that list
• save these lists as a single file or as separate files.
restart = 'y'
while (True):
sentence = input("What is your sentence?: ")
sentence_split = sentence.split()
sentence2 = [0]
print(sentence)
for count, i in enumerate(sentence_split):
if sentence_split.count(i) < 2:
sentence2.append(max(sentence2) + 1)
else:
sentence2.append(sentence_split.index(i) +1)
sentence2.remove(0)
print(sentence2)
restart = input("would you like restart the programme y/n?").lower()
if (restart == "n"):
print ("programme terminated")
break
elif (restart == "y"):
pass
else:
print ("Please enter y or n")
Since this are several question in one, here's a few pointers (I won't help you with the file I/O as that's not really part of the problem).
First, to filter punctuation from a sentence, refer to this question.
Second, in order to get an ordered list of unique words and their first positions, you can use an ordered dictionary. Demonstration:
>>> from collections import OrderedDict
>>> s = 'ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY'
>>> words = s.split()
>>> word2pos = OrderedDict()
>>>
>>> for index, word in enumerate(words, 1):
... if word not in word2pos:
... word2pos[word] = index
...
>>> word2pos.keys()
['ASK', 'NOT', 'WHAT', 'YOUR', 'COUNTRY', 'CAN', 'DO', 'FOR', 'YOU']
If you are not allowed to use an ordered dictionary you will have to work a little harder and read through the answers of this question.
Finally, once you have a mapping of words to their first position, no matter how you acquired it, creating the list of positions is straight forward:
>>> [word2pos[word] for word in words]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 3, 9, 6, 7, 8, 4, 5]
You have to consider a few things before, such as what do you do with punctuation as you already noted. Now, considering that you are trying to teach yourself I will attempt to only give you some tips and information that you can look at.
The [strip] command can allow you to remove certain letters/numbers from a sentence, such as a , or ..
The split command will split a string into a list of smaller strings, based on your splitting command. However, to see the place they had in the original string you could look at the index of the list. For instance, in your sentence list you can get the first word by accessing sentence[0] and so forth.
However, considering that words can be repeated this will be a bit trickier, so you might look into something called a dictionary, which is perfect for what you want to do as it allows you do something as follows:
words = {'Word': 'Ask', 'Position': [1,10]}
Now if you stuck with the simplistic approach (using a list), you can iterate over the list with an index and process each word inidividually to write them to a file, for instance along the lines of (warning, this is pseudo code).
for index, word in sentence:
do things with word
write things to a file
To get a more 'true' starting point check the below spoiler
for index, word in enumerate(sentence):
filename = str(word)+".txt"
with open(filename,'w') as fw:
fw.write("Word: "+str(word)+"\tPlace: "+str(index))
I hope this gets you under way!
Related
I would like to first extract repeating n-grams from within a single sentence using Gensim's Phrases, then use those to get rid of duplicates within sentences. Like so:
Input: "Testing test this test this testing again here testing again here"
Desired output: "Testing test this testing again here"
My code seemed to have worked for generating up to 5-grams using multiple sentences but whenever I pass it a single sentence (even a list full of the same sentence) it doesn't work. If I pass a single sentence, it splits the words into characters. If I pass the list full of the same sentence, it detects nonsense like non-repeating words while not detecting repeating words.
I thought my code was working because I used like 30MB of text and produced very intelligible n-grams up to n=5 that seemed to correspond to what I expected. I have no idea how to tell its precision and recall, though. Here is the full function, which recursively generates all n-grams from 2 to n::
def extract_n_grams(documents, maximum_number_of_words_per_group=2, threshold=10, minimum_count=6, should_print=False, should_use_keywords=False):
from gensim.models import Phrases
from gensim.models.phrases import Phraser
tokens = [doc.split(" ") for doc in documents] if type(documents) == list else [documents.split(" ") for _ in range(100)] # this is what I tried
final_n_grams = []
for current_n in range(maximum_number_of_words_per_group - 1):
n_gram = Phrases(tokens, min_count=minimum_count, threshold=threshold, connector_words=connecting_words)
n_gram_phraser = Phraser(n_gram)
resulting_tokens = []
for token in tokens:
resulting_tokens.append(n_gram_phraser[token])
current_n_gram_final = []
for token in resulting_tokens:
for word in token:
if '_' in word:
# no n_gram should have a comma between words
if ',' not in word:
word = word.replace('_', ' ')
if word not in current_n_gram_final and all([word not in gram for gram in final_n_grams]):
current_n_gram_final.append(word)
tokens = n_gram[tokens]
final_n_grams.append(current_n_gram_final)
In addition to trying repeating the sentence in the list, I also tried using NLKT's word_tokenize as suggested here. What am I doing wrong? Is there an easier approach?
The Gensim Phrases class is designed to statistically detect when certain pairs of words appear so often together, compared to independently, that it might be useful to combine them into a single token.
As such, it's unlikely to be helpful for your example task, of eliminating the duplicate 3-word ['testing', 'again', 'here'] run-of-tokens.
First, it never eliminates tokens – only combines them. So, if it saw the couplet ['again', 'here'] appearing ver often together, rather than as separate 'again' and 'here', it'd turn it into 'again_here' – not eliminate it.
But second, it does these combinations not for every repeated n-token grouping, but only if the large amount of training data implies, based on the threshold configured, that certain pairs stick out. (And it only goes beyond pairs if run repeatedly.) Your example 3-word grouping, ['testing', 'again', 'here'], does not seem likely to stick out as a composition of extra-likely pairings.
If you have a more rigorous definition of which tokens/runs-of-tokens need to be eliminated, you'd probably want to run other Python code on the lists-of-tokens to enforce that de-duplication. Can you describe in more detail, perhaps with more examples, the kinds of n-grams you want removed? (Will they only be at the beginning or end of a text, or also the middle? Do they have to be next-to each other, or can they be spread throughout the text? Why are such duplicates present in the data, & why is it thought important to remove them?)
Update: Based on the comments about the real goal, a few lines of Python that check, at each position in a token-list, whether the next N tokens match the previous N tokens (and thus can be ignored) should do the trick. For example:
def elide_repeated_ngrams(list_of_tokens):
return_tokens = []
i = 0
while i < len(list_of_tokens):
for candidate_len in range(len(return_tokens)):
if list_of_tokens[i:i+candidate_len] == return_tokens[-candidate_len:]:
i = i + candidate_len # skip the repeat
break # begin fresh forward repeat-check
else:
# this token not part of any repeat; include & proceed
return_tokens.append(list_of_tokens[i])
i += 1
return return_tokens
On your test case:
>>> elide_repeated_ngrams("Testing test this test this testing again here testing again here".split())
['Testing', 'test', 'this', 'testing', 'again', 'here']
Hi so I'm currently taking a class and one of our assignments is to create a Hangman AI. There are two parts to this assignment and currently I am stuck on the first task, which is, given the state of the hangman puzzle, and a list of words as a dictionary, filter out non-possible answers to the puzzle from the dictionary. As an example, if the puzzle given is t--t, then we should filter out all words that are not 4 letters long. Following that, we should filter out words which do not have t as their first and last letter (ie. tent and test are acceptable, but type and help should be removed). Currently, for some reason, my code seems to remove all entries from the dictionary and I have no idea why. Help would be much appreciated. I have also attached my code below for reference.
def filter(self,puzzle):
wordlist = dictionary.getWords()
newword = {i : wordlist[i] for i in range(len(wordlist)) if len(wordlist[i])==len(str(puzzle))}
string = puzzle.getState()
array = []
for i in range(len(newword)):
n = newword[i]
for j in range(len(string)):
if string[j].isalpha() and n[j]==string[j]:
continue
elif not string[j].isalpha():
continue
else:
array+=i
print(i)
break
array = list(reversed(array))
for i in array:
del newword[i]
Some additional information:
puzzle.getState() is a function given to us that returns a string describing the state of the puzzle, eg. a-a--- or t---t- or elepha--
dictionary.getWords essentially creates a dictionary from the list of words
Thanks!
This may be a bit late, but using regular expressions:
import re
def remove(regularexpression, somewords):
words_to_remove=[]
for word in somewords:
#if not re.search(regularexpression, word):
if re.search(regularexpression, word)==None:
words_to_remove.append(word)
for word in words_to_remove:
somewords.remove(word)
For example you have the list words=['hello', 'halls', 'harp', 'heroic', 'tests'].
You can now do:
remove('h[a-zA-Z]{4}$', words) and words would become ['hello', 'halls'], while remove('h[a-zA-Z]{3}s$', words) only leaves words as ['halls']
This line
newword = {i : wordlist[i] for i in range(len(wordlist)) if len(wordlist[i])==len(str(puzzle))}
creates a dictionary with non-contiguous keys. You want only the words that are the same length as puzzle so if your wordlist is ['test', 'type', 'string', 'tent'] and puzzle is 4 letters newword will be {0:'test', 1:'type', 3:'tent'}. You then use for i in range(len(newword)): to iterate over the dictionary based on the length of the dictionary. I'm a bit surprised that you aren't getting a KeyError with your code as written.
I'm not able to test this without the rest of your code, but I think changing the loop to:
for i in newword.keys():
will help.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm writing a Python script. I need to search a text file for a word that end by " s , es or ies " and the word must be greater than three letters , need to konw number of words and the word it-self .....it's hard task i cant work with it, please help me
I agree with the comment that you need to go work on the basics. Here are some ideas to get you started.
1) You say "search a file." Open a file and read line by line like this:
with open ('myFile.txt', 'r') as infile:
for line in infile:
# do something to each line
2) You probably want to store each line in a data structure, like a list:
# before you open the file...
lines = []
# while handling the file:
lines.append(line)
3) You'll need to work with each word. look into the 'split' function of lists.
4) You'll need to look at individual letters of each word. Look into 'string slicing.'
All said and done, you can probably do this with 10 - 15 lines of code.
Try to divide the task into different tasks if it feels overwhelming.
The following code is by no means good, but hopefully it is clear enough so you can get the point.
1 First you need to get your text. If your text is in a file in your computer you need to put it into something that python can use.
# this code takes the content of "text.txt" and store it into my_text
with open("text.txt") as file:
my_text = file.read()
2 Now you need to work with every individual word. All your words are together in a string called my_text, and you would like them separated (split) into a list so you can work with them individually. Usually words are separated by spaces, so that's what you use to separate them:
# take the text and split it into words
my_words = my_text.split(" ")
3 I don't know exactly what you want, but let's suppose you want to store separately the words in different lists. Then you will need those lists:
# three list to store the words:
words_s = []
words_es = []
words_ies = []
4 Now you need to iterate through the words and do stuff with them. For that the easiest thing to do is to use a for loop:
#iterate through each word
for word in my_words:
# you're not interested in short words:
if len(word) <= 3:
continue # this means: do nothing with this word
# now, if the word's length is greater than 3, you classify it:
if word.endswith("ies"):
words_ies.append(word) # add it to the list
if word.endswith("es"):
words_es.append(word) # add it to the list
if word.endswith("s"):
words_s.append(word) # add it to the list
4 Finally, outside the for loop, you can print the list of words and also get the length of the list:
print(words_s)
print(len(words_s))
Something that you need to consider is if you want the words repeated or not. Note that the condition 'word that end by "s", "es" or "ies"' is equivalent to 'word that end by "s"'. The code above will get the words distributed in different lists redundantly. If a word ends with "ies" it also ends with "es" and "s", so it'll be stored in the three lists. If you want to avoid overlapping, you can substitute the if statements by else if statements.
Keep learning the basics as other answers suggest and soon you'll be able to understand scary code like this :D
with open("text.txt") as myfile:
words = [word for word in myfile.read().split(" ") if word.endswith("s") and len(word) > 3]
print("There are {} words ending with 's' and longer than 3".format(len(words)))
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
How can I return a duplicated word in a list?
I was asked to create a function, word_count(text, n). The text is converted into a list and then return the word that is repeated n times. I've tried to write it but it seems to return every single word.
>>> repeat_word_count("one one was a racehorse two two was one too", 3)
['one']
I've used the for loop and conditioned it. I'm wanted to post my code but I'm scared that my school will find the code online :(
I think I know what it is that you are wanting to do and without seeing your code I can't point out exactly where you're going wrong so I'll take you through creating this function step by step and you should be able to figure out where you went wrong.
I think you are trying to create this:
def function(a,b):
"""where a is a sentence and b is the target number.
The function will return to you each word in the
given sentence that occurs exactly b times."""
To do this we have to do the following:
converts the sentence into a list of words and removes punctuation,capitalization,and spaces.
iterates through each unique word in the sentence and print it if it occurs exactly b times in the sentence
put these together to make a function
so in your example your sentence was "one one was a racehorse two two was one too", and you're looking for all words that occur exactly 3 times, so the function should return the word "one"
We'll look at each step one at a time.
FIRST STEP -
we have to take the sentence or sentences and convert them into a list of words. Since I don't know if you will be using sentences with punctuation and/or capitalization I'll have to assume that its possible and plan to deal with them. We'll have to omit any punctuation/spaces from the list and also change all the letters in each word to lowercase if they happened to have a capital letter because even though "Cat" and "cat" are the same word, according to a computer brain, "Cat" does NOT equal any of these:
"cat" - lowercase c doesn't match uppercase C in "Cat"
" Cat" - there is an extra space at the start of the word
"Cat." - There is a period after the word
"Cat " - There is a space after the word
So if we use "One one was a racehorse two two was one, too." as our input we'll need to handle spaces, punctuation, and capitalization. Luckily all of this work can be done with 2 lines of code by using a regular expression and list comprehension to get rid of all the junk and create a list of words.
import re
wordlist=[i.lower() for i in re.findall(r"[\w']+",sentence)]
This gives us our list of words:
['one', 'one', 'was', 'a', 'racehorse', 'two', 'two', 'was', 'one', 'too']
SECOND STEP -
Now we need to iterate through each Unique word in the wordlist and see if it occurs exactly b times. Since we only need unique words we can create a list that only contains each word exactly once by converting the wordlist from a list to a set and looping through each word in the set and counting the number of times they appear in the wordlist. Any that occur exactly b number of times are our solutions. I'm not exactly sure how you were wanting the results returned but i'm going to assume you want to have each word that fits the criteria to be printed one at a time.
for word in set(wordlist):
if wordlist.count(word)==b:
print word
THIRD STEP -
Now I'll put all this together to create my function:
import re
def repeat_word_count(a,b):
wordlist=[i.lower() for i in re.findall(r"[\w']+",a)]
for word in set(wordlist):
if wordlist.count(word)==b:
print word
I hope this helps you understand a bit better
I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1