Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Suppose I have a text like
text="I came from the moon. He went to the other room. She went to the drawing room."
Most Frequent group of 3 words here is "went to the"
I know how to find most frequent bigram or trigram but I am stuck in this.I want to find this solution without using NLTK library.
import string
text="I came from the moon. He went to the other room. She went to the drawing room."
for character in string.punctuation:
text = text.replace(character, " ")
while text != text.replace(" ", " "):
text = text.replace(" ", " ")
text = text.split(" ")
wordlist = []
frequency_dict = dict()
for i in range(len(text)-3):
wordlist.append([text[i], text[i+1], text[i+2]])
for three_words in wordlist:
frequency= wordlist.count(three_words)
frequency_dict[", ".join(three_words)] = frequency
print max(frequency_dict, key=frequency_dict.get), frequency_dict[max(frequency_dict, key=frequency_dict.get)]
Output: went, to, the 2
Unfortunately lists are not hashable. Otherwise it would help to create a set of the three_words items.
nltk makes this problem trivial, but seeing as you don't want such a dependency, I have included a simple implementation using only core libraries. The code works on python2.7 and python3.x, and uses collections.Counter to count frequencies of n-grams. Computationally, it is O(NM) where N is the number of words in the text and M is the number of n-grams being counted (so if one were to count uni and bigrams, M = 2).
import collections
import re
import sys
import time
# Convert a string to lowercase and split into words (w/o punctuation)
def tokenize(string):
return re.findall(r'\w+', string.lower())
def count_ngrams(lines, min_length=2, max_length=4):
lengths = range(min_length, max_length + 1)
ngrams = {length: collections.Counter() for length in lengths}
queue = collections.deque(maxlen=max_length)
# Helper function to add n-grams at start of current queue to dict
def add_queue():
current = tuple(queue)
for length in lengths:
if len(current) >= length:
ngrams[length][current[:length]] += 1
# Loop through all lines and words and add n-grams to dict
for line in lines:
for word in tokenize(line):
queue.append(word)
if len(queue) >= max_length:
add_queue()
# Make sure we get the n-grams at the tail end of the queue
while len(queue) > min_length:
queue.popleft()
add_queue()
return ngrams
def print_most_frequent(ngrams, num=10):
for n in sorted(ngrams):
print('----- {} most common {}-grams -----'.format(num, n))
for gram, count in ngrams[n].most_common(num):
print('{0}: {1}'.format(' '.join(gram), count))
print('')
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage: python ngrams.py filename')
sys.exit(1)
start_time = time.time()
with open(sys.argv[1]) as f:
ngrams = count_ngrams(f)
print_most_frequent(ngrams)
elapsed_time = time.time() - start_time
print('Took {:.03f} seconds'.format(elapsed_time))
text="I came from the moon. He went to the other room. She went to the drawing room."
fixed_text = re.sub("[^a-zA-Z ]"," ",text)
text_list = fixed_text.split()
print Counter(" ".join(text_list[i:i+3]) for i in range(len(text_list)-3)).most_common(1)
I guess ... maybe?
>>> text="I came from the moon. He went to the other room. She went to the drawi
ng room."
>>> fixed_text = re.sub("[^a-zA-Z ]"," ",text)
>>> text_list = fixed_text.split()
>>> print Counter(" ".join(text_list[i:i+3]) for i in range(len(text_list)-3)).most_common(1)
[('went to the', 2)]
>>>
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
A =["I like apple"]
B =["I like playing basketball"]
C =["how are you doing"]
listt=[A,B,C]
I need to return the string with the most words in a list (in this example it's B and C).
first, I want to count number of words in each string, but my code doesn't work. Does anyone know how to return the string that has the most words?
number = len(re.findall(r'\w+',listt))
Use max:
print(max(A+B+C,key=lambda x: len(x.split())))
If wanna show both:
print([i for i in A+B+C if i == max(A+B+C,key=lambda x: len(x.split()))])
Try this:
[len(i[0].split(' ')) for i in listt]
You can just .count the spaces (separating words): [s for s in sentences if s.count(' ') == max(s.count(' ') for s in sentences)] (s[0] if you have each sentence in a separate list and maybe getting max first to save time)
If your words can also be separated by any punctuation, you will probably want to use re, as in your example, just findalling on every sentence, like this:
import re
pattern = re.compile(r'\w+')
# note I changed some stuff to have words only separated by punctuation
sentences = [["I like:apple"], ["I (really)like playing basketball"], ["how are you doing"]]
current_s = []
current_len = 0
for s in sentences:
no = len(pattern.findall(s[0])) # [0] because you have each sentence in a separate list
if no == current_len:
current_s.append(s)
elif no > current_len:
current_s = [s]
current_len = no
print('the most number of words is', current_len)
print('\n'.join(current_s))
Split the sentences and order by number of words:
A = ["I like apple"]
B = ["I like playing basketball"]
C = ["how are you doing today?"]
sorted_sentences = sorted([(sentence, len(sentence[0].split(' '))) for sentence in [A, B, C]], key=lambda x: x[1],
reverse=True)
print('The sentence: "{}" has the maximum number of words: {}'.format(sorted_sentences[0][0],sorted_sentences[0][1]))
Output
The sentence: "['how are you doing today?']" has the maximum number of words: 5
After researching, I found this using max function.
using max function to count the maximum words, then printing the words with maximum words.
A ="I like apple"
B ="I like playing basketball"
C ="how are you doing"
sentences=[A,B,C]
max_words=max(len(x.split(' ')) for x in sentences)
print([i for i in sentences if len(i.split(' '))==max_words])
Basic logic of python, without any special function:
A ="I like apple"
B ="I like playing basketball"
C ="how are you doing"
sentences=[A,B,C]
max_words=0
for sentence in sentences:
if len(sentence.split(' '))>max_words:
max_words=len(sentence.split(' '))
for sentence in sentences:
if len(sentence.split(' '))==max_words:
print (sentence)
print(max_sentences)
A = ["I like apple"]
B = ["I like playing basketball"]
C = ["how are you doing"]
items = A + B + C
longest_length = len(max(items, key=lambda k: len(k.split())).split())
result = [i for i in items if len(i.split()) == longest_length]
print(result)
Output:
['I like playing basketball', 'how are you doing']
Please notice that max function and not the length of the string:
The method max() returns the max alphabetical character from the string str.
That is why I'm using key lambda function to change default comparison to the number of words in the string key=lambda k: len(k.split())
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a file with one sentence. I let the user choose the number of 'rows' and 'columns'. I want to check how many times I can write the sentence in this kind of table without splitting words.Now I would like the text to form like this:
Input:
rows=3
columns=10
setence from file: Cat has dog.
Output:
Cat has***
dog. Cat**
has dog.**
The program can't split words and in places where they can't fit place stars. Here is the part of the code I did but I feel I am not going the good direction.
My questions:
1. How can I improve my code?
2. How to make it count chars but also words?
3. General tips for this task.
My code:
import sys
columns, rows, path = sys.argv[1:]
columns=int(columns)
rows=int(rows)
file=open(path,"r")
text=file.read()
list=list(text.split())
length=len(list)
for i in range(length):
k=len(lista[i])
if k<=columns:
print(list[i], end=" ")
else:
print("*")
This was tougher than I thought it would be. There might be an easier solution out there but you can try this:
list_of_words = text.split()
current_character_count_for_row = 0
current_string_for_row = ""
current_row = 1
how_many_times_has_sentence_been_written = 0
is_space_available = True
# Keep going until told to stop.
while is_space_available:
for word in list_of_words:
# If a word is too long for a row, then return false.
if len(word) > columns:
is_space_available = False
break
# Check if we can add word to row.
if len(word) + current_character_count_for_row < columns:
# If at start of row, then just add word if short enough.
if current_character_count_for_row == 0:
current_string_for_row = current_string_for_row + word
current_character_count_for_row += len(word)
# otherwise, add word with a space before it.
else:
current_string_for_row = current_string_for_row +" " + word
current_character_count_for_row += len(word) + 1
# Word doesn't fit into row.
else:
# Fill rest of current row with *'s.
current_string_for_row = current_string_for_row + "*"*(columns - current_character_count_for_row)
# Print it.
print(current_string_for_row)
# Break if on final row.
if current_row == rows:
is_space_available = False
break
# Otherwise start a new row with the word
current_row +=1
current_character_count_for_row = len(word)
current_string_for_row = word
if current_row > rows:
is_space_available = False
break
# Have got to end of words. Increment the count, unless we've gone over the row count.
if is_space_available:
how_many_times_has_sentence_been_written +=1
print(how_many_times_has_sentence_been_written)
This question already has answers here:
How to find the most similar word in a list in python
(2 answers)
Closed 6 years ago.
word = "work"
word_set = {"word","look","wrap","pork"}
How can I find the similar word such that both "word" and "pork" need only one letter to change to the "work"?
I am wondering that if there is a method to find the difference between a string and the item in set.
Use difflib.get_close_matches() from the standard library:
import difflib
word = "work"
word_set = {"word","look","wrap","pork"}
difflib.get_close_matches(word, word_set)
returns:
['word', 'pork']
EDIT If needed, difflib.SequenceMatcher.get_opcodes() can be used to calculate the edit distance:
matcher = difflib.SequenceMatcher(b=word)
for test_word in word_set:
matcher.set_seq1(test_word)
distance = len([m for m in matcher.get_opcodes() if m[0]!='equal'])
print(distance, test_word)
You could do something like:
word = "work"
word_set = set(["word","look","wrap","pork"])
for example in word_set:
if len(example) != len(word):
continue
num_chars_out = sum([1 for c1,c2 in zip(example, word) if c1 != c2])
if num_chars_out == 1:
print(example)
I would recommend the editdistance Python package, which provides an editdistance.eval function that calculates the number of characters you need to change to get from the first word to the second word. Edit distance is the same as Levenshtein distance, which was suggested by MattDMo.
In your case, if you want to identify words within 1 edit distance of each other, you could do:
import editdistance as ed
thresh = 1
w1 = "work"
word_set = set(["word","look","wrap","pork"])
neighboring_words = [w2 for w2 in word_set if ed.eval(w1, w2) <= thresh]
print neighboring_words
with neighboring_words evaluating to ['pork', 'word'].
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
A couple days ago, I made a program that allowed me to pick a letter from a string and it would tell me how many times a chosen letter appeared. Now I want to use that code to create a program that takes all the letters and counts how many times each letter appears. For example, if I put "dog" in as my string, I would want the program to say that d appears once, o appears once, and g appears once. Here is my current code below.
from collections import Counter
import string
pickedletter= ()
count = 0
word = ()
def count_letters(word):
global count
wordsList = word.split()
for words in wordsList:
if words == pickedletter:
count = count+1
return count
word = input("what do you want to type? ")
pickedletter = input("what letter do you want to pick? ")
print (word.count(pickedletter))
from collections import Counter
def count_letters(word):
counts = Counter(word)
for char in sorted(counts):
print char, "appears", counts[char], "times in", word
I'm not sure why you're importing anything for this, especially Counter. This is the approach I would use:
def count_letters(s):
"""Count the number of times each letter appears in the provided
specified string.
"""
results = {} # Results dictionary.
for x in s:
if x.isalpha():
try:
results[x.lower()] += 1 # Case insensitive.
except KeyError:
results[x.lower()] = 1
return results
if __name__ == '__main__':
s = 'The quick brown fox jumps over the lazy dog and the cow jumps over the moon.'
results = count_letters(s)
print(results)
I have a small python script I am working on for a class homework assignment. The script reads a file and prints the 10 most frequent and infrequent words and their frequencies. For this assignment, a word is defined as 2 letters or more. I have the word frequencies working just fine, however the third part of the assignment is to print the total number of unique words in the document. Unique words meaning count every word in the document, only once.
Without changing my current script too much, how can I count all the words in the document only one time?
p.s. I am using Python 2.6 so please don't mention the use of collections.Counter
from string import punctuation
from collections import defaultdict
import re
number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)
"""Define words as 2+ letters"""
def count_unique(s):
count = 0
if word in line:
if len(word) >= 2:
count += 1
return count
"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
counter[word] += 1
# Most Frequent Words
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
print "Most Frequent Words: "
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
print " "
print "Least Frequent Words: "
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique
Count the number of keys in your counter dictionary:
total_unique = len(counter.keys())
Or more simply:
total_unique = len(counter)
A defaultdict is great, but it might be more that what you need. You will need it for the part about most frequent words. But in the absence of that question, using a defaultdict is overkill. In such a situation, I would suggest using a set instead:
words = set()
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
words.add(word)
num_unique_words = len(words)
Now words contains only unique words.
I am only posting this because you say that you are new to python, so I want to make sure that you are aware of sets as well. Again, for your purposes, a defaultdict works fine and is justified