Most frequents words in Python

Most frequents words in Python - python

I was trying to implement a code that would allow me to find the 10 most frequent words in a text. I'm new at python, and am more used to languages like C#, java or even C++. Here is what I did:
f = open("bigtext.txt","r")
word_count = {}
Basicaly, my idea is to create a dictionary that contains the number of times that each word is present in my text. If the word is not present, I will add it to the dictionary with the value of 1. If the world is already present in the dictionary, I will increment its value by 1.
for x in f.read().split():
if x not in word_count:
word_count[x] = 1
else:
word_count[x] += 1
sorted(word_count.values)
Here, I will sort my dictionary by values (since I'm looking for the 10 most frequent worlds, I need the 10 words with the biggest values).
for keys,values in word_count.items():
values = values + 1
print(word_count[-values])
if values == 10:
break
Here is the part were it all fails. I know now for sure (since I sorted my dictionary by the value of the values). That my 10 most frequent words are the 10 last elements of my dictionary. I want to display those. So I decided to initialize values at 1 and to display my dictionary backward till values = 10 so that I won't need to display more than what I need. But unfortunately, I get this following error:
File "<ipython-input-19-f5241b4c239c>", line 13
for keys,values in word_count.items()
^
SyntaxError: invalid syntax
I do know that my mistake is that I didn't display my dictionary backwards correctly. But I don't know how to proceed elsewhere. So if someone can tell me how to properly display my last 10 elements in my dictionary, I would very much appreciate it. Thank You.

If you didn’t want to use collections.Counter, you could do something like this:
for word, count in sorted(word_count.items(), key=lambda x: -x[1])[:10]:
print(word, count)
This gets all the words in the dictionary, along with their counts, into a list of tuples; sorts that list by the 2nd item in each tuple (the count) descending, and then only prints the first (I.e. highest) ten of those.

I would like to address a big thank you to Ben who told me that I can't sort a dictionary like that.
So this is my final solution (hoping it would help someone else);
my_words = []
for keys, values in word_count.items():
my_words.append((values,keys))
I created a list and I added to it the values I had in my dictionary with the following word for each value.
my_words.sort(reverse = True)
I then sorted my list according to the value in reverse (so that my 10 most frequent worlds would be the 10 first element of my list)
print("The 10 most frequent words in this text are:")
print()
for key, val in my_words[:10]:
print (key, val)
I then simply displayed the 10 first elements of my list.
I would also like to thank all of you who told me about NLTK. I will try it later to have a more optimal and accurate solution.
Thank You so much for your help.

Related

How to get the count of a few specific words that occur in a text

Hello everyone for my problem i have to input a text and a few words.
The result I need to get is that the program shows how many time each word occurs in the text. An example of the expected input and output will be shown at the bottom of this file
The code I have at the moment is this
tekst = str(input().lower())
wordsToCount = str(input().lower())
D = dict()
words = tekst.split()
wordsToCount = wordsToCount.split()
for word in words:
for wordToCount in wordsToCount:
if wordToCount == word:
if wordToCount not in D:
D[wordToCount] = 1
else:
D[wordToCount] += 1
for key in list(D.keys()):
print(D[key])
With output
3
1
2
1
This seems really close but It takes the word "Or" first instead of "reading" because "or" comes first in the text
INPUT:
Alice was beginning to get very tired of sitting by her sisiter on the
bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thougt Alice
without pictures or conversations
-----
reading
or
what
pictures
Output:
1
3
1
2

You are relying on the default dictionary key ordering If you need the results to be printed in exactly a certain order, then you need do that yourself.
for word in wordsToCount:
print(D[word])

The collections module has a Counter class that is exactly what you need for this. It's a dict subclass whose initialization takes an iterable containing items to be counted. The resulting dictionary has the count of each word. You can use a generator expression to filter the words being counted.
from collections import Counter
# same input and initialization of your variables
D = Counter(w for w in words if w in wordsToCount)
You can also make wordsToCount a set to make things a little faster.
The order is likely because you're using a Python version before 3.6. 3.6 and later retain the order of keys as they are added. If you need your code to run on Python prior to 3.6, get the keys from wordsToCount instead of from the dict (and keep it a list so its order is retained). Since some wordsToCount may not appear in the original tekst, check to see if each word is in the dictionary before printing.
for key in wordsToCount:
if key in D:
print(D[key])

How to find the top 10 words from a given string at any given point of time. Python

Lets say I have a string s and words are keep on adding to the string at any given point of time. Now I have to maintain top 10 reoccurring words at any given point of time.
My approach is I have a created a dictionary with key value pair as shown below.
dic = {}
for i in s:
if i in dic:
dic[i] +=1
else:
dic[i] = 1
now I want to maintain the frequency of the top 10 words in the above dictionary.
possible ways through which I can perform the above action are
I can sort the dictionary after every iteration but it will result in the high complexity as dictionary may contains millions of records.
I can use the feature of counter or collections but I don't want to use any inbuilt function.
I want the above program to work in linear time. I know the above question has been asked before but I was not able to find the linear solution.

You don't need to sort the entire dict every time. Just check whether the incremented value is now larger than the existing 10th largest value.

This is how I am able to achieve what I was asking.
file = open('read.txt','r')
text = file.read().split()
word_cout = {}
top_words = {}
for i in text:
if i in word_cout:
word_cout[i] +=1
else:
word_cout[i] = 1
if i not in top_words:
for key in top_words.copy():
if top_words[key] < word_cout[i]:
top_words.pop(key)
top_words[i] = word_cout[i]
break
if(len(top_words) < 10):
top_words[i] = word_cout[i]
# print(word_cout)
print((top_words))

Converting list (array values) into integers and then print out the input which the array holds as the converted integers not words

firstsentence = input('Enter a sentence: ')
firstsentence = firstsentence.lower()
words = firstsentence.split(' ')
my_list = []
my_list.append(words)
for (i, firstsentence) in enumerate(my_list):
if (aword == my_list): #This is where i get confused
print(str(i+1)+ ' position')
print (my_list)
So what i am try to do is ask the user for an input - "firstsentence". I will then try to split the inputted sentence into words and store them in the array (list) my_list. I am getting stuck on how to convert the seperated words into numbers (their positions in the sentence) and then printing out the input("firstsentence") as the integer version. 'aword' is what i used to try and identify if the word was repeated in the sentence. This is quite confusing and here is an example :-
"I like to do coding because I like it and i like it a lot"
There are 15 words here. Of these 15, only 10 are different. So the repeated words MUST be put into the same integer (even though their positions are different their actual word is the same) value in the array. The code i want to print out should look like this for the sentence above:-
1 2 3 4 5 6 1 2 7 8 1 2 7 9 10. I am really confused as to how to store the separated words as different values (integers) and then especially on how to store the same repeating words as one value, and then printing the whole thing out. Thanks to those who help. Would be much appreciated :-).

My answer is deliberately incomplete. It's clear you're new at this, and part of being new is discovering how code works. But you're lacking in some basic concepts of what to do with lists in python, and this should point you in the right direction.
firstsentence = input('Enter a sentence: ')
firstsentence = firstsentence.lower()
words = firstsentence.split(' ')
my_list = []
for word in words:
print(word in my_list)
# You can do something more useful than print here, right?
# Maybe add the specific word to my_list...
for word in words:
print(my_list.index(word))
# This gives you the index, but you need to figure out how to output it the way you want
# Note that indices in python start at 0 while your sample output starts
# at 1. So maybe do some addition somewhere in here.
So, what's happening here?
We're doing two passes.
Generally, you'll get very far in programming if you can decompose a problem into discrete steps. In this case, there are two steps to your problem - find the index of the first occurrence of each word, and then output the right indices.
In the first pass, you'll create a list of unique words, where their ordering reflects their ordering in the sentence.
In the second pass, you'll go through the original sentence and look up the place where it first occurred, and format that the way you need.
If you're really attentive, you'll realize: you could easily do this in one pass. What does the second pass require? Just that the word you're looking for be in my_list. As long as you meet that requirement, you could combine the two loops. This is a good thing to do - it may not matter when you're looking at 20 words, but what if you were looking at 20,000,000,000 words? You'd really only want one loop. But take small steps. Start by figuring out how to do it with two loops, and then maybe move on to putting it all in one.

How can I get a ranked list from a dictionary?

I am working in Python. The dictionary I have looks like this:
score = {'a':{4:'c', 3:'d'}, 'b':{6:'c', 3:'d'}}
And I need to order it like this:
rank = [{a:3, b:6}, {a:4, b:3}]
Where the sub-dictionary with the greatest combination of exclusive key values is in the first element, the second greatest combination of exclusive key values is in the second element and so forth. The greatest combination logic would be: 1. Grab the biggest combination (total sum) of keys from each dictionary (in this case it would be a->4:'c' and b->6:'d'. Remove those values from the dictionary and grab the next biggest combination of keys (in this case, it would be a->4:'c' and b->3:'d'). This should continue until the original dictionary is empty.
It is exclusive because once the once a value has been used from the original dict, it should be removed, or excluded from being used again in any future combinations.
I have tried all the different approaches I know, but algorithmically I am missing something.

I think I made what you're looking for? It's a weird algorithm, and it's kinda dirty due to the try/except block, but it works.
Edit: added comments and removed unneeded code.
def rank(toSort):
#importing from the string library
from string import lowercase as alph
#temporary list
_ranks=[]
#populate with empty dictonaries
for i in range(len(toSort)):
_ranks.append({})
#the actual sorting algorithm
for i in range(len(toSort)-1):
#iterate all k/v pairs in the supplied dictionary
for k,v in toSort.iteritems():
#iterate all k/v pairs in v element
for a,b in v.iteritems():
#if the alpha index of an element is equal to
#the max alpha index of elements in its containing dictionary...
if alph.index(b)==max(map(alph.index,v.values())):
_ranks[i][k]=a
#if it isn't..
else:
try:
_ranks[i+1][k]=a
except IndexError:
_ranks[-1][k]=a
return _ranks

Sorting a concordance?

For my homework, I need to isolate the most frequent 50 words in a text. I have tried a whole lot of things, and in my most recent attempt, I have done a concordance using this:
concordance = {}
lineno = 0
for line in vocab:
lineno = lineno + 1
words = re.findall(r'[A-Za-z][A-Za-z\'\-]*', line)
for word in words:
word = word.title()
if word in concordance:
concordance[word].append(lineno)
else:
concordance[word] = [lineno]
listing = []
for key in sorted(concordance.keys()):
listing.append( [key, concordance[key] ])
What I would like to know is whether I can sort the subsequent concordance in order of most frequently used word to least frequently used word, and then isolate and print the top 50? I am not permitted to import any modules other than re and sys, and I'm struggling to come up with a solution.

sorted is a builtin which does not require import. Try something like:
list(sorted(concordance.items(), key = lambda (k,v): v))[:50]
Not tested, but you get the idea.
The list constructor is there because sorted returns a generator, which you can't slice directly (itertools provides a utility to do that, but you can't import it).
There are probably slightly more efficient ways to take the first 50, but I doubt it matters here.

Few hints:
Use enumerate(list) in your for loop to get the line number and the line at once.
Try using \w for word characters in your regular expression instead of listing [A-Za-z...].
Read about the dict.items() method. It will return a list of (key, value) pairs.
Manipulate that list with list.sort(key=function_to_compare_two_items).
You can define that function with a lambda, but it is not necessary.
Use the len(list) function to get the length of the list. You can use it to get the number of matches of a word (which are stored in a list).
UPDATE: Oh yeah, and use slices to get a part of the resulting list. list[:50] to get the first 50 items (equivalent to list[0:50]), and list[5:10] to get the items from index 5 inclusive to index 10 exclusive.
To print them, loop through the resulting list, then print every word. Alternatively, you can use something similar to print '[separator]'.join(list) to print a string with all the items separated by '[separator]'.
Good luck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.