Python: How to count the keywords in a py file? - python

I'm trying to count the number of keywords in another py file
here what's i made:
import keyword
infile=open(xx.py,'r')
contentbyword=infile.read().split()
num_of_keywords=0
for word in contentbyword:
if keyword.iskeyword(word) or keyword.iskeyword(word.replace(':','')):
num_of_keywords+=1
I know it's buggy as even if the keywords is inside a quote or after a # sign, it also counts.
So what is the better way to count the orange-highlighted words (IDLE default) in python? Many Thanks<(_ _)>

The correct way to do this is using the tokenize module, which takes care of all the edge cases.
import token
import keyword
import tokenize
s = open('hi.py').readline
counter = 0
l = []
for i in tokenize.generate_tokens(s):
if i.type == token.NAME and keyword.iskeyword(i.string):
counter += 1
l.append(i.string)
print(counter)
print(l)

Should consider using Counter: here a snippet to grep all keywords in a file according a list of keywords:
from collections import Counter
def get_kws(file_in, keywords_list):
with open(file_in) as fin:
# load all content => not suitable for large file
content = fin.read()
# split by non-word
words = re.split(r"\W", content)
counter = Counter(words)
for word, c in counter.items():
if word and word in keywords_list:
yield word, c
EDIT:
for the list of python keywords =>
keywords_list = ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'else', 'if', 'pass', 'yield', 'break', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'continue', 'finally', 'is', 'return', 'def', 'for', 'lambda', 'try']

The below code should do for python 2.7 codes.
import keyword
import re
handle = open("asdf.py","r")
data = str(handle.read())
data = re.sub(r'".*"', r"",data)
data = re.sub(r'#.*' , r"",data)
mystr = data.split()
mykeys = keyword.kwlist
count=0;
for i in mystr:
i = re.sub(r':',r'',i)
if i in mykeys:
print i
count=count+1
else:
count+=0
print count
replace the suitable filename, cheers!

Related

Print the dictionary value and key(index) in python

I am trying to write a code which inputs a line from user, splits it and feed it up to majestic dictionary named counts. All is well until we ask her majesty for some data. I want the data in the format such that the word is printed first and number of times it repeats printed next to it. Below is the code I managed to write.
counts = dict()
print('Enter a line of text:')
line = input('')
words = line.split()
print('Words:', words)
print('Counting:')
for word in words:
counts[word] = counts.get(word,0) + 1
for wording in counts:
print('trying',counts[wording], '' )
When it executes its output is unforgivable.
Words: ['You', 'will', 'always', 'only', 'get', 'an', 'indent', 'error', 'if', 'there', 'is', 'actually', 'an', 'indent', 'error.', 'Double', 'check', 'that', 'your', 'final', 'line', 'is', 'indented', 'the', 'same', 'was', 'as', 'the', 'other', 'lines', '--', 'either', 'with', 'spaces', 'or', 'with', 'tabs.', 'Most', 'likely,', 'some', 'of', 'the', 'lines', 'had', 'spaces', '(or', 'tabs)', 'and', 'the', 'other', 'line', 'had', 'tabs', '(or', 'spaces).']
Counting:
trying 1
trying 1
trying 1
trying 1
trying 1
trying 2
trying 2
trying 1
trying 1
trying 1
trying 2
trying 1
trying 1
trying 1
trying 1
trying 1
trying 1
trying 1
trying 2
It just prints trying and number of times it is repeated and without the word(I think it is called index in dictionary, correct me if I am wrong)
Thankyou
Please help me and when replying to this question please keep in mind I am a newbie, both to python and stack overflow.
Nowhere in your code do you attempt to print the word. How did you expect it to appear in the output? If you want the word, put it in the list of things to print:
print(wording, counts[wording])
For more education, look up the package collections, and use the Counter construct.
counts = Counter(words)
will do all of your word counts for you.
I'm confuzled as to why you print trying.
Try this instead.
counts = dict()
print('Enter a line of text:')
line = input('')
words = line.split()
print('Words:', words)
print('Counting:')
for word in words:
counts[word] = counts.get(word,0) + 1
for wording in counts:
print(wording,counts[wording], '' )
You should use counts.items() to iterate over the key and value of counts as follows:
counts = dict()
print('Enter a line of text:')
line = input('')
words = line.split()
print('Words:', words)
print('Counting:')
for word in words:
counts[word] = counts.get(word,0) + 1
for word, count in counts.items(): # notice this!
print(f'trying {word} {count}')
Also notice that you can use an f-string when printing.
The code you have iterates over the dictionary keys and prints only the count in the dictionary. You would want to do something like this:
for word, count in counts.items():
print('trying', word, count)
You might also want to use
from collections defaultdict
counts = defaultdict(lambda: 0)
So while adding to the dictionary, the code would be as simple as
counts[word] += 1

Split file every three words & create lists of triplets

I am having a txt file with a text that I import in Python and I want to separate it at every 3 words.
For example,
Python is an interpreted, high-level and general-purpose programming language
I want to be,
[['Python', 'is', 'an'],['interpreted,', 'high-level','and'],['general-purpose','programming','language']].
My code so far,
lines = [word.split() for word in open(r"c:\\python\4_TRIPLETS\Sample.txt", "r")]
print(lines)
gives me this output,
[['Python', 'is', 'an', 'interpreted,', 'high-level', 'and', 'general-purpose', 'programming', 'language.', "Python's", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace.', 'Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear,', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects.']]
Any ideas?
Use list comprehension to convert list into chunks of n items
with open('c:\\python\4_TRIPLETS\Sample.txt', 'r') as file:
data = file.read().replace('\n', '').split()
lines = [data[i:i + 3] for i in range(0, len(data), 3)]
print(lines)
You can use a split string to separate each word and then go through the list and group them into pairs of 3 words.
final = = [None] * math.ceil(lines/3)
temp = [None] * 3
i = 0
for x in lines:
if(i % 3 == 0)
final.append(temp)
temp = [None] * 3
temp.append(x)

removing common words from a text file

I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))

How do I remove duplicate words from a list in python without using sets?

I have the following python code which almost works for me (I'm SO close!). I have text file from one Shakespeare's plays that I'm opening:
Original text file:
"But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief"
And the result of the code I worte gives me is this:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and',
'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill',
'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the',
'through', 'what', 'window', 'with', 'yonder']
So this is almost what I want: It's already in a list sorted the way I want it, but how do I remove the duplicate words? I'm trying to create a new ResultsList and append the words to it, but it gives me the above result without getting rid of the duplicate words. If I "print ResultsList" it just dumps a ton of words out. They way I have it now is close, but I want to get rid of the extra "and's", "is's", "sun's" and "the's".... I want to keep it simple and use append(), but I'm not sure how I can get it to work. I don't want to do anything crazy with the code. What simple thing am I missing from my code inorder to remove the duplicate words?
fname = raw_input("Enter file name: ")
fhand = open(fname)
NewList = list() #create new list
ResultList = list() #create new results list I want to append words to
for line in fhand:
line.rstrip() #strip white space
words = line.split() #split lines of words and make list
NewList.extend(words) #make the list from 4 lists to 1 list
for word in line.split(): #for each word in line.split()
if words not in line.split(): #if a word isn't in line.split
NewList.sort() #sort it
ResultList.append(words) #append it, but this doesn't work.
print NewList
#print ResultList (doesn't work the way I want it to)
mylist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(set(mylist), key=lambda x:mylist.index(x))
print(newlist)
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist contains a list of the set of unique values from mylist, sorted by each item's index in mylist.
You did have a couple logic error with your code. I fixed them, hope it helps.
fname = "stuff.txt"
fhand = open(fname)
AllWords = list() #create new list
ResultList = list() #create new results list I want to append words to
for line in fhand:
line.rstrip() #strip white space
words = line.split() #split lines of words and make list
AllWords.extend(words) #make the list from 4 lists to 1 list
AllWords.sort() #sort list
for word in AllWords: #for each word in line.split()
if word not in ResultList: #if a word isn't in line.split
ResultList.append(word) #append it.
print(ResultList)
Tested on Python 3.4, no importing.
Below function might help.
def remove_duplicate_from_list(temp_list):
if temp_list:
my_list_temp = []
for word in temp_list:
if word not in my_list_temp:
my_list_temp.append(word)
return my_list_temp
else: return []
This should work, it walks the list and adds elements to a new list if they are not the same as the last element added to the new list.
def unique(lst):
""" Assumes lst is already sorted """
unique_list = []
for el in lst:
if el != unique_list[-1]:
unique_list.append(el)
return unique_list
You could also use collections.groupby which works similarly
from collections import groupby
# lst must already be sorted
unique_list = [key for key, _ in groupby(lst)]
A good alternative to using a set would be to use a dictionary. The collections module contains a class called Counter which is specialized dictionary for counting the number of times each of its keys are seen. Using it you could do something like this:
from collections import Counter
wordlist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and',
'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is',
'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun',
'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(Counter(wordlist),
key=lambda w: w.lower()) # case insensitive sort
print(newlist)
Output:
['already', 'and', 'Arise', 'breaks', 'But', 'east', 'envious', 'fair',
'grief', 'is', 'It', 'Juliet', 'kill', 'light', 'moon', 'pale', 'sick',
'soft', 'sun', 'the', 'through', 'what', 'Who', 'window', 'with', 'yonder']
There is a problem with your code.
I think you mean:
for word in line.split(): #for each word in line.split()
if words not in ResultList: #if a word isn't in ResultList
Use plain old lists. Almost certainly not as efficient as Counter.
fname = raw_input("Enter file name: ")
Words = []
with open(fname) as fhand:
for line in fhand:
line = line.strip()
# lines probably not needed
#if line.startswith('"'):
# line = line[1:]
#if line.endswith('"'):
# line = line[:-1]
Words.extend(line.split())
UniqueWords = []
for word in Words:
if word.lower() not in UniqueWords:
UniqueWords.append(word.lower())
print Words
UniqueWords.sort()
print UniqueWords
This always checks against the lowercase version of the word, to ensure the same word but in a different case configuration is not counted as 2 different words.
I added checks to remove the double quotes at the start and end of the file, but if they are not present in the actual file. These lines could be disregarded.
This should do the job:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
lst.sort()
print(lst)

How to determine if word in string is a double word?

I want to write a function that takes a file as a string, and returns True if the file has duplicate words and False otherwise.
So far I have:
def double(filename):
infile = open(filename, 'r')
res = False
l = infile.split()
infile.close()
for line in l:
#if line is in l twice
res = True
return res
if my file contains:
"there is is a same word"
I should get True
if my file contains:
"there is not a same word"
I should get False
How do I determine if there is a duplicate of a word in the string
P.S. the duplicate word does not have to come right after the other
i.e In "there is a same word in the sentence over there" should return True because "there" is also a duplicate.
The str.split() method doesn't work well for splitting words in natural English text because of apostrophes and punctuation. You usually need the power of regular expressions for this:
>>> text = """I ain't gonna say ain't, because it isn't
in the dictionary. But my dictionary has it anyways."""
>>> text.lower().split()
['i', "ain't", 'gonna', 'say', "ain't,", 'because', 'it', "isn't", 'in', 'the',
'dictionary.', 'but', 'my', 'dictionary', 'has', 'it', 'anyways.']
>>> re.findall(r"[a-z']+", text.lower())
['i', "ain't", 'gonna', 'say', "ain't", 'because', 'it', "isn't", 'in', 'the',
'dictionary', 'but', 'my', 'dictionary', 'has', 'it', 'anyways']
To find whether there are any duplicate words, you can use set operations:
>>> len(words) != len(set(words))
True
To list out the duplicate words, use the multiset operations in collections.Counter:
>>> sorted(Counter(words) - Counter(set(words)))
["ain't", 'dictionary', 'it']
def has_duplicates(filename):
seen = set()
for line in open(filename):
for word in line.split():
if word in seen:
return True
seen.add(word)
return False
Use a set to detect duplicates:
def double(filename):
seen = set()
with open(filename, 'r') as infile:
for line in l:
for word in line.split():
if word in seen:
return True
seen.add(word)
return False
You could shorten that to:
def double(filename):
seen = set()
with open(filename, 'r') as infile:
return any(word in seen or seen.add(word) for line in l for word in line.split())
Both versions exit early; as soon as a duplicate word is found, the function returns True; it does have to read the whole file to determine there are no duplicates and return False.
a = set()
for line in l:
if (line in a):
return True
a.add(line)
return False
Another general approach to detecting duplicate words, involving collections.Counter
from itertools import chain
from collections import Counter
with open('test_file.txt') as f:
x = Counter(chain.from_iterable(line.split() for line in f))
for (key, value) in x.iteritems():
if value > 1:
print key

Categories

Resources