Python common word finder

Python common word finder - python

I have a small program that looks at a text file and displays how many time the word was used. Instead of printing words, it prints most commonly used letters not words and I don't understand what the problem.
import re
from collections import Counter
words = re.findall(r'\w', open('words.txt').read().lower())
count = Counter(words).most_common(8)
print(count)

I hope this helps, this is a regular expression answer and should go word by word.
import re
with open("words.txt") as f:
for line in f:
for word in re.findall(r'\w+', line):
# word by word
if you do not have quotes around your data and you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file) try this:
with open('words.txt','r') as f:
for line in f:
for word in line.split():
print(word)

import string
words = open('words.txt').read().lower()
# skip punctuation
words = words = words.translate(str.maketrans('', '',string.punctuation)).split()
count = Counter(words).most_common(8)

in regex \w means just any character, not any word. You can get a list of words doing:
words= ' '.split( open('words.txt').read().lower())
And then you perform what you were doing:
count = Counter(words).most_common(8)
print(count)
I guess that should suffice, tell me if it isn't working.

Assuming you have following text file:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
And you want to calculate words frequency:
import operator
with open('text.txt') as f:
words = f.read().split()
result = {}
for word in words:
result[word] = words.count(word)
result = sorted(result.items(), key=operator.itemgetter(1), reverse=True)
print(result)
You'll get list of words with number of occurences for each word sorted descending:
[('in', 3), ('dolor', 2), ('ut', 2), ('dolore', 2), ('Lorem', 1),
('ipsum', 1), ...

Related

How would I limit the number of characters per line in python from an input

Would there be a way to limit the amount of characters that are printed per line?
while 1:
user_message = ""
messageQ = input("""\nDo you want to enter a message?
[1] Yes
[2] No
[>] Select an option: """)
if messageQ == "1":
message = True
elif messageQ == "2":
message = False
else:
continue
if message == True:
print(
"""
-----------------------------------------------------------------
You can enter a custom message that is below 50 characters.
""")
custom_message = input("""\nPlease enter your custom message:\n \n> """)
if len(custom_message) > 50:
print("[!] Only 50 characters allowed")
continue
else:
print(f"""
Your Custom message is:
{custom_message}""") #here is where I need to limit the number of characters per line to 25
break
So where I print it here:
Your Custom message is:
{custom_message}""") #here is where I need to limit the number of characters per line to 25
I need to limit the output to 25 characters per line.

You can do
message = "More than 25 characters in this message!"
print(f"{message:.25}")
Output
More than 25 characters i

You might use textwrap.fill to break excessively long string into lines, example usage
import textwrap
message = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
print(textwrap.fill(message, 25))
output
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor incididunt
ut labore et dolore magna
aliqua. Ut enim ad minim
veniam, quis nostrud
exercitation ullamco
laboris nisi ut aliquip
ex ea commodo consequat.
Duis aute irure dolor in
reprehenderit in
voluptate velit esse
cillum dolore eu fugiat
nulla pariatur. Excepteur
sint occaecat cupidatat
non proident, sunt in
culpa qui officia
deserunt mollit anim id
est laborum.

>>> my_str = """This is a really long message that is longer than 25 characters"""
#For 25 characters TOTAL
>>> print(f"This is your custom message: {my_str}"[:25])
'This is your custom messa'
#For 25 characters in custom message
>>> print(f"This is your custom message: {my_str[:25]}")
This is your custom message: This is a really long mes
This takes advantage of the substring operator. This cuts off any characters past the 25th character.

As have already checked that the message is not more than 50 characters we just need to know whether it is more or less than 25 characters long.
ln = len(custom_message) -1 # because strings are 0 indexed
if ln < 25:
print(custom_message)
else:
print(f"This is your custom message: {my_str}"[:ln])
print(f"This is your custom message: {my_str}"[25:ln])
``

Struggling with reading a text file and use of nested loops

I have attempted making a program that counts the number of occurrences of "[AB]" in a text file by searching each file individually (after loading and opening the file of course) but it doesn't seem to work, and I have no idea why.
Here is the program:
# NOTE: to make it work try making more functions that return values and check if
# for the beginning and end of the names
# to deal with the issue of local variable scope
#imports and reads first line of text file
print("Opening and closing file")
print("\nReading characters from file.")
text_file = open("chat3.txt", "r")
#prints current line just for checking(can remove later)
x = 0
ABcount = 0
d = 0
length = len(text_file.readlines())
print("There are no of lines ", length)
line = text_file.readline()
print("the current line is ", line)
#loop to find most commonly used words( a tuple with word(string): no of occurences(int))
print("point 1(before loop 1)")
for d in range(0, length):
print("point 2(just into loop 1)")
c = text_file.readline()#reads one line and stores it in variable c as a string
count = len(c)#gets the length of line/no of characters in it as the next loop will iterate for each one
print(c)
print("point 3(in loop 1 after printing current line)")
for x in range(0, count):
print("This is count number", x+1)
c2 = c[x]
print("Current char is ", c2)
if(('[' in c) and (c2 == '[')):
start = c.index('[') + 1
end = c.index(':')
ABcount += 1
print("There is/are ", ABcount, c[start:end])
elif ( not '[' in c):
break
text_file.close()
And chat3.txt content's are:
nn an an [AB:2020]
[AB]
[AB]
And the results from comp + running are
PS C:\Users\test> python counter.py
Opening and closing file
Reading characters from file.
There are no of lines 3
the current line is
point 1(before loop 1)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
PS C:\Users\test>

Use regex for this kind of thing
t.txt
Deserunt velit ipsum quis id aliquip commodo deserunt nulla officia ea dolor reprehenderit pariatur. Sit laboris culpa in non et. Do laborum aliqua sunt voluptate occaecat anim magna eu. Est tempor ad non consectetur ea reprehenderit est quis et. Culpa eu sit amet est ullamco eiusmod et sit excepteur et cupidatat ullamco consectetur Lorem. Dolore elit dolore proident consectetur ipsum non. Sunt veniam incididunt duis veniam dolor sunt fugiat irure eiusmod.
Nulla eiusmod voluptate aute tempor amet aliquip ad culpa dolor labore consequat ut ea proident. Qui minim velit elit ut excepteur fugiat nisi esse do et sit. Consequat est pariatur officia incididunt et pariatur laborum aute veniam do adipisicing.
Eu aliqua ex ex irure. Mollit adipisicing est id quis eiusmod aliqua ullamco cupidatat. Lorem ea esse magna aliqua aute occaecat. Velit in enim ut ad eu magna amet fugiat labore amet ea.
Adipisicing duis enim tempor ipsum magna duis. Consectetur ullamco adipisicing est aute fugiat qui excepteur nostrud nisi laboris ipsum. Officia sunt eiusmod consectetur dolor do et adipisicing duis cillum. Adipisicing esse exercitation deserunt labore Lorem deserunt consectetur ad laboris anim sit veniam ex ea. Minim voluptate pariatur dolor adipisicing commodo voluptate consectetur aute id officia irure elit. Cillum eiusmod esse nulla enim nostrud mollit voluptate incididunt ullamco anim cillum officia.
script
with open('r.txt','r') as file:
f=file.read()
import re
re.findall('ab',f)
print(re.findall('ab',f))
# ['ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab']

To answer your question, it does not enter your loop because when you first call readlines, it set the cursor at the end of the file and so the next readline returns nothing. This might help: Why the second time I run "readlines" on the same file nothing is returned?
If you want to loop a file line by line just do for line in file:
For the rest, as suggested in other answers there are most certainly better way to do this, but I believe it is not the question here.

How to find total number of characters in paragraph(excluding spaces) in PYTHON?

finding total words in paragraph by using split command was easy but what to do when to find number of characters in paragraph using PYTHON?

I would go for a listcomprehension:
len([c for c in "la a a" if c not in (' ', '\n') ])

There are a lot of ways to do that, so I wondered how they fair against each other so I timed them. I implemented all the methods in this question and any additional ones from here.
from timeit import timeit
setup = """
from collections import Counter
import string
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. In malesuada eget tortor vel tempor. Cras condimentum risus a mi sagittis, at lobortis dui efficitur. Suspendisse fringilla ligula at eros consequat aliquet. Praesent volutpat sapien non aliquam cursus. Suspendisse auctor sapien ac leo luctus scelerisque. Cras eget fringilla mauris. Vivamus fermentum, nisl et mollis consequat, nisl sapien lacinia ex, ac finibus eros dui vel sapien. Curabitur dignissim porttitor ex sed vestibulum. Nullam nulla lorem, aliquam in turpis at, egestas tempor turpis. Aenean at lorem molestie, placerat eros in, tempus dui. Morbi magna nulla, blandit ac vestibulum faucibus, luctus sit amet libero. Ut quis lorem porta, cursus nunc id, malesuada mauris. Aenean luctus diam ac tortor rutrum mattis. Donec ultrices nibh quis est varius pellentesque. Donec tempor, est vel commodo ultricies, mauris tortor egestas orci, ut hendrerit orci ex eget risus. In eu ullamcorper odio, lacinia auctor urna.'
def split_join(s):
return len(''.join(s.split()))
def list_comprehension(s):
return len([c for c in s if c != " "])
def sum_len_split(s):
return sum([len(x) for x in s.split()])
def map_len_split(s):
return sum(map(len, s.split()))
def conventional_loop(s):
x = 0
for n in s:
if not n.isspace():
x += 1
return x
def replace_space(s):
return len(s.replace(" ", ""))
def counter(s):
valid_letters = string.ascii_letters
count = Counter(s)
return sum(count[letter] for letter in valid_letters)
def discount_space(s):
return len(s) - s.count(" ")
"""
functions = [line[4:-4] for line in setup.split('\n')
if line.startswith('def ')]
n = 100000
results = []
for function in functions:
results.append((timeit('{}(text)'.format(function), setup=setup, number=n), function))
results.sort()
for time, function in results:
print(function, time)
Results
discount_space 0.1687837346025738
replace_space 0.5508266038467227
split_join 1.231192897388388
map_len_split 1.5719588628305754
sum_len_split 2.2983778970212896
list_comprehension 5.715995796916212
counter 7.133700537385263
conventional_loop 11.01061941802605
Of course, unless you have millions of characters in your string, performance isn't an issue and code clarity is more important. In that case I would still argue that discount_space() is the most clear and direct.

You can split your string (that removes spaces and gives a list of string), and then count characters of each word and sum the total.
myString = "Hello you !" # 11 characters
totalNCharacters = sum([len(x) for x in mystring.split()])
print(totalCharacters) # output: 9

by characters, I assume mean anything not whitespace, meaning punctuation included. This is one way it can be done:
#!/usr/bin/python
s = 'abc def ghi....\n":)*9.w wer'
x = 0
for n in s:
if not n.isspace():
x += 1
print("total number of characters is {}".format(x))

paragraph = "The quick brown fox jumps over the lazy dog"
Remove the spaces and print the length:
print(len(paragraph.replace(" ", "")))
# 35

I would use split and join:
def char_count(s):
return len(''.join(s.split()))

from collections import Counter
import string
def count_letters(word, valid_letters=string.ascii_letters):
count = Counter(word) # this counts all the letters
return sum(count[letter] for letter in valid_letters) # valid letters

Python Grabbing String in between characters

If I have a string like /Hello how are you/, how am I supposed to grab this line and delete it using a python script.
import sys
import re
i_file = sys.argv[1];
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
with open(i_file) as i_file_comment_strip:
i_files_names = i_file_comment_strip.readlines()
for line in i_files_names:
with open(line, "w") as i_file_data:
i_file_comment = i_file_data.readlines();
for line in i_file_comment:
i_file_comment_data = i_file_comment.strip()
In the i_file_comment I have the lines from i_file_data and i_file_comment contains the lines with the "/.../" format. Would I use a for loop through each character in the line and replace every one of those characters with a ""?

If you want to remove the /Hello how are you/ you can use regex:
import re
x = 'some text /Hello how are you/ some more text'
print (re.sub(r'/.*/','', x))
Output:
some text some more text

If you know you have occurences of a fixed string in your lines, you can simply do
for line in i_file_comment:
line = line.replace('/Hello how are you/', '')
however, if what you have is multiple occurences of strings delimited by / (i.e. /foo/, /bar/), I think using a simple regex will sufice:
>>> import re
>>> regex = re.compile(r'\/[\w\s]+\/')
>>> s = """
... Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
... /Hello how are you/ ++ tempor incididunt ut labore et dolore magna aliqua.
... /Hello world/ -- ullamco laboris nisi ut aliquip ex ea commodo
... """
>>> print re.sub(regex, '', s) # find substrings matching the regex, replace them with '' on string s
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
++ tempor incididunt ut labore et dolore magna aliqua.
-- ullamco laboris nisi ut aliquip ex ea commodo
>>>
just adjust the regex to what you need to get rid of :)

get words from large file, using low memory in python

I need to iterate over the words in a file. The file could be very big (over 1TB), the lines could be very long (maybe just one line). Words are English, so reasonable in size. So I don't want to load in the whole file or even a whole line.
I have some code that works, but may explode if lines are to long (longer than ~3GB on my machine).
def words(file):
for line in file:
words=re.split("\W+", line)
for w in words:
word=w.lower()
if word != '': yield word
Can you tell be how I can, simply, rewrite this iterator function so that it does not hold more than needed in memory?

Don't read line by line, read in buffered chunks instead:
import re
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
for word in (w.lower() for w in words if w):
yield word
if buffer:
yield buffer.lower()
I'm using the callable-and-sentinel version of the iter() function to handle reading from the file until file.read() returns an empty string; I prefer this form over a while loop.
If you are using Python 3.3 or newer, you can use generator delegation here:
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
yield from (w.lower() for w in words if w)
if buffer:
yield buffer.lower()
Demo using a small chunk size to demonstrate this all works as expected:
>>> demo = StringIO('''\
... Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque in nulla nec mi laoreet tempus non id nisl. Aliquam dictum justo ut volutpat cursus. Proin dictum nunc eu dictum pulvinar. Vestibulum elementum urna sapien, non commodo felis faucibus id. Curabitur
... ''')
>>> for word in words(demo, 32):
... print word
...
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
pellentesque
in
nulla
nec
mi
laoreet
tempus
non
id
nisl
aliquam
dictum
justo
ut
volutpat
cursus
proin
dictum
nunc
eu
dictum
pulvinar
vestibulum
elementum
urna
sapien
non
commodo
felis
faucibus
id
curabitur

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.