I have attempted making a program that counts the number of occurrences of "[AB]" in a text file by searching each file individually (after loading and opening the file of course) but it doesn't seem to work, and I have no idea why.
Here is the program:
# NOTE: to make it work try making more functions that return values and check if
# for the beginning and end of the names
# to deal with the issue of local variable scope
#imports and reads first line of text file
print("Opening and closing file")
print("\nReading characters from file.")
text_file = open("chat3.txt", "r")
#prints current line just for checking(can remove later)
x = 0
ABcount = 0
d = 0
length = len(text_file.readlines())
print("There are no of lines ", length)
line = text_file.readline()
print("the current line is ", line)
#loop to find most commonly used words( a tuple with word(string): no of occurences(int))
print("point 1(before loop 1)")
for d in range(0, length):
print("point 2(just into loop 1)")
c = text_file.readline()#reads one line and stores it in variable c as a string
count = len(c)#gets the length of line/no of characters in it as the next loop will iterate for each one
print(c)
print("point 3(in loop 1 after printing current line)")
for x in range(0, count):
print("This is count number", x+1)
c2 = c[x]
print("Current char is ", c2)
if(('[' in c) and (c2 == '[')):
start = c.index('[') + 1
end = c.index(':')
ABcount += 1
print("There is/are ", ABcount, c[start:end])
elif ( not '[' in c):
break
text_file.close()
And chat3.txt content's are:
nn an an [AB:2020]
[AB]
[AB]
And the results from comp + running are
PS C:\Users\test> python counter.py
Opening and closing file
Reading characters from file.
There are no of lines 3
the current line is
point 1(before loop 1)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
PS C:\Users\test>
Use regex for this kind of thing
t.txt
Deserunt velit ipsum quis id aliquip commodo deserunt nulla officia ea dolor reprehenderit pariatur. Sit laboris culpa in non et. Do laborum aliqua sunt voluptate occaecat anim magna eu. Est tempor ad non consectetur ea reprehenderit est quis et. Culpa eu sit amet est ullamco eiusmod et sit excepteur et cupidatat ullamco consectetur Lorem. Dolore elit dolore proident consectetur ipsum non. Sunt veniam incididunt duis veniam dolor sunt fugiat irure eiusmod.
Nulla eiusmod voluptate aute tempor amet aliquip ad culpa dolor labore consequat ut ea proident. Qui minim velit elit ut excepteur fugiat nisi esse do et sit. Consequat est pariatur officia incididunt et pariatur laborum aute veniam do adipisicing.
Eu aliqua ex ex irure. Mollit adipisicing est id quis eiusmod aliqua ullamco cupidatat. Lorem ea esse magna aliqua aute occaecat. Velit in enim ut ad eu magna amet fugiat labore amet ea.
Adipisicing duis enim tempor ipsum magna duis. Consectetur ullamco adipisicing est aute fugiat qui excepteur nostrud nisi laboris ipsum. Officia sunt eiusmod consectetur dolor do et adipisicing duis cillum. Adipisicing esse exercitation deserunt labore Lorem deserunt consectetur ad laboris anim sit veniam ex ea. Minim voluptate pariatur dolor adipisicing commodo voluptate consectetur aute id officia irure elit. Cillum eiusmod esse nulla enim nostrud mollit voluptate incididunt ullamco anim cillum officia.
script
with open('r.txt','r') as file:
f=file.read()
import re
re.findall('ab',f)
print(re.findall('ab',f))
# ['ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab']
To answer your question, it does not enter your loop because when you first call readlines, it set the cursor at the end of the file and so the next readline returns nothing. This might help: Why the second time I run "readlines" on the same file nothing is returned?
If you want to loop a file line by line just do for line in file:
For the rest, as suggested in other answers there are most certainly better way to do this, but I believe it is not the question here.
Related
Would there be a way to limit the amount of characters that are printed per line?
while 1:
user_message = ""
messageQ = input("""\nDo you want to enter a message?
[1] Yes
[2] No
[>] Select an option: """)
if messageQ == "1":
message = True
elif messageQ == "2":
message = False
else:
continue
if message == True:
print(
"""
-----------------------------------------------------------------
You can enter a custom message that is below 50 characters.
""")
custom_message = input("""\nPlease enter your custom message:\n \n> """)
if len(custom_message) > 50:
print("[!] Only 50 characters allowed")
continue
else:
print(f"""
Your Custom message is:
{custom_message}""") #here is where I need to limit the number of characters per line to 25
break
So where I print it here:
Your Custom message is:
{custom_message}""") #here is where I need to limit the number of characters per line to 25
I need to limit the output to 25 characters per line.
You can do
message = "More than 25 characters in this message!"
print(f"{message:.25}")
Output
More than 25 characters i
You might use textwrap.fill to break excessively long string into lines, example usage
import textwrap
message = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
print(textwrap.fill(message, 25))
output
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor incididunt
ut labore et dolore magna
aliqua. Ut enim ad minim
veniam, quis nostrud
exercitation ullamco
laboris nisi ut aliquip
ex ea commodo consequat.
Duis aute irure dolor in
reprehenderit in
voluptate velit esse
cillum dolore eu fugiat
nulla pariatur. Excepteur
sint occaecat cupidatat
non proident, sunt in
culpa qui officia
deserunt mollit anim id
est laborum.
>>> my_str = """This is a really long message that is longer than 25 characters"""
#For 25 characters TOTAL
>>> print(f"This is your custom message: {my_str}"[:25])
'This is your custom messa'
#For 25 characters in custom message
>>> print(f"This is your custom message: {my_str[:25]}")
This is your custom message: This is a really long mes
This takes advantage of the substring operator. This cuts off any characters past the 25th character.
As have already checked that the message is not more than 50 characters we just need to know whether it is more or less than 25 characters long.
ln = len(custom_message) -1 # because strings are 0 indexed
if ln < 25:
print(custom_message)
else:
print(f"This is your custom message: {my_str}"[:ln])
print(f"This is your custom message: {my_str}"[25:ln])
``
I have a list of words (lowercase) parsed from an article. I joined them together using .join() with a space into a long string. Punctuation will be treated like words (ie. with spaces before and after).
I want to write this string into a file with at most X characters (in this case, 90 characters) per line, without breaking any words. Each line cannot start with a space or end with a space.
As part of the assignment I am not allowed to import modules, which from my understanding, textwrap would've helped.
I have basically a while loop nested in a for loop that goes through every 90 characters of the string, and firstly checks if it is not a space (ie. in the middle of a word). The while loop would then iterate through the string until it reaches the next space (ie. incorporates the word unto the same line). I then check if this line, minus the leading and trailing whitespaces, is longer than 90 characters, and if it is, the while loop iterates backwards and reaches the character before the word that extends over 90 characters.
x = 0
for i in range(89, len(text), 90):
while text[i] != " ":
i += 1
if len(text[x:i].strip()) > 90:
while text[i - 1] != " ":
i = i - 1
file.write("".join(text[x:i]).strip() + "\n")
x = i
The code works for 90% of the file after comparing with the file with correct outputs. Occasionally there are lines where it would exceed 90 characters without wrapping the extra word into the next line.
EX:
Actual Output on one line (93 chars):
extraordinary thing , but i never read a patent medicine advertisement without being impelled
Expected Output with "impelled" on new line (84 chars + 8 chars):
extraordinary thing , but i never read a patent medicine advertisement without being\nimpelled
Are there better ways to do this? Any suggestions would be appreciated.
You could consider using a "buffer" to hold the data as you build each line to output. As you read each new word check if adding it to the "buffer" would exceed the line length, if it would then you print the "buffer" and then reset the "buffer" starting with the word that couldn't fit in the sentence.
data = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis a risus nisi. Nunc arcu sapien, ornare sit amet pretium id, faucibus et ante. Curabitur cursus iaculis nunc id convallis. Mauris at enim finibus, fermentum est non, fringilla orci. Proin nibh orci, tincidunt sed dolor eget, iaculis sodales justo. Fusce ultrices volutpat sapien, in tincidunt arcu. Vivamus at tincidunt tortor. Sed non cursus turpis. Sed tempor neque ligula, in elementum magna vehicula in. Duis ultricies elementum pellentesque. Pellentesque pharetra nec lorem at finibus. Pellentesque sodales ligula sed quam iaculis semper. Proin vulputate, arcu et laoreet ultrices, orci lacus pellentesque justo, ut pretium arcu odio at tellus. Maecenas sit amet nisi vel elit sagittis tristique ac nec diam. Suspendisse non lacus purus. Sed vulputate finibus facilisis."""
sentence_limit = 40
buffer = ""
for word in data.split():
word_length = len(word)
buffer_length = len(buffer)
if word_length > sentence_limit:
print(f"ERROR: the word '{word}' is longer than the sentence limit of {sentence_limit}")
break
if buffer_length + word_length < sentence_limit:
if buffer:
buffer += " "
buffer += word
else:
print(buffer)
buffer = word
print(buffer)
OUTPUT
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Duis a risus nisi. Nunc
arcu sapien, ornare sit amet pretium id,
faucibus et ante. Curabitur cursus
iaculis nunc id convallis. Mauris at
enim finibus, fermentum est non,
fringilla orci. Proin nibh orci,
tincidunt sed dolor eget, iaculis
sodales justo. Fusce ultrices volutpat
sapien, in tincidunt arcu. Vivamus at
tincidunt tortor. Sed non cursus turpis.
Sed tempor neque ligula, in elementum
magna vehicula in. Duis ultricies
elementum pellentesque. Pellentesque
pharetra nec lorem at finibus.
Pellentesque sodales ligula sed quam
iaculis semper. Proin vulputate, arcu et
laoreet ultrices, orci lacus
pellentesque justo, ut pretium arcu odio
at tellus. Maecenas sit amet nisi vel
elit sagittis tristique ac nec diam.
Suspendisse non lacus purus. Sed
vulputate finibus facilisis.
Using a regular expression:
import re
with open('f0.txt', 'r') as f:
# file must be 1 long single line of text)
text = f.read().rstrip()
for line in re.finditer(r'(.{1,70})(?:$|\s)', text):
print(line.group(1))
To approach another way without regex:
# Constant
J = 70
# output list
out = []
with open('f0.txt', 'r') as f:
# assumes file is 1 long line of text
line = f.read().rstrip()
i = 0
while i+J < len(line):
idx = line.rfind(' ', i, i+J)
if idx != -1:
out.append(line[i:idx])
i = idx+1
else:
out.append(line[i:i+J] + '-')
i += J
out.append(line[i:]) # get ending line portion
for line in out:
print(line)
Here are the file contents (1 long single string):
I have basically a while loop nested in a for loop that goes through every 90 characters of the string, and firstly checks if it is not a space (ie. in the middle of a word). The while loop would then iterate through the string until it reaches the next space (ie. incorporates the word unto the same line). I then check if this line, minus the leading and trailing whitespaces, is longer than 90 characters, and if it is, the while loop iterates backwards and reaches the character before the word that extends over 90 characters.
Output:
I have basically a while loop nested in a for loop that goes through
every 90 characters of the string, and firstly checks if it is not a
space (ie. in the middle of a word). The while loop would then
iterate through the string until it reaches the next space (ie.
incorporates the word unto the same line). I then check if this line,
minus the leading and trailing whitespaces, is longer than 90
characters, and if it is, the while loop iterates backwards and
reaches the character before the word that extends over 90 characters.
I have a string with a large text and need to split it into multiple substrings with length <= N characters (as close to N as it's possible; N is always bigger than the largest sentence), but I also need not to break the sentences.
For example, if I have N = 80 and given text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel.
I want to get list of strings:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam."
"Nam sit amet iaculis lacus, non sagittis nulla."
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
"Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
And also I want this to work with English and Russian.
How to achieve this?
The steps I'd take:
Initiate a list to store the lines and a current line variable to store the string of the current line.
Split the paragraph into sentences - this requires you to .split on '.', remove the trailing empty sentence (""), strip leading and trailing whitespace (.strip) and then add the fullstops back.
Loop through these sentences and:
if the sentence can be added onto the current line, add it
otherwise add the current working line string to the list of lines and set the current line string to be the current sentence
So, in Python, something like:
para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
lines.append(line)
line = sentence
else: #can fit on => add a space then this sentence
line += ' ' + sentence
giving lines as:
[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
"Nam sit amet iaculis lacus, non sagittis nulla.",
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]
There's no built-in for this that I can find, so here's a start. You can make it smarter by checking before and after for where to move the sentences, instead of just before. Length includes spaces, because I'm splitting naïvely instead of with regular expressions or something.
def get_sentences(text, min_length):
sentences = (sentence + ". "
for sentence in text.split(". "))
current_line = ""
for sentence in sentences:
if len(current_line >= min_length):
yield current_line
current_line = sentence
else:
current_line += sentence
yield current_line
It's slow for long lines, but it does the job.
As an assignment I have to take in a long string of text then output it justified with each line being x characters long.
The current method I am trying to use is not working and I can not figure out why, it just gets stuck in an infinite loop.
I would appreciate some help with debugging my code.
code:
words = 'Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. Nam quam nunc, blandit vel, luctus pulvinar, hendrerit id, lorem. Maecenas nec odio et ante tincidunt tempus. Donec vitae sapien ut libero venenatis faucibus. Nullam quis ante. Etiam sit amet orci eget eros faucibus tincidunt. Duis leo. Sed fringilla mauris sit amet nibh. Donec sodales sagittis magna. Sed consequat, leo eget bibendum sodales, augue velit cursus nunc, quis gravida magna mi a libero. Fusce vulputate eleifend sapien. Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus. Nullam accumsan lorem in dui. Cras ultricies mi eu turpis hendrerit fringilla. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In ac dui quis mi consectetuer lacinia.'.split()
max_len = 60
line = ''
lines = []
for word in words:
if len(line) + len(word) <= max_len:
line += (' ' + word)
else:
lines.append(line.strip())
line = ''
import re
def JustifyLine(oline, maxLen):
if len(oline) < maxLen:
s = 1
nline = oline
while len(nline) < maxLen:
match = '\w(\s{%i})\w' % s
replacement = ' ' * (s + 1)
nline = re.sub(match, replacement, nline, 1)
if len(re.findall(match, nline)) == 0:
s = s + 1
replacement = s + 1
elif len(nline) == maxLen:
return nline
return oline
for l in lines[:-1]:
string = JustifyLine(l, max_len)
print(string)
Your major problem is that you are replacing letter-whitespace-letter with more white space, deleting the letters on either side of it. So your line never gets longer, and your loop never terminates.
Put the letters in their own groups, and add references (e.g., \1) to the replacement string.
Stephen's answer gives you a bit more than I was going to give you.
Suggestions for the future:
Work out what loop isn't terminating. e.g. add print statements to suspect loops. A different character to each.
Print out the key values for the loop condition and check that they are heading the right way. In this case the length of nline. If it isn't increasing every time through you need to worry that it won't terminate.
Think carefully before having two loop exits (the condition on the loop and the the return), it can make it harder to reason about the behaviour.
Here is the contents of a txt file:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec
egestas, enim et consectetuer ullamcorper, lectus ligula rutrum leo, a
elementum elit tortor eu quam. Duis tincidunt nisi ut ante. Nulla
facilisi. Sed tristique eros eu libero. Pellentesque vel arcu. Vivamus
purus orci, iaculis ac, suscipit sit amet, pulvinar eu,
lacus. Praesent placerat tortor sed nisl. Nunc blandit diam egestas
dui. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Aliquam viverra fringilla
leo. Nulla feugiat augue eleifend nulla. Vivamus mauris. Vivamus sed
mauris in nibh placerat egestas. Suspendisse potenti. Mauris massa. Ut
eget velit auctor tortor blandit sollicitudin. Suspendisse imperdiet
justo.
and here is my code:
import mmap
import re
import contextlib
pattern = re.compile(r'[\S\s]{5,15}elementum......',
re.DOTALL | re.IGNORECASE | re.MULTILINE)
with open('lorem.txt', 'r') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m:
for match in pattern.findall(m):
print match.replace('\n', ' ')
Print fails to include anything from the prior line, even though I'm telling the program to delete newlines and I'm matching on everything. How do I match the text on the prior line of my sample file?
Your screenshot suggests you're on Windows. With Windows line endings (\r\n) in lorem.txt, the output becomes " rutrum leo, a\r elementum elit ". The \r (carriage return) causes the cursor to hop back to the start of the line, so the first part is overwritten by the second:
$ python foo.py | od -tc
0000000 r u t r u m l e o , a \r e
0000020 l e m e n t u m e l i t \n
0000037
To make the code platform-independent, use os.linesep instead of '\n'.
Another option is to use regular file reading functions instead of mmap, and to specify mode 'r' (to assume platform-local line endings) or 'rU' (to accept any of \r, \r\n and \n). This makes sure all line endings get converted to \n automatically.