Python2.7 mmap over two lines with regex

Python2.7 mmap over two lines with regex - python

Here is the contents of a txt file:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec
egestas, enim et consectetuer ullamcorper, lectus ligula rutrum leo, a
elementum elit tortor eu quam. Duis tincidunt nisi ut ante. Nulla
facilisi. Sed tristique eros eu libero. Pellentesque vel arcu. Vivamus
purus orci, iaculis ac, suscipit sit amet, pulvinar eu,
lacus. Praesent placerat tortor sed nisl. Nunc blandit diam egestas
dui. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Aliquam viverra fringilla
leo. Nulla feugiat augue eleifend nulla. Vivamus mauris. Vivamus sed
mauris in nibh placerat egestas. Suspendisse potenti. Mauris massa. Ut
eget velit auctor tortor blandit sollicitudin. Suspendisse imperdiet
justo.
and here is my code:
import mmap
import re
import contextlib
pattern = re.compile(r'[\S\s]{5,15}elementum......',
re.DOTALL | re.IGNORECASE | re.MULTILINE)
with open('lorem.txt', 'r') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m:
for match in pattern.findall(m):
print match.replace('\n', ' ')
Print fails to include anything from the prior line, even though I'm telling the program to delete newlines and I'm matching on everything. How do I match the text on the prior line of my sample file?

Your screenshot suggests you're on Windows. With Windows line endings (\r\n) in lorem.txt, the output becomes " rutrum leo, a\r elementum elit ". The \r (carriage return) causes the cursor to hop back to the start of the line, so the first part is overwritten by the second:
$ python foo.py | od -tc
0000000 r u t r u m l e o , a \r e
0000020 l e m e n t u m e l i t \n
0000037
To make the code platform-independent, use os.linesep instead of '\n'.
Another option is to use regular file reading functions instead of mmap, and to specify mode 'r' (to assume platform-local line endings) or 'rU' (to accept any of \r, \r\n and \n). This makes sure all line endings get converted to \n automatically.

Related

How can I wrap text into a paragraph of x characters without importing any modules?

I have a list of words (lowercase) parsed from an article. I joined them together using .join() with a space into a long string. Punctuation will be treated like words (ie. with spaces before and after).
I want to write this string into a file with at most X characters (in this case, 90 characters) per line, without breaking any words. Each line cannot start with a space or end with a space.
As part of the assignment I am not allowed to import modules, which from my understanding, textwrap would've helped.
I have basically a while loop nested in a for loop that goes through every 90 characters of the string, and firstly checks if it is not a space (ie. in the middle of a word). The while loop would then iterate through the string until it reaches the next space (ie. incorporates the word unto the same line). I then check if this line, minus the leading and trailing whitespaces, is longer than 90 characters, and if it is, the while loop iterates backwards and reaches the character before the word that extends over 90 characters.
x = 0
for i in range(89, len(text), 90):
while text[i] != " ":
i += 1
if len(text[x:i].strip()) > 90:
while text[i - 1] != " ":
i = i - 1
file.write("".join(text[x:i]).strip() + "\n")
x = i
The code works for 90% of the file after comparing with the file with correct outputs. Occasionally there are lines where it would exceed 90 characters without wrapping the extra word into the next line.
EX:
Actual Output on one line (93 chars):
extraordinary thing , but i never read a patent medicine advertisement without being impelled
Expected Output with "impelled" on new line (84 chars + 8 chars):
extraordinary thing , but i never read a patent medicine advertisement without being\nimpelled
Are there better ways to do this? Any suggestions would be appreciated.

You could consider using a "buffer" to hold the data as you build each line to output. As you read each new word check if adding it to the "buffer" would exceed the line length, if it would then you print the "buffer" and then reset the "buffer" starting with the word that couldn't fit in the sentence.
data = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis a risus nisi. Nunc arcu sapien, ornare sit amet pretium id, faucibus et ante. Curabitur cursus iaculis nunc id convallis. Mauris at enim finibus, fermentum est non, fringilla orci. Proin nibh orci, tincidunt sed dolor eget, iaculis sodales justo. Fusce ultrices volutpat sapien, in tincidunt arcu. Vivamus at tincidunt tortor. Sed non cursus turpis. Sed tempor neque ligula, in elementum magna vehicula in. Duis ultricies elementum pellentesque. Pellentesque pharetra nec lorem at finibus. Pellentesque sodales ligula sed quam iaculis semper. Proin vulputate, arcu et laoreet ultrices, orci lacus pellentesque justo, ut pretium arcu odio at tellus. Maecenas sit amet nisi vel elit sagittis tristique ac nec diam. Suspendisse non lacus purus. Sed vulputate finibus facilisis."""
sentence_limit = 40
buffer = ""
for word in data.split():
word_length = len(word)
buffer_length = len(buffer)
if word_length > sentence_limit:
print(f"ERROR: the word '{word}' is longer than the sentence limit of {sentence_limit}")
break
if buffer_length + word_length < sentence_limit:
if buffer:
buffer += " "
buffer += word
else:
print(buffer)
buffer = word
print(buffer)
OUTPUT
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Duis a risus nisi. Nunc
arcu sapien, ornare sit amet pretium id,
faucibus et ante. Curabitur cursus
iaculis nunc id convallis. Mauris at
enim finibus, fermentum est non,
fringilla orci. Proin nibh orci,
tincidunt sed dolor eget, iaculis
sodales justo. Fusce ultrices volutpat
sapien, in tincidunt arcu. Vivamus at
tincidunt tortor. Sed non cursus turpis.
Sed tempor neque ligula, in elementum
magna vehicula in. Duis ultricies
elementum pellentesque. Pellentesque
pharetra nec lorem at finibus.
Pellentesque sodales ligula sed quam
iaculis semper. Proin vulputate, arcu et
laoreet ultrices, orci lacus
pellentesque justo, ut pretium arcu odio
at tellus. Maecenas sit amet nisi vel
elit sagittis tristique ac nec diam.
Suspendisse non lacus purus. Sed
vulputate finibus facilisis.

Using a regular expression:
import re
with open('f0.txt', 'r') as f:
# file must be 1 long single line of text)
text = f.read().rstrip()
for line in re.finditer(r'(.{1,70})(?:$|\s)', text):
print(line.group(1))
To approach another way without regex:
# Constant
J = 70
# output list
out = []
with open('f0.txt', 'r') as f:
# assumes file is 1 long line of text
line = f.read().rstrip()
i = 0
while i+J < len(line):
idx = line.rfind(' ', i, i+J)
if idx != -1:
out.append(line[i:idx])
i = idx+1
else:
out.append(line[i:i+J] + '-')
i += J
out.append(line[i:]) # get ending line portion
for line in out:
print(line)
Here are the file contents (1 long single string):
I have basically a while loop nested in a for loop that goes through every 90 characters of the string, and firstly checks if it is not a space (ie. in the middle of a word). The while loop would then iterate through the string until it reaches the next space (ie. incorporates the word unto the same line). I then check if this line, minus the leading and trailing whitespaces, is longer than 90 characters, and if it is, the while loop iterates backwards and reaches the character before the word that extends over 90 characters.
Output:
I have basically a while loop nested in a for loop that goes through
every 90 characters of the string, and firstly checks if it is not a
space (ie. in the middle of a word). The while loop would then
iterate through the string until it reaches the next space (ie.
incorporates the word unto the same line). I then check if this line,
minus the leading and trailing whitespaces, is longer than 90
characters, and if it is, the while loop iterates backwards and
reaches the character before the word that extends over 90 characters.

How to split string to substrings with given length but not breaking sentences?

I have a string with a large text and need to split it into multiple substrings with length <= N characters (as close to N as it's possible; N is always bigger than the largest sentence), but I also need not to break the sentences.
For example, if I have N = 80 and given text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel.
I want to get list of strings:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam."
"Nam sit amet iaculis lacus, non sagittis nulla."
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
"Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
And also I want this to work with English and Russian.
How to achieve this?

The steps I'd take:
Initiate a list to store the lines and a current line variable to store the string of the current line.
Split the paragraph into sentences - this requires you to .split on '.', remove the trailing empty sentence (""), strip leading and trailing whitespace (.strip) and then add the fullstops back.
Loop through these sentences and:
if the sentence can be added onto the current line, add it
otherwise add the current working line string to the list of lines and set the current line string to be the current sentence
So, in Python, something like:
para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
lines.append(line)
line = sentence
else: #can fit on => add a space then this sentence
line += ' ' + sentence
giving lines as:
[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
"Nam sit amet iaculis lacus, non sagittis nulla.",
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]

There's no built-in for this that I can find, so here's a start. You can make it smarter by checking before and after for where to move the sentences, instead of just before. Length includes spaces, because I'm splitting naïvely instead of with regular expressions or something.
def get_sentences(text, min_length):
sentences = (sentence + ". "
for sentence in text.split(". "))
current_line = ""
for sentence in sentences:
if len(current_line >= min_length):
yield current_line
current_line = sentence
else:
current_line += sentence
yield current_line
It's slow for long lines, but it does the job.

How to find total number of characters in paragraph(excluding spaces) in PYTHON?

finding total words in paragraph by using split command was easy but what to do when to find number of characters in paragraph using PYTHON?

I would go for a listcomprehension:
len([c for c in "la a a" if c not in (' ', '\n') ])

There are a lot of ways to do that, so I wondered how they fair against each other so I timed them. I implemented all the methods in this question and any additional ones from here.
from timeit import timeit
setup = """
from collections import Counter
import string
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. In malesuada eget tortor vel tempor. Cras condimentum risus a mi sagittis, at lobortis dui efficitur. Suspendisse fringilla ligula at eros consequat aliquet. Praesent volutpat sapien non aliquam cursus. Suspendisse auctor sapien ac leo luctus scelerisque. Cras eget fringilla mauris. Vivamus fermentum, nisl et mollis consequat, nisl sapien lacinia ex, ac finibus eros dui vel sapien. Curabitur dignissim porttitor ex sed vestibulum. Nullam nulla lorem, aliquam in turpis at, egestas tempor turpis. Aenean at lorem molestie, placerat eros in, tempus dui. Morbi magna nulla, blandit ac vestibulum faucibus, luctus sit amet libero. Ut quis lorem porta, cursus nunc id, malesuada mauris. Aenean luctus diam ac tortor rutrum mattis. Donec ultrices nibh quis est varius pellentesque. Donec tempor, est vel commodo ultricies, mauris tortor egestas orci, ut hendrerit orci ex eget risus. In eu ullamcorper odio, lacinia auctor urna.'
def split_join(s):
return len(''.join(s.split()))
def list_comprehension(s):
return len([c for c in s if c != " "])
def sum_len_split(s):
return sum([len(x) for x in s.split()])
def map_len_split(s):
return sum(map(len, s.split()))
def conventional_loop(s):
x = 0
for n in s:
if not n.isspace():
x += 1
return x
def replace_space(s):
return len(s.replace(" ", ""))
def counter(s):
valid_letters = string.ascii_letters
count = Counter(s)
return sum(count[letter] for letter in valid_letters)
def discount_space(s):
return len(s) - s.count(" ")
"""
functions = [line[4:-4] for line in setup.split('\n')
if line.startswith('def ')]
n = 100000
results = []
for function in functions:
results.append((timeit('{}(text)'.format(function), setup=setup, number=n), function))
results.sort()
for time, function in results:
print(function, time)
Results
discount_space 0.1687837346025738
replace_space 0.5508266038467227
split_join 1.231192897388388
map_len_split 1.5719588628305754
sum_len_split 2.2983778970212896
list_comprehension 5.715995796916212
counter 7.133700537385263
conventional_loop 11.01061941802605
Of course, unless you have millions of characters in your string, performance isn't an issue and code clarity is more important. In that case I would still argue that discount_space() is the most clear and direct.

You can split your string (that removes spaces and gives a list of string), and then count characters of each word and sum the total.
myString = "Hello you !" # 11 characters
totalNCharacters = sum([len(x) for x in mystring.split()])
print(totalCharacters) # output: 9

by characters, I assume mean anything not whitespace, meaning punctuation included. This is one way it can be done:
#!/usr/bin/python
s = 'abc def ghi....\n":)*9.w wer'
x = 0
for n in s:
if not n.isspace():
x += 1
print("total number of characters is {}".format(x))

paragraph = "The quick brown fox jumps over the lazy dog"
Remove the spaces and print the length:
print(len(paragraph.replace(" ", "")))
# 35

I would use split and join:
def char_count(s):
return len(''.join(s.split()))

from collections import Counter
import string
def count_letters(word, valid_letters=string.ascii_letters):
count = Counter(word) # this counts all the letters
return sum(count[letter] for letter in valid_letters) # valid letters

Python long string to justified text of x characters long lines

As an assignment I have to take in a long string of text then output it justified with each line being x characters long.
The current method I am trying to use is not working and I can not figure out why, it just gets stuck in an infinite loop.
I would appreciate some help with debugging my code.
code:
words = 'Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. Nam quam nunc, blandit vel, luctus pulvinar, hendrerit id, lorem. Maecenas nec odio et ante tincidunt tempus. Donec vitae sapien ut libero venenatis faucibus. Nullam quis ante. Etiam sit amet orci eget eros faucibus tincidunt. Duis leo. Sed fringilla mauris sit amet nibh. Donec sodales sagittis magna. Sed consequat, leo eget bibendum sodales, augue velit cursus nunc, quis gravida magna mi a libero. Fusce vulputate eleifend sapien. Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus. Nullam accumsan lorem in dui. Cras ultricies mi eu turpis hendrerit fringilla. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In ac dui quis mi consectetuer lacinia.'.split()
max_len = 60
line = ''
lines = []
for word in words:
if len(line) + len(word) <= max_len:
line += (' ' + word)
else:
lines.append(line.strip())
line = ''
import re
def JustifyLine(oline, maxLen):
if len(oline) < maxLen:
s = 1
nline = oline
while len(nline) < maxLen:
match = '\w(\s{%i})\w' % s
replacement = ' ' * (s + 1)
nline = re.sub(match, replacement, nline, 1)
if len(re.findall(match, nline)) == 0:
s = s + 1
replacement = s + 1
elif len(nline) == maxLen:
return nline
return oline
for l in lines[:-1]:
string = JustifyLine(l, max_len)
print(string)

Your major problem is that you are replacing letter-whitespace-letter with more white space, deleting the letters on either side of it. So your line never gets longer, and your loop never terminates.
Put the letters in their own groups, and add references (e.g., \1) to the replacement string.

Stephen's answer gives you a bit more than I was going to give you.
Suggestions for the future:
Work out what loop isn't terminating. e.g. add print statements to suspect loops. A different character to each.
Print out the key values for the loop condition and check that they are heading the right way. In this case the length of nline. If it isn't increasing every time through you need to worry that it won't terminate.
Think carefully before having two loop exits (the condition on the loop and the the return), it can make it harder to reason about the behaviour.

get words from large file, using low memory in python

I need to iterate over the words in a file. The file could be very big (over 1TB), the lines could be very long (maybe just one line). Words are English, so reasonable in size. So I don't want to load in the whole file or even a whole line.
I have some code that works, but may explode if lines are to long (longer than ~3GB on my machine).
def words(file):
for line in file:
words=re.split("\W+", line)
for w in words:
word=w.lower()
if word != '': yield word
Can you tell be how I can, simply, rewrite this iterator function so that it does not hold more than needed in memory?

Don't read line by line, read in buffered chunks instead:
import re
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
for word in (w.lower() for w in words if w):
yield word
if buffer:
yield buffer.lower()
I'm using the callable-and-sentinel version of the iter() function to handle reading from the file until file.read() returns an empty string; I prefer this form over a while loop.
If you are using Python 3.3 or newer, you can use generator delegation here:
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
yield from (w.lower() for w in words if w)
if buffer:
yield buffer.lower()
Demo using a small chunk size to demonstrate this all works as expected:
>>> demo = StringIO('''\
... Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque in nulla nec mi laoreet tempus non id nisl. Aliquam dictum justo ut volutpat cursus. Proin dictum nunc eu dictum pulvinar. Vestibulum elementum urna sapien, non commodo felis faucibus id. Curabitur
... ''')
>>> for word in words(demo, 32):
... print word
...
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
pellentesque
in
nulla
nec
mi
laoreet
tempus
non
id
nisl
aliquam
dictum
justo
ut
volutpat
cursus
proin
dictum
nunc
eu
dictum
pulvinar
vestibulum
elementum
urna
sapien
non
commodo
felis
faucibus
id
curabitur

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.