Python .replace() function not working correctly - python

I'm trying to figure out why the .replace function in python isn't functioning correctly. I have spent the entire day yesterday searching for an answer but alas have not found one.
I'm trying to open and read a file, copy it into a list, count the number of lines in the list and remove all the punctuation (ie , . ! ? etc). I can do everything except remove the punctuation (and I must use the .replace function instead of importing a module).
with open('Small_text_file.txt', 'r') as myFile: #adding lines from file to list
contents = myFile.readlines()
fileList= []
# punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in contents:
fileList.append(i.rstrip())
print('The Statistics are:\n','Number of lines:', len(fileList)) #first part of question
for item in fileList:
fileList = item.replace(',', "")
fileList = item.replace('.', "")
print(fileList)
The "Small text file" is:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Vivamus condimentum sagittis lacus? laoreet luctus ligula laoreet ut.
Vestibulum ullamcorper accumsan velit vel vehicula?
Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum mi.
In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus pellentesque.
Nullam id elementum ipsum. Suspendisse
Running the code returns the following:
The Statistics are:
Number of lines: 6
Nullam id elementum ipsum Suspendisse
So the code DOES remove the comma and period characters but it also removes the preceding 5 lines of the text and only prints the very last line. What am I doing wrong here?

Use enumerate:
for x, item in enumerate(fileList):
fileList[x] = item.replace(',', "").replace('.', "")
Note: item.replace() returns replaced string which you need to store in the right index of list. enumerate helps you keep track of index while iterating through the list.

It should be
for i,item in enumerate(fileList):
fileList[i] = item.replace(',', "").replace('.', "")
Without enumerate,
for i in range(len(fileList)):
fileList[i] = fileList[i].replace(',', "").replace('.', "")

Related

How can I wrap text into a paragraph of x characters without importing any modules?

I have a list of words (lowercase) parsed from an article. I joined them together using .join() with a space into a long string. Punctuation will be treated like words (ie. with spaces before and after).
I want to write this string into a file with at most X characters (in this case, 90 characters) per line, without breaking any words. Each line cannot start with a space or end with a space.
As part of the assignment I am not allowed to import modules, which from my understanding, textwrap would've helped.
I have basically a while loop nested in a for loop that goes through every 90 characters of the string, and firstly checks if it is not a space (ie. in the middle of a word). The while loop would then iterate through the string until it reaches the next space (ie. incorporates the word unto the same line). I then check if this line, minus the leading and trailing whitespaces, is longer than 90 characters, and if it is, the while loop iterates backwards and reaches the character before the word that extends over 90 characters.
x = 0
for i in range(89, len(text), 90):
while text[i] != " ":
i += 1
if len(text[x:i].strip()) > 90:
while text[i - 1] != " ":
i = i - 1
file.write("".join(text[x:i]).strip() + "\n")
x = i
The code works for 90% of the file after comparing with the file with correct outputs. Occasionally there are lines where it would exceed 90 characters without wrapping the extra word into the next line.
EX:
Actual Output on one line (93 chars):
extraordinary thing , but i never read a patent medicine advertisement without being impelled
Expected Output with "impelled" on new line (84 chars + 8 chars):
extraordinary thing , but i never read a patent medicine advertisement without being\nimpelled
Are there better ways to do this? Any suggestions would be appreciated.
You could consider using a "buffer" to hold the data as you build each line to output. As you read each new word check if adding it to the "buffer" would exceed the line length, if it would then you print the "buffer" and then reset the "buffer" starting with the word that couldn't fit in the sentence.
data = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis a risus nisi. Nunc arcu sapien, ornare sit amet pretium id, faucibus et ante. Curabitur cursus iaculis nunc id convallis. Mauris at enim finibus, fermentum est non, fringilla orci. Proin nibh orci, tincidunt sed dolor eget, iaculis sodales justo. Fusce ultrices volutpat sapien, in tincidunt arcu. Vivamus at tincidunt tortor. Sed non cursus turpis. Sed tempor neque ligula, in elementum magna vehicula in. Duis ultricies elementum pellentesque. Pellentesque pharetra nec lorem at finibus. Pellentesque sodales ligula sed quam iaculis semper. Proin vulputate, arcu et laoreet ultrices, orci lacus pellentesque justo, ut pretium arcu odio at tellus. Maecenas sit amet nisi vel elit sagittis tristique ac nec diam. Suspendisse non lacus purus. Sed vulputate finibus facilisis."""
sentence_limit = 40
buffer = ""
for word in data.split():
word_length = len(word)
buffer_length = len(buffer)
if word_length > sentence_limit:
print(f"ERROR: the word '{word}' is longer than the sentence limit of {sentence_limit}")
break
if buffer_length + word_length < sentence_limit:
if buffer:
buffer += " "
buffer += word
else:
print(buffer)
buffer = word
print(buffer)
OUTPUT
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Duis a risus nisi. Nunc
arcu sapien, ornare sit amet pretium id,
faucibus et ante. Curabitur cursus
iaculis nunc id convallis. Mauris at
enim finibus, fermentum est non,
fringilla orci. Proin nibh orci,
tincidunt sed dolor eget, iaculis
sodales justo. Fusce ultrices volutpat
sapien, in tincidunt arcu. Vivamus at
tincidunt tortor. Sed non cursus turpis.
Sed tempor neque ligula, in elementum
magna vehicula in. Duis ultricies
elementum pellentesque. Pellentesque
pharetra nec lorem at finibus.
Pellentesque sodales ligula sed quam
iaculis semper. Proin vulputate, arcu et
laoreet ultrices, orci lacus
pellentesque justo, ut pretium arcu odio
at tellus. Maecenas sit amet nisi vel
elit sagittis tristique ac nec diam.
Suspendisse non lacus purus. Sed
vulputate finibus facilisis.
Using a regular expression:
import re
with open('f0.txt', 'r') as f:
# file must be 1 long single line of text)
text = f.read().rstrip()
for line in re.finditer(r'(.{1,70})(?:$|\s)', text):
print(line.group(1))
To approach another way without regex:
# Constant
J = 70
# output list
out = []
with open('f0.txt', 'r') as f:
# assumes file is 1 long line of text
line = f.read().rstrip()
i = 0
while i+J < len(line):
idx = line.rfind(' ', i, i+J)
if idx != -1:
out.append(line[i:idx])
i = idx+1
else:
out.append(line[i:i+J] + '-')
i += J
out.append(line[i:]) # get ending line portion
for line in out:
print(line)
Here are the file contents (1 long single string):
I have basically a while loop nested in a for loop that goes through every 90 characters of the string, and firstly checks if it is not a space (ie. in the middle of a word). The while loop would then iterate through the string until it reaches the next space (ie. incorporates the word unto the same line). I then check if this line, minus the leading and trailing whitespaces, is longer than 90 characters, and if it is, the while loop iterates backwards and reaches the character before the word that extends over 90 characters.
Output:
I have basically a while loop nested in a for loop that goes through
every 90 characters of the string, and firstly checks if it is not a
space (ie. in the middle of a word). The while loop would then
iterate through the string until it reaches the next space (ie.
incorporates the word unto the same line). I then check if this line,
minus the leading and trailing whitespaces, is longer than 90
characters, and if it is, the while loop iterates backwards and
reaches the character before the word that extends over 90 characters.

Parsing and updating markdown file with Python

I'm creating a script that will traverse a markdown file and update the any image tags from
![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif)
to
![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif?alt-text="Daffy Duck")
I'm new to Python, so I'm unsure about syntax and my approach, but my current thinking is to create an new empty string, traverse the original markdown line by line, if an image tag is detected splice the alt text to the correct location and add the lines to the new markdown string. The code I have so far looks like:
import markdown
from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension
originalMarkdown = '''
## New Article
Lorem ipsum dolor sit amet, consectetur adipiscing elit. In pretium nunc ligula. Quisque bibendum vel lectus sed pulvinar. Phasellus id magna ac arcu iaculis facilisis. Curabitur tincidunt sed ipsum vel lacinia. Nulla et semper urna. Quisque ultrices hendrerit magna nec tempor.
![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif)
Quisque accumsan sem mi. Nunc orci justo, laoreet vel metus nec, interdum euismod ipsum.
![Bugs Bunny](http://www.nationalnannies.com/wp-content/uploads/2012/03/bugsbunny.png)
Suspendisse augue odio, pharetra ac erat eget, volutpat ornare velit. Sed ac luctus quam. Sed id mauris erat. Duis lacinia faucibus metus, nec vehicula metus consectetur eu.
'''
updatedMarkdown = ""
# First create the treeprocessor
class AltTextExtractor(Treeprocessor):
def run(self, doc):
"Find all alt_text and append to markdown.alt_text. "
self.markdown.alt_text = []
for image in doc.findall('.//img'):
self.markdown.alt_text.append(image.get('alt'))
# Then traverse the markdown file and concatenate the alt text to the end of any image tags
class ImageTagUpdater(Treeprocessor):
def run(self, doc):
# Set a counter
count = 0
# Go through markdown file line by line
for line in doc:
# if line is an image tag
if line > ('.//img'):
# grab the array of the alt text
img_ext = ImgExtractor(md)
# find the second to last character of the line
location = len(line) - 1
# insert the alt text
line += line[location] + '?' + '"' + img_ext[count] + '"'
# add line to new markdown file
updatedMarkdownadd.add(line)
The above code is pseudo code. I'm able to successfully extract the strings I need from the original file but I'm unable to concatenate those strings to their respective image tags and update the original file.
Provided your files aren't huge, it might be easier to overwrite the file, rather than try to wedge little bits in here or there.
orig = '![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif)'
new = '![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif?alt-text="Daffy Duck")'
with open(filename, 'r') as f:
text = f.readlines()
new_text = "\n".join([line if line != orig else new for line in text])
with open(filename, 'w') as f:
f.write(new_text)
You could also use regex re.sub, but I suppose its a matter of preference.

Python - Removing all punctuation from string and only printing words containing an "i" and are equal to or longer than five characters

testText = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque nec mauris nec tellus mollis ullamcorper. Vestibulum sit amet arcu placerat, sagittis quam sed, rutrum sem. Morbi vulputate odio non lacus."
splitText = testText.split(" ")
print(splitText)
cleanedText = ''
for letter in testText:
if letter in list('.,:;?!'):
cleanedText.append(letter)
''.join(cleanedText)
I am trying to remove all punctuation in the paragraph above, but I am running into an "Attribute Error: 'str' object has no attribute 'append'".
What could potentially be going wrong and how should I go about resolving it?
Additionally, how would I then only print worlds equal to or longer than five characters and contain an 'i'?
To remove a simple trick is to replace it with an empty str (with replace). For the second part we look at the 2 conditions: that i is in the word and and the length is equal to or greater than 5. Beware that we are looking at I in uppercase!
testText = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque nec mauris nec tellus mollis ullamcorper. Vestibulum sit amet arcu placerat, sagittis quam sed, rutrum sem. Morbi vulputate odio non lacus."
str_to_remove = list('.,:;?!')
for letter in str_to_remove:
testText = testText.replace(letter, '')
for word in testText.split(' '):
if 'i' in word and len(word) >= 5:
print(word)
try this:
for letter in testText:
if letter not in list('.,:;?!\''):
cleanText += letter
print(cleanText)

How to split string to substrings with given length but not breaking sentences?

I have a string with a large text and need to split it into multiple substrings with length <= N characters (as close to N as it's possible; N is always bigger than the largest sentence), but I also need not to break the sentences.
For example, if I have N = 80 and given text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel.
I want to get list of strings:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam."
"Nam sit amet iaculis lacus, non sagittis nulla."
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
"Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
And also I want this to work with English and Russian.
How to achieve this?
The steps I'd take:
Initiate a list to store the lines and a current line variable to store the string of the current line.
Split the paragraph into sentences - this requires you to .split on '.', remove the trailing empty sentence (""), strip leading and trailing whitespace (.strip) and then add the fullstops back.
Loop through these sentences and:
if the sentence can be added onto the current line, add it
otherwise add the current working line string to the list of lines and set the current line string to be the current sentence
So, in Python, something like:
para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
lines.append(line)
line = sentence
else: #can fit on => add a space then this sentence
line += ' ' + sentence
giving lines as:
[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
"Nam sit amet iaculis lacus, non sagittis nulla.",
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]
There's no built-in for this that I can find, so here's a start. You can make it smarter by checking before and after for where to move the sentences, instead of just before. Length includes spaces, because I'm splitting naïvely instead of with regular expressions or something.
def get_sentences(text, min_length):
sentences = (sentence + ". "
for sentence in text.split(". "))
current_line = ""
for sentence in sentences:
if len(current_line >= min_length):
yield current_line
current_line = sentence
else:
current_line += sentence
yield current_line
It's slow for long lines, but it does the job.

get words from large file, using low memory in python

I need to iterate over the words in a file. The file could be very big (over 1TB), the lines could be very long (maybe just one line). Words are English, so reasonable in size. So I don't want to load in the whole file or even a whole line.
I have some code that works, but may explode if lines are to long (longer than ~3GB on my machine).
def words(file):
for line in file:
words=re.split("\W+", line)
for w in words:
word=w.lower()
if word != '': yield word
Can you tell be how I can, simply, rewrite this iterator function so that it does not hold more than needed in memory?
Don't read line by line, read in buffered chunks instead:
import re
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
for word in (w.lower() for w in words if w):
yield word
if buffer:
yield buffer.lower()
I'm using the callable-and-sentinel version of the iter() function to handle reading from the file until file.read() returns an empty string; I prefer this form over a while loop.
If you are using Python 3.3 or newer, you can use generator delegation here:
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
yield from (w.lower() for w in words if w)
if buffer:
yield buffer.lower()
Demo using a small chunk size to demonstrate this all works as expected:
>>> demo = StringIO('''\
... Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque in nulla nec mi laoreet tempus non id nisl. Aliquam dictum justo ut volutpat cursus. Proin dictum nunc eu dictum pulvinar. Vestibulum elementum urna sapien, non commodo felis faucibus id. Curabitur
... ''')
>>> for word in words(demo, 32):
... print word
...
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
pellentesque
in
nulla
nec
mi
laoreet
tempus
non
id
nisl
aliquam
dictum
justo
ut
volutpat
cursus
proin
dictum
nunc
eu
dictum
pulvinar
vestibulum
elementum
urna
sapien
non
commodo
felis
faucibus
id
curabitur

Categories

Resources