I have written two functions in Python.
When I run replace(), it looks at the data structure named replacements. It takes the key, iterates through the document and when it matches a key to a word in the document, it replaces the word with the value.
Now it seems what is happening, because i also have the reverse ('stopped' changes to 'suspended' and 'suspended' changes to 'stopped', depending on what is in the text file), it seems that as it goes through the file, some words are changed, and then changed back (i.e so no changes are made)
when I run replace2() i take each word from the text document, and see if this is a key in replacements. If it is, I replace it. What I have noticed though, when I run this, suspended (contains the substring "ended") ends up as "suspfinished"?
Is there an easier way to iterate through the text file and only change the word once, if found? I think replace2() does what I want it to do, although I'm losing phrases, but it also seems to pick up substrings, which it should not, as i did use the split() function.
def replace():
fileinput = open('tennis.txt').read()
out = open('tennis.txt', 'w')
for i in replacements.keys():
fileinput = fileinput.replace(i, replacements[i])
print(i, " : ", replacements[i])
out.write(fileinput)
out.close
def replace2():
fileinput = open('tennis.txt').read()
out = open('tennis.txt', 'w')
#for line in fileinput:
for word in fileinput.split():
for i in replacements.keys():
print(i)
if word == i:
fileinput = fileinput.replace(word, replacements[i])
out.write(fileinput)
out.close
replacements = {
'suspended' : 'stopped',
'stopped' : 'suspended',
'due to' : 'because of',
'ended' : 'finished',
'finished' : 'ended',
'40' : 'forty',
'forty' : '40',
'because of' : 'due to' }
the match ended due to rain a mere 40 minutes after it started. it was
suspended because of rain.
Improved version of rawbeans answer. It didn't work as expected since some of your replacement keys contain multiple words.
Tested with your example line and it outputs: the match finished because of rain a mere forty minutes after it started. it was stopped due to rain.
import re
def replace2():
fileinput = open('tennis.txt').read()
out = open('tennisout.txt', 'w')
#for line in fileinput:
wordpats = '|'.join(replacements.keys())
pattern = r'({0}|\w+|\W|[.,!?;-_])'.format(wordpats)
words = re.findall(pattern, fileinput)
output = "".join(replacements.get(x, x) for x in words)
out.write(output)
out.close()
replacements = {
'suspended' : 'stopped',
'stopped' : 'suspended',
'due to' : 'because of',
'ended' : 'finished',
'finished' : 'ended',
'40' : 'forty',
'forty' : '40',
'because of' : 'due to' }
if __name__ == '__main__':
replace2()
is there an easier way to iterate through the text file and only change the word once, if found?
There's a much simpler way:
output = " ".join(replacements.get(x, x) for x in fileinput.split())
out.write(output)
To account for punctuation, use a regular expression instead of split():
output = " ".join(replacements.get(x, x) for x in re.findall(r"[\w']+|[.,!?;]", fileinput))
out.write(output)
This way, punctuation will be ignored during the replace, but will be present in the final string. See this post for an explanation and potential caveats.
Related
I have a long list of many elements, each element is a string. See below sample:
data = ['BAT.A.100', 'Regulation 2020-1233', 'this is the core text of', 'the regulation referenced ',
'MOC to BAT.A.100', 'this', 'is', 'one method of demonstrating compliance to BAT.A.100',
'BAT.A.120', 'Regulation 2020-1599', 'core text of the regulation ...', ' more free text','more free text',
'BAT.A.145', 'Regulation 2019-3333', 'core text of' ,'the regulation1111',
'MOC to BAT.A.145', 'here is how you can show compliance to BAT.A.145','more free text',
'MOC2 to BAT.A.145', ' here is yet another way of achieving compliance']
My desired output is ultimately a Pandas DataFrame as follows:
As the strings may have to be concatenated, I have firstly joining all the elements to single string using ## to separate the text which have been joined.
I am going for all regex because there would be lot of conditions to check otherwise.
re_req = re.compile(r'##(?P<Short_ref>BAT\.A\.\d{3})'
r'##(?P<Full_Reg_ref>Regulation\s\d{4}-\d{4})'
r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))'
r'(?:##)?(?:(?P<Moc_text>.*?MOC2 to \1)(?P<MOC2>(?:##)?.*?(?=##BAT\.A\.\d{3})(?!\1)|.+)'
r'|(?P<Moc_text_temp>.*?(?=##BAT\.A\.\d{3})(?!\1)))')
final_list = []
for match in re_req.finditer("##" + "##".join(data)):
inner_list = [match.group('Short_ref').replace("##", " "),
match.group('Full_Reg_ref').replace("##", " "),
match.group('Reg_text').replace("##", " ")]
if match.group('Moc_text_temp'): # just Moc_text is present
inner_list += [match.group('Moc_text_temp').replace("##", " "), ""]
elif match.group('Moc_text') and match.group('MOC2'): # both Mock_text and MOC2 is present
inner_list += [match.group('Moc_text').replace("##", " "), match.group('MOC2').replace("##", " ")]
else: # neither Moc_text nor MOC2 is present
inner_list += ["", ""]
final_list.append(inner_list)
final_df = pd.DataFrame(final_list, columns=['Short_ref', 'Full_Reg_ref', 'Reg_text', 'Moc_text', 'MOC2'])
First and second line of regex is same as which you posted earlier and identifies the first two columns.
In third line of regex, r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))' - matches all text till MOC to Short_ref or matches all the text before the next Reg_text. (?=##BAT\.A\.\d{3})(?!\1) part is to taking the text upto Short_ref pattern and if the Short_ref is not the current Reg_text.
Fourth line is for when Moc_text and MOC2 both is present and it is or with fifth line for the case when just Moc_text is present. This part of the regex is similar to the third line.
Last looping over all the matches using finditer and constructing the rows of the dataframe
final_df:
The code below is supposed to clean out the word frack, and potentially a list of bad words. But for now the issue is with the function clean_line. If text line has frack more than twice, it only take the first one, also it does not react on capital letters.
class Cleaner:
def __init__(self, forbidden_word = "frack"):
""" Set the forbidden word """
self.word = forbidden_word
def clean_line(self, line):
"""Clean up a single string, replacing the forbidden word by *beep!*"""
found = line.find(self.word)
if found != -1:
return line[:found] + "*beep!*" + line[found+len(self.word):]
return line
def clean(self, text):
for i in range(len(text)):
text[i] = self.clean_line(text[i])
example_text = [
"What the frack! I am not going",
"to honour that question with a response.",
"In fact, I think you should",
"get the fracking frack out of here!",
"Frack you!"
]
clean_text = Cleaner().clean(example_text)
for line in example_text: print(line)
Assuming that you just want to get rid of any word with frack in it, you could do something like the code below. If you need to also get rid of trailing whitespace, then you will need to change the regular expression a little bit. If you need to learn more about regular expressions, I would recommend checking out regexone.com.
# Using regular expressions makes string manipulation easier
import re
example_text = [
"What the frack! I am not going",
"to honour that question with a response.",
"In fact, I think you should",
"get the fracking frack out of here!",
"Frack you!"
]
# The pattern below gets rid of all words which start with 'frack'
filter = re.compile(r'frack\w*', re.IGNORECASE)
# We then apply this filter to each element in the example_text list
clean = [filter.sub("", e) for e in example_text]
print(clean)
Output
['What the ! I am not going',
'to honour that question with a response.',
'In fact, I think you should',
'get the out of here!',
' you!']
Use this simple code to clean up your line from a bad word:
line = "frack one Frack two"
bad_word = "frack"
line = line.lower()
if bad_word in line:
clean_line = line.replace(bad_word, "")
Resulting in clean_line being:
"one two"
Is there a way to replace a word within a string without using a "string replace function," e.g., string.replace(string,word,replacement).
[out] = forecast('This snowy weather is so cold.','cold','awesome')
out => 'This snowy weather is so awesome.
Here the word cold is replaced with awesome.
This is from my MATLAB homework which I am trying to do in python. When doing this in MATLAB we were not allowed to us strrep().
In MATLAB, I can use strfind to find the index and work from there. However, I noticed that there is a big difference between lists and strings. Strings are immutable in python and will likely have to import some module to change it to a different data type so I can work with it like how I want to without using a string replace function.
just for fun :)
st = 'This snowy weather is so cold .'.split()
given_word = 'awesome'
for i, word in enumerate(st):
if word == 'cold':
st.pop(i)
st[i - 1] = given_word
break # break if we found first word
print(' '.join(st))
Here's another answer that might be closer to the solution you described using MATLAB:
st = 'This snow weather is so cold.'
given_word = 'awesome'
word_to_replace = 'cold'
n = len(word_to_replace)
index_of_word_to_replace = st.find(word_to_replace)
print st[:index_of_word_to_replace]+given_word+st[index_of_word_to_replace+n:]
You can convert your string into a list object, find the index of the word you want to replace and then replace the word.
sentence = "This snowy weather is so cold"
# Split the sentence into a list of the words
words = sentence.split(" ")
# Get the index of the word you want to replace
word_to_replace_index = words.index("cold")
# Replace the target word with the new word based on the index
words[word_to_replace_index] = "awesome"
# Generate a new sentence
new_sentence = ' '.join(words)
Using Regex and a list comprehension.
import re
def strReplace(sentence, toReplace, toReplaceWith):
return " ".join([re.sub(toReplace, toReplaceWith, i) if re.search(toReplace, i) else i for i in sentence.split()])
print(strReplace('This snowy weather is so cold.', 'cold', 'awesome'))
Output:
This snowy weather is so awesome.
With the following code (a bit messy, I acknowledge) I separate a string by commas, but the condition is that it doesn't separate when the string contains comma separated single words, for example:
It doesn't separate "Yup, there's a reason why you want to hit the sack just minutes after climax" but it separates "The increase in heart rate, which you get from masturbating, is directly beneficial to the circulation, and can reduce the likelihood of a heart attack" into ['The increase in heart rate', 'which you get from masturbating', 'is directly beneficial to the circulation', 'and can reduce the likelihood of a heart attack']
The problem is the purpose of the code fails when it encounters with such a string: "When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow." I don't want a separation after oxytocin, but after prolactin. I need a regex to do that.
import os
import textwrap
import re
import io
from textblob import TextBlob
string = str(input_string)
listy= [x.strip() for x in string.split(',')]
listy = [x.replace('\n', '') for x in listy]
listy = [re.sub('(?<!\d)\.(?!\d)', '', x) for x in listy]
listy = filter(None, listy) # Remove any empty strings
newstring= []
for segment in listy:
wc = TextBlob(segment).word_counts
if listy[len(listy)-1] != segment:
if len(wc) > 3: # len(segment.split(' ')) > 7:
newstring.append(segment+"&&")
else:
newstring.append(segment+",")
else:
newstring.append(segment)
sep = [x.strip() for x in (' '.join(newstring)).split('&&')]
Consider the below..
mystr="When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow."
rExp=r",(?!\s+(?:and\s+)?\w+,)"
mylst=re.compile(rExp).split(mystr)
print(mylst)
should give the below output..
['When men ejaculate', ' it releases a slew of chemicals including oxytocin, vasopressin, and prolactin', ' all of which naturally help you hit the pillow.']
Let's look at how we split the string...
,(?!\s+\w+,)
Use every comma that is not followed by((?! -> negative look ahead) \s+\w+, space and a word with comma.
The above would fail in case of vasopressin, and as and is not followed by ,. So introduce a conditional and\s+ within.
,(?!\s+(?:and\s+)?\w+,)
Although I might want to use the below
,(?!\s+(?:(?:and|or)\s+)?\w+,)
Test regex here
Test code here
In essence consider replacing your line
listy= [x.strip() for x in string.split(',')]
with
listy= [x.strip() for x in re.split(r",(?!\s+(?:and\s+)?\w+,)",string)]
i need to construct a program to my class which will : read a messed text from file and give this text a book form so from input:
This is programing story , for programmers . One day a variable
called
v comes to a bar and ordred some whiskey, when suddenly
a new variable was declared .
a new variable asked : " What did you ordered? "
into output
This is programing story,
for programmers. One day
a variable called v comes
to a bar and ordred some
whiskey, when suddenly a
new variable was
declared. A new variable
asked: "what did you
ordered?"
I am total beginner at programming, and my code is here
def vypis(t):
cely_text = ''
for riadok in t:
cely_text += riadok.strip()
a = 0
for i in range(0,80):
if cely_text[0+a] == " " and cely_text[a+1] == " ":
cely_text = cely_text.replace (" ", " ")
a+=1
d=0
for c in range(0,80):
if cely_text[0+d] == " " and (cely_text[a+1] == "," or cely_text[a+1] == "." or cely_text[a+1] == "!" or cely_text[a+1] == "?"):
cely_text = cely_text.replace (" ", "")
d+=1
def vymen(riadok):
for ch in riadok:
if ch in '.,":':
riadok = riadok[ch-1].replace(" ", "")
x = int(input("Zadaj x"))
t = open("text.txt", "r")
v = open("prazdny.txt", "w")
print(vypis(t))
This code have deleted some spaces and i have tried to delete spaces before signs like " .,_?" but this do not worked why ? Thanks for help :)
You want to do quite a lot of things, so let's take them in order:
Let's get the text in a nice text form (a list of strings):
>>> with open('text.txt', 'r') as f:
... lines = f.readlines()
>>> lines
['This is programing story , for programmers . One day a variable',
'called', 'v comes to a bar and ordred some whiskey, when suddenly ',
' a new variable was declared .',
'a new variable asked : " What did you ordered? "']
You have newlines all around the place. Let's replace them by spaces and join everything into a single big string:
>>> text = ' '.join(line.replace('\n', ' ') for line in lines)
>>> text
'This is programing story , for programmers . One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared . a new variable asked : " What did you ordered? "'
Now we want to remove any multiple spaces. We split by space, tabs, etc... and keep only the non-empty words:
>>> words = [word for word in text.split() if word]
>>> words
['This', 'is', 'programing', 'story', ',', 'for', 'programmers', '.', 'One', 'day', 'a', 'variable', 'called', 'v', 'comes', 'to', 'a', 'bar', 'and', 'ordred', 'some', 'whiskey,', 'when', 'suddenly', 'a', 'new', 'variable', 'was', 'declared', '.', 'a', 'new', 'variable', 'asked', ':', '"', 'What', 'did', 'you', 'ordered?', '"']
Let us join our words by spaces... (only one this time)
>>> text = ' '.join(words)
>>> text
'This is programing story , for programmers . One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared . a new variable asked : " What did you ordered? "'
We now want to remove all the <SPACE>., <SPACE>, etc...:
>>> for char in (',', '.', ':', '"', '?', '!'):
... text = text.replace(' ' + char, char)
>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. a new variable asked:" What did you ordered?"'
OK, the work is not done as the " are still messed up, the upper case are not set etc... You can still incrementally update your text. For the upper case, consider for instance:
>>> sentences = text.split('.')
>>> sentences
['This is programing story, for programmers', ' One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared', ' a new variable asked:" What did you ordered?"']
See how you can fix it ?
The trick is to take only string transformations such that:
A correct sentence is UNCHANGED by the transformation
An incorrect sentence is IMPROVED by the transformation
This way you can compose them an improve your text incrementally.
Once you have a nicely formatted text, like this:
>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. A new variable asked: "what did you ordered?"'
You have to define similar syntactic rules for printing it out in book format. Consider for instance the function:
>>> def prettyprint(text):
... return '\n'.join(text[i:i+50] for i in range(0, len(text), 50))
It will print each line with an exact length of 50 characters:
>>> print prettyprint(text)
This is programing story, for programmers. One day
a variable called v comes to a bar and ordred som
e whiskey, when suddenly a new variable was declar
ed. A new variable asked: "what did you ordered?"
Not bad, but can be better. Just like we previously juggled with text, lines, sentences and words to match the syntactic rules of English language, with want to do exactly the same to match the syntactic rules of printed books.
In that case, both the English language and printed books work on the same units: words, arranged in sentences. This suggests we might want to work on these directly. A simple way to do that is to define your own objects:
>>> class Sentence(object):
... def __init__(self, content, punctuation):
... self.content = content
... self.endby = punctuation
... def pretty(self):
... nice = []
... content = self.content.pretty()
... # A sentence starts with a capital letter
... nice.append(content[0].upper())
... # The rest has already been prettified by the content
... nice.extend(content[1:])
... # Do not forget the punctuation sign
... nice.append('.')
... return ''.join(nice)
>>> class Paragraph(object):
... def __init__(self, sentences):
... self.sentences = sentences
... def pretty(self):
... # Separating our sentences by a single space
... return ' '.join(sentence.pretty() for sentence in sentences)
etc... This way you can represent your text as:
>>> Paragraph(
... Sentence(
... Propositions([Proposition(['this',
... 'is',
... 'programming',
... 'story']),
... Proposition(['for',
... 'programmers'])],
... ',')
... '.'),
... Sentence(...
etc...
Converting from a string (even a messed up one) to such a tree is relatively straightforward as you only break down to the smallest possible elements. When you want to print it in book format, you can define your own book methods on each element of the tree, e.g. like this, passing around the current line, the output lines and the current offset on the current line:
class Proposition(object):
...
def book(self, line, lines, offset, line_length):
for word in self.words:
if offset + len(word) > line_length:
lines.append(' '.join(line))
line = []
offset = 0
line.append(word)
return line, lines, offset
...
class Propositions(object):
...
def book(self, lines, offset, line_length):
lines, offset = self.Proposition1.book(lines, offset, line_length)
if offset + len(self.punctuation) + 1 > line_length:
# Need to add the punctuation sign with the last word
# to a new line
word = line.pop()
lines.append(' '.join(line))
line = [word + self.punctuation + ' ']
offset = len(word + self.punctuation + ' ')
line, lines, offset = self.Proposition2.book(lines, offset, line_length)
return line, lines, offset
And work your way up to Sentence, Paragraph, Chapter...
This is a very simplistic implementation (and actually a non-trivial problem) which does not take into account syllabification or justification (which you would probably like to have), but this is the way to go.
Note that I did not mention the string module, string formatting or regular expressions which are tools to use once you can define your syntactic rules or transformations. These are extremely powerful tools, but the most important here is to know exactly the algorithm to transform an invalid string into a valid one. Once you have some working pseudocode, regexps and format strings can help you achieve it with less pain than plain character iteration. (in my previous example of tree of words for instance, regexps can tremendously ease the construction of the tree, and Python's powerful string formatting functions can make the writing of book or pretty methods much easier).
To strip the multiple spaces you could use a simple regex substitution.
import re
cely_text = re.sub(' +',' ', cely_text)
Then for punctuation you could run a similar sub:
cely_text = re.sub(' +([,.:])','\g<1>', cely_text)