Censoring a txt file with python - python

Hi I could really use some help on a python project that I'm working on. Basically I have a list of banned words and I must go through a .txt file and search for these specific words and change them from their original form to a ***.
text_file = open('filename.txt','r')
text_file_read = text_file.readlines()
banned_words = ['is','activity', 'one']
words = []
i = 0
while i < len(text_file_read):
words.append(text_file_read[i].strip().lower().split())
i += 1
i = 0
while i < len(words):
if words[i] in banned_words:
words[i] = '*'*len(words[i])
i += 1
i = 0
text_file_write = open('filename.txt', 'w')
while i < len(text_file_read):
print(' '.join(words[i]), file = text_file_write)
i += 1
The expected output would be:
This **
********
***?
However its:
This is
activity
one?
Any help is greatly appreciated! I'm also trying not to use external libraries as well

I cannot solve this for you (haven't touched python in a while), but the best debugging tip I can offer is: print everything. Take the first loop, print every iteration, or print what "words" is afterwards. It will give you an insight on what's going wrong, and once you know what is working in an unexpected way, you can search how you can fix it.
Also, if you're just starting, avoid concatenating methods. It ends up a bit unreadable, and you can't see what each method is doing. In my opinion at least, it's better to have 30 lines of readable and easy-to-understand code, than 5 that take some brain power to understand.
Good luck!

A simpler way if you just need to print it
banned_words = ['is','activity', 'one']
output = ""
f = open('filename.txt','r')
for line in f:
for word in line.rsplit():
if not word in banned_words:
output += word + " "
else:
output += "*"*len(word) + " "
output += "\n"
print(output)

Related

output differences between 2 texts when lines are dissimilar

I am relatively new to Python so apologies in advance for sounding a bit ditzy sometimes. I'll try took google and attempt your tips as much as I can before asking even more questions.
Here is my situation: I am working with R and stylometry to find out the (likely) authorship of a text. What I'd like to do is see if there is a difference in the stylometry of a novel in the second edition, after one of the (assumed) co-authors died and therefore could not have contributed. In order to research that I need
Text edition 1
Text edition 2
and for python to output
words that appear in text 1 but not in text 2
words that appear in text 2 but not in text 1
And I would like to have the words each time they appear so not just 'the' once, but every time the program encounters it when it differs from the first edition (yep I know I'm asking for a lot sorry)
I have tried approaching this via
file1 = open("FRANKENST18.txt", "r")
file2 = open("FRANKENST31.txt", "r")
file3 = open("frankoutput.txt", "w")
list1 = file1.readlines()
list2 = file2.readlines()
file3.write("here: \n")
for i in list1:
for j in list2:
if i==j:
file3.write(i)
but of course this doesn't work because the texts are two giant balls of texts and not separate lines that can be compared, plus the first text has far more lines than the second one. Is there a way to go from lines to 'words' or the text in general to overcome that? Can I put an entire novel in a string lol? I assume not.
I have also attempted to use difflib, but I've only started coding a few weeks ago and I find it quite complicated. For example, I used fraxel's script as a base for:
from difflib import Differ
s1 = open("FRANKENST18.txt", "r")
s1 = open("FRANKENST31.txt", "r")
def appendBoldChanges(s1, s2):
#"Adds <b></b> tags to words that are changed"
l1 = s1.split(' ')
l2 = s2.split(' ')
dif = list(Differ().compare(l1, l2))
return " ".join(['<b>'+i[2:]+'</b>' if i[:1] == '+' else i[2:] for i in dif
if not i[:1] in '-?'])
print appendBoldChanges
but I couldn't get it to work.
So my question is is there any way to output the differences between texts that are not similar in lines like this? It sounded quite do-able but I've greatly underestimated how difficult I found Python haha.
Thanks for reading, any help is appreciated!
EDIT: posting my current code just in case it might help fellow learners that are googling for answers:
file1 = open("1stein.txt")
originaltext1 = file1.read()
wordlist1={}
import string
text1 = [x.strip(string.punctuation) for x in originaltext1.split()]
text1 = [x.lower() for x in text1]
for word1 in text1:
if word1 not in wordlist1:
wordlist1[word1] = 1
else:
wordlist1[word1] += 1
for k,v in sorted(wordlist1.items()):
#print "%s %s" % (k, v)
col1 = ("%s %s" % (k, v))
print col1
file2 = open("2stein.txt")
originaltext2 = file2.read()
wordlist2={}
import string
text2 = [x.strip(string.punctuation) for x in originaltext2.split()]
text2 = [x.lower() for x in text2]
for word2 in text2:
if word2 not in wordlist2:
wordlist2[word2] = 1
else:
wordlist2[word2] += 1
for k,v in sorted(wordlist2.items()):
#print "%s %s" % (k, v)
col2 = ("%s %s" % (k, v))
print col2
what I hope still to edit and output is something like this:
using the dictionaries' key and value system (applied to col1 and col2): {apple 3, bridge 7, chair 5} - {apple 1, bridge 9, chair 5} = {apple 2, bridge -2, chair 5}?
You want to output:
words that appear in text 1 but not in text 2
words that appear in
text 2 but not in text 1
Interesting. A set difference is what you need.
import re
s1 = open("FRANKENST18.txt", "r").read()
s1 = open("FRANKENST31.txt", "r").read()
words_s1 = re.findall("[A-Za-z]",s1)
words_s2 = re.findall("[A-Za-z]",s2)
set_s1 = set(words_s1)
set_s2 = set(words_s2)
words_in_s1_but_not_in_s2 = set_s1 - set_s2
words_in_s2_but_not_in_s1 = set_s2 - set_s1
words_in_s1 = '\n'.join(words_in_s1_but_not_in_s2)
words_in_s2 = '\n'.join(words_in_s2_but_not_in_s1)
with open("s1_output","w") as s1_output:
s1_output.write(words_in_s1)
with open("s2_output","w") as s2_output:
s2_output.write(words_in_s2)
Let me know if this isn't exactly what you're looking for, but it seems like you want to iterate through lines of a file, which you can do very easily in python. Here's an example, where I omit the newline character at the end of each line, and add the lines to a list:
f = open("filename.txt", 'r')
lines = []
for line in f:
lines.append(f[:-1])
Hope this helps!
I'm not completely sure if you're trying to compare the differences in words as they occur or lines as they occur, however one way you could do this is by using a dictionary. If you want to see which lines change you could split the lines on periods by doing something like:
text = 'this is a sentence. this is another sentence.'
sentences = text.split('.')
This will split the string you have (which contains the entire text I assume) on the periods and will return an array (or list) of all the sentences.
You can then create a dictionary with dict = {}, loop over each sentence in the previously created array, make it a key in the dictionary with a corresponding value (could be anything since most sentences probably don't occur more than once). After doing this for the first version you can go through the second version and check which sentences are the same. Here is some code that will give you a start (assuming version1 contains all the sentences from the first version):
for sentence in version1:
dict[sentence] = 1 #put a counter for e
You can then loop over the second version and check if the same sentence is found in the first, with something like:
for sentence in version2:
if sentence in dict: #if the sentence is in the dictionary
pass
#or do whatever you want here
else: #if the sentence isn't
print(sentence)
Again not sure if this is what you're looking for but hope it helps

Python 3 - How to remove empty paragraphs on specific lines only - pythondocx

from docx import Document
alphaDic = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','!','?','.','~',',','(',')','$','-',':',';',"'",'/']
doc = Document('realexample.docx')
docIndex = 0
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
while docIndex < len(doc.paragraphs):
firstSen = doc.paragraphs[docIndex].text
rep_dic = {ord(k):None for k in alphaDic + [x.upper() for x in alphaDic]}
translation = (firstSen.translate(rep_dic))
removeExcessSpaces = " ".join(translation.split())
if removeExcessSpaces != '':
doc.paragraphs[docIndex].text = removeExcessSpaces
else:
delete_paragraph(doc.paragraphs[docIndex])
docIndex -=1 # go one step back in the loop because of the deleted index
docIndex +=1
So the test document looks like this
Hello
你好
Good afternoon
朋友们
Good evening
晚上好
And I'm trying to achieve this result below.
你好
朋友们
晚上好
Right now the code removes all empty paragraphs and excessive spaces and does this, so I'm kinda stuck here. I only want to erase the line breaks that were caused from the English words.
你好
朋友们
晚上好
what you can do is looking for english words, once you find the english word "WORD", append it with "\n" and then remove this new result "WORD\n" from the document. The way you append strings in python is with + sign. Just do "WORD" + "\n"

Trying to solve old GoogleCodeJam for practice

I'm working on some of the old Google Code Jam problems as practice in python since we don't use this language at my school. Here is the current problem I am working on that is supposed to just reverse the order of a string by word.
Here is my code:
import sys
f = open("B-small-practice.in", 'r')
T = int(f.readline().strip())
array = []
for i in range(T):
array.append(f.readline().strip().split(' '))
for ar in array:
if len(ar) > 1:
count = len(ar) - 1
while count > -1:
print ar[count]
count -= 1
The problem is that instead of printing:
test a is this
My code prints:
test
a
is
this
Please let me know how to format my loop so that it prints all in one line. Also, I've had a bit of a learning curve when it comes to reading input from a .txt file and manipulating it, so if you have any advice on different methods to do so for such problems, it would be much appreciated!
print by default appends a newline. If you don't want that behavior, tack a comma on the end.
while count > -1:
print ar[count],
count -= 1
Note that a far easier method than yours to reverse a list is just to specify a step of -1 to a slice. (and join it into a string)
print ' '.join(array[::-1]) #this replaces your entire "for ar in array" loop
There are multiple ways you can alter your output format. Here are a few:
One way is by adding a comma at the end of your print statement:
print ar[count],
This makes print output a space at the end instead of a newline.
Another way is by building up a string, then printing it after the loop:
reversed_words = ''
for ar in array:
if len(ar) > 1:
count = len(ar) - 1
while count > -1:
reversed_words += ar[count] + ' '
count -= 1
print reversed_words
A third way is by rethinking your approach, and using some useful built-in Python functions and features to simplify your solution:
for ar in array:
print ' '.join(ar[::-1])
(See str.join and slicing).
Finally, you asked about reading from a file. You're already doing that just fine, but you forgot to close the file descriptor after you were done reading, using f.close(). Even better, with Python 2.5+, you can use the with keyword:
array = list()
with open(filename) as f:
T = int(f.readline().strip())
for i in range(T):
array.append(f.readline().strip().split(' '))
# The rest of your code here
This will close the file descriptor f automatically after the with block, even if something goes wrong while reading the file.
T = int(raw_input())
for t in range(T):
print "Case #%d:" % (t+1),
print ' '.join([w for w in raw_input().split(' ')][::-1])

python, string.replace() and \n

(Edit: the script seems to work for others here trying to help. Is it because I'm running python 2.7? I'm really at a loss...)
I have a raw text file of a book I am trying to tag with pages.
Say the text file is:
some words on this line,
1
DOCUMENT TITLE some more words here too.
2
DOCUMENT TITLE and finally still more words.
I am trying to use python to modify the example text to read:
some words on this line,
</pg>
<pg n=2>some more words here too,
</pg>
<pg n=3>and finally still more words.
My strategy is to load the text file as a string. Build search-for and a replace-with strings corresponding to a list of numbers. Replace all instances in string, and write to a new file.
Here is the code I've written:
from sys import argv
script, input, output = argv
textin = open(input,'r')
bookstring = textin.read()
textin.close()
pages = []
x = 1
while x<400:
pages.append(x)
x = x + 1
pagedel = "DOCUMENT TITLE"
for i in pages:
pgdel = "%d\n%s" % (i, pagedel)
nplus = i + 1
htmlpg = "</p>\n<p n=%d>" % nplus
bookstring = bookstring.replace(pgdel, htmlpg)
textout = open(output, 'w')
textout.write(bookstring)
textout.close()
print "Updates to %s printed to %s" % (input, output)
The script runs without error, but it also makes no changes whatsoever to the input text. It simply reprints it character for character.
Does my mistake have to do with the hard return? \n? Any help greatly appreciated.
In python, strings are immutable, and thus replace returns the replaced output instead of replacing the string in place.
You must do:
bookstring = bookstring.replace(pgdel, htmlpg)
You've also forgot to call the function close(). See how you have textin.close? You have to call it with parentheses, like open:
textin.close()
Your code works for me, but I might just add some more tips:
Input is a built-in function, so perhaps try renaming that. Although it works normally, it might not for you.
When running the script, don't forget to put the .txt ending:
$ python myscript.py file1.txt file2.txt
Make sure when testing your script to clear the contents of file2.
I hope these help!
Here's an entirely different approach that uses re(import the re module for this to work):
doctitle = False
newstr = ''
page = 1
for line in bookstring.splitlines():
res = re.match('^\\d+', line)
if doctitle:
newstr += '<pg n=' + str(page) + '>' + re.sub('^DOCUMENT TITLE ', '', line)
doctitle = False
elif res:
doctitle = True
page += 1
newstr += '\n</pg>\n'
else:
newstr += line
print newstr
Since no one knows what's going on, it's worth a try.

Counting words in a dictionary (Python)

I have this code, which I want to open a specified file, and then every time there is a while loop it will count it, finally outputting the total number of while loops in a specific file. I decided to convert the input file to a dictionary, and then create a for loop that every time the word while followed by a space was seen it would add a +1 count to WHILE_ before finally printing WHILE_ at the end.
However this did not seem to work, and I am at a loss as to why. Any help fixing this would be much appreciated.
This is the code I have at the moment:
WHILE_ = 0
INPUT_ = input("Enter file or directory: ")
OPEN_ = open(INPUT_)
READLINES_ = OPEN_.readlines()
STRING_ = (str(READLINES_))
STRIP_ = STRING_.strip()
input_str1 = STRIP_.lower()
dic = dict()
for w in input_str1.split():
if w in dic.keys():
dic[w] = dic[w]+1
else:
dic[w] = 1
DICT_ = (dic)
for LINE_ in DICT_:
if ("while\\n',") in LINE_:
WHILE_ += 1
elif ('while\\n",') in LINE_:
WHILE_ += 1
elif ('while ') in LINE_:
WHILE_ += 1
print ("while_loops {0:>12}".format((WHILE_)))
This is the input file I was working from:
'''A trivial test of metrics
Author: Angus McGurkinshaw
Date: May 7 2013
'''
def silly_function(blah):
'''A silly docstring for a silly function'''
def nested():
pass
print('Hello world', blah + 36 * 14)
tot = 0 # This isn't a for statement
for i in range(10):
tot = tot + i
if_im_done = false # Nor is this an if
print(tot)
blah = 3
while blah > 0:
silly_function(blah)
blah -= 1
while True:
if blah < 1000:
break
The output should be 2, but my code at the moment prints 0
This is an incredibly bizarre design. You're calling readlines to get a list of strings, then calling str on that list, which will join the whole thing up into one big string with the quoted repr of each line joined by commas and surrounded by square brackets, then splitting the result on spaces. I have no idea why you'd ever do such a thing.
Your bizarre variable names, extra useless lines of code like DICT_ = (dic), etc. only serve to obfuscate things further.
But I can explain why it doesn't work. Try printing out DICT_ after you do all that silliness, and you'll see that the only keys that include while are while and 'while. Since neither of these match any of the patterns you're looking for, your count ends up as 0.
It's also worth noting that you only add 1 to WHILE_ even if there are multiple instances of the pattern, so your whole dict of counts is useless.
This will be a lot easier if you don't obfuscate your strings, try to recover them, and then try to match the incorrectly-recovered versions. Just do it directly.
While I'm at it, I'm also going to fix some other problems so that your code is readable, and simpler, and doesn't leak files, and so on. Here's a complete implementation of the logic you were trying to hack up by hand:
import collections
filename = input("Enter file: ")
counts = collections.Counter()
with open(filename) as f:
for line in f:
counts.update(line.strip().lower().split())
print('while_loops {0:>12}'.format(counts['while']))
When you run this on your sample input, you correctly get 2. And extending it to handle if and for is trivial and obvious.
However, note that there's a serious problem in your logic: Anything that looks like a keyword but is in the middle of a comment or string will still get picked up. Without writing some kind of code to strip out comments and strings, there's no way around that. Which means you're going to overcount if and for by 1. The obvious way of stripping—line.partition('#')[0] and similarly for quotes—won't work. First, it's perfectly valid to have a string before an if keyword, as in "foo" if x else "bar". Second, you can't handle multiline strings this way.
These problems, and others like them, are why you almost certainly want a real parser. If you're just trying to parse Python code, the ast module in the standard library is the obvious way to do this. If you want to be write quick&dirty parsers for a variety of different languages, try pyparsing, which is very nice, and comes with some great examples.
Here's a simple example:
import ast
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
while_loops = sum(1 for node in ast.walk(tree) if isinstance(node, ast.While))
print('while_loops {0:>12}'.format(while_loops))
Or, more flexibly:
import ast
import collections
filename = input("Enter file: ")
with open(filename) as f:
tree = ast.parse(f.read())
counts = collections.Counter(type(node).__name__ for node in ast.walk(tree))
print('while_loops {0:>12}'.format(counts['While']))
print('for_loops {0:>14}'.format(counts['For']))
print('if_statements {0:>10}'.format(counts['If']))

Categories

Resources