Censor specific word or phrase from string - python

I am trying to open a file and censor words out of it. These words that are censored are referenced from a list. This is my code
# These are the emails you will be censoring.
# The open() function is opening the text file that the emails are contained in
# and the .read() method is allowing us to save their contexts to the following variables:
email_one = open("email_one.txt", "r").read()
email_two = open("email_two.txt", "r").read()
email_three = open("email_three.txt", "r").read()
email_four = open("email_four.txt", "r").read()
# Write a function that can censor a specific word or phrase from a body of text,
# and then return the text.
# Mr. Cloudy has asked you to use the function to censor all instances
# of the phrase learning algorithms from the first email, email_one.
# Mr. Cloudy doesn’t care how you censor it, he just wants it done.
def censor_words(text, censor):
if censor in text:
text = text.replace(censor, '*' * len(censor))
return text
#print(censor_words(email_one, "learning algorithms"))
# Write a function that can censor not just a specific word or phrase from a body of text,
# but a whole list of words and phrases, and then return the text.
# Mr. Cloudy has asked that you censor all words and phrases from the following list in email_two.
def censor_words_in_list(text):
proprietary_terms = ["she", "personality matrix", "sense of self",
"self-preservation", "learning algorithm", "her", "herself"]
for x in proprietary_terms:
if x.lower() in text.lower():
text = text.replace(x, '*' * len(x))
return text
out_file = open("output.txt", "w")
out_file.write(censor_words_in_list(email_two))
This is the string before its being called and printed.
Good Morning, Board of Investors,
Lots of updates this week. The learning algorithms have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
She is learning faster than ever. Her learning rate now that she has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.
Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary sense of self is starting to form. This is a major step in the process, as having a sense of self and self-preservation will allow her to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.
Till next month,
Francine, Head Scientist
This is the same string after being ran through my code.
Good Morning, Board of Investors,
Lots of updates this week. The ******************s have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
She is learning faster than ever. Her learning rate now that *** has access to the world wide web has increased exponentially, far faster than we had though the ******************s were capable of.
Not only that, but we have configured * ****************** to allow for communication between the system and our team of researc***s. That's how we know * considers *self to be a *! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary ************* is starting to form. This is a major step in the process, as having a ************* and ***************** will allow *** to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.
Till next month,
Francine, Head Scientist
Example of what I need to fix is when you find the word researchers it is censoring out the word partially when it should not. Reason being is that it is finding the substring her in researchers. How can I fix this?

Using the regular expression module and the word boundary anchor \b:
import re
def censor_words_in_list(text):
regex = re.compile(
r'\bshe\b|\bpersonality matrix\b|\bsense of self\b'
r'|\bself-preservation\b|\blearning algorithms\b|\bher\b|\bherself\b',
re.IGNORECASE)
matches = regex.finditer(text)
# find location of matches in text
for match in matches:
# find how many * should be used based on length of match
span = match.span()[1] - match.span()[0]
replace_string = '#' * span
# substitution expression based on match
expression = r'\b{}\b'.format(match.group())
text = re.sub(expression, replace_string, text, flags=re.IGNORECASE)
return text
email_one = open("email_one.txt", "r").read()
out_file = open("output.txt", "w")
out_file.write(censor_words_in_list(email_one))
out_file.close()
Output (I have used the # symbol because ** is used to create bold text (like this) so the answer displays incorrectly for text bounded by three asterisks on Stack Overflow):
Good Morning, Board of Investors,
Lots of updates this week. The ################### have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
### is learning faster than ever. ### learning rate now that ### has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.
Not only that, but we have configured ### ################## to allow for communication between the system and our team
of researchers. That's how we know ### considers ####### to be a ###! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary
############# is starting to form. This is a major step in the process, as having a ############# and #################
will allow ### to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share
our enthusiasm.
Till next month, Francine, Head Scientist

Related

Analysing English text with some French name

I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Bar­rieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the ap­proaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agi­tated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recog­nized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."
I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.

Keep text clean from url

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —#POTUS https://twitter.com/OZRd5o4wRL
or
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —#POTUS in Greece https://twitter.com/PIO9dG2qjX
I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc
So the result will be:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
and
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.
Please help me find a regex pattern that will do what i want.
This might help.
Demo:
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —#POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —#POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))

I'm a newb in Python and data-mining. Have issues regarding tokenizer & data type issues

Hi~ I am having a problem while I am trying to tokenize facebook comments which are in CSV format. I have my CSV data ready, and I completed reading the file.
I am using Anaconda3; Python 3.5. (My CSV data has about 20k in rows and 1 in cols)
The codes are,
import csv
from nltk import sent_tokenize, word_tokenize as sent_tokenize, word_tokenize
with open('facebook_comments_samsung.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader) #list(reader)
print (your_list)
What comes, as a result, is something like this:
[['comment_message'], ['b"Yet again been told a pack of lies by Samsung Customer services who have lost my daughters phone and couldn\'t care less. ANYONE WHO PURCHASES ANYTHING FROM THIS COMPANY NEEDS THEIR HEAD TESTED"'], ["b'You cannot really blame an entire brand worldwide for a problem caused by a branch. It is a problem yes, but address your local problem branch'"], ["b'Haha!! Sorry if they lost your daughters phone but I will always buy Samsung products no matter what.'"], ["b'Salim Gaji BEST REPLIE EVER \\xf0\\x9f\\x98\\x8e'"], ["b'<3 Bewafa zarge <3 \\r\\n\\n \\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\r\\n\\xf0\\x9f\\x8e\\xad\\xf0\\x9f\\x91\\x89 AQIB-BOT.ML \\xf0\\x9f\\x91\\x88\\xf0\\x9f\\x8e\\xadMANUAL\\xe2\\x99\\xaaKing.Bot\\xe2\\x84\\xa2 \\r\\n\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\xe2\\x80\\x94'"], ["b'\\xf0\\x9f\\x8c\\x90 LATIF.ML \\xf0\\x9f\\x8c\\x90'"], ['b"I\'m just waiting here patiently for you guys to say that you\'ll be releasing the s8 and s8+ a week early, for those who pre-ordered. Wishful thinking \\xf0\\x9f\\x98\\x86. Can\'t wait!"'], ['b"That\'s some good positive thinking there sir."'], ["b'(y) #NextIsNow #DoWhatYouCant'"], ["b'looking good'"], ['b"I\'ve always thought that when I first set eyes on my first born that I\'d like it to be on the screen of a cameraphone at arms length rather than eye-to-eye while holding my child. Thank you Samsung for improving our species."'], ["b'cool story'"], ["b'I believe so!'"], ["b'superb'"], ["b'Nice'"], ["b'thanks for the share'"], ["b'awesome'"], ["b'How can I talk to Samsung'"], ["b'Wow'"], ["b'#DoWhatYouCant siempre grandes innovadores Samsung Mobile'"], ["b'I had a problem with my s7 edge when I first got it all fixed now. However when I went to the Samsung shop they were useless and rude they refused to help and said there is nothing they could do no wonder the shop was dead quiet'"], ["b'Zeeshan Khan Masti Khel'"], ["b'I dnt had any problem wd my phn'"], ["b'I have maybe just had a bad phone to start with until it got fixed eventually. I had to go to carphone warehouse they were very helpful'"], ["b'awesome'"], ["b'Ch Shuja Uddin'"], ["b'akhheeerrr'"], ["b'superb'"], ["b'nice story'"], ["b'thanks for the share'"], ["b'superb'"], ["b'thanks for the share'"], ['b"On February 18th 2017 I sent my phone away to with a screen issue. The lower part of the screen was flickering bright white. The phone had zero physical damage to the screen\\n\\nI receive an email from Samsung Quotations with a picture of my SIM tray. Upon phoning I was told my SIM tray was stuck inside the phone and was handed a \\xc2\\xa392.14 repair bill. There is no way that my SIM tray was stuck in the phone as I removed my SIM and memory card before sending the phone away.\\n\\nAfter numerous calls I finally gave in and agreed to pay the \\xc2\\xa392.14 on the understanding that my screen repair would also be covered in this cost. This was confirmed to me by the person on the phone.\\n\\nOn
Sorry for your inconvenience in reading the result. My bad.
To continue, I added,
tokens = [word_tokenize(i) for i in your_list]
for i in tokens:
print (i)
print (tokens)
This is the part where I get the following error:
C:\Program Files\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) in line 1278 TypeError: expected string or bytes-like object
What I want to do next is,
import nltk
en = nltk.Text(tokens)
print(len(en.tokens))
print(len(set(en.tokens)))
en.vocab()
en.plot(50)
en.count('galaxy s8')
And finally, I want to draw a wordcloud based on the data.
Being aware of the fact that every seconds of your time is precious, I am terribly sorry to ask for your help. I have been working this for a couple of days, and cannot find the right solution for my problem. Thank you for reading.
The error you're getting is because your CSV file is turned into a list of lists-- one for each row in the file. The file only contains one column, so each of these lists has one element: The string containing the message you want to tokenize. To get past the error, unpack the sublists by using this line instead:
tokens = [word_tokenize(row[0]) for row in your_list]
After that, you'll need to learn some more python and learn how to examine your program and your variables.

Feedback generator for text in python

I am looking to make a program that reads a document and gives us feedback on how to improve it for example an essay. So far I have split the essay into its main components like introduction, development paragraph, and conclusion. How would I be able to make a program that can give feedback on each paragraph so that the user would be able to improve it? Is this type of program even possible or is the technology not advances enough. Would machine learning be the best technique. The code so far is not much but here it is:
import nltk
def parse_essay(essay_filename):
with open(essay_filename, 'r') as file:
paragraphs = [x.strip('\n') for x in file.readlines() if x != '\n' and x != '\t\n']
return paragraphs[0], paragraphs[1], paragraphs[2], paragraphs[3]
def get_introduction_feedback(text):
sentences = nltk.tokenize.sent_tokenize(text)
hook = ' '.join(sentences[:3])
thesis = sentences[3]
arguments = ' '.join(sentences[4:])
def get_development_feedback(text):
pass
def get_conclusion_feedback(text):
pass
if __name__ == '__main__':
ESSAY = 'essay.txt'
intro, dev1, dev2, conclusion = parse_essay(ESSAY)
intro_feedback = get_introduction_feedback(intro)
dev1_feedback = get_development_feedback(dev1)
dev2_feedback = get_development_feedback(dev2)
conclusion_feedback = get_conclusion_feedback(conclusion)
print('Introduction:\n\n{}\n\nDevelopment 1:\n\n{}\n\nDevelopment 2:\n\n{}\n\nConclusion:\n\n{}'.format(intro_feedback, dev1_feedback, dev2_feedback, conclusion_feedback))
Also the essay document that it uses is a 4 paragraph essay, where paragraph 1 is the intro, paragraph 2 and 3 are development and 4 is the conclusion.
It would be a very hard problem to tackle with plain Machine Learning. The essays would be very domain specific and you would require a very extensive (and labeled) essays database.
An Unsupervied approach may be useful, like Word-embedding via Artificial neural networks.
The first thing that you should think is "What problem I am trying to solve?" or "What question am I trying to anwer?" Once you have that, then you can define an objective function to optimize using ML, or you alredy found another paradigm to use while trying to answer those questions.

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?
With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split
What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

Categories

Resources