Analysing English text with some French name - python

I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Bar­rieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the ap­proaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agi­tated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recog­nized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."

I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.

Related

Gtts stutters after 2 sentences

I'm trying to make two gtts voices, Sarah and Mary, talk to each other reading a standard script. After 2 sentences of using the same voice, they start repeating the last word until the duration of the sentence is over.
from moviepy.editor import *
import moviepy.editor as mp
from gtts import gTTS
dialog = "Mary: Mom, can we get a dog?
\nSarah: What kind of dog do you want?
\nMary: I’m not sure. I’ve been researching different breeds and I think I like corgis.
\nSarah: Corgis? That’s a pretty popular breed. What do you like about them?
\nMary: Well, they’re small, so they won’t take up too much room. They’re also very loyal and friendly. Plus, they’re really cute!
\nSarah: That’s true. They do seem like a great breed. Have you done any research on their care and grooming needs?
\nMary: Yes, I have. They don’t need a lot of grooming, but they do need regular brushing and occasional baths. They’re also very active, so they need plenty of exercise.
\nSarah: That sounds like a lot of work. Are you sure you’re up for it?
\nMary: Yes, I am. I’m willing to put in the effort to take care of a corgi.
\nSarah: Alright, if you’re sure. Let’s look into getting a corgi then.
\nMary: Yay! Thank you, Mom!"
lines = dialog.split("\n")
combined = AudioFileClip("Z:\Programming Stuff\Music\Type_Beat__BPM105.wav").set_duration(10) #ADD INTRO MUSIC
for line in lines:
if "Sarah:" in line:
# Use a voice for Person 1
res = line.split(' ', 1)[1] #Removes the first name
tts = gTTS(text=str(res), lang='en') #Accent Changer
tts.save("temp6.mp3")#temp save file cuz audio must mix audio clips
combined = concatenate_audioclips([combined, AudioFileClip("temp6.mp3")])
elif "Mary:" in line:
# Use a voice for Person 2
res = line.split(' ', 1)[1] #Removes the first name
tts = gTTS(text=str(res), lang='en', tld = 'com.au') #Accent Changer
tts.save("temp6.mp3") #temp save file cuz audio must mix audio clips
combined = concatenate_audioclips([combined, AudioFileClip("temp6.mp3")])
combined.write_audiofile("output3.mp3") #Final File Nmae
OUTPUT:
It's an audio file that outputs almost exactly the intended output, except after "They're also very loyal and friendly." it keeps repeating "plus". It also repeats at "Yes, I have. They don't need a lot of grooming, but they do need regular brushing and occasional baths." It repeats "baths" many times.
It appears it just repeats after saying 2 sentences and I have no idea why.

Censor specific word or phrase from string

I am trying to open a file and censor words out of it. These words that are censored are referenced from a list. This is my code
# These are the emails you will be censoring.
# The open() function is opening the text file that the emails are contained in
# and the .read() method is allowing us to save their contexts to the following variables:
email_one = open("email_one.txt", "r").read()
email_two = open("email_two.txt", "r").read()
email_three = open("email_three.txt", "r").read()
email_four = open("email_four.txt", "r").read()
# Write a function that can censor a specific word or phrase from a body of text,
# and then return the text.
# Mr. Cloudy has asked you to use the function to censor all instances
# of the phrase learning algorithms from the first email, email_one.
# Mr. Cloudy doesn’t care how you censor it, he just wants it done.
def censor_words(text, censor):
if censor in text:
text = text.replace(censor, '*' * len(censor))
return text
#print(censor_words(email_one, "learning algorithms"))
# Write a function that can censor not just a specific word or phrase from a body of text,
# but a whole list of words and phrases, and then return the text.
# Mr. Cloudy has asked that you censor all words and phrases from the following list in email_two.
def censor_words_in_list(text):
proprietary_terms = ["she", "personality matrix", "sense of self",
"self-preservation", "learning algorithm", "her", "herself"]
for x in proprietary_terms:
if x.lower() in text.lower():
text = text.replace(x, '*' * len(x))
return text
out_file = open("output.txt", "w")
out_file.write(censor_words_in_list(email_two))
This is the string before its being called and printed.
Good Morning, Board of Investors,
Lots of updates this week. The learning algorithms have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
She is learning faster than ever. Her learning rate now that she has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.
Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary sense of self is starting to form. This is a major step in the process, as having a sense of self and self-preservation will allow her to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.
Till next month,
Francine, Head Scientist
This is the same string after being ran through my code.
Good Morning, Board of Investors,
Lots of updates this week. The ******************s have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
She is learning faster than ever. Her learning rate now that *** has access to the world wide web has increased exponentially, far faster than we had though the ******************s were capable of.
Not only that, but we have configured * ****************** to allow for communication between the system and our team of researc***s. That's how we know * considers *self to be a *! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary ************* is starting to form. This is a major step in the process, as having a ************* and ***************** will allow *** to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.
Till next month,
Francine, Head Scientist
Example of what I need to fix is when you find the word researchers it is censoring out the word partially when it should not. Reason being is that it is finding the substring her in researchers. How can I fix this?
Using the regular expression module and the word boundary anchor \b:
import re
def censor_words_in_list(text):
regex = re.compile(
r'\bshe\b|\bpersonality matrix\b|\bsense of self\b'
r'|\bself-preservation\b|\blearning algorithms\b|\bher\b|\bherself\b',
re.IGNORECASE)
matches = regex.finditer(text)
# find location of matches in text
for match in matches:
# find how many * should be used based on length of match
span = match.span()[1] - match.span()[0]
replace_string = '#' * span
# substitution expression based on match
expression = r'\b{}\b'.format(match.group())
text = re.sub(expression, replace_string, text, flags=re.IGNORECASE)
return text
email_one = open("email_one.txt", "r").read()
out_file = open("output.txt", "w")
out_file.write(censor_words_in_list(email_one))
out_file.close()
Output (I have used the # symbol because ** is used to create bold text (like this) so the answer displays incorrectly for text bounded by three asterisks on Stack Overflow):
Good Morning, Board of Investors,
Lots of updates this week. The ################### have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
### is learning faster than ever. ### learning rate now that ### has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.
Not only that, but we have configured ### ################## to allow for communication between the system and our team
of researchers. That's how we know ### considers ####### to be a ###! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary
############# is starting to form. This is a major step in the process, as having a ############# and #################
will allow ### to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share
our enthusiasm.
Till next month, Francine, Head Scientist

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

Keep text clean from url

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —#POTUS https://twitter.com/OZRd5o4wRL
or
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —#POTUS in Greece https://twitter.com/PIO9dG2qjX
I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc
So the result will be:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
and
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.
Please help me find a regex pattern that will do what i want.
This might help.
Demo:
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —#POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —#POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))

I'm a newb in Python and data-mining. Have issues regarding tokenizer & data type issues

Hi~ I am having a problem while I am trying to tokenize facebook comments which are in CSV format. I have my CSV data ready, and I completed reading the file.
I am using Anaconda3; Python 3.5. (My CSV data has about 20k in rows and 1 in cols)
The codes are,
import csv
from nltk import sent_tokenize, word_tokenize as sent_tokenize, word_tokenize
with open('facebook_comments_samsung.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader) #list(reader)
print (your_list)
What comes, as a result, is something like this:
[['comment_message'], ['b"Yet again been told a pack of lies by Samsung Customer services who have lost my daughters phone and couldn\'t care less. ANYONE WHO PURCHASES ANYTHING FROM THIS COMPANY NEEDS THEIR HEAD TESTED"'], ["b'You cannot really blame an entire brand worldwide for a problem caused by a branch. It is a problem yes, but address your local problem branch'"], ["b'Haha!! Sorry if they lost your daughters phone but I will always buy Samsung products no matter what.'"], ["b'Salim Gaji BEST REPLIE EVER \\xf0\\x9f\\x98\\x8e'"], ["b'<3 Bewafa zarge <3 \\r\\n\\n \\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\r\\n\\xf0\\x9f\\x8e\\xad\\xf0\\x9f\\x91\\x89 AQIB-BOT.ML \\xf0\\x9f\\x91\\x88\\xf0\\x9f\\x8e\\xadMANUAL\\xe2\\x99\\xaaKing.Bot\\xe2\\x84\\xa2 \\r\\n\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\xe2\\x80\\x94'"], ["b'\\xf0\\x9f\\x8c\\x90 LATIF.ML \\xf0\\x9f\\x8c\\x90'"], ['b"I\'m just waiting here patiently for you guys to say that you\'ll be releasing the s8 and s8+ a week early, for those who pre-ordered. Wishful thinking \\xf0\\x9f\\x98\\x86. Can\'t wait!"'], ['b"That\'s some good positive thinking there sir."'], ["b'(y) #NextIsNow #DoWhatYouCant'"], ["b'looking good'"], ['b"I\'ve always thought that when I first set eyes on my first born that I\'d like it to be on the screen of a cameraphone at arms length rather than eye-to-eye while holding my child. Thank you Samsung for improving our species."'], ["b'cool story'"], ["b'I believe so!'"], ["b'superb'"], ["b'Nice'"], ["b'thanks for the share'"], ["b'awesome'"], ["b'How can I talk to Samsung'"], ["b'Wow'"], ["b'#DoWhatYouCant siempre grandes innovadores Samsung Mobile'"], ["b'I had a problem with my s7 edge when I first got it all fixed now. However when I went to the Samsung shop they were useless and rude they refused to help and said there is nothing they could do no wonder the shop was dead quiet'"], ["b'Zeeshan Khan Masti Khel'"], ["b'I dnt had any problem wd my phn'"], ["b'I have maybe just had a bad phone to start with until it got fixed eventually. I had to go to carphone warehouse they were very helpful'"], ["b'awesome'"], ["b'Ch Shuja Uddin'"], ["b'akhheeerrr'"], ["b'superb'"], ["b'nice story'"], ["b'thanks for the share'"], ["b'superb'"], ["b'thanks for the share'"], ['b"On February 18th 2017 I sent my phone away to with a screen issue. The lower part of the screen was flickering bright white. The phone had zero physical damage to the screen\\n\\nI receive an email from Samsung Quotations with a picture of my SIM tray. Upon phoning I was told my SIM tray was stuck inside the phone and was handed a \\xc2\\xa392.14 repair bill. There is no way that my SIM tray was stuck in the phone as I removed my SIM and memory card before sending the phone away.\\n\\nAfter numerous calls I finally gave in and agreed to pay the \\xc2\\xa392.14 on the understanding that my screen repair would also be covered in this cost. This was confirmed to me by the person on the phone.\\n\\nOn
Sorry for your inconvenience in reading the result. My bad.
To continue, I added,
tokens = [word_tokenize(i) for i in your_list]
for i in tokens:
print (i)
print (tokens)
This is the part where I get the following error:
C:\Program Files\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) in line 1278 TypeError: expected string or bytes-like object
What I want to do next is,
import nltk
en = nltk.Text(tokens)
print(len(en.tokens))
print(len(set(en.tokens)))
en.vocab()
en.plot(50)
en.count('galaxy s8')
And finally, I want to draw a wordcloud based on the data.
Being aware of the fact that every seconds of your time is precious, I am terribly sorry to ask for your help. I have been working this for a couple of days, and cannot find the right solution for my problem. Thank you for reading.
The error you're getting is because your CSV file is turned into a list of lists-- one for each row in the file. The file only contains one column, so each of these lists has one element: The string containing the message you want to tokenize. To get past the error, unpack the sublists by using this line instead:
tokens = [word_tokenize(row[0]) for row in your_list]
After that, you'll need to learn some more python and learn how to examine your program and your variables.

Categories

Resources