Gtts stutters after 2 sentences

Gtts stutters after 2 sentences - python

I'm trying to make two gtts voices, Sarah and Mary, talk to each other reading a standard script. After 2 sentences of using the same voice, they start repeating the last word until the duration of the sentence is over.
from moviepy.editor import *
import moviepy.editor as mp
from gtts import gTTS
dialog = "Mary: Mom, can we get a dog?
\nSarah: What kind of dog do you want?
\nMary: I’m not sure. I’ve been researching different breeds and I think I like corgis.
\nSarah: Corgis? That’s a pretty popular breed. What do you like about them?
\nMary: Well, they’re small, so they won’t take up too much room. They’re also very loyal and friendly. Plus, they’re really cute!
\nSarah: That’s true. They do seem like a great breed. Have you done any research on their care and grooming needs?
\nMary: Yes, I have. They don’t need a lot of grooming, but they do need regular brushing and occasional baths. They’re also very active, so they need plenty of exercise.
\nSarah: That sounds like a lot of work. Are you sure you’re up for it?
\nMary: Yes, I am. I’m willing to put in the effort to take care of a corgi.
\nSarah: Alright, if you’re sure. Let’s look into getting a corgi then.
\nMary: Yay! Thank you, Mom!"
lines = dialog.split("\n")
combined = AudioFileClip("Z:\Programming Stuff\Music\Type_Beat__BPM105.wav").set_duration(10) #ADD INTRO MUSIC
for line in lines:
if "Sarah:" in line:
# Use a voice for Person 1
res = line.split(' ', 1)[1] #Removes the first name
tts = gTTS(text=str(res), lang='en') #Accent Changer
tts.save("temp6.mp3")#temp save file cuz audio must mix audio clips
combined = concatenate_audioclips([combined, AudioFileClip("temp6.mp3")])
elif "Mary:" in line:
# Use a voice for Person 2
res = line.split(' ', 1)[1] #Removes the first name
tts = gTTS(text=str(res), lang='en', tld = 'com.au') #Accent Changer
tts.save("temp6.mp3") #temp save file cuz audio must mix audio clips
combined = concatenate_audioclips([combined, AudioFileClip("temp6.mp3")])
combined.write_audiofile("output3.mp3") #Final File Nmae
OUTPUT:
It's an audio file that outputs almost exactly the intended output, except after "They're also very loyal and friendly." it keeps repeating "plus". It also repeats at "Yes, I have. They don't need a lot of grooming, but they do need regular brushing and occasional baths." It repeats "baths" many times.
It appears it just repeats after saying 2 sentences and I have no idea why.

Related

Analysing English text with some French name

I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Barrieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the approaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agitated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recognized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."

I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.

Keep text clean from url

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." ‚Äî#POTUS https://twitter.com/OZRd5o4wRL
or
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" ‚Äî#POTUS in Greece https://twitter.com/PIO9dG2qjX
I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc
So the result will be:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
and
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.
Please help me find a regex pattern that will do what i want.

This might help.
Demo:
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" ‚Äî#POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." ‚Äî#POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))

How to read list element in Python from a text file?

My text file is like below.
[0, "we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n"]
[1, "I want to write a . I think I will.\n"]
[2, "#va_stress broke my twitter..\n"]
[3, "\" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n"]
[4, "aww great "Picture to burn"\n"]
[5, "#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n"]
[6, "http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n"]
[7, "cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n"]
[8, "\" couples in public\n"]
[9, "damn wendy's commerical got that damn in my head.\n"]
[10, "i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n"]
[11, "\" getting ready for school. after i print out this\n"]
I want to read every second element from the list mean all the text tweets into array.
I wrote
tweets = []
for line in open('tweets.txt').readlines():
print line[1]
tweets.append(line)
but when I see the output, It just takes 2nd character of every line.

When you read a text file in Python, the lines are just strings. They aren't automatically converted to some other data structure.
In your case, it looks like each line in your file contains a JSON list. In that case, you can parse the line first using json.loads(). This converts the string to a Python list which you can then take the second element of:
import json
with open('tweets.txt') as fp:
tweets = [json.loads(line)[1] for line in fp]

May be you should consider to use json.loads method :
import json
tweets = []
for line in open('tweets.txt').readlines():
print json.loads(line)[1]
tweets.append(line)
There is more pythonic way in #Erik Cederstrand 's comment.

Rather than guessing what format the data is in, you should find out.
If you're generating it yourself, and don't know how to parse back in what you're creating, change your code to generate something that can be easily parsed with the same library used to generate it, like JsonLines or CSV.
If you're ingesting it from some API, read the documentation for that API and parse it the way it's documented.
If someone handed you the file and told you to parse it, ask that someone what format it's in.
Occasionally, you do have to deal with some crufty old file in some format that was never documented and nobody remembers what it was. In that case, you do have to reverse engineer it. But what you want to do then is guess at likely possibilities, and try to parse it with as much validation and error handling as possible, to verify that you guessed right.
In this case, the format looks a lot like either JSON lines or ndjson. Both are slightly different ways of encoding multiple objects with one JSON text per line, with specific restrictions on those texts and the way they're encoded and the whitespace between them.
So, while a quick&dirty parser like this will probably work:
with open('tweets.txt') as f:
for line in f:
tweet = json.loads(line)
dosomething(tweet)
You probably want to use a library like jsonlines:
with jsonlines.open('tweets.txt') as f:
for tweet in f:
dosomething(tweet)
The fact that the quick&dirty parser works on JSON lines is, of course, part of the point of that format—but if you don't actually know whether you have JSON lines or not, you're better off making sure.

Since your input looks like Python expressions, I'd use ast.literal_eval to parse them.
Here is an example:
import ast
with open('tweets.txt') as fp:
tweets = [ast.literal_eval(line)[1] for line in fp]
print(tweets)
Output:
['we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n', 'I want to write a . I think I will.\n', '#va_stress broke my twitter..\n', '" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n', 'aww great "Picture to burn"\n', '#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n', 'http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n', 'cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n', '" couples in public\n', "damn wendy's commerical got that damn in my head.\n", 'i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n', '" getting ready for school. after i print out this\n']

I'm a newb in Python and data-mining. Have issues regarding tokenizer & data type issues

Hi~ I am having a problem while I am trying to tokenize facebook comments which are in CSV format. I have my CSV data ready, and I completed reading the file.
I am using Anaconda3; Python 3.5. (My CSV data has about 20k in rows and 1 in cols)
The codes are,
import csv
from nltk import sent_tokenize, word_tokenize as sent_tokenize, word_tokenize
with open('facebook_comments_samsung.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader) #list(reader)
print (your_list)
What comes, as a result, is something like this:
[['comment_message'], ['b"Yet again been told a pack of lies by Samsung Customer services who have lost my daughters phone and couldn\'t care less. ANYONE WHO PURCHASES ANYTHING FROM THIS COMPANY NEEDS THEIR HEAD TESTED"'], ["b'You cannot really blame an entire brand worldwide for a problem caused by a branch. It is a problem yes, but address your local problem branch'"], ["b'Haha!! Sorry if they lost your daughters phone but I will always buy Samsung products no matter what.'"], ["b'Salim Gaji BEST REPLIE EVER \\xf0\\x9f\\x98\\x8e'"], ["b'<3 Bewafa zarge <3 \\r\\n\\n \\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\r\\n\\xf0\\x9f\\x8e\\xad\\xf0\\x9f\\x91\\x89 AQIB-BOT.ML \\xf0\\x9f\\x91\\x88\\xf0\\x9f\\x8e\\xadMANUAL\\xe2\\x99\\xaaKing.Bot\\xe2\\x84\\xa2 \\r\\n\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\xe2\\x80\\x94'"], ["b'\\xf0\\x9f\\x8c\\x90 LATIF.ML \\xf0\\x9f\\x8c\\x90'"], ['b"I\'m just waiting here patiently for you guys to say that you\'ll be releasing the s8 and s8+ a week early, for those who pre-ordered. Wishful thinking \\xf0\\x9f\\x98\\x86. Can\'t wait!"'], ['b"That\'s some good positive thinking there sir."'], ["b'(y) #NextIsNow #DoWhatYouCant'"], ["b'looking good'"], ['b"I\'ve always thought that when I first set eyes on my first born that I\'d like it to be on the screen of a cameraphone at arms length rather than eye-to-eye while holding my child. Thank you Samsung for improving our species."'], ["b'cool story'"], ["b'I believe so!'"], ["b'superb'"], ["b'Nice'"], ["b'thanks for the share'"], ["b'awesome'"], ["b'How can I talk to Samsung'"], ["b'Wow'"], ["b'#DoWhatYouCant siempre grandes innovadores Samsung Mobile'"], ["b'I had a problem with my s7 edge when I first got it all fixed now. However when I went to the Samsung shop they were useless and rude they refused to help and said there is nothing they could do no wonder the shop was dead quiet'"], ["b'Zeeshan Khan Masti Khel'"], ["b'I dnt had any problem wd my phn'"], ["b'I have maybe just had a bad phone to start with until it got fixed eventually. I had to go to carphone warehouse they were very helpful'"], ["b'awesome'"], ["b'Ch Shuja Uddin'"], ["b'akhheeerrr'"], ["b'superb'"], ["b'nice story'"], ["b'thanks for the share'"], ["b'superb'"], ["b'thanks for the share'"], ['b"On February 18th 2017 I sent my phone away to with a screen issue. The lower part of the screen was flickering bright white. The phone had zero physical damage to the screen\\n\\nI receive an email from Samsung Quotations with a picture of my SIM tray. Upon phoning I was told my SIM tray was stuck inside the phone and was handed a \\xc2\\xa392.14 repair bill. There is no way that my SIM tray was stuck in the phone as I removed my SIM and memory card before sending the phone away.\\n\\nAfter numerous calls I finally gave in and agreed to pay the \\xc2\\xa392.14 on the understanding that my screen repair would also be covered in this cost. This was confirmed to me by the person on the phone.\\n\\nOn
Sorry for your inconvenience in reading the result. My bad.
To continue, I added,
tokens = [word_tokenize(i) for i in your_list]
for i in tokens:
print (i)
print (tokens)
This is the part where I get the following error:
C:\Program Files\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) in line 1278 TypeError: expected string or bytes-like object
What I want to do next is,
import nltk
en = nltk.Text(tokens)
print(len(en.tokens))
print(len(set(en.tokens)))
en.vocab()
en.plot(50)
en.count('galaxy s8')
And finally, I want to draw a wordcloud based on the data.
Being aware of the fact that every seconds of your time is precious, I am terribly sorry to ask for your help. I have been working this for a couple of days, and cannot find the right solution for my problem. Thank you for reading.

The error you're getting is because your CSV file is turned into a list of lists-- one for each row in the file. The file only contains one column, so each of these lists has one element: The string containing the message you want to tokenize. To get past the error, unpack the sublists by using this line instead:
tokens = [word_tokenize(row[0]) for row in your_list]
After that, you'll need to learn some more python and learn how to examine your program and your variables.

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?

With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split

What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.