so i'm making alittle project as i am a beginner and i'm doing some webscraping. I wanted to print the lyrics of a song each on it's line using beautifulsoup in python but instead it's printing like this:
I looked out this morning and the sun was goneTurned on some music to start my dayI lost myself in a familiar songI closed my eyes and I slipped awayIt's more than a feeling (more than a feeling)When I hear that old song they used to play (more than a feeling)And I begin dreaming (more than a feeling)Till I see Marianne walk awayI see my Marianne walkin' awaySo many people have come and goneTheir faces fade as the years go byYet I still recall as I wander onAs clear as the sun in the summer skyIt's more than a feeling (more than a feeling)When I hear that old song they used to play (more than a feeling)And I begin dreaming (more than a feeling)Till I see Marianne walk awayI see my Marianne walkin' awayWhen I'm tired and thinking coldI hide in my music, forget the dayAnd dream of a girl I used to knowI closed my eyes and she slipped awayShe slipped awayIt's more than a feeling (more than a feeling)When I hear that old song they used to play (more than a feeling)And I begin dreaming (more than a feeling)Till I see Marianne walk away
This is my code:
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen("http://www.metrolyrics.com/more-than-a-feeling-lyrics-boston.html")
bsObj = BeautifulSoup(html, "lxml")
namelist = bsObj.find_all("div", {"id": "lyrics-body-text"})
print("".join([p.get_text(strip=True) for p in namelist]))
You need to remove the strip = True parameter to get_text. That strips the string resulting in the joined output you see.
By removing it:
print("".join([p.get_text() for p in namelist]))
It prints fine.
Try writing it into a simple for loop
for p in namelist:
print(p.get_text(strip=True))
Related
I'm trying to make two gtts voices, Sarah and Mary, talk to each other reading a standard script. After 2 sentences of using the same voice, they start repeating the last word until the duration of the sentence is over.
from moviepy.editor import *
import moviepy.editor as mp
from gtts import gTTS
dialog = "Mary: Mom, can we get a dog?
\nSarah: What kind of dog do you want?
\nMary: I’m not sure. I’ve been researching different breeds and I think I like corgis.
\nSarah: Corgis? That’s a pretty popular breed. What do you like about them?
\nMary: Well, they’re small, so they won’t take up too much room. They’re also very loyal and friendly. Plus, they’re really cute!
\nSarah: That’s true. They do seem like a great breed. Have you done any research on their care and grooming needs?
\nMary: Yes, I have. They don’t need a lot of grooming, but they do need regular brushing and occasional baths. They’re also very active, so they need plenty of exercise.
\nSarah: That sounds like a lot of work. Are you sure you’re up for it?
\nMary: Yes, I am. I’m willing to put in the effort to take care of a corgi.
\nSarah: Alright, if you’re sure. Let’s look into getting a corgi then.
\nMary: Yay! Thank you, Mom!"
lines = dialog.split("\n")
combined = AudioFileClip("Z:\Programming Stuff\Music\Type_Beat__BPM105.wav").set_duration(10) #ADD INTRO MUSIC
for line in lines:
if "Sarah:" in line:
# Use a voice for Person 1
res = line.split(' ', 1)[1] #Removes the first name
tts = gTTS(text=str(res), lang='en') #Accent Changer
tts.save("temp6.mp3")#temp save file cuz audio must mix audio clips
combined = concatenate_audioclips([combined, AudioFileClip("temp6.mp3")])
elif "Mary:" in line:
# Use a voice for Person 2
res = line.split(' ', 1)[1] #Removes the first name
tts = gTTS(text=str(res), lang='en', tld = 'com.au') #Accent Changer
tts.save("temp6.mp3") #temp save file cuz audio must mix audio clips
combined = concatenate_audioclips([combined, AudioFileClip("temp6.mp3")])
combined.write_audiofile("output3.mp3") #Final File Nmae
OUTPUT:
It's an audio file that outputs almost exactly the intended output, except after "They're also very loyal and friendly." it keeps repeating "plus". It also repeats at "Yes, I have. They don't need a lot of grooming, but they do need regular brushing and occasional baths." It repeats "baths" many times.
It appears it just repeats after saying 2 sentences and I have no idea why.
I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Barrieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the approaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agitated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recognized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."
I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.
Let's say I open a text file with Python3:
fname1 = "filename.txt"
with open(fname1, "rt", encoding='latin1') as in_file:
readable_file = in_file.read()
The output is a standard text file of paragraphs:
\n\n"Well done, Mrs. Martin!" thought Emma. "You know what you are about."\n\n"And when she had come away, Mrs. Martin was so very kind as to send\nMrs. Goddard a beautiful goose--the finest goose Mrs. Goddard had\never seen. Mrs. Goddard had dressed it on a Sunday, and asked all\nthe three teachers, Miss Nash, and Miss Prince, and Miss Richardson,\nto sup with her."\n\n"Mr. Martin, I suppose, is not a man of information beyond the line\nof his own business? He does not read?"\n\n"Oh yes!--that is, no--I do not know--but I believe he has\nread a good deal--but not what you would think any thing of.\nHe reads the Agricultural Reports, and some other books that lay\nin one of the window seats--but he reads all _them_ to himself.\nBut sometimes of an evening, before we went to cards, he would read\nsomething aloud out of the Elegant Extracts, very entertaining.\nAnd I know he has read the Vicar of Wakefield. He never read the\nRomance of the Forest, nor The Children of the Abbey. He had never\nheard of such books before I mentioned them, but he is determined\nto get them now as soon as ever he can."\n\nThe next question was--\n\n"What sort of looking man is Mr. Martin?"
How can one save only a certain string within this file? For example, how does one save the sentence
And when she had come away, Mrs. Martin was so very kind as to send\nMrs. Goddard a beautiful goose--the finest goose Mrs. Goddard had\never seen.
into a separate text file? How do you know the indices where to access this sentence?
CLARIFICATION: There should be no decision statements to make. The end goal is to create a program which the user could "save" sentences or paragraphs separately. I am asking a more general question at the moment.
Let's say there's a certain paragraph I like in this text. I would like a way to append this quickly to a JSON file or text file. In principle, how does one do this? Tagging all sentences? Is there a way to isolate paragraphs? (To repeat above) How do you know the indices where to access this sentence? (especially if there is no "decision logic")
If I know the indices, couldn't I simply slice the string?
I have the following text:
"It's the show your only friend and pastor have been talking about!
<i>Wonder Showzen</i> is a hilarious glimpse into the black
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth,
nature, diversity, and history – all inside the prison of
your mind! Where else can you..."
What I want to do with this is remove the html tags and encode it into unicode. I am currently doing:
def remove_tags(text):
return TAG_RE.sub('', text)
Which only strips the tag. How would I correctly encode the above for database storage?
You could try passing your text through a HTML parser. Here is an example using BeautifulSoup:
from bs4 import BeautifulSoup
text = '''It's the show your only friend and pastor have been talking about!
<i>Wonder Showzen</i> is a hilarious glimpse into the black
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth,
nature, diversity, and history – all inside the prison of
your mind! Where else can you...'''
soup = BeautifulSoup(text)
>>> soup.text
u"It's the show your only friend and pastor have been talking about! \nWonder Showzen is a hilarious glimpse into the black \nheart of childhood innocence! Get ready as the complete first season of MTV2's Wonder Showzen tackles valuable life lessons like birth, \nnature, diversity, and history \u2013 all inside the prison of \nyour mind! Where else can you..."
You now have a unicode string with the HTML entities converted to unicode escaped characters, i.e. – was converted to \u2013.
This also removes the HTML tags.
I've written a Scrapy spider that extracts text from a page. The spider parses and outputs correctly on many of the pages, but is thrown off by a few. I'm trying to maintain line breaks and formatting in the document. Pages such as http://www.state.gov/r/pa/prs/dpb/2011/04/160298.htm are formatted properly like such:
April 7, 2011
Mark C. Toner
2:03 p.m. EDT
MR. TONER: Good afternoon, everyone. A couple of things at the top,
and then I’ll take your questions. We condemn the attack on innocent
civilians in southern Israel in the strongest possible terms, as well
as ongoing rocket fire from Gaza. As we have reiterated many times,
there’s no justification for the targeting of innocent civilians,
and those responsible for these terrorist acts should be held
accountable. We are particularly concerned about reports that indicate
the use of an advanced anti-tank weapon in an attack against civilians
and reiterate that all countries have obligations under relevant
United Nations Security Council resolutions to prevent illicit
trafficking in arms and ammunition. Also just a brief statement --
QUESTION: Can we stay on that just for one second?
MR. TONER: Yeah. Go ahead, Matt.
QUESTION: Apparently, the target of that was a school bus. Does that
add to your outrage?
MR. TONER: Well, any attack on innocent civilians is abhorrent, but
certainly the nature of the attack is particularly so.
While pages like http://www.state.gov/r/pa/prs/dpb/2009/04/121223.htm have output like this with no line breaks:
April 2, 2009
Robert Wood
11:53 a.m. EDTMR. WOOD: Good morning, everyone. I think it’s just
about still morning. Welcome to the briefing. I don’t have anything,
so – sir.QUESTION: The North Koreans have moved fueling tankers, or
whatever, close to the site. They may or may not be fueling this
missile. What words of wisdom do you have for the North Koreans at
this moment?MR. WOOD: Well, Matt, I’m not going to comment on, you
know, intelligence matters. But let me just say again, we call on the
North to desist from launching any type of missile. It would be
counterproductive. It’s provocative. It further inflames tensions in
the region. We want to see the North get back to the Six-Party
framework and focus on denuclearization.Yes.QUESTION: Japan has also
said they’re going to call for an emergency meeting in the Security
Council, you know, should this launch go ahead. Is this something that
you would also be looking for?MR. WOOD: Well, let’s see if this test
happens. We certainly hope it doesn’t. Again, calling on the North
not to do it. But certainly, we will – if that test does go forward,
we will be having discussions with our allies.
The code I'm using is as follows:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
speaker = hxs.select("//span[contains(#class, 'official_s_name')]") #gets the speaker
speaker = speaker.select('string()').extract()[0] #extracts speaker text
date = hxs.select('//*[#id="date_long"]') #gets the date
date = date.select('string()').extract()[0] #extracts the date
content = hxs.select('//*[#id="centerblock"]') #gets the content
content = content.select('string()').extract()[0] #extracts the content
texts = "%s\n\n%s\n\n%s" % (date, speaker, content) #puts everything together in a string
filename = ("/path/StateDailyBriefing-" + '%s' ".txt") % (date) #creates a file using the date
#opens the file defined above and writes 'texts' using utf-8
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)
I think they problem lies in the formatting of the HTML of the page. On the pages that output the text incorrectly, the paragraphs are separated by <br> <p></p>, while on the pages that output correctly the paragraphs are contained within <p align="left" dir="ltr">. So, while I've identified this, I'm not sure how to make everything output consistently in the correct form.
The problem is that when you getting text() or string(), <br> tags are not converted to newline.
Workaround - replace <br> tags before doing XPath requests. Code:
response = response.replace(body=response.body.replace('<br />', '\n'))
hxs = HtmlXPathSelector(response)
And let me give some advice, if you know, that there is only one node, you can use text() instead string():
date = hxs.select('//*[#id="date_long"]/text()').extract()[0]
Try this xpath:
//*[#id="centerblock"]//text()