Replacing header contents with empty string in Beautiful Soup - python

I have a code to remove the text which is in head tag. Soup us the html of a website
for link in soup.findAll('head'):
link.replaceWith("")
I am trying to replace the entire content with "". However this is not working. How can i remove all text between head tags from soup completely.

Try this:
[head.extract() for head in soup.findAll('head')]

You need to use """ (3 quotes), where you appear to be using only two.
Example:
"""
This block
is commented out
"""
Happy coding!
EDIT: This is not what the user was asking, my apologies.
I'm not experienced with Beautiful Soup, but I found a snippet of code on SO that might work for you (source):
soup = BeautifulSoup(source.lower())
to_extract = soup.findAll('ahref') #Edit the stuff inside '' to change which tag you want items to be removed from, like 'ahref' or 'head'
for item in to_extract:
item.extract()
By the look of it, it might just remove every link on your page, though.
I'm sorry if this doesn't help you more!

Related

Appending an existing tag to soup causes angled brackets to become HTML entities

I'm trying to write a BeautifulSoup object to a file. Note that I append something to the soup object. The thing is div containing HTML/JavaScript from Plotly's to_html() function, which gives me a chart in HTML form. I narrowed down the problem to the following code:
from bs4 import BeautifulSoup
file_writer = open("path/to/file", "w")
html_outline = """<html>
<head></head>
<body>
<p>Hello World!</p>
<div></div>
</body>
</html>"""
soup = BeautifulSoup(html_outline, "html.parser")
soup.div.append({plotly HTML/JavaScript})
file_writer.write(soup)
file_writer.close()
Inside the write function, I've tried various functions for the soup object to convert it to a string, like str(soup), soup.prettify(), and more that I'm forgetting, and those indeed successfully write to the file, but the angled brackets ("<>") from the Plotly HTML I insert become HTML entities (I believe that's what they're called), so a
<div>
becomes:
<div>
inside the file I write to. I will note here that only the angled brackets for the HTML I appended into the soup object turn into HTML entities, the html, head, and body tags are all proper angled brackets.
My question is, how can I convert the soup object directly into a string that has proper angled brackets and no HTML entities?
I guess I can maybe write a function that parses the file for those HTML entities and replaces them with proper angled brackets, but I'm hoping there's a better solution before I do that. I tried searching this problem up multiple times but nothing came up for it.
I asked this question previously but it was marked as a duplicate, but the duplicate question linked didn't help because that was for adding empty tags. I'm appending a whole existing div with JavaScript and other content to my soup object here.
Thanks in advance!
I found out that I was able to use bs4's .prettify() function, but I had to change the formatter to None. So my line of code that writes the HTML to the file becomes:
file_writer.write(soup.prettify(formatter=None))
This isn't best practice because according to bs4's docs, it said that this may generate invalid HTML/XML. I know the docs say that it should convert HTML entities to Unicode characters by default, so I'm not sure why that didn't work for me. While I'm not in urgent need of a solution anymore, I posted this because I thought that someone may find it useful in the future. Hopefully someone can give a better solution, though!

How to remove html tags from text using python?

I am new to using python and I am trying to create a simple script that prints out the word of the day from Urban Dictionary.
import requests
from bs4 import BeautifulSoup
# requests urban dictionary home page
r = requests.get('https://www.urbandictionary.com')
soup = BeautifulSoup(r.text, 'html.parser')
# finds the title
title = soup.find('title').text
print(title)
# finds the definition
definition = soup.find('meta', attrs={'property': 'og:description'})
print(definition)
I use ".text" for the title to get rid of the html tags and it works, but when I try to use it on the definition all of the text disappears. So, at the moment definition prints out with the html tags. What are some other ways besides ".text" to remove the html tags. When I try to paste the output here part of it doesn't show up so here is a picture of the output.
This is my first time posting on here so I'm sorry if I didn't format my question correctly but any help would be greatly appreciated.
... when I try to use [the text property] on the definition all of the text disappears...
This is because the tag you're targeting looks like this:
<meta content="foo bar baz..." name="Description" property="og:description">
When you try to access the text property on this object in Beautiful Soup, there isn't any text that's a child of the element. Instead, you're looking to extract the "content" attribute, which you can do with the square bracket "array"-style notation:
definition['content']
This feature is documented in the Attributes section of the Beautiful Soup documentation.

Using regex to find something in the middle of a href while looping

For "extra credit" in a beginners class in Python that I am taking I wanted to extract data out of a URL using regex. I know that there are other ways I could probably do this, but my regex desperately needs work so...
Given a URL to start at, find the xth occurrence of a href on the page, and use that link to go down a level. Rinse and repeat until I have found the required link on the page at the requested depth on the site.
I am using Python 3.7 and Beautiful Soup 4.
At the beginning of the program, after all of the house-keeping is done, I have:
starting_url = 'http://blah_blah_blah_by_Joe.html'
extracted_name = re.findall('(?<=by_)([a-zA-Z0-9]+)[^.html]*', starting_url)
selected_names.append(extracted_name)
# Just for testing purposes
print(selected_name) [['Joe']]
Hmm, a bit odd didn't expect a nested list, but I know how to flatten a list, so ok. Let's go on.
I work my way through a couple of loops, opening each url for the next level down by using:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
Continue processing and, in the loop where the program should have found the href I want:
# Testing to check I have found the correct href
print(desired_link) <a href="http://blah_blah_blah_by_Mary.html">blah
blah</a>
type(desired_link) bs4.element.tag
Correct link, but a "type" new to me and not something I can use re.findall on. So more research and I have found:
for link in soup.find_all('a') :
tags = link.get('href')
type(tags) str
print(tags)
http://blah_blah_blah_by_George.html
http://blah_blah_blah_by_Bill.html
http://blah_blah_blah_by_Mary.html
etc.
Right type, but when I look at what printed, I think what I am looking at is maybe just one long string? And I need a way to just assign the third href in the string to a variable that I can use in re.findall('regex expression', desired_link).
Time to ask for help, I think.
And, while we are at it, any ideas about why I get the nested list the first time I used re.findall with the regex?
Please let me know how to improve this question so it is clearer what I've done and what I'm looking for (I KNOW you guys will, without me even asking).
You've printed every link on the page. But each time in the loop tags contains only one of them (you can print len(tags) to validate it easily).
Also I suggest replacing [a-zA-Z0-9]+ with \w+ - it will catch letters, numbers and underscores and is much cleaner.

Getting html stripped of script and style tags with BeautifulSoup?

I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet.
soup = BeautifulSoup(html)
for script in soup("script"):
soup.script.extract()
for style in soup("style"):
soup.style.extract()
contents = soup.html.contents
text = loader.extract_text(contents)
contents = soup.html.contents just gets a list and everything is defined in classes there. Is there a method that just returns the raw html after soup manipulates it? Or do I just need to go through the contents list and piece the html back together excluding the script & style tags?
Or is there an even better solution to accomplish what I want?
unicode( soup ) gives you the html.
Also what you want is this:
for elem in soup.findAll(['script', 'style']):
elem.extract()

Decomposing HTML to link text and target

Given an HTML link like
texttxt
how can I isolate the url and the text?
Updates
I'm using Beautiful Soup, and am unable to figure out how to do that.
I did
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
links = soup.findAll('a')
for link in links:
print "link content:", link.content," and attr:",link.attrs
i get
*link content: None and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root /support.asp')]* ...
...
Why am i missing the content?
edit: elaborated on 'stuck' as advised :)
Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.
EDIT:
I think you want:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.
EDIT 2:
This should show you all the links in a page:
import urlparse, urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source)
for item in soup.fetchall('a'):
try:
link = urlparse.urlparse(item['href'].lower())
except:
# Not a valid link
pass
else:
print link
Here's a code example, showing getting the attributes and contents of the links:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.attrs, link.contents
Looks like you have two issues there:
link.contents, not link.content
attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.
Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.
/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/
Here's what it matches:
'text'
// Parts: "url", "text"
'text<span>something</span>'
// Parts: "url", "text<span>something</span>"
If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

Categories

Resources