Remove HTML Tags python - python

I have looked everywhere for a solution to my problem, but none of them seem to work. Essentially, I want to know the simplest way to remove HTML tags from a string. For example,
PriceTag = Soup.find_all(class_="text-robux-lg wait-for-i18n-format-render")
print(PriceTag)
This returns [<span class="text-robux-lg wait-for-i18n-format-render">1,250</span>] which is very much expected, but I don't know how to take 'PriceTag' and remove the HTML tags.

Try using the .text method:
print(PriceTag.text)
This will remove the HTML tags and extract the inner text of the selected element.
If this is a find_all, you need to use a for-loop to traverse:
for price_tag in PriceTag:
print(price_tag.text)

I am not that experienced but i'll have a go at your question
for price in Pricetag:
print(price.text.strip())

Related

Using regex to find something in the middle of a href while looping

For "extra credit" in a beginners class in Python that I am taking I wanted to extract data out of a URL using regex. I know that there are other ways I could probably do this, but my regex desperately needs work so...
Given a URL to start at, find the xth occurrence of a href on the page, and use that link to go down a level. Rinse and repeat until I have found the required link on the page at the requested depth on the site.
I am using Python 3.7 and Beautiful Soup 4.
At the beginning of the program, after all of the house-keeping is done, I have:
starting_url = 'http://blah_blah_blah_by_Joe.html'
extracted_name = re.findall('(?<=by_)([a-zA-Z0-9]+)[^.html]*', starting_url)
selected_names.append(extracted_name)
# Just for testing purposes
print(selected_name) [['Joe']]
Hmm, a bit odd didn't expect a nested list, but I know how to flatten a list, so ok. Let's go on.
I work my way through a couple of loops, opening each url for the next level down by using:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
Continue processing and, in the loop where the program should have found the href I want:
# Testing to check I have found the correct href
print(desired_link) <a href="http://blah_blah_blah_by_Mary.html">blah
blah</a>
type(desired_link) bs4.element.tag
Correct link, but a "type" new to me and not something I can use re.findall on. So more research and I have found:
for link in soup.find_all('a') :
tags = link.get('href')
type(tags) str
print(tags)
http://blah_blah_blah_by_George.html
http://blah_blah_blah_by_Bill.html
http://blah_blah_blah_by_Mary.html
etc.
Right type, but when I look at what printed, I think what I am looking at is maybe just one long string? And I need a way to just assign the third href in the string to a variable that I can use in re.findall('regex expression', desired_link).
Time to ask for help, I think.
And, while we are at it, any ideas about why I get the nested list the first time I used re.findall with the regex?
Please let me know how to improve this question so it is clearer what I've done and what I'm looking for (I KNOW you guys will, without me even asking).
You've printed every link on the page. But each time in the loop tags contains only one of them (you can print len(tags) to validate it easily).
Also I suggest replacing [a-zA-Z0-9]+ with \w+ - it will catch letters, numbers and underscores and is much cleaner.

XPath: Select specific item from property string

Trying to drill down to a specific Xpath of a url in a longer string. I've gotten down to each of the listed blocks, but can't seem to get any further than the long string of properties.
example code:
<div class="abc class">
<a class="123" title="abc" keys="xyz" href="url string">
Right now I have...
.//*[#id='content']/div/div[1]/a
That only retrieves the whole string of data, from class through href. What would I need to just retrieve the "url string" from that part? Would this need to be accomplished with a subsequent 'for' argument in the python input?
A pure XPath solution would involve just adding the #href to the expression:
.//*[#id='content']/div/div[1]/a/#href
In Python, assuming you are using lxml.html, you can get the attribute using the .attrib:
for link in root.xpath(".//*[#id='content']/div/div[1]/a"):
print(link.attrib['href'])
Try to avoid this array
if your class name is unique you can do it like:-
//*[#id='content']/div/div[#class='abc class']/a[#keys='xyz']/#href
Hope it will help you :)

Selenium - cant get text from span element

I'm very confused by getting text using Selenium.
There are span tags with some text inside them. When I search for them using driver.find_element_by_..., everything works fine.
But the problem is that the text can't be got from it.
The span tag is found because I can't use .get_attribute('outerHTML') command and I can see this:
<span class="branding">ThrivingHealthy</span>
But if I change .get_attribute('outerHTML') to .text it returns empty text which is not correct as you can see above.
Here is the example (outputs are pieces of dictionary):
display_site = element.find_element_by_css_selector('span.branding').get_attribute('outerHTML')
'display_site': u'<span class="branding">ThrivingHealthy</span>'
display_site = element.find_element_by_css_selector('span.branding').text
'display_site': u''
As you can clearly see, there is a text but it does not finds it. What could be wrong?
EDIT: I've found kind of workaround. I've just changed the .text to .get_attribute('innerText')
But I'm still curious why it works this way?
The problem is that there are a LOT of tags that are fetched using span.branding. When I just queried that page using find_elements (plural), it returned 20 tags. Each tag seems to be doubled... I'm not sure why but my guess is that one set is hidden while the other is visible. From what I can tell, the first of the pair is hidden. That's probably why you aren't able to pull text from it. Selenium's design is to not interact with elements that a user can't interact with. That's likely why you can get the element but when you try to pull text, it doesn't work. Your best bet is to pull the entire set with find_elements and then just loop through the set getting the text. You will loop through like 20 and only get text from 10 but it looks like you'll still get the entire set anyway. It's weird but it should work.

Replacing header contents with empty string in Beautiful Soup

I have a code to remove the text which is in head tag. Soup us the html of a website
for link in soup.findAll('head'):
link.replaceWith("")
I am trying to replace the entire content with "". However this is not working. How can i remove all text between head tags from soup completely.
Try this:
[head.extract() for head in soup.findAll('head')]
You need to use """ (3 quotes), where you appear to be using only two.
Example:
"""
This block
is commented out
"""
Happy coding!
EDIT: This is not what the user was asking, my apologies.
I'm not experienced with Beautiful Soup, but I found a snippet of code on SO that might work for you (source):
soup = BeautifulSoup(source.lower())
to_extract = soup.findAll('ahref') #Edit the stuff inside '' to change which tag you want items to be removed from, like 'ahref' or 'head'
for item in to_extract:
item.extract()
By the look of it, it might just remove every link on your page, though.
I'm sorry if this doesn't help you more!

Find text and elements with python and selenium?

When I go to a certain webpage I am trying to find a certain element and piece of text:
<span class="Bold Orange Large">0</span>
This didn't work: (It gave an error of compound class names or something...)
elem = browser.find_elements_by_class_name("Bold Orange Large")
So I tried this: (but I'm not sure it worked because I don't really understand the right way to do css selectors in selenium...)
elem = browser.find_elements_by_css_selector("span[class='Bold Orange Large']")
Once I find the span element, I want to find the number that is inside...
num = elem.(what to put here??)
Any help with css selectors, class names, and finding element text would be great!!
Thanks.
EDIT:
My other problem is that there are multiple of those exact span elements but with different numbers inside..how can I deal with that?
you're correct in your usage of css selectors! Also your first attempt was failing because there are spaces in the class name and selenium does not seem to be able to find standalone identifiers with spacing at all. I think that is a bad development practice to begin with, so its not your problem. Selenium itself does not include an html editor, because its already been done before.
Try looking here: How to find/replace text in html while preserving html tags/structure.
Also this one is relevant and popular as well: RegEx match open tags except XHTML self-contained tags

Categories

Resources