I need help with parsing a web page with Python and requests-html lib. Here is the <div> that I want to analyze:
<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>
It renders as:
Text
I need to get Te<b>x</b>t as a result of parsing, without <div> and <span> but with <b> tags.
Using element as a requests-html object, here is what I am getting.
element.html:
<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>
element.text:
ATe\nx\nt
element.full_text:
AText
Could you please tell me how can I get rid of <span> but still get <b> tags in the parsing result?
Don't overcomplicate it.
How about some simple string processing and get the string between two boundaries:
Use element.html
take everything after the close </span>
Take everything before the close </div>
Like this
myHtml = '<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>'
myAnswer = myHtml.split("</span>")[1]
myAnswer = myAnswer.split("</div>")[0]
print(myAnswer)
output:
Te<b>x</b>t
Seems to work for your sample provided. If you have more complex requirements let us know and I'm sure someone can adapt thus further.
Related
So I have such pieces of HTML that I'm trying to parse. What I want to grab is the price ("84.00 USD"):
<div class="HeaderAndValues_headerDetailSection__3c2SZ ProductCatalog_price__25i2r">
<div class="HeaderAndValues_header__3dB61">Wholesale</div>
<span class="notranslate">
<div class="">84.00 USD</div>
</span>
</div>
soup.find(text="Wholesale").find_next().text gives me exactly what I need but only for the first search result. Is there anyway I could combine find_all() and find_next()? smth like soup.find_all(text="Wholesale").find_next() that would grab next text for each found "Wholesale"
Ok, I've found it! Someone might still find it useful
[x.find_next().text for x in page.find_all(text = "Wholesale")]
I've looked around and found solutions that have worked or suppose to work for this exact question, but it will not work for this situation. Anyone have a reason why it would work here, and not here? Or just simply show what I'm doing wrong, and I can work out the difference.
Keep in mind, I'm just giving a snippet of the html, it contains much more with the same span and class='boldText'. I'm specifically wanting the tag with Status: as its text, then the next text/content following that.
import bs4
html1 = '''<span class="boldText"><b>Date:</b> </span>12/04/2018<br/>
<span class="boldText"><b>Name:</b> </span>Aaron Rodgers<br/>
<span class="boldText"><b>Status:</b> </span>Questionable<br/><br/>
<br/>
<br/><br/><br/>'''
soup = bs4.BeautifulSoup(html1,'html.parser')
status = soup.find(text='Status:').next_sibling
I'm just trying to get the text: 'Questionable'
so looking for output:
>>> print (status)
>>> Questionable
The problem is that the b tag has no siblings. It's easier to see when formatted like this:
<span class="boldText">
<b>Status:</b>
</span>
Questionable
<br/>
See how the b is the only child of the span? The string "Questionable" is actually a sibling of the parent span, so you need to navigate to it as follows:
print(soup.find('b', string='Status:').parent.next_sibling)
# => 'Questionable'
How I can a image if code like this:
<div class="galery-images">
<div class="galery-images-slide" style="width: 760px;">
<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>
I want to get 136666697057736800.jpg
I wrote:
images = soup.select("div.galery-item")
And i get a list:
[<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>,
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136013892671126300.jpg);" ></div>,
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136666699218876700.jpg);"></div>]
I dont understand: how I can get all images?
Use regex or a css parser to extract the url, concatenate the host to the beginning of the URL, finally download the image like this.
import urllib
urllib.urlretrieve("https://www.google.com/images/srpr/logo11w.png", "google.png")
To make your life easier, you should use a regex:
urls = []
for ele in soup.find_all('div', attrs={'class':'galery-images-slide'}):
pattern = re.compile('.*background-image:\s*url\((.*)\);')
match = pattern.match(ele.div['style'])
if match:
urls.append(match.group(1))
This works by finding all the divs belonging to the parent div (which has the class: 'galery-images-slide'). Then, you can parse the child divs to find any that contain the style (which itself contains the background-url) using a regex.
So, from your above example, this will output:
[u'/images/photo/1/20130206/30323/136666697057736800.jpg']
Now, to download the specified image, you append the site name in front of the url, and you should be able to download it.
NOTE:
This requires the regex module (re) in Python in addition to BeautifulSoup.
And, the regex I used is quite naive. But, you can adjust this as required to suit your needs.
I am currently writing code for Scrapers and more and more become a fan of Python and especially BeautifulSoup.
Still... When parsing through html I came across a difficult part that I could only use in a not so beautiful way.
I want to scrape HTML Code and especially the following snippet:
<div class="title-box">
<h2>
<span class="result-desc">
Search results <strong>1</strong>-<strong>10</strong> out of <strong>10,009</strong> about <strong>paul mccartney</strong>Create email Alert
</span>
</h2>
</div>
So what I do is I identify the div by using:
comment = TopsySoup.find('div', attrs={'class' : 'title-box'})
Then the ugly part comes in. To catch the number I want to have: 10,009 I use:
catcher = comment.strong.next.next.next.next.next.next.next
Can somebody tell me if there is a nicer way?
How about comment.find_all('strong')[2].text?
It can actually be shortened as comment('strong')[2].text, since calling a Tag object as though it is a function is the same as calling find_all on it.
>>> comment('strong')[2].text
u'10,009'
I want to read the amount value (24.40) from this HTML.
<div id="order-total" class="clear-fix" style="margin-bottom:20px;">
<h3 class="col-left">Order total</h3>
<h3 class="col-right" style="display: block;">
<span class="credit-total-to-order" data-total-to-order="24.40">$ 24.40</span>
credits
</h3>
</div>
xpath - /html/body/div/header/section/form/div[5]/h3[2]/span
css - html body.ui-lang-en div#slave-edit.string-v2 header#slave-edit-header.edit
section#order-form form#frm-order-translation div#order-total.clear-fix
h3.col-right span.credit-total-to-order
I know I should use find_element_by_class_name or find_element_by_css_selector.
But not sure what should be the argument.
How can I do it?
Why not select the value from the element and parse the string to get the answer you need. For example, you can split the string and disregard the dollar to return the number you need.
someString = selenium.find_element_by_css_selector(".credit-total-to-order").text
someString.split(' ')[1]
Bear in mind - this will only work for the example you have provided.
Its not necessary to use find_element_by_class_name or find_element_by_css_selector..You can achive it with xpath like this
driver.find_element_by_xpath("//span[#class='credit-total-to-order']").text
UPDATE:
As per your updated html it looks like the style makes your element hidden.Mean while I also came to notice that the value you want to get is also stored in an attribute data-total-to-order.
So you can do somthing like this :
driver.find_element_by_xpath("//span[#class='credit-total-to-order']").get_Attribute("data-total-to-order")