how I can get all images if i'm using beautiful soup? - python

How I can a image if code like this:
<div class="galery-images">
<div class="galery-images-slide" style="width: 760px;">
<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>
I want to get 136666697057736800.jpg
I wrote:
images = soup.select("div.galery-item")
And i get a list:
[<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>,
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136013892671126300.jpg);" ></div>,
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136666699218876700.jpg);"></div>]
I dont understand: how I can get all images?

Use regex or a css parser to extract the url, concatenate the host to the beginning of the URL, finally download the image like this.
import urllib
urllib.urlretrieve("https://www.google.com/images/srpr/logo11w.png", "google.png")

To make your life easier, you should use a regex:
urls = []
for ele in soup.find_all('div', attrs={'class':'galery-images-slide'}):
pattern = re.compile('.*background-image:\s*url\((.*)\);')
match = pattern.match(ele.div['style'])
if match:
urls.append(match.group(1))
This works by finding all the divs belonging to the parent div (which has the class: 'galery-images-slide'). Then, you can parse the child divs to find any that contain the style (which itself contains the background-url) using a regex.
So, from your above example, this will output:
[u'/images/photo/1/20130206/30323/136666697057736800.jpg']
Now, to download the specified image, you append the site name in front of the url, and you should be able to download it.
NOTE:
This requires the regex module (re) in Python in addition to BeautifulSoup.
And, the regex I used is quite naive. But, you can adjust this as required to suit your needs.

Related

Exclude span from parsing with requests-html

I need help with parsing a web page with Python and requests-html lib. Here is the <div> that I want to analyze:
<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>
It renders as:
Text
I need to get Te<b>x</b>t as a result of parsing, without <div> and <span> but with <b> tags.
Using element as a requests-html object, here is what I am getting.
element.html:
<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>
element.text:
ATe\nx\nt
element.full_text:
AText
Could you please tell me how can I get rid of <span> but still get <b> tags in the parsing result?
Don't overcomplicate it.
How about some simple string processing and get the string between two boundaries:
Use element.html
take everything after the close </span>
Take everything before the close </div>
Like this
myHtml = '<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>'
myAnswer = myHtml.split("</span>")[1]
myAnswer = myAnswer.split("</div>")[0]
print(myAnswer)
output:
Te<b>x</b>t
Seems to work for your sample provided. If you have more complex requirements let us know and I'm sure someone can adapt thus further.

Python code to keep only a set of html tags in a input string

I have text like this:
<div>
<script></script>
<h1>name</h1>
<p> Description </p>
<i> italic </i>
</div>
I want to remove all html tags except h tags and p tags. For this I'm trying to make a more generic method like this:
def strip_tags(text, a_list_of_tags_to_not_remove)
Using the following Beautiful Soup code I can remove all the html tags, but it doesn't allow to keep a list of tags, while removing others.
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html).text
Can I do this using Beautiful Soup or are there any other python library to do this?
Yes, you can.
You can use .find_all([]) to find all the tags you don't care about, then call .unwrap() to get rid of them while keeping the content.
You can use the find_all function:
soup.find_all(['h1', 'p'])
to get a list of the tags you need, instead of having to find all the tags you don't want.

Pattern match not working as expected python

I was playing around with pattern matches in different html codes of sites I noticed something weird. I used this pattern :
pat = <div class="id-app-orig-desc">.*</div>
I used it on a app page of the play store(Picked a random app). So according to me it should just give what's between the div tags (ie the description) but that does not happen. I gives everything starting from the first of the pattern and goes on till the last of the page completely ignoring in between. Anyone knows what's happening?!
And I check the length of the list returned it's just 1.
First of all, do not parse HTML with regex, use a specialized tool - HTML parser. For example, BeautifulSoup:
from bs4 import BeautifulSoup
data = """
<div>
<div class="id-app-orig-desc">
Do not try to get me with a regex, please.
</div>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('div', {'class': 'id-app-orig-desc'}).text.strip()
Prints:
Do not try to get me with a regex, please.

Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

I need to automatically scan lots of html documents for ad banners that are surrounded by an anchor tag, e.g.:
<a href="http://ad_network.com/abc.html">
<img src="ad_banner.jpg">
</a>
As a newbie with xpath, I can select such anchors via lxml like so:
text = '''
<a href="http://ad_network.com/abc.html">
<img src="ad_banner.jpg">
</a>'''
root = lxml.html.fromstring(text)
print root.xpath('//a[contains(#href,("ad_network.")) or contains(#href,("other_ad_network."))][descendant::img]')
In the example I check on two different domains: "ad_network." and "other_ad_network.". However, there are over 25 domains to check and the xpath expression would get terribly long by connecting all those conatains-directives by "or". And I fear the expression would be pretty inefficient concerning CPU ressources. Is there some syntax for checking on multiple "contains" values?
I could get the concerned links also via regex in a single line of code. Yet, although the html code is normalized by lxml, regex seems never to be a good choice for that kind of work ... Any help appreciated!
It might not be that bad just to do a bunch of 'or's. Build the xpath with python so that you don't get writer's cramp and then precompile it. The actual xpath code is in libxml and should be fast.
sites=['aaa', 'bbb']
contains = ' or '.join('contains(#href,(%s))' % site for site in sites)
anchor_xpath = etree.XPath('//a[%s][descendant::img]' % contains)

XPath - get <img> that is not inside <a>

With XPath, how to select all the images that are not inside <a> tag?
For example, here:
<a href='foo'> <img src='bar'/> </a>
<img src='ham' />
I should get "ham" image as a result. To get the first image, I would use \\a\\img. Is there anything like \\not(a)\\img ?
I use python + lxml so python hacks are welcome, if pure xpath would be to hairy.
That's easily done with
//img[not(ancestor::a)]
Read the spec on XPath axes if you want to find out about the other ones besides ancestor.

Categories

Resources