I am trying to automate a process of downloading imgur files, and for this purpose I am using beautifulsoup to get the link however to be honest I am pretty lost on why this doesn't work, as according to my research it should:
soup = BeautifulSoup("http://imgur.com/ha0WYYQ")
imageUrl = soup.select('.image a')[0]['href']
The code above just returns an empty list, and therefore an error. I tried to modify it, but to no avail. Any and all input is appreciated.
<div class="post-image">
<a href="//i.imgur.com/ha0WYYQ.jpg" class="zoom">
<img src="//i.imgur.com/ha0WYYQ.jpg" alt="Frank in his bb8 costume" itemprop="contentURL">
</a>
</div>
this is the image tag, the "post-image" is a single word, can not be separated.
imageUrl = soup.select('.post-image a')[0]['href']
shortcut for select one tag:
imageUrl = soup.select_one('.post-image a')['href']
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
There are a few things wrong with your approach:
BeautifulSoup does not expect an url, so you will need to use a library to fetch the HTML stream first; and
Your selector seems invalid based on what I can see it should be .post-image a.
r = urllib.urlopen('http://imgur.com/ha0WYYQ').read()
soup = BeautifulSoup(r,'lxml')
soup.select('.post-image a')[0]['href']
Or more elegant:
with urllib.urlopen('http://imgur.com/ha0WYYQ') as f:
r = f.read()
soup = BeautifulSoup(r,'lxml')
result = soup.select('.post-image a')[0]['href']
Related
How can I get the src attribute of this link tag with BS4 library?
Right now I'm using the code below to achieve the resulte but i can't
<li class="active" id="server_0" data-embed="<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>"><a><span><i class="fa fa-eye"></i></span> <strong>vk</strong></a></li>
i want this value src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'
this my code i access ['data-embed'] i don't how to exract the link this my code
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper()
access = "https://w.mycima.cc/play.php?vid=d4d8322b9"
response = scraper.get(access)
doc2 = bs(response.content, "lxml")
container2 = doc2.find("div", id='player').find("ul", class_="list_servers list_embedded col-sec").find("li")
link = container2['data-embed']
print(link)
Result
<Response [200]>
https://w.mycima.cc/play.php?vid=d4d8322b9
<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>
Process finished with exit code 0
From the beautiful soup documentation
You can access a tag’s attributes by treating the tag like a
dictionary
They give the example:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser')
tag['id']
# 'boldest'
Reference and further details,
see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So, for your case specifically, you could write
print(link.find("iframe")['src'])
if link turns out to be plain text, not a soup object - which may be the case for your particular example based on the comments - well then you can resort to string searching, regex, or more beautiful soup'ing, for example:
link = """<Response [200]>https://w.mycima.cc/play.php?vid=d4d8322b9<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'></iframe>"""
iframe = re.search(r"<iframe.*>", link)
if iframe:
soup = BeautifulSoup(iframe.group(0),"html.parser")
print("src=" + soup.find("iframe")['src'])
I'm trying to extract image URLs from this code:
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
How can I find the URLs in data-src?
I'm using beautiful soup and find function but I have no idea how to extract links because I don't see img tag as usual...
Thank you for your time in advance
If you can't use an HTML parser for whatever reason, then you can use regex.
import re
text = '''
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
'''
parsed = re.search('(?<=data-src=").*(?=" )', text).group(0)
print(parsed)
You can try the following:
from bs4 import BeautifulSoup
html = """
<div class="theme-screenshot one attachment-theme-screenshot size-theme-screenshot wp-post-image loaded" data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
"""
soup = BeautifulSoup(html, "html.parser")
url = soup.select_one(
"div.theme-screenshot.one.attachment-theme-screenshot.size-theme-screenshot.wp-post-image.loaded"
).get("data-src")
print(url)
This will return:
https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg
Documentation for BeautifulSoup(bs4) can be found at:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
This is strange. I tried to scrape from a website where the end of tag of <p> is <\\/p> instead of just </p>. Obviously, every time I call soup.find_all('p'), it will not return any values instead None. It doesn't have any problem when I try a or div since both are well-structured tags with </a> and </div> end tags, respectively. I don't have any clue on how could I solve this problem.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
url = 'http://www.gmanetwork.com/news/story/656223/money/economy/iphone-worries-weigh-on-wall-street'
page = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0'}))
data = page.read()
soup = BeautifulSoup(data, 'html.parser')
print(soup.find_all('p'))
EDIT
As suggested, I found headless browsers like Splinter a little bit frustrating since the module needs a browser dependency (correct me if I'm wrong).
Well you can do something like this :
import re
a = "<p> This is a text <\\/p>"
match = re.match("""^.*<p>(.*)<\\\\/p>.*$""", a).group(1)
print(match)
My code only returns an empty string, and I have no idea why.
import urllib2
def getImage(url):
page = urllib2.urlopen(url)
page = page.read() #Gives HTML to parse
start = page.find('<a img=')
end = page.find('>', start)
img = page[start:end]
return img
It would only return the first image it finds, so it's not a very good image scraper; that said, my primary goal right now is just to be able to find an image. I'm unable to.
Consider using BeautifulSoup to parse your HTML:
from BeautifulSoup import BeautifulSoup
import urllib
url = 'http://www.google.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for img in soup.findAll('img'):
print img['src']
You should use a library for this and there are several out there, but to answer your question by changing the code you showed us...
Your problem is that you are trying to find images, but images don't use the <a ...> tag. They use the <img ...> tag. Here is an example:
<img src="smiley.gif" alt="Smiley face" height="42" width="42">
What you should do is change your start = page.find('<a img=') line to start = page.find('<img ') like so:
def getImage(url):
page = urllib2.urlopen(url)
page = page.read() #Gives HTML to parse
start = page.find('<img ')
end = page.find('>', start)
img = page[start:end+1]
return img
Article on screen scraping with ruby:
http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Its not scraping images but its a good article and may help.
Extracting the image information this way is not a good idea. There are severaly better options, depending on your knowledge and your motivation to learn something new:
http://scrapy.org/ is a very good framework for extracting data from web pages. As it looks like you're a beginner, it might a bit overkill.
Learn regular expressions to extract the information: http://docs.python.org/library/re.html and Learning Regular Expressions
Use http://www.crummy.com/software/BeautifulSoup/ to parse data from the result of page.read().
Some instructions that might be of help:
Use Google Chrome. Set the mouse over the image and right click. Select "Inspect element". That will open a section where you'll be able to see the html near the image.
Use Beautiful Soup to parse the html:
from BeautifulSoup import BeautifulSoup
request = urllib2.Request(url)
response = urllib2.urlopen(request)
html = response.read()
soap = BeautifulSoap(html)
imgs = soup.findAll("img")
items = []
for img in imgs:
print img['src'] #print the image location
items.append(img['src']) #store the locations for downloading later