how to find image links by using webscraping

how to find image links by using webscraping - python

I want to parse the image links of webpages.I have tried the below code but its showing some error.
#!usr/bin/python
import requests
from bs4 import BeautifulSoup
url=raw_input("enter website")
r=requests.get("http://"+ url)
data=r.img
soup=BeautifulSoup(data)
for link in soup.find_all('img'):
print link.get('src')
error
File "img.py", line 6, in <module>
data=r.img
AttributeError: 'Response' object has no attribute 'img'

you error is that you want to get img from Response, not from source code
r=requests.get("http://"+ url)
# data=r.img # it is wrong
# change instead of `img` to `text`
data = r.text # here we need to get `text` from `Response` not `img`
# and the code
soup=BeautifulSoup(data)
for link in soup.find_all('img'):
print link.get('src')

Below you will find a working version with import urllib.request and BeautifulSoup:
import urllib.request
from bs4 import BeautifulSoup
url='http://python.org'
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('img'):
print('relative img path')
print(link['src'])
print('absolute path')
print(url + link['src'])
I hope this helps you :-)

Related

Problem Following Web Scraping Tutorial Using Python

I am following this web scrapping tutorial and I am getting an error.
My code is as follows:
import requests
URL = "http://books.toscrape.com/" # Replace this with the website's URL
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
print(getURL.status_code)
from bs4 import BeautifulSoup
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
print(images)
imageSources=[]
for image in images:
imageSources.append(image.get("src"))
print(imageSources)
for image in imageSources:
webs=requests.get(image)
open("images/"+image.split("/")[-1], "wb").write(webs.content)
Unfortunately, I am getting an error in the line webs=requests.get(image), which is as follows:
MissingSchema: Invalid URL 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg': No schema supplied. Perhaps you meant http://media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg?
I am totally new to scrapping and I don't know what this means. Any suggestion is appreciated.

You need to supply a proper URL in this line:
webs=requests.get(image)
Because this media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg is not a valid URL. Hence, the MissingSchema error.
For example:
full_image_url = f"http://books.toscrape.com/{image}"
This gives you:
http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
Full code:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://books.toscrape.com/").text, 'html.parser')
images = soup.find_all('img')
imageSources = []
for image in images:
imageSources.append(image.get("src"))
for image in imageSources:
full_image_url = f"http://books.toscrape.com/{image}"
webs = requests.get(full_image_url)
open(image.split("/")[-1], "wb").write(webs.content)

BeautifulSoup: No connection adapters were found error

My goal is to search google for a specified string and save the image from the found url. I have been following online tutorials but keep getting the same InvalidSchema error and do not know why.
from PIL import Image
from bs4 import BeautifulSoup
import requests
text= "animal+crossing"
html_page = requests.get("https://www.google.com/search?q="+text)
soup = BeautifulSoup(html_page.text, 'html.parser')
image = soup.find('img')
img_url = image['src']
img = Image.open(requests.get(img_url, stream = True).raw)
img.save('image.jpg')

The image url that you're grabbing in this step: img_url = image['src'] is not actually a valid url. Here's the value I see for img_url when I run your code:
data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

Beutifulsoup sometimes print None sometimes

I tried to scrape a image from a reddit post. But when I run this code snippet It show me html snippet sometimes, but sometimes it prints None (NO Error occurred). Anybody can tell me why? Here is the code.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.reddit.com/r/programmingmemes/').text
soup = BeautifulSoup(source, 'lxml')
img = soup.find('div', class_='_3Oa0THmZ3f5iZXAQ0hBJ0k')
print(img)

Check the return code of the request:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.reddit.com/r/programmingmemes/')
if source.status_code == 200:
soup = BeautifulSoup(source.text, 'lxml')
img = soup.find('div', class_='_3Oa0THmZ3f5iZXAQ0hBJ0k')
print(img)
else:
print(f"Error (code {source})")
Also check if the class is constant during time (it may be randomized).

How to save graph / image from CGI website in python?

http://www.wunderground.com/history/airport/KMDW/2014/11/17/MonthlyHistory.html?req_city=NA&req_state=NA&req_statename=NA
On the link above, I am trying to save the "Monthly Weather History Graph" in a python script. I have tried everything I can think of using BeautifulSoup and urrlib.
What I have been able to do is get to the point below, which I can extract, but I can not figure out how to save that graph as an image/HTML/PDF/anything. I am really not familiar with CGI, so any guidance here is much appreciated.
div id="history-graph-image"
img src="/cgi-bin/histGraphAll?day=17&year=2014&month=11&ID=KMDW&type=1&width=614**" alt="Monthly Weather History Graph" /

Get the page with requests, parse the HTML with BeautifulSoup, find the img tag inside div with id="history-graph-image" and get the src attribute value:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.wunderground.com'
url = 'http://www.wunderground.com/history/airport/KMDW/2014/11/17/MonthlyHistory.html?req_city=NA&req_state=NA&req_statename=NA'
response = requests.get(url)
soup = BeautifulSoup(response.content)
image_relative_url = soup.find('div', id='history-graph-image').img.get('src')
image_url = urljoin(base_url, image_relative_url)
print image_url
Prints:
http://www.wunderground.com/cgi-bin/histGraphAll?day=17&year=2014&month=11&ID=KMDW&type=1&width=614
Then, download the file with urllib.urlretrieve():
import urllib
urllib.urlretrieve(image_url, "image.gif")

'NoneType' object is not callable beautifulsoup error while using get_text

I wrote this code for extracting all text from a web page:
from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://www.pythonforbeginners.com').read())
print(soup.get_text())
The problem is I get this error:
print(soup.get_text())
TypeError: 'NoneType' object is not callable
Any idea about how to solve this?

The method is called soup.getText(), i.e. camelCased.
Why you get TypeError instead of AttributeError here is a mystery to me!

As Markku suggests in the comments, I would recommend breaking your code up.
from BeautifulSoup import BeautifulSoup
import urllib2
URL = "http://www.pythonforbeginners.com"
page = urllib2.urlopen('http://www.pythonforbeginners.com')
html = page.read()
soup = BeautifulSoup(html)
print(soup.get_text())
If it's still not working, throw in some print statements to see what's going on.
from BeautifulSoup import BeautifulSoup
import urllib2
URL = "http://www.pythonforbeginners.com"
print("URL is {} and its type is {}".format(URL,type(URL)))
page = urllib2.urlopen('http://www.pythonforbeginners.com')
print("Page is {} and its type is {}".format(page,type(page))
html = page.read()
print("html is {} and its type is {}".format(html,type(html))
soup = BeautifulSoup(html)
print("soup is {} and its type is {}".format(soup,type(soup))
print(soup.get_text())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to find image links by using webscraping - python

Related

Problem Following Web Scraping Tutorial Using Python

BeautifulSoup: No connection adapters were found error

Beutifulsoup sometimes print None sometimes

How to save graph / image from CGI website in python?

'NoneType' object is not callable beautifulsoup error while using get_text

Categories

Resources