Problem Following Web Scraping Tutorial Using Python - python

I am following this web scrapping tutorial and I am getting an error.
My code is as follows:
import requests
URL = "http://books.toscrape.com/" # Replace this with the website's URL
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
print(getURL.status_code)
from bs4 import BeautifulSoup
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
print(images)
imageSources=[]
for image in images:
imageSources.append(image.get("src"))
print(imageSources)
for image in imageSources:
webs=requests.get(image)
open("images/"+image.split("/")[-1], "wb").write(webs.content)
Unfortunately, I am getting an error in the line webs=requests.get(image), which is as follows:
MissingSchema: Invalid URL 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg': No schema supplied. Perhaps you meant http://media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg?
I am totally new to scrapping and I don't know what this means. Any suggestion is appreciated.

You need to supply a proper URL in this line:
webs=requests.get(image)
Because this media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg is not a valid URL. Hence, the MissingSchema error.
For example:
full_image_url = f"http://books.toscrape.com/{image}"
This gives you:
http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
Full code:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://books.toscrape.com/").text, 'html.parser')
images = soup.find_all('img')
imageSources = []
for image in images:
imageSources.append(image.get("src"))
for image in imageSources:
full_image_url = f"http://books.toscrape.com/{image}"
webs = requests.get(full_image_url)
open(image.split("/")[-1], "wb").write(webs.content)

Related

BeautifulSoup: No connection adapters were found error

My goal is to search google for a specified string and save the image from the found url. I have been following online tutorials but keep getting the same InvalidSchema error and do not know why.
from PIL import Image
from bs4 import BeautifulSoup
import requests
text= "animal+crossing"
html_page = requests.get("https://www.google.com/search?q="+text)
soup = BeautifulSoup(html_page.text, 'html.parser')
image = soup.find('img')
img_url = image['src']
img = Image.open(requests.get(img_url, stream = True).raw)
img.save('image.jpg')
The image url that you're grabbing in this step: img_url = image['src'] is not actually a valid url. Here's the value I see for img_url when I run your code:
data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

Can't extract src attribute from "img" tag with BeautifulSoup

I'm working on a project and I'm trying to extract the pictures' URL from a website. I'm a noob at this so please bear with me. Based on the HTML code, the class of the pictures that I want is "fotorama__img". However, when I execute my code, it doesn't seem to work. Anyone knows why that's the case? Also, how come the src attribute doesn't contain the whole URL, just a part of it? Example: the link to the image is https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_front.jpg but the src attribute of the img tag is "/files_SYS/images/System/sysThumb/SYS-120U-TNR_main.png".
Here is my code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.supermicro.com/en/products/system/Ultra/1U/SYS-120U-TNR")
soup = BeautifulSoup(page.content,'lxml')
images = soup.find_all("img", {"class": "fotorama__img"})
for image in images:
print(image.get("src"))
And here is the picture of the HTML code for the page
Thank you for your help!
The class is added dynamically via JavaScript, so beautifulsoup doesn't see it. To extract the images from this site, you can do:
import requests
from bs4 import BeautifulSoup
page = requests.get(
"https://www.supermicro.com/en/products/system/Ultra/1U/SYS-120U-TNR"
)
soup = BeautifulSoup(page.content, "lxml")
images = [
"https://www.supermicro.com" + a["href"]
for a in soup.select(".fotorama > a")
]
print(*images, sep="\n")
Prints:
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_main.png
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_angle.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_top.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_front.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_rear.jpg

Having issues scraping the image from a website using bs4

Hey I can't seem to scrape the images from this website
https://www.nike.com/gb/w/new-mens-shoes-3n82yznik1zy7ok
I am using the following code
product.find('img', {'class': 'css-1fxh5tw product-card__hero-image'})['src']]
It returns this
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Your code was not wrong. I have extracted images
import requests
from bs4 import BeautifulSoup
url ="https://www.nike.com/gb/w/new-mens-shoes-3n82yznik1zy7ok"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
images = soup.find_all('img', {'class':'css-1fxh5tw product-card__hero-image'},src=True)
for i in images:
if 'data:image' not in i['src']:
print(i['src'])

Read data from URL / XML with python

this is my first question.
Im trying to learn some python, so.. i have this problem
how i can get data from this url that shows info in XML:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/v01/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup)
output:
Output
but i wanna get access to this type of data, < RUTEmisor> data for example:
linkurl_invoice
hope guys you can try to advice me with the code and how to read xml docs.
By examining the URL you gave, it seems that the data is actually held a few links away at the following URL: http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49
As such, you can access it directly as follows:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup.find('RUTEmisor').text)

How to save graph / image from CGI website in python?

http://www.wunderground.com/history/airport/KMDW/2014/11/17/MonthlyHistory.html?req_city=NA&req_state=NA&req_statename=NA
On the link above, I am trying to save the "Monthly Weather History Graph" in a python script. I have tried everything I can think of using BeautifulSoup and urrlib.
What I have been able to do is get to the point below, which I can extract, but I can not figure out how to save that graph as an image/HTML/PDF/anything. I am really not familiar with CGI, so any guidance here is much appreciated.
div id="history-graph-image"
img src="/cgi-bin/histGraphAll?day=17&year=2014&month=11&ID=KMDW&type=1&width=614**" alt="Monthly Weather History Graph" /
Get the page with requests, parse the HTML with BeautifulSoup, find the img tag inside div with id="history-graph-image" and get the src attribute value:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.wunderground.com'
url = 'http://www.wunderground.com/history/airport/KMDW/2014/11/17/MonthlyHistory.html?req_city=NA&req_state=NA&req_statename=NA'
response = requests.get(url)
soup = BeautifulSoup(response.content)
image_relative_url = soup.find('div', id='history-graph-image').img.get('src')
image_url = urljoin(base_url, image_relative_url)
print image_url
Prints:
http://www.wunderground.com/cgi-bin/histGraphAll?day=17&year=2014&month=11&ID=KMDW&type=1&width=614
Then, download the file with urllib.urlretrieve():
import urllib
urllib.urlretrieve(image_url, "image.gif")

Categories

Resources