I am trying to test my website using selenium. And I want to check all images have filled alt attribute or not. So how to check this.
<img src="/media/images/biren.png" alt="N. Biren" class ="img-fluid" >
I'm not well versed with selenium, but this can be done easily using requests with bs4 (simple web-scraping).
Please find an example code below:
import requests, bs4
url = 'HTML URL HERE!'
# get the url html txt
page = requests.get(url).text
# parse the html txt
soup = bs4.BeautifulSoup(page, 'html.parser')
# get all img tags
for image in soup.find_all('img'):
try:
# print alt text if exists
print(image['alt'])
except:
# print the complete img tag if not
print(image)
Related
Why does my code result in an empty list? It's as if the page is too big and it doesn't parse it all... could it be the case?
from bs4 import BeautifulSoup
source = requests.get('https://www.youtube.com/nitroparkour')
soup = BeautifulSoup(source.text, 'lxml')
doc = soup.findAll("a", id="video-title")
print(doc)
If you right-click on the page then "view page source" you will find the html of the website, try and search for any of the titles of the videos ex. "Super Mario Maker", you will find them stored on a JSON inside script tag in the HTML.
then why do you see the videos inside a tag with id="video-title" in the page when you "inspect element" using the "dev-tools"?
that's because youtube uses javascript to render the site.
here is how to capture that JSON, you will need to explore it and figure which data you need.
import requests, json, re
from bs4 import BeautifulSoup
source = requests.get('https://www.youtube.com/nitroparkour')
soup = BeautifulSoup(source.text, 'lxml')
unparsed_js = soup.find(string=re.compile('var ytInitialData ='))
js = json.loads(unparsed_js.replace('var ytInitialData = ', '').rstrip(';'))
I'm working on a project and I'm trying to extract the pictures' URL from a website. I'm a noob at this so please bear with me. Based on the HTML code, the class of the pictures that I want is "fotorama__img". However, when I execute my code, it doesn't seem to work. Anyone knows why that's the case? Also, how come the src attribute doesn't contain the whole URL, just a part of it? Example: the link to the image is https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_front.jpg but the src attribute of the img tag is "/files_SYS/images/System/sysThumb/SYS-120U-TNR_main.png".
Here is my code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.supermicro.com/en/products/system/Ultra/1U/SYS-120U-TNR")
soup = BeautifulSoup(page.content,'lxml')
images = soup.find_all("img", {"class": "fotorama__img"})
for image in images:
print(image.get("src"))
And here is the picture of the HTML code for the page
Thank you for your help!
The class is added dynamically via JavaScript, so beautifulsoup doesn't see it. To extract the images from this site, you can do:
import requests
from bs4 import BeautifulSoup
page = requests.get(
"https://www.supermicro.com/en/products/system/Ultra/1U/SYS-120U-TNR"
)
soup = BeautifulSoup(page.content, "lxml")
images = [
"https://www.supermicro.com" + a["href"]
for a in soup.select(".fotorama > a")
]
print(*images, sep="\n")
Prints:
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_main.png
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_angle.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_top.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_front.jpg
https://www.supermicro.com/files_SYS/images/System/SYS-120U-TNR_callout_rear.jpg
This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
http://www.wunderground.com/history/airport/KMDW/2014/11/17/MonthlyHistory.html?req_city=NA&req_state=NA&req_statename=NA
On the link above, I am trying to save the "Monthly Weather History Graph" in a python script. I have tried everything I can think of using BeautifulSoup and urrlib.
What I have been able to do is get to the point below, which I can extract, but I can not figure out how to save that graph as an image/HTML/PDF/anything. I am really not familiar with CGI, so any guidance here is much appreciated.
div id="history-graph-image"
img src="/cgi-bin/histGraphAll?day=17&year=2014&month=11&ID=KMDW&type=1&width=614**" alt="Monthly Weather History Graph" /
Get the page with requests, parse the HTML with BeautifulSoup, find the img tag inside div with id="history-graph-image" and get the src attribute value:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.wunderground.com'
url = 'http://www.wunderground.com/history/airport/KMDW/2014/11/17/MonthlyHistory.html?req_city=NA&req_state=NA&req_statename=NA'
response = requests.get(url)
soup = BeautifulSoup(response.content)
image_relative_url = soup.find('div', id='history-graph-image').img.get('src')
image_url = urljoin(base_url, image_relative_url)
print image_url
Prints:
http://www.wunderground.com/cgi-bin/histGraphAll?day=17&year=2014&month=11&ID=KMDW&type=1&width=614
Then, download the file with urllib.urlretrieve():
import urllib
urllib.urlretrieve(image_url, "image.gif")
My code only returns an empty string, and I have no idea why.
import urllib2
def getImage(url):
page = urllib2.urlopen(url)
page = page.read() #Gives HTML to parse
start = page.find('<a img=')
end = page.find('>', start)
img = page[start:end]
return img
It would only return the first image it finds, so it's not a very good image scraper; that said, my primary goal right now is just to be able to find an image. I'm unable to.
Consider using BeautifulSoup to parse your HTML:
from BeautifulSoup import BeautifulSoup
import urllib
url = 'http://www.google.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for img in soup.findAll('img'):
print img['src']
You should use a library for this and there are several out there, but to answer your question by changing the code you showed us...
Your problem is that you are trying to find images, but images don't use the <a ...> tag. They use the <img ...> tag. Here is an example:
<img src="smiley.gif" alt="Smiley face" height="42" width="42">
What you should do is change your start = page.find('<a img=') line to start = page.find('<img ') like so:
def getImage(url):
page = urllib2.urlopen(url)
page = page.read() #Gives HTML to parse
start = page.find('<img ')
end = page.find('>', start)
img = page[start:end+1]
return img
Article on screen scraping with ruby:
http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Its not scraping images but its a good article and may help.
Extracting the image information this way is not a good idea. There are severaly better options, depending on your knowledge and your motivation to learn something new:
http://scrapy.org/ is a very good framework for extracting data from web pages. As it looks like you're a beginner, it might a bit overkill.
Learn regular expressions to extract the information: http://docs.python.org/library/re.html and Learning Regular Expressions
Use http://www.crummy.com/software/BeautifulSoup/ to parse data from the result of page.read().
Some instructions that might be of help:
Use Google Chrome. Set the mouse over the image and right click. Select "Inspect element". That will open a section where you'll be able to see the html near the image.
Use Beautiful Soup to parse the html:
from BeautifulSoup import BeautifulSoup
request = urllib2.Request(url)
response = urllib2.urlopen(request)
html = response.read()
soap = BeautifulSoap(html)
imgs = soup.findAll("img")
items = []
for img in imgs:
print img['src'] #print the image location
items.append(img['src']) #store the locations for downloading later