Related
I have written a webscraping program that goes to an online marketplace like www.tutti.ch, searches for a category key word, and then downloads all the resulting photos of the search result to a folder.
#! python3
# imageSiteDownloader_stack.py - A program that goes to an online marketplace like
# tutti.ch, searches for a category of photos, and then downloads all the
# resulting images.
import requests, os
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox() # Opens Firefox webbrowser
browser.get('https://www.tutti.ch/') # Go to tutti.ch website
wait = WebDriverWait(browser, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click() # accepts cookies terms
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "._1CFCt > input:nth-child(1)"))).send_keys('Gartenstuhl') # enters search key word
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # clicks submit button
os.makedirs('tuttiBilder', exist_ok=True) # creates new folder
images = browser.find_elements(By.TAG_NAME, 'img') # stores every img element in a list
for im in images:
imageURL = im.get_attribute('src') # get the URL of the image
print('Downloading image %s...' % (imageURL))
res = requests.get(imageURL) # downloads the image
res.raise_for_status()
imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
for chunk in res.iter_content(100000): # writes to the image file
imageFile.write(chunk)
imageFile.close()
print('Done.')
browser.quit()
My program crashes at line 26, the exception is as follows:
The program downloads the first couple of photos correctly, but then, suddenly, crashes.
Looking for solutions on stackoverflow, I have found this post: Requests : No connection adapters were found for, error in Python3
The answer of the post above suggests that the problem arises because of a newline charachter in the URL.
I checked the source URLs of the photos im the HTML code that couldn't be downloaded. They seem to be OK.
The problem seems to be either the browser.find_elements() method which parses the 'src' attribute values incorrectly or the .get_attribute() method which cannot fetch some of the URLs correctly. Instead of getting something like
https://c.tutti.ch/images/23452346536.jpg
the method gives back strings like
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Of course, this is not a valid URL which the requests.get() method can use to download the image. I did some research and I have found out that this might be a base64 string...
Why does the .get_attribute() method return a base64 string in some of the cases? Can I prevent it to do so? Or do I have to convert it to a normal string?
Update: Another approach using beautifulsoup for parsing instead ob WebDriver. (This code is not working yet. The Data URLs are still a problem)
import requests, sys, os, bs4
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox() # Opens Firefox webbrowser
browser.get('https://www.tutti.ch/') # Go to tutti.ch website
wait = WebDriverWait(browser, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click()
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "._1CFCt > input:nth-child(1)"))).send_keys(sys.argv[1:])
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # https://www.tutorialspoint.com/how-to-locate-element-by-partial-id-match-in-selenium
os.makedirs('tuttiBilder', exist_ok=True)
url = browser.current_url
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#Check for errors from here
images = soup.select('div[style] > img')
for im in images:
imageURL = im.get('src') # get the URL of the image
print('Downloading image %s...' % (imageURL))
res = requests.get(imageURL) # downloads the image
res.raise_for_status()
imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
for chunk in res.iter_content(100000): # writes to the image file
imageFile.write(chunk)
imageFile.close()
print('Done.')
browser.quit()
The program crashes, as you attempt to download a file (image) using a base64 encoded string (which is not a valid image URL). The reason these base64 strings show up in your images list is that each image (in <img> tags) appears to initially be a base64 string, and once it is loaded, the src value changes to a valid image url (you can check that by opening the DevTools in your browser while accessing your website at https://...ganze-schweiz?q=Gartenstuhl, and searching for "base64" in the Elements section of the DevTools. By moving to the next image in the search findings - using the arrow buttons - you'll notice the behaviour described above). This is also the reason (as shown in your cmd window snippet, and as tested it myself as well) that only 3 to 5 images are found and downloaded. That is because these 5 images are the ones appearing at the top of the page, and are succesfully loaded and given a valid image URL, when the page is accessed; whereas, the remaining <img> tags still include a base64 string.
So, the first step is - once the "search results" operation is completed- to slowly scroll down the page, in order for every image in the page to be loaded and given a valid URL. You can achieve that by using the method described here. You can adjust the speed as you wish, as long as it allows items/images to load properly.
The second step is to ensure that only valid URLs are passed to requests.get() method. Although every base64 string will be replaced by a valid URL due to the above fix, there might still be invalid image URLs in the list; in fact, there seems to be one (which is not related to the items) starting with https://bat.bing.com/action/0?t..... Thus, it is prudent to check that the requested URLs are valid image URLs, before attempting downloading them. You can do that by using str.endswith() method, looking for strings ending with specific suffixes (extensions), such as ".png" and ".jpg". If a string in the images list does end with any of the above extensions, you can then proceed downloading the image. Working example is given below (please note, the below will download the images appearing in the first page of search results. If you require downloading further image results, you can extend the program to navigate to the next page and then repeat the same steps as below).
Update 1
The code below has been updated, so that one can obtain further results by navigating to the following pages and downloading the images. You can set the number of "next pages" from which you would like to get results by adjusting the next_pages_no variable.
import requests, os
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
suffixes = (".png", ".jpg")
next_pages_no = 3
browser = webdriver.Firefox() # Opens Firefox webbrowser
#browser = webdriver.Chrome() # Opens Chrome webbrowser
wait = WebDriverWait(browser, 10)
os.makedirs('tuttiBilder', exist_ok=True)
def scroll_down_page(speed=40):
current_scroll_position, new_height= 0, 1
while current_scroll_position <= new_height:
current_scroll_position += speed
browser.execute_script("window.scrollTo(0, {});".format(current_scroll_position))
new_height = browser.execute_script("return document.body.scrollHeight")
def save_images(images):
for im in images:
imageURL = im.get_attribute('src') # gets the URL of the image
if imageURL.endswith(suffixes):
print('Downloading image %s...' % (imageURL))
res = requests.get(imageURL, stream=True) # downloads the image
res.raise_for_status()
imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
for chunk in res.iter_content(1024): # writes to the image file
imageFile.write(chunk)
imageFile.close()
def get_first_page_results():
browser.get('https://www.tutti.ch/')
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click() # accepts cookies terms
wait.until(EC.presence_of_element_located((By.XPATH, '//form//*[name()="input"][#data-automation="li-text-input-search"]'))).send_keys('Gartenstuhl') # enters search keyword
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # clicks submit button
scroll_down_page() # scroll down the page slowly for the images to load
images = browser.find_elements(By.TAG_NAME, 'img') # stores every img element in a list
save_images(images)
def get_next_page_results():
wait.until(EC.visibility_of_element_located((By.XPATH, '//button//*[name()="svg"][#data-testid="NavigateNextIcon"]'))).click()
scroll_down_page() # scroll down the page slowly for the images to load
images = browser.find_elements(By.TAG_NAME, 'img') # stores every img element in a list
save_images(images)
get_first_page_results()
for _ in range(next_pages_no):
get_next_page_results()
print('Done.')
browser.quit()
Update 2
As per your request, here is an alternative approach to the problem, using Python requests to download the HTML content of a given URL, as well as BeautifulSoup library to parse the content, in order to get the image URLs. As it appears in the HTML content, both base64 strings and actual image URLs are included (base64 strings occur exactly in the same number as image URLs). Thus, you can use the same approach as above to check for their suffixes, before proceeding downloading them. Complete working example below (adjust the page range in the for loop as you wish).
import requests
from bs4 import BeautifulSoup as bs
import os
suffixes = (".png", ".jpg")
os.makedirs('tuttiBilder', exist_ok=True)
def save_images(imageURLS):
for imageURL in imageURLS:
if imageURL.endswith(suffixes):
print('Downloading image %s...' % (imageURL))
res = requests.get(imageURL, stream=True) # downloads the image
res.raise_for_status()
imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
for chunk in res.iter_content(1024): # writes to the image file
imageFile.write(chunk)
imageFile.close()
def get_results(page_no, search_term):
response = requests.get('https://www.tutti.ch/de/li/ganze-schweiz?o=' + str(page_no) + '&q=' + search_term)
soup = bs(response.content, 'html.parser')
images = soup.findAll("img")
imageURLS = [image['src'] for image in images]
save_images(imageURLS)
for i in range(1, 4): # get results from page 1 to page 3
get_results(i, "Gartenstuhl")
Update 3
To clear things up, the base64 strings are all the same i.e., R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7. You can check this by saving the received HTML content in a file (to do this, please add the code below in the get_results() method of the second solution), opening it with a text editor and searching for "base64".
with open("page.html", 'wb') as f:
f.write(response.content)
If you enter the above base64 string into a "base64-to-image" converter online, then download and open the image with a graphics editor (such as Paint), you will see that it is a 1px image (usually called a "tracking pixel"). This "tracking pixel" is used in Web beacon technique to check that a user has accessed some content - in your case, a product in the list.
The base64 string is not an invalid URL that somehow turns into a valid one. It is an encoded image string, which can be decoded to recover the image. Thus, in the first solution using Selenium, when scrolling down on the page, those base64 strings are not converted into valid image URLs, but rather, tell the website that you have accessed some content, and then the website removes/hides them; that is the reason they do not show up in the results. The images (and hence, the image URLs) appear as soon as you scroll down to a product, as a common technique, called "Image Lazy Loading" is used (which is used to improve performance, user experience, etc.). Lazy-loading instructs the browser to defer loading of images that are off-screen until the user scrolls near them. In the second solution, since requests.get() is used to retrieve the HTML content, the base64 strings are still in the HTML document; one per each product. Again, those base64 strings are all the same, and are 1px images used for the purpose mentioned earlier. So, you don't need them in your results and should be ignored. Both solutions above download all the product images present in the webpage. You can check that by looking into the tuttiBilder folder, after running the programs. If you still, however, want to save those base64 images (which is worthless, as they are all the same and not useful), replace the save_images() method in the second solution (i.e., using BeautifulSoup) with the one below. Make sure to import the extra libraries (as shown below). The below will save all the base64 images, along with products' images, in the same tuttiBilder folder, assigning them unique identifiers as filenames (as they don't carry a filename).
import re
import base64
import uuid
def save_images(imageURLS):
for imageURL in imageURLS:
if imageURL.endswith(suffixes):
print('Downloading image %s...' % (imageURL))
res = requests.get(imageURL, stream=True) # downloads the image
res.raise_for_status()
imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
for chunk in res.iter_content(1024): # writes to the image file
imageFile.write(chunk)
imageFile.close()
elif imageURL.startswith("data:image/"):
base64string = re.sub(r"^.*?/.*?,", "", imageURL)
image_as_bytes = str.encode(base64string) # convert string to bytes
recovered_img = base64.b64decode(image_as_bytes) # decode base64string
filename = os.path.join('tuttiBilder', str(uuid.uuid4()) + ".png")
with open(filename, "wb") as f:
f.write(recovered_img)
Can I suggest not using Selenium, there is a backend api that serves the data for each page. The only tricky thing is that requests to the api need to have a certain uuid hash which is in the HTML of the landing page. So you can get that when you go to the landing page, then use it to sign your subsequent api calls, here is an example which will loop through the pages and images for each post:
import requests
import re
import os
search = 'Gartenstuhl'
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url = f'https://www.tutti.ch/de/li/ganze-schweiz?q={search}'
step = requests.get(url,headers=headers)
print(step)
uuids = re.findall( r'[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}',step.text)
print(f'tutti hash code: {uuids[0]}') #used to sign requests to api
os.makedirs('tuttiBilder', exist_ok=True)
for page in range(1,10):
api = f'https://www.tutti.ch/api/v10/list.json?aggregated={page}&limit=30&o=1&q={search}&with_all_regions=true'
new_headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
'x-tutti-hash':uuids[0],
'x-tutti-source':'web latest-staging'
}
resp = requests.get(api,headers=new_headers).json()
for item in resp['items']:
for image in item['image_names']:
image_url = 'https://c.tutti.ch/images/'+image
pic = requests.get(image_url)
with open(os.path.join('tuttiBilder', os.path.basename(image)),'wb') as f:
f.write(pic.content)
print(f'Saved: {image}')
That is not a URL of any kind. The actual image data is stored in there, hence it is base 64 encoded. Try copying it into your browser (starting with the data: part), and you will see the image.
What just happened is the image is not hosted on a separate URL but it is embedded into the website, your browser only decoded that data to render the image. If you want to get the raw image data, base64decode everything after ;base64, part.
I've got a Python web scraper that crawls thru a bunch of TIFF pages online and converts each to PDF but I can't figure out how to combine all the converted PDFs into one and write it to my computer.
import img2pdf, requests
outPDF = []
for pgNum in range(1,20):
tiff = requests.get("http://url-to-tiff-file.com/page="+str(pgNum)).content
pdf = img2pdf.convert(tiff)
outPDF.append(pdf)
with open("file","wb") as f:
f.write(''.join(outPDF))
I get the following error when I run it:
f.write(''.join(outPDF))
TypeError: sequence item 0: expected str instance, bytes found
Update
If you go to http://oris.co.palm-beach.fl.us/or_web1/details_img.asp?doc_id=23543456&pg_num=1, then open up a web dev console in your browser, you can see a form tag with a bunch of ".tif" URLs in a bunch of hidden input tags.
img2pdf has some quirkiness when it comes to converting TIFF and PNG files. The code solves some of the potential issues within your code, because it uses Pillow to reformat the image files for processing with img2pdf
import img2pdf
from PIL import Image
image_list = []
test_images = ['image_01.tiff', 'image_02.tiff', 'image_03.tiff']
for image in test_images:
im = Image.open(f'{image}').convert('RGB')
im.save(f'mod_{image}')
image_list.append(f'mod_{image}')
with open('test.pdf', 'wb') as f:
letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
layout = img2pdf.get_layout_fun(letter)
f.write(img2pdf.convert(image_list, layout_fun=layout))
I modified your code to use my code above, but I cannot test it, because I don't know what website that you're querying. So please let me know if something fails and I will troubleshoot it.
import img2pdf
import requests
from PIL import Image
from io import BytesIO
outPDF = []
for pgNum in range(1,20):
tiff = requests.get("http://url-to-tiff-file.com/page="+str(pgNum)).content
im = Image.open(BytesIO(tiff).convert('RGB')
im.save(tiff)
outPDF.append(tiff)
with open("file.pdf","wb") as f:
letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
layout = img2pdf.get_layout_fun(letter)
f.write(img2pdf.convert(outPDF, layout_fun=layout))
UPDATED ANSWER
After you provided the actual URL for the target website, I determined that the best way to accomplish your task was to go another route. Based on your use case you wanted the PDF file that was being produced from all the hidden TIFF files. The source website will generate the PDF without downloading all those TIFF files.
Here is the code to get that generated PDF and download it to your system.
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
capabilities = DesiredCapabilities().CHROME
chrome_options = Options()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
download_directory = os.path.abspath('chrome_pdf_downloads')
prefs = {"download.default_directory": download_directory,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)
url_main = 'http://oris.co.palm-beach.fl.us/or_web1/details_img.asp? doc_id=23543456&pg_num=1'
driver.get(url_main)
WebDriverWait(driver, 60)
driver.find_element_by_xpath("//input[#name='button' and #onclick='javascript:ValidateAndSubmit(this.form)']").submit()
If you still want to get the TIFF files, please let me know and I will look into downloading and processing them to produce the PDF file that the code above is obtaining.
Are you trying to create a multipage pdf out of multiple single pages PDFs? I'm sure your use of join() is not correct.
Take a look at this tutorial. A couple years old but certainly still valid:
https://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/
I am working on a machine learning project and need a LOT of pictures for the data set that will train my program. The website https://its.txdot.gov/ITS_WEB/FrontEnd/default.html?r=SAT&p=San%20Antonio&t=cctv has pictures that are updated every six minutes. I need to save the image at LP 1604 at Kyle Seal Pkwy, but can't figure out how. I'm trying to right click on the image using action chains to save the image. Here's what I have so far:
driver.get('https://its.txdot.gov/ITS_WEB/FrontEnd/default.html?r=SAT&p=San%20Antonio&t=cctv')
time.sleep(5) #to let the site load
driver.find_element_by_id('LP-1604').click() #to get to the 1604 tab
time.sleep(5) #to let the site load
pic = driver.find_element_by_id('LP 1604 at Kyle Seale Pkwy__SAT')
action = ActionChains(driver)
action.context_click(pic)
The drop-down menu that usually pops up when you right-click is not showing up. And I feel like there has to be a better way to do this than right-click. I know how to wrap this in a loop that will execute every six minutes, so I don't need help there. It's just the downloading the image part. One of the problems I run into is that all the images are under the same url and most examples out there use urls. Any suggestions would be helpful.
I think that it could be help you do save the images in your Pc:
from PIL import Image
def save_image_on_disk(driver, element, path):
location = element.location
size = element.size
# saves screenshot of entire page
driver.save_screenshot(path)
# uses PIL library to open image in memory
image = Image.open(path)
left = location['x']
top = location['y'] + 0
right = location['x'] + size['width']
bottom = location['y'] + size['height'] + 0
image = image.crop((left, top, right, bottom)) # defines crop points
image = image.convert('RGB')
image.save(path, 'png') # saves new cropped image
def your_main_method():
some_element_img = driver.find_element_by_xpath('//*[#id="id-of-image"]')
save_image_on_disk(driver, some_element_img, 'my-image.png')
About the time you should use time.sleep(6*60)
The image data is located in the src property of the currentSnap element. It's encoded in base64, so you need to capture it and decode it. Then using PIL you can do anything you like with the image.
Also you can use selenium's built in wait functions instead of hardcoding sleeps. In this case the image sometimes loads even after the image element loads, so there's an additional short sleep still in the code to allow it to load.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from PIL import Image
from io import BytesIO
import base64
import re
# Max time to wait for page to load
timeout=10
driver = webdriver.Chrome()
driver.get('https://its.txdot.gov/ITS_WEB/FrontEnd/default.html?r=SAT&p=San%20Antonio&t=cctv')
# Wait for element to load before clicking
element_present = EC.presence_of_element_located((By.ID, 'LP-1604'))
WebDriverWait(driver, timeout).until(element_present)
driver.find_element_by_id('LP-1604').click() #to get to the 1604 tab
# Waat for image to load before capturing data
element_present = EC.presence_of_element_located((By.ID, 'currentSnap'))
WebDriverWait(driver, timeout).until(element_present)
# Sometimes the image still loads after the element is present, give it a few more seconds
time.sleep(4)
# Get base64 encoded image data from src
pic = driver.find_element_by_id('currentSnap').get_attribute('src')
# Strip prefix
pic = re.sub('^data:image/.+;base64,', '', pic)
# Load image file to memory
im = Image.open(BytesIO(base64.b64decode(pic)))
# Write to disk
im.save('image.jpg')
# Display image in Jupyter
im
# Open in your default image viewer
im.show()
I have been able to catch screenshots as pngs of some elements such the one with following code
from selenium import webdriver
from PIL import Image
from io import BytesIO
from os.path import expanduser
from time import sleep
# Define url and driver
url = 'https://www.formula1.com/'
driver = webdriver.Chrome('chromedriver')
# Go to url, scroll down to right point on page and find correct element
driver.get(url)
driver.execute_script('window.scrollTo(0, 4100)')
sleep(4) # Wait a little for page to load
element = driver.find_element_by_class_name('race-list')
location = element.location
size = element.size
png = driver.get_screenshot_as_png()
driver.quit()
# Store image as bytes, crop it and save to desktop
im = Image.open(BytesIO(png))
im = im.crop((200, 150, 700, 725))
path = expanduser('~/Desktop/')
im.save(path + 'F1-info.png')
This outputs to:
Which is what I want but not exactly how I want. I needed to manually input some scrolling down and as I couldn't get the element I wanted (class='race step-1 step-2 step-3') I had to manually crop the image too.
Any better solutions?
In case someone is wondering. This is how I managed it in the end. First I found and scrolled to the right part of the page like this
element = browser.find_element_by_css_selector('.race.step-1.step-2.step-3')
browser.execute_script('arguments[0].scrollIntoView()', element)
browser.execute_script('window.scrollBy(0, -80)')
and then cropped the image
im = im.crop((200, 80, 700, 560))
I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.
Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)
I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`
Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.
I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.