I want to download bing search images using python code.
Example URL: https://www.bing.com/images/search?q=sketch%2520using%20iphone%2520students
My python code generates an url of bing search as shown in example. Next step, is to download all images shown in that link on my local desktop.
In my project i am generating some words in python and my code generates bing image search URL. All i need is to download images shown on that search page using python.
To download an image, you need to make a request to the image URL that ends with .png, .jpg etc.
But Bing provides a "m" attribute inside the <a> element that stores needed data in the JSON format from which you can parse the image URL that is stored in the "murl" key and download it afterward.
To download all images locally to your computer, you can use 2 methods:
# bs4
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
image = requests.get(img_url, headers=headers, timeout=30)
query = query.lower().replace(" ", "_")
if image.status_code == 200:
with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
file.write(image.content)
# urllib
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
query = query.lower().replace(" ", "_")
opener = req.build_opener()
opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
req.install_opener(opener)
req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")
In the first case, you can use context manager with open() to load the image locally. In the second case, you can use urllib.request.urlretrieve method of the urllib.request library.
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Note: An error might occur with the urllib.request.urlretrieve method where some of the request has got a captcha or something else that returns an unsuccessful status code. The biggest problem is it's hard to test for response code while requests provide a status_code method to test it.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
query = "sketch using iphone students"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"first": 1
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
image = requests.get(img_url, headers=headers, timeout=30)
query = query.lower().replace(" ", "_")
if image.status_code == 200:
with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
file.write(image.content)
Using urllib.request.urlretrieve.
from bs4 import BeautifulSoup
import requests, lxml, json
import urllib.request as req
query = "sketch using iphone students"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"first": 1
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
query = query.lower().replace(" ", "_")
opener = req.build_opener()
opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
req.install_opener(opener)
req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")
Output:
edit your code to find the designated image url and then use this code
use urllib.request
import urllib.request as req
imgurl ="https://i.ytimg.com/vi/Ks-_Mh1QhMc/hqdefault.jpg"
req.urlretrieve(imgurl, "image_name.jpg")
Related
Requests.get() does not seem to be returning the expected bytes for Wikipedia image URLs, such as https://upload.wikimedia.org/wikipedia/commons/0/05/20100726_Kalamitsi_Beach_Ionian_Sea_Lefkada_island_Greece.jpg:
import wikipedia
import requests
page = wikipedia.page("beach")
first_image_link = page.images[0]
req = requests.get(first_image_link)
req.content
b'<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n*...
Most websites block requests that come in without a valid browser as a User-Agent. Wikimedia is one such.
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
res = requests.get('https://upload.wikimedia.org/wikipedia/commons/0/05/20100726_Kalamitsi_Beach_Ionian_Sea_Lefkada_island_Greece.jpg', headers=headers)
res.content
which will give you expected output
I typed your code and it seems to be an "Error: 403, Forbidden.". Wikipedia requires a user agent header in the request.
import wikipedia
import requests
headers = {
'User-Agent': 'My User Agent 1.0'
}
page = wikipedia.page("beach")
first_image_link = page.images[0]
req = requests.get(first_image_link, headers=headers, stream=True)
req.content
For the user agent, you should probably supply something a bit more descriptive than the placeholder I use in my example. Maybe the name of your script, or just the word "script" or something like that.
I tested it and it works fine. You will get back the image as you are expecting.
I am trying to download a picture from "https://prnt.sc", but the URL of the image is so long and I also can‘t find that URL when printing r.content.
This is my code for getting the HTML:
import requests
import random
import string
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/76.0.3809.132 Safari/537.36 OPR/63.0.3368.71'
}
register_data = {"path" : "luk111"}
print (register_data)
with requests.Session() as s:
url = 'https://prnt.sc/luk111'
r = s.post(url, json=register_data, headers=headers)
print(r)
print (r.content)
The whole url has around 81954 characters, so I need a better way to download it. Any ideas?
This is my code for downloading the .jpg image:
import random
import urllib.request
def download_web_image(url):
name = "xddd"
full_name = "screen/" + str(name) + ".jpg"
urllib.request.urlretrieve(url,full_name)
xd = input("paste url")
download_web_image(xd)
This long url on page is not real url but image's data in base64
But first I turned off JavaScript in web browser and I checked this page without JavaScript because requests and BeautifulSoup can't run JavaScript.
I see normal url to image and I have no problem to download it.
import requests
from bs4 import BeautifulSoup as BS
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36 OPR/63.0.3368.71'
}
with requests.Session() as s:
url = 'https://prnt.sc/luk111'
r = s.get(url, headers=headers)
soup = BS(r.content, 'html.parser')
img_url = soup.find('img', {'id': 'screenshot-image'})['src']
r = s.get(img_url, headers=headers)
with open('temp.png', 'wb') as f:
f.write(r.content)
What I'm trying to do is acquire a product ID from a script tag inside an HTML document. Unfortunately, StockX doesn't offer a public API, so I have to scrape the data from an HTML document. Here are my attempts at it (both work):
Attempt 1
import requests
PRODUCT_URL = 'https://stockx.com/supreme-steiff-bear-heather-grey'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url=PRODUCT_URL, headers=HEADERS).text
PRODUCT_ID = response[response.find('"product":{"id":"')+17:].partition('"')[0]
PRODUCT_NAME = response[response.find('<title>')+7:].partition('<')[0]
Attempt 2
from bs4 import BeautifulSoup
import requests
# Gets HTML document
PRODUCT_URL = 'https://stockx.com/supreme-steiff-bear-heather-grey'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
html_content = requests.get(url=PRODUCT_URL, headers=HEADERS)
# Make BeautifulSoup parser from HTML document
soup = BeautifulSoup(html_content.text, 'html.parser')
# Get product name
PRODUCT_NAME = soup.title.text
# Get script tag data with product ID
js_content = soup.find_all('script', type='text/javascript')[9].text
PRODUCT_ID = js_content[50:86]
print(PRODUCT_ID)
Output:
884861d2-abe6-4b09-90ff-c8ad1967ac8c
However, I feel like there is a better approach to this problem instead of just "hard-coding" in where to find the ID.
If you view the page source of the product URL and do a search for "product":{"id":, you will find that the ID is inside a nested dictionary that is assigned to an object and inside a tag.
Is there any better way to go about obtaining the product ID from an HTML document?
EDIT: Here is the content of html_content: https://gist.github.com/leecharles50/9b6b11fb458767cabcfc0ed4f961984d
My first idea was to parse the JavaScript inside the tag. There is a package called slimit that can do this. See for example this answer.
However, in your case there is an even easier solution. I searched the DOM for the id you gave (884861d2-abe6-4b09-90ff-c8ad1967ac8) and found an occurrence inside the following tag:
<script type="application/ld+json">
{
[...]
"sku" : "884861d2-abe6-4b09-90ff-c8ad1967ac8c",
[...]
}
</script>
which contains valid JSON. Simply find the tag with BeautifulSoup:
tag = soup('script', {'type': 'application/ld+json'})[-1]
and decode the JSON within:
import json
product_id = json.loads(tag.text)['sku']
As you can see by the product URL, this has been tested on multiple product pages.
import requests
import json
from bs4 import BeautifulSoup
#product_url = 'https://stockx.com/supreme-steiff-bear-heather-grey'
product_url = 'https://stockx.com/air-jordan-1-retro-high-shattered-backboard-3'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
html_content = requests.get(url=product_url, headers=headers)
soup = BeautifulSoup(html_content.text, 'lxml')
script_tags = soup.find_all('script', attrs={'type': 'application/ld+json'})
product_info_text = script_tags[-1].text
# contains a bunch of useful info
product_info_json = json.loads(product_info_text, strict=False)
print(json.dumps(product_info_json, indent=4))
product_sku = product_info_json['sku']
print(product_sku)
I will try to implement the use of a SoupStrainer.
Here is an alternative using regex:
import requests
import re
product_uuid = re.compile(r'"product":{"id":"(\w{8}-(?:\w{4}-){3}\w{12}){1}"')
product_name = re.compile(r'<title>(.*)</title>')
url = 'https://stockx.com/supreme-steiff-bear-heather-grey'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
content = requests.get(url, headers=headers)
if content.ok:
PRODUCT_NAME = product_name.findall(content.text)[0]
PRODUCT_UUID = product_uuid.findall(content.text)[0]
print(PRODUCT_NAME)
print(PRODUCT_UUID)
Slightly hard-coded but easy to adjust and depends only on standard modules.
If you want to scrape on large volumes, you can use the API of Piloterr
I am trying to learn how to use BS4 but I ran into this problem. I try to find the text in the Google Search results page showing the number of results for the search but I can't find no text 'results' neither in the html_page nor in the soup HTML parser. This is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
print(b'results' in html_page)
print('results' in soup)
Both prints return False, what am I doing wrong? How to fix that?
EDIT:
Turns out the language of the webpage was a problem, adding &hl=en to the URL almost fixed it.
url = 'https://www.google.com/search?q=stack&hl=en'
The first print is now True but the second is still False.
requests library when returning the response in form of response.content usually returns in a raw format. So to answer your second question, replace the res.content with res.text.
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.text
soup = BeautifulSoup(html_page, 'html.parser')
print('results' in soup)
Output: True
Keep in mind, Google is usually very active in handling scrapers. To avoid getting blocked/captcha'ed, you can add a user agent to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
It's not because res.content should be changed to res.text as 0xInfection mentioned, it would still return the result.
However, in some cases, it will only return bytes content if it's not gzip or deflate transfer-encodings, which are automatically decoded by requests to a readable format (correct me in the comments or edit this answer if I'm wrong).
It's because there's no user-agent specified thus Google will block a request eventually because default requests user-agent is python-requests and Google understands that it's a bot/script. Learn more about request headers.
Pass user-agent into request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
request.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah definition", # query
"gl": "us", # country to make request from
"hl": "en" # language
}
response = requests.get('https://www.google.com/search',
headers=headers,
params=params).content
soup = BeautifulSoup(response, 'lxml')
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 114,000 results
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want without thinking about how to extract stuff or figure out how to bypass blocks from Google or other search engines since it's already done for the end-user.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah definition",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 112000
Disclaimer, I work for SerpApi.
In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)