Requests.get() does not seem to be returning the expected bytes for Wikipedia image URLs, such as https://upload.wikimedia.org/wikipedia/commons/0/05/20100726_Kalamitsi_Beach_Ionian_Sea_Lefkada_island_Greece.jpg:
import wikipedia
import requests
page = wikipedia.page("beach")
first_image_link = page.images[0]
req = requests.get(first_image_link)
req.content
b'<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n*...
Most websites block requests that come in without a valid browser as a User-Agent. Wikimedia is one such.
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
res = requests.get('https://upload.wikimedia.org/wikipedia/commons/0/05/20100726_Kalamitsi_Beach_Ionian_Sea_Lefkada_island_Greece.jpg', headers=headers)
res.content
which will give you expected output
I typed your code and it seems to be an "Error: 403, Forbidden.". Wikipedia requires a user agent header in the request.
import wikipedia
import requests
headers = {
'User-Agent': 'My User Agent 1.0'
}
page = wikipedia.page("beach")
first_image_link = page.images[0]
req = requests.get(first_image_link, headers=headers, stream=True)
req.content
For the user agent, you should probably supply something a bit more descriptive than the placeholder I use in my example. Maybe the name of your script, or just the word "script" or something like that.
I tested it and it works fine. You will get back the image as you are expecting.
Related
I'm new to python. I have to download some images from the web and save it to my local file system. I've noticed that the response content does not contain any image data.
The problem only occurs with this specific url, with every other image url the code works fine.
I know the easiest solution would be just use another url but still i'd like to ask if someone had a similar problem.
import requests
url = 'https://assets.coingecko.com/coins/images/1/large/bitcoin.png'
filename = "bitcoin.png"
response = requests.get(url, stream = True)
response.raw.decode_content = True
with open(f'images/{filename}', 'wb') as outfile:
outfile.write(response.content)
First, look at the content of the response with response.text, you'll see the website blocked your request.
Please turn JavaScript on and reload the page.
Then, you can try to check if changing the User-Agent of your request fixes your issue.
response = requests.get(
url,
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
},
stream = True
)
If it doesn't, you may need to get your data with something which can parse javascript like selenium or Puppeteer.
I'm trying to get the page source for an Instagram post using the code below. Funnily enough, it worked a few times but then it said that I wasn't logged in (which changes the entire source code). Is there any way so that I can access the source code? You would get while logged in without using automation stuff like Selenium, because that would be pretty slow.
import requests
import urllib
from urllib.request import urlopen, URLError, Request
def getSource(rawLink):
req = Request(
rawLink,
data = None,
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
}
)
with urlopen(req) as response:
source = response.read().decode('utf-8')
return source
link = "https://www.instagram.com/p/COF47v4HoC9/"
source = getSource(link)
print(source[0:100])
As you can see the output line <html lang="en" class="no-js not-logged-in client-root"> indicates how I'm not logged in.
I have a question re: requests module in Python.
So far I have been using this to scrape and it's been working well.
However when I do it against one particular website (code below - and refer to the Jupyter Notebook snapshot), it just doesn't want to complete the task (showing [*] forever).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
page = requests.get('https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets', verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
Some users also suggest using headers such as below to speed it up but it doesnt work for me as well:
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = requests.get(url = url, headers = headers)
Not sure what's going on (this is the first time for me) but I might be missing on something obvious. If someone can explain why this is not working? Or if it's working in your machine, please do let me know!
The page attempts to add a cookie the first time you visit it. By using the requests module and not defining a cookie will prevent you from being able to connect to the page.
I've modified your script to include my cookie which should work - if it doesn't, copy your cookie (for this host domain) from the browser to the script.
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
cookies = {
'TS01e58ec0': '01a1c9e334eb0b8b191d36d0da302b2bca8927a0ffd2565884aff3ce69db2486850b7fb8e283001c711cc882a8d1f749838ff59d3d'
}
req = requests.get(url = url, headers = headers, cookies=cookies)
I'm trying to fetch automatically the MAC addresses for some vendors in Python. I found a website that is really helpful, but I'm not being able to access its information from Python. When I run this:
import grequests
rs = (grequests.get(u) for u in ['https://aruljohn.com/mac/000000'])
requests = grequests.map(rs)
for response in requests:
print(response)
It prints None. Does anyone know how to solve this?
Looks like the issue is just not setting a user-agent in the headers. I was able to request the website without any issues. I just used normal python request but it should work fine with grequests. I do think you might want to find a more active library. You could check out aiohttp. Very active and I have had a wonderful experience using aiohttp.
import requests
from lxml import html
def request_website(mac):
url = 'https://aruljohn.com/mac/' + mac
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get(url, headers=headers)
return r.text
response = request_website('000000')
tree = html.fromstring(response)
results = tree.cssselect('.results p')[0].text
print (results)
I want to download bing search images using python code.
Example URL: https://www.bing.com/images/search?q=sketch%2520using%20iphone%2520students
My python code generates an url of bing search as shown in example. Next step, is to download all images shown in that link on my local desktop.
In my project i am generating some words in python and my code generates bing image search URL. All i need is to download images shown on that search page using python.
To download an image, you need to make a request to the image URL that ends with .png, .jpg etc.
But Bing provides a "m" attribute inside the <a> element that stores needed data in the JSON format from which you can parse the image URL that is stored in the "murl" key and download it afterward.
To download all images locally to your computer, you can use 2 methods:
# bs4
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
image = requests.get(img_url, headers=headers, timeout=30)
query = query.lower().replace(" ", "_")
if image.status_code == 200:
with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
file.write(image.content)
# urllib
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
query = query.lower().replace(" ", "_")
opener = req.build_opener()
opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
req.install_opener(opener)
req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")
In the first case, you can use context manager with open() to load the image locally. In the second case, you can use urllib.request.urlretrieve method of the urllib.request library.
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Note: An error might occur with the urllib.request.urlretrieve method where some of the request has got a captcha or something else that returns an unsuccessful status code. The biggest problem is it's hard to test for response code while requests provide a status_code method to test it.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
query = "sketch using iphone students"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"first": 1
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
image = requests.get(img_url, headers=headers, timeout=30)
query = query.lower().replace(" ", "_")
if image.status_code == 200:
with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
file.write(image.content)
Using urllib.request.urlretrieve.
from bs4 import BeautifulSoup
import requests, lxml, json
import urllib.request as req
query = "sketch using iphone students"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"first": 1
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
query = query.lower().replace(" ", "_")
opener = req.build_opener()
opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
req.install_opener(opener)
req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")
Output:
edit your code to find the designated image url and then use this code
use urllib.request
import urllib.request as req
imgurl ="https://i.ytimg.com/vi/Ks-_Mh1QhMc/hqdefault.jpg"
req.urlretrieve(imgurl, "image_name.jpg")