I am trying to resolve a DOI like this:
import requests
url = 'https://dx.doi.org/10.3847/1538-4357/aafd31'
r1 = requests.get(url)
actual_url = r1.url
But the requests.get call actually takes of the order of 10s of seconds up to 5 minutes (it varies)! I tried stream=True or verify=False but that does not really help.
It seems they are slowing you down on purpose. Try setting a valid user agent.
Below code runs ok (quick response) for me;
import requests
url = 'https://dx.doi.org/10.3847/1538-4357/aafd31'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
req = requests.get(url, headers=headers)
print(req.text)
If you are doing multiple requests just make sure you do it slow enough and possibly use multiple user agents at random
try:
import urllib.request
response = urllib.request.urlopen('https://dx.doi.org/10.3847/1538-4357/aafd31')
html = response.read()
I had the same problem. My solution is to create a new environment with more recent python version.
Related
Requests.get() does not seem to be returning the expected bytes for Wikipedia image URLs, such as https://upload.wikimedia.org/wikipedia/commons/0/05/20100726_Kalamitsi_Beach_Ionian_Sea_Lefkada_island_Greece.jpg:
import wikipedia
import requests
page = wikipedia.page("beach")
first_image_link = page.images[0]
req = requests.get(first_image_link)
req.content
b'<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n*...
Most websites block requests that come in without a valid browser as a User-Agent. Wikimedia is one such.
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
res = requests.get('https://upload.wikimedia.org/wikipedia/commons/0/05/20100726_Kalamitsi_Beach_Ionian_Sea_Lefkada_island_Greece.jpg', headers=headers)
res.content
which will give you expected output
I typed your code and it seems to be an "Error: 403, Forbidden.". Wikipedia requires a user agent header in the request.
import wikipedia
import requests
headers = {
'User-Agent': 'My User Agent 1.0'
}
page = wikipedia.page("beach")
first_image_link = page.images[0]
req = requests.get(first_image_link, headers=headers, stream=True)
req.content
For the user agent, you should probably supply something a bit more descriptive than the placeholder I use in my example. Maybe the name of your script, or just the word "script" or something like that.
I tested it and it works fine. You will get back the image as you are expecting.
I'm new to python. I have to download some images from the web and save it to my local file system. I've noticed that the response content does not contain any image data.
The problem only occurs with this specific url, with every other image url the code works fine.
I know the easiest solution would be just use another url but still i'd like to ask if someone had a similar problem.
import requests
url = 'https://assets.coingecko.com/coins/images/1/large/bitcoin.png'
filename = "bitcoin.png"
response = requests.get(url, stream = True)
response.raw.decode_content = True
with open(f'images/{filename}', 'wb') as outfile:
outfile.write(response.content)
First, look at the content of the response with response.text, you'll see the website blocked your request.
Please turn JavaScript on and reload the page.
Then, you can try to check if changing the User-Agent of your request fixes your issue.
response = requests.get(
url,
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
},
stream = True
)
If it doesn't, you may need to get your data with something which can parse javascript like selenium or Puppeteer.
I have a question re: requests module in Python.
So far I have been using this to scrape and it's been working well.
However when I do it against one particular website (code below - and refer to the Jupyter Notebook snapshot), it just doesn't want to complete the task (showing [*] forever).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
page = requests.get('https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets', verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
Some users also suggest using headers such as below to speed it up but it doesnt work for me as well:
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = requests.get(url = url, headers = headers)
Not sure what's going on (this is the first time for me) but I might be missing on something obvious. If someone can explain why this is not working? Or if it's working in your machine, please do let me know!
The page attempts to add a cookie the first time you visit it. By using the requests module and not defining a cookie will prevent you from being able to connect to the page.
I've modified your script to include my cookie which should work - if it doesn't, copy your cookie (for this host domain) from the browser to the script.
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
cookies = {
'TS01e58ec0': '01a1c9e334eb0b8b191d36d0da302b2bca8927a0ffd2565884aff3ce69db2486850b7fb8e283001c711cc882a8d1f749838ff59d3d'
}
req = requests.get(url = url, headers = headers, cookies=cookies)
I'm trying to fetch automatically the MAC addresses for some vendors in Python. I found a website that is really helpful, but I'm not being able to access its information from Python. When I run this:
import grequests
rs = (grequests.get(u) for u in ['https://aruljohn.com/mac/000000'])
requests = grequests.map(rs)
for response in requests:
print(response)
It prints None. Does anyone know how to solve this?
Looks like the issue is just not setting a user-agent in the headers. I was able to request the website without any issues. I just used normal python request but it should work fine with grequests. I do think you might want to find a more active library. You could check out aiohttp. Very active and I have had a wonderful experience using aiohttp.
import requests
from lxml import html
def request_website(mac):
url = 'https://aruljohn.com/mac/' + mac
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get(url, headers=headers)
return r.text
response = request_website('000000')
tree = html.fromstring(response)
results = tree.cssselect('.results p')[0].text
print (results)
Im using the requests module in python to try and make a search on the following webiste http://musicpleer.audio/, however this website appears to be blocking me as it issues nothing but a 403 when i attempt to access it, im wondering how i can get around this, ive tried sending it the user agent of my web browser(chrome) and it still returns error 403. any suggestions on how i could get around this an example of downloading a song from the site would be very helpful. Thanks in advance
My code:
import requests, os
def funGetList:
start_path = 'C:/Users/Jordan/Music/' # current directory
list = []
for path,dirs,files in os.walk(start_path):
for filename in files:
temp = (os.path.join(path,filename))
tempLen = len(temp)
"print(tempLen)"
iterate = 0
list.append(temp[22:(len(temp))-4])
def funDownloadMP3:
for i in list:
print(i)
payload = {'searchQuery': 'meme', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
print(requests.post(url, data=payload))
Putting the User-Agent in the headers seems to work:
In []:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
r = requests.get('{}#!{}'.format(url, 'meme'), headers=headers)
r.status_code
Out[]:
200
Note: It looks like the search url is simple '#!<search-term>'
HTML 403 Forbidden error code.
The server might be expecting some more request headers like Host or Cookies etc.
You might want to use Postman to debug it with ease