Why does grequests.get() return None? - python

I'm trying to fetch automatically the MAC addresses for some vendors in Python. I found a website that is really helpful, but I'm not being able to access its information from Python. When I run this:
import grequests
rs = (grequests.get(u) for u in ['https://aruljohn.com/mac/000000'])
requests = grequests.map(rs)
for response in requests:
print(response)
It prints None. Does anyone know how to solve this?

Looks like the issue is just not setting a user-agent in the headers. I was able to request the website without any issues. I just used normal python request but it should work fine with grequests. I do think you might want to find a more active library. You could check out aiohttp. Very active and I have had a wonderful experience using aiohttp.
import requests
from lxml import html
def request_website(mac):
url = 'https://aruljohn.com/mac/' + mac
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get(url, headers=headers)
return r.text
response = request_website('000000')
tree = html.fromstring(response)
results = tree.cssselect('.results p')[0].text
print (results)

Related

Python request.get error on Wikipedia image URL

Requests.get() does not seem to be returning the expected bytes for Wikipedia image URLs, such as https://upload.wikimedia.org/wikipedia/commons/0/05/20100726_Kalamitsi_Beach_Ionian_Sea_Lefkada_island_Greece.jpg:
import wikipedia
import requests
page = wikipedia.page("beach")
first_image_link = page.images[0]
req = requests.get(first_image_link)
req.content
b'<!DOCTYPE html>\n<html lang="en">\n<meta charset="utf-8">\n<title>Wikimedia Error</title>\n<style>\n*...
Most websites block requests that come in without a valid browser as a User-Agent. Wikimedia is one such.
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
res = requests.get('https://upload.wikimedia.org/wikipedia/commons/0/05/20100726_Kalamitsi_Beach_Ionian_Sea_Lefkada_island_Greece.jpg', headers=headers)
res.content
which will give you expected output
I typed your code and it seems to be an "Error: 403, Forbidden.". Wikipedia requires a user agent header in the request.
import wikipedia
import requests
headers = {
'User-Agent': 'My User Agent 1.0'
}
page = wikipedia.page("beach")
first_image_link = page.images[0]
req = requests.get(first_image_link, headers=headers, stream=True)
req.content
For the user agent, you should probably supply something a bit more descriptive than the placeholder I use in my example. Maybe the name of your script, or just the word "script" or something like that.
I tested it and it works fine. You will get back the image as you are expecting.

Why Request doesn't work on a specific URL?

I have a question re: requests module in Python.
So far I have been using this to scrape and it's been working well.
However when I do it against one particular website (code below - and refer to the Jupyter Notebook snapshot), it just doesn't want to complete the task (showing [*] forever).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
page = requests.get('https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets', verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
Some users also suggest using headers such as below to speed it up but it doesnt work for me as well:
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = requests.get(url = url, headers = headers)
Not sure what's going on (this is the first time for me) but I might be missing on something obvious. If someone can explain why this is not working? Or if it's working in your machine, please do let me know!
The page attempts to add a cookie the first time you visit it. By using the requests module and not defining a cookie will prevent you from being able to connect to the page.
I've modified your script to include my cookie which should work - if it doesn't, copy your cookie (for this host domain) from the browser to the script.
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
cookies = {
'TS01e58ec0': '01a1c9e334eb0b8b191d36d0da302b2bca8927a0ffd2565884aff3ce69db2486850b7fb8e283001c711cc882a8d1f749838ff59d3d'
}
req = requests.get(url = url, headers = headers, cookies=cookies)

requests.get is very slow

I am trying to resolve a DOI like this:
import requests
url = 'https://dx.doi.org/10.3847/1538-4357/aafd31'
r1 = requests.get(url)
actual_url = r1.url
But the requests.get call actually takes of the order of 10s of seconds up to 5 minutes (it varies)! I tried stream=True or verify=False but that does not really help.
It seems they are slowing you down on purpose. Try setting a valid user agent.
Below code runs ok (quick response) for me;
import requests
url = 'https://dx.doi.org/10.3847/1538-4357/aafd31'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
req = requests.get(url, headers=headers)
print(req.text)
If you are doing multiple requests just make sure you do it slow enough and possibly use multiple user agents at random
try:
import urllib.request
response = urllib.request.urlopen('https://dx.doi.org/10.3847/1538-4357/aafd31')
html = response.read()
I had the same problem. My solution is to create a new environment with more recent python version.

Python3, beautifulsoup, return nothing in specific pages

In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)

Python requests html 403 response

Im using the requests module in python to try and make a search on the following webiste http://musicpleer.audio/, however this website appears to be blocking me as it issues nothing but a 403 when i attempt to access it, im wondering how i can get around this, ive tried sending it the user agent of my web browser(chrome) and it still returns error 403. any suggestions on how i could get around this an example of downloading a song from the site would be very helpful. Thanks in advance
My code:
import requests, os
def funGetList:
start_path = 'C:/Users/Jordan/Music/' # current directory
list = []
for path,dirs,files in os.walk(start_path):
for filename in files:
temp = (os.path.join(path,filename))
tempLen = len(temp)
"print(tempLen)"
iterate = 0
list.append(temp[22:(len(temp))-4])
def funDownloadMP3:
for i in list:
print(i)
payload = {'searchQuery': 'meme', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
print(requests.post(url, data=payload))
Putting the User-Agent in the headers seems to work:
In []:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
r = requests.get('{}#!{}'.format(url, 'meme'), headers=headers)
r.status_code
Out[]:
200
Note: It looks like the search url is simple '#!<search-term>'
HTML 403 Forbidden error code.
The server might be expecting some more request headers like Host or Cookies etc.
You might want to use Postman to debug it with ease

Categories

Resources