I have a question re: requests module in Python.
So far I have been using this to scrape and it's been working well.
However when I do it against one particular website (code below - and refer to the Jupyter Notebook snapshot), it just doesn't want to complete the task (showing [*] forever).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
page = requests.get('https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets', verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
Some users also suggest using headers such as below to speed it up but it doesnt work for me as well:
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = requests.get(url = url, headers = headers)
Not sure what's going on (this is the first time for me) but I might be missing on something obvious. If someone can explain why this is not working? Or if it's working in your machine, please do let me know!
The page attempts to add a cookie the first time you visit it. By using the requests module and not defining a cookie will prevent you from being able to connect to the page.
I've modified your script to include my cookie which should work - if it doesn't, copy your cookie (for this host domain) from the browser to the script.
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
cookies = {
'TS01e58ec0': '01a1c9e334eb0b8b191d36d0da302b2bca8927a0ffd2565884aff3ce69db2486850b7fb8e283001c711cc882a8d1f749838ff59d3d'
}
req = requests.get(url = url, headers = headers, cookies=cookies)
Related
I am trying to scrape data from a website but it shows this error. I don't know how to fix this.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
This is my code
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
page = requests.get(url).content
page
Output
You need to add user-agent and it works.
If you do not put user-agent of some browser, the site thinks that you are bot and block you.
from bs4 import BeautifulSoup
import requests
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(url, headers=headers).content
print(page)
I am trying to resolve a DOI like this:
import requests
url = 'https://dx.doi.org/10.3847/1538-4357/aafd31'
r1 = requests.get(url)
actual_url = r1.url
But the requests.get call actually takes of the order of 10s of seconds up to 5 minutes (it varies)! I tried stream=True or verify=False but that does not really help.
It seems they are slowing you down on purpose. Try setting a valid user agent.
Below code runs ok (quick response) for me;
import requests
url = 'https://dx.doi.org/10.3847/1538-4357/aafd31'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
req = requests.get(url, headers=headers)
print(req.text)
If you are doing multiple requests just make sure you do it slow enough and possibly use multiple user agents at random
try:
import urllib.request
response = urllib.request.urlopen('https://dx.doi.org/10.3847/1538-4357/aafd31')
html = response.read()
I had the same problem. My solution is to create a new environment with more recent python version.
I'm trying to fetch automatically the MAC addresses for some vendors in Python. I found a website that is really helpful, but I'm not being able to access its information from Python. When I run this:
import grequests
rs = (grequests.get(u) for u in ['https://aruljohn.com/mac/000000'])
requests = grequests.map(rs)
for response in requests:
print(response)
It prints None. Does anyone know how to solve this?
Looks like the issue is just not setting a user-agent in the headers. I was able to request the website without any issues. I just used normal python request but it should work fine with grequests. I do think you might want to find a more active library. You could check out aiohttp. Very active and I have had a wonderful experience using aiohttp.
import requests
from lxml import html
def request_website(mac):
url = 'https://aruljohn.com/mac/' + mac
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get(url, headers=headers)
return r.text
response = request_website('000000')
tree = html.fromstring(response)
results = tree.cssselect('.results p')[0].text
print (results)
The below code got stuck after printing hi in output. Can you please check what is wrong with this? And if the site is secure and I need some special authentication?
from bs4 import BeautifulSoup
import requests
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl);
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
Unable to read html page from beautiful soup
Why you got this problem is website consider that you are robots, they won't send anything to you. And they even hang up the connection let you wait forever.
You just imitate browser's request, then server will consider you are not an robot.
Add headers is the simplest way to deal with this problem. But something you should not pass User-Agent only(like this time). Remember copy your browser's request and remove the useless element(s) through testing. If you are lazy use browser's headers straightly, but you must not copy all of them when you want to upload files
from bs4 import BeautifulSoup
import requests
rooturl='http://www.hoovers.com/company-information/company-search.html'
with requests.Session() as se:
se.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en"
}
resp = se.get(rooturl)
print(resp.content)
soup = BeautifulSoup(resp.content,"html.parser")
Was having the same issue as you. Just sat there.
I tried by adding user-agent, and it pulled it realtively quickly. Don't know why that is though.
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl, headers=headers)
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
EDIT: So odd. Now it's not working for me again. It first didn't work. Then it did. Now it doesn't. But there is another potential option with the use of Selenium.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.hoovers.com/company-information/company-search.html')
r = browser.page_source
print('hi1')
soup=BeautifulSoup(r,"html.parser")
print('hi2')
print(soup)
browser.close()
In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)