I am using the requests library to parse the website but it returns the mobile version of the site. How can I get the HTML page of the desktop version?
import requests
sess = requests.Session()
sess.get("https://google.com/")
Try this:
from requests_html import HTMLSession
session = HTMLSession()
session.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
response = session.get(url, headers=session.headers)
I can see that google gets a User-Agent header specifying the browser, from the inspect element of my Mozilla I can see that google gets
User-Agent Mozilla/5.0 etc..
Maybe you can try to send a header tricking Google to think it's being accessed from a browser.
Related
I am trying to scrape data from a website but it shows this error. I don't know how to fix this.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
This is my code
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
page = requests.get(url).content
page
Output
You need to add user-agent and it works.
If you do not put user-agent of some browser, the site thinks that you are bot and block you.
from bs4 import BeautifulSoup
import requests
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(url, headers=headers).content
print(page)
I am trying to extract the source code of the html page. It was working fine before. but now the source web server wanted more evidence that I am NOT a bot. This is the error: your IP is blocked. My IP is NOT blocked for sure as I can still open the page manually via any browser. Do I need to change any parameters before making the request. Thanks.
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
req = requests.get(url, headers=headers)
url_content = req.content
url_content = url_content.replace(b'data-imgid', b'\ndata-imgid')
output_file = open('downloaded.txt', 'wb')
output_file.write(url_content)
output_file.close()
I am using python requests to get the html page.
I am using the latest version of chrome in the user agent.
But the response tells that Please update your browser.
Here is my sample code.
import requests
url = 'https://www.choicehotels.com/alabama/mobile/quality-inn-hotels/al045/hotel-reviews/4'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'content-type': 'application/xhtml+xml', 'referer': url}
url_response = s.get(url, headers=headers, timeout=15)
print url_response.text
I am using python 2.7 in a windows server.
But when I ran the same code in my local I got the required output.
Please update your browser is the answer.
You cannot do https with old browser (and request in python2.7 could be old browser). There were a lot of security problems in https protocols, so it seems that servers doesn't allow to connect with unsecure encryptions and connection standards.
In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)
Im using the requests module in python to try and make a search on the following webiste http://musicpleer.audio/, however this website appears to be blocking me as it issues nothing but a 403 when i attempt to access it, im wondering how i can get around this, ive tried sending it the user agent of my web browser(chrome) and it still returns error 403. any suggestions on how i could get around this an example of downloading a song from the site would be very helpful. Thanks in advance
My code:
import requests, os
def funGetList:
start_path = 'C:/Users/Jordan/Music/' # current directory
list = []
for path,dirs,files in os.walk(start_path):
for filename in files:
temp = (os.path.join(path,filename))
tempLen = len(temp)
"print(tempLen)"
iterate = 0
list.append(temp[22:(len(temp))-4])
def funDownloadMP3:
for i in list:
print(i)
payload = {'searchQuery': 'meme', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
print(requests.post(url, data=payload))
Putting the User-Agent in the headers seems to work:
In []:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
r = requests.get('{}#!{}'.format(url, 'meme'), headers=headers)
r.status_code
Out[]:
200
Note: It looks like the search url is simple '#!<search-term>'
HTML 403 Forbidden error code.
The server might be expecting some more request headers like Host or Cookies etc.
You might want to use Postman to debug it with ease