I would like to get the information on this page:
http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1
From the browser debugger tools I get the information in this file:
http://www.jnfdc.gov.cn/r/house/757e06e0-c5b3-4384-9a14-2cb1eac011d1_154810896.xml
But when I use the browser to access the url directly, I can't get the file.
I don't know why.
I use python.
import urllib2
#url1 = 'http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1'
url = 'http://www.jnfdc.gov.cn/r/house/757e06e0-c5b3-4384-9a14-2cb1eac011d1_113649432.xml'
headers = {
"Accept" :"*/*",
"Accept-Encoding" :"gzip, deflate, sdch",
"Accept-Language" :"zh-CN,zh;q=0.8",
"Cache-Control" :"max-age=0",
"Connection" :"keep-alive",
"Cookie" :"JSESSIONID=A205D8D7B0807FD34F879D6CB6EEB0CE",
"DNT" :"1",
"Host" :"www.jnfdc.gov.cn",
"Referer" :"http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1",
"User-Agent" :"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.3051.400 QQBrowser/9.6.11301.400"
}
req = urllib2.Request(url, headers=headers)
resp = urllib2.urlopen(req) #this code throw exception:HTTPError: Not Found
How could I do? Thanks.
For getting data from browser you can try to use Selenium - Selenium doc
Related
I have a script that used to work with urllib and now has to use requests. I have a url I use to put stuff in a database. the url is
http://www.example.com/insert.php?network=testnet&id=1245100&c=2800203&lat=7555344
this url worked through urllib(urlopen) but i get 403 forbidden when doing it through requests.get
HEADER = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36' }
headers = requests.utils.default_headers()
headers.update = ( HEADER,)
payload={'network':'testnet','id':'1245300','c':'2803824', 'lat':'7555457'}
response = requests.get("http://www.example.com/insert.php", headers=headers, params=payload)
print(f"Remote commit: {response.text}")
print(response.url)
the url works in a browser and gets a simple json ok response.
the script produces:
Remote commit: <html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
http://www.example.com/insert.php?network=testnet&id=1245300&c=2803824&lat=7555457
not sure what I am doing wrong.
edit: changed https to http.
Forbidden often correlated to SSL/TLS certificate verification failure. Please try using the requests.get by setting the verify=False as following
Fixing the SSL certificate issue
requests.get("https://www.example.com/insert.php?network=testnet&id=1245300&c=2803824&lat=7555457", verify=False)
Fixing the TLS certificate issue
Check out my answer related to the TLS certificate verification fix.
Somehow I overcomplicated it and when I tried the absolute minimum that works.
import requests
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36' }
response = requests.get("http://www.example.com/insert.php?network=testnet&id=1245200&c=2803824&lat=7555457", headers=headers)
print(response.text)
I can load this webpage in Google Chrome, but I can't access it via requests. Any idea what the compression problem is?
Code:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {'Accept-Encoding':'gzip, deflate, compress, br, identity'}
r = requests.get(url, headers=headers)
Result:
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
Use a user agent that emulates a browser:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
r = requests.get(url, headers=headers)
You're getting a 403 Forbidden error, which you can see using requests.head. Use RJ's suggestion to defeat huffpost's robot blocking.
>>> requests.head(url)
<Response [403]>
In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)
i would like to take the response data about a specific website.
I have this site:
https://enjoy.eni.com/it/milano/map/
and if i open the browser debuger console i can see a posr request that give a json response:
how in python i can take this response by scraping the website?
Thanks
Apparently the webservice has a PHPSESSID validation so we need to get it first using proper user agent:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
}
r = requests.get('https://enjoy.eni.com/it/milano/map/', headers=headers)
session_id = r.cookies['PHPSESSID']
headers['Cookie'] = 'PHPSESSID={};'.format(session_id)
res = requests.post('https://enjoy.eni.com/ajax/retrieve_vehicles', headers=headers, allow_redirects=False)
json_obj = json.loads(res.content)
I'm trying to login to an aspx page then get the contents of another page as a logged in user.
import requests
from bs4 import BeautifulSoup
URL="https://example.com/Login.aspx"
durl="https://example.com/Daily.aspx"
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36'
language = 'en-US,en;q=0.8'
encoding = 'gzip, deflate'
accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
connection = 'keep-alive'
headers = {
"Accept": accept,
"Accept-Encoding": encoding,
"Accept-Language": language,
"Connection": connection,
"User-Agent": user_agent
}
username="user"
password="pass"
s=requests.Session()
s.headers.update(headers)
r=s.get(URL)
print(r.cookies)
soup=BeautifulSoup(r.content)
LASTFOCUS=soup.find(id="__LASTFOCUS")['value']
EVENTTARGET=soup.find(id="__EVENTTARGET")['value']
EVENTARGUMENT=soup.find(id="__EVENTARGUMENT")['value']
VIEWSTATEFIELDCOUNT=soup.find(id="__VIEWSTATEFIELDCOUNT")['value']
VIEWSTATE=soup.find(id="__VIEWSTATE")['value']
VIEWSTATE1=soup.find(id="__VIEWSTATE1")['value']
VIEWSTATE2=soup.find(id="__VIEWSTATE2")['value']
VIEWSTATE3=soup.find(id="__VIEWSTATE3")['value']
VIEWSTATE4=soup.find(id="__VIEWSTATE4")['value']
VIEWSTATEGENERATOR=soup.find(id="__VIEWSTATEGENERATOR")['value']
login_data={
"__LASTFOCUS":"",
"__EVENTTARGET":"",
"__EVENTARGUMENT":"",
"__VIEWSTATEFIELDCOUNT":"5",
"__VIEWSTATE":VIEWSTATE,
"__VIEWSTATE1":VIEWSTATE1,
"__VIEWSTATE2":VIEWSTATE2,
"__VIEWSTATE3":VIEWSTATE3,
"__VIEWSTATE4":VIEWSTATE4,
"__VIEWSTATEGENERATOR":VIEWSTATEGENERATOR,
"__SCROLLPOSITIONX":"0",
"__SCROLLPOSITIONY":"100",
"ctl00$NameTextBox":"",
"ctl00$ContentPlaceHolderNavPane$LeftSection$UserLogin$UserName":username,
"ctl00$ContentPlaceHolderNavPane$LeftSection$UserLogin$Password":password,
"ctl00$ContentPlaceHolderNavPane$LeftSection$UserLogin$LoginButton":"Login",
"ctl00$ContentPlaceHolder1$RetrievePasswordUserNameTextBox":"",
"hiddenInputToUpdateATBuffer_CommonToolkitScripts":"1"
}
r1=s.post(URL, data=login_data)
print (r1.cookies)
d=s.get(durl)
print (d.cookies)
dsoup=BeautifulSoup(r1.content)
print (dsoup)
but the thing is that the cookies are not preserved into the session and I can't get to the next page as a logged in user.
Can someone give me some pointers on this.
Thanks.
When you post to the login page:
r1=s.post(URL, data=login_data)
It's likely issuing a redirect to another page. So the response to the POST request returns the cookies in the response, then it redirects to another page. The redirect is what is captured in r1 and does not contain the cookies.
Try the same command but not allowing redirects:
r1 = s.post(URL, data=login_data, allow_redirects=False)