url that worked with urllib.open does not with requests.get - python

I have a script that used to work with urllib and now has to use requests. I have a url I use to put stuff in a database. the url is
http://www.example.com/insert.php?network=testnet&id=1245100&c=2800203&lat=7555344
this url worked through urllib(urlopen) but i get 403 forbidden when doing it through requests.get
HEADER = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36' }
headers = requests.utils.default_headers()
headers.update = ( HEADER,)
payload={'network':'testnet','id':'1245300','c':'2803824', 'lat':'7555457'}
response = requests.get("http://www.example.com/insert.php", headers=headers, params=payload)
print(f"Remote commit: {response.text}")
print(response.url)
the url works in a browser and gets a simple json ok response.
the script produces:
Remote commit: <html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
http://www.example.com/insert.php?network=testnet&id=1245300&c=2803824&lat=7555457
not sure what I am doing wrong.
edit: changed https to http.

Forbidden often correlated to SSL/TLS certificate verification failure. Please try using the requests.get by setting the verify=False as following
Fixing the SSL certificate issue
requests.get("https://www.example.com/insert.php?network=testnet&id=1245300&c=2803824&lat=7555457", verify=False)
Fixing the TLS certificate issue
Check out my answer related to the TLS certificate verification fix.

Somehow I overcomplicated it and when I tried the absolute minimum that works.
import requests
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36' }
response = requests.get("http://www.example.com/insert.php?network=testnet&id=1245200&c=2803824&lat=7555457", headers=headers)
print(response.text)

Related

Can't login with python post error shows up

When I try to login into GameStop(gamestop.ca) with requests, this is what I get. What am I doing wrong or what am I missing? I tried adding many other headers including the authority header shown in chrome dev tool under network. I don't understand why this doesn't work but selenium works when I try logging in with selenium. If this is because of bot detection, isn't selenium supposed to be detected much more easily?
import requests
import json
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
payload = {
'UserName': '*****',
'Password': '******',
'RememberMe': 'false'
}
with requests.session() as s:
r = s.post('https://www.gamestop.ca/Account/LogOn', headers=headers, data=payload)
print(r.status_code)
print(r.content)
print(r.text)
403
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.gamestop.ca/Account/LogOn" on this server.<P>\nReference #18.70fd017.1635453520.211da9c5\n</BODY>\n</HTML>\n'
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://www.gamestop.ca/Account/LogOn" on this server.<P>
Reference #18.70fd017.1635453520.211da9c5
</BODY>
</HTML>

can't find the right compression for this webpage (python requests.get)

I can load this webpage in Google Chrome, but I can't access it via requests. Any idea what the compression problem is?
Code:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {'Accept-Encoding':'gzip, deflate, compress, br, identity'}
r = requests.get(url, headers=headers)
Result:
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
Use a user agent that emulates a browser:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
r = requests.get(url, headers=headers)
You're getting a 403 Forbidden error, which you can see using requests.head. Use RJ's suggestion to defeat huffpost's robot blocking.
>>> requests.head(url)
<Response [403]>

I get HTTPError: Not Found exception when I open the url

I would like to get the information on this page:
http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1
From the browser debugger tools I get the information in this file:
http://www.jnfdc.gov.cn/r/house/757e06e0-c5b3-4384-9a14-2cb1eac011d1_154810896.xml
But when I use the browser to access the url directly, I can't get the file.
I don't know why.
I use python.
import urllib2
#url1 = 'http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1'
url = 'http://www.jnfdc.gov.cn/r/house/757e06e0-c5b3-4384-9a14-2cb1eac011d1_113649432.xml'
headers = {
"Accept" :"*/*",
"Accept-Encoding" :"gzip, deflate, sdch",
"Accept-Language" :"zh-CN,zh;q=0.8",
"Cache-Control" :"max-age=0",
"Connection" :"keep-alive",
"Cookie" :"JSESSIONID=A205D8D7B0807FD34F879D6CB6EEB0CE",
"DNT" :"1",
"Host" :"www.jnfdc.gov.cn",
"Referer" :"http://www.jnfdc.gov.cn/onsaling/viewhouse.shtml?fmid=757e06e0-c5b3-4384-9a14-2cb1eac011d1",
"User-Agent" :"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.3051.400 QQBrowser/9.6.11301.400"
}
req = urllib2.Request(url, headers=headers)
resp = urllib2.urlopen(req) #this code throw exception:HTTPError: Not Found
How could I do? Thanks.
For getting data from browser you can try to use Selenium - Selenium doc

convert curl command to python [pycurl or request or anythings!]

I`m try to post a data to specific url and get response and save it in the file with curl and no problem with it.
This webpage post a data and if data is correct show response webpage, and if not redirect to url like that:
http://example.com/url/foo/bar/error
My code is:
curl --fail --user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36" --data "mydataexample" --referer "http://example.com/url/foo/bar" http://example.com/url/foo/bar --output test.html
But when code in python with requests always status_code is OK [200] even with wrong data! and no exist correct response to save!
Here is my python code:
import requests
data = 'myexampledata'
headers = { 'referer':'http://example.com/url/foo/bar' ,'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36' }
url = 'http://example.com/url/foo/bar'
r = requests.post(url,params=data,headers=headers)
# check headers and status_code for test if data is wrong...
print r.headers
print r.status_code
And now, How write a python code to solve this problem and exactly work same as curl command? Any advise with requests or pycurl to fix it?

Login to website via Python Requests

for a university project I am currently trying to login to a website, and scrap a little detail (a list of news articles) from my user profile.
I am new to Python, but I did this before to some other website. My first two approaches deliver different HTTP errors. I have considered problems with the header my request is sending, however my understanding of this sites login process appears to be insufficient.
This is the login page: http://seekingalpha.com/account/login
My first approach looks like this:
import requests
with requests.Session() as c:
requestUrl ='http://seekingalpha.com/account/orthodox_login'
USERNAME = 'XXX'
PASSWORD = 'XXX'
userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
login_data = {
"slugs[]":None,
"rt":None,
"user[url_source]":None,
"user[location_source]":"orthodox_login",
"user[email]":USERNAME,
"user[password]":PASSWORD
}
c.post(requestUrl, data=login_data, headers = {"referer": "http://seekingalpha.com/account/login", 'user-agent': userAgent})
page = c.get("http://seekingalpha.com/account/email_preferences")
print(page.content)
This results in "403 Forbidden"
My second approach looks like this:
from requests import Request, Session
requestUrl ='http://seekingalpha.com/account/orthodox_login'
USERNAME = 'XXX'
PASSWORD = 'XXX'
userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
# c.get(requestUrl)
login_data = {
"slugs[]":None,
"rt":None,
"user[url_source]":None,
"user[location_source]":"orthodox_login",
"user[email]":USERNAME,
"user[password]":PASSWORD
}
headers = {
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language":"de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4",
"origin":"http://seekingalpha.com",
"referer":"http://seekingalpha.com/account/login",
"Cache-Control":"max-age=0",
"Upgrade-Insecure-Requests":1,
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
}
s = Session()
req = Request('POST', requestUrl, data=login_data, headers=headers)
prepped = s.prepare_request(req)
prepped.body ="slugs%5B%5D=&rt=&user%5Burl_source%5D=&user%5Blocation_source%5D=orthodox_login&user%5Bemail%5D=XXX%40XXX.com&user%5Bpassword%5D=XXX"
resp = s.send(prepped)
print(resp.status_code)
In this approach I was trying to prepare the header exactly as my browser would do it. Sorry for redundancy. This results in HTTP error 400.
Does someone have an idea, what went wrong? Probably a lot.
Instead of spending a lot of energy on manually logging in and playing with Session, I suggest you just scrape the pages right away using your cookie.
When you log in, usually there is a cookie added to your request to identify your identity. Please see this for example:
Your code will be like this:
import requests
response = requests.get("www.example.com", cookies={
"c_user":"my_cookie_part",
"xs":"my_other_cookie_part"
})
print response.content

Categories

Resources