Hi~I'm now doing my research on Zhihu, a Chinese Q&A website like Quora, using social network analysis. And I'm writing a crawler with Python these days, but met a problem:
I want to scratch the user info that follows a specific user, like Kaifu-Lee. The Kaifu-Lee's followers page is http://www.zhihu.com/people/kaifulee/followers
And the load-more button is at the bottom of the followers list, I need to get full list.
Here's the way I do with python requests:
import requests
import re
s = requests.session()
login_data = {'email': '***', 'password': '***', }
# post the login data.
s.post('http://www.zhihu.com/login', login_data)
# verify if I've login successfully. Surely this step have succeed.
r = s.get('http://www.zhihu.com')
Then, I jumped to the target page:
r = s.get('http://www.zhihu.com/people/kaifulee/followers')
and get 200 return:
In [7]: r
Out[7]: <Response [200]>
So the next step is to analyze the request of load-more under "network" tag using chrome's developer tool, here's the information:
Request URL: http://www.zhihu.com/node/ProfileFollowersListV2
Request Method: POST
Request Headers
Connection:keep-alive
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/kaifulee/followers
Form data
method:next
params:{"hash_id":"12135f10b08a64c54e8bfd537dd7bee7","order-by":"created","offset":20}
_xsrf:ea63beee3a3444bfb853f36b7d968ad1
So I try to POST:
global header_info
header_info = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1581.2 Safari/537.36',
'Host':'www.zhihu.com',
'Origin':'http://www.zhihu.com',
'Connection':'keep-alive',
'Referer':'http://www.zhihu.com/people/zihaolucky/followers',
'Content-Type':'application/x-www-form-urlencoded',
}
# form data.
data = r.text
raw_hash_id = re.findall('hash_id(.*)',data)
hash_id = raw_hash_id[0][14:46]
payload={"method":next,"hash_id":str(hash_id),"order_by":"created","offset":20}
# post with parameters.
url = 'http://www.zhihu.com/node/ProfileFollowersListV2'
r = requests.post(url,data=payload,headers=header_info)
BUT, it returns Response<404>>
If I made any mistake?
Someone said I made a mistake in dealing with the params. The Form Data has 3 parameters:method,params,_xsrfand I lost _xsrf and then I put them into a dictionary.
So I modified the code:
# form data.
data = r.text
raw_hash_id = re.findall('hash_id(.*)',data)
hash_id = raw_hash_id[0][14:46]
raw_xsrf = re.findall('xsrf(.*)',r.text)
_xsrf = raw_xsrf[0][9:-3]
payload = {"method":"next","params":{"hash_id":hash_id,"order_by":"created","offset":20,},"_xsrf":_xsrf,}
# reuse the session object, but still error.
>>> r = s.post(url,data=payload,headers=header_info)
>>> <Response [500]>
You can't pass nested dictionaries to the data parameter. Requests just doesn't know what to do with them.
It's not clear, but it looks like the value of the params key is probably JSON. This means your payload code should look like this:
params = json.dumps({"hash_id":hash_id,"order_by":"created","offset":20,})
payload = {"method":"next", "params": params, "_xsrf":_xsrf,}
Give that a try.
Related
I'm trying to get the whole table from this website: https://br.investing.com/commodities/aluminum-historical-data
But when I send this code:
with requests.Session() as s:
r = s.post('https://br.investing.com/commodities/aluminum-historical-data',
headers={"curr_id": "49768","smlID": "300586","header": "Alumínio Futuros Dados Históricos",
'User-Agent': 'Mozilla/5.0', 'st_date': '01/01/2017','end_date': '29/09/2018',
'interval_sec': 'Daily','sort_col': 'date','sort_ord': 'DESC','action': 'historical_data'})
bs2 = BeautifulSoup(r.text,'lxml')
tb = bs2.find('table',{"id":"curr_table"})
It only returns a piece of the table, not the whole date I just filtered.
I did see the post page below:
Can anyone help me get the whole table I just filtered?
You made two mistakes with your code.
The first one is the url.
You need to use the correct url to request data to investing.com.
Your current url is 'https://br.investing.com/commodities/aluminum-historical-data'
However, when you see inspection and click 'Network' the Request URLis https://br.investing.com/instruments/HistoricalDataAjax.
Your second mistake exists in s.post(blah). As Federico Rubbi referred above, what you coded assigned to headers must be assigned to data instead.
Now, your mistakes are all resolved. You need to do only one step more. You have to add a dictionary {'X-Requested-With': 'XMLHttpRequest'} to your_headers. Seeing from your code, I can see that you have already checked Network tab in HTML inspection. So, you are probably able to see why you need {'X-Requested-With': 'XMLHttpRequest'}.
So the entire code should be as follows.
import requests
import bs4 as bs
with requests.Session() as s:
url = 'https://br.investing.com/instruments/HistoricalDataAjax' # Making up for the first mistake.
your_headers = {'User-Agent': 'Mozilla/5.0'}
s.get(url, headers= your_headers)
c_list = s.cookies.get_dict().items()
cookie_list = [key+'='+value for key,value in c_list]
cookie = ','.join(cookie_list)
your_headers = {**{'X-Requested-With': 'XMLHttpRequest'},**your_headers}
your_headers['Cookie'] = cookie
data= {} # Your data. Making up for the second mistake.
response = s.post(url, data= data, headers = your_headers)
The problem is that you're passing form data as headers.
You have to send data with data keyworded argument in request.Session.post:
with requests.Session() as session:
url = 'https://br.investing.com/commodities/aluminum-historical-data'
data = {
"curr_id": "49768",
"smlID": "300586",
"header": "Alumínio Futuros Dados Históricos",
'User-Agent': 'Mozilla/5.0',
'st_date': '01/01/2017',
'end_date': '29/09/2018',
'interval_sec': 'Daily',
'sort_col': 'date',
'sort_ord': 'DESC',
'action': 'historical_data',
}
your_headers = {} # your headers here
response = session.post(url, data=data, headers=your_headers)
bs2 = BeautifulSoup(response.text,'lxml')
tb = bs2.find('table',{"id":"curr_table"})
I'd also recommend including your headers (especially user-agents) in the POST request because the site could not allow bots. In this case, if you do it, it will be harder to detect the bot.
I need to log me in a website with requests, but all I have try don't work :
from bs4 import BeautifulSoup as bs
import requests
s = requests.session()
url = 'https://www.ent-place.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=5'
def authenticate():
headers = {'username': 'myuser', 'password': 'mypasss', '_Id': 'submit'}
page = s.get(url)
soup = bs(page.content)
value = soup.form.find_all('input')[2]['value']
headers.update({'value_name':value})
auth = s.post(url, params=headers, cookies=page.cookies)
authenticate()
or :
import requests
payload = {
'inUserName': 'user',
'inUserPass': 'pass'
}
with requests.Session() as s:
p = s.post('https://www.ent-place.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=5', data=payload)
print(p.text)
print(p.status_code)
r = s.get('A protected web page url')
print(r.text)
When I try this with the .status_code, it return 200 but I want 401 or 403 for do a script like 'if login'...
I have found this but I think it works in python 2, but I use python 3 and I don't know how to convert... :
import requests
import sys
payload = {
'username': 'sopier',
'password': 'somepassword'
}
with requests.Session(config={'verbose': sys.stderr}) as c:
c.post('http://m.kaskus.co.id/user/login', data=payload)
r = c.get('http://m.kaskus.co/id/myform')
print 'sopier' in r.content
Somebody know how to do ?
Because each I have test test all script I have found and it don't work...
When you submit the logon, the POST request is sent to https://www.ent-place.fr/CookieAuth.dll?Logon not https://www.ent-place.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=5 -- You get redirected to that URL afterwards.
When I tested this, the post request contains the following parameters:
curl:Z2F
flags:0
forcedownlevel:0
formdir:5
username:username
password:password
SubmitCreds.x:69
SubmitCreds.y:9
SubmitCreds:Ouvrir une session
So, you'll likely need to supply those additional parameters as well.
Also, the line s.post(url, params=headers, cookies=page.cookies) is not correct. You should pass headers into the keyword argument data not params -- params encodes to the request url -- you need to pass it in the form data. And I'm assuming you really mean payload when you say headers
s.post(url, data=headers, cookies=page.cookies)
The site you're trying to login to has an onClick JavaScript when you process the login form. requests won't be able to execute JavaScript for you. This may cause issues with the site functionality.
I am trying to achieve a signed call to Instagram API in Python. Currently my headers looks like this :
user_agent = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7'
headers = {
'User-Agent': user_agent,
"Content-type": "application/x-www-form-urlencoded"
}
I tried several permutations on the instructions given at this page (Restrict API Requests # instagram), including the HMAC method and enabling "Enforce Signed Header" in my API settings page.
But I keep getting either a headers not found or 403 error. I just cant figure out how to properly code X-Insta-Forwarded-For
Can you please help with how to pass Signed call with header in Python?
Much appreciated...
This should do it for you. You'll need the Crypto python library as well.
import requests
from Crypto.Hash import HMAC, SHA256
#change these accordingly
client_secret = "mysecret"
client_ip = "127.0.0.1"
hmac = HMAC.new(client_secret, digestmod=SHA256)
hmac.update(client_ip)
signature = hmac.hexdigest()
header_string = "%s|%s" % (client_ip, signature)
headers = {
"X-Insta-Forwarded-For" : header_string,
#and the rest of your headers
}
#or use requests.post or del since that's the
#only time that this header is used...just
#conveying the concept
resp = requests.get(insta_url, headers=headers)
If you test it with the example that's given on the reference you listed, you can verify that you get the correct hash using this method
ip = "200.15.1.1"
secret = "6dc1787668c64c939929c17683d7cb74"
hmac = HMAC.new(secret, digestmod=SHA256)
hmac.update(ip)
signature = hmac.hexdigest()
# should be 7e3c45bc34f56fd8e762ee4590a53c8c2bbce27e967a85484712e5faa0191688
Per the reference docs - "To enable this setting, edit your OAuth Client configuration and mark the Enforce signed header checkbox." So make sure you have done that too
I am looking for a way to sign into amazon.com via a post sent in the body of the header. I am new to python and still learning the specifics. When I do a http sniff of the login to amazon.com I get several headers which I input into my code using requests. I verified the names of each and that none of them change. I have it writing to a file which I can load as a webpage to verify, and it loads amazons homepage but it shows I am not signed in. When I change my email address or pw to make it incorrect it makes no difference and I still get code 200. I don't know what i'm doing wrong to sign in via post. I would greatly appreciate any help. my code is as follows:
import requests
s = requests.Session()
login_data = {
'email':'myemail',
'password':'mypasswd',
'appAction' : 'SIGNIN',
'appActionToken' : 'RjZHAvZ7X4o8bm0eM2vFJFj2BYqZMj3D',
'openid.pape.max_auth_age' : 'ape:MA==',
'openid.ns' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjA=',
'openid.ns.pape' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvZXh0ZW5zaW9ucy9wYXBlLzEuMA==',
'prevRID' : 'ape:MVQ4MVBLWDVEMUI0QjA3WlkyMEE=', # changes
'pageId' : 'ape:dXNmbGV4',
'openid.identity' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=',
'openid.claimed_id' : 'ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=',
'openid.mode' : 'ape:Y2hlY2tpZF9zZXR1cA==',
'openid.assoc_handle' : 'ape:dXNmbGV4',
'openid.return_to' : 'ape:aHR0cHM6Ly93d3cuYW1hem9uLmNvbS9ncC95b3Vyc3RvcmUvaG9tZT9pZT1VVEY4JmFjdGlvbj1zaWduLW91dCZwYXRoPSUyRmdwJTJGeW91cnN0b3JlJTJGaG9tZSZyZWZfPW5hdl95b3VyYWNjb3VudF9zaWdub3V0JnNpZ25Jbj0xJnVzZVJlZGlyZWN0T25TdWNjZXNzPTE=',
'create' : '0',
'metadata1' : `'j1QDaKdThighCUrU/Wa7pRWvjeOlkT+BQ7QFj2d9TXIlN6VtoSYle6l9S2z2chZ/EwnMx1xy2EjY5NMRS6hW+jgTFNBJlwSlw+HphykjlOm8Ejx47RrNRXOrwwdFHiUBSB6Jt+GGCp/+5gPXrt8DyrgYFxqVlGRkupsqFjMjtGWaYhEvNsBQmUZ4932fJpeYRJzWxi5KbqG4US53JQCB++aGYgicdOcSY2aTMGEqEkrAPu+xBE9NEboMBepju2OPcAsCzkf+hpugIPYl4xYhFpBarAYlRMgpbypi82+zX6TMCj5/b2k9ky1/fV4LvvvOd3embFEJXrzMIH9enK8F1BH5LJiRQ7e45ld+KqlcSb/cdJMqIXPIeJAy8r88LX3pB65IxR46Z89Sol1XmaOSfP0P626nIvIYv7a0oEyg9SHiAJ2LUpZLlqCh1Tax9f/Mz1ZiqMmFRal8UxyUZbcc5qWI24xKY86P/jwG007kWQiXhcBPUbQnjCWXVPcRV14FAyLQ9OOsiu7OyjElo/Sd1NvqaKVtz6DKy2RyDkZo82WvgPsj6BkUr0oIEUAYggSHHZxhsDicaIzgfJ5/OQeCLEvLUPjLJbnl2FK8xOG0FQAsl2Floso4Lgqd46V38frC4kD87bqQezQqmCr343FgT4uCRgP0LGBf5iSP70kj9JZolGQRSfHy+7DqwLsoEtZa9kJgrJlUNWTrC/XQZZyvA8yzWvz9jaEcPH3nGrUEMLI8airUMwpvK0JrGEHNzJZjPIWzz4bI5G1f1TL8F2eY+iZ6jDPK8mBQHh4zO2b/InaKn2NydtX42QF5lGYagsMEPn3vMgvYgrHwu5YS38ywDUobDxjDhnUCbCvNy4Cot2XMjBz1/S3X5tv4b540yDXhgJWH4h6OZOgs9gjlotv3rk24xYYPlZp+6WgrAQPPZ5VZJorh4dyMvkM7KNhzaK+ejQDMTIZG7096kGf+iQkfudXzg8k8YAXoerRvKpWgckUeZyY2cEwXpZCBsK2zZvuvuyuaHdKbVr8VTgJo'`
}
r = s.post('https://www.amazon.com/ap/signin', login_data)
s.cookies
r = s.get('https://www.amazon.com')
for cookie in s.cookies:
print 'cookie name ' + cookie.name
print 'cookie value ' + cookie.value
with open("requests_results.html", "w") as f:
f.write(r.content)
If you just want to login to Amazon, then try using Python mechanize module. It is much more simpler and neater.
For reference check out the sample code
import mechanize
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [("User-agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
sign_in = browser.open('https://www.amazon.com/gp/sign-in.html')
browser.select_form(name="sign-in")
browser["email"] = '' #provide email
browser["password"] = '' #provide passowrd
logged_in = browser.submit()
===============================
Edited: requests module
import requests
session = requests.Session()
data = {'email':'', 'password':''}
header={'User-Agent' : 'Mozilla/5.0'}
response = session.post('https://www.amazon.com/gp/sign-in.html', data,headers=header)
print response.content
I'm trying to fill out and submit a form using Python, but I'm not able to retrieve the resulting page. I've tried both mechanize and urllib/urllib2 methods to post the form, but both run into problems.
The form I'm trying to retrieve is here: http://zrs.leidenuniv.nl/ul/start.php. The page is in Dutch, but this is irrelevant to my problem. It may be noteworthy that the form action redirects to http://zrs.leidenuniv.nl/ul/query.php.
First of all, this is the urllib/urllib2 method I've tried:
import urllib, urllib2
import socket, cookielib
url = 'http://zrs.leidenuniv.nl/ul/start.php'
params = {'day': 1, 'month': 5, 'year': 2012, 'quickselect' : "unchecked",
'res_instantie': '_ALL_', 'selgebouw': '_ALL_', 'zrssort': "locatie",
'submit' : "Uitvoeren"}
http_header = { "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language" : "nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4" }
timeout = 15
socket.setdefaulttimeout(timeout)
request = urllib2.Request(url, urllib.urlencode(params), http_header)
response = urllib2.urlopen(request)
cookies = cookielib.CookieJar()
cookies.extract_cookies(response, request)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler, cookie_handler)
response = opener.open(request)
html = response.read()
However, when I try to print the retrieved html I get the original page, not the one the form action refers to. So any hints as to why this doesn't submit the form would be greatly appreciated.
Because the above didn't work, I also tried to use mechanize to submit the form. However, this results in a ParseError with the following code:
import mechanize
url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)
where the last line exits with the following: "ParseError: unexpected '-' char in declaration". Now I realize that this error may indicate an error in the DOCTYPE declaration, but since I can't edit the form page I'm not able to try different declarations. Any help on this error is also greatly appreciated.
Thanks in advance for your help.
It's because the DOCTYPE part is malformed.
Also it contains some strange tags like:
<!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden><!e-mail j.dreef#law.leidenuniv.nl >
Try validating the page yourself...
Nonetheless, you can just strip off the junk to make mechanizes html parser happy:
import mechanize
url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177:])
br.set_response(response)
br.select_form(nr = 0)