How can I load a CookieJar to a new requests.Session object?
cj = cookielib.MozillaCookieJar("mycookies.txt")
s = requests.Session()
This is what I create, now the session will store cookies, but I want it to have my cookies from the file (The session should load the cookieJar).
How can this be achieved?
I searched the documentation but I can only find code examples and they are never loading a cookieJar, just saving cookies during one session.
Python 3.x code, fully working and well-implemented example. The code is self-explanatory.
This code properly handles "session cookies", preserving them between runs. By default, those are not saved to disk, which means that most websites would require you to constantly login between runs. But with the technique below, all session cookies are kept too!
This is the code you are looking for.
import os
import pathlib
import requests
from http.cookiejar import MozillaCookieJar
cookiesFile = str(pathlib.Path(__file__).parent.absolute() / "cookies.txt") # Places "cookies.txt" next to the script file.
cj = MozillaCookieJar(cookiesFile)
if os.path.exists(cookiesFile): # Only attempt to load if the cookie file exists.
cj.load(ignore_discard=True, ignore_expires=True) # Loads session cookies too (expirydate=0).
s = requests.Session()
s.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
"Accept-Language": "en-US,en"
}
s.cookies = cj # Tell Requests session to use the cookiejar.
# DO STUFF HERE WHICH REQUIRES THE PERSISTENT COOKIES...
s.get("https://www.somewebsite.com/")
cj.save(ignore_discard=True, ignore_expires=True) # Saves session cookies too (expirydate=0).
In Python 3.x
import requests
import http.cookiejar
s = requests.Session()
s.cookies = http.cookiejar.MozillaCookieJar("anything.txt")
for example, i will acces google site and save the cookiejar to file "anything.txt"
s.get("https://google.com")
s.cookies.save()
And in the future, i access google again with my cookiejar.
s.cookies.load()
s.get("https://google.com")
There's an optional cookies= that can be provided for a requests.Session (as well as request) objects:
cookies = None
A CookieJar containing all currently outstanding cookies set on this
session. By default it is a RequestsCookieJar, but may be any other cookielib.CookieJar compatible object.
see: https://2.python-requests.org/en/latest/api/#requests.Session.cookies
So it becomes:
s = requests.Session(cookies=cj)
Update: I was confusing the the requests.get, request.post etc..., as correctly pointed out by mata in comments - cookies is an attribute of the session object, not a init parameter, so this won't work. s.cookies = cj after constructing the session will:
Therefore, use:
s = requests.Session()
s.cookies = cj
Related
I am trying to scrape data that generates a chart on a website using python's request module.
My code currently looks like this:
# load modules
import os
import json
import requests as r
# url to send the call to
postURL = <insert website>
# utiliz get to pull cookie data
cookie_intel = r.get(postURL, verify = False)
# get cookies
search_cookies = cookie_intel.cookies
#### Request Information ####
# API request data
post_data = <insert request json>
# header information
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
# results
results_post = r.post(postURL, data = post_data, cookies = search_cookies, headers = headers, verify = False)
# result
print(results_post.json())
As a quick summary, I first loaded the site to then inspect it, from there I identified the url for the request in the network tab and then checked the required request data in the payload tab. Then I took the user-agent from the request headers tab.
The request itself works, however, it is always empty. I have tried altering all sorts of inputs but without success. I would highly appreciate any sort of tips that would help me to solve this issue. Thank you in advance!
in this case you have to use json= instead of data= when making the post request according to the requests documentation . By replacing this part of your code you should get the expected response.
results_post = r.post(postURL, json = post_data, cookies = search_cookies, headers = headers, verify = False)
You can also try other scraping tools like Scrapy to crawl these data and maybe running the crawler on the cloud using estela.
I want to send a value for "User-agent" while requesting a webpage using Python Requests. I am not sure is if it is okay to send this as a part of the header, as in the code below:
debug = {'verbose': sys.stderr}
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(url, headers = user_agent, config=debug)
The debug information isn't showing the headers being sent during the request.
Is it acceptable to send this information in the header? If not, how can I send it?
The user-agent should be specified as a field in the header.
Here is a list of HTTP header fields, and you'd probably be interested in request-specific fields, which includes User-Agent.
If you're using requests v2.13 and newer
The simplest way to do what you want is to create a dictionary and specify your headers directly, like so:
import requests
url = 'SOME URL'
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.example' # This is another valid field
}
response = requests.get(url, headers=headers)
If you're using requests v2.12.x and older
Older versions of requests clobbered default headers, so you'd want to do the following to preserve default headers and then add your own to them.
import requests
url = 'SOME URL'
# Get a copy of the default headers that requests would use
headers = requests.utils.default_headers()
# Update the headers with your custom ones
# You don't have to worry about case-sensitivity with
# the dictionary keys, because default_headers uses a custom
# CaseInsensitiveDict implementation within requests' source code.
headers.update(
{
'User-Agent': 'My User Agent 1.0',
}
)
response = requests.get(url, headers=headers)
It's more convenient to use a session, this way you don't have to remember to set headers each time:
session = requests.Session()
session.headers.update({'User-Agent': 'Custom user agent'})
session.get('https://httpbin.org/headers')
By default, session also manages cookies for you. In case you want to disable that, see this question.
It will send the request like browser
import requests
url = 'https://Your-url'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
response= requests.get(url.strip(), headers=headers, timeout=10)
BackgroundInfo:
I am scraping amazon. I need to set up the session cookies before using requests.session.get() to get the final version of the page source code of a url.
Code:
import requests
# I am currently working in China, so it's cn.
# Use the homepage to get cookies. Then use it later to scrape data.
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = requests.get(homepage,headers = headers)
cookies = response.cookies
#set up the Session object, so as to preserve the cookies between requests.
session = requests.Session()
session.headers = headers
session.cookies = cookies
#now begin download the source code
url = 'https://www.amazon.cn/TCL-%E7%8E%8B%E7%89%8C-L65C2-CUDG-65%E8%8B%B1%E5%AF%B8-%E6%96%B0%E7%9A%84HDR%E6%8A%80%E6%9C%AF-%E5%85%A8%E6%96%B0%E7%9A%84%E9%87%8F%E5%AD%90%E7%82%B9%E6%8A%80%E6%9C%AF-%E9%BB%91%E8%89%B2/dp/B01FXB0ZG4/ref=sr_1_2?ie=UTF8&qid=1476165637&sr=8-2&keywords=L65C2-CUDG'
response = session.get(url)
Desired Result:
When navigate to the amazon homepage in Chrome, the cookies should be something like:
As you can find in the cookies part,which I underscore in red, part of the cookies set by the response to our request to the homepage is "ubid-acbcn", which is also part of the request header, probably left from last visit.
So that is the cookie I want, which I attempted to get by the above code.
In python code, it should be a cookieJar, or a dictionary. Either way, its content should be something that contains 'ubid-acbcn' and 'session-id':
{'ubid-acbcn':'453-7613662-1073007','session-id':'455-1363863-7141553','otherparts':'otherparts'}
What I am getting instead:
The 'session-id' is there, but the 'ubid-acbcn' is missing.
>>homepage = 'http://www.amazon.cn'
>>headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
>>response = requests.get(homepage,headers = headers)
>>cookies = response.cookies
>>print(cookies.get_dict()):
>>{'session-id': '456-2975694-3270026','otherparts':'otherparts'}
Related Info:
OS: WINDOWS 10
PYTHON: 3.5
requests: 2.11.1
I am sorry for being a bit verbose.
What I tried and figure:
I googled for certain keywords, but nobody seems to be facing this
problem.
I figure it might be something to do with the amazon
anti-scraping measure. But other than change my headers to disguise
myself as a human, there isn't much I know I should do.
I have also entertained the possibility that tt might not be a case of missing cookie. But rather I have not set up my requests.get(homepage,headers = headers) properly, hence the response.cookie is not as expected. Given this,I have tried to copying the request header in my browser, leaving out only the cookie part, but still the response cookie is missing the 'ubid-acbcn' part. Maybe some other parameter has to be set up?
You're trying to get cookies from simple "nameless" GET request. But if to sent it "on behalf" of Session you can get required ubid-acbcn value:
session = requests.Session()
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = session.get(homepage,headers = headers)
cookies = response.cookies
print(cookies.get_dict())
Output:
{'ubid-acbcn': '456-2652288-5841140' ...}
The cookies being set are from other pages/resources, probably loaded by JavaScript code. So you probably need to used selenium web driver for it. Check out the link for detail discussion.
not getting all cookie info using python requests module
I am try to learn python, but I have no knowledge about HTTP, I read some posts here about how to use requests to login web site. But it doesn't work. My simple code is here (not real number and password):
#!/usr/bin/env python3
import requests
login_data = {'txtDID': '111111111',
'txtPswd': 'mypassword'}
with requests.Session() as c:
c.post('http://phone.ipkall.com/login.asp', data=login_data)
r = c.get('http://phone.ipkall.com/update.asp')
print(r.text)
print("Done")
But I can't get my personal information which should be showed after login. Can anyone give me some hint? Or point me a direction? I have no idea what's going wrong.
Servers don't like bots (scripts) for security reason. So your script have to behave like human using real browser. First use get() to get session cookies, set user-agent in headers to real one. Use http://httpbin.org/headers to see what user-agent is send by your browser.
Always check results r.status_code and r.url
So you can start with this:
(I don't have acount on this server so I can't test it)
#!/usr/bin/env python3
import requests
s = requests.Session()
s.headers.update({
'User-agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0",
})
# --------
# to get cookies, session ID, etc.
r = s.get('http://phone.ipkall.com/login.asp')
print( r.status_code, r.url )
# --------
login_data = {
'txtDID': '111111111',
'txtPswd': 'mypassword',
'submit1': 'Submit'
}
r = s.post('http://phone.ipkall.com/process.asp?action=verify', data=login_data)
print( r.status_code, r.url )
# --------
BTW: If page use JavaScript you have problem because requests can't run javascript on page.
I would like to write a program that changes my user agent string.
How can I do this in Python?
I assume you mean a user-agent string in an HTTP request? This is just an HTTP header that gets sent along with your request.
using Python's urllib2:
import urllib2
url = 'http://foo.com/'
# add a header to define a custon User-Agent
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req).read()
In urllib, it's done like this:
import urllib
class AppURLopener(urllib.FancyURLopener):
version = "MyStrangeUserAgent"
urllib._urlopener = AppURLopener()
and then just use urllib.urlopen normally. In urllib2, use req = urllib2.Request(...) with a parameter of headers=somedict to set all the headers you want (including user agent) in the new request object req that you make, and urllib2.urlopen(req).
Other ways of sending HTTP requests have other ways of specifying headers, of course.
Using Python you can use urllib to download webpages and use the version value to change the user-agent.
There is a very good example on http://wolfprojects.altervista.org/changeua.php
Here is an example copied from that page:
>>> from urllib import FancyURLopener
>>> class MyOpener(FancyURLopener):
... version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11'
>>> myopener = MyOpener()
>>> page = myopener.open('http://www.google.com/search?q=python')
>>> page.read()
[…]Results <b>1</b> - <b>10</b> of about <b>81,800,000</b> for <b>python</b>[…]
urllib2 is nice because it's built in, but I tend to use mechanize when I have the choice. It extends a lot of urllib2's functionality (though much of it has been added to python in recent years). Anyhow, if it's what you're using, here's an example from their docs on how you'd change the user-agent string:
import mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
("From", "responsible.person#example.com")]
Best of luck.
As mentioned in the answers above, the user-agent field in the http request header can be changed using builtin modules in python such as urllib2. At the same time, it is also important to analyze what exactly the web server sees. A recent post on User agent detection gives a sample code and output, which gives a description of what the web server sees when a programmatic request is sent.
If you want to change the user agent string you send when opening web pages, google around for a Firefox plugin. ;) For example, I found this one. Or you could write a proxy server in Python, which changes all your requests independent of the browser.
My point is, changing the string is going to be the easy part; your first question should be, where do I need to change it? If you already know that (at the browser? proxy server? on the router between you and the web servers you're hitting?), we can probably be more helpful. Or, if you're just doing this inside a script, go with any of the urllib answers. ;)
Updated for Python 3.2 (py3k):
import urllib.request
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
url = 'http://www.google.com'
request = urllib.request.Request(url, b'', headers)
response = urllib.request.urlopen(request).read()