I try to get cookies from a website. Reading a bit on this topic on other StackOverflow posts I came up with this code because other code pieces did not work either.
import requests
s = requests.Session()
print(s.get("https://instagram.com").cookies.get_dict())
Unfortunately, it returns an empty dictionary.
I already tried it with browser_cookie3 but it either did not work or it is not supporting Safari.
Am I missing something important?
You are getting the cookies from your requests.Session(). The request will not execute any javascript code so there are no cookies to read. That is why you are getting a empty dictionary.
If the cookies were set on server side itself you would be able to read them.
Between Browser Cookies 3 Currently supports Chrome, Firefox, Opera, Edge, and Chromium.
Related
I've already seen multiple posts on Stackoverflow regarding this. However, some of the answers are outdated (such as using PhantomJS) and others didn't work for me.
I'm using selenium to scrape a few sports websites for their data. However, every time I try to scrape these sites, a few of them block me because they know I'm using chromedriver. I'm not sending very many requests at all, and I'm also using a VPN. I know the issue is with chromedriver because anytime I stop running my code but try opening these sites on chromedriver, I'm still blocked. However, when I open them in my default web browser, I can access them perfectly fine.
So, I wanted to know if anyone has any suggestions of how to avoid getting blocked from these sites when scraping them in selenium. I've already tried changing the '$cdc...' variable within the chromedriver, but that didn't work. I would greatly appreciate any ideas, thanks!
Obviously they can tell you're not using a common browser. Could it have something to do with the User Agent?
Try it out with something like Postman. See what the responses are. Try messing with the user agent and other request fields. Look at the request headers when you access the site with a regular browser (like chrome) and try to spoof those.
Edit: just remembered this and realized the page might be performing some checks in JS and whatnot. It's worth looking into what happens when you block JS on the site with a regular browser.
i am confused on this particular topic, i built a bot for two different websites making use of python's requests module to manually simulate the sending of HTTP PoST and GET requests.
I implemented socks proxies and also used user agents in my requests as well as referrer URL;s when neccesary (i verified actual requests sent by a browser when on these sites using burpsuite) in order to make it look genuine.
However, any accounts i run through my bots keep getting suspended. It got me wondering what i'm doing wrong, a friend suggested that maybe i should use one of these headless solutions(phantomJS) and i am leaning towards that route but i am still confused and would like to know what the difference is between using HTTP requests module and using headless browser like phantomJS.
I am not sure if there is any need to paste my source code here. Just looking for some direction on this project. thank you for taking your time to read such a long wall of text :)
Probably, you have to set cookies.
To make your requests more genuine, you should set other headers such as Host and Referer. However, the Cookies header should change every time. You can get them in this way:
from requests import Session
with Session() as session:
# Send request to get cookies.
response = session.get('your_url', headers=your_headers, proxies=proxies) # eventually add params keyword
cookies = response.cookies.get_dict()
response = session.get('your_url', headers=your_headers, cookies=cookies, proxy=proxy)
Or maybe, the site is scanning for bots in some way.
In this case, you could try to add a delay between requests with time.sleep(). You can see timings in Dev Tools on your browser. Alternatively, you could emulate all the requests you send when you connect to the site on your browser, such as ajax scripts, etc.
In my experience, using requests or using Selenium webdrivers doesn't make much difference in terms of detection, because you can't access headers and even request and response data. Also, note that Phantom Js is no longer supported. It's preferred to use headless Chrome instead.
If none of requests approach doesn't work, I suggest using Selenium-wire or Mobilenium, modified versions of Selenium, that allow accessing requests and response data.
Hope it helps.
I've been writing automated tests with Selenium Webdriver 2.45 in python. To get through some of the things I need to test I must retrieve the various JSESSION cookies that are generate from the site. When I use webdrivers get_cookies() function with Firefox or Chrome all of the needed cookies return to me. When I do the same thing with IE11 I do not see the cookies that I need. Anyone know how I can retrieve session cookies from IE?
What you describe sounds like an issue I ran into a few months ago. My tests ran fine with Chrome and Firefox but not in IE, and the problem was cookies. Upon investigation what I found is that my web site had set its session cookies to be HTTP-only. When a cookie has this flag turned on, the browser will send the cookie over the HTTP(S) protocol and allow it to be set by the server in responses but it will make the cookie inaccessible to JavaScript. (Which is consistent with your comment that you cannot see the cookies you want in document.cookie.) It so happens that when you use Selenium with Chrome or Firefox, Selenium is able to ignore this flag and obtain the cookies from the browser anyway. However, it cannot do the same with IE.
I worked around this issue by turning off the HTTP-only flag when running my site in testing mode. I use Django for my server so I had to create a special test_settings.py file with SESSION_COOKIE_HTTPONLY = False in it.
There is an open issue with IE and Safari. Those driver will not return correct cookies information. At least not the domain. See this
I am using mechanize to retrieve data from many web site. When I tried to log into www.douban.com , I found there are a lot of cookies not set when I log in success. Finally, I find they came from google analytics. They were set by javascript. However, mechanize can not handle javascript, so how to get these cookies. Without these cookies I still can not visit www.douban.com.
PhantomJS is a headless webkit-based client supporting all bells and wisthles, JavaScript included. It had Python API (PyPhantomJS) which was unfortunately removed due to lack of maintainer. You may still want to take a look.
Sorry to say that, but unless Your crawler knows how to run Javascript code, You are unable to fetch cookies set by Javascript.
When I log into a page in my browser, I get 3 cookies: tips, ipb_member_id and ip_pass_hash. I need those last two to access some pages I can only see when logged in. When I log in via the browser it works fine, but under mechanize I only get the tips cookie.
Are there any flags I have to set up for this to work, or is there any module I might need? I can't link to the page here. Though I do know Python's Mechanize + cookielib stores the cookies correctly, since I already have a working version for it.
I am working on the same issue (I want to get all cookies loaded on a page).
I think it's impossible with mechanize. One reason is that it doesn't support javascript, so anything a little bit complex (such as a img loaded on a js event, which set a new cookie) will not work.
I am considering other options as webkit :http://stackoverflow.com/questions/4730906/automating-chrome
if you find a good way to gather all the cookies, let me know :)