I am fairly proficient in Python and have started exploring the requests library to formulate simple HTTP requests. I have also taken a look at Sessions objects that allow me to login to a website and -using the session key- continue to interact with the website through my account.
Here comes my problem: I am trying to build a simple API in Python to perform certain actions that I would be able to do via the website. However, I do not know how certain HTTP requests need to look like in order to implement them via the requests library.
In general, when I know how to perform a task via the website, how can I identify:
the type of HTTP request (GET or POST will suffice in my case)
the URL, i.e where the resource is located on the server
the body parameters that I need to specify for the request to be successful
This has nothing to do with python, but you can use a network proxy to examine your requests.
Download a network proxy like Burpsuite
Setup your browser to route all traffic through Burpsuite (default is localhost:8080)
Deactivate packet interception (in the Proxy tab)
Browse to your target website normally
Examine the request history in Burpsuite. You will find every information you need
Related
So, the question is pretty straight forward, on my dedicated server I have a /26 subnet (62 usable ips), and I want to specify somehow to url lib which ip should use for any request.
I don't want to use any 3rd party library, just urllib since i made my own session request module based on url lib.
The problem looks like that: my employees have accounts on different websites where they receive notifications, we made an app that is a wrapper over every website and when they receive a notification on that website, we forward the notification to our services and then we show that notification in our gui.
If login to the websites fails, we get cannot login with any account for the next 30 minutes. I want to be able, to switch to other IP address and continue with the next account. The last account will be flagged so we never try to login until is unflagged, or we may try 3-4 times in a row to really flag it.
Can I set to urllib what ip should use for requests?
I am having a use case to get data from a specific site which needs to have requests via session every time. I have created the session in python and also cookies are set which contain my logged in details.
I am currently hosting my script on a data center but the account is getting blocked. I am thinking of requesting the data via proxy but still feel that if my session is created from a different machine and proxy is used to get data via session then what are the chances that the proxy ip is going to be black-listed?
What are the possible solutions here to cater this kind of problem.
Scenarios in Python Requests Session differs according to different regions,
Like in some countries some headers are not permitted in the request due to country laws.
Since you are already making logged in session that means your user-agent and other headers are being set according to the response from login request.
One of the solution might be using a proxy of a country which doesnot have strict rules of data extraction from that platfrom.
i want to access browser name and version in python by sending out a request.is this the ideal method or is there any other way? because all the methods which provide user agents give PythonUserlib2.7 as user agent,i want my actual user agent.
I'll assume you're familiar with HTTP requests and their structure, if not, here's a link to the RFC documentation for HTTP/1.1 requests, at the bottom of the page, there is a list of links to header fields.
The user-agent is a field in the HTTP request header that identifies the entity that sends the request, by entity I mean the program you used to send the request, that's hosted on your machine. Things like the browser type, version and operating system, are sent in the user-agent field.
So, when you use urllib.request to send a request, urllib fills the HTTP request headers with the values you provide to it, otherwise, default values are used. That's why you get PythonUserLib2.7 as a user-agent.
If you need the user-agent of a specific browser, you need to send request using that browser, you can do that in python by using a browser automation tool, like selenium webdriver. Which you can use to launch an instance of your browser, and go to websites.
I've worked only with selenium webdriver, and it doesn't have the capability to inspect sent/received packets/requests, in other words, you can't get HTTP requests/responses directly from selenium.
As a work around, you can use selenium(or any other automation tool) to launch your browser, then go to a website that will give you your user-agent, and may even parse it.
Here's a link to selenium documentation, it explains how to get started with selenium and how to download the required packages.
If you search on google for user-agent online, Google will tell you what's your user agent.
I have an application with many users, some of these users have an account on an external website with data I want to scrape.
This external site has a members area protected with a email/password form. This sets some cookies when submitted (a couple of ASP ones). You can then pull up the needed page and grab the data the external site holds for the user that just logged in.
The external site has no API.
I envisage my application asking users for their credentials to the external site, logging in on their behalf and grabbing the data we want.
How would I go about this in Python, i.e. do I need to run a GUI web browser on the server that Python prods to handle the cookies (I'd rather not)?
Find the call the page makes to the backend by inspecting what is the format of the login call in your browser's inspector.
Make the same request after using either getpass to get user credentials from the terminal or via a GUI. You can use urllib2 to make the requests.
Save all the cookies from the response in a cookiejar.
Reuse the cookies in subsequent requests and fetch data.
Then, profit.
Usually, this is performed with session.
I'm recommending you to use requests library (http://docs.python-requests.org/en/latest/) in order to do that.
You can use the Session feature (http://docs.python-requests.org/en/latest/user/advanced/#session-objects). Simply perform an authentication HTTP request (url and parameters depends of the website you want to request), and then, perform a request towards the ressource you want to scrape.
Without further information, we cannot help you more.
The site that I'm trying to scrape uses js to create a cookie. What I was thinking was that I can create a cookie in python and then use that cookie to scrape the site. However, I don't know any way of doing that. Does anybody have any ideas?
Please see Python httplib2 - Handling Cookies in HTTP Form Posts for an example of adding a cookie to a request.
I often need to automate tasks in web
based applications. I like to do this
at the protocol level by simulating a
real user's interactions via HTTP.
Python comes with two built-in modules
for this: urllib (higher level Web
interface) and httplib (lower level
HTTP interface).
If you want to do more involved browser emulation (including setting cookies) take a look at mechanize. It's simulation capabilities are almost complete (no Javascript support unfortunately): I've used it to build several scrapers with much success.