How to get a specific resource with a GET request using python3?

How to get a specific resource with a GET request using python3? - python

Hello I am trying to solve this problem currently. How would one go about getting a specific resource, i.e. "resources/2021/helloworld" from a web service at a specific url (example.com/example) using python? I also need to specify a user agent (chrome on android in this case) and a link which it was referred by (i.e. google.com/resourcelink). Then preferrably print the text of the resource or write it to a file.

Related

A http proxy same as Fiddler AutoResponder

Hey I'm trying to create something like fiddler autoresponder.
I want replace specific url content to other example:
I researched everything but can't found I tried create proxy server, nodejs script, python script. Nothing.
https://randomdomain.com content replace with https://cliqqi.ml
ps I'm doing this because I want intercept electron app getting main game file from site and then just intercept they game file site to my site.

If you're looking for a way to do this programmatically in Node.js, I've written a library you can use to do exactly that, called Mockttp.
You can use Mockttp to create a HTTPS-intercepting & rewriting proxy, which will allow you to send mock responses directly, redirect traffic from one address to another, rewrite anything including the headers & body of existing traffic, or just log everything that's sent & received. There's a full guide here: https://httptoolkit.tech/blog/javascript-mitm-proxy-mockttp/

How do I get URLs that are being accessed in my browser in 'real time'?

I want to write a program that returns the current or last visited URL by me on my computer (Windows 10) browser. Is there any way in which I can get that URL?
I tried using Python and SQLite to access Chrome history database on C:\Users%USERNAME%\AppData\Local\Google\Chrome\User Data\Default\History and it worked, but if I'm using the browser, the database gets locked.
I know that by using Wireshark, one can see the packets when accessing an URL, but I cannot find the complete URL in those packets fields, only the server name (ie: stackoverflow.com).
I'd like to know whether there is a way in which I can see that information as it's done by Wireshark, but only to get the complete URL, nothing else. Thank you!

I found a solution to this by using mitmproxy: https://mitmproxy.org/. This video on YouTube helped me with the installation and setup process: https://www.youtube.com/watch?v=7BXsaU42yok. The video explains the installation on Mac, but it's not so different from Windows. Then you can use Python to capture and process the URLs contained within the HTTPS requests by using the flow.request.pretty_url property: https://docs.mitmproxy.org/stable/addons-scripting/.

Python: Getting current URL

I searched the net but couldn't get anything that works.
I am trying to write a python script which will trigger a timer if a particular url is opened in the current browser.
How do i obtain the url from the browser.

You cannot do it platform independent way.
You need to use pywin32 for Windows platform (or any other suitable module which provides access to platform API, for example pywm) to access window (you can get it by window name). After that you should analyse all child to get to window which represents URL string. Finally you can get text of this.

Python - Direct linking blocking via iFrames, can I still get the binaries?

I have a scraper script that pulls binary content off publishers websites. Its built to replace the manual action of saving hundreds of individual pdf files that colleagues would other wise have to undertake.
The websites are credential based, and we have the correct credentials and permissions to collect this content.
I have encountered a website that has the pdf file inside an iFrame.
I can extract the content URL from the HTML. When I feed the URL to the content grabber, I collect a small piece of HTML that says: <html><body>Forbidden: Direct file requests are not allowed.</body></html>
I can feed the URL directly to the browser, and the PDF file resolves correctly.
I am assuming that there is a session cookie (or something, I'm not 100% comfortable with the terminology) that gets sent with the request to show that the GET request comes from a live session, not a remote link.
I looked at the refering URL, and saw these different URLs that point to the same article that I collected over a day of testing (I have scrubbed identifers from the URL):-
http://content_provider.com/NDM3NTYyNi45MTcxODM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3NjYyMS4wNjU3MzY%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njc3Mi4wOTY3MDM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njg3Ni4yOTc0NDg%3D/elibrary//title/issue/article.pdf
This suggests that there is something in the URL that is unique, and needs associating to something else to circumvent the direct link detector.
Any suggestions on how to get round this problem?

OK. The answer was Cookies and headers. I collected the get header info via httpfox and made a identical header object in my script, and i grabbed the session ID from request.cookie and sent the cookie with each request.
For good measure I also set the user agent to a known working browser agent, just in case the server was checking agent details.
Works fine.

CiteULike API 'forbidden'

I am trying to query counts of bookmarks for research papers in CiteULike. I am using the "http://www.citeulike.org/api/posts/for/doi/" URL in order to put in a request (using urllib2 library for Python) for an XML document which contains information on the bookmarks for a given DOI (unique identifier for papers). However I keep getting a HTTP 403 Error: Forbbiden.
Does anyone know why I am getting this error? I've tried putting the URL with the DOI in the browser and that returns the XML just fine, so the problem seems related to my automated requests.
Thanks,
Nathanael

You should read http://wiki.citeulike.org/index.php/Importing_and_Exporting#Scripting_CiteULike
If you access CiteULike via an automated process, you MUST provide a
means to identify yourself via the User-Agent string. Please use
"<username>/<email> <application>" e.g., "fred/fred#wilma.com
myscraper/1.0". Any scripting of the site without a means to identify
you may result in a block.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.