When I use the python requests module, calling requests.get(url), I have found that the response from the url is being truncated.
import requests
url = 'https://gtfsrt.api.translink.com.au/Feed/SEQ'
response = requests.get(url)
print response.text
The response I get from the URL is being truncated. Is there a way to get requests to retrieve the full set of data and not truncate it?
Note: The given URL is a public transport feed which puts out a huge quantity data during the peak of day.
I ran into the same issue. The problem is not your Python code. It might be PyCharm or Utility you are using - The console has a buffer limit. You may have to increase that to see your full response.
Refer to this article for more help:
Increase output buffer when running or debugging in PyCharm
Add "gzip":true to your request options.
That fixed the issue for me.
Related
I'm trying to make one simple request:
ua=UserAgent()
req = requests.get('https://www.casasbahia.com.br/' , headers={'User-Agent':ua.random})
I would understand if I received <Response [403] or something like that, but instead, a recive nothing, the code keep runing with no response.
using logging I see:
I know I could use a timeout to avoid keeping the code running, but I just want to understand why I don't get an response
thanks in advance
I never used this API before, but from what I researched on here just now, there are sites that can block requests from fake users.
So, for reproducing this example on my PC, I installed fake_useragent and requests modules on my Python 3.10, and tried to execute your script. It turns out that with my Authentic UserAgent string, the request can be done. When printed on the console, req.text shows the entire HTML file received from the request.
But if I try again with a fake user agent, using ua.random, it fails. The site was probably developed to detect and reject requests from fake agents (or bots).
Though again, this is just theory. I have no ways to access this site's server files to comprove it.
I have a database of thousands of files online, and I want to check what their status is (e.g. if the file exists, if it sends us to a 404, etc.) and update this in my database.
I've used urllib.request to download files to a python script. However, obviously downloading terabytes of files is going to take a long time. Parallelizing the process would help, but ultimately I just don't want to download all the data, just check the status. Is there an ideal way to check (using urllib or another package) the HTTP response code of a certain URL?
Additionally, if I can get the file size from the server (which would be in the HTTP response), then I can also update this in my database.
If your web server is standards-based, you can use a HEAD request instead of a GET. It returns the same status without actually fetching the page.
The requests module can check the status response of a request.
Just do:
import requests
url = 'https://www.google.com' # Change to your link
response = requests.get(url)
print(response.status_code)
this code shows me 200, so the request has been successful
I am trying to download torrent file from this code :
url = "https://itorrents.org/torrent/0BB4C10F777A15409A351E58F6BF37E8FFF53CDB.torrent"
r = requests.get(url, allow_redirects=True)
open('test123.torrent', 'wb').write(r.content)
It downloads a torrent file , but when i load it to bittorrent error occurs.
It says Unable to Load , Torrent Is Not Valid Bencoding
Can anybody please help me to resolve this problem ? Thanks in advance
This page use cloudflare to prevent scraping the page,I am sorry to say that bypassing cloudflare is very hard if you only use requests, the measures cloudflare takes will update soon.This page will check your browser whether it support Javascript.If not, they won't give you the bytes of the file.That's why you couldn't use them.(You could use r.text to see the response content, it is a html page.Not a file.)
Under this circumstance, I think you should consider about using selenium.
Bypassing Cloudflare can be a pain, so I suggest using a library that handles it. Please don't forget that your code may break in the future because Cloudflare changes their techniques periodically. Well, if you use the library, you will just need to update the library (at least you should hope for that).
I used a similar library only in NodeJS, but I see python also has something like that - cloudscraper
Example:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
Depending on your usage you may need to consider using proxies - CloudFlare can still block you if you send too many requests.
Also, if you are working with video torrents, you may be interested in Torrent Stream Server. It a server that downloads and streams video at the same time, so you can watch the video without fully downloading it.
We can do by adding cookies in headers .
But after some time cookie expires.
Therefore only solution is to download from opening browser
I need to get the live stream url using a scripting language such as python or shell
eg: http://rt.com/on-air/
I can get the url by using a tool such as the network monitor on Firefox, but i need to be able to get it via a script
After quick look on requests documentation:
from contextlib import closing
with closing(requests.get('http://rt.com/on-air/', stream=True)) as r:
# Do things with the response here.
If it doesn't help, please check another way:
import requests
r = requests.get('http://rt.com/on-air/', stream=True)
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
# do some sort of things
You need to identify it in the source of the page. It is pretty much the same as using the network tool from FF.
For python you can use beautifulsoup to parse the page and get more info out of it... or a simple regex.
I have been having problems with a script I am developing whereby I am receiving no output and the memory usage of the script is getting larger and larger over time. I have figured out the problem lies with some of the URLs I am checking with the Requests library. I am expecting to download a webpage however I download a large file instead. All this data is then stored in memory causing my issues.
What I want to know is; is there any way with the requests library to check what is being downloaded? With wget I can see: Length: 710330974 (677M) [application/zip].
Is this information available in the headers with requests? If so is there a way of terminating the download upon figuring out it is not a HTML webpage?
Thanks in advance.
Yes, the headers can tell you a lot about the page, most pages will include a Content-Length header.
By default, however, the request is downloaded in its entirety before the .get() or .post(), etc. call returns. Set the stream=True keyword to defer loading the response:
response = requests.get(url, stream=True)
Now you can inspect the headers and just discard the request if you don't like what you find:
length = int(response.headers.get('Content-Length', 0))
if length > 1048576:
print 'Response larger than 1MB, discarding
Subsequently accessing the .content or .text attributes, or the .json() method will trigger a full download of the response.