Can't download file over HTTP - session is hanging - python

A friend told me that when he entered the following website:
http://tatoochange.com/watch/OGgmlnav-joe-gould-s-secret/vip.html
He noticed that when he played the video, on Fiddler he saw the path of the file (http://85.217.223.24/vids/joe_goulds_secret_2000.mp4):
So he tried to download it from the browser but he received an error:
I checked the GET request with Burpe when playing the video:
GET /vids/joe_goulds_secret_2000.mp4 HTTP/1.1
Host: 85.217.223.24
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0
Accept: video/webm,video/ogg,video/*;q=0.9,application/ogg;q=0.7,audio/*;q=0.6,*/*;q=0.5
Accept-Language: en-US,en;q=0.5
Referer: http://entervideo.net/watch/3accec760b23ad4
Range: bytes=0-
Connection: close
I converted it to python script:
import requests
session = requests.Session()
headers = {"Accept":"video/webm,video/ogg,video/*;q=0.9,application/ogg;q=0.7,audio/*;q=0.6,*/*;q=0.5","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","Referer":"http://entervideo.net/watch/3accec760b23ad4","Connection":"close","Accept-Language":"en-US,en;q=0.5","Range":"bytes=0-"}
response = session.get("http://85.217.223.24/vids/joe_goulds_secret_2000.mp4", headers=headers)
print("Status code: %i" % response.status_code)
print("Response body: %s" % response.content)
When I run it, it his hanging.
I don't have any idea if download it or not.
My question is, why I can't download it from the browser just by accessing it ?
Second, even when I am using the script which is not getting any error, it hangs...

Using sessions.get is not advisable to download a large file. This would primarily be used for a web call that receives a json or xml list. To download large files you should follow the method shown in this thread:
Download large file in python with requests

I managed to do it.
Need to notice that the response was 206 which is a partial content.
The solution:
import os,requests
def download():
headers = {"Accept": "video/webm,video/ogg,video/*;q=0.9,application/ogg;q=0.7,audio/*;q=0.6,*/*;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Referer": "http://entervideo.net/watch/3accec760b23ad4", "Connection": "close",
"Accept-Language": "en-US,en;q=0.5", "Range": "bytes=0-"}
get_response = requests.get("http://85.217.223.24/vids/joe_goulds_secret_2000.mp4", headers=headers,stream=True)
#file_name = url.split("/")[-1]
file_name = r'c:\tmp\joe_goulds_secret_2000.mp4'
with open(file_name, 'wb') as f:
count = 0
for chunk in get_response.iter_content(chunk_size=1024):
print('chunk: ' + str(count))
count += 1
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download()

Related

can't find the right compression for this webpage (python requests.get)

I can load this webpage in Google Chrome, but I can't access it via requests. Any idea what the compression problem is?
Code:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {'Accept-Encoding':'gzip, deflate, compress, br, identity'}
r = requests.get(url, headers=headers)
Result:
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
Use a user agent that emulates a browser:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
r = requests.get(url, headers=headers)
You're getting a 403 Forbidden error, which you can see using requests.head. Use RJ's suggestion to defeat huffpost's robot blocking.
>>> requests.head(url)
<Response [403]>

unable to fetch json data - JSONDecodeError: Expecting value

I'm new to python and struggling with below.
The website page URL is https://www.nseindia.com/market-data/equity-derivatives-watch and when we select "Nifty 50 Futures" and upon inspect, we get the api URL as https://www.nseindia.com/api/liveEquity-derivatives?index=nse50_fut.
Now the issue is this json opens up on browser but from python it does not open and gives JSONDecodeError error. I have included right header but still it fails.
One more observation is that when i load this api directly in browser, the python code gets json data once but it does not work there after. One thing i noticed is that a new cookies is set on every page refresh.
Can anyone help me where I'm missing.
Code:
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
"accept-language": "en-US,en;q=0.9", "accept-encoding": "gzip, deflate, br", "accept": "*/*"}
URL = "https://www.nseindia.com/api/liveEquity-derivatives?index=nse50_fut"
fut_json = requests.get(URL, headers = header).json()
print(fut_json)
File "C:\ProgramData\Anaconda3\lib\site-packages\simplejson\decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
JSONDecodeError: Expecting value
You need cookies to get the response as JSON, as without them you get Resource not found.
Here's how:
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
}
with requests.Session() as s:
r = s.get("https://www.nseindia.com", headers=headers)
api_url = "https://www.nseindia.com/api/liveEquity-derivatives?index=nse50_fut"
response = s.get(api_url, headers=headers).json()
print(response["marketStatus"]["marketStatusMessage"])
Output:
Market is Closed

The website exists but request.head/get times out

I have written a Python script to check whether a website exists or not. Everything works fine, except when checking http://www.dhl.com - the request times out. I have tried both GET and HEAD methods. I used https://httpstatus.io/ and https://app.urlcheckr.com/ to check DHL website and the result is error. The DHL website DOES exist! Here is my code:
import requests
a ='http://www.dhl.com'
def check(url):
try:
header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'}
request = requests.head(url, headers = header , timeout = 60)
code = request.status_code
if code < 400:
return "Exist",str(code)
else:
return "Not exist", str(code)
except Exception as e:
return "Not Exist",str(type(e).__name__)
print(check(a))
How can I resolve this error?
Testing with curl shows you need a couple of other headers for that DHL site
import requests
url = 'http://www.dhl.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,fil;q=0.8',
}
request = requests.head(url, headers=headers, timeout=60, allow_redirects=True)
print(request.status_code, request.reason)
print(request.history)
Without these headers, curl never gets a response.

Response [412] when using the requests python package to access this webpage, how to get around it?

This is the reproducible code:
import requests
url = 'http://wjw.hubei.gov.cn/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'}
res = requests.get(url,headers=headers)
print(res)
The code print(res) gives the following output:
<Response [412]>
I can open the webpage fine on my computer with Chrome.
Is there something missing in the header? Is there a way to get around the 412 error? Thanks in advance!
That website require a valid Cookie in order to response back to you.
I've tried several ways such as calling the main website and then retrieving the Cookie under requests.Session() but the website is not allowing me to pass through.
So the only way which you can use as for now. Or to use Selenium or pass a valid Cookie to the requests
Here's how to get the Cookie and User-Agent via the browser:
Using the following Code:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
"Cookie": "Hm_lvt_5544783ae3e1427d6972d9e77268f25d=1578572654; Hm_lpvt_5544783ae3e1427d6972d9e77268f25d=1578572671; dataHide2=64fa0f2a-a6aa-43b4-adf0-ce901e8d1a37; FSSBBIl1UgzbN7N80S=sXE0qXcyGkTm4uVerLqfZyUU3XFMZzkm22k.eqVABLPe0eYMo3D8uX5ZJ07.7cCr; FSSBBIl1UgzbN7N80T=4aY.P74ZFvDef6i1BgsPAGpjsGOCcIHJFaOyshl4_fJ1WvTk1nqBkdG9PsyX3VRZcIuI8zdYiRJw4rEBQfx.Mv.GS_wT6Hzgiw.AY.UMP.Mw4iCKXGDzY1UeIH2gUd15impxzBVzZpN3MnSdqD0TUqcxSq0RrvIuE8RKT5pFLAqaNnVqtbeSACx43yIYtKJ41y8Isu6a6lNOlWNeaFJ8bx22pKm3lAIO.HIDhGSZqrUP76.q3i4Iux59f7dqJPuSRF90G1LSUBE8t8HrlWzBcSwJJJARX4Ioc0iHmHvdkVoigUitTRjLUHJM4ieOV1sLBDsq"
}
r = requests.get("http://wjw.hubei.gov.cn/", headers=headers)
print(r)
Output:
<Response [200]>
Update:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"}
with requests.Session() as req:
r = req.get("http://www.hubei.gov.cn/")
headers['Cookie'] = r.headers.get("Set-Cookie")
for item in range(10):
new = req.get("http://wjw.hubei.gov.cn/", headers=headers)
print(new)
import requests
response=requests.get("https://precog.iiit.ac.in/")
< Response [200] >
<Response [400]>
<Response [800]>
None of the above responses

Python3, download file from url by button click

I need to download file from link like this https://freemidi.org/getter-13560
But I cant use urllib.request or requests library cause it downloads html, not midi. Is there is any solution? And also here is the link with the button itself link
By adding the proper headers and using session we can download and save the file using request module.
import requests
headers = {
"Host": "freemidi.org",
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
}
session = requests.Session()
#the website sets the cookies first
req1 = session.get("https://freemidi.org/getter-13560", headers = headers)
#Request again to download
req2 = session.get("https://freemidi.org/getter-13560", headers = headers)
print(len(req2.text)) # This is the size of the mdi file
with open("testFile.mid", "wb") as saveMidi:
saveMidi.write(req2.content)

Categories

Resources