Downloading video files of specific size - python

To do:
Download a video from url. The downloading should stop once a specific file size has been reached.
This is what I have tried so far.
import requests
with open(local_file_name, 'w') as f:
r = requests.get(url,stream=True)
for chunk in r.iter_content(chunk_size=5000000):
if chunk:
f.write(chunk)
break
I used 'break' after writing that chunk of specific size to the file.
Though the file is being created and the size of it is 5 MB, I am not able to view the file as it throwing error while opening.

Related

How to download a huge gz file (around 3 GB size) from a URL where there is no file name present using python

I am trying to download a file using python from a URL. However its not working and instead I am getting index.html.
Please help on same.
import requests
target_url = "https://transparency-in-coverage.uhc.com/?file=2022-07-01_United-HealthCare-Services_Third-Party-Administrator_EP1-50_C1_in-network-rates.json.gz&origin=uhc"
filename = "2022-07-01_United-HealthCare-Services_Third-Party-Administrator_EP1-50_C1_in-network-rates.json.gz"
with requests.get(target_url, stream=True) as r:
r.raise_for_status()
with open(filename, "wb") as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
That's because you're the URL you specified is for an HTML page that subsequently starts the download for the .gz file you want.
This is the link for the file:
https://mrfstorageprod.blob.core.windows.net/mrf-even/2022-07-01_ALL-SAVERS-INSURANCE-COMPANY_Insurer_PS1-50_C2_in-network-rates.json.gz?sv=2021-04-10&st=2022-07-05T22%3A19%3A13Z&se=2022-07-09T22%3A19%3A13Z&skoid=89efab61-5daa-4cf2-aa04-ce3ba9d1d1e8&sktid=db05faca-c82a-4b9d-b9c5-0f64b6755421&skt=2022-07-05T22%3A19%3A13Z&ske=2022-07-09T22%3A19%3A13Z&sks=b&skv=2021-04-10&sr=b&sp=r&sig=NaLrw2KG239S%2BpfZibvw7%2B25AAQsf9GYZ1gFK0KRN20%3D&rscd=attachment
To find it, you need to have the inspector open on the 'Network' tab whilst loading the page (or you can click on the file in the list when it loads the list of files on the page). When the download starts you'll see two files pop-up, one of which is the actual URL of the .gz file.
It does look the URL has a timestamp in it, so it might not work at a later time, I don't know.

Download bz2, Read compress files in memory (avoid memory overflow)

As title says, I'm downloading a bz2 file which has a folder inside and a lot of text files...
My first version was decompressing in memory, but Although it is only 90mbs when you uncomrpess it, it has 60 files of 750mb each.... Computer goes bum! obviusly cant handle like 40gb of ram XD)
So, The problem is that they are too big to keep all in memory at the same time... so I'm using this code that works but its sucks (Too slow):
response = requests.get('https:/fooweb.com/barfile.bz2')
# save file into disk:
compress_filepath = '{0}/files/sources/{1}'.format(zsets.BASE_DIR, check_time)
with open(compress_filepath, 'wb') as local_file:
local_file.write(response.content)
#We extract the files into folder
extract_folder = compress_filepath + '_ext'
with tarfile.open(compress_filepath, "r:bz2") as tar:
tar.extractall(extract_folder)
# We process one file at a time:
for filename in os.listdir(extract_folder):
filepath = '{0}/{1}'.format(extract_folder,filename)
file = open(filepath, 'r').readlines()
for line in file:
some_processing(line)
Is there a way I could make this without dumping it to disk... and only decompressing and reading one file from the .bz2 at a time?
Thank you very much for your time in advance, I hope somebody knows how to help me with this...
#!/usr/bin/python3
import sys
import requests
import tarfile
got = requests.get(sys.argv[1], stream=True)
with tarfile.open(fileobj=got.raw, mode='r|*') as tar:
for info in tar:
if info.isreg():
ent = tar.extractfile(info)
# now process ent as a file, however you like
print(info.name, len(ent.read()))
I did it this way:
response = requests.get(my_url_to_file)
memfile = io.BytesIO(response.content)
# We extract files in memory, one by one:
tar = tarfile.open(fileobj=memfile, mode="r:bz2")
for member_name in tar.getnames():
filecount+=1
file = tar.extractfile(member_name)
with open(file, 'r') as read_file:
for line in read_file:
process_line(line)

Download file using python without knowing its extension. - content-type - stream

Hey i just did some research and found that i could download images from urls which end with filename.extension like 000000.jpeg. i wonder now how i could downoad a picture which doesnt have any extension.
Here is my url which i want to download the image http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api
when i put the url directly to the browser it displays an image
furthermore here is what i tried:
from six.moves import urllib
thumbnail='http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api'
img=urllib.request.Request(thumbnail)
pic=urllib.request.urlopen(img)
pic=urllib.request.urlopen(img).read()
Anyhelp will be appreciated so much
This is a way to do it using HTTP response headers :
import requests
import time
r = requests.get("http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api", stream=True)
ext = r.headers['content-type'].split('/')[-1] # converts response headers mime type to an extension (may not work with everything)
with open("%s.%s" % (time.time(), ext), 'wb') as f: # open the file to write as binary - replace 'wb' with 'w' for text files
for chunk in r.iter_content(1024): # iterate on stream using 1KB packets
f.write(chunk) # write the file

Downloading video with python not playing while loading

I have been trying to download a video file with python and at the same time playing it with VLC.
I have tried few ways. One of them is to download in a single thread with continuous fetch and append data. This style is slow but video plays. The code is something like below
self.fp = open(dest, "w")
while not self.stop_down and _continue: with urllib2 request
try:
size = 1024 * 8
data = page.read(size)
bytld+= size
self.fp.write(data)
This function takes longer to download but I am able to play the video while its loading.
However I have been trying to download in multiple parts at the same time..
With proper threading logics
req= urllib2.Request(self.url)
req.headers['Range'] = 'bytes=%s-%s' % (self.startPos, self.end)
response = urllib2.urlopen(req)
content = response.read()
if os.path.exists(self.dest) :
out_fd = open(self.dest, "r+b")
else :
out_fd = open(self.dest, "w+b")
out_fd.seek(self.startPos, 0)
out_fd.write(content)
out_fd.close()
With my threading I am making sure that each part of the file is being saved on sequentially.
But for some reason I can't play this file at all while downloading.
Is there anything I am not doing right? Is the "Range" should be modified different way?
Turns out for each block of data in thread mode Range has to be +1 BYTE. So if the first block is 1024 next one is from 1023 to whatever.

downloading large number of files using python

test.txt contains the list of files to be downloaded:
http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif
How these files can be downloaded using python with maximum download speed?
my thinking was as follows:
import urllib.request
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
What after that?How to select download directory?
Select a path to your desired output directory (output_dir). In your for loop split every url on / character and use the last peace as the filename. Also open the files for writing in binary mode wb since the response.read() returns bytes, not str.
import os
import urllib.request
output_dir = 'path/to/you/output/dir'
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
output_file = os.path.join(output_dir, line.split('/')[-1])
with open(output_file, 'wb') as writer:
writer.write(response.read())
Note:
Downloading multiple files can be faster if you use multiple threads since the download is rarely using the full bandwidth of your internet connection._
Also if the files you are downloading are pretty big you should probably stream the read (reading chunk by chunk). As #Tiran commented you should use shutil.copyfileobj(response, writer) instead of writer.write(response.read()).
I would only add that you should probably always specify the length parameter too: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB) since the default value of 16kb is really small and it will just slow things down.
This works fine for me: (note that name must be absolute, for example 'afaf1.tif')
import urllib,os
def download(baseUrl,fileName,layer=0):
print 'Trying to download file:',fileName
url = baseUrl+fileName
name = os.path.join('foldertodwonload',fileName)
try:
#Note that folder needs to exist
urllib.urlretrieve (url,name)
except:
# Upon failure to download retries total 5 times
print 'Download failed'
print 'Could not download file:',fileName
if layer > 4:
return
else:
layer+=1
print 'retrying',str(layer)+'/5'
download(baseUrl,fileName,layer)
print fileName+' downloaded'
for fileName in nameList:
download(url,fileName)
Moved unnecessary code out from try block

Categories

Resources