Downloading files raw.githubusercontent.com is immensely slow - python

I'm building an application in python 3 that requires downloading a whole bunch of *.java files from raw.githubusercontent.com. Basically, I use the GitHub's API v3 to obtain the all paths ending with ".java" in a given repository, then I download them through raw.githubusercontent.com. The trouble is that this is really slow (< 10 kB/s). Now sometimes, it starts of at a decent rate (40-50 kB/s), but then it usually drops of pretty quickly.
I've tried keeping a persistent connection by using requests.Session(). I've also tried using an authorization token, which someone suggested. Both of these failed to give an improvement.
This is how my code looks like:
with requests.Session() as s:
path_index = ""
for path in paths.splitlines():
file_url = githubusercontent_prefix + path
filename = path.split("/")[-1]
res = s.get(file_url, stream=True, allow_redirects=True)
outf = open("sources/" + filename, 'w')
outf.write(res.text)
outf.close()

Related

How can I get Helm's binary from their GitHub repo?

I'm trying to download Helm's latest release using a script. I want to download the binary and copy it to a file. I tried looking at the documentation, but it's very confusing to read and I don't understand this. I have found a way to download specific files, but nothing regarding the binary. So far, I have:
from github import Github
def get_helm(filename):
f = open(filename, 'w') # The file I want to copy the binary to
g = Github()
r = g.get_repo("helm/helm")
# Get binary and use f.write() to transfer it to the file
f.close
return filename
I am also well aware of the limits of queries that I can do since there are no credentials.
For Helm in particular, you're not going to have a good time since they apparently don't publish their release files via GitHub, only the checksum metadata.
See https://github.com/helm/helm/releases/tag/v3.6.0 ...
Otherwise, this would be as simple as:
get the JSON data from https://api.github.com/repos/{repo}/releases
get the first release in the list (it's the newest)
look through the assets of that release to find the file you need (e.g. for your architecture)
download it using your favorite HTTP client (e.g. the one you used to get the JSON data in the first step)
Nevertheless, here's a script that works for Helm's additional hoops-to-jump-through:
import requests
def download_binary_with_progress(source_url, dest_filename):
binary_resp = requests.get(source_url, stream=True)
binary_resp.raise_for_status()
with open(dest_filename, "wb") as f:
for chunk in binary_resp.iter_content(chunk_size=524288):
f.write(chunk)
print(f.tell(), "bytes written")
return dest_filename
def download_newest_helm(desired_architecture):
releases_resp = requests.get(
f"https://api.github.com/repos/helm/helm/releases"
)
releases_resp.raise_for_status()
releases_data = releases_resp.json()
newest_release = releases_data[0]
for asset in newest_release.get("assets", []):
name = asset["name"]
# For a project using regular releases, this would be simplified to
# checking for the desired architecture and doing
# download_binary_with_progress(asset["browser_download_url"], name)
if desired_architecture in name and name.endswith(".tar.gz.asc"):
tarball_filename = name.replace(".tar.gz.asc", ".tar.gz")
tarball_url = f"https://get.helm.sh/{tarball_filename}"
return download_binary_with_progress(
source_url=tarball_url, dest_filename=tarball_filename
)
raise ValueError("No matching release found")
download_newest_helm("darwin-arm64")

urlretrieve stopping at 4k

I'm using a script to download images from a website and up until recently it has been working fine. Of note the page is https, but per the urllib docs that shouldn't be an issue. It first requests the page and uses a regex to pull the download links from the page. From there the script goes into a loop to download the file and the inner loop looks like so:
dllink = m[0].replace('\">Download','')
print dllink
#m = re.findall('[a-f0-9]+.[\w]+',dllink)
extension = re.findall('.[\w]+$',dllink)[0]
fname = post_id + extension
urllib.urlretrieve(dllink,cpath + "/" + fname)
printLine(post_id + " ")
delay = random.uniform(32.0,64.0)
dlcount = dlcount + 1
time.sleep(delay)
Again, it downloads a file, but the files I'm downloading are on the order of 200k-4m and every file has started returning 4k. I've copy-pasted the download links into browsers and it pulls the right image and wget also downloads it just fine, so I'm not sure what's wrong with my code where it's only downloading 4k of the file. If this is a server side issue is there a way to call wget from python to accomplish the same thing without urlretreive? Thanks in advance for any help!

Downloading contents of several html pages using python

I'm new to Python and was trying to figure out how to code a script that will download the contents of HTML pages. I was thinking of doing something like:
Y = 0
X = "example.com/example/" + Y
While Y != 500:
(code to download file), Y++
if Y == 500:
break
so the (Y) is the file name and I need to download files from example.com/example/1 all the way till file number 500, regardless of the file type.
Read this official docs page:
This module provides a high-level interface for fetching data across the World Wide Web.
In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames.
Some restrictions apply — it can only open URLs for reading, and no seek operations are available.
So you have code like this:
import urllib
content = urllib.urlopen("http://www.google.com").read()
#urllib.request.urlopen(...).read() in python 3
The following code should meet your need. It will download 500 web contents and save them to disk.
import urllib2
def grab_html(url):
response = urllib2.urlopen(url)
mimetype = response.info().getheader('Content-Type')
return response.read(), mimetype
for i in range(500):
filename = str(i) # Use digit as filename
url = "http://example.com/example/{0}".format(filename)
contents, _ = grab_html(url)
with open(filename, "w") as fp:
fp.write(contents)
Notes:
If you need parallel fetching, here is a great example https://docs.python.org/3/library/concurrent.futures.html

downloading a file, not the contents

I am trying to automate downloading a .Z file from a website, but the file I get is 2kb when it should be around 700 kb and it contains a list of the contents of the page (ie: all the files available for download). I am able to download it manually without a problem. I have tried urllib and urllib2 and different configurations of each, but each does the same thing. I should add that the urlVar and fileName variables are generated in a different part of the code, but I have given an example of each here to demonstrate.
import urllib2
urlVar = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z"
fileName = txga1000.14d.Z
downFile = urllib2.urlopen(urlVar)
with open(fileName, "wb") as f:
f.write(downFile.read())
At least the urllib2documentation suggest you should use the Requestobject. This works with me:
import urllib2
req = urllib2.Request("ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z")
response = urllib2.urlopen(req)
data = response.read()
Data length seems to be 740725.
I was able to download what seems like the correct size for your file with the following python2 code:
import urllib2
filename = "txga1000.14d.Z"
url = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/{}".format(filename)
reply = urllib2.urlopen(url)
buf = reply.read()
with open(filename, "wb") as fh:
fh.write(buf)
Edit: The post above me was answered faster and is much better.. I thought I'd post since I tested and wrote this out anyways.

can python be useful to open multiple tabs in a browser in one shot?

I am looking for a faster way to do my task. i have 40000 file downloadable urls. I would like to download them in local desktop is.now the thought is currently what I am doing is placing the link on the browser and then download them via a script.now what I am looking for is to place 10 urls in a chunk to the address bar and get the 10 files to be downloaded at the same time.If it possible hope overall time will be decreased.
Sorry I was late to give the code,here it is :
def _download_file(url, filename):
"""
Given a URL and a filename, this method will save a file locally to the»
destination_directory path.
"""
if not os.path.exists(destination_directory):
print 'Directory [%s] does not exist, Creating directory...' % destination_directory
os.makedirs(destination_directory)
try:
urllib.urlretrieve(url, os.path.join(destination_directory, filename))
print 'Downloading File [%s]' % (filename)
except:
print 'Error Downloading File [%s]' % (filename)
def _download_all(main_url):
"""
Given a URL list, this method will download each file in the destination
directory.
"""
url_list = _create_url_list(main_url)
for url in url_list:
_download_file(url, _get_file_name(url))
Thanks,
Why use a browser? This seems like an XY problem.
To download files, I'd use a library like requests (or make a system call to wget).
Something like this:
import requests
def download_file_from_url(url, file_save_path):
r = requests.get(url)
if r.ok: # checks if the download succeeded
with file(file_save_path, 'w') as f:
f.write(r.content)
return True
else:
return r.status_code
download_file_from_url('http://imgs.xkcd.com/comics/tech_support_cheat_sheet.png', 'new_image.png')
# will download image and save to current directory as 'new_image.png'
You first have to install requests using whatever python package manager you prefer e.g., pip install requests. You can also get fancier; e.g.,

Categories

Resources