GitHub API response with fewer files

GitHub API response with fewer files - python

I am new to GitHub API.
I am writing a Python program (using requests) that should list all the changed/added files of a pull request in a given repository.
Using the API I am able to list all the pull requests and get their numbers. However, when I try to get the information about the files, the response does not contain all the files in the pull request.
pf = session.get(f'https://api.github.com/repos/{r}/pulls/{pull_num}/files')
pj = pf.json()
pprint.pprint(pf.json())
for i in range(len(pj)):
print(fj[i]['filename']))
(I know there might be a prettier way, Python is not really my cup of coffee yet, but when I compare the pf.text with the output of this snippet, the result is identical.)
I know that there is a limit of 300 files as mentioned in the documentation, but the problem occurs even if their total number is less that 300.
I created a test repo with a single pull request that adds files called file1, file 2, ..., file222 and after I send the GET request, the response only contains filenames of:
file1, file10, file100, file101, file102, file103, file104, file105, file106, file107, file108, file109, file11, file110, file111, file112, file113, file114, file115, file116, file117, file118, file119, file12, file120, file121, file122, file123, file124, file125
Is there another limit that I don't know about? Or why would the response contain only those filenames? How do I get all of them?

I found a solution a while after I posted the question. The API sends a few entries (filenames) and a link to another page in the header of the response. The files from the question are the first few in the alphabetical order, the first page.

Related

How to complete URL's with only domain name

So i am calling out urls i.e. "domain.xyz" from a .csv file. The purpose is to use the requests module to GET/HEAD responses. Using this code as a work around to add a string.
x = "http://www."+str('domain.com')
response = requests.head(x)
The problem here is not all "domain.com" entries in my .csv start with standard http://www.. What's the best way to complete the URL before using the requests module?
p.s. I am looking for something similar to what Chromes address bar does to complete a url. For instance when we enter 'abc.com'. it completes it to "http://www.abc.xyz".

python-libtorrent torrent_info method

I've been using python-libtorrent to check what pieces belong to a file in a torrent containing multiple files.
I'm using the below code to iterate over the torrent file
info = libtorrent.torrent_info('~/.torrent')
for f in info.files():
print f
But this returns <libtorrent.file_entry object at 0x7f0eda4fdcf0> and I don't know how to extract information from this.
I'm unaware of the torrent_info property which would return piece value information of various files. Some help is appreciated.

the API is documented here and here. Obviously the python API can't always be exactly as the C++ one. But generally the interface takes a file index and returns some property of that file.

Retrieveing files form URL in Python returns blank

I'm working on a chess related project for which I have to download a very large quantity of files from ChessTempo.
When running the following code:
import urllib.request
url = "http://chesstempo.com/requests/download_game_pgn.php?gameids="
for i in range (3,500):
urllib.request.urlretrieve(url + str(i),'Games/Game ' + str(i) + ".pgn")
print("Downloaded file nº " + str(i))
I get the expected list of 500~ files but they are all blank except the second and third files, which have the correct data in them.
When I open the URLs by hand, it all works perfectly. What am I missing?

In fact, I can only download files 2 & 3, all others are empty...
Were you logged in while accessing those files "manually"? (Which I assume to be using a web browser).
If so, FYI an http request does not only consist of the URL, lots of other information is transfered. So if you are not getting the same information, you are almost certainly not making the same request.
In chrome you can see the requests you make within a page.
From Developer Tools go to Network > Select a name form the list > Request Headers (See picture)
The most probable thing that you may be looking for are the cookies
Hope it helps.

Python JSON dictionary key error

I'm trying to collect data from a JSON file using python. I was able to access several chunks of text but when I get to the 3rd object in the JSON file I'm getting a key error. The first three lines work fine but the last line gives me a key error.
response = urllib.urlopen("http://asn.desire2learn.com/resources/D2740436.json")
data = json.loads(response.read())
title = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/dc/elements/1.1/title"][0]["value"]
description = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/dc/terms/description"][0]["value"]
topics = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/gem/qualifiers/hasChild"]
topicDesc = data["http://asn.desire2learn.com/resources/S2743916"]
Here is the JSON file I'm using. http://s3.amazonaws.com/asnstaticd2l/data/rdf/D2742493.json I went through all the braces and can't figure out why I'm getting this error. Anyone know why I might be getting this?

topics = data["http://asn.desire2learn.com/resources/D2740436"]["http://purl.org/gem/qualifiers/hasChild"]
I don't see this key "http://asn.desire2learn.com/resources/D2740436" anywhere in your source file. You didn't include your stack, but my first thought would be typo resulting in a bad key and you getting an error like:
KeyError: "http://asn.desire2learn.com/resources/D2740436"
Which means that value does not exist in the data you are referencing

The link in your code and your AWS link go to very different files. Open up the link in your code in a web browser, and you will find that it's much shorter than the file on AWS. It doesn't actually contain the key you're looking for.

You say that you are using the linked file, in which the key "http://asn.desire2learn.com/resources/S2743916" turns up once.
However, your code is downloading a different file - one in which the key does not appear.
Try using the file you linked in your code, and you should see the key will work.

Downloading Protein Sequences of multiple Organisms

I am attempting to use biopython to download all of the proteins of a list of organisms sequenced by a specific institution. I have the organism names and BioProject's associated with each organism; specifically I am looking to analyze the proteins found in some recent genome sequences. I'd like to download the protein files in bulk, in the friendliest manner possible with efetch . My most recent attempt of downloading all of protein FASTA sequences for an associated organism is as follows:
net_handle = Entrez.efetch(db="protein",
id=mydictionary["BioPROJECT"][i],
rettype="fasta")
There are roughly 3000-4500 proteins associated with each organism; so using esearch and trying to efetch each protein one at a time is not realistic. Plus I'd like have a single FASTA file for each organism that encompasses all of its proteins.
Unfortunately when I run this line of code, I receive the following error:
urllib2.HTTPError: HTTP Error 400: Bad Request.
It appears for all of the organisms I am interested in, I can't simply find their genonome sequence in their Nucleotide databank and download the "Protein encoding Sequences"
How may obtain these protein sequences I want in a manner that won't overload the NCBI servers? I was hoping that I could replicate what I can do on NCBI's web browser: select the protein database, search for the Bioproject number, and then save all of the found protein sequences into a single fasta file (under the "Send to" drop down menu)

Try to download the sequence from PATRIC's FTP, which is a gold mine, first it is much better organized and second, the data are A LOT cleaner than NCBI. PATRIC is backed by NIH by the way.
PATRIC contains some 15000+ genomes and provides their DNA, protein, the DNA of protein coding regions, EC, pathway, genbank in separate files. Super convenient. Have a look yourself there:
ftp://ftp.patricbrc.org/patric2.
I suggest you download all the desired files from all organisms first and then pick up those you need once you have them all on your hard drive. The following python script download the ec number annotation files provided by PATRIC in one go (if you have proxy, you need to config it in the comment section):
from ftplib import FTP
import sys, os
#######if you have proxy
####fill in you proxy ip here
#site = FTP('1.1.1.1')
#site.set_debuglevel(1)
#msg = site.login('anonymous#ftp.patricbrc.org')
site = FTP("ftp.patricbrc.org")
site.login()
site.cwd('/patric2/current_release/ec/')
bacteria_list = []
site.retrlines('LIST', bacteria_list.append)
output = sys.argv[1]
if not output.endswith("/"):
output += "/"
print "bacteria_list: ", len(bacteria_list)
for c in bacteria_list:
path_name = c.strip(" ").split()[-1]
if "PATRIC.ec" in path_name:
filename = path_name.split("/")[-1]
site.retrbinary('RETR ' + path_name, open(output + filename , 'w').write)

While I have no experience with python let alone biopython, a quick google search found a couple things for you to look at.
urllib2 HTTP Error 400: Bad Request
urllib2 gives HTTP Error 400: Bad Request for certain urls, works for others

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

GitHub API response with fewer files - python

I found a solution a while after I posted the question. The API sends a few entries (filenames) and a link to another page in the header of the response. The files from the question are the first few in the alphabetical order, the first page.

Related

How to complete URL's with only domain name

python-libtorrent torrent_info method

Retrieveing files form URL in Python returns blank

Python JSON dictionary key error

Downloading Protein Sequences of multiple Organisms

Categories

Resources