Downloading Protein Sequences of multiple Organisms

Downloading Protein Sequences of multiple Organisms - python

I am attempting to use biopython to download all of the proteins of a list of organisms sequenced by a specific institution. I have the organism names and BioProject's associated with each organism; specifically I am looking to analyze the proteins found in some recent genome sequences. I'd like to download the protein files in bulk, in the friendliest manner possible with efetch . My most recent attempt of downloading all of protein FASTA sequences for an associated organism is as follows:
net_handle = Entrez.efetch(db="protein",
id=mydictionary["BioPROJECT"][i],
rettype="fasta")
There are roughly 3000-4500 proteins associated with each organism; so using esearch and trying to efetch each protein one at a time is not realistic. Plus I'd like have a single FASTA file for each organism that encompasses all of its proteins.
Unfortunately when I run this line of code, I receive the following error:
urllib2.HTTPError: HTTP Error 400: Bad Request.
It appears for all of the organisms I am interested in, I can't simply find their genonome sequence in their Nucleotide databank and download the "Protein encoding Sequences"
How may obtain these protein sequences I want in a manner that won't overload the NCBI servers? I was hoping that I could replicate what I can do on NCBI's web browser: select the protein database, search for the Bioproject number, and then save all of the found protein sequences into a single fasta file (under the "Send to" drop down menu)

Try to download the sequence from PATRIC's FTP, which is a gold mine, first it is much better organized and second, the data are A LOT cleaner than NCBI. PATRIC is backed by NIH by the way.
PATRIC contains some 15000+ genomes and provides their DNA, protein, the DNA of protein coding regions, EC, pathway, genbank in separate files. Super convenient. Have a look yourself there:
ftp://ftp.patricbrc.org/patric2.
I suggest you download all the desired files from all organisms first and then pick up those you need once you have them all on your hard drive. The following python script download the ec number annotation files provided by PATRIC in one go (if you have proxy, you need to config it in the comment section):
from ftplib import FTP
import sys, os
#######if you have proxy
####fill in you proxy ip here
#site = FTP('1.1.1.1')
#site.set_debuglevel(1)
#msg = site.login('anonymous#ftp.patricbrc.org')
site = FTP("ftp.patricbrc.org")
site.login()
site.cwd('/patric2/current_release/ec/')
bacteria_list = []
site.retrlines('LIST', bacteria_list.append)
output = sys.argv[1]
if not output.endswith("/"):
output += "/"
print "bacteria_list: ", len(bacteria_list)
for c in bacteria_list:
path_name = c.strip(" ").split()[-1]
if "PATRIC.ec" in path_name:
filename = path_name.split("/")[-1]
site.retrbinary('RETR ' + path_name, open(output + filename , 'w').write)

While I have no experience with python let alone biopython, a quick google search found a couple things for you to look at.
urllib2 HTTP Error 400: Bad Request
urllib2 gives HTTP Error 400: Bad Request for certain urls, works for others

Related

For loop to reference external .txt file contents

I am fairly new to python so please be kind. I am a network administrator but have been tasked with automating several processes of ours using python.
I am trying to take a list of networks id's and plug them into a URL using
For loop.
file = open('networkid.txt', 'r')
def main(file):
for x in file:
print(x)`
link = ('https://api.meraki.com/api/v0/networks/') +(Network ID) ('/syslogServers')
Each line in the .txt files contains a network ID, and I need that ID to be injected where (Network ID) is in the script, then I need the script to run the rest of the script not posted here and continue this until all ID's have been exhausted.
The current example layout is not how my script is setup but bits and pieces are cut to give you an idea of what I am aiming for.
To clarify the question at hand, how do I reference each line in the text file, which each line contains a network ID that I need to inject into the URL. From there, I am trying to establish a proper For Loop to continue this process until all network ID's in the list has been exhausted.

x contains the network ID after you strip off the newline.
for line in file:
networkID = line.strip()
link = 'https://api.meraki.com/api/v0/networks/' + networkID + '/syslogServers'
# do something with link

GitHub API response with fewer files

I am new to GitHub API.
I am writing a Python program (using requests) that should list all the changed/added files of a pull request in a given repository.
Using the API I am able to list all the pull requests and get their numbers. However, when I try to get the information about the files, the response does not contain all the files in the pull request.
pf = session.get(f'https://api.github.com/repos/{r}/pulls/{pull_num}/files')
pj = pf.json()
pprint.pprint(pf.json())
for i in range(len(pj)):
print(fj[i]['filename']))
(I know there might be a prettier way, Python is not really my cup of coffee yet, but when I compare the pf.text with the output of this snippet, the result is identical.)
I know that there is a limit of 300 files as mentioned in the documentation, but the problem occurs even if their total number is less that 300.
I created a test repo with a single pull request that adds files called file1, file 2, ..., file222 and after I send the GET request, the response only contains filenames of:
file1, file10, file100, file101, file102, file103, file104, file105, file106, file107, file108, file109, file11, file110, file111, file112, file113, file114, file115, file116, file117, file118, file119, file12, file120, file121, file122, file123, file124, file125
Is there another limit that I don't know about? Or why would the response contain only those filenames? How do I get all of them?

I found a solution a while after I posted the question. The API sends a few entries (filenames) and a link to another page in the header of the response. The files from the question are the first few in the alphabetical order, the first page.

How to get more insight into why documents fail to be ingested in Watson Discovery Service

I'm using the DiscoveryV1 module of the watson_developer_cloud python library to ingest 700+ documents into a WDS collection. Each time I attempt a bulk-ingestion many of the documents fail to be ingested, it is nondeterministic, usually around 100 documents fail.
Each time I call discovery.add_document(env_id, cold_id, file_info=file_info) I find that the response contains a WDS document_id. After I've made this call for all documents in my corpus I use the corresponding document_ids to call discovery.get_document(env_id, col_id, doc_id) and check the document's status. Around 100 of these calls will return the status Document failed to be ingested and indexed. There is no pattern among the files that fail, they range in size and of both msword (doc) and pdf file types.
My code to ingest a document was written based on the WDS Documentation, it looks something like this:
with open(f_path) as file_data:
if f_path.endswith('.doc') or f_path.endswith('.docx'):
re = discovery.add_document(env_id, col_id, file_info=file_data, mime_type='application/msword')
else:
re = discovery.add_document(env_id, col_id, file_info=file_data)
Because my corpus is relatively large, ~3gb in size, I recieve Service is busy processing... responses from discovery.add_document(env_id, cold_id, file_info=file_info) calls in which case I call sleep(5) and try again.
I've exhausted the WDS documentation without any luck. How can I get more insight into the reason that these files are failing to be ingested?

You should be able to use the https://watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Queries/queryNotices API to see errors/warnings that happen during ingestion along with details that might give more information on why the ingestion failed.
Unfortunately, at the time of this posting it does not look like the python SDK has a method to wrap this API yet, so you can use the Watson Discovery Tooling or use curl to query the API directly (replacing the values in {} with your collection-specific values)
curl -u "{username}:{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/notices?version=2017-01-01

The python-sdk now supports querying notices.
from watson_developer_cloud import DiscoveryV1
discovery = DiscoveryV1(
version='2017-10-16',
## url is optional, and defaults to the URL below. Use the correct URL for your region.
url='https://gateway.watsonplatform.net/discovery/api',
iam_api_key='your_api_key')
discovery.federated_query_notices('env_id', ['collection_id']])

How to use python get results from uniprot automatically

I wanna use a Gene Ontology term to get related sequences in Uniprot. It is simple to do it manually, however, I wanna use python to achieve it. Anybody has ideas with it? For example, I have GO:0070337, then I wanna download all the search results in a fasta file. Thanks!

To do it fully automated, I recommend to use requests:
import requests
from StringIO import StringIO # Python 2
from io import StringIO # Python 3
params = {"query": "GO:0070337", "format": "fasta"}
response = requests.get("http://www.uniprot.org/uniprot/", params)
for record in SeqIO.parse(StringIO(r.text), "fasta"):
# Do what you need here with your sequences.

I would use the rest interface provided by UniProt. You just have to build a search query with your requirement - i.e. your GO term, species, and file format.
This query will give you all the human proteins with the GO Term for protein binding that haven't been reviewed in fasta format.
http://www.uniprot.org/uniprot/?query=%28go%3A%22protein+binding+%5B0005515%5D%22+AND+organism%3A%22Homo+sapiens+%5B9606%5D%22%29+AND+reviewed%3Ano&sort=score&format=fasta
More details are available at:
http://www.uniprot.org/faq/28

Why is PDBParser not able to read all fpocket output files?

I'm using fpocket to find pockets in my PDB protein structures. The output is a ordered list of pockets pocket0_atm.pdb, pocket1_atm.pdb, etc. Some files are read into Bio.PDB.PDBParser without incident. Others fail with an "AssertionError".
Attempts to compare the .pdb files that work to those that fail have not shown me a consistent difference. Any ideas?
Here's the relevant section of code that's giving me trouble:
def get_pdb_limits(pdb_file):
''' Return the X,Y,Z size limits of a PDB file. '''
p = PDB.PDBParser()
structure = p.get_structure('test', pdb_file)

According to fpocket documentation the pocketx_atm.pdb file only contains the atoms that are in contact with the spheres used to extract the pocket. In other words the pocket files doesn't contain complete residues which could be a source of problems in parsing.

Without a stacktrace, it's impossible to actually know what your problem is. However, PDB.PDBParser is built to tolerate and compensate for some errors in PDB files. Try setting PERMISSIVE to True, like below, and see if you still get errors.
p = PDB.PDBParser(PERMISSIVE=1)
p.get_structure("pdb_id", pdb_file)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading Protein Sequences of multiple Organisms - python

While I have no experience with python let alone biopython, a quick google search found a couple things for you to look at. urllib2 HTTP Error 400: Bad Request urllib2 gives HTTP Error 400: Bad Request for certain urls, works for others

Related

For loop to reference external .txt file contents

GitHub API response with fewer files

How to get more insight into why documents fail to be ingested in Watson Discovery Service

How to use python get results from uniprot automatically

Why is PDBParser not able to read all fpocket output files?

Categories

Resources