I wanna use a Gene Ontology term to get related sequences in Uniprot. It is simple to do it manually, however, I wanna use python to achieve it. Anybody has ideas with it? For example, I have GO:0070337, then I wanna download all the search results in a fasta file. Thanks!
To do it fully automated, I recommend to use requests:
import requests
from StringIO import StringIO # Python 2
from io import StringIO # Python 3
params = {"query": "GO:0070337", "format": "fasta"}
response = requests.get("http://www.uniprot.org/uniprot/", params)
for record in SeqIO.parse(StringIO(r.text), "fasta"):
# Do what you need here with your sequences.
I would use the rest interface provided by UniProt. You just have to build a search query with your requirement - i.e. your GO term, species, and file format.
This query will give you all the human proteins with the GO Term for protein binding that haven't been reviewed in fasta format.
http://www.uniprot.org/uniprot/?query=%28go%3A%22protein+binding+%5B0005515%5D%22+AND+organism%3A%22Homo+sapiens+%5B9606%5D%22%29+AND+reviewed%3Ano&sort=score&format=fasta
More details are available at:
http://www.uniprot.org/faq/28
Related
I am working on a process to automate generation of offer letters for candidates. The candidate information is in Excel and contains standard information needed for offer letter generation such as candidate name, date of joining, location, job title, CTC etc.
Is there a way to generate multiple offer letters (output file name _.docx) while preserving the formatting of the docx template?
Using Stackoverflow's help, I was able to utilize python-docx package and generate multiple offer letters. Thus approach however strips all the formatting from the offer letter.
import os
from pandas import *
import datetime
from docxtpl import DocxTemplate
doc = DocxTemplate("\\template\\offer_letter_template.docx")
xls = ExcelFile("\\data\\candidate_data.xlsx")
df = xls.parse(xls.sheet_names[0])
print (df.to_json(orient='records'))
Output:
[{"offer_letter_date":"July 27, 2019","candidate_name":"John Wick","candidate_email":"john.wick#gmail.com","candidate_location":"NYC","candidate_job_title":"Business Development Executive","candidate_ctc":283000},{"offer_letter_date":"July 17, 2019","candidate_name":"Jane Doe","candidate_email":"jane.doe#gmail.com","candidate_location":"NYC","candidate_job_title":"Business Development Executive","candidate_ctc":290000}]
context = df.to_json(orient='records')
doc.render(context)
I am struggling with creating a loop around context so that candidate information is saved in respective file rather than one file itself. Can someone please help?
Jinja2 for word templating was really helpful but I could not replicate it with a loop.
It is possible to create multiple docx files, unfortunately nobody said in the docxtpl documentation that once you load the template, replacements are done in-place, thus preventing any further context replacements.
A workaround which you may like would be reopening the file at every iteration.
Something like:
context=df.to_json(orient='records')
for i in len(context):
doc = DocxTemplate("\\template\\offer_letter_template.docx")
template.render(context[i])
template.save("docs-folder\\%s%(context[i][candidate_name]))
^Might need some revision, but you get the point.
I'm using the DiscoveryV1 module of the watson_developer_cloud python library to ingest 700+ documents into a WDS collection. Each time I attempt a bulk-ingestion many of the documents fail to be ingested, it is nondeterministic, usually around 100 documents fail.
Each time I call discovery.add_document(env_id, cold_id, file_info=file_info) I find that the response contains a WDS document_id. After I've made this call for all documents in my corpus I use the corresponding document_ids to call discovery.get_document(env_id, col_id, doc_id) and check the document's status. Around 100 of these calls will return the status Document failed to be ingested and indexed. There is no pattern among the files that fail, they range in size and of both msword (doc) and pdf file types.
My code to ingest a document was written based on the WDS Documentation, it looks something like this:
with open(f_path) as file_data:
if f_path.endswith('.doc') or f_path.endswith('.docx'):
re = discovery.add_document(env_id, col_id, file_info=file_data, mime_type='application/msword')
else:
re = discovery.add_document(env_id, col_id, file_info=file_data)
Because my corpus is relatively large, ~3gb in size, I recieve Service is busy processing... responses from discovery.add_document(env_id, cold_id, file_info=file_info) calls in which case I call sleep(5) and try again.
I've exhausted the WDS documentation without any luck. How can I get more insight into the reason that these files are failing to be ingested?
You should be able to use the https://watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Queries/queryNotices API to see errors/warnings that happen during ingestion along with details that might give more information on why the ingestion failed.
Unfortunately, at the time of this posting it does not look like the python SDK has a method to wrap this API yet, so you can use the Watson Discovery Tooling or use curl to query the API directly (replacing the values in {} with your collection-specific values)
curl -u "{username}:{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/notices?version=2017-01-01
The python-sdk now supports querying notices.
from watson_developer_cloud import DiscoveryV1
discovery = DiscoveryV1(
version='2017-10-16',
## url is optional, and defaults to the URL below. Use the correct URL for your region.
url='https://gateway.watsonplatform.net/discovery/api',
iam_api_key='your_api_key')
discovery.federated_query_notices('env_id', ['collection_id']])
I am trying to import a JSON file for use in a Python editor so that I can perform analysis on the data. I am quite new to Python so not sure how I am meant to achieve this. My JSON file is full of tweet data, example shown here:
{"id":441999105775382528,"score":0.0,"text":"blablabla","user_id":1441694053,"created":"Fri Mar 07 18:09:33 GMT 2014","retweet_id":0,"source":"twitterfeed","geo_long":null,"geo_lat":null,"location":"","screen_name":"SevenPS4","name":"Playstation News","lang":"en","timezone":"Amsterdam","user_created":"2013-05-19","followers":463,"hashtags":"","mentions":"","following":1062,"urls":"http://bit.ly/1lcbBW6","media_urls":"","favourites_count":4514,"reply_status_id":0,"reply_user_id":0,"is_truncated":false,"is_retweet":false,"original_text":null,"status_count":4514,"description":"Tweeting the latest Playstation news!","url":null,"utc_offset":3600}
My questions:
How do I import the JSON file so that I can perform analysis on it in a Python editor?
How do I perform analysis on only a set number of the data (IE 100/200 of them instead of all of them)?
Is there a way to get rid of some of the fields such as score, user_id, created, etc without having to go through all of my data manually to do so?
Some of the tweets have invalid/unusable symbols within them, is there anyway to get rid of those without having to go through manually?
I'd use Pandas for this job, as you are will not only load the json, but perform some data analysis tasks on it. Depending on the size of your json-file, this one should do it:
import pandas as pd
import json
# read a sample json-file (replace the link with your file location
j = json.loads("yourfilename")
# you might select the relevant keys before constructing the data-frame
df = pd.DataFrame.from_dict([{k:v} for k,v in j.iteritems() if k in ["id","retweet_count"]])
# select a subset (the first five rows)
df.iloc[:5]
# do some analysis
df.retweet_count.sum()
>>> 200
I need to pull information on a long list of JIRA issues that live in a CSV file. I'm using the JIRA REST API in Python in a small script to see what kind of data I can expect to retrieve:
#!/usr/bin/python
import csv
import sys
from jira.client import JIRA
*...redacted*
csvfile = list(csv.reader(open(sys.argv[1])))
for row in csvfile:
r = str(row).strip("'[]'")
i = jira.issue(r)
print i.id,i.fields.summary,i.fields.fixVersions,i.fields.resolution,i.fields.resolutiondate
The ID (Key), Summary, and Resolution dates are human-readable as expected. The fixVersions and Resolution fields are resources as follows:
[<jira.resources.Version object at 0x105096b11>], <jira.resources.Resolution object at 0x105096d91>
How do I use the API to get the set of available fixVersions and Resolutions, so that I can populate this correctly in my output CSV?
I understand how JIRA stores these values, but the documentation on the jira-python code doesn't explain how to harness it to grab those base values. I'd be happy to just snag the available fixVersion and Resolution values globally, but the resource info I receive doesn't map to them in an obvious way.
You can use fixVersion.name and resolution.name to get the string versions of those values.
User mdoar answered this question in his comment:
How about using version.name and resolution.name?
I have a piece of software called Rss-Aware that I'm trying to use. It basically desktop feed-checker that checks if RSS feeds are updated and gives a notification through Ubuntu's Notify-OSD system.
However, to know what feeds to check, you have to list out the feed urls in a text file in ~/.rss-aware/rssfeeds.txt one after the other in a list with linebreak between each feed url. Something like:
http://example.com/feed.xml
http://othersite.org/feed.xml
http://othergreatsite.net/rss.xml
...Seems pretty simple right? Well, the list of feeds I'd like to use are exported from Google Reader as an OPML file (it's a type of XML) and I have no clue how to parse it to just output the the feed urls. It seems like it should be pretty straight forward yet I'm stumped.
I'd love if anyone could give an implementation in Python or Ruby or something I could do quickly from a prompt. A bash script would be awesome.
Thanks you so much for the help, I'm a really weak programmer and would love to learn how to do this basic parsing.
EDIT: Also, here is the OPML file I'm trying to extract the feed urls from.
I wrote a subscription list parser for this very purpose. It's called listparser, and it's written in Python. I just tested your OPML file, and it appears to parse the file perfectly. It will also make your feeds' labels available.
If you've ever used feedparser, the interface should be familiar:
>>> import listparser as lp
>>> d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
>>> len(d.feeds)
112
>>> d.feeds[100].url
u'http://longreads.com/rss'
>>> d.feeds[100].tags
[u'reading']
It's possible to create the file with feed URLs using a script similar to:
import listparser as lp
d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
f = open('/home/USERNAME/.rss-aware/rssfeeds.txt', 'w')
for i in d.feeds:
f.write(i.url + '\n')
f.close()
Just replace USERNAME with your actual username. Done!
XML parsing was so easy to implement and worked great for me.
from xml.etree import ElementTree
def extract_rss_urls_from_opml(filename):
urls = []
with open(filename, 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//outline'):
url = node.attrib.get('xmlUrl')
if url:
urls.append(url)
return urls
urls = extract_rss_urls_from_opml('your_file')
Since it's an XML file, you can use an XPath query to extract the urls.
In the XML file, it looks like the rss feed urls are stored in xmlUrl attributes. The XPath expression //#xmlUrl will select all values of that attribute.
If you want to test this out in your web-browser, you can use an online XPath tester. If you want to perform this XPath query in Python, this question explains how to use XPath in Python. Additionally, the lxml docs have a page on using XPath in lxml that might be helpful.
You could also use a regex. I used the following search-and-replace regex to convert my Google Reader OPML export to a Firefox HTML live-bookmark import:
^\s+<outline.*?title="(.*?)".*?xmlUrl="(.*?)".*?htmlUrl="(.*?)".*?/>
<DT><A FEEDURL="$2" HREF="$3">$1</A>