How do i get uniprotkb given pdb id? - python

I need to get information about length and domain structure of a particular protein, for example 1btk. For this I need to get UniprotKB, how can I do it?
from web site http://www.rcsb.org/pdb/explore.do?structureId=1BTK
the UniprotKB is 'Q06187'

you may use urllib2 for download pdb file, next to use regular expression for extract the Uniprot id
url_template = "http://www.rcsb.org/pdb/files/{}.pdb"
protein = "1BTK"
url = url_template.format(protein)
import urllib2
response = urllib2.urlopen(url)
pdb = response.read()
response.close() # best practice to close the file
import re
m = re.search('UNP\ +(\w+)', pdb)
m.group(1)
# you get 'Q06187'
bonus, if you wish parser the pdb file:
from Bio.PDB.PDBParser import PDBParser
response = urllib2.urlopen(url)
parser = PDBParser()
structure = parser.get_structure(protein, response)
response.close() # best practice to close the file
header = parser.get_header()
trailer = parser.get_trailer()
#info about protein in structure, header and trailer

Related

Python download image from short url by keeping its own name

I would like to download image file from shortener url or generated url which doesn't contain file name on it.
I have tried to use [content-Disposition]. However my file name is not in ASCII code. So it can't print the name.
I have found out i can use urlretrieve, request to download file but i need to save as different name.
I want to download by keeping it's own name..
How can i do this?
matches = re.match(expression, message, re.I)
url = matches[0]
print(url)
original = urlopen(url)
remotefile = urlopen(url)
#blah = remotefile.info()['content-Disposition']
#print(blah)
#value, params = cgi.parse_header(blah)
#filename = params["filename*"]
#print(filename)
#print(original.url)
#filename = "filedown.jpg"
#urlretrieve(url, filename)
These are the list that i have try but none of them work
I was able to get this to work with the requests library because you can use it to get the url that the shortened url redirects to. Then, I applied your code to the redirected url and it worked. There might be a way to only use urllib (I assume thats what you are using) with this, but I dont know.
import requests
from urllib.request import urlopen
import cgi
def getFilenameFromURL(url):
req = requests.request("GET", url)
# req.url is now the url the shortened url redirects to
original = urlopen(req.url)
value, params = cgi.parse_header(original.info()['Content-Disposition'])
filename = params["filename*"]
print(filename)
return filename
getFilenameFromURL("https://shorturl.at/lKOY3")
You can then use urlretrieve with this. Its inefficient but it works... Also since you can get the actual url with the requests library, you can probably get the filname through there.

how to get an XML file from a website using python?

using 'bottle' library, I have to create my own API based on this website http://dblp.uni-trier.de so I have to get data for each author. For this reason I am using the following link format http://dblp.uni-trier.de/pers/xx/'first letter of the last name'/'lastnamefirstname'.xml
Could you help me get the XML format to be able to parse it and get the information I need.
thank you
import bottle
import requests
import re
r = requests.get("https://dblp.uni-trier.de/")
#the format of my request is
#http://localhost:8080/lastname firstname
#bottle.route('/info/<name>')
def info(name):
first_letter = name[:1]
#mettre au format Lastname:Firstname
...
data = requests.get("http://dblp.uni-trier.de/pers/xx/" + first_letter + "/" + family_name + ".xml")
return data
bottle.run(host='localhost', port=8080)
from xml.etree import ElementTree
import requests
url = 'some url'
response = requests.get(url)
xml_root = ElementTree.fromstring(response.content)
fromstring Parses an XML section from a string constant. This function can be used to embed “XML literals” in Python code. text is a
string containing XML data. parser is an optional parser instance. If
not given, the standard XMLParser parser is used. Returns an Element
instance.
HOW TO Load XML from a string into an ElementTree
from xml.etree import ElementTree
root = ElementTree.fromstring("<root><a>1</a></root>")
ElementTree.dump(root)
OUTPUT
<root><a>1</a></root>
The object returned from requests.get is not the raw data. You need to use text property to get the contents
Response Content Documentation
Note that:
response.text returns content as unicode
response.content returns content as bytes

How to parse XML by using python

I want to parse this url to get the text of \Roman\
http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
counts = tree.findall('.//Word')
for count in counts
print count.get('Roman')
But it didn't work.
Try tree.findall('.//{urn:yahoo:jp:jlp:FuriganaService}Word') . It seems you need to specify the namespace too .
I recently ran into a similar issue to this. It was because I was using an older version of the xml.etree package and to workaround that issue I had to create a loop for each level of the XML structure. For example:
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
counts = tree.findall('.//Word')
for result in tree.findall('Result'):
for wordlist in result.findall('WordList'):
for word in wordlist.findall('Word'):
print(word.get('Roman'))
Edit:
With the suggestion from #omu_negru I was able to get this working. There was another issue, when getting the text for "Roman" you were using the "get" method which is used to get attributes of the tag. Using the "text" attribute of the element you can get the text between the opening and closing tags. Also, if there is no 'Roman' tag, you'll get a None object and won't be able to get an attribute on None.
# encoding: utf-8
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
ns = '{urn:yahoo:jp:jlp:FuriganaService}'
counts = tree.findall('.//%sWord' % ns)
for count in counts:
roman = count.find('%sRoman' % ns)
if roman is None:
print 'Not found'
else:
print roman.text

Using configparser for text read from URL using urllib2

I have to read a txt ini file from my browser. [this is required]
res = urllib2.urlopen(URL)
inifile = res.read()
Then I want to basically use this the same way as I would have read any txt file.
config = ConfigParser.SafeConfigParser()
config.read( inifile )
But now looks like I can't use it as this is actually a string now
Can anybody suggest a way around?
You want configparser.readfp -- Presumably, you might even be able to get away with:
res = urllib2.urlopen(URL)
config = ConfgiParser.SafeConfigParser()
config.readfp(res)
assuming that urllib2.urlopen returns an object that is sufficiently file-like (i.e. it has a readline method). For easier debugging, you could do:
config.readfp(res, URL)
If you have to read it the data from a string, you could pack the whole thing into a io.StringIO (or StringIO.StringIO) buffer and read from that:
import io
res = urllib2.urlopen(URL)
inifile_text = res.read()
inifile = io.StringIO(inifile_text)
inifile.seek(0)
config.readfp(inifile)

Writing data to csv or text file using python

I am trying to write some data to csv file by checking some condition as below
I will have a list of urls in a text file as below
urls.txt
www.example.com/3gusb_form.aspx?cid=mum
www.example_second.com/postpaid_mum.aspx?cid=mum
www.example_second.com/feedback.aspx?cid=mum
Now i will go through each url from the text file and read the content of the url using urllib2 module in python and will search a string in the entire html page. If the required string founds i will write that url in to a csv file.
But when i am trying to write data(url) in to csv file,it is saving like each character in to one coloumn as below instead of saving entire url(data) in to one column
h t t p s : / / w w w......
Code.py
import urllib2
import csv
search_string = 'Listen Capcha'
html_urls = open('/path/to/input/file/urls.txt','r').readlines()
outputcsv = csv.writer(open('output/path' + 'urls_contaning _%s.csv'%search_string, "wb"),delimiter=',', quoting=csv.QUOTE_MINIMAL)
outputcsv.writerow(['URL'])
for url in html_urls:
url = url.replace('\n','').strip()
if not len(url) == 0:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
if str(search_string) in response.read():
outputcsv.writerow(url)
So whats wrong with the above code, what needs to be done in order to save the entire url(string) in to one column in a csv file ?
Also how can we write data to a text file as above ?
Edited
Also i had a url suppose like http://www.vodafone.in/Pages/tuesdayoffers_che.aspx ,
this url will be redirected to http://www.vodafone.in/pages/home_che.aspx?cid=che in browser actually, but when i tried through code as below it is same as the above given url
import urllib2, httplib
httplib.HTTPConnection.debuglevel = 1
request = urllib2.Request("http://www.vodafone.in/Pages/tuesdayoffers_che.aspx")
opener = urllib2.build_opener()
f = opener.open(request)
print f.geturl()
Result
http://www.vodafone.in/pages/tuesdayoffers_che.aspx?cid=che
So finally how to catch the redirected url with urllib2 and fetch the data from it ?
Change the last line to:
outputcsv.writerow([url])

Categories

Resources