BeautifulSoup takes forever, can this be done faster?

BeautifulSoup takes forever, can this be done faster? - python

I'm using a Raspberry Pi 1B+ w/ Debian Linux:
Linux rbian 3.18.0-trunk-rpi #1 PREEMPT Debian 3.18.5-1~exp1+rpi16 (2015-03-28) armv6l GNU/Linux
As part of a larger Python program I'm using this code:
#!/usr/bin/env python
import time
from urllib2 import Request, urlopen
from bs4 import BeautifulSoup
_url="http://xml.buienradar.nl/"
s1 = time.time()
req = Request(_url)
print "Request = {0}".format(time.time() - s1)
s2 = time.time()
response = urlopen(req)
print "URLopen = {0}".format(time.time() - s2)
s3 = time.time()
output = response.read()
print "Read = {0}".format(time.time() - s3)
s4 = time.time()
soup = BeautifulSoup(output)
print "Soup (1) = {0}".format(time.time() - s4)
s5 = time.time()
MSwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windsnelheidms)
GRwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windrichtinggr)
ms = MSwind.replace("<"," ").replace(">"," ").split()[1]
gr = GRwind.replace("<"," ").replace(">"," ").split()[1]
print "Extracting info = {0}".format(time.time() - s5)
s6 = time.time()
soup = BeautifulSoup(urlopen(_url))
print "Soup (2) = {0}".format(time.time() - s6)
s5 = time.time()
MSwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windsnelheidms)
GRwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windrichtinggr)
ms = MSwind.replace("<"," ").replace(">"," ").split()[1]
gr = GRwind.replace("<"," ").replace(">"," ").split()[1]
print "Extracting info = {0}".format(time.time() - s5)
When I run it, I get this output:
Request = 0.00394511222839
URLopen = 0.0579500198364
Read = 0.0346400737762
Soup (1) = 23.6777830124
Extracting info = 0.183892965317
Soup (2) = 36.6107468605
Extracting info = 0.382317781448
So, the BeautifulSoup command takes about half a minute to process the _url.
I would really love it if this could be done in under 10 seconds.
Any suggestions that would significantly speed up the code (by at least -60%) would be extremely welcome.

Install the lxml library; once installed BeautifulSoup will use it as the default parser.
lxml parser the page using the libxml2 C library, which is significantly faster than the default html.parser backend, implemented in pure Python.
You can then also parse the page as XML instead of as HTML:
soup = BeautifulSoup(output, 'xml')
Parsing your given page with lxml should be faster; I can parse the page almost 50 times per second:
>>> timeit("BeautifulSoup(output, 'xml')", 'from __main__ import BeautifulSoup, output', number=50)
1.1700470447540283
Still, I wonder if you are missing some other Python acceleration libraries, as I certainly cannot reproduce your results even with the built-in parser:
>>> timeit("BeautifulSoup(output, 'html.parser')", 'from __main__ import BeautifulSoup, output', number=50)
1.7218239307403564
Perhaps you are memory constrained and the large-ish document causes your OS to swap memory a lot? Memory swapping (writing pages to disk and loading other pages from disk) can bring even the fastest programs to a grinding halt.
Note that instead of using str() on tag elements and splitting off the tags, you can get the value from a tag simply by using the .string attribute:
station_6350 = soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350)
ml = station_6350.windsnelheidMS.string
gr = station_6350.windrichtingGR.string
If you are using the XML parser, take into account that tagnames must match case (HTML is a case-insensitive mark-up language).
Since this is an XML document, another option would be to use the lxml ElementTree model; you can use XPath expressions to extract the data:
from lxml import etree
response = urlopen(_url)
for event, elem in etree.iterparse(response, tag='weerstation'):
if elem.get('id') == '6350':
ml = elem.find('windsnelheidMS').text
gr = elem.find('windrichtingGR').text
break
# clear elements we are not interested in, adapted from
# http://stackoverflow.com/questions/12160418/why-is-lxml-etree-iterparse-eating-up-all-my-memory
elem.clear()
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
This should only build the minimal object tree required, clearing out the weather stations you don't need as you go along the document.
Demo:
>>> from lxml import etree
>>> from urllib2 import urlopen
>>> _url = "http://xml.buienradar.nl/"
>>> response = urlopen(_url)
>>> for event, elem in etree.iterparse(response, tag='weerstation'):
... if elem.get('id') == '6350':
... ml = elem.find('windsnelheidMS').text
... gr = elem.find('windrichtingGR').text
... break
... # clear elements we are not interested in
... elem.clear()
... for ancestor in elem.xpath('ancestor-or-self::*'):
... while ancestor.getprevious() is not None:
... del ancestor.getparent()[0]
...
>>> ml
'4.64'
>>> gr
'337.8'

Using requests and regular expressions can be a lot shorter and faster. For such relatively simple data gathering regexes work fine.
#!/usr/bin/env python
from __future__ import print_function
import re
import requests
import time
_url = "http://xml.buienradar.nl/"
_regex = '<weerstation id="6391">.*?'\
'<windsnelheidMS>(.*?)</windsnelheidMS>.*?'\
'<windrichtingGR>(.*?)</windrichtingGR>'
s1 = time.time()
br = requests.get(_url)
print("Request = {0}".format(time.time() - s1))
s5 = time.time()
MSwind, GRwind = re.findall(_regex, br.text)[0]
print("Extracting info = {0}".format(time.time() - s5))
print('wind speed', MSwind, 'm/s')
print('wind direction', GRwind, 'degrees')
On my desktop (which is not a raspberry, though :-) ) this runs very fast;
Request = 0.0723416805267334
Extracting info = 0.0009412765502929688
wind speed 2.35 m/s
wind direction 232.6 degrees
Of course this particular regex would fail if the windsnelheidMS and windrichtingGR tags were reversed. But given that the XML is most probably computer-generated that doesn't seem likely.
And there is an solution for it. By first using a regex to capture the text between <weerstation id="6391"> and </weerstation>, and then use two other regexes to find the wind speed and direction.

Related

issue extracting html page's string using bs4

I'm writing a program to find song lyrics , the program is almost near to done but i have a little problem with bs4 data type ,
my question is how to extract plain text from lyric variable at the end of line ?
import re
import requests
import bs4
from urllib import unquote
def getLink(fileName):
webFileName = unquote(fileName)
page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")
match = re.search('songmeanings\.com\/[^image].*?\/"',page.content)
if match:
Mached = str("http://"+match.group())
return(Mached[:-1:]) # this line used to remove a " at the end of line
else:
return(1)
def getText(link):
page = requests.get(str(link))
soup = bs4.BeautifulSoup(page.content ,"lxml")
return(soup)
Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)
and here is outout :
[\n\t\t\t\t\t\tPlease could you stop the noise,\nI'm trying to get some rest\nFrom all the unborn chicken voices in my head\nWhat's that?\nWhat's that?\n\nWhen I am king, you will be first against the wall\nWith your opinion which is of no consequence at all\nWhat's that?\nWhat's that?\n\nAmbition makes you look pretty ugly\nKicking and squealing Gucci little piggy\nYou don't remember\nYou don't remember\nWhy don't you remember my name?\nOff with his head, man\nOff with his head, man\nWhy don't you remember my name?\nI guess he does\n\nRain down, rain down\nCome on rain down on me\nFrom a great height\nFrom a great height, height\nRain down, rain down\nCome on rain down on me\nFrom a great height\nFrom a great height, height,\nRain down, rain down\nCome on rain down on me\n\nThat's it, sir\nYou're leaving\nThe crackle of pigskin\nThe dust and the screaming\nThe yuppies networking\nThe panic, the vomit\nThe panic, the vomit\nGod loves his children,\nGod loves his children, yeah!\nEdit Lyrics\nEdit Wiki\nAdd Video\n ]

Append following line of code:
lyric = ''.join([tag.text for tag in lyric])
After
lyric = Soup.findAll(attrs={"lyric-box"})
You'll get output something like
Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?
When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?
...

First trim the leading and trailing [] by doing stringvar[1:-1] then on each line call linevar.strip() which will strip off all that whitespace.

for guys whom like the idea , with some little changes finally my code is looking like this :)
import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO
def getLink(fileName):
fileName = unquote(fileName)
baseAddres = "https://songmeanings.com/query/?query="
linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToPage)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
tab_content = str(soup.find_all(attrs={"tab-content"}))
pattern = r'\"\/\/songmeanings.com\/.+?\"'
links = re.findall(pattern,tab_content)
"""returns first mached item without double quote
at the beginning and at the end of the string"""
return("http:"+links[0][1:-1:])
def getText(linkToSong):
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToSong)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
lyric_box = soup.find_all(attrs={"lyric-box"})
lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
return(lyric_boxSTR)
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)

Optimize Python Script to parse xml

I'm parsing the US Patent XML files (downloaded from Google patent dumps) using Python and Beautifulsoup; parsed data is exported to MYSQL database.
Each year's data contains close to 200-300K patents - which means parsing 200-300K xml files.
The server on which I'm running the python script is pretty powerful - 16 cores, 160 gigs of RAM, etc. but still it is taking close to 3 days to parse one year's worth of data.
I've been learning and using python since 2 years - so I can get stuff done but do not know how to get it done in the most efficient manner. I'm reading on it.
How can I optimize the below script to make it efficient?
Any guidance would be greatly appreciated.
Below is the code:
from bs4 import BeautifulSoup
import pandas as pd
from pandas.core.frame import DataFrame
import MySQLdb as db
import os
cnxn = db.connect('xx.xx.xx.xx','xxxxx','xxxxx','xxxx',charset='utf8',use_unicode=True)
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("<?xml "):
yield "".join(buffer)
buffer = []
buffer.append(line)
yield "".join(buffer)
file.close()
def get_data(soup):
df = pd.DataFrame(columns = ['doc_id','patcit_num','patcit_document_id_country', 'patcit_document_id_doc_number','patcit_document_id_kind','patcit_document_id_name','patcit_document_id_date','category'])
if soup.findAll('us-citation'):
cit = soup.findAll('us-citation')
else:
cit = soup.findAll('citation')
doc_id = soup.findAll('publication-reference')[0].find('doc-number').text
for x in cit:
try:
patcit_num = x.find('patcit')['num']
except:
patcit_num = None
try:
patcit_document_id_country = x.find('country').text
except:
patcit_document_id_country = None
try:
patcit_document_id_doc_number = x.find('doc-number').text
except:
patcit_document_id_doc_number = None
try:
patcit_document_id_kind = x.find('kind').text
except:
patcit_document_id_kind = None
try:
patcit_document_id_name = x.find('name').text
except:
patcit_document_id_name = None
try:
patcit_document_id_date = x.find('date').text
except:
patcit_document_id_date = None
try:
category = x.find('category').text
except:
category = None
print doc_id
val = {'doc_id':doc_id,'patcit_num':patcit_num, 'patcit_document_id_country':patcit_document_id_country,'patcit_document_id_doc_number':patcit_document_id_doc_number, 'patcit_document_id_kind':patcit_document_id_kind,'patcit_document_id_name':patcit_document_id_name,'patcit_document_id_date':patcit_document_id_date,'category':category}
df = df.append(val, ignore_index=True)
df.to_sql(name = 'table_name', con = cnxn, flavor='mysql', if_exists='append')
print '1 doc exported'
i=0
l = os.listdir('/path/')
for item in l:
f = '/path/'+item
print 'Currently parsing - ',item
for xml_string in separated_xml(f):
soup = BeautifulSoup(xml_string,'xml')
if soup.find('us-patent-grant'):
print item, i, xml_string[177:204]
get_data(soup)
else:
print item, i, xml_string[177:204],'***********************************soup not found********************************************'
i+=1
print 'DONE!!!'

Here is a tutorial on multi-threading, because currently that code will run on 1 thread, 1 core.
Remove all try/except statements and handle the code properly. Exceptions are expensive.
Run a profiler to find the chokepoints, and multi-thread those or find a way to do them less times.

So, you're doing two things wrong. First, you're using BeautifulSoup, which is slow, and second, you're using a "find" call, which is also slow.
As a first cut, look at lxml's ability to pre-compile xpath queries (Look at the heading "The Xpath class). That will give you a huge speed boost.
Alternatively, I've been working on a library to do this kind of parsing declaratively, using best practices for lxml speed, including precompiled xpath called yankee.
Yankee on PyPI |
Yankee on GitHub
You could do the same thing with yankee like this:
from yankee.xml import Schema, fields as f
# Create a schema for citations
class Citation(Schema):
num = f.Str(".//patcit")
country = f.Str(".//country")
# ... and so forth for the rest of your fields
# Then create a "wrapper" to get all the citations
class Patent(Schema):
citations = f.List(".//us-citation|.//citation")
# Then just feed the Schema your lxml.etrees for each patent:
import lxml.etree as ET
schema = Patent()
for _, doc in ET.iterparse(xml_string, "xml"):
result = schema.load(doc)
The result will look like this:
{
"citations": [
{
"num": "<some value>",
"country": "<some value>",
},
{
"num": "<some value>",
"country": "<some value>",
},
]
}
I would also check out Dask to help you multithread it more efficiently. Pretty much all my projects use it.

Why are the videos on the most_recent standard feed so out of date?

I'm trying to grab the most recently uploaded videos. There's a standard feed for that - it's called most_recent. I don't have any problems grabbing the feed, but when I look at the entries inside, they're all half a year old, which is hardly recent.
Here's the code I'm using:
import requests
import os.path as P
import sys
from lxml import etree
import datetime
namespaces = {"a": "http://www.w3.org/2005/Atom", "yt": "http://gdata.youtube.com/schemas/2007"}
fmt = "%Y-%m-%dT%H:%M:%S.000Z"
class VideoEntry:
"""Data holder for the video."""
def __init__(self, node):
self.entry_id = node.find("./a:id", namespaces=namespaces).text
published = node.find("./a:published", namespaces=namespaces).text
self.published = datetime.datetime.strptime(published, fmt)
def __str__(self):
return "VideoEntry[id='%s']" % self.entry_id
def paginate(xml):
root = etree.fromstring(xml)
next_page = root.find("./a:link[#rel='next']", namespaces=namespaces)
if next_page == None:
next_link = None
else:
next_link = next_page.get("href")
entries = [VideoEntry(e) for e in root.xpath("/a:feed/a:entry", namespaces=namespaces)]
return entries, next_link
prefix = "https://gdata.youtube.com/feeds/api/standardfeeds/"
standard_feeds = set("top_rated top_favorites most_shared most_popular most_recent most_discussed most_responded recently_featured on_the_web most_viewed".split(" "))
feed_name = sys.argv[1]
assert feed_name in standard_feeds
feed_url = prefix + feed_name
all_video_ids = []
while feed_url is not None:
r = requests.get(feed_url)
if r.status_code != 200:
break
text = r.text.encode("utf-8")
video_ids, feed_url = paginate(text)
all_video_ids += video_ids
all_upload_times = [e.published for e in all_video_ids]
print min(all_upload_times), max(all_upload_times)
As you can see, it prints the min and max timestamps for the entire feed.
misha#misha-antec$ python get_standard_feed.py most_recent
2013-02-02 14:40:02 2013-02-02 14:54:00
misha#misha-antec$ python get_standard_feed.py top_rated
2006-04-06 21:30:53 2013-07-28 22:22:38
I've glanced through the downloaded XML and it appears to match the output. Am I doing something wrong?
Also, on an unrelated note, the feeds I'm getting are all about 100 entries (I'm paginating through them 25 at a time). Is this normal? I expected the feeds to be a bit bigger.

Regarding the "Most-Recent-Feed"-Topic: There is a ticket for this one here. Unfortunately, the YouTube-API-Teams doesn't respond or solved the problem so far.
Regarding the number of entries: That depends on the type of standardfeed, but for the most-recent-Feed it´s usually around 100.
Note: You could try using the "orderby=published" parameter to get recents videos, although I don´t know how "recent" they are.
https://gdata.youtube.com/feeds/api/videos?orderby=published&prettyprint=True
You can combine this query with the "category"-parameter or other ones (region-specific queries - like for the standard feeds - are not possible, afaik).

LXML Xpath does not seem to return full path

OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
every_tag.append(single_tag)
print(every_tag)
This gives:
'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
Adam
EDIT:
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile = xmlFilepath.read()
fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
Update:
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?

ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
</TechAccount>
"""
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
prints:
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
UPD:
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
Hope that helps.

getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
namespaces=nsmap)
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath

How can I talk to UniProt over HTTP in Python?

I'm trying to get some results from UniProt, which is a protein database (details are not important). I'm trying to use some script that translates from one kind of ID to another. I was able to do this manually on the browser, but could not do it in Python.
In http://www.uniprot.org/faq/28 there are some sample scripts. I tried the Perl one and it seems to work, so the problem is my Python attempts. The (working) script is:
## tool_example.pl ##
use strict;
use warnings;
use LWP::UserAgent;
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';
my $params = {
from => 'ACC', to => 'P_REFSEQ_AC', format => 'tab',
query => 'P13368 P20806 Q9UM73 P97793 Q17192'
};
my $agent = LWP::UserAgent->new;
push #{$agent->requests_redirectable}, 'POST';
print STDERR "Submitting...\n";
my $response = $agent->post("$base/$tool/", $params);
while (my $wait = $response->header('Retry-After')) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
print STDERR "Checking...\n";
$response = $agent->get($response->base);
}
$response->is_success ?
print $response->content :
die 'Failed, got ' . $response->status_line .
' for ' . $response->request->uri . "\n";
My questions are:
1) How would you do that in Python?
2) Will I be able to massively "scale" that (i.e., use a lot of entries in the query field)?

question #1:
This can be done using python's urllibs:
import urllib, urllib2
import time
import sys
query = ' '.join(sys.argv)
# encode params as a list of 2-tuples
params = ( ('from','ACC'), ('to', 'P_REFSEQ_AC'), ('format','tab'), ('query', query))
# url encode them
data = urllib.urlencode(params)
url = 'http://www.uniprot.org/mapping/'
# fetch the data
try:
foo = urllib2.urlopen(url, data)
except urllib2.HttpError, e:
if e.code == 503:
# blah blah get the value of the header...
wait_time = int(e.hdrs.get('Retry-after', 0))
print 'Sleeping %i seconds...' % (wait_time,)
time.sleep(wait_time)
foo = urllib2.urlopen(url, data)
# foo is a file-like object, do with it what you will.
foo.read()

You're probably better off using the Protein Identifier Cross Reference service from the EBI to convert one set of IDs to another. It has a very good REST interface.
http://www.ebi.ac.uk/Tools/picr/
I should also mention that UniProt has very good webservices available. Though if you are tied to using simple http requests for some reason then its probably not useful.

Let's assume that you are using Python 2.5.
We can use httplib to directly call the web site:
import httplib, urllib
querystring = {}
#Build the query string here from the following keys (query, format, columns, compress, limit, offset)
querystring["query"] = ""
querystring["format"] = "" # one of html | tab | fasta | gff | txt | xml | rdf | rss | list
querystring["columns"] = "" # the columns you want comma seperated
querystring["compress"] = "" # yes or no
## These may be optional
querystring["limit"] = "" # I guess if you only want a few rows
querystring["offset"] = "" # bring on paging
##From the examples - query=organism:9606+AND+antigen&format=xml&compress=no
##Delete the following and replace with your query
querystring = {}
querystring["query"] = "organism:9606 AND antigen"
querystring["format"] = "xml" #make it human readable
querystring["compress"] = "no" #I don't want to have to unzip
conn = httplib.HTTPConnection("www.uniprot.org")
conn.request("GET", "/uniprot/?"+ urllib.urlencode(querystring))
r1 = conn.getresponse()
if r1.status == 200:
data1 = r1.read()
print data1 #or do something with it
You could then make a function around creating the query string and you should be away.

check this out bioservices. they interface a lot of databases through Python.
https://pythonhosted.org/bioservices/_modules/bioservices/uniprot.html
conda install bioservices --yes

in complement to O.rka answer:
Question 1:
from bioservices import UniProt
u = UniProt()
res = u.get_df("P13368 P20806 Q9UM73 P97793 Q17192".split())
This returns a dataframe with all information about each entry.
Question 2: same answer. This should scale up.
Disclaimer: I'm the author of bioservices

There is a python package in pip which does exactly what you want
pip install uniprot-mapper

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup takes forever, can this be done faster? - python

Related

issue extracting html page's string using bs4

Optimize Python Script to parse xml

Why are the videos on the most_recent standard feed so out of date?

LXML Xpath does not seem to return full path

How can I talk to UniProt over HTTP in Python?

Categories

Resources