I'm trying to adapt a script (found here: https://gist.github.com/bonzanini/5a4c39e4c02502a8451d) to search and retrieve data from PubMed.
Here's what I have so far:
#!/usr/bin/env python
from Bio import Entrez
import datetime
import json
# Create dictionary of journals (the official abbreviations are not used here...)
GroupA=["Nature", "Science", "PNAS","JACS"]
GroupB=["E-life", "Mol Cell","Plos Computational","Nature communication","Cell"]
GroupC=["Nature Biotech", "Nature Chem Bio", "Nature Str Bio", "Nature Methods"]
Journals = {}
for item in GroupA:
Journals.setdefault("A",[]).append(item)
for item in GroupB:
Journals.setdefault("B",[]).append(item)
for item in GroupC:
Journals.setdefault("C",[]).append(item)
# Set dates for search
today = datetime.datetime.today()
numdays = 15
dateList = []
for x in range (0, numdays):
dateList.append(today - datetime.timedelta(days = x))
dateList[1:numdays-1] = []
today = dateList[0].strftime("%Y/%m/%d")
lastdate = dateList[1].strftime("%Y/%m/%d")
print 'Retreiving data from ' '%s to %s' % (lastdate,today)
for value in Journals['A']:
Entrez.email = "email"
handle = Entrez.esearch(db="pubmed",term="gpcr[TI] AND value[TA]",
sort="pubdate",retmax="10",retmode="xml",datetype="pdat",mindate=lastdate,maxdate=today)
record = Entrez.read(handle)
print(record["IdList"])
I would like to use each "value" of the for loop (in this case journal titles) as a parameter for Entrez.search function. There is no built in parameter for this, so it would have to be inside the term parameter, but it doesn't work as shown.
Once I have an ID list, I will then use Entrez.fetch to retrieve and print the data that I want, but that's another question...
I hope this is clear enough, first question for me! Thanks!
If I understand you correctly, I think this is what you are looking for:
term="gpcr[TI] AND {}[TA]".format(value)
using this, the each term will be:
"gpcr[TI] AND Nature[TA]"
"gpcr[TI] AND Science[TA]"
"gpcr[TI] AND PNAS[TA]"
"gpcr[TI] AND JACS[TA]"
Related
I am working with this data, which I extracted from a public playlist and shows all of the number 1's since 1953 with their audio features: https://raw.githubusercontent.com/StanWaldron/StanWaldron.github.io/main/FinalData.csv
I am now trying to loop through and find their album ids so that I can retrieve their release date and plot their audio features against other time series data, using this code:
def find_album_release(name):
album_ids = []
for x in name:
results = sp.search(q="album:" + x, type="album")
if not results["albums"]["items"]:
return []
album_id = results['albums']['items'][0]['uri']
album_ids.append(album_id)
print(album_id)
return album_ids
final = pd.read_csv('FinalData.csv')
albumlist = final['album']
finalalbums = find_album_release(albumlist)
It works for the first 7 and then returns nothing. Without the if statement, it returns that the index is out of range. I have tested the 8th element by hard coding in its album name and it returns the correct result, this is the same for the next 4 in the list so it isn't an issue with the searching of these album names. I have played around with the lists but I am not entirely sure what is out of range of what.
Any help is greatly appreciated
The 8th row album name has single quotes in its name (Don't Stop Me Eatin'). I tried to remove the quotes and it worked. Maybe you should check what characters are allowed in the query parameters.
def find_album_release(name):
album_ids = []
for x in name:
x = x.replace("'", "") # Remove the quotes from album name
results = sp.search(q="album:" + x, type="album")
....
....
final = pd.read_csv('FinalData.csv')
albumlist = final['album']
finalalbums = find_album_release(albumlist)
The output for me:
spotify:album:31lHUoHC3P6BRFzKYLyRJO
spotify:album:6s84u2TUpR3wdUv4NgKA2j
spotify:album:4OyzQQJHEfKXRfyN4QyLR7
spotify:album:2Hjcfw8zHN4dJDZJGOzLd6
spotify:album:1zEBi4O4AaY5M55dUcUp3z
spotify:album:0Hi8bTOS35xZM0zZ6S89hT
spotify:album:5GGIgiGtxIgcVJQnsKQW94
spotify:album:3rLjiJI34bHFNIFqeK3y9s
spotify:album:6q1MiYTIE28nFzjkvLLt0I
spotify:album:61ulfFSmmxMhc2wCdmdMkN
spotify:album:3euz4vS7ezKGnNSwgyvKcd
spotify:album:1pFaEu56zqpzSviJc3htZN
spotify:album:4PTxbJPI4jj0Kx8hwr1v0T
spotify:album:2ogiazbrNEx0kQHGl5ZBTQ
spotify:album:5glfCPECXSHzidU6exW8wO
spotify:album:1XMw3pBrYeXzNXZXc84DNw
spotify:album:623PL2MBg50Br5dLXC9E9e
spotify:album:4TqgXMSSTwP3RCo3MMSR6t
spotify:album:3xIwVbGJuAcovYIhzbLO3J
spotify:album:3h2xv1tJgDnJZGy5Roxs5A
spotify:album:66xP0vUo8to8ALVpkyKc41
spotify:album:6XcYTEonLIpg9NpAbJnqrC
spotify:album:5sXSHscDjBez8VF20cSyad
spotify:album:6pQZPa398NswBXGYyqHH7y
spotify:album:0488X5veBK6t3vSmIiTDJY
I'm trying to search PubMed using search terms derived from a CSV file. I've combined the search terms into a form understandable by Biopython's Entrez module, like so:
term1 = ['"' + name + " AND " + disease + '"' for name, disease in zip(names, diseases)]
where 'names' and 'diseases' refers to the parameters I'm combining into a search using eSearch.
Subsequently, to execute the search, this is the code I wrote:
from Bio import Entrez
Entrez.email = "theofficialvelocifaptor#gmail.com"
for entry in range(0, len(term1)):
handle = Entrez.esearch(db="pubmed", term=term1[entry], retmax="10")
record = Entrez.read(handle)
record["IdList"]
print("The first 10 are\n{}".format(record["IdList"]))
Now, what I'm expecting from the code is, to iterate the function over the entire list stored in term1. However, this is the output I'm getting:
['Botanical name', 'Asystasia salicifalia', 'Asystasia salicifalia', 'Asystasia salicifalia', 'Barleria strigosa', 'Justicia procumbens', 'Justicia procumbens', 'Strobilanthes auriculata', 'Thunbergia laurifolia', 'Thunbergia similis']
['Disease', 'Puerperal illness', 'Puerperium', 'Puerperal disorder', 'Tonic', 'Lumbago', 'Itching', 'Malnutrition', 'Detoxificant', 'Tonic']
The first 10 are
['31849133', '31751652', '31359527', '31178344', '31057654', '30725751', '28476677', '27798405', '27174082', '26923540']
The first 10 are
[]
The first 10 are
[]
The first 10 are
[]
The first 10 are
[]
The first 10 are
[]
The first 10 are
The first 10 are
[]
The first 10 are
[]
The first 10 are
[]
Surely, there's something I'm missing, because the iteration seems to be shorting out prematurely. I've been at it for a solid 5 hours at the time of writing, and I feel very silly. I should also mention that I am new to Python, so if I'm making any obvious mistakes, I don't see it.
Your loop is working fine, there are no pubmed results for the last 9 term/disease combinations.
I'm looking for ways to make the code more efficient (runtime and memory complexity)
Should I use something like a Max-Heap?
Is the bad performance due to the string concatenation or sorting the dictionary not in-place or something else?
Edit: I replaced the dictionary/map object to applying a Counter method on a list of all retrieved names (with duplicates)
minimal request: script should take less then 30 seconds
current runtime: it takes 54 seconds
# Try to implement the program efficiently (running the script should take less then 30 seconds)
import requests
# Requests is an elegant and simple HTTP library for Python, built for human beings.
# Requests is the only Non-GMO HTTP library for Python, safe for human consumption.
# Requests is not a built in module (does not come with the default python installation), so you will have to install it:
# http://docs.python-requests.org/en/v2.9.1/
# installing it for pyCharm is not so easy and takes a lot of troubleshooting (problems with pip's main version)
# use conda/pip install requests instead
import json
# dict subclass for counting hashable objects
from collections import Counter
#import heapq
import datetime
url = 'https://api.namefake.com'
# a "global" list object. TODO: try to make it "static" (local to the file)
words = []
#####################################################################################
# Calls the site http://www.namefake.com 100 times and retrieves random names
# Examples for the format of the names from this site:
# Dr. Willis Lang IV
# Lily Purdy Jr.
# Dameon Bogisich
# Ms. Zora Padberg V
# Luther Krajcik Sr.
# Prof. Helmer Schaden etc....
#####################################################################################
requests.packages.urllib3.disable_warnings()
t = datetime.datetime.now()
for x in range(100):
# for each name, break it to first and last name
# no need for authentication
# http://docs.python-requests.org/en/v2.3.0/user/quickstart/#make-a-request
responseObj = requests.get(url, verify=False)
# Decoding JSON data from returned response object text
# Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
# containing a JSON document) to a Python object.
jsonData = json.loads(responseObj.text)
x = jsonData['name']
newName = ""
for full_name in x:
# make a string from the decoded python object concatenation
newName += str(full_name)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
words.append(y[2])
else:
words.append(y[0])
words.append(y[1])
# Return the top 10 words that appear most frequently, together with the number of times, each word appeared.
# Output example: ['Weber', 'Kris', 'Wyman', 'Rice', 'Quigley', 'Goodwin', 'Lebsack', 'Feeney', 'West', 'Marlen']
# (We don't care whether the word was a first or a last name)
# list of tuples
top_ten =Counter(words).most_common(10)
top_names_list = [name[0] for name in top_ten ]
print((datetime.datetime.now()-t).total_seconds())
print(top_names_list)
You are calling an endpoint of an API that generates dummy information one person at a time - that takes considerable amount of time.
The rest of the code is taking almost no time.
Change the endpoint you are using (there is no bulk-name-gathering on the one you use) or use built-in dummy data provided by python modules.
You can clearly see that "counting and processing names" is not the bottleneck here:
from faker import Faker # python module that generates dummy data
from collections import Counter
import datetime
fake = Faker()
c = Counter()
# get 10.000 names, split them and add 1st part
t = datetime.datetime.now()
c.update( (fake.name().split()[0] for _ in range(10000)) )
print(c.most_common(10))
print((datetime.datetime.now()-t).total_seconds())
Output for 10000 names:
[('Michael', 222), ('David', 160), ('James', 140), ('Jennifer', 134),
('Christopher', 125), ('Robert', 124), ('John', 120), ('William', 111),
('Matthew', 111), ('Lisa', 101)]
in
1.886564 # seconds
General advise for code optimization: measure first then optimize the bottlenecks.
If you need a codereview you can check https://codereview.stackexchange.com/help/on-topic and see if your code fits with the requirements for the codereview stackexchange site. As with SO some effort should be put into the question first - i.e. analyzing where the majority of your time is being spent.
Edit - with performance measurements:
import requests
import json
from collections import defaultdict
import datetime
# defaultdict is (in this case) better then Counter because you add 1 name at a time
# Counter is superiour if you update whole iterables of names at a time
d = defaultdict(int)
def insertToDict(n):
d[n] += 1
url = 'https://api.namefake.com'
api_times = []
process_times = []
requests.packages.urllib3.disable_warnings()
for x in range(10):
# for each name, break it to first and last name
try:
t = datetime.datetime.now() # start time for API call
# no need for authentication
responseObj = requests.get(url, verify=False)
jsonData = json.loads(responseObj.text)
# end time for API call
api_times.append( (datetime.datetime.now()-t).total_seconds() )
x = jsonData['name']
t = datetime.datetime.now() # start time for name processing
newName = ""
for name_char in x:
# make a string from the decoded python object concatenation
newName = newName + str(name_char)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
insertToDict(y[2])
else:
insertToDict(y[0])
insertToDict(y[1])
# end time for name processing
process_times.append( (datetime.datetime.now()-t).total_seconds() )
except:
continue
newA = sorted(d, key=d.get, reverse=True)[:10]
print(newA)
print(sum(api_times))
print(sum( process_times ))
Output:
['Ruecker', 'Clare', 'Darryl', 'Edgardo', 'Konopelski', 'Nettie', 'Price',
'Isobel', 'Bashirian', 'Ben']
6.533625
0.000206
You can make the parsing part better .. I did not, because it does not matter.
It is better to use timeit for performance testing (it calls code multiple times and averages, smoothing artifacts due to caching/lag/...) (thx #bruno desthuilliers ) - in this case I did not use timeit because I do not want to call API 100000 times to average results
Hey guys many thanks for taking the time to look at my problem, I have been working on this code for about 1 week (I am new to coding and to python 1 week also) Currently the loop only works if x in xrange(x) and 'rp' : 'x' is the correct number of rows available from this xml. The xml updates throughout the day, I was wondering if anyone can offer a solution to make x dynamic?
import mechanize
import urllib
import json
import re
from sched import scheduler
from time import time, sleep
from sched import scheduler
from time import time, sleep
s = scheduler(time, sleep)
def run_periodically(start, end, interval, func):
event_time = start
while event_time < end:
s.enterabs(event_time, 0, func, ())
event_time += interval
s.run()
def getData():
post_url = "urlofinterest_xml"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
######These are the parameters you've got from checking with the aforementioned tools
parameters = {'page' : '1',
'rp' : '8',
'sortname' : 'roi',
'sortorder' : 'desc'
}
#####Encode the parameters
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
xmlload1 = json.loads(trans_array)
pattern1 = re.compile('> (.*)<')
pattern2 = re.compile('/control/profile/view/(.*)\' title=')
pattern3 = re.compile('<span style=\'font-size:12px;\'>(.*)<\/span>')
pattern4 = re.compile('title=\'Naps posted: (.*) Winners:')
pattern5 = re.compile('Winners: (.*)\'><img src=')
for i in xrange(8):
user_delimiter = xmlload1['rows'][i]['cell']['username']
selection_delimiter = xmlload1['rows'][i]['cell']['race_horse']
username_delimiter_results = re.findall(pattern1, user_delimiter)[0]
userid_delimiter_results = int(re.findall(pattern2, user_delimiter)[0])
user_selection = re.findall(pattern3, selection_delimiter)[0]
user_numberofselections = float(re.findall(pattern4, user_delimiter)[0])
user_numberofwinners = float(re.findall(pattern5, user_delimiter)[0])
strikeratecalc1 = user_numberofwinners/user_numberofselections
strikeratecalc2 = strikeratecalc1*100
print "user id = ",userid_delimiter_results
print "username = ",username_delimiter_results
print "user selection = ",user_selection
print "best price available as decimal = ",xmlload1['rows'][i]['cell'] ['tws.best_price']
print "race time = ",xmlload1['rows'][i]['cell']['race_time']
print "race meeting = ",xmlload1['rows'][i]['cell']['race_meeting']
print "ROI = ",xmlload1['rows'][i]['cell']['roi']
print "number of selections = ",user_numberofselections
print "number of winners = ",user_numberofwinners
print "Strike rate = ",strikeratecalc2,"%"
print ""
getData()
run_periodically(time()+5, time()+1000000, 15, getData)
Kind regards AEA
First, I'm going to talk about how you iterate over your results. Based on your code, xmlload1['rows'] is an array of dicts, so instead of choosing an arbitrary number, you can iterate over it directly instead. To make this a better example, I'm going to set up some arbitrary data to make this clear:
xmlload1 = {
"rows": [{"cell": {"username": "one", "race_horse":"b"}}, {"cell": {"username": "two", "race_horse": "c"}}]
}
So, given the data above, you can just iterate through rows in a for loop, like this:
for row in xmlload1['rows']:
cell = row["cell"]
print cell["username"]
print cell["race_horse"]
Each iteration, cell takes on the value of another element in the iterable (the list in xmlload1['rows']). This works with any container or sequence that supports iteration (like lists, tuples, dicts, generators, etc.)
Notice how I haven't used any magic numbers anywhere, so xmlload1['rows'] could be arbitrarily long and it would still work.
You can set the requests to be dynamic by using a function, like this:
def get_data(rp=8, page=1):
parameters = {'page' : str(page),
'rp' : str(rp),
'sortname' : 'roi',
'sortorder' : 'desc'
}
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
return json.loads(trans_array)
Now, you can call get_data(rp=5) to get 5 rows, or get_data(rp=8) to get 8 rows [and get_data(rp=8, page=3) to get the third page], etc. And you can clearly add additional variables or even pass in the parameters dict to the function directly.
I'm not sure I understand your question, but I think what you want is this:
rows = xmlload1['rows']
for row in rows:
user_delimiter = row['cell']['username']
selection_delimiter = row['cell']['race_horse']
# ...
If you need the row index as well as the row itself, use enumerate:
rows = xmlload1['rows']
for i, row in enumerate(rows):
user_delimiter = row['cell']['username']
selection_delimiter = row['cell']['race_horse']
# ...
In general, if you're doing for i in range(…) for any purpose other than a fixed number of iterations, you're doing it wrong. There's usually a collection you want to iterate over; just find it and iterate over it.
I am trying to grep some results pages for work, and then eventually print them out to an html website so someone does not have to manually look through each section.
How I would eventually use: I feed this function a result page, it greps through the 5 different sections, then I can do a html output (thats what that print substitute area is for) with all the different results.
OK MASSIVE EDIT I actually removed the old code because I was asking too many questions. I fixed my code taking some suggestions, but I am still interested in the advantage of using human-readable dict instead of just list. Here is my working code that gets all the right results into a 'list of lists', I then outputted the first section in my eventual html block
import urllib
import re
import string
import sys
def ipv6_results(input_page):
sections = ['/spec.p2/summary.html', '/nd.p2/summary.html',
'/addr.p2/summary.html', '/pmtu.p2/summary.html',
'/icmp.p2/summary.html']
variables_output=[]
for s in sections:
temp_list = []
page = input_page + s
#print page
url_reference = urllib.urlopen(page)
html_page = url_reference.read()
m = re.search(r'TOTAL</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
temp_list.append(int(m.group(1)) )
m = re.search(r'PASS</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
temp_list.append(int(m.group(1)))
m = re.search(r'FAIL</FONT></B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
temp_list.append(int(m.group(1)))
variables_output.append(temp_list)
#print variables to check them :)
print "------"
print variables_output
print "Ready Logo Phase 2"
print "Section | Total | Pass | Fail |"
#this next part is eventually going to output an html block
output = string.Template("""
1 - RFC2460-IPv6 Specs $spec_total $spec_pass $spec_fail
""")
print output.substitute(spec_total=variables_output[0][0], spec_pass=variables_output[0][1],
spec_fail=variables_output[0][2])
return 1
imagine the tabbing is correct :( I wish this was more like paste bin, suggestions welcome on pasting code in here
Generally, you don't declare the shape of the list first, and then fill in the values. Instead, you build the list as you discover the values.
Your variables has a lot of structure. You've got inner lists of 3 elements, always in the order of 'total', 'pass', 'fail'. Perhaps these 3-tuples should be made namedtuples. That way, you can access the three parts with humanly-recogizable names (data.total, data.pass, data.fail), instead of cryptic index numbers (data[0], data[1], data[2]).
Next, your 3-tuples differ by prefixes: 'spec', 'nd', 'addr', etc.
These sound like keys to a dict rather than elements of a list.
So perhaps consider making variables a dict. That way, you can access the particular 3-tuple you want with the humanly-recognizable variables['nd'] instead of variables[1]. And you can access the nd_fail value with variables['nd'].fail instead of variables[1][2]:
import collections
# define the namedtuple class Point (used below).
Point = collections.namedtuple('Point', 'total pass fail')
# Notice we declare `variables` empty at first; we'll fill in the values later.
variables={}
keys=('spec','nd','addr','pmtu','icmp')
for s in sections:
for key in keys:
page = input_page + s
url_reference = urllib.urlopen(page)
html_page = url_reference.read()
m = re.search(r'TOTAL</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
ntotal = int(m.group(1))
m = re.search(r'PASS</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
npass = int(m.group(1))
m = re.search(r'FAIL</FONT></B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
nfail = int(m.group(1))
# We create an instance of the namedtuple on the right-hand side
# and store the value in `variables[key]`, thus building the
# variables dict incrementally.
variables[key]=Point(ntotal,npass,nfail)
The first thing is that those lists there will only be the values of the variables, at the time of assignment. You would be changing the list value, but not the variables.
I would seriously consider using classes and build structures of those, including lists of class instances.
For example:
class SectionResult:
def __init__(self, total = 0, pass = 0, fail = 0):
self.total = total
self.pass = pass
self.fail = fail
Since it looks like each group should link up with a section, you can create a list of dictionaries (or perhaps a list of classes?) with the bits associated with a section:
sections = [{'results' : SectionResult(), 'filename': '/addr.p2/summary.html'}, ....]
Then in the loop:
for section in sections:
page = input_page + section['filename']
url_reference = urllib.urlopen(page)
html_page = url_reference.read()
m = re.search(r'TOTAL</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
section['results'].total = int(m.group(1))
m = re.search(r'PASS</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
section['results'].pass = int(m.group(1))
m = re.search(r'FAIL</FONT></B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
section['results'].fail = int(m.group(1))
I would use a dictionary inside a list. Maybe something like:
def ipv6_results(input_page):
results = [{file_name:'/spec.p2/summary.html', total:0, pass:0, fail:0},
{file_name:'/nd.p2/summary.html', total:0, pass:0, fail:0},
{file_name:'/addr.p2/summary.html', total:0, pass:0, fail:0},
{file_name:'/pmtu.p2/summary.html', total:0, pass:0, fail:0},
{file_name:'/icmp.p2/summary.html', total:0, pass:0, fail:0}]
for r in results:
url_reference = urllib.urlopen(input_page + r[file_name])
html_page = url_reference.read()
m = re.search(r'TOTAL</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
r[total] = int(m.group(1))
m = re.search(r'PASS</B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
r[pass] = int(m.group(1))
m = re.search(r'FAIL</FONT></B></TD><TD>:</TD><TD>([0-9,]+)', html_page)
r[fail] = int(m.group(1))
for r in results:
print r[total]
print r[pass]
print r[fail]