Bioinformatics : Programmatic Access to the BacDive Database - python

the resource at "BacDive" - ( http://bacdive.dsmz.de/) is a highly useful database for accessing bacterial knowledge, such as strain information, species information and parameters such as growth temperature optimums.
I have a scenario in which I have a set of organism names in a plain text file, and I would like to programmatically search them 1 by 1 against the Bacdive database (which doesnt allow a flat file to be downloaded) and retrieve the relevent information and populate my text file accordingly.
What are the main modules (such as beautifulsoups) that I would need to accomplish this? Is it straight forward? Is it allowed to programmatically access webpages ? Do I need permission?
A bacteria name would be "Pseudomonas putida" . Searching this would give 60 hits on bacdive. Clicking one of the hits, takes us to the specific page, where the line : "Growth temperature: [Ref.: #27] Recommended growth temperature : 26 °C " is the most important.
The script would have to access bacdive (which i have tried accessing using requests, but I feel they do not allow programmatic access, I have asked the moderator about this, and they said I should register for their API first).
I now have the API access. This is the page (http://www.bacdive.dsmz.de/api/bacdive/). This may seem quite simple to people who do HTML scraping, but I am not sure what to do now that I have access to the API.

Here is the solution...
import re
import urllib
from bs4 import BeautifulSoup
def get_growth_temp(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
no_hits = int(map(float, re.findall(r'[+-]?[0-9]+',str(soup.find_all("span", class_="searchresultlayerhits"))))[0])
if no_hits > 1 :
letters = soup.find_all("li", class_="searchresultrow1") + soup.find_all("li", class_="searchresultrow2")
all_urls = []
for i in letters:
all_urls.append('http://bacdive.dsmz.de/index.php' + i.a["href"])
max_temp = []
for ind_url in all_urls:
soup = BeautifulSoup(urllib.urlopen(ind_url).read())
a = soup.body.findAll(text=re.compile('Recommended growth temperature :'))
if a:
max_temp.append(int(map(float, re.findall(r'[+-]?[0-9]+', str(a)))[0]))
print "Recommended growth temperature : %d °C:\t" % max(max_temp)
url = 'http://bacdive.dsmz.de/index.php?search=Pseudomonas+putida'
if __name__ == "__main__":
# TO Open file then iterate thru the urls/bacterias
# with open('file.txt', 'rU') as f:
# for url in f:
# get_growth_temp(url)
get_growth_temp(url)
Edit:
Here I am passing single url. if you want to pass multiple urls to get their growth temperature. call the function(url) by opening file. code is commented.
Hope it helped you..
Thanks

Related

Beautifulsoup unable to get data from mds-data-table from morningstar

I'm trying to get the dividend information from morningstar.
The following code works for scraping info from finviz but the dividend information is not the same as my broker platform.
symbol = 'bxs'
morningstar_url = 'https://www.morningstar.com/stocks/xnys/' + symbol + '/dividends'
http = urllib3.PoolManager()
response = http.request('GET', morningstar_url)
soup = BeautifulSoup(response.data, 'lxml')
html = list(soup.children)[1]
[type(item) for item in list(soup.children)]
def display_elements(L, show = 0):
test = list(L.children)
if(show):
for i in range(len(test)):
print(i)
print(test[i])
print()
return(test)
test = display_elements(html,1)
I have no issue printing out the elements but cannot find the element that houses the information such as "Total Yield %" of 2.8%. How do I get inside the mds-data-table to extract the information?
Great question! I've actually worked on this specifically, but years ago. Morningstar will only load the tables after running a script to prevent this exact type of scraping behavior. If you view source generally, immediately on load, you won't be able to see any HTML.
What your going to want to do is find the JavaScript code that is loading the elements, and hook up bs4 to use that. You'll have to poke around the files, but somewhere deep in those js files, you'll find a dynamic URL. It'll be hidden, but it'll be in there somewhere. I'll go look at some of my old code and see if i can find something that helps.
So here's an edited sample of what used to work for me:
from urllib.request import urlopen
exchange = 'NYSE'
ticker = 'V'
if exchange == 'NYSE':
exchange_code = "XNYS"
elif exchange in ["NasdaqNM", "NASDAQ"]:
exchange_code = "XNAS"
else:
logging.info("Unknown Exchange Code for {}".format(stock.symbol))
return
time_now = int(time.time())
time_delay = int(time.time()+150)
morningstar_raw = urlopen(f'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t={exchange_code}:{ticker}&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=asc&columnYear=5&rounding=3&view=raw&r=354589&callback=jsonp{time_now}&_={time_delay}')
print(morningstar_raw)
Granted this solution is from a file lasted edited sometime in 2018, and they may have changed up their scripting, but you can find this and much more on my github project wxStocks

How does one scrape all the products from a random website?

I tried to get all the products from this website but somehow I don't think I chose the best method because some of them are missing and I can't figure out why. It's not the first time when I get stuck when it comes to this.
The way I'm doing it now is like this:
go to the index page of the website
get all the categories from there (A-Z 0-9)
access each of the above category and recursively go through all the subcategories from there until I reach the products page
when I reach the products page, check if the product has more SKUs. If it has, get the links. Otherwise, that's the only SKU.
Now, the below code works but it just doesn't get all the products and I don't see any reasons for why it'd skip some. Maybe the way I approached everything is wrong.
from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session
INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()
def retry(link):
wait = randint(0, 10)
try:
return session_.get(link).text
except Exception as e:
print('Retrying product page in {} seconds because: {}'.format(wait, e))
sleep(wait)
return retry(link)
def get_category_sections():
au = list(ascii_uppercase)
au.remove('Q')
au.remove('Y')
au.append('0-9')
return au
def get_categories():
html_ = retry(INDEX_PAGE)
page = html.fromstring(html_)
sections = get_category_sections()
for section in sections:
for link in page.xpath("//div[#id='index-{}']//li/a/#href".format(section)):
yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)
def dig_up_products(url):
html_ = retry(url)
page = html.fromstring(html_)
for link in page.xpath(
'//h2[contains(., "CATEGORIES")]/following-sibling::*[#id="carouselSegment2b"]//li//a/#href'
):
yield from dig_up_products(link)
for link in page.xpath('//ul[#id="prodResult"]/li//div[#class="imgWrapper"]/a/#href'):
yield link
for link in page.xpath('//*[#id="ts_resultList"]/div/nav/ul/li[last()]/a/#href'):
if link != '#':
yield from dig_up_products(link)
def check_if_more_products(tree):
more_prods = [
all_prod
for all_prod in tree.xpath("//div[#id='pm2_prodTableForm']//tbody/tr/td[1]//a/#href")
]
if not more_prods:
return False
return more_prods
def main():
for category_link in get_categories():
for product_link in dig_up_products(category_link):
product_page = retry(product_link)
product_tree = html.fromstring(product_page)
more_products = check_if_more_products(product_tree)
if not more_products:
print(product_link)
else:
for sku_product_link in more_products:
print(sku_product_link)
if __name__ == '__main__':
main()
Now, the question might be too generic but I wonder if there's a rule of thumb to follow when someone wants to get all the data (products, in this case) from a website. Could someone please walk me through the whole process of discovering what's the best way to approach a scenario like this?
If your ultimate goal is to scrape the entire product listing for each category, it may make sense to target the full product listings for each category on the index page. This program uses BeautifulSoup to find each category on the index page and then iterates over each product page under each category. The final output is a list of namedtuples stories each category name with the current page link and the full product titles for each link:
url = "https://www.richelieu.com/us/en/index"
import urllib
import re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import itertools
s = soup(str(urllib.urlopen(url).read()), 'lxml')
blocks = s.find_all('div', {'id': re.compile('index\-[A-Z]')})
results_data = {[c.text for c in i.find_all('h2', {'class':'h1'})][0]:[b['href'] for b in i.find_all('a', href=True)] for i in blocks}
final_data = []
category = namedtuple('category', 'abbr, link, products')
for category1, links in results_data.items():
for link in links:
page_data = str(urllib.urlopen(link).read())
print "link: ", link
page_links = re.findall(';page\=(.*?)#results">(.*?)</a>', page_data)
if not page_links:
final_page_data = soup(page_data, 'lxml')
final_titles = [i.text for i in final_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)
else:
page_numbers = set(itertools.chain(*list(map(list, page_links))))
full_page_links = ["{}?imgMode=m&sort=&nbPerPage=48&page={}#results".format(link, num) for num in page_numbers]
for page_result in full_page_links:
new_page_data = soup(str(urllib.urlopen(page_result).read()), 'lxml')
final_titles = [i.text for i in new_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)
print final_data
The output will garner results in the format:
[category(abbr=u'A', link='https://www.richelieu.com/us/en/category/tools-and-shop-supplies/workshop-accessories/tool-accessories/sander-accessories/1058847', products=[u'Replacement Plate for MKT9924DB Belt Sander', u'Non-Grip Vacuum Pads', u'Sandpaper Belt 2\xbd " x 14" for Compact Belt Sander PC371 or PC371K', u'Stick-on Non-Vacuum Pads', u'5" Non-Vacuum Disc Pad Hook-Face', u'Sanding Filter Bag', u'Grip-on Vacuum Pads', u'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x 10.79 cm (3" x 4-1/4")', u'4" Abrasive for Finishing Tool', u'Sander Backing Pad for RO 150 Sander', u'StickFix Sander Pad for ETS 125 Sander', u'Sub-Base Pad for Stocked Sanders', u'(5") Non-Vacuum Disc Pad Vinyl-Face', u'Replacement Sub-Base Pads for Stocked Sanders', u"5'' Multi-Hole Non-Vaccum Pad", u'Sander Backing Pad for RO 90 DX Sander', u'Converting Sanding Pad', u'Stick-On Vacuum Pads', u'Replacement "Stik It" Sub Base', u'Drum Sander/Planer Sandpaper'])....
To access each attribute, call like so:
categories = [i.abbr for i in final_data]
links = [i.links for i in final_data]
products = [i.products for i in final_data]
I believe the benefit of using BeautifulSoup is this instance is that it provides a higher level of control over the scraping and is easily modified. For instance, should the OP change his mind regarding what facets of the product/index he would like to scrape, simple changes in the find_all parameters should only be needed, as the general structure of the code above centers around each product category from the index page.
First of all, there is no definite answer to your generic question of how would one know if the data one has already scraped is all the available data. This is at least web-site specific and is rarely actually revealed. Plus, the data itself might be highly dynamic. On this web-site though you may more or less use the product counters to verify the amount of results found:
Your best bet here would be to debug - use logging module to print out information while scraping, then analyze the logs and look for why there was a missing product and what caused that.
Some of the ideas I currently have:
could it be that the retry() is the problematic part - could it be that session_.get(link).text does not raise an error but does not contain the actual data in the response as well?
I think the way you extract category links is correct and I don't see you missing categories on the index page
the dig_up_products() is questionable: when you extract links to the subcategories, you have this carouselSegment2b id used in the XPath expression, but I see that on at least some of the pages (like this one) the id value is carouselSegment1b. In any case, I would probably do just //h2[contains(., "CATEGORIES")]/following-sibling::div//li//a/#href here
I also don't like that imgWrapper class used to find a product link (could be that products missing images are missed?). Why not just: //ul[#id="prodResult"]/li//a/#href - this would though bring in some duplicates which you can address separately. But, you can also look for the link in the "info" section of the product container: //ul[#id="prodResult"]/li//div[contains(#class, "infoBox")]//a/#href.
There can also be an anti-bot, anti-web-scraping strategy deployed that may temporarily ban your IP or/and User-Agent or even obfuscate the response. Check for that too.
As pointed out by #mzjn and #alecxe, some websites employ anti-scraping measures. To hide their intentions, scrapers should try to mimic a human visitor.
One particular way for websites to detect a scraper, is to measure the time between subsequent page requests. Which is why scrapers typically keep a (random) delay between requests.
Besides, hammering a web server that is not yours without giving it some slack, is not considered good netiquette.
From Scrapy's documentation:
RANDOMIZE_DOWNLOAD_DELAY
Default: True
If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while fetching requests from the same website.
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.
The randomization policy is the same used by wget --random-wait option.
If DOWNLOAD_DELAY is zero (default) this option has no effect.
Oh, and make sure the User-Agent string in your HTTP request resembles that of an ordinary web browser.
Further reading:
https://exposingtheinvisible.org/guides/scraping/
Sites not accepting wget user agent header

How to retrieve google URL from search query

So I'm trying to create a Python script that will take a search term or query, then search google for that term. It should then return 5 URL's from the result of the search term.
I spent many hours trying to get PyGoogle to work. But later found out Google no longer supports the SOAP API for search, nor do they provide new license keys. In a nutshell, PyGoogle is pretty much dead at this point.
So my question here is... What would be the most compact/simple way of doing this?
I would like to do this entirely in Python.
Thanks for any help
Use BeautifulSoup and requests to get the links from the google search results
import requests
from bs4 import BeautifulSoup
keyword = "Facebook" #enter your keyword here
search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + keyword
r = requests.get(search)
soup = BeautifulSoup(r.text, "html.parser")
container = soup.find('div',{'id':'search'})
url = container.find("cite").text
print(url)
What issues are you having with pygoogle? I know it is no longer supported, but I've utilized that project on many occasions and it would work fine for the menial task you have described.
Your question did make me curious though--so I went to Google and typed "python google search". Bam, found this repository. Installed with pip and within 5 minutes of browsing their documentation got what you asked:
import google
for url in google.search("red sox", num=5, stop=1):
print(url)
Maybe try a little harder next time, ok?
Here, link is the xgoogle library to do the same.
I tried similar to get top 10 links which also counts words in links we are targeting. I have added the code snippet for your reference :
import operator
import urllib
#This line will import GoogleSearch, SearchError class from xgoogle/search.py file
from xgoogle.search import GoogleSearch, SearchError
my_dict = {}
print "Enter the word to be searched : "
#read user input
yourword = raw_input()
try:
#This will perform google search on our keyword
gs = GoogleSearch(yourword)
gs.results_per_page = 80
#get google search result
results = gs.get_results()
source = ''
#loop through all result to get each link and it's contain
for res in results:
#print res.url.encode('utf8')
#this will give url
parsedurl = res.url.encode("utf8")
myurl = urllib.urlopen(parsedurl)
#above line will read url content, in below line we parse the content of that web page
source = myurl.read()
#This line will count occurrence of enterd keyword in our webpage
count = source.count(yourword)
#We store our result in dictionary data structure. For each url, we store it word occurent. Similar to array, this is dictionary
my_dict[parsedurl] = count
except SearchError, e:
print "Search failed: %s" % e
print my_dict
#sorted_x = sorted(my_dict, key=lambda x: x[1])
for key in sorted(my_dict, key=my_dict.get, reverse=True):
print(key,my_dict[key])

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

How can I use Google Map Feed API to get a list of my Google Maps using Python?

I want to create a script in Python which downloads the current KML files of all the Maps I created on Google Maps.
To do so manually, I can use this:
http://maps.google.com.br/maps/ms?msid=USER_ID.MAP_ID&msa=0&output=kml
where USER_ID is a constant number Google uses to identify me, and MAP_ID is the individual map identifier generated by the link icon on top-right corner.
This is not very straightforward, because I have to manually browse "My Places" page on Google Maps, and get the links one by one.
From Google Maps API HTTP Protocol Reference:
The Map Feed is a feed of user-created maps.
This feed's full GET URI is:
http://maps.google.com/maps/feeds/maps/default/full
This feed returns a list of all maps for the authenticated user.
** The page says this service is no longer available, so I wonder if there is a way to do the same in the present.
So, the question is: Is there a way to get/download the list of MAP_IDs of all my maps, preferrably using Python?
Thanks for reading
The correct answer to this question involves using Google Maps Data API, HTML interface, which by the way is deprecated but still solves my need in a more official way, or at least more convincing than parsing a web page. Here it goes:
# coding: utf-8
import urllib2, urllib, re, getpass
username = 'heltonbiker'
senha = getpass.getpass('Senha do usuário ' + username + ':')
dic = {
'accountType': 'GOOGLE',
'Email': (username + '#gmail.com'),
'Passwd': senha,
'service': 'local',
'source': 'helton-mapper-1'
}
url = 'https://www.google.com/accounts/ClientLogin?' + urllib.urlencode(dic)
output = urllib2.urlopen(url).read()
authid = output.strip().split('\n')[-1].split('=')[-1]
request = urllib2.Request('http://maps.google.com/maps/feeds/maps/default/full')
request.add_header('Authorization', 'GoogleLogin auth=%s' % authid)
source = urllib2.urlopen(request).read()
for link in re.findall('<link rel=.alternate. type=.text/html. href=((.)[^\1]*?)>', source):
s = link[0]
if 'msa=0' in s:
print s
I arrived with this solution with a bunch of other questions in SO, and a lot of people helped me a lot, so I hope this code might help anyone else trying to do so in the future.
A quick and dirty way I have found, that skips Google Maps API completely and perhaps might brake in the near future, is this:
# coding: utf-8
import urllib, re
from BeautifulSoup import BeautifulSoup as bs
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
soup = bs(source)
maptables = soup.findAll(id=re.compile('^map[0-9]+$'))
for table in maptables:
for line in table.findAll('a', 'maptitle'):
mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-2]
print shown, mapid, mapname
shown += 1
# uncomment if you want to download the KML files:
# urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
'&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
else:
break
Of course it is only printing a numbered list, but from there to save a dictionary and/or automate KML download via &output=kml url trick it goes naturally.

Categories

Resources