I'm trying to programmatically extract text from this webpage which describes a genome assembly in the public archive:
http://www.ebi.ac.uk/ena/data/view/ERS019623
I have thousands of assemblies that I want to track down and extract the study accession, which is the code on the far left of the table beginning with "PRJ". The URL for each of these assemblies is of the same format as the one above, i.e. "http://www.ebi.ac.uk/ena/data/view/ERS******". I have the ERS code for each of my assemblies so I can construct the URL for each one.
I've tried a few different methods, firstly if you add "&display=XML" to the end of the URL it prints the XML (or at least I'm presuming that it's printing the XML for the entire page, because the problem is that the study accession "PRJ******" is no where to be seen here). I had utilised this to extract another code that I needed from the same webpage, the run accession which is always of the format "ERR******" using the below code:
import urllib2
from bs4 import BeautifulSoup
import re
import csv
with open('/Users/bj5/Desktop/web_scrape_test.csv','rb') as f:
reader = csv.reader(f) #opens csv containig list of ERS numbers
for row in reader:
sample = row[0] #reads index 0 (1st row)
ERSpage = "http://www.ebi.ac.uk/ena/data/view/" + sample + "&display=xml" #creates URL using ERS number from 1st row
page = urllib2.urlopen(ERSpage) #opens url and assigns it to variable page
soup = BeautifulSoup(page, "html.parser") #parses the html/xml from page and assigns it to variable called soup
page_text = soup.text #returns text from variable soup, i.e. no tags
ERS = re.search('ERS......', page_text, flags=0).group(0) #returns first ERS followed by six wildcards
ERR = re.search('ERR......', page_text, flags=0).group(0) #retursn first ERR followed by six wildcards
print ERS + ',' + ERR + ',' + "http://www.ebi.ac.uk/ena/data/view/" + sample #prints ERS,ERR,URL
This worked very well, but as the study accession is not in the XML I can't use it to access this.
I also attempted to use BeautifulSoup again to download the HTML by doing this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://www.ebi.ac.uk/ena/data/view/ERS019623"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
print soup
get_category_links(BASE_URL)
But again I can't see the study accession in the output from this either...
I have also attempted to use a different python module, lxml, to parse the XML and HTML but haven't had any luck there either.
When I right click and inspect element on the page I can find the study accession by doing ctrl+F -> PRJ.
So my question is this: what is the code that I'm looking at in inspect element, XML or HTML (or something else)? Why does it look different to the code that prints in my console when I try and use BeautifulSoup to parse HTML? And finally how can I scrape the study accessions (PRJ******) from these webpages?
(I've only been coding for a couple of months and I'm entirely self-taught so apologies for the slightly confused nature of this question but I hope I've got across what it is that I'm trying to do. Any suggestions or advice would be much appreciated.)
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://www.ebi.ac.uk/ena/data/view/ERS019623&display=xml')
soup = BeautifulSoup(r.text, 'lxml')
ERS = soup.find('primary_id').text
ERR = soup.find('id', text=re.compile(r'^ERR')).text
url = 'http://www.ebi.ac.uk/ena/data/view/{}'.format(ERS)
print(ERS, ERR, url)
out:
ERS019623 ERR048142 http://www.ebi.ac.uk/ena/data/view/ERS019623
bs4 can parse xml file, just treat it like html, they are all the same, so their is no need to use regex to extract info.
i find a TEXT download link:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt
this link's fileds can be changed to get the data you want, like this:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession&download=txt
by doing so, you can get all you data in a text file
In you sample soup is a BeautifulSoup object: a representation of the parsed document.
If you want to print the entire HTML of the document, you can call print(soup.prettify()) or if you want the text within it print(soup.get_text()).
The soup object has other possibilities to access parts of the document you are interested in: to navigate the parsed tree, to search in it ...
Related
I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])
I am trying to scrape data to get the text I need. I want to find the line that says aberdeen and all lines after it which contain the airport info. Here is a pic of the html hierarchy:
I am trying to locate the text elements inside the class "i1" with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
But I am not getting the values I expect at all. Here is a link to the data if curious. I am new to scraping obviously.
The problem is your BeautifulSoup parser:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
If what you want is the text elements, you can use:
soup.get_text()
Note: this will give you all the text elements.
why are people suggesting selenium? this doesnt dynamically load the data ... requests + re is all you need, you dont even need beautiful soup
data = requests.get('http://www.airportcodes.org/').content
cities_and_codes =re.findall("([A-Za-z, ]+)\(([A-Z]{3})\)",data)
just look for any alphanumeric characters (including also comma and space)
followed by exactly 3 uppercase letters in parenthesis
A small disclaimer for all, this is my first language for programming and I am still getting used to it, so any suggestions are recommended.
The problem that was given is as follows:
Scraping Numbers from HTML using BeautifulSoup In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.
We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.
Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)
Actual data: http://py4e-data.dr-chuck.net/comments_97465.html (Sum ends with 19)
You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.
Data Format
The file is a table of names and comment counts. You can ignore most of the data in the file except for lines like the following:
<tr><td>Modu</td><td><span class="comments">90</span></td></tr>
<tr><td>Kenzie</td><td><span class="comments">88</span></td></tr>
<tr><td>Hubert</td><td><span class="comments">87</span></td></tr>
You are to find all the tags in the file and pull out the numbers from the tag and sum the numbers.
Look at the sample code provided. It shows how to find all of a certain kind of tag, loop through the tags and extract the various aspects of the tags.
...
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print 'TAG:',tag
print 'URL:',tag.get('href', None)
print 'Contents:',tag.contents[0]
print 'Attrs:',tag.attrs
You need to adjust this code to look for span tags and pull out the text content of the span tag, convert them to integers and add them up to complete the assignment.
I have written this:
import urllib
from bs4 import BeautifulSoup
page = urllib.urlopen(input("Enter URL: "))
soup = BeautifulSoup(page, "html.parser")
spans = soup('span')
numbers = []
for span in spans:
numbers.append(int(span.string))
print (sum(numbers))
The error message is that bs4 is not a module even though it is and it is not asking me for the url and not giving me output. I have no clue on how to solve it.
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
ctx= ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter -')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
numbers = []
spans = soup('span')
for span in spans:
numbers.append(int(span.string))
print (sum(numbers))
Use this code, works for me.
I need to check a webpage search results and compare them to user input.
ui = raw_input() #for example "Niels Bohr"
link = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=90&k=10"
stranica=urllib.urlopen(link)
soup = BeautifulSoup(stranica, from_encoding="utf-8")
beauty = soup.prettify()
print beauty
since there is 1502 results, my idea was to change the k=10 to k=1502. Now I need some kind of function to check if search results contain my user input. I know that my names are the text after TEXT
so how to do it? maybe using regex?
the second part is if there are matching results to get the link of the results. Again, I know that link is inside that href="", but how to get it out and make it usable=
Finding if Niels Bohr is listed is as easy as using a large batch number and loading the resulting page:
import sys
import urllib2
from bs4 import BeautifulSoup
url = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=0&k={}".format(sys.maxint)
name = u'Bohr, Niels'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all(class_='AllWordsTextHit', text=name):
print link
This produces any links that contain the text 'Bohr, Niels' as the link text. You can use a regular expression if you need a partial match.
The link object has a (relative) href attribute you can then use to load the next page:
professor_page = 'http://www.enciklopedija.hr/' + link['href']
I am using python and beautifulsoup for html parsing.
I am using the following code :
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"
main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
print a[href]
but I am not getting output links like :
http://www.wikipathways.org/index.php/Pathway:WP26
and also imp thing is there are 107 pathways. but I will not get all the links as other lins depends on "show links" at the bottom of the page.
so, how can I get all the links (107 links) from that url?
Your problem is line 8, content = url.read(). You're not actually reading the webpage, you're actually just doing nothing (If anything, you should be getting an error).
main_url is what you want to read, so change line 8 to:
content = main_url.read()
You also have another error, print a[href]. href should be a string, so it should be:
print a['href']
I would suggest using lxml its faster and better for parsing html worth investing the time to learn it.
from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')
That should get you going.