python, collecting links / script values from page - python

I am trying to make a program to collect links and some values from a website. It works mostly well but I have come across a page in which it does not work.
With firebug I can see that this is the html code of the illusive "link" (cant find it when viewing the pages source thou):
<a class="visit" href="/tet?id=12&mv=13&san=221">
221
</a>
and this is the script:
<td><a href=\"/tet?id=12&mv=13&san=221\" class=\"visit\">221<\/a><\/td><\/tr>
I'm wondering how to get either the "link" ("/tet?id=12&mv=13&san=221") from the html code and the string "221" from either the script or the html using selenium, mechanize or requests (or some other library)
I have made an unsuccessful attempt at getting it with mechanize using the br.links() function, which collected a number of links from the side, just not the one i am after
extra info: This might be important. to get to the page I have to click on a button with this code:
<a id="f33" class="button-flat small selected-no" onclick="qc.pA('visitform', 'f33', 'QClickEvent', '', 'f52'); if ($j('#f44').length == 0) { $j('f44').style.display='inline'; }; $j('#f38').hide();qc.recordControlModification('f38', 'DisplayStyle', 'hide'); document.getElementById('forumpanel').className = 'section-3'; return false;" href="#">
load2
</a>
after which a "new page" loads in a part of the window (but the url never changes)

I think you pasted the wrong script of yours ;)
I'm not sure what you need exactly - there are at least two different approaches.
Matching all hrefs using regex
Matching specific tags and using getAttribute(...)
For the first one, you have to get the whole html source of the page with something like webdriver.page_source and using something like the following regex (you will have to escape either the normal or the double quotes!):
<a.+?href=['"](.*?)['"].*?/?>
If you need the hrefs of all matching links, you could use something similar to webdriver.find_elements_by_css_selector('.visit') (take care to choose find_elements_... instead of find_element_...!) to obtain a list of webelements and iterate through them to get their attributes.
This could result in code like this:
hrefs = []
elements = webdriver.find_elements_by_css_selector('.visit')
for element in elements:
hrefs.append(element.getAttribute('href'))
Or a one liner using list comprehension:
hrefs = [element.getAttribute('href') for element \
in webdriver.find_elements_by_css_selector('.visit')]

Related

Find a link by href in selenium python

Let's take the example of spotify because I'm listening to music on it right now.
I would like to get the text contained in the href tag in the following code.
<a data-testid="nowplaying-track-link" href="/album/3xIwVbGJuAcovYIhzbLO3J">Toosie Slide</a>
What I want is to get "/album/3xIwVbGJuAcovYIhzbLO3J" or if that's not possible, get "Toosie Slide" in order to store it in a variable to compare it with a constant.
The difficulty with Spotify (and many other sites) is that this href tag is present several times on the web page. So I'd like to get only the link that's contained in "nowplaying-track-link" which is a data-testid.
There, I hope I was clear.
PS: I already know the commands like: driver.find_element_by_xpath, etc... but I can't use them in this case...
I'm not sure what you mean about the commands of the type and not being able to use them, but this is how you would get the info you're seeking:
element = driver.find_element_by_css_selector('[data-testid="nowplaying-track-link"]')
href = element.get_attribute('href')
element_text = element.text
if you want to put together the link, you can do it this way:
link = driver.current_url + href

Xpath clicking not working at all

Quick info: I'm using Mac OS, Python 3.
I have like 800 links that need to be clicked on a page (and many more pages to go so need automation).
They were hidden because you only see those links when you hover over.
I fixed that by injecting CSS rule (just saying in case its the reason it's not working).
When I try to find elements by xpath it does not want to click the links afterwards and it also doesn't find all of them always just 4 (even when more are displayed in view).
HTML:
Display
When i click ok copy xpath in inspect it gives me:
//*[#id="tiles"]/li[3]/div[2]/ul/li[2]/a
But it doesn't work when I use it like this:
driver.find_elements_by_xpath('//*[#id="tiles"]/li[3]/div[2]/ul/li[2]/a')
So two questions:
How do I get them all?
How do I get it to click on each of them?
The pattern in the XPath is the same, with the /li[3] being the only number that changes, for this I created a for loop to create them all based on the count on page which I did successfully.
So if it can be done with the XPaths generated by myself that are corresponding to when I copy XPath in inspector then I only need question 2 answered.
PS.: this is HTML of parent of that first HTML:
<li onclick="openPopup(event, 'collect', {item_id: 165214})" class="collect" data-item-id="165214">Display</li>
This XPath,
//a[.="Display"]
will select all a links with anchor text equal to "Display".
As per your question, the HTML you have shared and your code attempts there is no necessity to get the <li> tags. Instead we will get the <a> tags in a list. So to answer your first question How do I get them all you can use the following line of code :
all_Display = driver.find_elements_by_xpath("//*[#id='tiles']//li/div[2]/ul/li[#class='collect']/a[#title='Display']")
Next to click on each of them you have to create a loop to iterate through all the <a> tag as follows :
all_Display = driver.find_elements_by_xpath("//*[#id='tiles']//li/div[2]/ul/li[#class='collect']/a[#title='Display']")
for each_Display in all_Display :
each_Display.click()
Using an XPath with elements by position is not ideal. Instead use a CSS selector to match the attributes for the targeted elements.
Something like:
all_Display = driver.find_elements_by_css_selector("#tiles li[onclick][data-item-id] a[title]")
You can then click them in a loop if none of them is loading a new page:
for element in all_Display:
element.click()

Want to pull a journal title from an RCSB Page using python & BeautifulSoup

I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.
To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).
First things to note:
1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48
2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.
3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.
4) Searching through the HTML, one finds the journal title located inside a form here:
<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">
<p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>
<p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Skjeldal, L.');">Skjeldal, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Gran, L.');">Gran, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Sletten, K.');">Sletten, K.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Volkman, B.F.');">Volkman, B.F.</a></p>
<p>
<b>Journal:</b>
(2002)
<span class="se_journal">Arch.Biochem.Biophys.</span>
<span class="se_journal"><b>399: </b>142-148</span>
</p>
A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".
And so I wrote the following code:
def JournalLookup():
PDBID= '1K48'
import requests
from bs4 import BeautifulSoup
session = requests.session()
req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)
doc = BeautifulSoup(req.content)
Journal = doc.findAll('span', class_="se_journal")
Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.
After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.
Does anybody know why this is the case, and what I could possibly do to fix it?
Thanks.
The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:
"This browser is either not Javascript enabled or has it turned off.
This site will not function correctly without Javascript."
For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.
PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
This SO question explains why it's not recomended way to do it.
The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI
pip install pypdb
For your application, the function describe_pdb takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:
my_desc = describe_pdb('4lza')
There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)

Python 2.7 Beautiful Soup- parsing list of links

I am trying to parse all of the links on this page which have an identical hierarchy. I am not getting any traceback, but not getting the data either.
I am trying to get the href tag from the highlighted portion of code:
My current code is:
def link_parser(soup,itemsList):
for item in soup.findAll("div", { "class" : "tileInfo" }):
for link in item.findAll("a", { "class" : "productClick productTitle" }):
try:
itemsList.put(removeNonAscii(html_parser.unescape(link.string)).replace(',',' ')+","+clean_a_url(link['href']))
except Exception:
print "Formatting error: "
traceback.print_exc(file=sys.stdout)
return ""
It looks like you were trying to scrape Target's website - perhaps this page.
You've encountered one of the fundamental difficulties with web page scraping - what you see is not always what you get. In this case they are AJAXing in a bunch of content after you load the page. Notice the little pinwheel animation when you first load the page - the content you were trying to access simply does not exist in the DOM until all the various js scripts they've got on that page run. (and they've got a whole lot of them)
I clicked through a bit and it looks like the code responsible for generating that content is this bit of jquery:
<script id="productTitleTmpl" type="text/x-jquery-tmpl" >
{{if $item.parent.parent.viewType != "details"}}
{{tmpl($data.itemAttributes) "#productBrandTmpl"}}
{{/if}}
<a class="productClick productTitle" id="prodTitle-{{= $item.parent.parent.viewType}}-{{= $item.parent.parent.currentPageNumber}}-{{= $item.parent.productCounter}}" href="/{{= productDetailPageURL}}#prodSlot={{= $item.parent.parent.viewType}}_{{= $item.parent.parent.currentPageNumber}}_{{= $item.parent.productCounter}}" title="{{= title}}" name="prodTitle_{{= $item.catalogEntryId}}">
{{= $item.parent.parent.fetchProductTitleForView($item.productTitle)}}
</a>
So, anyway. If you really are dead-set on scraping this page, you will need to ditch urllib (or whatever you were using to fetch the html). Instead visit this page with a javascript-enabled headless browser (like selenium), let the javascript run, and then scrape it. All of that is outside the realm of this answer but you can google around for various headless browser solutions and find one that works for you.

Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

I need to automatically scan lots of html documents for ad banners that are surrounded by an anchor tag, e.g.:
<a href="http://ad_network.com/abc.html">
<img src="ad_banner.jpg">
</a>
As a newbie with xpath, I can select such anchors via lxml like so:
text = '''
<a href="http://ad_network.com/abc.html">
<img src="ad_banner.jpg">
</a>'''
root = lxml.html.fromstring(text)
print root.xpath('//a[contains(#href,("ad_network.")) or contains(#href,("other_ad_network."))][descendant::img]')
In the example I check on two different domains: "ad_network." and "other_ad_network.". However, there are over 25 domains to check and the xpath expression would get terribly long by connecting all those conatains-directives by "or". And I fear the expression would be pretty inefficient concerning CPU ressources. Is there some syntax for checking on multiple "contains" values?
I could get the concerned links also via regex in a single line of code. Yet, although the html code is normalized by lxml, regex seems never to be a good choice for that kind of work ... Any help appreciated!
It might not be that bad just to do a bunch of 'or's. Build the xpath with python so that you don't get writer's cramp and then precompile it. The actual xpath code is in libxml and should be fast.
sites=['aaa', 'bbb']
contains = ' or '.join('contains(#href,(%s))' % site for site in sites)
anchor_xpath = etree.XPath('//a[%s][descendant::img]' % contains)

Categories

Resources