I am trying to scrape a bunch of links, or things which can be appended to the root domain to make a link from https://www.media.mit.edu/groups
The html itself looks like this:
<div class="container-item listing-layout-item selectorgadget_selected" data-href="/groups/viral-communications/overview/" '="">
<div class="container-item listing-layout-item selectorgadget_suggested" data-href="/groups/social-machines/overview/" '="">
<div class="container-item listing-layout-item selectorgadget_suggested" data-href="/groups/space-enabled/overview/" '="">
The link data is stored within the data-href part, and I have been trying to use CSS selectors to get this data.
When I use the Scrapy shell, I have been trying to use
response.css('.data-href::text').extract() but it returns an empty list.
Any suggestions would be greatly appreciated!
Try to use
response.xpath('//div/#data-href').extract()
to get required values
Related
One website I am trying to scrap has a specific structure for prices. It is something like :
<span class="sale-price" data-sup-product-price="" data-item-price="2.02" ...>
2,
<sup>02 E</sup>
</span>
It is possible to access directly the data-item-price data nested into the span ?
I mean, not something like :
response.css("span.sale-price").extract()
But another way with data-item-price ?
Try response.css("span.sale-price::attr(data-item-price)").get() for getting data from this field. Or if you want to get all span with such field use selector span[data-item-price].
First of all thank you if you are reading this.
I have been scraping away for some time to collect minor data, however I want to pull in some additional information but I got stuck on a pagination.
I would like to get the data-href of the link, however it needs to consist the
i have been using [contains()] when however how do you get data-href when i needs to contain an object with a specific class
<li><a class="cursor" data-type="js" data-href="test"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li>
I have been using the following:
next_page_url = response.selector.xpath('//*[#class="text-center"]/ul/li/a[contains(#class,"cursor")]/#data-href').extract_first()
which works but not for the correct data-href
Many thanks for the help
Full source code:
<div class="pagination-container margin-bottom-20"> <div class="text-center"><ul class="pagination"><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html"><i class="fa fa-chevron-left" aria-hidden="true"></i></a></li><li>1</li><li class="active"><a>2</a></li><li>3</li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs">12</li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs">22</li><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html?p=3"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li></ul></div> </div> </div>
Huh... Turned out to be such a simple case (:
Your mistake is .extract_first() while you should extract last item to get next page.
next_page = response.xpath('//a[#class="cursor"]/#data-href').extract()[-1]
This will do the trick. But I'd recommend you to extract all the links from pagination list, since scrapy is managing duplication crawling. This will do a better job and having less chances for mistake:
pages = response.xpath('//ul[#class="pagination"]//a/#href').extract()
for url in pages:
yield scrapy.Request(url=response.urljoin(url), callback=self.whatever)
And so on..
try with that :
next_page_url = response.selector.xpath('//*[#class="text-center"]/ul/li/a[#class="cursor")]/#data-href').extract_first()
I'd suggest you to make sure that your element exists in initial html first:
just Ctlr+U in Chrome and then Ctrl+F to find element..
If element can be found there - something's wrong with your xpath selector.
Else element is generated by javascript and you have to use another way to get the data.
PS. You shouldn't use Chrome Devtools "Elements" tab to check if element exists or not, because this tab contains elements with JS code already applied. So check source only(ctrl+U)
I am trying to write a python script to scrape data from a webpage. However, I am not able to figure out correct usage of xpath to retrieve value. Please help me in fixing this.
The url in question is https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=NIFTY&instrument=OPTIDX&strike=10400.00&type=CE&expiry=30NOV2017
I am trying to get value of VWAP value ,which at present is 27.16(this value changes every business day.) When is inspect the value in Chrome, I get the following xpath for required value
<span id="vwap">27.16</span>
As per online tutorial , I wrote following python script
from lxml import html
import requests
page = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=NIFTY&instrument=OPTIDX&strike=10400.00&type=CE&expiry=30NOV2017')
tree = html.fromstring(page.content)
vwap = tree.xpath('//span[#id="vwap"]/text()')
print(vwap)
But when i execute this command, I get following output
[]
instead of
27.16
I have also tried replacing xpath line to following as per some other answer on stackoverflow, but still I am not getting the correct output.
vwap = tree.xpath('//*[#id="vwap"]/text()')
Please let me know what to put inside xpath so that I get correct value inside vwap variable.
Any other solutions(other than lxml) are also welcome.
If to check page source as it initially appears required node will look like
<li><a style="color: #000000;" title="VWAP">VWAP</a> <span id="vwap"></span></li>
while this is how it appears after JavaScript executed
<li><a style="color: #000000;" title="VWAP">VWAP</a> <span id="vwap">27.16</span></li>
Note that there is no text content in first HTML sample
It seem that values comes from below node
<div id="responseDiv" style="display:none">
{"valid":"true","isinCode":null,"lastUpdateTime":"29-NOV-2017 15:30:30","ocLink":"\/marketinfo\/sym_map\/symbolMapping.jsp?symbol=NIFTY&instrument=-&date=-&segmentLink=17&symbolCount=2","tradedDate":"29NOV2017","data":[{"change":"-17.80","sellPrice1":"13.80","buyQuantity3":"450","sellPrice2":"13.85","buyQuantity4":"150","buyQuantity1":"13,725","ltp":"-243019.52","buyQuantity2":"6,225","sellPrice5":"14.00","sellPrice3":"13.90","buyQuantity5":"450","sellPrice4":"13.95","underlying":"NIFTY","bestSell":"-2,41,672.50","annualisedVolatility":"9.44","optionType":"CE","prevClose":"31.10","pChange":"-57.23","lastPrice":"13.30","lowPrice":"11.00","strikePrice":"10400.00","premiumTurnover":"11,707.33","numberOfContractsTraded":"5,74,734","underlyingValue":"10,361.30","openInterest":"58,96,350","impliedVolatility":"12.73","vwap":"27.16","totalBuyQuantity":"10,49,850","openPrice":"35.10","closePrice":"17.85","bestBuy":"-2,43,852.25","changeinOpenInterest":"1,60,800","clientWisePositionLimits":"30517526","totalSellQuantity":"11,07,825","dailyVolatility":"0.49","sellQuantity5":"19,800","marketLot":"75","expiryDate":"30NOV2017","marketWidePositionLimits":"-","sellQuantity2":"75","sellQuantity1":"3,825","buyPrice1":"13.00","sellQuantity4":"900","buyPrice2":"12.90","sellQuantity3":"2,025","buyPrice4":"12.75","buyPrice3":"12.80","buyPrice5":"12.65","turnoverinRsLakhs":"44,94,632.53","pchangeinOpenInterest":"2.80","settlementPrice":"-","instrumentType":"OPTIDX","highPrice":"40.85"}],"companyName":"Nifty 50","eqLink":""}
</div>
so the code that you might need is
import json
vwap = json.loads(tree.xpath('//div[#id="responseDiv"]/text()')[0].strip())['data'][0]['vwap']
Here is the HTML I'm dealing with
<a class="_54nc" href="#" role="menuitem">
<span>
<span class="_54nh">Other...</span>
</span>
</a>
I can't seem to get my XPath structured correctly to find this element with the link. There are other elements on the page with the same attributes as <a class="_54nc"> so I thought I would start with the child and then go up to the parent.
I've tried a number of variations, but I would think something like this:
crawler.get_element_by_xpath('//span[#class="_54nh"][contains(text(), "Other")]/../..')
None of the things I've tried seem to be working. Any ideas would be much appreciated.
Or, more cleaner is //*[.='Other...']/../.. and with . you are directly pointing to the parent element
In other scenario, if you want to find a tag then use css [role='menuitem'] which is a better option if role attribute is unique
how about trying this
crawler.get_element_by_xpath('//a[#class="_54nc"][./span/span[contains(text(), "other")]]')
Try this:
crawler.get_element_by_xpath('//a[#class='_54nc']//span[.='Other...']');
This will search for the element 'a' with class as "_54nc" and containing exact text/innerHTML "Other...". Furthermore, you can just edit the text "Other..." with other texts to find the respective element(s)
I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.
To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).
First things to note:
1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48
2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.
3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.
4) Searching through the HTML, one finds the journal title located inside a form here:
<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">
<p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>
<p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Skjeldal, L.');">Skjeldal, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Gran, L.');">Gran, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Sletten, K.');">Sletten, K.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Volkman, B.F.');">Volkman, B.F.</a></p>
<p>
<b>Journal:</b>
(2002)
<span class="se_journal">Arch.Biochem.Biophys.</span>
<span class="se_journal"><b>399: </b>142-148</span>
</p>
A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".
And so I wrote the following code:
def JournalLookup():
PDBID= '1K48'
import requests
from bs4 import BeautifulSoup
session = requests.session()
req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)
doc = BeautifulSoup(req.content)
Journal = doc.findAll('span', class_="se_journal")
Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.
After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.
Does anybody know why this is the case, and what I could possibly do to fix it?
Thanks.
The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:
"This browser is either not Javascript enabled or has it turned off.
This site will not function correctly without Javascript."
For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.
PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
This SO question explains why it's not recomended way to do it.
The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI
pip install pypdb
For your application, the function describe_pdb takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:
my_desc = describe_pdb('4lza')
There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)