I am trying to write a python script to scrape data from a webpage. However, I am not able to figure out correct usage of xpath to retrieve value. Please help me in fixing this.
The url in question is https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=NIFTY&instrument=OPTIDX&strike=10400.00&type=CE&expiry=30NOV2017
I am trying to get value of VWAP value ,which at present is 27.16(this value changes every business day.) When is inspect the value in Chrome, I get the following xpath for required value
<span id="vwap">27.16</span>
As per online tutorial , I wrote following python script
from lxml import html
import requests
page = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=NIFTY&instrument=OPTIDX&strike=10400.00&type=CE&expiry=30NOV2017')
tree = html.fromstring(page.content)
vwap = tree.xpath('//span[#id="vwap"]/text()')
print(vwap)
But when i execute this command, I get following output
[]
instead of
27.16
I have also tried replacing xpath line to following as per some other answer on stackoverflow, but still I am not getting the correct output.
vwap = tree.xpath('//*[#id="vwap"]/text()')
Please let me know what to put inside xpath so that I get correct value inside vwap variable.
Any other solutions(other than lxml) are also welcome.
If to check page source as it initially appears required node will look like
<li><a style="color: #000000;" title="VWAP">VWAP</a> <span id="vwap"></span></li>
while this is how it appears after JavaScript executed
<li><a style="color: #000000;" title="VWAP">VWAP</a> <span id="vwap">27.16</span></li>
Note that there is no text content in first HTML sample
It seem that values comes from below node
<div id="responseDiv" style="display:none">
{"valid":"true","isinCode":null,"lastUpdateTime":"29-NOV-2017 15:30:30","ocLink":"\/marketinfo\/sym_map\/symbolMapping.jsp?symbol=NIFTY&instrument=-&date=-&segmentLink=17&symbolCount=2","tradedDate":"29NOV2017","data":[{"change":"-17.80","sellPrice1":"13.80","buyQuantity3":"450","sellPrice2":"13.85","buyQuantity4":"150","buyQuantity1":"13,725","ltp":"-243019.52","buyQuantity2":"6,225","sellPrice5":"14.00","sellPrice3":"13.90","buyQuantity5":"450","sellPrice4":"13.95","underlying":"NIFTY","bestSell":"-2,41,672.50","annualisedVolatility":"9.44","optionType":"CE","prevClose":"31.10","pChange":"-57.23","lastPrice":"13.30","lowPrice":"11.00","strikePrice":"10400.00","premiumTurnover":"11,707.33","numberOfContractsTraded":"5,74,734","underlyingValue":"10,361.30","openInterest":"58,96,350","impliedVolatility":"12.73","vwap":"27.16","totalBuyQuantity":"10,49,850","openPrice":"35.10","closePrice":"17.85","bestBuy":"-2,43,852.25","changeinOpenInterest":"1,60,800","clientWisePositionLimits":"30517526","totalSellQuantity":"11,07,825","dailyVolatility":"0.49","sellQuantity5":"19,800","marketLot":"75","expiryDate":"30NOV2017","marketWidePositionLimits":"-","sellQuantity2":"75","sellQuantity1":"3,825","buyPrice1":"13.00","sellQuantity4":"900","buyPrice2":"12.90","sellQuantity3":"2,025","buyPrice4":"12.75","buyPrice3":"12.80","buyPrice5":"12.65","turnoverinRsLakhs":"44,94,632.53","pchangeinOpenInterest":"2.80","settlementPrice":"-","instrumentType":"OPTIDX","highPrice":"40.85"}],"companyName":"Nifty 50","eqLink":""}
</div>
so the code that you might need is
import json
vwap = json.loads(tree.xpath('//div[#id="responseDiv"]/text()')[0].strip())['data'][0]['vwap']
Related
I am trying to scrape a bunch of links, or things which can be appended to the root domain to make a link from https://www.media.mit.edu/groups
The html itself looks like this:
<div class="container-item listing-layout-item selectorgadget_selected" data-href="/groups/viral-communications/overview/" '="">
<div class="container-item listing-layout-item selectorgadget_suggested" data-href="/groups/social-machines/overview/" '="">
<div class="container-item listing-layout-item selectorgadget_suggested" data-href="/groups/space-enabled/overview/" '="">
The link data is stored within the data-href part, and I have been trying to use CSS selectors to get this data.
When I use the Scrapy shell, I have been trying to use
response.css('.data-href::text').extract() but it returns an empty list.
Any suggestions would be greatly appreciated!
Try to use
response.xpath('//div/#data-href').extract()
to get required values
I am making a web crawling for checking a kind of availability.
I want to check the title of the specific time. However, if the title is 'NO', there is no href, otherwise there is a href. Therefore, it's xpath depends on the title. The title name changes every time. So i can't check by xpath.
If I want to check the availability of 09:00~11:00, how can do that?
I tried to find by XPATH. However, since the XPATH changes as I told, I can't to check the specific time i want.
Thanks in advance.
Below is the HTML code.
<span class="rs">07:00~09:00</span><img src="../images/reservation_btn04.gif" title="NO"><br>
<span class="rs">09:00~11:00</span><img src="../images/reservation_btn04.gif" title="NO"><br>
<span class="rs">11:00~13:00</span><img src="../images/reservation_btn04.gif" title="NO"><br>
<span class="rs">13:00~15:00</span><img src="../images/reservation_btn03.gif" title="YES"><br>
<span class="rs">15:00~17:00</span><img src="../images/reservation_btn03.gif" title="YES"><br>
<span class="rs">17:00~19:00</span><img src="../images/reservation_btn03.gif" title="YES"><br>
<span class="rs">19:00~21:00</span><img src="../images/reservation_btn04.gif" title="NO"><br>
As per the HTML you have shared to check the availability of any timespan e.g. 09:00~11:00 you can use the following solution:
You can create a function() which will take an argument as the timespan and extract the availability as follows:
def check_availability(myTimeSpan):
print(driver.find_element_by_xpath("//span[#class='rs'][.='" + myTimeSpan + "']//following::img[1]").get_attribute("title"))
Now you can call the function check_availability() with any timespan as follows:
check_availability("09:00~11:00")
If the text 09:00~11:00 is fixed, you can locate the img element like this -
element = driver.find_element_by_xpath("//span[#class='rs' and contains(text(),'09:00~11:00')]/following-sibling::img")
To check whether the title attribute of the element is "YES" -
if element.get_attribute("title") == 'YES':
// do whatever you want
To get the href attribute of your required element-
source = driver.find_element_by_xpath("//span[#class='rs' and contains(text(),'09:00~11:00')]/following-sibling::img[#title='YES']/preceding-sibling::a").get_attribute("href")
First of all thank you if you are reading this.
I have been scraping away for some time to collect minor data, however I want to pull in some additional information but I got stuck on a pagination.
I would like to get the data-href of the link, however it needs to consist the
i have been using [contains()] when however how do you get data-href when i needs to contain an object with a specific class
<li><a class="cursor" data-type="js" data-href="test"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li>
I have been using the following:
next_page_url = response.selector.xpath('//*[#class="text-center"]/ul/li/a[contains(#class,"cursor")]/#data-href').extract_first()
which works but not for the correct data-href
Many thanks for the help
Full source code:
<div class="pagination-container margin-bottom-20"> <div class="text-center"><ul class="pagination"><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html"><i class="fa fa-chevron-left" aria-hidden="true"></i></a></li><li>1</li><li class="active"><a>2</a></li><li>3</li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs">12</li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs">22</li><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html?p=3"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li></ul></div> </div> </div>
Huh... Turned out to be such a simple case (:
Your mistake is .extract_first() while you should extract last item to get next page.
next_page = response.xpath('//a[#class="cursor"]/#data-href').extract()[-1]
This will do the trick. But I'd recommend you to extract all the links from pagination list, since scrapy is managing duplication crawling. This will do a better job and having less chances for mistake:
pages = response.xpath('//ul[#class="pagination"]//a/#href').extract()
for url in pages:
yield scrapy.Request(url=response.urljoin(url), callback=self.whatever)
And so on..
try with that :
next_page_url = response.selector.xpath('//*[#class="text-center"]/ul/li/a[#class="cursor")]/#data-href').extract_first()
I'd suggest you to make sure that your element exists in initial html first:
just Ctlr+U in Chrome and then Ctrl+F to find element..
If element can be found there - something's wrong with your xpath selector.
Else element is generated by javascript and you have to use another way to get the data.
PS. You shouldn't use Chrome Devtools "Elements" tab to check if element exists or not, because this tab contains elements with JS code already applied. So check source only(ctrl+U)
I'm following this
tutorial for scraping information from a website after a login.
Now, part of the code makes use of a xpath variable to scrape specific content. I'm not familiair with xpath and after a lot of searching I can't find the right solution. I hope one of you guys can help me out!
I need the value within the "price" <span>:
<div class="price-box">
<span class="regular-price" id="product-price-64">
<span class="price">€ 4,90</span>
</span>
</div>
My piece of code right now is:
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//span[#class='price']/text()")
What should be the xpath code to get the information from within the <span>?
Edit: It seems indeed -as per the comments- that the initial page source came not good through.
your xpath looks almost fine, maybe you forgot a dot?
bucket_names = tree.xpath(".//span[#class='price']/text()")
Please check xpath submethod,
Your xpath seems correct.
Try with: xpath("//span[#class='price']").text
I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.
To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).
First things to note:
1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48
2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.
3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.
4) Searching through the HTML, one finds the journal title located inside a form here:
<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">
<p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>
<p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Skjeldal, L.');">Skjeldal, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Gran, L.');">Gran, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Sletten, K.');">Sletten, K.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Volkman, B.F.');">Volkman, B.F.</a></p>
<p>
<b>Journal:</b>
(2002)
<span class="se_journal">Arch.Biochem.Biophys.</span>
<span class="se_journal"><b>399: </b>142-148</span>
</p>
A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".
And so I wrote the following code:
def JournalLookup():
PDBID= '1K48'
import requests
from bs4 import BeautifulSoup
session = requests.session()
req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)
doc = BeautifulSoup(req.content)
Journal = doc.findAll('span', class_="se_journal")
Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.
After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.
Does anybody know why this is the case, and what I could possibly do to fix it?
Thanks.
The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:
"This browser is either not Javascript enabled or has it turned off.
This site will not function correctly without Javascript."
For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.
PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
This SO question explains why it's not recomended way to do it.
The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI
pip install pypdb
For your application, the function describe_pdb takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:
my_desc = describe_pdb('4lza')
There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)