I was using Python 3.8, XPath and Scrapy where things just seemed to work. I took my XPath expressions for granted.
Now I'm must using Python 3.8, XPath and lxml.html and things are much less forgiving. For example, using this URL and this XPath:
//dt[text()='Services/Products']/following-sibling::dd[1]
I would return a paragraph or a list depending on what the innerhtml was. This is how I am attempting to extract the text now:
data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
which returns this: Services_Product[]
which is a list of "li" elements for his page, but other times this field can be any of these:
<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li><p>some text</p></li>
<li><p>some text</p></li>
</ul>
</dd>
What is the best practice for extracting text from situations like this where the target field can be a number of different things?
I used this test code to see what my options are:
file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
for elem in stuff:
print(elem[0][0].text)
That returned this:
Health
Health
doctors
Health
doctors
Which is not correct. Here's a screenshot of it in google chrome:
The Xpath tool in google chrome along with the html in question
Whats the best way to scrape this data using Python and Xpath - or other options?
Thank you.
After spending hours googling and then writing this post above, it just came to me:
old code:
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
and new code that returns a nice list of text:
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li/text()")
add the "/text()" on the end fixed it.
I am scraping a website with python and beautiful soup and I cannot seem to get this one tag right. How to extract page information?
This is the html code:
<div class="pull-right">
<span class="pages">page 1 / 7</span>
<span class="sep">|</span>
Next »
</div>
I have done this :
page=soup.find_all("span",{"class":"pages"})
and produced this output: [page 1 / 7]. However I only want a part of this("1/7! or "page 1/7").
Can anyone help?
What you are trying is giving you a list, where all the elements having same tags will be captured. An easy fix could be access element by index which is '0' but in some case it can be problem as it will get all the
values having same tag values
if you just want the 'page 1/7' use this
Code:
element = soup.find("span",{"class": "pages"})
if bool(element):
print(element.text)
Output:
page 1 / 7
If you want just '1/7' as your answer use regex.
re.findall(r'\d*\s*\/\s*\d*', element.text)[0]
Hope this will solve your problem
The best way to do this is
page= soup.find_all("span",attrs={"class": "pages"})
page=page.get_text()
Try this, hope it will help you
First of all thank you if you are reading this.
I have been scraping away for some time to collect minor data, however I want to pull in some additional information but I got stuck on a pagination.
I would like to get the data-href of the link, however it needs to consist the
i have been using [contains()] when however how do you get data-href when i needs to contain an object with a specific class
<li><a class="cursor" data-type="js" data-href="test"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li>
I have been using the following:
next_page_url = response.selector.xpath('//*[#class="text-center"]/ul/li/a[contains(#class,"cursor")]/#data-href').extract_first()
which works but not for the correct data-href
Many thanks for the help
Full source code:
<div class="pagination-container margin-bottom-20"> <div class="text-center"><ul class="pagination"><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html"><i class="fa fa-chevron-left" aria-hidden="true"></i></a></li><li>1</li><li class="active"><a>2</a></li><li>3</li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs">12</li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs">22</li><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html?p=3"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li></ul></div> </div> </div>
Huh... Turned out to be such a simple case (:
Your mistake is .extract_first() while you should extract last item to get next page.
next_page = response.xpath('//a[#class="cursor"]/#data-href').extract()[-1]
This will do the trick. But I'd recommend you to extract all the links from pagination list, since scrapy is managing duplication crawling. This will do a better job and having less chances for mistake:
pages = response.xpath('//ul[#class="pagination"]//a/#href').extract()
for url in pages:
yield scrapy.Request(url=response.urljoin(url), callback=self.whatever)
And so on..
try with that :
next_page_url = response.selector.xpath('//*[#class="text-center"]/ul/li/a[#class="cursor")]/#data-href').extract_first()
I'd suggest you to make sure that your element exists in initial html first:
just Ctlr+U in Chrome and then Ctrl+F to find element..
If element can be found there - something's wrong with your xpath selector.
Else element is generated by javascript and you have to use another way to get the data.
PS. You shouldn't use Chrome Devtools "Elements" tab to check if element exists or not, because this tab contains elements with JS code already applied. So check source only(ctrl+U)
I am trying to write a python script to scrape data from a webpage. However, I am not able to figure out correct usage of xpath to retrieve value. Please help me in fixing this.
The url in question is https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=NIFTY&instrument=OPTIDX&strike=10400.00&type=CE&expiry=30NOV2017
I am trying to get value of VWAP value ,which at present is 27.16(this value changes every business day.) When is inspect the value in Chrome, I get the following xpath for required value
<span id="vwap">27.16</span>
As per online tutorial , I wrote following python script
from lxml import html
import requests
page = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuoteFO.jsp?underlying=NIFTY&instrument=OPTIDX&strike=10400.00&type=CE&expiry=30NOV2017')
tree = html.fromstring(page.content)
vwap = tree.xpath('//span[#id="vwap"]/text()')
print(vwap)
But when i execute this command, I get following output
[]
instead of
27.16
I have also tried replacing xpath line to following as per some other answer on stackoverflow, but still I am not getting the correct output.
vwap = tree.xpath('//*[#id="vwap"]/text()')
Please let me know what to put inside xpath so that I get correct value inside vwap variable.
Any other solutions(other than lxml) are also welcome.
If to check page source as it initially appears required node will look like
<li><a style="color: #000000;" title="VWAP">VWAP</a> <span id="vwap"></span></li>
while this is how it appears after JavaScript executed
<li><a style="color: #000000;" title="VWAP">VWAP</a> <span id="vwap">27.16</span></li>
Note that there is no text content in first HTML sample
It seem that values comes from below node
<div id="responseDiv" style="display:none">
{"valid":"true","isinCode":null,"lastUpdateTime":"29-NOV-2017 15:30:30","ocLink":"\/marketinfo\/sym_map\/symbolMapping.jsp?symbol=NIFTY&instrument=-&date=-&segmentLink=17&symbolCount=2","tradedDate":"29NOV2017","data":[{"change":"-17.80","sellPrice1":"13.80","buyQuantity3":"450","sellPrice2":"13.85","buyQuantity4":"150","buyQuantity1":"13,725","ltp":"-243019.52","buyQuantity2":"6,225","sellPrice5":"14.00","sellPrice3":"13.90","buyQuantity5":"450","sellPrice4":"13.95","underlying":"NIFTY","bestSell":"-2,41,672.50","annualisedVolatility":"9.44","optionType":"CE","prevClose":"31.10","pChange":"-57.23","lastPrice":"13.30","lowPrice":"11.00","strikePrice":"10400.00","premiumTurnover":"11,707.33","numberOfContractsTraded":"5,74,734","underlyingValue":"10,361.30","openInterest":"58,96,350","impliedVolatility":"12.73","vwap":"27.16","totalBuyQuantity":"10,49,850","openPrice":"35.10","closePrice":"17.85","bestBuy":"-2,43,852.25","changeinOpenInterest":"1,60,800","clientWisePositionLimits":"30517526","totalSellQuantity":"11,07,825","dailyVolatility":"0.49","sellQuantity5":"19,800","marketLot":"75","expiryDate":"30NOV2017","marketWidePositionLimits":"-","sellQuantity2":"75","sellQuantity1":"3,825","buyPrice1":"13.00","sellQuantity4":"900","buyPrice2":"12.90","sellQuantity3":"2,025","buyPrice4":"12.75","buyPrice3":"12.80","buyPrice5":"12.65","turnoverinRsLakhs":"44,94,632.53","pchangeinOpenInterest":"2.80","settlementPrice":"-","instrumentType":"OPTIDX","highPrice":"40.85"}],"companyName":"Nifty 50","eqLink":""}
</div>
so the code that you might need is
import json
vwap = json.loads(tree.xpath('//div[#id="responseDiv"]/text()')[0].strip())['data'][0]['vwap']
I am currently writing code for Scrapers and more and more become a fan of Python and especially BeautifulSoup.
Still... When parsing through html I came across a difficult part that I could only use in a not so beautiful way.
I want to scrape HTML Code and especially the following snippet:
<div class="title-box">
<h2>
<span class="result-desc">
Search results <strong>1</strong>-<strong>10</strong> out of <strong>10,009</strong> about <strong>paul mccartney</strong>Create email Alert
</span>
</h2>
</div>
So what I do is I identify the div by using:
comment = TopsySoup.find('div', attrs={'class' : 'title-box'})
Then the ugly part comes in. To catch the number I want to have: 10,009 I use:
catcher = comment.strong.next.next.next.next.next.next.next
Can somebody tell me if there is a nicer way?
How about comment.find_all('strong')[2].text?
It can actually be shortened as comment('strong')[2].text, since calling a Tag object as though it is a function is the same as calling find_all on it.
>>> comment('strong')[2].text
u'10,009'