Scrapy Xpath getting the correct pagination - python

First of all thank you if you are reading this.
I have been scraping away for some time to collect minor data, however I want to pull in some additional information but I got stuck on a pagination.
I would like to get the data-href of the link, however it needs to consist the
i have been using [contains()] when however how do you get data-href when i needs to contain an object with a specific class
<li><a class="cursor" data-type="js" data-href="test"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li>
I have been using the following:
next_page_url = response.selector.xpath('//*[#class="text-center"]/ul/li/a[contains(#class,"cursor")]/#data-href').extract_first()
which works but not for the correct data-href
Many thanks for the help
Full source code:
<div class="pagination-container margin-bottom-20"> <div class="text-center"><ul class="pagination"><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html"><i class="fa fa-chevron-left" aria-hidden="true"></i></a></li><li>1</li><li class="active"><a>2</a></li><li>3</li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs">12</li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs">22</li><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html?p=3"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li></ul></div> </div> </div>

Huh... Turned out to be such a simple case (:
Your mistake is .extract_first() while you should extract last item to get next page.
next_page = response.xpath('//a[#class="cursor"]/#data-href').extract()[-1]
This will do the trick. But I'd recommend you to extract all the links from pagination list, since scrapy is managing duplication crawling. This will do a better job and having less chances for mistake:
pages = response.xpath('//ul[#class="pagination"]//a/#href').extract()
for url in pages:
yield scrapy.Request(url=response.urljoin(url), callback=self.whatever)
And so on..

try with that :
next_page_url = response.selector.xpath('//*[#class="text-center"]/ul/li/a[#class="cursor")]/#data-href').extract_first()

I'd suggest you to make sure that your element exists in initial html first:
just Ctlr+U in Chrome and then Ctrl+F to find element..
If element can be found there - something's wrong with your xpath selector.
Else element is generated by javascript and you have to use another way to get the data.
PS. You shouldn't use Chrome Devtools "Elements" tab to check if element exists or not, because this tab contains elements with JS code already applied. So check source only(ctrl+U)

Related

Scrapy How to Get Values from data-href

I am trying to scrape a bunch of links, or things which can be appended to the root domain to make a link from https://www.media.mit.edu/groups
The html itself looks like this:
<div class="container-item listing-layout-item selectorgadget_selected" data-href="/groups/viral-communications/overview/" '="">
<div class="container-item listing-layout-item selectorgadget_suggested" data-href="/groups/social-machines/overview/" '="">
<div class="container-item listing-layout-item selectorgadget_suggested" data-href="/groups/space-enabled/overview/" '="">
The link data is stored within the data-href part, and I have been trying to use CSS selectors to get this data.
When I use the Scrapy shell, I have been trying to use
response.css('.data-href::text').extract() but it returns an empty list.
Any suggestions would be greatly appreciated!
Try to use
response.xpath('//div/#data-href').extract()
to get required values

How to extract the img title next to span as per the given HTML through Selenium and Python

I am making a web crawling for checking a kind of availability.
I want to check the title of the specific time. However, if the title is 'NO', there is no href, otherwise there is a href. Therefore, it's xpath depends on the title. The title name changes every time. So i can't check by xpath.
If I want to check the availability of 09:00~11:00, how can do that?
I tried to find by XPATH. However, since the XPATH changes as I told, I can't to check the specific time i want.
Thanks in advance.
Below is the HTML code.
<span class="rs">07:00~09:00</span><img src="../images/reservation_btn04.gif" title="NO"><br>
<span class="rs">09:00~11:00</span><img src="../images/reservation_btn04.gif" title="NO"><br>
<span class="rs">11:00~13:00</span><img src="../images/reservation_btn04.gif" title="NO"><br>
<span class="rs">13:00~15:00</span><img src="../images/reservation_btn03.gif" title="YES"><br>
<span class="rs">15:00~17:00</span><img src="../images/reservation_btn03.gif" title="YES"><br>
<span class="rs">17:00~19:00</span><img src="../images/reservation_btn03.gif" title="YES"><br>
<span class="rs">19:00~21:00</span><img src="../images/reservation_btn04.gif" title="NO"><br>
As per the HTML you have shared to check the availability of any timespan e.g. 09:00~11:00 you can use the following solution:
You can create a function() which will take an argument as the timespan and extract the availability as follows:
def check_availability(myTimeSpan):
print(driver.find_element_by_xpath("//span[#class='rs'][.='" + myTimeSpan + "']//following::img[1]").get_attribute("title"))
Now you can call the function check_availability() with any timespan as follows:
check_availability("09:00~11:00")
If the text 09:00~11:00 is fixed, you can locate the img element like this -
element = driver.find_element_by_xpath("//span[#class='rs' and contains(text(),'09:00~11:00')]/following-sibling::img")
To check whether the title attribute of the element is "YES" -
if element.get_attribute("title") == 'YES':
// do whatever you want
To get the href attribute of your required element-
source = driver.find_element_by_xpath("//span[#class='rs' and contains(text(),'09:00~11:00')]/following-sibling::img[#title='YES']/preceding-sibling::a").get_attribute("href")

Find span value with xpath

I'm following this
tutorial for scraping information from a website after a login.
Now, part of the code makes use of a xpath variable to scrape specific content. I'm not familiair with xpath and after a lot of searching I can't find the right solution. I hope one of you guys can help me out!
I need the value within the "price" <span>:
<div class="price-box">
<span class="regular-price" id="product-price-64">
<span class="price">€ 4,90</span>
</span>
</div>
My piece of code right now is:
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//span[#class='price']/text()")
What should be the xpath code to get the information from within the <span>?
Edit: It seems indeed -as per the comments- that the initial page source came not good through.
your xpath looks almost fine, maybe you forgot a dot?
bucket_names = tree.xpath(".//span[#class='price']/text()")
Please check xpath submethod,
Your xpath seems correct.
Try with: xpath("//span[#class='price']").text

Selenium Webscraping Twitter - Getting hold of tweet timestamp?

When inspecting a twitter results page, within the following class:
<small class="time">
....
</small>
Is a timestamp for each tweet 'data-time':
<span class="_timestamp js-short-timestamp js-relative-timestamp" data-time="1510698047" data-time-ms="1510698047000" data-long-form="true" aria-hidden="true">12m</span>
Within selenium i am using the following code:
tweet_date = browser.find_elements_by_class_name('_timestamp')
But looking at a single entry only returns, in this case, 12m.
How is it possible to access one of the other properties within the class within selenium?
I usually use find_elements_by_xpath, this will let you grab a specific element from a page without worrying about names. Or so that's how it seems to work.
EDIT
Alright so I think I've got it figured out. First, find element by xpath and assign.
ts=browser.find_elements_by_xpath('//*[#id="stream-item-tweet-929138668551380992"]/div/div[2]/div[1]/small/a/span')
Forgot that if you use "elements" instead of "element" you'll need to add something like this.
ts=ts[0]
Then you can use the get_attribute method to get the info associated with 'data-time' in the html.
raw_time=ts.get_attribute('data-time')
Returns
raw_time == '1510358895'
Thank you to SuperStew who found the key to the answer - get_attribute()
My final solution for anyone wondering:
tweet_date = browser.find_elements_by_class_name("_timestamp")
And then for any date in that list:
tweet_date[1].get_attribute('data-time')

python, collecting links / script values from page

I am trying to make a program to collect links and some values from a website. It works mostly well but I have come across a page in which it does not work.
With firebug I can see that this is the html code of the illusive "link" (cant find it when viewing the pages source thou):
<a class="visit" href="/tet?id=12&mv=13&san=221">
221
</a>
and this is the script:
<td><a href=\"/tet?id=12&mv=13&san=221\" class=\"visit\">221<\/a><\/td><\/tr>
I'm wondering how to get either the "link" ("/tet?id=12&mv=13&san=221") from the html code and the string "221" from either the script or the html using selenium, mechanize or requests (or some other library)
I have made an unsuccessful attempt at getting it with mechanize using the br.links() function, which collected a number of links from the side, just not the one i am after
extra info: This might be important. to get to the page I have to click on a button with this code:
<a id="f33" class="button-flat small selected-no" onclick="qc.pA('visitform', 'f33', 'QClickEvent', '', 'f52'); if ($j('#f44').length == 0) { $j('f44').style.display='inline'; }; $j('#f38').hide();qc.recordControlModification('f38', 'DisplayStyle', 'hide'); document.getElementById('forumpanel').className = 'section-3'; return false;" href="#">
load2
</a>
after which a "new page" loads in a part of the window (but the url never changes)
I think you pasted the wrong script of yours ;)
I'm not sure what you need exactly - there are at least two different approaches.
Matching all hrefs using regex
Matching specific tags and using getAttribute(...)
For the first one, you have to get the whole html source of the page with something like webdriver.page_source and using something like the following regex (you will have to escape either the normal or the double quotes!):
<a.+?href=['"](.*?)['"].*?/?>
If you need the hrefs of all matching links, you could use something similar to webdriver.find_elements_by_css_selector('.visit') (take care to choose find_elements_... instead of find_element_...!) to obtain a list of webelements and iterate through them to get their attributes.
This could result in code like this:
hrefs = []
elements = webdriver.find_elements_by_css_selector('.visit')
for element in elements:
hrefs.append(element.getAttribute('href'))
Or a one liner using list comprehension:
hrefs = [element.getAttribute('href') for element \
in webdriver.find_elements_by_css_selector('.visit')]

Categories

Resources