Scrapy / How to select data directly nested into span - python

One website I am trying to scrap has a specific structure for prices. It is something like :
<span class="sale-price" data-sup-product-price="" data-item-price="2.02" ...>
2,
<sup>02 E</sup>
</span>
It is possible to access directly the data-item-price data nested into the span ?
I mean, not something like :
response.css("span.sale-price").extract()
But another way with data-item-price ?

Try response.css("span.sale-price::attr(data-item-price)").get() for getting data from this field. Or if you want to get all span with such field use selector span[data-item-price].

Related

Extract text from div class with scrapy

I am using python along with scrapy. I want to extract the text from the div tag which is inside a div class. For example:
<div class="ld-header">
<h1>2013 Gulfstream G650ER for Sale</h1>
<div id="header-price">Price - $46,500,000</div>
</div>
I've extracted text from h1 tag
result.xpath('//div[#class="ld-header"]/h1/text()').extract()
but I can't extract Price. I've tried
'price': result.xpath('//div[#class="ld-header"]/div[#id="header-price"]/text()').extract()
As you have an id, you do not need to use the complete path to the element. Ids are unique per Webpage:
This Xpath:
//div[#id="header-price"]/text()
used on the give XML will return:
'Price - $46,500,000'
For debugging Xpath and CSS Selectors, I always find it helpful to use an online checker (just use Google to find some suggestions).
Try This one and you tell me :)
price = [x.replace('Price - ', '').replace('$', '') for x in result.xpath('//div[#class="ld-header"]/h1/text()').extract()]
This is a 'for' loop inside all the items in the extraction where you replace all the info you don't need with the 'replace()' method.

How to get data-timestamp using python/selenium

Below is the html of the table I want to extract the data-timestamp from.
The webpage is at https://nl.soccerway.com/national/argentina/primera-division/20182019/regular-season/r47779/matches/?ICID=PL_3N_02
So far I tried verious variants I found on here but nothing seemed to work. Can someone help me to extract the (for example) 1536962400. So in other words I want to extract every data-timestamp value of the table. Any suggestions are more than welcome! I have used selenium/python to extract table data from the website but data-timestamp always gives errors.
data-timestamp is an attribute of tr element, you can try this:
element_list = driver.find_elements_by_xpath("//table[contains(#class,'matches')]/tbody/tr")
for items in element_list:
print(items.get_attribute('data-timestamp'))

Scrapy How to Get Values from data-href

I am trying to scrape a bunch of links, or things which can be appended to the root domain to make a link from https://www.media.mit.edu/groups
The html itself looks like this:
<div class="container-item listing-layout-item selectorgadget_selected" data-href="/groups/viral-communications/overview/" '="">
<div class="container-item listing-layout-item selectorgadget_suggested" data-href="/groups/social-machines/overview/" '="">
<div class="container-item listing-layout-item selectorgadget_suggested" data-href="/groups/space-enabled/overview/" '="">
The link data is stored within the data-href part, and I have been trying to use CSS selectors to get this data.
When I use the Scrapy shell, I have been trying to use
response.css('.data-href::text').extract() but it returns an empty list.
Any suggestions would be greatly appreciated!
Try to use
response.xpath('//div/#data-href').extract()
to get required values

Selenium Webscraping Twitter - Getting hold of tweet timestamp?

When inspecting a twitter results page, within the following class:
<small class="time">
....
</small>
Is a timestamp for each tweet 'data-time':
<span class="_timestamp js-short-timestamp js-relative-timestamp" data-time="1510698047" data-time-ms="1510698047000" data-long-form="true" aria-hidden="true">12m</span>
Within selenium i am using the following code:
tweet_date = browser.find_elements_by_class_name('_timestamp')
But looking at a single entry only returns, in this case, 12m.
How is it possible to access one of the other properties within the class within selenium?
I usually use find_elements_by_xpath, this will let you grab a specific element from a page without worrying about names. Or so that's how it seems to work.
EDIT
Alright so I think I've got it figured out. First, find element by xpath and assign.
ts=browser.find_elements_by_xpath('//*[#id="stream-item-tweet-929138668551380992"]/div/div[2]/div[1]/small/a/span')
Forgot that if you use "elements" instead of "element" you'll need to add something like this.
ts=ts[0]
Then you can use the get_attribute method to get the info associated with 'data-time' in the html.
raw_time=ts.get_attribute('data-time')
Returns
raw_time == '1510358895'
Thank you to SuperStew who found the key to the answer - get_attribute()
My final solution for anyone wondering:
tweet_date = browser.find_elements_by_class_name("_timestamp")
And then for any date in that list:
tweet_date[1].get_attribute('data-time')

What is a unique identifier and how to use it to select?

I use Selenium and I am trying to automate a task on a website and in order to select an item I have to use this:
select = driver.find_element_by_*whatever*
However, all the whatevers like find_element_by_id, by name, by tag name etc. are either unavailable or are shared by several items. The only one that seems to be unique to each item is a "data-id" number but there isn't a find_element_by_data_id function as far as I know.
I can get a unique identifier which looks like this:
div.item:nth-child(453)
It seems to fit since it doesn't change when I reload the page and is unique to only one item.
How can I use this unique identifier to select the object? Alternatively, could you suggest a way of how I could select the desired item?
Here's the HTML pertaining to the object:
...
</div>
<div data-id="3817366931"
data-slot="secondary"
data-classes="pyro"
data-content="Level: 30<br/>"
data-appid="440"
class="item hoverable quality6 app440"
style="opacity:1;background-image:url(https://steamcdn-a.akamaihd.net/apps/440/icons/c_drg_manmelter.b76b87bda3242806c05a6201a4024a560269e805.png);"
data-title="Manmelter"
data-defindex="595">
</div>
<div data-id="3820690816"
data-slot="primary"
data-classes="pyro"
data-content="Level: 10<br/>"
data-appid="440"
class="item hoverable quality6 app440"
style="opacity:1;background-image:url(https://steamcdn-a.akamaihd.net/apps/440/icons/c_drg_phlogistinator.99b83086e28b2f85ed4c925ac5e3c6e123289aec.png);"
data-title="Phlogistinator"
data-defindex="594">
</div>
<div data-id="3819377317"
data-slot="primary"
data-classes="pyro"
data-content="Level: 10<br/>"
data-appid="440"
class="item hoverable quality6 app440"
style="opacity:1;background-image:url(https://steamcdn-a.akamaihd.net/apps/440/icons/c_drg_phlogistinator.99b83086e28b2f85ed4c925ac5e3c6e123289aec.png);"
data-title="Phlogistinator"
data-defindex="594">
So the items in the two bottom boxes are the same. The one at the top is different. Let's I would like a way to select the item in the second box.
I am not sure how easy it will be to automate the scenario based on the html structure like this. I would suggest you to talk to the devs to see if they can add some kind of ids to each parent div otherwise the selector will be too brittle. I also see the data-id attribute is unique in every case so that could be your best bet if you somehow know the ids beforehand. If you do not have any other options then css nth-child() function is the next most reliable mechanism. But, in that case you have to know the parent. nth-child() is well explained here
On the other hand, if the intention is to find the second data-slot you can use the following xpath:
//div[#data-slot='primary'][2]

Categories

Resources