XPath not working in Scrapy

XPath not working in Scrapy - python

I have the following XPath that I am trying to extract data from:
/html/body/div[2]/div[2]/div/div/div[4]/ul[2]/li/div
I am trying to simply test this through Scrapy Shell, so I do the following:
scrapy shell "https://www.rentler.com/listing/520583"
and then:
hxs.select('/html/body/div[2]/div[2]/div/div/div[4]/ul[2]/li/div').extract()
But this returns [].
Any ideas?
Edit
The whole reason that I want to do this is because I need to breakup these 5 items into individual variables, not one array (which I currently have working):
<ul class="basic-stats">
<li>
<div class="count">4</div>
<div class="label">Bed</div>
</li>
<li>
<div class="count">2</div>
<div class="label">Bath</div>
</li>
<li>
<div class="count">1977</div>
<div class="label">Year</div>
</li>
<li>
<div class="count">1960</div>
<div class="label">SqFt</div>
</li>
<li>
<div class="count">0</div>
<div class="label">Acres</div>
</li>

I solved this. To access the individual items above, you simply add li[1],li[2], etc.

Related

How to pick up specific data with Python and Selenium

I'm Hiro from Japan.
I just started studying Python with Selenium by myself.
I am happy if somebody help me solve following problem.
Number of "data" which is enclosed in "p" tags is always changed so don't know how many data will be shown every time.
Data4 is always appear but order is changed.
Also these data are assigned same class.
For example, In case sample A Data4 is 4th, but sample B is 2nd.
Please help me how pick up "Data4".
Thank you for your help in advance.
(sample A)
----------------------------------------
<div class="elist" id="test">
<ul class="ijkl">
<li class="elRow">
<div class="elRowTitle">
<p>Code1</p>
</div>
<div class="elRowData">
<p>Data1</p>
</div>
</li>
<li class="elRow">
<div class="elRowTitle">
<p>Code2</p>
</div>
<div class="elRowData">
<p>Data2</p>
</div>
</li>
<li class="elRow">
<div class="elRowTitle">
<p>Code3</p>
</div>
<div class="elRowData">
<p>Data3</p>
</div>
</li>
<li class="elRow">
<div class="elRowTitle">
<p>Code4</p>
</div>
<div class="elRowData">
<p>Data4</p>
</div>
</li>
</ul>
</div>
----------------------------------------
(sample B)
----------------------------------------
<div class="elist" id="test">
<ul class="ijkl">
<li class="elRow">
<div class="elRowTitle">
<p>Code1</p>
</div>
<div class="elRowData">
<p>Data1</p>
</div>
</li>
<li class="elRow">
<div class="elRowTitle">
<p>Code4</p>
</div>
<div class="elRowData">
<p>Data4</p>
</div>
</li>
</ul>
</div>
----------------------------------------
I wrote the following script using "find_elements_by_xxxxx" method for both sample A and B but did not work.
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
elem_ItemCodes = driver.find_elements_by_tag_name('div')
elem_ItemCodes2 = elem_ItemCodes.find_elements_by_class_name('elRowData')
elem_ItemCode = elem_ItemCodes2[3].text
print(elem_ItemCode)

You can use parent as reference,
The xpath equivalent will be :
//div[#id="test"]//div[#class="elRowData"][4]/p
your approach is correct, but make few changes like:
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
elem_ItemCodes = driver.find_elements_by_id('test')
elem_ItemCodes2 = elem_ItemCodes.find_elements_by_class_name('elRowData')
elem_ItemCode = elem_ItemCodes2[4].find_elemen_by_tag_name("p").text
print(elem_ItemCode)
The text is in P tag and the index in xpath starts from 1 and not 0

Any limitation on HTML <li>?

I have the following code with Python:
<div id="sidebar-wrapper" class="container-fluid" style="background-color: lightgray">
<nav id="spy" class="nav nav-pills navbar-stacked">
<ul class="sidebar-nav nav">
<li class="">
<a href="{% url 'PHIproduct' %}" data-scroll="" class="">
<span class="fa fa-anchor solo"><h3>Product List</h3></span>
</a>
<li class="">
{% for i in loop_times_product %}
<a href="{% url 'PHI' %}?id={{ i }}" data-scroll="" class="">
<span class="fa fa-anchor solo" id="{{ i }}">{{ i|safe }}</span>
</a>
{% endfor %}
<li class="">
{% for i in loop_times %}
<a href="{% url 'PHIc' %}?id={{ i }}" data-scroll="" class="">
<span class="fa fa-anchor solo" id="{{ i }}">{{i|safe}}</span>
</a> {% endfor %}
<li class="">
{% for i in loop_timesc %}
<a href="{% url 'button' %}?id={{ i }}" data-scroll="" class="">
<span class="fa fa-anchor solo" id="{{ i }}">{i|safe}}</span>
</a> {% endfor %}
</li>
</li>
</li>
</li>
</ul>
</nav>
</div>
The main purpose is to add following feature:
After I apply this code, when product A is clicked, the car and motor will not show, which means this part of code is not running:
<li class="">
{% for i in loop_timesc %}<span class="fa fa-anchor solo" id="{{ i }}">{{i|safe}}</span>
{% endfor %}
</li>
Is there any limitation on li code or am I writing the wrong code here? Can anyone help me look at this because I already spent 2 days trying to find the mistake here but have failed.

I didn't inspect your code in great detail, but one thing jumped out at me: you're nesting <li> elements directly inside each other. You can't do that; an <li> needs to be a direct child of an <ol> or <ul>.
Forget about Python for the moment and just look at a simple HTML example.
Invalid:
<ul>
<li>
One
<li>
One A
</li>
</li>
</ul>
Valid:
<ul>
<li>
One
<ul>
<li>
One A
</li>
</ul>
</li>
</ul>
There may be other problems in your code, but this is certainly one to fix.
Another tip: if you're working a suspected HTML issue like this where one of the problems may be that the generated HTML simply isn't valid, don't try to figure out everything from your Python template source code. Instead, do a View Source in the browser where you can see exactly what the browser sees.
In fact, you can do a Select All and Copy from the View Source window, and then paste into the W3C HTML Validator to see if the HTML is valid. If you're generating invalid HTML, all bets are off, so that is the first thing to check.
If you treat your server code (including templates) separately from the actual downloaded HTML that the browser sees, you'll have a much easier time debugging. The server generates HTML code; the browser parses and renders the HTML code that the server generated.

Extract text with a Python XPath expression

I want to display http:///gb/groceries/easter-essentials--%28approx-205kg%29.
In scrapy I used this XPath expression:
response.xpath('//div[#class="productNameAndPromotions"]/h3/a/href').extract()
but it didn't work!
<div class="product ">
<div class="productInfo">
<div class="productNameAndPromotions">
<h3>
<a href="http:///gb/groceries/easter-essentials--%28approx-205kg%29">
<img src="http:co.uk/wcsstore7.20.1.145/ExtendedSitesCatalogAssetStore/image/catalog/productImages/08/020000008_L.jpeg" alt="" />
</a>
</h3>
</div>
</div>
</div>

This //div[#class="productNameAndPromotions"]/h3/a/href means you want to get element href which is child of a.
If you want to extract nodes' attribute, e.g. href, you need to use #attribute syntax. Try below:
//div[#class="productNameAndPromotions"]/h3/a/#href

Bootstrap Tabs not showing content when I use tal:repeat to display <li> elements in Pyramid Framework

I want to create dynamic tabs and their content in Pyramid using Bootstrap and Chameleon template engine for Python, but only the first tab remains activate despite of clicking on other tabs.
My HTML Code:
<ul class="nav nav-tabs responsive">
<li><a data-toggle="tab" href="#tips">Tips</a></li>
<li tal:repeat="key dict"><a data-toggle="tab"href="#${key}">${key} </a></li>
</ul>
<div class="tab-content responsive">
<div tal:repeat="(keys,value) dict.iteritems()" id="${keys}" class="tab-pane fade">
<p>${value}</p>
</div>
<div id="tips" class="tab-pane fade in active">
<p>tips</p>
</div>
</div>

How to debug strange artefacts in django template?

I'm using django template to render my hierarchical tree in a web page. In the process of rendering of a tree I see these strange whitespaces between nodes:
Here is my recursive template:
index.html:
<ul class="Container">
<li class="IsRoot">
<div class="Expand">
</div>
<div class="Content">
Содержание
</div>
</li>
{% include 'list.html' with data=list %}
</ul>
and list.html (as a recursive part):
<ul class="Container">
<li class="Node ExpandClosed">
<div class="Expand"></div>
<div class="Content">
<a href="/help/{{data.name}}">
{{data.content}}
</a>
</div>
{% for item in data.decendent %}
{% include 'list.html' with data=item %}
{% endfor %}
</li>
</ul>
How to debug what's the matter with this template and in what period of time it happens? As you can see, I don't generate any whitespaces in this template.

The white space is not the problem, and it is not causing the spaces in your rendered tree. The reason for that appears to be that you are nesting ul elements directly inside uls, which isn't strictly speaking valid: they should be inside lis.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

XPath not working in Scrapy - python

I solved this. To access the individual items above, you simply add li[1],li[2], etc.

Related

How to pick up specific data with Python and Selenium

Any limitation on HTML <li>?

Extract text with a Python XPath expression

Bootstrap Tabs not showing content when I use tal:repeat to display <li> elements in Pyramid Framework

How to debug strange artefacts in django template?

Categories

Resources