Scrapy scrape content having same class name - python

I am using scrapy to crawl and scrape data from a particular webiste. The crawle works fine, but i'm having issue when scraping content having from div having same class name. As for example:
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
I want to retrieve only this is the 1st div. The code i've used is:
desc = hxs.select('//div[#class = "same_name"]/text()').extract()
But it returns me all the contents. Any help would be really helpful !!

Ok , this one worked for me.
print desc[0]
It returned me this is the first div which is what i wanted.

You can use BeautifulSoup. Its a great html parser.
from BeautifulSoup import BeautifulSoup
html = """
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
"""
soup = BeautifulSoup(html)
print soup.text
That should do the work.

Using xpath you will get all the div with the same class, further, you can loop on them to get the result(for scrapy):
divs = response.xpath('//div[#class="full class name"]')
for div in divs:
if div.css("div.class"):

Related

how to find second div from html in python beautifulsoup

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()
Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Get all links from DOM except from a certain div tag selenium python

How to get all links of the DOM except from a certain div tag??
This is the div I don't want links from:
<div id="yii-debug-toolbar">
<div class="yii-debug-toolbar_bar">
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
</div>
</div>
I get the links in my code lke this:
links = driver.find_elements_by_xpath("//a[#href]")
But I don't want to get the ones from that div, how can I do that?
I'm not sure if there is a simple way to do this with just seleniums xpath capabilities. However, a simple solution could be to parse the HTML with something like BeautifulSoup, get rid of all the <div id="yii-debug-toolbar">...</div> Elements, and then select the remaining links.
from bs4 import BeautifulSoup
...
soup = BeautifulSoup(wd.page_source)
for div in soup.find_all("div", {'id':'yii-debug-toolbar'}):
div.decompose()
soup.find_all('a', href=True)

BeautifulSoup.find_all for nested divs without class attribute

I am working with python2 and I wanted to get the content of a div in html page.
<div class="lts-txt2">
Some Content
</div>
If the div class is like above then I can get the content using
BeautifulSoup.find_all('div', attrs={"class": 'lts-txt2'})
But if the div is like,
<div class="lts-txt2">
<div align="justify">
Some Content
</div>
</div>
then using
BeautifulSoup.find_all('div', attrs={"class": 'lts-txt2'})
isn't return the content.
So I tried with
BeautifulSoup.find_all('div', attrs={"align": 'justify'})
But it also wasn't worked.
How can I solve the problem.
You can extract all text from the node including nested nodes with the Element.get_text() method:
[el.get_text() for el in soup.find_all('div', attrs={"class": 'lts-txt2'})]
This produces a list with the textual content of each such a div, wether or not there is a nested div inside.
You could also use the CSS selector Element.select() function to select the nested div:
soup.select('div.lts-txt2 > div')

How to select spans under one div but not another via XPath?

Say I have this page:
<div class="top">
<span class="strings">asdf</span>
<span class="strings">qwer</span>
<span class="strings">zxcv</span>
</div>
<div id="content">
some other text
<span class="strings">1234</span>
<span class="strings">5678</span>
<span class="strings">1234</span>
</div>
How do I get the script to only scrape the span class strings in the div id="content", not div class="top"? Results should be '1234', '5678', '1234'.
Here is my code so far:
from lxml import html
import requests
url = 'http://www.amazon.com/dp/B00SGGQRNO'
response = requests.get(url)
tree = html.fromstring(response.content)
bullets = tree.xpath('//span[#class="strings"]/text()')
print ('Bullets: ',bullets)
To select only the text of those span elements (with #class="strings") that are children of the div element with #id="content, use this XPath expression:
//div[#id="content"]/span[#class="strings"]/text()

Categories

Resources