Say I have this page:
<div class="top">
<span class="strings">asdf</span>
<span class="strings">qwer</span>
<span class="strings">zxcv</span>
</div>
<div id="content">
some other text
<span class="strings">1234</span>
<span class="strings">5678</span>
<span class="strings">1234</span>
</div>
How do I get the script to only scrape the span class strings in the div id="content", not div class="top"? Results should be '1234', '5678', '1234'.
Here is my code so far:
from lxml import html
import requests
url = 'http://www.amazon.com/dp/B00SGGQRNO'
response = requests.get(url)
tree = html.fromstring(response.content)
bullets = tree.xpath('//span[#class="strings"]/text()')
print ('Bullets: ',bullets)
To select only the text of those span elements (with #class="strings") that are children of the div element with #id="content, use this XPath expression:
//div[#id="content"]/span[#class="strings"]/text()
Related
I'm currently scraping elements from a webpage. Let's say i'm iterating over a HTML reponse and a part of that response looks like this:
<div class="col-sm-12 col-md-5">
<div class="material">
<div class="material-parts">
<span class="material-part" title="SLT-4 2435">
<img src="/images/train-material/mat_slt4.png"/> </span>
<span class="material-part" title="SLT-6 2631">
<img src="/images/train-material/mat_slt6.png"/> </span>
</div>
</div>
</div>
I know I can access the first element under title within the span class like so:
row[-1].find('span')['title']
"SLT-4 2435
But I would like to select the second title under the span class (if it exists) as a string too, like so: "SLT-4 2435, SLT-6 2631"
Any ideas?
You can use the find_all() function to find all the span elements with class material-part
titles = []
for material_part in row[-1].find_all('span', class_='material-part'):
titles.append(material_part['title'])
result = ', '.join(titles)
In alternativ to find() / find_all() you could use css selectors:
soup.select('span.material-part[title]')
,iterate the ResultSet with list comprehension and join() your texts to a single string:
','.join([t.get('title') for t in soup.select('span.material-part[title]')])
Example
from bs4 import BeautifulSoup
html = '''<div class="col-sm-12 col-md-5">
<div class="material">
<div class="material-parts">
<span class="material-part" title="SLT-4 2435">
<img src="/images/train-material/mat_slt4.png"/> </span>
<span class="material-part" title="SLT-6 2631">
<img src="/images/train-material/mat_slt6.png"/> </span>
</div>
</div>
</div>'''
soup = BeautifulSoup(html)
','.join([t.get('title') for t in soup.select('span.material-part[title]')])
Output
SLT-4 2435,SLT-6 2631
I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything
Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:
Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/
I want to scrape the salary of the job but there are many elements that don't relate to salary have the same tag name and class names how can I scrape it with beautifulsoup4 or I must find another web scraping libraries like selenium. And I think that the xpath will be the same also. How can I scrape the salary only without the another elements about the skills and description
html = '''
<div class="the-same-div">
<span class="header-span">Salary</span>
<span class="key-span">
<span class="css-8888">1000 Dollar</span>
</span>
</div>
<div class="the-same-div">
<span class="header-span">Skills</span>
<span class="key-span">
<span class="css-8888">Web scraping</span>
</span>
</div>
<div class="the-same-div">
<span class="header-span">Description</span>
<span class="key-span">
<span class="css-8888">This is a web scraping Job with good salary</span>
</span>
</div>'''
Now this is the python code to scrape the salary element
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
salary = soup.find_all("span", {"class": "css-8888"})
Now how can I scrape the salary of this job. Thank you.
I am not sure that sellenium is good choise for such task, selenium main purpose is a little bit different.
To get all salaries i would do in following way:
from bs4 import BeautifulSoup as bs
html_file = open("test.html", "r")
soup = bs(html_file.read())
same_div_list = soup.find_all("div", {"class": "the-same-div"})
jobs_salary_list = []
for div in same_div_list:
if div.find("span", {"class": "header-span"}).text == "Salary":
jobs_salary_list.append(div.find("span", {"class": "css-8888"}).text)
print(jobs_salary_list)
So basically bs4 is giving you ability to search locally (inside other objects), so first of all you get all "the-same-div" divs, iterate over them and look in "header-span" values, if it is equal to "Salary" then you take value of "css-8888" span.
Since Selenium is tagged , this is what I would do in Selenium :
//span[text() = 'Salary']/following-sibling::span/span
and get the text out of it using the .text method
something like this :
print(driver.find_element_by_xpath("//span[text() = 'Salary']/following-sibling::span/span").text)
if there's more than one salary use find_elements
You can grab the tag that has the "Salary" text and then .find_next() to get the sequential <span> tag with the salary:
html = '''
<div class="the-same-div">
<span class="header-span">Salary</span>
<span class="key-span">
<span class="css-8888">1000 Dollar</span>
</span>
</div>
<div class="the-same-div">
<span class="header-span">Skills</span>
<span class="key-span">
<span class="css-8888">Web scraping</span>
</span>
</div>
<div class="the-same-div">
<span class="header-span">Description</span>
<span class="key-span">
<span class="css-8888">This is a web scraping Job with good salary</span>
</span>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
span = soup.find_all("span", {"class": "header-span"}, text='Salary')
for each in span:
salary = each.find_next('span',{'class':'css-8888'})
print(salary.text)
Output:
1000 Dollar
My html file contains same tag(<span class="fna">) multiple times. If I want to differentiate this tag then i need to look previous tag. Tag() under tag(<span id="field-value-reporter">).
In beautiful soup, I can apply only on tag condition like, soup.find_all("span", {"id": "fna"}). This function extract all data for tag(<span class="fna">) but I need only which contain under tag(<span id="field-value-reporter")
Example html tags:
<div class="value">
<span id="field-value-reporter">
<div class="vcard vcard_287422" >
<a class="email " href="/user_profile?user_id=287422" >
<span class="fna">Chris Pearce (:cpearce)
</span>
</a>
</div>
</span>
</div>
<div class="value">
<span id="field-value-triage_owner">
<div class="vcard vcard_27780" >
<a class="email " href="/user_profile?user_id=27780">
<span class="fna">Justin Dolske [:Dolske]
</span>
</a>
</div>
</span>
</div>
Use soup.select:
soup.select('#field-value-reporter a > span') # select for all tags that are children of a tag whose id is field-value-reporter
>>> [<span class="fna">Chris Pearce (:cpearce)</span>]
soup.select uses css selector and are, in my opinion, much more capable than the default element search that comes with BeautifulSoup. Note that all results are returned as list and contains everything that match.
I am using scrapy to crawl and scrape data from a particular webiste. The crawle works fine, but i'm having issue when scraping content having from div having same class name. As for example:
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
I want to retrieve only this is the 1st div. The code i've used is:
desc = hxs.select('//div[#class = "same_name"]/text()').extract()
But it returns me all the contents. Any help would be really helpful !!
Ok , this one worked for me.
print desc[0]
It returned me this is the first div which is what i wanted.
You can use BeautifulSoup. Its a great html parser.
from BeautifulSoup import BeautifulSoup
html = """
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
"""
soup = BeautifulSoup(html)
print soup.text
That should do the work.
Using xpath you will get all the div with the same class, further, you can loop on them to get the result(for scrapy):
divs = response.xpath('//div[#class="full class name"]')
for div in divs:
if div.css("div.class"):