Difficulty web scraping with beautiful soup - python

I'm trying to use Beautiful Soup to extract the title of a job. The title in the span tag is the same as the text. Eg: text is 'Barista' but so is the title. So far I've been using .findall but idk how it can work for this.
Sample html:
<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class="label">new</span>
</div>
<span title="Barista">Barista</span>
</h2>

Try something like this.
# Imports.
from bs4 import BeautifulSoup
# HTML code.
html_str = '''<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class="label">new</span>
</div>
<span title="Barista">Barista</span>
</h2>'''
# Parsing HTML.
soup = BeautifulSoup(html_str, 'lxml')
# Searching for `span` tags with `title` attributes.
list_html_titles = soup.find_all('span', attrs={'title': True})
# Getting titles from HTML code blocks.
list_titles = [x.text for x in list_html_titles]

You can take advantage of the recursive attribute from beautifulSoup, to get just the direct child of h2.
I tested the following code sample and it works:
from bs4 import BeautifulSoup
html_str = '''<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class="label">new</span>
</div>
<span title="Barista">Barista</span>
</h2>'''
soup = BeautifulSoup(html_str, 'lxml')
title = soup.h2.find('span', recursive=False).text
print(title)

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything
Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:
Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/

Missing parts in Beautiful Soup results

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?
I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

how to find second div from html in python beautifulsoup

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()
Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

How to scrape element if there is some more element with the same tag name and class name but for another for another thing in Beautifulsoup4?

I want to scrape the salary of the job but there are many elements that don't relate to salary have the same tag name and class names how can I scrape it with beautifulsoup4 or I must find another web scraping libraries like selenium. And I think that the xpath will be the same also. How can I scrape the salary only without the another elements about the skills and description
html = '''
<div class="the-same-div">
<span class="header-span">Salary</span>
<span class="key-span">
<span class="css-8888">1000 Dollar</span>
</span>
</div>
<div class="the-same-div">
<span class="header-span">Skills</span>
<span class="key-span">
<span class="css-8888">Web scraping</span>
</span>
</div>
<div class="the-same-div">
<span class="header-span">Description</span>
<span class="key-span">
<span class="css-8888">This is a web scraping Job with good salary</span>
</span>
</div>'''
Now this is the python code to scrape the salary element
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
salary = soup.find_all("span", {"class": "css-8888"})
Now how can I scrape the salary of this job. Thank you.
I am not sure that sellenium is good choise for such task, selenium main purpose is a little bit different.
To get all salaries i would do in following way:
from bs4 import BeautifulSoup as bs
html_file = open("test.html", "r")
soup = bs(html_file.read())
same_div_list = soup.find_all("div", {"class": "the-same-div"})
jobs_salary_list = []
for div in same_div_list:
if div.find("span", {"class": "header-span"}).text == "Salary":
jobs_salary_list.append(div.find("span", {"class": "css-8888"}).text)
print(jobs_salary_list)
So basically bs4 is giving you ability to search locally (inside other objects), so first of all you get all "the-same-div" divs, iterate over them and look in "header-span" values, if it is equal to "Salary" then you take value of "css-8888" span.
Since Selenium is tagged , this is what I would do in Selenium :
//span[text() = 'Salary']/following-sibling::span/span
and get the text out of it using the .text method
something like this :
print(driver.find_element_by_xpath("//span[text() = 'Salary']/following-sibling::span/span").text)
if there's more than one salary use find_elements
You can grab the tag that has the "Salary" text and then .find_next() to get the sequential <span> tag with the salary:
html = '''
<div class="the-same-div">
<span class="header-span">Salary</span>
<span class="key-span">
<span class="css-8888">1000 Dollar</span>
</span>
</div>
<div class="the-same-div">
<span class="header-span">Skills</span>
<span class="key-span">
<span class="css-8888">Web scraping</span>
</span>
</div>
<div class="the-same-div">
<span class="header-span">Description</span>
<span class="key-span">
<span class="css-8888">This is a web scraping Job with good salary</span>
</span>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
span = soup.find_all("span", {"class": "header-span"}, text='Salary')
for each in span:
salary = each.find_next('span',{'class':'css-8888'})
print(salary.text)
Output:
1000 Dollar

Beautiful Soup - Ignore child divs with same name as parent div

the html is structured as so:
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
...
Basically, there are many divs with the same name as their child divs, and ultimately, I want to find the "important text" which is found under the partent div only.
When I try to find all divs with class="my_class", I obviously get both the parents and the childs. How can I only get the parent divs?
Here is my code for getting all divs with class = "my_class" and finding the important text:
my_div_list = soup.find_all('div', attrs={'class': 'my_class'})
for my_div in my_div_list:
text_item = my_div.find('div') # to get to the div that contains the important text
print(text_item.getText())
Obviously, the output is:
important text
not important
important text
not important
...
When I want:
important text
important text
...
From the findall() documentation:
recursive is a boolean argument (defaulting to True) which tells Beautiful Soup
whether to go all the way down the parse tree, or whether to only look at the
immediate children of the Tag or the parser object.
So, given the first level of divs is for example under the tags <head> and <body>, you can set
soup.html.body.find_all('div', attrs={'class': 'my_class'},
recursive=False)
Output:
['important text', 'important text']
You can iterate over soup.contents:
from bs4 import BeautifulSoup as soup
r = [i.div.text for i in soup(html, 'html.parser').contents if i != '\n']
Output:
['important text', 'important text']
with bs4 4.7.1 you can use :has and :first-child
from bs4 import BeautifulSoup as bs
html = '''<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>'''
soup = bs(html, 'lxml')
print([i.text for i in soup.select('.my_class:has(>.my_class) > div:first-child')])

Categories

Resources