How to extract class name as string from first element only?

How to extract class name as string from first element only? - python

New to python and I have been using this piece of code in order to get the class name as a text for my csv but can't make it to only extract the first one. Do you have any idea how to ?
for x in book_url_soup.findAll('p', class_="star-rating"):
for k, v in x.attrs.items():
review = v[1]
reviews.append(review)
del reviews[1]
print(review)
the url is : http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html
the output is:
Two
Two
One
One
Three
Five
Five
I only need the first output and don't know how to prevent the code from getting the "star ratings" from below the page that shares the same class name.

Instead of find_all() that will create a ResultSet you could use find() or select_one() to select only the first occurrence of your element and pick the last index from the list of class names:
soup.find('p', class_='star-rating').get('class')[-1]
or with css selector
soup.select_one('p.star-rating').get('class')[-1]
In newer code also avoid old syntax findAll() instead use find_all() - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
import requests
url = 'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html'
page = requests.get(url).text
soup = BeautifulSoup(page)
soup.find('p', class_='star-rating').get('class')[-1]
Output
Two

Related

Python Beautifulsoup get previous element using find_all_previous

I would like to identify some size in specific category, for example, I would like to scrap '(2)募入決定額' under the category '6．価格競争入札について' and '7．非競争入札について'
But somehow the structure for these are a little bit tricky as there is no hierarchy for these elements.
The website I use is :
https://www.mof.go.jp/jgbs/auction/calendar/nyusatsu/resul20211101.htm
And I tried the following code but nothing print out.
rows = soup.findAll('span')
for cell in r:
if "募入決定額" in cell:
a=rows[0].find_all_previous("td")
for i in a:
print(a.get('text'))
Much appreciate for any help!

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
You could select all <td> that contains 募入決定額 and from there its nearest sibling <td> that contains a <span>.
soup.select('td:-soup-contains("募入決定額") ~ td>span')
To get its previous categorie iterate over all previous <tr>:
[x.td.text for x in e.find_all_previous('tr') if x.td.span][0]
Read more about bs4 and css selectors and under dev.mozilla
Example
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.mof.go.jp/jgbs/auction/calendar/nyusatsu/resul20211101.htm'
soup = BeautifulSoup(requests.get(base_url).content)
for e in soup.select('td:-soup-contains("募入決定額") ~ td>span'):
print(e.text)
# or
print([x.td.text for x in e.find_all_previous('tr') if x.td.span][0],e.text)
Output
2兆1,205億円
4億8,500万円
4,785億円
or
6． 2兆1,205億円
7． 4億8,500万円
8． 4,785億円

How do you get specify html text from bs4 parsed response?

I am using this code to parse some context form a url:
response = requests.get(url)
cnbeta_article_content = BeautifulSoup(response.content, "html.parser").find("div", {"class": "cnbeta-article-body"})
return cnbeta_article_content.contents
But I need to get the cnbeta_article_content.contents was a result of list. How do you get the plain html from class cnbeta-article-body of the url? The cnbeta_article_content.text is not the original html.

Does cnbeta_article_content.prettify() render what you expect?

you are getting multiple results for the class so you will have to find out which one should you pick. If possible use a unique selector for the specific element or you can extract it from the current list (cnbeta_article_content.contents)
Go to the website and find out the elements serial number (I mean you are getting multiple elements with the same class so what is the position of your expected element) for the class you mentioned. You will get the text like this
cnbeta_article_content.contents[4].text
Here 4 is the 5th element (Zero indexing system)

How to get all tag element named in "href--2RDqa"?

I'm trying to get all tag element named in href--2RDqa.
SpnishDict - Definition of 'que'
Ideal result would be like:
(keh)
conjuction
pronoun
but my current code only catch single result for 'part of speech'.
search result of que:
(keh)
conjuction
Code:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.spanishdict.com/translate/"
search_keyword = input("input the keyword : ")
url = base_url + search_keyword
spanishdict_r = requests.get(url)
spanishdict_soup = BeautifulSoup(spanishdict_r.text, 'html.parser')
# Phonetic Alphabet
print(spanishdict_soup.find("span", {"id": "dictionary-link-es"}).text)
# Part of Speech
print(spanishdict_soup.find("a", {"class": "href--2RDqa"}).text)
# Meaning
I have tried to rename soup.find to soup.findAll in #Part of Speech part, but I got an AttributeError.
AttributeError: ResultSet object has no attribute 'text'. You're
probably treating a list of elements like a single element. Did you
call find_all() when you meant to call find()?
Please help!
Thank you.

The difference between the .find() method and the .findAll() method is that the first one returns a bs4.element.Tag object type while the latter returns a bs4.element.ResultSet. Each element of your bs4.element.ResultSet is a bs4.element.Tag. Thus, you need to iterate over it:
[element.text for element in spanishdict_soup.findAll("a", {"class": "href--2RDqa"})]
which renders
['conjunction', 'pronoun', 'pronoun', 'conjunction']

When you use soup.findAll (better to use soup.find_all as it's in snake_case), you are returned with a ResultSet, which is basically like a list. The error helpfully points that out. Therefore, to extract the text, you'll need to iterate over every element of the list.
Also, given that you can have multiple tags for the same part of speech, you can convert the resulting list into a set based on the .text tag to remove duplicates:
part_of_speech = set([x.text for x in spanishdict_soup.find_all("a", {"class": "href--2RDqa"})])
for part in part_of_speech:
print(part)
Additionally, as was asked by the OP, to preserve the order, you could use a python3.7 and higher dict, which will act as a pseudo-set:
part_of_speech = dict.fromkeys([x.text for x in spanishdict_soup.find_all("a", {"class": "href--2RDqa"})]).keys()
for part in part_of_speech:
print(part)
or, If you're using a version of python below 3.7, then from collections import OrderedDict, and using that instead of dict should also accomplish the job.

How to get text which has no HTML tag | Add multiple delimiters in split

Following XPath select div element with class ajaxcourseindentfix and split it from Prerequisite and gives me all the content after prerequisite.
div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]
My div can have not only prerequisite but also the following splitting points:
Prerequisites
Corerequisite
Corerequisites
Now, whenever I have Prerequisite, above XPath works fine but whenever anything from above three comes, the XPath fails and gives me the whole text.
Is there a way to put multiple delimiters in XPath? Or how do I solve it?
Sample pages:
Corequisite URL: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96106&show
Prerequisite URL: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show
Both: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=98590&show
[Old Thread] - How to get text which has no HTML tag

This code is the solution to your problem unless you need XPath specifically, I would also suggest that you review BeautifulSoup documentation on the methods I've used, you can find that HERE
.next_element and .next_sibling can be very useful in these cases.
or .next_elements we'll get a generator that we'll have either to convert or use it in a manner that we can manipulate a generator.
from bs4 import BeautifulSoup
import requests
url = 'http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show'
makereq = requests.get(url).text
soup = BeautifulSoup(makereq, 'lxml')
whole = soup.find('td', {'class': 'custompad_10'})
# we select the whole table (td), not needed in this case
thedivs = whole.find_all('div')
# list of all divs and elements within them
title_h3 = thedivs[2]
# we select only yhe second one (list) and save it in a var
mytitle = title_h3.h3
# using .h3 we can traverse (go to the child <h3> element)
mylist = list(mytitle.next_elements)
# title_h3.h3 is still part of a three and we save all the neighbor elements
the_text = mylist[3]
# we can then select specific elements
# from a generator that we've converted into a list (i.e. list(...))
prequisite = mylist[6]
which_cpsc = mylist[8]
other_text = mylist[11]
print(the_text, ' is the text')
print(which_cpsc, other_text, ' is the cpsc and othertext ')
# this is for testing purposes
Solves both issues, we don't have to use CSS selectors and those weird list manipulations. Everything is organic and works well.

BeautifulSoup won't parse Article element

I'm working on parsing this web page.
I've got table = soup.find("div",{"class","accordions"}) to get just the fixtures list (and nothing else) however now I'm trying to loop through each match one at a time. It looks like each match starts with an article element tag <article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">
However for some reason when I try to use matches = table.findAll("article",{"role","article"})
and then print the length of matches, I get 0.
I've also tried to say matches = table.findAll("article",{"about","/fixture/arsenal"}) but get the same issue.
Is BeautifulSoup unable to parse tags, or am I just using it wrong?

Try this:
matches = table.findAll('article', attrs={'role': 'article'})

the reason is that findAll is searching for tag name. refer to bs4 docs

You need to pass the attributes as a dictionary. There are three ways in which you can get the data you want.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16
Or, this is also the same:
matches = soup.find_all('article', role='article')
But, both these methods give some extra article tags that don't have the Arsernal fixtures. So, if you want to find them using /fixture/arsenal you can use CSS selectors. (Using find_all won't work, as you need a partial match)
matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13
Also, have a look at the keyword arguments. It'll help you get what you want.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract class name as string from first element only? - python

Related

Python Beautifulsoup get previous element using find_all_previous

How do you get specify html text from bs4 parsed response?

How to get all tag element named in "href--2RDqa"?

How to get text which has no HTML tag | Add multiple delimiters in split

BeautifulSoup won't parse Article element

Categories

Resources