Python Beautifulsoup get previous element using find_all_previous - python

I would like to identify some size in specific category, for example, I would like to scrap '(2)募入決定額' under the category '6.価格競争入札について' and '7.非競争入札について'
But somehow the structure for these are a little bit tricky as there is no hierarchy for these elements.
The website I use is :
https://www.mof.go.jp/jgbs/auction/calendar/nyusatsu/resul20211101.htm
And I tried the following code but nothing print out.
rows = soup.findAll('span')
for cell in r:
if "募入決定額" in cell:
a=rows[0].find_all_previous("td")
for i in a:
print(a.get('text'))
Much appreciate for any help!

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
You could select all <td> that contains 募入決定額 and from there its nearest sibling <td> that contains a <span>.
soup.select('td:-soup-contains("募入決定額") ~ td>span')
To get its previous categorie iterate over all previous <tr>:
[x.td.text for x in e.find_all_previous('tr') if x.td.span][0]
Read more about bs4 and css selectors and under dev.mozilla
Example
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.mof.go.jp/jgbs/auction/calendar/nyusatsu/resul20211101.htm'
soup = BeautifulSoup(requests.get(base_url).content)
for e in soup.select('td:-soup-contains("募入決定額") ~ td>span'):
print(e.text)
# or
print([x.td.text for x in e.find_all_previous('tr') if x.td.span][0],e.text)
Output
2兆1,205億円
4億8,500万円
4,785億円
or
6. 2兆1,205億円
7. 4億8,500万円
8. 4,785億円

Related

How to extract class name as string from first element only?

New to python and I have been using this piece of code in order to get the class name as a text for my csv but can't make it to only extract the first one. Do you have any idea how to ?
for x in book_url_soup.findAll('p', class_="star-rating"):
for k, v in x.attrs.items():
review = v[1]
reviews.append(review)
del reviews[1]
print(review)
the url is : http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html
the output is:
Two
Two
One
One
Three
Five
Five
I only need the first output and don't know how to prevent the code from getting the "star ratings" from below the page that shares the same class name.
Instead of find_all() that will create a ResultSet you could use find() or select_one() to select only the first occurrence of your element and pick the last index from the list of class names:
soup.find('p', class_='star-rating').get('class')[-1]
or with css selector
soup.select_one('p.star-rating').get('class')[-1]
In newer code also avoid old syntax findAll() instead use find_all() - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
import requests
url = 'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html'
page = requests.get(url).text
soup = BeautifulSoup(page)
soup.find('p', class_='star-rating').get('class')[-1]
Output
Two

Beautifulsoup4, BS4, Python Parsing Question

I am parsing a webpage using bs4. There are more then one data type I would like to select, with the same class name.
My parsing code:
rows_ranking = soup_ranking.select('#current-poll tbody tr .left')
The page I want to parse has two different ".left" identifiers in the table rows. How can I choose which one I would like. Here is an exmample of two of these table rows (one I would like my program to parse, the other I would like to ignore)
1 - <td class="left " data-stat="school_name" csk="Baylor.015">Baylor</td>
2 - <td class="left " data-stat="conf_abbr" csk="Big 12 Conference.015.001">Big 12</td>
As you can see they have the same class identifier. Is there a way I can have bs4 look only for the first of the two?
I hope my question makes sense, thanks in advance!
Haven't used BS4 or python for awhile, but If I remember correctly something like this should work on getting all elements with data_stat and school_name in the data.
results = soup.findAll("td", {"data_stat" : "school_name"})
Or if you want all results in data with the data_stat attribute and the value doesn't matter use -
results = soup.findAll("td", {"data_stat" : True})
You have a couple of options:
You can use soup.find_all and loop through your results.
Use the css selector for first.
Inspect and copy the selector for that element.

BeautifulSoup won't parse Article element

I'm working on parsing this web page.
I've got table = soup.find("div",{"class","accordions"}) to get just the fixtures list (and nothing else) however now I'm trying to loop through each match one at a time. It looks like each match starts with an article element tag <article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">
However for some reason when I try to use matches = table.findAll("article",{"role","article"})
and then print the length of matches, I get 0.
I've also tried to say matches = table.findAll("article",{"about","/fixture/arsenal"}) but get the same issue.
Is BeautifulSoup unable to parse tags, or am I just using it wrong?
Try this:
matches = table.findAll('article', attrs={'role': 'article'})
the reason is that findAll is searching for tag name. refer to bs4 docs
You need to pass the attributes as a dictionary. There are three ways in which you can get the data you want.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16
Or, this is also the same:
matches = soup.find_all('article', role='article')
But, both these methods give some extra article tags that don't have the Arsernal fixtures. So, if you want to find them using /fixture/arsenal you can use CSS selectors. (Using find_all won't work, as you need a partial match)
matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13
Also, have a look at the keyword arguments. It'll help you get what you want.

Python 3 Beautiful Soup Data type incompatibility issue

Hello there stack community!
I'm having an issue that I can't seem to resolve since it looks like most of the help out there is for Python 2.7.
I want to pull a table from a webpage and then just get the linktext and not the whole anchor.
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = 'http://www.craftcount.com/category.php?cat=5'
html = urlopen(url).read()
soup = BeautifulSoup(html)
alltables = soup.findAll("table")
## This bit captures the input from the previous sequence
results=[]
for link in alltables:
rows = link.findAll('a')
## Find just the names
top100 = re.findall(r">(.*?)<\/a>",rows)
print(top100)
When I run it, I get: "TypeError: expected string or buffer"
Up to the second to the last line, it does everything correctly (when I swap out 'print(top100)' for 'print(rows)').
As an example of the response I get:
michellechangjewelry
And I just need to get:
michellechangjewelry
According to pythex.org, my (ir)regular expression should work, so I wanted to see if anyone out there knew how to do that. As an additional issue, it looks like most people like to go the other way, that is, from having the full text and only wanting the URL part.
Finally, I'm using BeautifulSoup out of "convenience", but I'm not beholden to it if you can suggest a better package to narrow down the parsing to the linktext.
Many thanks in advance!!
BeautifulSoup results are not strings; they are Tag objects, mostly.
Look for the text of the <a> tags, use the .string attribute:
for table in alltables:
link = table.find('a')
top100 = link.string
print(top100)
This finds the first <a> link in a table. To find all text of all links:
for table in alltables:
links = table.find_all('a')
top100 = [link.string for link in links]
print(top100)

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary

Categories

Resources