python BeautifulSoup find span id name without using string\re methods

python BeautifulSoup find span id name without using string\re methods - python

I'm trying to get the id name of my span tags.
<td vAlign="top" colSpan="2"><IMG height="25" src="images/spacer.gif" width="1"><br>
<!--start table details-->
<table cellSpacing="1" cellPadding="5" width="100%" bgColor="#a18c42" border="0" id="compDetails">
<tr bgColor="white">
<td class="rowName" noWrap>מספר תאגיד:</td>
<td width="100%" colSpan="3"><span id="lblCompanyNumber">520000472</span></td>
</tr>
<tr bgColor="white">
<td class="rowName" noWrap>שם תאגיד (עברית):</td>
<td width="50%"><span id="lblCompanyNameHeb">חברת החשמל לישראל בעמ</span></td>
<td class="rowName" noWrap>שם תאגיד (אנגלית):</td>
<td width="50%"><span id="lblCompanyNameEn"></span></td>
</tr>
<tr bgColor="white">
<td class="rowName" noWrap>סטטוס:</td>
<td width="50%"><span id="lblStatus">פעילה</span></td>
<td class="rowName" noWrap>סוג תאגיד:</td>
<td width="50%"><span id="lblCorporationType">חברה ציבורית</span></td>
</tr>
<tr bgColor="white">
<td class="rowName" noWrap>סוג חברה ממשלתית:</td>
<td width="50%"><span id="lblGovCompanyType">חברה ממשלתית</span></td>
<td class="rowName" noWrap>סוג מגבלות:</td>
<td width="50%"><span id="lblLimitType">מוגבלת</span></td>
lets say htmlSpan contains the html above -
soup = BeautifulSoup(htmlSpan , fromEncoding="windows-1255") # I want to use windows-1255 and not utf8
spans = soup('span', limit=30)
that's the output -
[<span class="mainTitle">╫¿╫⌐╫¥ ╫פ╫ק╫ס╫¿╫ץ╫¬</span>,
<span class="subTitle">╫ñ╫¿╫ר╫ש
╫ק╫ס╫¿╫פ/╫⌐╫ץ╫¬╫ñ╫ץ╫¬</span>,
<span id="lblCompanyNumber">514568245</span>,
<span id="lblCompanyNameHeb">╫£╫ס╫ש╫נ ╫נ╫ש╫á╫ר╫ע╫¿╫ª╫ש╫פ ╫ץ╫á╫ש╫¬╫ץ╫ק ╫₧╫ó╫¿╫¢╫
ץ╫¬ ╫ס╫ó"╫₧</span>,
<span id="lblCompanyNameEn">LAVI INTEGRATION &SYSTEM; ANALYSIS LTD</span>,
<span id="lblStatus">╫ñ╫ó╫ש╫£╫פ</span>,
<span id="lblCorporationType">╫ק╫ס╫¿╫פ ╫ñ╫¿╫ר╫ש╫¬</span>,
<span id="lblGovCompanyType">╫ק╫ס╫¿╫פ ╫£╫נ ╫₧╫₧╫⌐╫£╫¬╫ש╫¬</span>,
<span id="lblLimitType">╫₧╫ץ╫ע╫ס╫£╫¬</span>,
<span id="lblStatusMafera"><b><font color="Red"></font></b></span>,
<span id="lblMaferaDate"></span>,
<span id="lblStatusMafera1"><b><font color="Red"></font></b></span>,
<span id="lblCountry">╫ש╫⌐╫¿╫נ╫£</span>,
<span id="lblCity">╫ק╫ף╫¿╫פ</span>,
<span id="lblStreet">╫פ╫£╫£ ╫ש╫ñ╫פ</span>,
<span id="lblStreetNumber">34</span>,
<span id="lblZipCode">38424</span>,
<span id="lblPOB"></span>,
<span id="lblLocatedAt"></span>,
<span id="lblCompanyGoal">╫£╫ó╫í╫ץ╫º ╫ס╫¢╫£ ╫ó╫ש╫í╫ץ╫º ╫ק╫ץ╫º╫ש</span>,
<span id="lblCompanyDesc"></span>,
<span id="lblDochShana"></span>]
I know how to get the span content but I can't get the span id name ('lblStatus' for ex').
how can I get it with BeautifulSoup's methods?
I'm also having trouble saving the spans content without BeautifulSoup converting (charset) it to utf8 (or gibberish) in the end I need to save the the span id name and content into a csv, and I'm having utf8 problems with it.
Thanks

I can't get the span id name ('lblStatus' for ex').
Using spans as set by your own code:
for span in spans:
print span['id']
I'm also having trouble saving the spans content without BeautifulSoup converting to utf8 or gibberish
I could not replicate this: the output of spans for me is not gibberish, but the same chars as in the html. Are you sure the page you are trying to parse is encoded in "windows-1255"? Do you have a proper UTF-8 encoding declaration (# -*- coding: UTF-8 -*-) you your python file?
UTF-8 is pretty much the standard in python nowadays and BeautifulSoup uses it internally. My suggestion would be to work in UTF-8 in all your code and change encoding (if you truly need to do it) only when you output/dump data.
in the end I need to save the the span id name and content into a csv...
This is just a rough idea that you should tweak as per your need:
import csv
file_ = open('output.csv', 'w')
writer = csv.writer(file_)
for span in spans:
writer.writerow([span['id'], span.string])
...and I'm having utf8 problems with it.
Could you specify about what your problems are? On my system (GNU/Linux) it works just fine.

You can access the attributes of tags by looking up the tag as a dict, keyed by tag name:
for span in spans:
print span['id']
gives what you want: lblCompanyNumber lblCompanyNameHeb lblCompanyNameEn lblStatus lblCorporationType lblGovCompanyType lblLimitType...
I'm also having trouble saving the spans content into a csv without BeautifulSoup converting (charset) it to utf8 (or gibberish)
mac's answer to use decode() is correct. It's unrelated to sys.getdefaultencoding() which defaults to 'ascii', that doesn't matter.

Related

BeautifulSoup how to only return class objects

I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.

To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED

How to parse HTML file in .TXT format (un-tabbed) in Python?

I have encountered a problem in my programming that has me stumped.
I'm trying to access data stored in a wealth of old HTML-formatted-saved-as-text files. However, when saving the HTML code lost its indentations, tabs, hierarchy, whatever you wish to call it. An example of this can be found below.
......
<tr class="ro">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_RevenueFromContractWithCustomerExcludingAssessedTax', window );">Net sales</a></td>
<td class="nump">$ 123,897<span></span>
</td>
<td class="nump">$ 122,136<span></span>
</td>
<td class="nump">$ 372,586<span></span>
</td>
<td class="nump">$ 360,611<span></span>
</td>
</tr>
<tr class="re">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherIncome', window );">Membership and other income</a></td>
<td class="nump">997<span></span>
</td>
<td class="nump">1,043<span></span>
</td>
<td class="nump">3,026<span></span>
</td>
<td class="nump">3,465<span></span>
</td>
</tr>
<tr class="rou">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_Revenues', window );">Total revenues</a></td>
<td class="nump">124,894<span></span>
</td>
<td class="nump">123,179<span></span>
</td>
<td class="nump">375,612<span></span>
</td>
<td class="nump">364,076<span></span>
</td>
</tr>
I typically would employ Beautiful Soup here and get to work parsing the data that way, but I've not found a good workflow since technically there is no hierarchy here; I can't tell BS to look within something other than the document itself-which is huge and might be way too time consuming (see next statement).
I also need to find a thorough solution and not a quick-fix because I have hundreds, if not thousands, of these same HTML-to-text files to parse.
So my question here is, if I want to return, in all the files, the first number for "Membership and other Income" (997 in this case), how could I go about doing that?
Two samples files can be found here:
(https://www.sec.gov/Archives/edgar/data/1800/0001104659-18-065076.txt) (https://www.sec.gov/Archives/edgar/data/1084869/0001437749-18-020205.txt)
EDIT - 4/16
Thanks for the replies everyone! I've written some code that returns the tags I'm looking for.
import requests
from bs4 import BeautifulSoup
data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')
# load the data
soup = BeautifulSoup(data.text, 'html.parser')
# get the data
for tr in soup.find_all('tr', {'class':['rou','ro','re','reu']}):
db = [td.text.strip() for td in tr.find_all('td')]
print(db)
The problem is there are a TON of returns and most contain nothing of use. Is there a way to filter based on these tags' grandparent? I've tried the same approach as above using head, title, body, etc. but I can't quite get BS to identify the FILENAME..
<DOCUMENT>
<TYPE>XML
<SEQUENCE>14
**<FILENAME>R2.htm**
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<html>
<head>
<title></title>
.....removed for brevity
</head>
<body>
.....removed for brevity
<td class="text"> <span></span>
</td>
.....removed for brevity
</tr>

Just so you are aware, HTML does not care about indentation. If you really wanted to, it could all be on the same line with no spaces in between. A HTML parser will just look at the structure of the tags.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all['<tag you are looking for>'][0]

Beautiful Soup Parse Python

I've captured the following html using BS4, but can't seem to search for the artist tag.
I've assigned this block of code to a variable called container, and then tried
print container.tr.td["artist"]
without luck.
Any advice appreciated?
<tr class="item">
<!-- <td class="image"><img src="https://www.stargreen.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/K/o/KoolAsTheGang.jpg" width="135" height="135" alt="KOOL AS THE GANG " /></td> -->
<td class="date">Sat, 30 Dec 2017</td>
<td class="artist">kool as the gang</td>
<td class="venue">100 club</td>
<td class="link">
<p class="availability out-of-stock">
<span>Off Sale</span></p>
</td>
</tr>

Your syntax is wrong, "artist" is the value of the "class" attribute try this:
from bs4 import BeautifulSoup
html = """
<tr class="item">
<!-- <td class="image"><img src="https://www.stargreen.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/K/o/KoolAsTheGang.jpg" width="135" height="135" alt="KOOL AS THE GANG " /></td> -->
<td class="date">Sat, 30 Dec 2017</td>
<td class="artist">
kool as the gang </td>
<td class="venue">100 club</td>
<td class="link">
<p class="availability out-of-stock">
<span>Off Sale</span></p>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'html.parser')
td = soup.find('td',{'class': 'artist'})
print (td.text.strip())
Outputs:
kool as the gang

Another way.
Look for the element within container whose class is 'artist' with the select method. Since there could be more than one, but you know there is only one, select the only element in the list, and request its text attribute.
>>> HTML = open('sven.htm').read()
>>> import bs4
>>> container = bs4.BeautifulSoup(HTML, 'lxml')
>>> container.select('.artist')[0].text
'\n kool as the gang '

Access value in BeautifulSoup4

I have made an HTML request from which I would like to retrieve specific elements, but I don't know how to access them with BeautifulSoup4.
Here is an example of the returned html:
<td valign="top" >
<span class="recordAttribute" >Taxonomy</span>: Mollusca, Gastropoda, Littorinimorpha, Hydrobiidae, Hydrobia<br>
<span class="recordAttribute" >Identifiers:</span> AF118324[sampleid] <br>
<span class="recordAttribute" >Depository</span>: Mined from GenBank, NCBI
</td>
I would like to access the element AF118324 (which is the name after the Identifiers span class).
How could I access it? (without using a substring method of course)

Does this work for you?
html = '''
<td valign="top" >
<span class="recordAttribute" >Taxonomy</span>: Mollusca, Gastropoda, Littorinimorpha, Hydrobiidae, Hydrobia<br>
<span class="recordAttribute" >Identifiers:</span> AF118324[sampleid] <br>
<span class="recordAttribute" >Depository</span>: Mined from GenBank, NCBI
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.find('span', text='Identifiers:').nextSibling
print(obj)
Which prints:
AF118324[sampleid]

Use beautiful soup to find elements by textual contents, not text?

Similar to .renderContents here, I want to search by that value: Beautiful Soup [Python] and the extracting of text in a table
Sample HTML:
<table>
<tr>
<td>
This is garbage
</td>
<td>
<td class="thead" style="font-weight:normal">
<!-- status icon and date -->
<a name="post1"><img class="inlineimg" src="img.gif" alt="Old" border="0" title="Old"></a>
19-11-2010, 04:25 PM
<!-- / status icon and date -->
</td>
<td>
This is garbage
</td>
</tr>
</table>
What I tried:
soup.find_all("td", text = re.compile('(AM|PM)'))[0].get_text().strip()
However, the text parameter of find_all seems to not work for this application: IndexError: list index out of range
What do I need to do?

Don't specify the tag name at all and let it find the desired text node. Works for me:
soup.find(text=re.compile('(AM|PM)')).strip()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python BeautifulSoup find span id name without using string\re methods - python

Related

BeautifulSoup how to only return class objects

How to parse HTML file in .TXT format (un-tabbed) in Python?

Beautiful Soup Parse Python

Access value in BeautifulSoup4

Use beautiful soup to find elements by textual contents, not text?

Categories

Resources