I am trying to pull some financial data for city governments using BeautifulSoup (had to convert the files from pdf). I just want to get the data as a csv file and then I'll analyze it in Excel or SAS. My problem is that I do not want to print the "& nbsp;" that is in the original HTML, just the numbers and the row heading. Any suggestions on how I can do this without using regex?
Below is a sample of the html I am looking at. Next is my code (currently just in proof of concept mode, need to prove I can get clean data before moving on). New to Python and programming so any help is appreciated.
<TD class="td1629">Investments (Note 2)</TD>
<TD class="td1605"> </TD>
<TD class="td479"> </TD>
<TD class="td1639">-</TD>
<TD class="td386"> </TD>
<TD class="td116"> </TD>
<TD class="td1634">2,207,592</TD>
<TD class="td479"> </TD>
<TD class="td1605"> </TD>
<TD class="td1580">2,207,592</TD>
<TD class="td301"> </TD>
<TD class="td388"> </TD>
<TD class="td1637">2,882,018</TD>
CODE
import htmllib
import urllib
import urllib2
import re
from BeautifulSoup import BeautifulSoup
CAFR = open("C:/Users/snown/Documents/CAFR2004 BFS Statement of Net Assets.html", "r")
soup = BeautifulSoup(CAFR)
assets_table = soup.find(True, id="page_27").find(True, id="id_1").find('table')
rows = assets_table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text+"|",
print
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
It converts and other html entities to appropriate characters.
To write it to a csv file:
>>> import csv
>>> import sys
>>> csv_file = sys.stdout
>>> writer = csv.writer(csv_file, delimiter="|")
>>> soup = BeautifulSoup("<tr><td>1<td> <td>3",
... convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> writer.writerows([''.join(t.encode('utf-8') for t in td(text=True))
... for td in tr('td')] for tr in soup('tr'))
1| |3
I've used t.encode('utf-8') due to is translated to non-ascii U+00A0 (no-break space) character.
Related
I seem to have hit a wall and I am looking for some help/guidance.
I am trying to extract data from a html page - I can extract the text or the image file alone but not together:
Within the HTML file there is multiple occurrences off a heading and the associated text:
Example:
<h2>Builder ind=BOB</h2>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- TXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image117.png" width=997 height=601>
<h2>Builder ind=ROB</h2>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- EXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image118.png" width=997 height=601>
In the example above I am trying to extract the text contained within the h2 tags and the associated img src tag and export them to a csv file
Extracting the image text code that i have:
{
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
fname = '\\\\C:\\TEMP\\\PAGE.htm'
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
for image in images:
print(image['src']+'\n')
How would i go about looping through the file and extracting both the texts and the and port to a file?
In the final output I am trying to achieve the following in a csv file:
ind=BOB,image117.png
ind=ROB,image118.png
The output that I get currently is:
gfx/image117.png
gfx/image118.png
Try this approach:
from bs4 import BeautifulSoup
import re
fname = '\\\\C:\\TEMP\\\PAGE.htm'
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
headings = bs.find_all('h2')
for i in range(len(images)):
print(headings[i].text.split(" ")[1]+", "+images[i]['src'])
Output:
ind=BOB, gfx/image117.png
ind=ROB, gfx/image118.png
Or If you want to store your output in a csv file so you should try this approach:
from bs4 import BeautifulSoup
import re
import csv
fname = 'PAGE.htm'
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
headings = bs.find_all('h2')
with open('data.csv', 'w') as file:
writer = csv.writer(file)
for i in range(len(images)):
#headingPlusImage = list(headings[i].text.split(" ")[1]+", "+images[i]['src'])
heading = headings[i].text.split(" ")[1]
image = images[i]['src']
print(heading,"," ,image)
writer.writerow([heading, image])
from bs4 import BeautifulSoup
html = """
<h2>Builder ind=BOB</h2>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- TXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image117.png" width=997 height=601>
<h2>Builder ind=ROB</h2>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- EXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image118.png" width=997 height=601>
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("h2"):
print("Text: {}, Image: {}".format(
item.text, item.find_next("img").get("src")))
Output:
Text: Builder ind=BOB, Image: gfx/image117.png
Text: Builder ind=ROB, Image: gfx/image118.png
I scrape the page like this:
s1 =bs4DerivativePage.find_all('table',class_='not-clickable zebra’)
With output:
[<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td>
I need to retrieve the value from Financieringsniveau.
The following gives a match:
finNiveau=re.search('Financieringsniveau’,LineIns1)
However I need the numerical value 135,05188. How does one does this?
You can use .findNext()
Ex:
from bs4 import BeautifulSoup
s = """<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td></tr></tbody></table>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.find(text="Financieringsniveau").findNext("td").text) #Search using text and the use findNext
Output:
135,05188
Assuming that data-stream-id attribute value is unique (in combination with table tag) you can use CSS selectors and avoid re. This is a fast retrieval method.
from bs4 import BeautifulSoup
html = '''
<table class="not-clickable zebra" data-price-format="{price}" data-quote-detail="0" data-stream-id="723288" data-stream-quote-option="Standard">
<tbody><tr>
<td><strong>Stop loss-niveau</strong></td>
<td>141,80447</td>
<td class="align-left"><strong>Type</strong></td>
<td>Turbo's</td>
</tr>
<tr>
<td><strong>Financieringsniveau</strong></td>
<td>135,05188</td>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('table[data-stream-id="723288"] td:nth-of-type(6)').text)
Let's say, i have an HTML Table like this:
<tr>
<td class="Klasse gerade">12A<br></td>
<td class="Stunde gerade">4<br></td>
<td class="Fach gerade">GEO statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">603<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
<tr>
<td class="Klasse gerade">10A<br></td>
<td class="Stunde gerade">2<br></td>
<td class="Fach gerade">MA statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">406<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
if phrase the HTML to python(2.7) with:
link = "http://www.test.com/vplan.html"
f = urllib.urlopen(link)
vplan = f.read()
print vplan
how can i do this?: if td=10A then print the complete tr of 10A
Sorry for the bad formulation but this is in my opinion the easiest was to explain my question and don't wonder about the German word's (I'm a German)
You need an HTML parser like Beautifulsoup. Assuming the table in question is the only one or the first one in the document, the program may look like this:
#!/usr/bin/env python
import urllib
from bs4 import BeautifulSoup
def main():
link = 'http://www.test.com/vplan.html'
soup = BeautifulSoup(urllib.urlopen(link), 'lxml')
table = soup.find('table')
rows = [x.find_parent('tr') for x in table.find_all(text='10A')]
for row in rows:
for cell in row.find_all('td'):
print cell.text
print '-' * 10
I am a beginner in Python and BeautifulSoup and I am trying to make a web scraper. However, I am facing some issues and can't figure out a way out. Here is my issue:
This is part of the HTML from where I want to scrap:
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>
Now, I want to extract "Charizard" from 1st table row and "Mega Charizard X" from the second row. Right now, I am able to extract "Charizard" from both rows.
Here is my code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.html"), "lxml")
poke_boxes = soup.findAll('a', attrs = {'class': 'ent-name'})
for poke_box in poke_boxes:
poke_name = poke_box.text.strip()
print(poke_name)
import bs4
html = '''<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>'''
soup = bs4.BeautifulSoup(html, 'lxml')
in:
[tr.get_text(strip=True) for tr in soup('tr')]
out:
['Charizard', 'CharizardMega Charizard X']
you can use get_text() to concatenate all the text in the tag, strip=Ture will strip all the space in the string
You'll need to change your logic to go through the rows and check to see if the small element exists, if it does print out that text, otherwise print out the anchor text as you are now.
soup = BeautifulSoup(html, 'lxml')
trs = soup.findAll('tr')
for tr in trs:
smalls = tr.findAll('small')
if smalls:
print(smalls[0].text)
else:
poke_box = tr.findAll('a')
print(poke_box[0].text)
I am programming a web crawler with the help of beautiful soup.I have the following html code:
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).
At the moment I am doing the following:
for c in soup.find_all("a", string=re.compile('abc')):
abc=c.string
But of course it returns the string "abc" and not the number in the tag afterwards.
So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.
Thanks for your help!!!
Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:
h = """<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling("td",class_="numeric"))
If the numeric td is always next you can just call find_next_sibling():
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling())
For your input both would give you:
td class="numeric">50,00%</td>
If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:
result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
table_columns = row.find_all("td")
result[table_columns[0].text] = tds[1].text
print result #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}
You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"
So as I understand your question you want to iterate over the tuples
('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?
But I don't understand how your code produces any results, since you are searching for <a> tags.
Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.
html = """
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all("tr"):
print((tr.td.string, tr.td.next_sibling.next_sibling.string))