An application that a friend uses depends on daily exchange rate figures from a particular site link to source of rates
The problem is there is no set time when the rate is changed which is affecting business since sometimes when the rate is changed she might be out and so until he comes back any transaction that happens will use the last rate entered. Sometimes she wins other times she looses out. I'm trying to create an automated client that will scrape and update the exchange rate for her independently.
So far I have been able to strip the content of the site down to a list:
[
<td style="text-align: left;">U.S Dollar</td>,
<td>USDGHS</td>, <td>1.8673</td>, <td>1.8994</td>,
<td style="text-align: left;">Pound Sterling</td>,
<td>GBPGHS</td>, <td>3.0081</td>, <td>3.0599</td>,
<td style="text-align: left;">Swiss Franc</td>,
<td>CHFGHS</td>, <td>2.0034</td>, <td>2.0375</td>,
<td style="text-align: left;">Australian Dollar</td>,
<td>AUDGHS</td>, <td>1.9667</td>, <td>2.0009</td>,
<td style="text-align: left;">Canadian Dollar</td>,
<td>CADGHS</td>, <td>1.8936</td>, <td>1.9259</td>,
<td style="text-align: left;">Danish Kroner</td>,
<td>DKKGHS</td>, <td>0.3255</td>, <td>0.3311</td>,
<td style="text-align: left;">Japanese Yen</td>,
<td>JPYGHS</td>, <td>0.0226</td>, <td>0.0230</td>,
<td style="text-align: left;">New Zealand Dollar</td>,
<td>NZDGHS</td>, <td>1.5690</td>, <td>1.5964</td>,
<td style="text-align: left;">Norwegian Kroner</td>,
<td>NOKGHS</td>, <td>0.3307</td>, <td>0.3363</td>]
I'm now strugling a bit to create a dictionary like so
{USDGHS: [1.8673, 1.8994], GBPGHS: [3.0081, 3.0599], ...}
I'll then use the dictionary to update the appropriate table in the database.
I got to this stage by using beautifulsoup4 and urllib2
[Edit]
Code that got me to this point
from bs4 import BeautifulSoup
import urllib2
url = "http://bog.gov.gh/data/bankindrate.php"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
td = soup.find_all('td')
another_soup = BeautifulSoup(td[:-3])
print another_soup
You need to first find the rows (tr tags) and use those to then get the columns (td tags):
currencies = {}
trs = soup.find_all('tr') # find rows
for tr in trs[1:-3]: # skip first and last 3 (or whatever)
text = list(tr.strings) # content of all text stuff in tr (works in this case)
# [u'U.S Dollar', u'USDGHS', u'1.8673', u'1.8994']
currencies[text[1]] = [float(text[2]), float(text[3])]
And put those into a dictionary using the appropriate key with a value of the two numbers converted to floats...
>>> currencies
{u'USDGHS': [1.8673, 1.8994], u'JPYGHS': [0.0226, 0.023], u'CHFGHS': [2.0034, 2.0375], u'CADGHS': [1.8936, 1.9259], ...}
Related
I am a beginner in Python and BeautifulSoup and I am trying to make a web scraper. However, I am facing some issues and can't figure out a way out. Here is my issue:
This is part of the HTML from where I want to scrap:
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>
Now, I want to extract "Charizard" from 1st table row and "Mega Charizard X" from the second row. Right now, I am able to extract "Charizard" from both rows.
Here is my code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.html"), "lxml")
poke_boxes = soup.findAll('a', attrs = {'class': 'ent-name'})
for poke_box in poke_boxes:
poke_name = poke_box.text.strip()
print(poke_name)
import bs4
html = '''<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>'''
soup = bs4.BeautifulSoup(html, 'lxml')
in:
[tr.get_text(strip=True) for tr in soup('tr')]
out:
['Charizard', 'CharizardMega Charizard X']
you can use get_text() to concatenate all the text in the tag, strip=Ture will strip all the space in the string
You'll need to change your logic to go through the rows and check to see if the small element exists, if it does print out that text, otherwise print out the anchor text as you are now.
soup = BeautifulSoup(html, 'lxml')
trs = soup.findAll('tr')
for tr in trs:
smalls = tr.findAll('small')
if smalls:
print(smalls[0].text)
else:
poke_box = tr.findAll('a')
print(poke_box[0].text)
I am programming a web crawler with the help of beautiful soup.I have the following html code:
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).
At the moment I am doing the following:
for c in soup.find_all("a", string=re.compile('abc')):
abc=c.string
But of course it returns the string "abc" and not the number in the tag afterwards.
So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.
Thanks for your help!!!
Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:
h = """<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling("td",class_="numeric"))
If the numeric td is always next you can just call find_next_sibling():
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling())
For your input both would give you:
td class="numeric">50,00%</td>
If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:
result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
table_columns = row.find_all("td")
result[table_columns[0].text] = tds[1].text
print result #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}
You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"
So as I understand your question you want to iterate over the tuples
('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?
But I don't understand how your code produces any results, since you are searching for <a> tags.
Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.
html = """
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all("tr"):
print((tr.td.string, tr.td.next_sibling.next_sibling.string))
I'm trying to tags that are nested in a tr tag, but the identifier that I'm using to find the correct value is nested in another td within the tr tag.
That is, I'm using the website LoLKing
And trying to scrape it for statistics based on a name, for example, Ahri.
The HTML is:
<tr>
<td data-sorttype="string" data-sortval="Ahri" style="text-align: left;">
<div style="display: table-cell;">
<div class="champion-list-icon" style="background:url(//lkimg.zamimg.com/shared/riot/images/champions/103_32.png)">
<a style="display: inline-block; width: 28px; height: 28px;" href="/champions/ahri"></a>
</div>
</div>
<div style="display: table-cell; vertical-align: middle; padding-top: 3px; padding-left: 5px;">Ahri</div>
</td>
<td style="text-align: center;" data-sortval="975"><img src='//lkimg.zamimg.com/images/rp_logo.png' width='18' class='champion-price-icon'>975</td>
<td style="text-align: center;" data-sortval="6300"><img src='//lkimg.zamimg.com/images/ip_logo.png' width='18' class='champion-price-icon'>6300</td>
<td style="text-align: center;" data-sortval="10.98">10.98%</td>
<td style="text-align: center;" data-sortval="48.44">48.44%</td>
<td style="text-align: center;" data-sortval="18.85">18.85%</td>
<td style="text-align: center;" data-sorttype="string" data-sortval="Middle Lane">Middle Lane</td>
<td style="text-align: center;" data-sortval="1323849600">12/14/2011</td>
</tr>
I'm having problems extracting the statistics, which are nested in td tags outside of the data-sortval. I imagine that I want to pull ALL the tr tags, but I don't know how to pull the tr tag based off of the one that contains the td tag with data-sortval="Ahri". At that point, I would want to step through the tr tag x times until I reach the first statistic I want, 10.98
At the moment, I'm trying to do a find for the td with data-sortval Ahri, but it doesn't return the rest of the tr.
It might be important to not that all of this is nested inside if a larger tag:
<table class="clientsort champion-list" width="100%" cellspacing="0" cellpadding="0">
<thead>
<tr><th>Champion</th><th>RP Cost</th><th>IP Cost</th><th>Popularity</th><th>Win Rate</th><th>Ban Rate</th><th>Meta</th><th>Released</th></tr>
</thead>
<tbody>
I apologize for the lack of clarity, I'm new with this scraping terminology, but I hope that makes enough sense.
Right now, I'm also doing:
main = soup.find('table', {'class':'clientsort champion-list'})
To get only that table
edit:
I typed this for the variable:
for champ in champs:
a = str(champ)
print type(a) is str
td_name = soup.find('td',{"data-sortval":a})
It confirms that a is a string.
But it throws this error:
File "lolrec.py", line 82, in StatScrape
tr = td_name.parent
AttributeError: 'NoneType' object has no attribute 'parent'
GO LOL!
For commercial purpose, please read the terms of services before scraping.
(1) To scrape a list of heroes, you can do this, which follows a similar logic as you described.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.lolking.net/champions/')
soup = BeautifulSoup(html)
# locate the cell that contains hero name: Ahri
hero_list = ["Blitzcrank", "Ahri", "Akali"]
for hero in hero_list:
td_name = soup.find('td', {"data-sortval":hero})
tr = td_name.parent
popularity = tr.find_all('td', recursive=False)[3].text
print hero, popularity
Output
Blitzcrank 12.58%
Ahri 10.98%
Akali 7.52%
Output
10.98%
(2) To scrape all the heroes.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.lolking.net/champions/')
soup = BeautifulSoup(html)
# find the table first
table = soup.find('table', {"class":"clientsort champion-list"})
# find the all the rows
for row in table.find('tbody').find_all("tr", recursive=False):
cols = row.find_all("td")
hero = cols[0].text.strip()
popularity = cols[3].text
print hero, popularity
Output:
Aatrox 6.86%
Ahri 10.98%
Akali 7.52%
Alistar 4.9%
Amumu 8.75%
...
I am trying to pull some financial data for city governments using BeautifulSoup (had to convert the files from pdf). I just want to get the data as a csv file and then I'll analyze it in Excel or SAS. My problem is that I do not want to print the "& nbsp;" that is in the original HTML, just the numbers and the row heading. Any suggestions on how I can do this without using regex?
Below is a sample of the html I am looking at. Next is my code (currently just in proof of concept mode, need to prove I can get clean data before moving on). New to Python and programming so any help is appreciated.
<TD class="td1629">Investments (Note 2)</TD>
<TD class="td1605"> </TD>
<TD class="td479"> </TD>
<TD class="td1639">-</TD>
<TD class="td386"> </TD>
<TD class="td116"> </TD>
<TD class="td1634">2,207,592</TD>
<TD class="td479"> </TD>
<TD class="td1605"> </TD>
<TD class="td1580">2,207,592</TD>
<TD class="td301"> </TD>
<TD class="td388"> </TD>
<TD class="td1637">2,882,018</TD>
CODE
import htmllib
import urllib
import urllib2
import re
from BeautifulSoup import BeautifulSoup
CAFR = open("C:/Users/snown/Documents/CAFR2004 BFS Statement of Net Assets.html", "r")
soup = BeautifulSoup(CAFR)
assets_table = soup.find(True, id="page_27").find(True, id="id_1").find('table')
rows = assets_table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text+"|",
print
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
It converts and other html entities to appropriate characters.
To write it to a csv file:
>>> import csv
>>> import sys
>>> csv_file = sys.stdout
>>> writer = csv.writer(csv_file, delimiter="|")
>>> soup = BeautifulSoup("<tr><td>1<td> <td>3",
... convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> writer.writerows([''.join(t.encode('utf-8') for t in td(text=True))
... for td in tr('td')] for tr in soup('tr'))
1| |3
I've used t.encode('utf-8') due to is translated to non-ascii U+00A0 (no-break space) character.
I want to scrape the following information except the last row and "class="Region" row:
...
<td>7</td>
<td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td>
<td bgcolor="" align="left">New York</td>
<td bgcolor="" align="left" class="Region">N/A</td>
<td bgcolor="" align="left">1,863</td>
<td bgcolor="" align="left">565</td>
<td bgcolor="" align="left">1,133</td>
<td bgcolor="" align="left">$160,000</td>
<td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class="small" bgcolor="#FFFFFF">
...
I tested with this handler:
class TestUrlOpen(webapp.RequestHandler):
def get(self):
soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
link_list = []
for a in soup.findAll('a',href=True):
link_list.append(a["href"])
self.response.out.write("""<p>link_list: %s</p>""" % link_list)
This works but it also get the "View Profile" link which I don't want:
link_list: [u'http://www.ilrg.com/', u'http://www.ilrg.com/', u'http://www.ilrg.com/nations/', u'http://www.ilrg.com/gov.html', ......]
I can easily remove the "u'http://www.ilrg.com/'" after scraping the site but it would be nice to have a list without it. What is the best way to do this? Thanks.
I think this may be what you are looking for. The attrs argument can be helpful for isolating the sections you want.
from BeautifulSoup import BeautifulSoup
import urllib
soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
rows = soup.findAll(name='tr',attrs={'class':'small'})
for row in rows:
number = row.find('td').text
tds = row.findAll(name='td',attrs={'align':'left'})
link = tds[0].find('a')['href']
firm = tds[0].text
office = tds[1].text
attorneys = tds[3].text
partners = tds[4].text
associates = tds[5].text
salary = tds[6].text
print number, firm, office, attorneys, partners, associates, salary
I would get each tr, in the table with the class=listings. Your search is obviously too broad for the information you want. Because HTML has a structure you can easily get just the table data. This is easier in the long run then getting all hrefs and filtering the ones that you don't want out. BeautifulSoup has plent of documentation on how to do this. http://www.crummy.com/software/BeautifulSoup/documentation.html
not exact code:
for tr in soup.findAll('tr'):
data_list = tr.children()
data_list[0].content # 7
data_list[1].content # New York
data_list[2].content # Region <-- ignore this
# etc