How to select some urls with BeautifulSoup? - python

I want to scrape the following information except the last row and "class="Region" row:
...
<td>7</td>
<td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td>
<td bgcolor="" align="left">New York</td>
<td bgcolor="" align="left" class="Region">N/A</td>
<td bgcolor="" align="left">1,863</td>
<td bgcolor="" align="left">565</td>
<td bgcolor="" align="left">1,133</td>
<td bgcolor="" align="left">$160,000</td>
<td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class="small" bgcolor="#FFFFFF">
...
I tested with this handler:
class TestUrlOpen(webapp.RequestHandler):
def get(self):
soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
link_list = []
for a in soup.findAll('a',href=True):
link_list.append(a["href"])
self.response.out.write("""<p>link_list: %s</p>""" % link_list)
This works but it also get the "View Profile" link which I don't want:
link_list: [u'http://www.ilrg.com/', u'http://www.ilrg.com/', u'http://www.ilrg.com/nations/', u'http://www.ilrg.com/gov.html', ......]
I can easily remove the "u'http://www.ilrg.com/'" after scraping the site but it would be nice to have a list without it. What is the best way to do this? Thanks.

I think this may be what you are looking for. The attrs argument can be helpful for isolating the sections you want.
from BeautifulSoup import BeautifulSoup
import urllib
soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
rows = soup.findAll(name='tr',attrs={'class':'small'})
for row in rows:
number = row.find('td').text
tds = row.findAll(name='td',attrs={'align':'left'})
link = tds[0].find('a')['href']
firm = tds[0].text
office = tds[1].text
attorneys = tds[3].text
partners = tds[4].text
associates = tds[5].text
salary = tds[6].text
print number, firm, office, attorneys, partners, associates, salary

I would get each tr, in the table with the class=listings. Your search is obviously too broad for the information you want. Because HTML has a structure you can easily get just the table data. This is easier in the long run then getting all hrefs and filtering the ones that you don't want out. BeautifulSoup has plent of documentation on how to do this. http://www.crummy.com/software/BeautifulSoup/documentation.html
not exact code:
for tr in soup.findAll('tr'):
data_list = tr.children()
data_list[0].content # 7
data_list[1].content # New York
data_list[2].content # Region <-- ignore this
# etc

Related

Established html table line to python

Let's say, i have an HTML Table like this:
<tr>
<td class="Klasse gerade">12A<br></td>
<td class="Stunde gerade">4<br></td>
<td class="Fach gerade">GEO statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">603<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
<tr>
<td class="Klasse gerade">10A<br></td>
<td class="Stunde gerade">2<br></td>
<td class="Fach gerade">MA statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">406<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
if phrase the HTML to python(2.7) with:
link = "http://www.test.com/vplan.html"
f = urllib.urlopen(link)
vplan = f.read()
print vplan
how can i do this?: if td=10A then print the complete tr of 10A
Sorry for the bad formulation but this is in my opinion the easiest was to explain my question and don't wonder about the German word's (I'm a German)
You need an HTML parser like Beautifulsoup. Assuming the table in question is the only one or the first one in the document, the program may look like this:
#!/usr/bin/env python
import urllib
from bs4 import BeautifulSoup
def main():
link = 'http://www.test.com/vplan.html'
soup = BeautifulSoup(urllib.urlopen(link), 'lxml')
table = soup.find('table')
rows = [x.find_parent('tr') for x in table.find_all(text='10A')]
for row in rows:
for cell in row.find_all('td'):
print cell.text
print '-' * 10

How to use BeauifulSoup for parsing data in following example?

I am a beginner in Python and BeautifulSoup and I am trying to make a web scraper. However, I am facing some issues and can't figure out a way out. Here is my issue:
This is part of the HTML from where I want to scrap:
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>
Now, I want to extract "Charizard" from 1st table row and "Mega Charizard X" from the second row. Right now, I am able to extract "Charizard" from both rows.
Here is my code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.html"), "lxml")
poke_boxes = soup.findAll('a', attrs = {'class': 'ent-name'})
for poke_box in poke_boxes:
poke_name = poke_box.text.strip()
print(poke_name)
import bs4
html = '''<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a></td>
</tr>
<tr>
<td class="num cell-icon-string" data-sort-value="6">
<td class="cell-icon-string"><a class="ent-name" href="/pokedex/charizard" title="View pokedex for #006 Charizard">Charizard</a><br>
<small class="aside">Mega Charizard X</small></td>
</tr>'''
soup = bs4.BeautifulSoup(html, 'lxml')
in:
[tr.get_text(strip=True) for tr in soup('tr')]
out:
['Charizard', 'CharizardMega Charizard X']
you can use get_text() to concatenate all the text in the tag, strip=Ture will strip all the space in the string
You'll need to change your logic to go through the rows and check to see if the small element exists, if it does print out that text, otherwise print out the anchor text as you are now.
soup = BeautifulSoup(html, 'lxml')
trs = soup.findAll('tr')
for tr in trs:
smalls = tr.findAll('small')
if smalls:
print(smalls[0].text)
else:
poke_box = tr.findAll('a')
print(poke_box[0].text)

Python Beautiful Soup find string and extract following string

I am programming a web crawler with the help of beautiful soup.I have the following html code:
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).
At the moment I am doing the following:
for c in soup.find_all("a", string=re.compile('abc')):
abc=c.string
But of course it returns the string "abc" and not the number in the tag afterwards.
So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.
Thanks for your help!!!
Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:
h = """<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling("td",class_="numeric"))
If the numeric td is always next you can just call find_next_sibling():
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling())
For your input both would give you:
td class="numeric">50,00%</td>
If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:
result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
table_columns = row.find_all("td")
result[table_columns[0].text] = tds[1].text
print result #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}
You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"
So as I understand your question you want to iterate over the tuples
('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?
But I don't understand how your code produces any results, since you are searching for <a> tags.
Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.
html = """
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all("tr"):
print((tr.td.string, tr.td.next_sibling.next_sibling.string))

python parse specific data on html table using lxml and xpath

First of all I am new to python and Stack Overflow so please be kind.
This is the source code of the html page I want to extract data from.
Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168
The table is at the bottom of the page
<html>
table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>
</td>
<td>Guldhedens IK - IF Warta</td>
<td>Guldheden Södra 1 Konstgräs </td>
</tr>
<tr class="clTrEven">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>
</td>
<td>Romelanda UF - IK Virgo</td>
<td>Romevi 1 Gräs </td>
</tr>
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>
</td>
<td>Kode IF - IK Kongahälla</td>
<td>Kode IP 1 Gräs </td>
</tr>
<tr class="clTrEven">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>
</td>
<td>Floda BoIF - Partille IF FK </td>
<td>Flodala IP 1 </td>
</tr>
</tbody>
</table>
</html>
I need to extract the time: 19:30 and the team name: Guldhedens IK - IF Warta meaning the first and the second table cell(not the third) from the first table row and 13:00/Romelanda UF - IK Virgo from the second table row etc.. from all the table rows there is.
As you can see every table row has a date right before the time so here comes the tricky part. I only want to get the time and the team names as mentioned above from those table rows where the date is equal to the date I run this code.
The only thing I managed to do so far is not much, I can only get the time and the team name using this code:
import lxml.html
html = lxml.html.parse("http://gbgfotboll.se/information/?scr=table&ftid=51168")
test=html.xpath("//*[#id='content-primary']/table[3]/tbody/tr[1]/td[1]/span/span//text()")
print test
which gives me the result ['2014-09-26', ' 19:30'] after this I'm lost on how to iterate through different table rows wanting the specific table cells where the date matches the date I run the code.
I hope you can answer as much as you can.
If I understood you, try something like this:
import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
print html.xpath(xpath1)[1], html.xpath(xpath2)[0]
I know this is fragile and there are better solutions, but it works. ;)
Edit:
Better way with using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr') #change this to [0] to parse first table
for i in t:
try:
print i.find('span').get_text()[-5:], i.find('a').get_text()
except AttributeError:
pass
Edit2:
page not responding, but that should work:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr')
time = ""
for i in t:
try:
dateTime = i.find('span').get_text()
teamName = i.find('a').get_text()
if time == dateTime[:-5]:
print dateTime[-5,], teamName
else:
print dateTime, teamName
time = dateTime[:-5]
except AttributeError:
pass
lxml:
import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
dateTemp = ""
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span// text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
time = html.xpath(xpath1)[1]
date = html.xpath(xpath1)[0]
teamName = html.xpath(xpath2)[0]
if date == dateTemp:
print time, teamName
else:
print date, time, teamName
So thanks to #CodeNinja help i just tweaked it a little bit to get exactly what i wanted.
I imported time to get the date of the time i run the code. Anyways here is the code for what i wanted. Thank you for the help!!
import lxml.html
import time
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
currentDate = (time.strftime("%Y-%m-%d"))
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
time = html.xpath(xpath1)[1]
date = html.xpath(xpath1)[0]
teamName = html.xpath(xpath2)[0]
if date == currentDate:
print time, teamName
So here is the FINAL version of how to do it the correct way. This will parse through all the table rows it has without using "range" in the for loop. I got this answer from my other post here: Iterate through all the rows in a table using python lxml xpath
import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'
rows_xpath = XPath("//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")
html = lxml.html.parse(url)
for row in rows_xpath(html):
time = time_xpath(row)[0].strip()
team = team_xpath(row)[0]
print time, team

python scraping exchange rate data

An application that a friend uses depends on daily exchange rate figures from a particular site link to source of rates
The problem is there is no set time when the rate is changed which is affecting business since sometimes when the rate is changed she might be out and so until he comes back any transaction that happens will use the last rate entered. Sometimes she wins other times she looses out. I'm trying to create an automated client that will scrape and update the exchange rate for her independently.
So far I have been able to strip the content of the site down to a list:
[
<td style="text-align: left;">U.S Dollar</td>,
<td>USDGHS</td>, <td>1.8673</td>, <td>1.8994</td>,
<td style="text-align: left;">Pound Sterling</td>,
<td>GBPGHS</td>, <td>3.0081</td>, <td>3.0599</td>,
<td style="text-align: left;">Swiss Franc</td>,
<td>CHFGHS</td>, <td>2.0034</td>, <td>2.0375</td>,
<td style="text-align: left;">Australian Dollar</td>,
<td>AUDGHS</td>, <td>1.9667</td>, <td>2.0009</td>,
<td style="text-align: left;">Canadian Dollar</td>,
<td>CADGHS</td>, <td>1.8936</td>, <td>1.9259</td>,
<td style="text-align: left;">Danish Kroner</td>,
<td>DKKGHS</td>, <td>0.3255</td>, <td>0.3311</td>,
<td style="text-align: left;">Japanese Yen</td>,
<td>JPYGHS</td>, <td>0.0226</td>, <td>0.0230</td>,
<td style="text-align: left;">New Zealand Dollar</td>,
<td>NZDGHS</td>, <td>1.5690</td>, <td>1.5964</td>,
<td style="text-align: left;">Norwegian Kroner</td>,
<td>NOKGHS</td>, <td>0.3307</td>, <td>0.3363</td>]
I'm now strugling a bit to create a dictionary like so
{USDGHS: [1.8673, 1.8994], GBPGHS: [3.0081, 3.0599], ...}
I'll then use the dictionary to update the appropriate table in the database.
I got to this stage by using beautifulsoup4 and urllib2
[Edit]
Code that got me to this point
from bs4 import BeautifulSoup
import urllib2
url = "http://bog.gov.gh/data/bankindrate.php"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
td = soup.find_all('td')
another_soup = BeautifulSoup(td[:-3])
print another_soup
You need to first find the rows (tr tags) and use those to then get the columns (td tags):
currencies = {}
trs = soup.find_all('tr') # find rows
for tr in trs[1:-3]: # skip first and last 3 (or whatever)
text = list(tr.strings) # content of all text stuff in tr (works in this case)
# [u'U.S Dollar', u'USDGHS', u'1.8673', u'1.8994']
currencies[text[1]] = [float(text[2]), float(text[3])]
And put those into a dictionary using the appropriate key with a value of the two numbers converted to floats...
>>> currencies
{u'USDGHS': [1.8673, 1.8994], u'JPYGHS': [0.0226, 0.023], u'CHFGHS': [2.0034, 2.0375], u'CADGHS': [1.8936, 1.9259], ...}

Categories

Resources