Parse HTML Table with Python BeautifulSoup

Parse HTML Table with Python BeautifulSoup - python

I am attempting to use BeautifulSoup to parse an html table which I uploaded to http://pastie.org/8070879 in order to get the three columns (0 to 735, 0.50 to 1.0 and 0.5 to 0.0) as lists. To explain why, I will want the integers 0-735 to be keys and the decimal numbers to be values.
From reading many of the other posts on SO, I have come up with the following which does not come close to creating the lists I want. All it does is display the text in the table as is seen here http://i1285.photobucket.com/albums/a592/TheNexulo/output_zps20c5afb8.png
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("fide.html"))
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text + "|",
print
I'm new to Python and BeautifulSoup, so please be gentle with me! Thanks

HTML parsers like BeautifulSoup presume that what you want is an object model that mirrors the input HTML structure. But sometimes (like in this case) that model gets in the way more than helps. Pyparsing includes some HTML parsing features that are more robust than just using raw regexes, but otherwise work in similar fashion, letting you define snippets of HTML of interest, and just ignoring the rest. Here is a parser that reads through your posted HTML source:
from pyparsing import makeHTMLTags,withAttribute,Suppress,Regex,Group
""" looking for this recurring pattern:
<td valign="top" bgcolor="#FFFFCC">00-03</td>
<td valign="top">.50</td>
<td valign="top">.50</td>
and want a dict with keys 0, 1, 2, and 3 all with values (.50,.50)
"""
td,tdend = makeHTMLTags("td")
keytd = td.copy().setParseAction(withAttribute(bgcolor="#FFFFCC"))
td,tdend,keytd = map(Suppress,(td,tdend,keytd))
realnum = Regex(r'1?\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'\d{1,3}').setParseAction(lambda t:int(t[0]))
DASH = Suppress('-')
# build up an expression matching the HTML bits above
entryExpr = (keytd + integer("start") + DASH + integer("end") + tdend +
Group(2*(td + realnum + tdend))("vals"))
This parser not only picks out the matching triples, it also extracts the start-end integers and the pairs of real numbers (and also already converts from string to integers or floats at parse time).
Looking at the table, I'm guessing you actually want a lookup that will take a key like 700, and return the pair of values (0.99, 0.01), since 700 falls in the range of 620-735. This bit of code searches the source HTML text, iterates over the matched entries and inserts key-value pairs into the dict lookup:
# search the input HTML for matches to the entryExpr expression, and build up lookup dict
lookup = {}
for entry in entryExpr.searchString(sourcehtml):
for i in range(entry.start, entry.end+1):
lookup[i] = tuple(entry.vals)
And now to try out some lookups:
# print out some test values
for test in (0,20,100,700):
print (test, lookup[test])
prints:
0 (0.5, 0.5)
20 (0.53, 0.47)
100 (0.64, 0.36)
700 (0.99, 0.01)

I think the above answer is better than what I would offer, but I have a BeautifulSoup answer that can get you started. This is a bit hackish, but I figured I would offer it nevertheless.
With BeautifulSoup, you can find all the tags with certain attributes in the following way (assuming you have a soup.object already set up):
soup.find_all('td', attrs={'bgcolor':'#FFFFCC'})
That will find all of your keys. The trick is to associate these with the values you want, which all show up immediately afterward and which are in pairs (if these things change, by the way, this solution won't work).
Thus, you can try the following to access what follows your key entries and put those into your_dictionary:
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = node.next_sibling
The problem is that the "next_sibling" is actually a '\n', so you have to do the following to capture the next value (the first value you want):
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = node.next_sibling.next_sibling.string
And if you want the two following values, you have to double this:
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = [node.next_sibling.next_sibling.string, node.next_sibling.next_sibling.next_sibling.next_sibling.string]
Disclaimer: that last line is pretty ugly to me.

I've used BeautifulSoup 3, but it probably will work under 4.
# Import System libraries
import re
# Import Custom libraries
from BeautifulSoup import BeautifulSoup
# This may be different between BeautifulSoup 3 and BeautifulSoup 4
with open("fide.html") as file_h:
# Read the file into the BeautifulSoup class
soup = BeautifulSoup(file_h.read())
tr_location = lambda x: x.name == u"tr" # Row location
key_location = lambda x: x.name == u"td" and bool(set([(u"bgcolor", u"#FFFFCC")]) & set(x.attrs)) # Integer key location
td_location = lambda x: x.name == u"td" and not dict(x.attrs).has_key(u"bgcolor") # Float value location
str_key_dict = {}
num_key_dict = {}
for tr in soup.findAll(tr_location): # Loop through all found rows
for key in tr.findAll(key_location): # Loop through all found Integer key tds
key_list = []
key_str = key.text.strip()
for td in key.findNextSiblings(td_location)[:2]: # Loop through the next 2 neighbouring Float values
key_list.append(td.text)
key_list = map(float, key_list) # Convert the text values to floats
# String based dictionary section
str_key_dict[key_str] = key_list
# Number based dictionary section
num_range = map(int, re.split("\s*-\s*", key_str)) # Extract a value range to perform interpolation
if(len(num_range) == 2):
num_key_dict.update([(x, key_list) for x in range(num_range[0], num_range[1] + 1)])
else:
num_key_dict.update([(num_range[0], key_list)])
for x in num_key_dict.items():
print x

Related

Using BeautifulSoup to find a tag and evaluate whether it fits some criteria

I am writing a program to extract text from a website and write it into a text file. Each entry in the text file should have 3 values separated by a tab. The first value is hard-coded to XXXX, the 2nd value should initialize to the first item on the website with , and the third value is the next item on the website with a . The logic I'm trying to introduce is looking for the first and write the associated string into the text file. Then find the next and write the associated string into the text file. Then, look for the next p class. If it's "style4", start a new line, if it's another "style5", write it into the text file with the first style5 entry but separated with a comma (alternatively, the program could just skip the next style5.
I'm stuck on the part of the program in bold. That is, getting the program to look for the next p class and evaluate it against style4 and style5. Since I was having problems with finding and evaluating the p class tag, I chose to pull my code out of the loop and just try to accomplish the first iteration of the task for starters. Here's my code so far:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
next_vendor = soup.find('p', {'class': 'style4'})
print next_vendor
next_commodity = next_vendor.find_next('p', {'class': 'style5'})
print next_commodity
next = next_commodity.find_next('p')
print next
I'd appreciate any help anybody can provide! Thanks in advance!

I am not entirely sure how you are expecting your output to be. I am assuming that you are trying to get the data in the webpage in the format:
Alphabet \t Vendor \t Category
You can do this:
# The basic things
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
Get the td of interest:
table = soup.find('table')
data = table.find_all('tr')[-1]
data = data.find_all('td')[1:]
Now, we will create a nested output dictionary with alphabets as the keys and an inner dict as the value. The inner dict has vendor name as key and category information as it's value
output_dict = {}
current_alphabet = ""
current_vendor = ""
for td in data:
for p in td.find_all('p'):
print p.text.strip()
if p.get('class')[0] == 'style6':
current_alphabet = p.text.strip()
vendors = {}
output_dict[current_alphabet] = vendors
continue
if p.get('class')[0] == 'style4':
print "Here"
current_vendor = p.text.strip()
category = []
output_dict[current_alphabet][current_vendor] = category
continue
output_dict[current_alphabet][current_vendor].append(p.text.strip())
This gets the output_dict in the format:
{ ...
u'W': { u'WTI - Weatherproofing Technologies': [u'Roofing'],
u'Wenger Corporation': [u'Musical Instruments and Equipment'],
u'Williams Scotsman, Inc': [u'Modular/Portable Buildings'],
u'Witt Company': [u'Interactive Technology']
},
u'X': { u'Xerox': [u"Copiers & MFD's", u'Printers']
}
}
Skipping the earlier parts for brevity. Now it is just a matter of accessing this dictionary and writing out to a tab separated file.
Hope this helps.

Agree with #shaktimaan. Using a dictionary or list is a good approach here. My attempt is slightly different.
import requests as rq
from bs4 import BeautifulSoup as bsoup
import csv
url = "http://www.kcda.org/KCDA_Awarded_Contracts.htm"
r = rq.get(url)
soup = bsoup(r.content)
primary_line = soup.find_all("p", {"class":["style4","style5"]})
final_list = {}
for line in primary_line:
txt = line.get_text().strip().encode("utf-8")
if txt != "\xc2\xa0":
if line["class"][0] == "style4":
key = txt
final_list[key] = []
else:
final_list[key].append(txt)
with open("products.csv", "wb") as ofile:
f = csv.writer(ofile)
for item in final_list:
f.writerow([item, ", ".join(final_list[item])])
For the scrape, we isolate style4 and style5 tags right away. I did not bother going for the style6 or the alphabet headers. We then get the text inside each tag. If the text is not a whitespace of sorts (this is all over the tables, probably obfuscation or bad mark-up), we then check if it's style4 or style5. If it's the former, we assign it as a key to a blank list. If it 's the latter, we append it to the blank list of the most recent key. Obviously the key changes every time we hit a new style4 only so it's a relatively safe approach.
The last part is easy: we just use ", ".join on the value part of the key-value pair to concatenate the list as one string. We then write it to a CSV file.
Due to the dictionary being unsorted, the resulting CSV file will not be sorted alphabetically. Screenshot of result below:
Changing it to a tab-delimited file is up to you. That's simple enough. Hope this helps!

Search a string for values present in a dict

As the title suggests, I am trying to find values in a dict within a string. This relates to my post here: Python dictionary - value
My code is something like follows:
import mechanize
from bs4 import BeautifulSoup
leaveOut = {
'a':'cat',
'b':'dog',
'c':'werewolf',
'd':'vampire',
'e':'nightmare'
}
br = mechanize.Browser()
r = br.open("http://<a_website_containing_a_list_of_movie_titles/")
html = r.read()
soup = BeautifulSoup(html)
table = soup.find_all('table')[0]
for row in table.find_all('tr'):
# Find all table data
for data in row.find_all('td'):
code_handling_the_assignment_of_movie_title_to_var_movieTitle
if any(movieTitle.find(leaveOut[c]) < 1 for c in 'abcde'):
do_this_set_of_instructions
else:
pass
I want to skip the program contained under the if block (identified above as do_this_set_of_instructions) if the string stored in movieTitle contains any of the strings (or values if you like) in the leaveOut dict.
So far, I have had no luck with any(movieTitle.find(leaveOut[c]) < 1 for c in 'abcde'): as it always returns True and the do_this_set_of_instructions always execute regardless.
Any ideas?

.find() returns -1 if the substring isn't in the string that you're working on, so your any() call will return True if any of the words aren't in the title.
You may want to do something like this instead:
if any(leaveOut[c] in movieTitle for c in 'abcde'):
# One of the words was in the title
Or the opposite:
if all(leaveOut[c] not in movieTitle for c in 'abcde'):
# None of the words were in the title
Also, why are you using a dictionary like this? Why don't you just store the words in a list?
leave_out = ['dog', 'cat', 'wolf']
...
if all(word not in movieTitle for word in leave_out):
# None of the words were in the title

How to sort a HTML table with Python and BeautyfulSoup

I need to categorize this html page http://gnats.netbsd.org/summary/year/2012-perf.html , I need to make a list of top issues just from the big table.This is my code in Python.I would be really gratefull if you could give me some advice.
import urllib.request
from bs4 import BeautifulSoup
# overall input
inputpage = urllib.request.urlopen("http://gnats.netbsd.org/summary/year/2012-perf.html")
page = inputpage.read()
soup = BeautifulSoup(page)
# checking tables
table = soup.findAll('table')
rows = soup.findAll('tr')
colomns = soup.findAll('td')
# inputing the lists
name = []
first = []
second = []
sum = []
# the main part
for tr in rows:
if (tr==1):
element = tr.split("<td>")
name.append(element)
elif (tr==2):
element = tr.split("<td>")
first.append(element)
elif (tr==3):
element = tr.split("<td>")
second.append(element)
# combining the open and closed issue lists
length = len(first)
for i in range(length):
sum = first[i] + second [i]
# printing the lists
length = len(sum)
for i in range(length):
print (name[i] + '|' + sum[i])

BeautifulSoup has some nice methods for accessing child nodes and so on. You could for example use tables = soup.findAll('table'). Assuming you want to combine the data of the second table in the link you posted (tables[1]), you could do something like the following
names = []
cdict = {0:[], 1:[]} # dictionary of "td positions to contents"
tables = soup.findAll('table')
for tt in tables[1].find_all('tr')[1:]: # skip first <tr> since it is the header
names.append(tt.find_all('th')[0]) # 1st column is a th with the name
for k, v in cdict.items():
# append the <td>text</td> of column k to the corresponding list
v.append(tt.find_all('td')[k].text)
So, what you'll end up with is a dictionary of columns -> lists, so that
each list contains the td text elements (the main reason for using a dictionary
is because you may want to grab the elements from columns 1,2 and 4, in which case
you'll only need to change what is in the cdict).
To make the sums you can do something like:
for i in xrange(len(names)):
print names[i], int(cdict[0][i]) + int(cdict[1][i])
If you have a look at each element's methods you'll see some really nice functionality you can use to make your task easier.

Converting HTML list (<li>) to tabs (i.e. indentation)

Have worked in dozens of languages but new to Python.
My first (maybe second) question here, so be gentle...
Trying to efficiently convert HTML-like markdown text to wiki format (specifically, Linux Tomboy/GNote notes to Zim) and have gotten stuck on converting lists.
For a 2-level unordered list like this...
First level
Second level
Tomboy/GNote uses something like...
<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>
However, the Zim personal wiki wants that to be...
* First level
* Second level
... with leading tabs.
I've explored the regex module functions re.sub(), re.match(), re.search(), etc. and found the cool Python ability to code repeating text as...
count * "text"
Thus, it looks like there should be a way to do something like...
newnote = re.sub("<list>", LEVEL * "\t", oldnote)
Where LEVEL is the ordinal (occurrance) of <list> in the note. It would thus be 0 for the first <list> incountered, 1 for the second, etc.
LEVEL would then be decremented each time </list> was encountered.
<list-item> tags are converted to the asterisk for the bullet (preceded by newline as appropriate) and </list-item> tags dropped.
Finally... the question...
How do I get the value of LEVEL and use it as a tabs multiplier?

You should really use an xml parser to do this, but to answer your question:
import re
def next_tag(s, tag):
i = -1
while True:
try:
i = s.index(tag, i+1)
except ValueError:
return
yield i
a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
a = a.replace("<list-item>", "* ")
for LEVEL, ind in enumerate(next_tag(a, "<list>")):
a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)
a = a.replace("</list-item>", "")
a = a.replace("</list>", "")
print a
This will work for your example, and your example ONLY. Use an XML parser. You can use xml.dom.minidom (it's included in Python (2.7 at least), no need to download anything):
import xml.dom.minidom
def parseList(el, lvl=0):
txt = ""
indent = "\t" * (lvl)
for item in el.childNodes:
# These are the <list-item>s: They can have text and nested <list> tag
for subitem in item.childNodes:
if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
# This is the text before the next <list> tag
txt += "\n" + indent + "* " + subitem.nodeValue
else:
# This is the next list tag, its indent level is incremented
txt += parseList(subitem, lvl=lvl+1)
return txt
def parseXML(s):
doc = xml.dom.minidom.parseString(s)
return parseList(doc.firstChild)
a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)
Output:
* First level
* Second level
* Second level 2
* Third level

Use Beautiful soup , it allows you to iterate in the tags even if they are customs. Very pratical for doing this type of operation
from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')] for list_tag in soup('list')]
Output : [[u'First level'], [u'Second level']]
I used a nested list comprehension but you can use a nested for loop
for list_tag in soup('list'):
for item in list_tag('list-item'):
print item.text
I hope that helps you.
In my example I used BeautifulSoup 3 but the example should work with BeautifulSoup4 but only the import change.
from bs4 import BeautifulSoup

Using Regex to Search for HTML links near keywords

If I'm looking for the keyword "sales" and I want to get the nearest "http://www.somewebsite.com" even if there is multiple links in the file. I want the nearest link not the first link. This means I need to search for the link that comes just before the keyword match.
This doesn't work...
regex = (http|https)://[-A-Za-z0-9./]+.*(?!((http|https)://[-A-Za-z0-9./]+))sales
sales
Whats the best way to find a link that is closest to a keyword?

It is generally much easier and more robust to use an HTML parser rather than regex.
Using the third-party module lxml:
import lxml.html as LH
content = '''<html>
<p>other stuff</p><p>sales</p>
</html>
'''
doc = LH.fromstring(content)
for url in doc.xpath('''
//*[contains(text(),"sales")]
/preceding::*[starts-with(#href,"http")][1]/#href'''):
print(url)
yields
http://www.somewebsite.com
I find lxml (and XPath) a convenient way to express what elements I'm looking for. However, if installing a third-party module is not an option, you could also accomplish this particular job with HTMLParser from the standard library:
import HTMLParser
import contextlib
class MyParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.last_link = None
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if 'href' in attrs:
self.last_link = attrs['href']
content = '''<html>
<p>other stuff</p><p>sales</p>
</html>
'''
idx = content.find('sales')
with contextlib.closing(MyParser()) as parser:
parser.feed(content[:idx])
print(parser.last_link)
Regarding the XPath used in the lxml solution: The XPath has the following meaning:
//* # Find all elements
[contains(text(),"sales")] # whose text content contains "sales"
/preceding::* # search the preceding elements
[starts-with(#href,"http")] # such that it has an href attribute that starts with "http"
[1] # select the first such <a> tag only
/#href # return the value of the href attribute

I don't think you can do this one with regex alone (especially looking before the keyword match) as it has no sense of comparing distances.
I think you're best off doing something like this:
find all occurences of sales & get substring index, called salesIndex
find all occurences of https?://[-A-Za-z0-9./]+ and get the substring index, called urlIndex
loop through salesIndex. For each location i in salesIndex, find the urlIndex closest.
Depending on how you want to judge "closest" you may need to get the start and end indices of the sales and http... occurences to compare. i.e., find the end index of a URL that is closest to the start index of the current occurence of sales, and find the start index of a URL that is closest to the end index of the current occurence of sales, and pick the one that is closer.
You can use matches = re.finditer(pattern,string,re.IGNORECASE) to get a list of matches, and then match.span() to get the start/end substring indices for each match in matches.

Building on what mathematical.coffee suggested, you could try something along these lines:
import re
myString = "" ## the string you want to search
link_matches = re.finditer('(http|https)://[-A-Za-z0-9./]+',myString,re.IGNORECASE)
sales_matches = re.finditer('sales',myString,re.IGNORECASE)
link_locations = []
for match in link_matches:
link_locations.append([match.span(),match.group()])
for match in sales_matches:
match_loc = match.span()
distances = []
for link_loc in link_locations:
if match_loc[0] > link_loc[0][1]: ## if the link is behind your keyword
## append the distance between the END of the keyword and the START of the link
distances.append(match_loc[0] - link_loc[0][1])
else:
## append the distance between the END of the link and the START of the keyword
distances.append(link_loc[0][0] - match_loc[1])
for d in range(0,len(distances)-1):
if distances[d] == min(distances):
print ("Closest Link: " + link_locations[d][1] + "\n")
break

I tested out this code and it seems to be working...
def closesturl(keyword, website):
keylist = []
urllist = []
closest = []
urls = []
urlregex = "(http|https)://[-A-Za-z0-9\\./]+"
urlmatches = re.finditer(urlregex, website, re.IGNORECASE)
keymatches = re.finditer(keyword, website, re.IGNORECASE)
for n in keymatches:
keylist.append([n.start(), n.end()])
if(len(keylist) > 0):
for m in urlmatches:
urllist.append([m.start(), m.end()])
if((len(keylist) > 0) and (len(urllist) > 0)):
for i in range (0, len(keylist)):
closest.append([abs(urllist[0][0]-keylist[i][0])])
urls.append(website[urllist[0][0]:urllist[0][1]])
if(len(urllist) >= 1):
for j in range (1, len(urllist)):
if((abs(urllist[j][0]-keylist[i][0]) < closest[i])):
closest[i] = abs(keylist[i][0]-urllist[j][0])
urls[i] = website[urllist[j][0]:urllist[j][1]]
if((abs(urllist[j][0]-keylist[i][0]) > closest[i])):
break # local minimum / inflection point break from url list
if((len(keylist) > 0) and (len(urllist) > 0)):
return urls #return website[urllist[index[0]][0]:urllist[index[0]][1]]
else:
return ""
somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword"
keyword = "mykeyword"
print closesturl(keyword, somestring)
The above when run shows... http://www.secondlink.com.
If someones got ideas on how to speed up this code that would be awesome!
Thanks
V$H.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse HTML Table with Python BeautifulSoup - python

Related

Using BeautifulSoup to find a tag and evaluate whether it fits some criteria

Search a string for values present in a dict

How to sort a HTML table with Python and BeautyfulSoup

Converting HTML list (<li>) to tabs (i.e. indentation)

Using Regex to Search for HTML links near keywords

Categories

Resources