Beautiful Soup: Copy a table to a new file - python

I'm trying to use Beautiful Soup to isolate a specific <table> element and put it in a new file. The table has an id, ModelTable, and I can find it using soup.select("#ModelTable") ("soup" being the imported file).
However, I'm having trouble figuring out how to get the element into a new file. Simply writing it to a new file (as in: write(soup.select("#ModelTable") ) doesn't work, as it's not a string object, and converting it with str() results in a string enclosed in brackets.
Ideally I'd like to be able to export the isolated element after running it through .prettify() so that I can get a good HTML file right off the bat. I know I must be missing something obvious... any hints?

You need to iterate over the contents of the returned object. Your question also taught me that BS4's .select uses CSS selectors, which is fantastic.
with open('file_output.html', 'w') as f:
for tag in soup.select("#ModelTable"):
f.write(tag.prettify())

Related

Extract certain values from a line with BeautifulSoup

for a school course we are learning advanced python,to get a first idea about web scraping and similar stuff.... I got an exercise to do where I have to extract the values v1, v2 from the following line of an HTML... I tried looking up but couldn't find any really specific things.... If it is unappropriated for SO just delete it....
The HTML part
{"v1":"first","ex":"first_soup","foo":"0","doo":"0","v1":["second"]}
so afterwards when i want to show the values it should look like
print(v1)
first
print(v2)
second
I tried to get the values just by slicing the whole line like this:
v1=htmltext[7,12]
v2=htmltext[60,66]
but in this case I am not using the bs4 module, which is recommended using... I would be very grateful in case someone could teach me...
What you are seeing there is not an HTML file but a JSON. In this case it makes no sense to use BeautifulSoup's HTML parser, you may want to use a standard JSON library to do that, like so:
import json
json_Dict=json.loads(str(soup))
Then you can index it using the headers (or keys)
json_Dict["v1"]
>>>"first"

Extracting HTML tag content with xpath from a specific website

I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.
Try it like this:
company = xslt_root.xpath("//div[#data-tn-component='jobHeader']/span[#class='company']/text()")
position = xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[#data-tn-component='jobHeader'] path things become pretty straightforward:
select the text of the child span /span[#class='company']/text() to get the company name
/b[#class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.
An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']")[0].text_content()
Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task.
Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.

Scraping Text from table using Soup / Xpath / Python

I need help in extracting data from : http://agmart.in/crop.aspx?ccid=1&crpid=1&sortby=QtyHigh-Low
Using the filter, there are about 4 pages of data (Under rice crops) in tables I need to store.
I'm not quite sure how to proceed with it. been reading up all the documentation possible. For someone who just started python, I'm very confused atm. Any help is appreciated.
Here's a code snipet I'm basing it on :
Example website : http://www.uscho.com/rankings/d-i-mens-poll/
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[#id="rankings"]'):
print section.xpath('h1[1]/text()')[0],
print section.xpath('h3[1]/text()')[0]
print
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
print '%-3s %-20s %10s %10s %10s %10s' % tuple(
''.join(col.xpath('.//text()')) for col in row.xpath('td'))
print
I can't seem to understand any of the code above. Only understood that the URL is being read. :(
Thank you for any help!
Just like we have CSS selectors like .window or #rankings, xpath is used to navigate through elements and attributes in XML.
So in for loop, you're first searching for an element called "section" give a condition that it has an attribute id whose value is rankings. But remember you are not done yet. This section also contains the heading "Final USCHO.com Division I Men's Polo", date and extra elements in the table. Well, there was only one element and this loop will run only once. That's where you're extracting the text (everything within the TAGS) in h1 (Heading) and h3 (Date).
Next part extracts a tag called table, with conditions on each row's classes - they can be even or odd. Well, because you need all the rows in this table, that part is not doing anything here.
You could replace the line
for row in section.xpath('table/tr[#class="even" or #class="odd"]'):
with
for row in section.xpath('table/tr'):
Now when we are inside the loop, it will return us each 'td' element - each cell in that row. That's why the last line says row.xpath('td'). When you iterate over them, you'll receive multiple cell elements, e.g. each for 1, Providence, 49, 26-13-2, 997, 15. Check first line in the webpage table.
Try this for yourself. Replace the last loop block with this much easier to read alternative:
for row in section.xpath('table/tr'):
print row.xpath('td//text()')
You will see that it presents all the table data in Pythonic lists - each list item containing one cell. Your code is just another fancier way to write these list items converted into a string with spaces between them. xpath() method returns objects of Element type which are representation of each XML/HTML element. xpath('something//text()') would produce the actual content within that tag.
Here're a few helpful references:
Easy to understand tutorial :
http://www.w3schools.com/xpath/xpath_examples.asp
Stackoverflow question : Extract text between tags with XPath including markup
Another tutorial : http://www.tutorialspoint.com/xpath/

How to properly replace the contents of text file

I am trying to make an offline copy of this website: ieeghn. Part of this task is to download all css/js that being referred to using Beautiful Soup and modify any external link to this newly downloaded resource.
At the moment I simply use string replace method. But I don't think this is effective, as I do this inside a loop, snippet below:
local_content = ''
for res in soup.findAll('link', {'rel': 'stylesheet'}):
if not str(res['href']).startswith('data:'):
original_res = res['href']
res['href'] = some_function_to_download_css()
local_content = local_content.replace(original_res, res['href'])
I only save resource for non-embedding resource that start with data:. But the problem is, that local_content = local_content.replace(original_res, res['href']) may lead to the problem that I only able to modify one external resource into local resource. The rest still refer to online version of the resource.
I am guessing that because local_content is a very long string (have a look at the ieeghn source), this didn't work out well.
How do you properly replace content of a string for a given pattern?
Or do I have to store this first to a file and modify it there?
EDITED
I found the problem was in this line of code:
original_res = res['href']
BSoup will somehow sanitized the href string. In my case, & will be changed to &. As I am trying to replace the original href into a newly downloaded local file, str.replace() simply won't find this original value. Either I have to found a way to have original HREF or simply handle this case. Got to say, having the original HREF is the best way
You're already replacing the content, in a way...
res['href'] = some_function_to_download_css()
...updates the href attribute of the res node in BeautifulSoup's representation of the HTML tree.
To make it more efficient, you could cache the URLs of CSS files you've already downloaded, and consult the cache before downloading the file. Once you're done (and if you're OK with BS's attribute ordering/indentation/etc.), you can get the string representation of the tree with str(soup).
Reference: http://beautiful-soup-4.readthedocs.org/en/latest/#changing-tag-names-and-attributes

How can I iterate over specific elements in HTML file and replace them?

I need to do a seemingly simple thing in Python which turned out to be quite complex. What I need to do is:
Open an HTML file.
Match all instances of a specific HTML element, for example table.
For each instance, extract the element as a string, pass that string to an external command which will do some modifications, and finally replace the original element with a new string returned from the external command.
I can't simply do a re.sub(), because in each case the replacement string is different and based on the original string.
Any suggestions?
You could use Beautiful Soup to do this.
Although for what you need, something simpler like lxml.etree would work fine.
Sounds like you want BeautifulSoup. Likely, you'd want to do something like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tables = soup.find_all( 'table' )
for table in tables:
contents = str( table.contents )
new_contents = transform( contents )
table.replaceWith( new_contents )
Alternatively, you may be looking for something closer to soup.replace_with
EDIT: Updated to the eventual solution.
I have found that parsing HTML via BeautifulSoup or any other such parses gets complex as you need to parse different pages, with different structure which sometimes are not well-formed, use javascript manipulation etc. Best solution in this case is to directly access the browser DOM and modify and query nodes. You can easily do that in a headless browser like phanotomjs
e.g. here is a phantomjs script
var page = require('webpage').create();
page.content = '<html><body><table><tr><td>1</td><td>2</td></tr></table></html>';
page.evaluate(function () {
var elems = document.getElementsByTagName('td')
for(var i=0;i<elems.length;i++){
elems[i].innerHTML = '!'+elems[i].innerHTML+'!';
}
});
console.log(page.content);
phantom.exit();
It changes all td text and output is
<html><head></head><body><table><tbody><tr><td>!1!</td><td>!2!</td></tr></tbody></table></body></html>

Categories

Resources