Issue in creating a html file from beautiful soup - python

Here is the my python code using BeautifulSoup. The main issue is with the attributes. What I am looking for is, each element of the th should be separated but for some reason it keep generating inside only one individual tag.
from BeautifulSoup import BeautifulSoup, Tag
soup=BeautifulSoup()
mem_attr=['Description','PhysicalID','Slot','Size','Width']
tag1 = Tag(soup, "html")
tag2 = Tag(soup, "table")
tag3 = Tag(soup, "tr")
tag4 = Tag(soup, "th")
tag5 = Tag(soup, "td")
soup.insert(0, tag1)
tag1.insert(0, tag2)
tag2.insert(0, tag3)
for i in range(0,len(mem_attr)):
tag3.insert(0,tag4)
tag4.insert(i,mem_attr[i])
print soup.prettify()
Here is its output:
<html>
<table>
<tr>
<th>
Description
PhysicalID
Slot
Size
Width
</th>
</tr>
</table>
</html>
What I am looking for is this one.
<html>
<table>
<tr>
<th>
Description
</th>
<th>
PhysicalID
</th>
<th>
Slot
</th>
<th>
Size
</th>
<th>
Width
</th>
</tr>
</table>
</html>
Can anyone tell me what is missing in the code?.

You're putting it in the same th. You never told it to create more than one.
Here is code more like what you are wanting:
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup()
mem_attr = ['Description', 'PhysicalID', 'Slot', 'Size', 'Width']
html = Tag(soup, "html")
table = Tag(soup, "table")
tr = Tag(soup, "tr")
soup.append(html)
html.append(table)
table.append(tr)
for attr in mem_attr:
th = Tag(soup, "th")
tr.append(th)
th.append(attr)
print soup.prettify()

Related

Extracting raw HTML content (with tags) via beautifulsoup

Using BeautifulSoup and Pandas, I am writing a module where I wish to extract full, raw HTML from pages/files and export the results to a spreadsheet. Here's an example:
Content.html file
<table>
<tbody>
<tr>
<td>Item 1</td>
</tr>
<tr data-name="item">
<td data-name="heading">Item 1</td>
<td data-name="content">Tagless Text in a cell.</td>
</tr>
<tr data-name="item">
<td data-name="heading">Item 2</td>
<td data-name="content">
<p>Item with child elements.</p>
<div>Second element.</div>
<p>Third Element</p>
</td>
</tr>
<tr data-name="item">
<td data-name="heading">Item 3</td>
<td data-name="content">
<p>Item with direct and indirect child elements.
<ul>
<li>Nested element 1</li>
<li>Nested element 2</li>
</ul>
</td>
</tr>
</tbody>
</table>
Python Script
from bs4 import BeautifulSoup
import csv
import pandas as pd
import numpy as np
import requests
import lxml
with open("content.html", "r") as source:
#req = requests.get(url, headers)
soup = BeautifulSoup(source, 'lxml')
output = soup.findAll("td", attrs={"data-name": "content"})
data = []
for item in output:
data.append(''.join(str(item)))
df = pd.DataFrame(data, columns=["Content"])
df.to_csv("data.csv", index=False)
# Output HTML code in file
#with open("code.html", "w") as f:
# f.write(stuff)
print("Project Finished!")
This script currently works, but my output will contain the parent td element along with all of its content.
data.csv
Content
"<td data-name=""content"">Tagless Text in a cell.</td>"
"<td data-name=""content"">
<p>Item with child elements.</p>
<div>Second element.</div>
<p>Third Element</p>
</td>"
"<td data-name=""content"">
<p>Item with direct and indirect child elements.
</p><ul>
<li>Nested element 1</li>
<li>Nested element 2</li>
</ul>
</td>"
My ideal output would look like the following:
Content
"Tagless Text in a cell."
"<p>Item with child elements.</p>
<div>Second element.</div>
<p>Third Element</p>"
"
<p>Item with direct and indirect child elements.
</p><ul>
<li>Nested element 1</li>
<li>Nested element 2</li>
</ul>"
How can I achieve this? The closest I've been able to get so far either strips out all of the tags or keeps the tags, but outputs every child element in a list (which throws a "ValueError: X columns passed, passed data had Y columns" for elements with multiple items in said list).
You could try iterating over the .contents for each <td> tag, for example:
from bs4 import BeautifulSoup
import pandas as pd
import lxml
with open("content.html", "r") as source:
#req = requests.get(url, headers)
soup = BeautifulSoup(source, 'lxml')
data = []
for td in soup.find_all("td", attrs={"data-name": "content"}):
data.append(''.join(str(el) for el in td.contents))
df = pd.DataFrame(data, columns=["Content"])
df.to_csv("data.csv", index=False)
print("Project Finished!")

How to continue filtering beyond BeautifulSoup find_all ResultSet?

Imagine you're trying to parse something like this with bs4:
<table>
<tbody>
<tr>
<th attr="attr" class="title">
Title Text
</th>
<th attr="attr" class="title">
Title Text 2
</th>
<th attr="attr" class="title">
Title Text 3
</th>
</tr>
</tbody>
<a href"otherlink.com">Other link to throw you off</a>
</table>
Currently I am able to get to a list of all the th elements with
html_content = BeautifulSoup(requests.get("parsingwebsite.com").content, "html.parser")
res = html_content.find_all("th", {"attr": "attr"}, class_="title")
But I only want the title text inside <a>. (ideally ["Title Text", "Title Text 2", "Title Text 3"])
Is there a way to continue filtering down by html element or otherwise modify the original query to filter down to the text inside the link, without having to use regex?
You can use CSS selector for selecting <a> tags under specific <th> tags.
For example th[attr="attr"].title a will select all <a> tags under <th> tags with attr="attr" and class="title":
txt = '''<table>
<tbody>
<tr>
<th attr="attr" class="title">
Title Text
</th>
<th attr="attr" class="title">
Title Text 2
</th>
<th attr="attr" class="title">
Title Text 3
</th>
</tr>
</tbody>
<a href"otherlink.com">Other link to throw you off</a>
</table>'''
soup = BeautifulSoup(txt, 'html.parser')
print([a.text for a in soup.select('th[attr="attr"].title a')])
Prints:
['Title Text', 'Title Text 2', 'Title Text 3']
Or using BeautifulSoup's own API:
print( [th.a.text for th in soup.find_all("th", {"attr": "attr"}, class_="title") if th.a] )
You can try this:
import requests
from bs4 import BeautifulSoup
html = '''<table> <tbody>
<tr>
<th attr="attr" class="title">
Title Text
</th>
<th attr="attr" class="title">
Title Text 2
</th>
<th attr="attr" class="title">
Title Text 3
</th>
</tr>
</tbody>
</table>'''
html_code = BeautifulSoup(html, 'html.parser')
a = html_code.find_all('a')
text_a = [i.text for i in a]
print(text_a)

Beautifulsoup iterate to get either <td>sometext</td> or url

I Want to create a list that contains a key-value pair. With the <thead> items as the key. For the values I want to get the text for all <th>items except the <th> items where there is a <a href='url'>, then I want to get the url instead.
Currently I am only able to get the text from all items. But how do I do to get '/someurl' instead of Makulerad and Detaljer?
<table class="table table-bordered table-hover table-striped zero-margin-top">
<thead>
<tr>
<th>Volymsenhet</th>
<th>Pris</th>
<th>Valuta</th>
<th>Handelsplats</th>
<th>url1</th>
<th>url2</th>
</tr>
</thead>
<tbody>
<tr class="iprinactive">
<td>Antal</td>
<td>5,40</td>
<td>SEK</td>
<td>NASDAQ STOCKHOLM AB</td>
<td>Makulerad</td>
<td>
Detaljer
</td>
</tr>
</tbody>
</table>
My code:
raw_html = simple_get('https://example.com/')
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={"class":"table"})
head = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(head,(td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)
Try this:
simply get the text data of <td> if it doesn't have an <a>. Otherwise get the href value.
from bs4 import BeautifulSoup
raw_html = '''<table class="table table-bordered table-hover table-striped zero-margin-top">
<thead>
<tr>
<th>Volymsenhet</th>
<th>Pris</th>
<th>Valuta</th>
<th>Handelsplats</th>
<th>url1</th>
<th>url2</th>
</tr>
</thead>
<tbody>
<tr class="iprinactive">
<td>Antal</td>
<td>5,40</td>
<td>SEK</td>
<td>NASDAQ STOCKHOLM AB</td>
<td>Makulerad</td>
<td>
Detaljer
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={"class":"table"})
head = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(head, [td.get_text() if not td.a else td.a['href'] for td in row.find_all("td")]))
datasets.append(dataset)
print(datasets)
OUTPUT:
[{'Volymsenhet': 'Antal', 'Pris': '5,40', 'Valuta': 'SEK', 'Handelsplats': 'NASDAQ STOCKHOLM AB', 'url1': '/someurl', 'url2': '/someurl'}]

How to extract 2nd column in html table in python?

<table style="width:300px" border="1">
<tr>
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
<tr>
<td>ABC</td>
<td>abcd</td>
<td>80</td>
</tr>
<tr>
<td>EFC</td>
<td>efc</td>
<td>80</td>
</tr>
</table>
I need to grab all the td's in column 2 in python.I am new to python.
import urllib2
from bs4 import BeautifulSoup
url = "http://ccdsiu.byethost33.com/magento/adamo-13.html"
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'madhu'})
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
for trr in trdata:
print trr
I am trying to get data from above code .It is printing all the td elements in table.I am trying to achieve this by Xpath
I don't think you can use xpath like you mentioned with BeautifulSoup. However, the lxml module, which comes with python, can do it.
from lxml import etree
table = '''
<table style="width:300px" border="1">
<tr>
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
<tr>
<td>ABC</td>
<td>abcd</td>
<td>80</td>
</tr>
<tr>
<td>EFC</td>
<td>efc</td>
<td>80</td>
</tr>
</table>
'''
parser = etree.HTMLParser()
tree = etree.fromstring(table, parser)
results = tree.xpath('//tr/td[position()=2]')
print 'Column 2\n========'
for r in results:
print r.text
Which when run prints
Column 2
========
Doe
abcd
efc
You don't have to iterate over your td elements. Use this:
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
if len(tddata) >= 2:
print tddata[1]
Lists are indexed starting from 0. I check the length of the list to make sure that second td exists.
It is not clear really what you want since your example of html is not relevant and the description of just second column tds isnt really helpful. Anyway I modified Elmos answer to give you the Importance title and then the actual importance level of each thing.
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
count = 0
for i in range(0, len(tddata)):
if count % 6 == 0:
print tddata[count + 1]
count += 1

How to grab these values with BeautifulSoup?

Im trying to parse the following HTML:
<div class="content">
<h3>
Kontaktuppgifter</h3>
<table>
<tr>
<th>
Postadress:
</th>
<td>
Platteb....
<br/>44497 SVE....
</td>
</tr>
<tr>
<th>
Telefon:
</th>
<td>
01-.......
</td>
</tr>
</table>
I want to grab td 1, td 2 and td 3
However td 3 is not always present.
This is what i got so far:
def ParsePage(threadName, page_url):
r = requests.get(page_url)
print "\n--------------------\n"
print "Parsing page: " + r.url
data = r.text
soup = BeautifulSoup(data)
divs = soup.findAll('div', { "class" : "content" })
for tag in divs:
divds = tag.findAll('td')
print divds
For some reason this just prints the whole div
You must have a typo somewhere, the code worked for me:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html)
div = soup.findAll("div", {"class": "content"})
for tag in div: print tag.findAll("td")
#printed:
[<td>
Platteb....
<br/>44497 SVE....
</td>, <td>
01-.......
</td>]

Categories

Resources