Bad output with python html parsing - python

I have an html file in C:\temp.
I want to extract this text
Death - Individual Thought Patterns (1993), Progressive Death Metal
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
Alfadog (1994), Black Metal
from this block of code
<td width='99%' style='word-wrap:break-word;'><div><img src='style_images/1/nav_m.gif' border='0' alt='>' width='8' height='8' /> <b>Death - Individual Thought Patterns (1993)</b>, Progressive Death Metal</div></td>
<!--HideBegin--><div class='hidetop'>Hidden text</div><div class='hidemain'><!--HideEBegin--><!--coloro:#FF0000--><span style="color:#FF0000"><!--/coloro--><b>Download:</b><!--colorc--></span><!--/colorc--><br />Download from ifolder.ru <i>*Death - Individual Thought Patterns (1993)* <b>by Dissident God</b></i><br /><!--HideEnd--></div><!--HideEEnd--><br /><!--HideBegin--><div class='hidetop'>Hidden text</div><div class='hidemain'><!--HideEBegin--><!--coloro:#ff0000--><span style="color:#ff0000"><!--/coloro--><b>Download (mp3#VBR230kbps) (67 MB):</b><!--colorc--></span><!--/colorc--><br />Download from rapidshare.com <i>*Death - Individual Thought Patterns (Remastered) (2008)* <b>by smashter</b></i><!--HideEnd--></div><!--HideEEnd-->
<td width='99%' style='word-wrap:break-word;'><div><img src='style_images/1/nav_m.gif' border='0' alt='>' width='8' height='8' /> <b>Alfadog (1994)</b>, Black Metal</div></td>
The extracted text must be saved in a file called links.txt
Despite my changes my script only ever extracts this text to me
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
But I want you to extract this text and like this
Death - Individual Thought Patterns (1993), Progressive Death Metal
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
Alfadog (1994), Black Metal
This is the script
import requests
from bs4 import BeautifulSoup
# Open the HTML file in read mode
with open("C:/temp/pagina.html", "r") as f:
html = f.read()
# Create a Beautiful Soup object from HTML code
soup = BeautifulSoup(html, "html.parser")
# Initialize a list to contain the extracted text
extracted_text = []
# Find all td's with style "word-wrap:break-word"
tds = soup.find_all("td", style="word-wrap:break-word")
# For each td found, look for the div tag and the b tag inside
# and extract the text contained in these tags
for td in tds:
div_tag = td.find("div")
b_tag = div_tag.find("b")
if b_tag:
text = b_tag.text
# Also add the text after the b tag
text += td.text[td.text.index(b_tag.text) + len(b_tag.text):]
extracted_text.append(text)
# Find all divs with class "hidemain"
divs = soup.find_all("div", class_="hidemain")
# For each div found, look for the a tag inside
# and extract the link text contained in this tag
for div in divs:
a_tag = div.find("a")
if a_tag:
link = a_tag.get("href")
extracted_text.append(link)
# Save the extracted text to a text file
with open("links.txt", "w") as f:
for line in extracted_text:
f.write(line + "\n")
I can't understand the problem of why it doesn't return the text I ask for

After #Barmar code edit i get this output
Death - Individual Thought Patterns (1993), Progressive Death Metal
Alfadog (1994), Black Metal
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
To make the desired lines appear in the order they appear in the html file
Death - Individual Thought Patterns (1993), Progressive Death Metal
http://xxxxxxx.bb/1196198
http://yyyyyyyyyyyy.com/files/153576607/d-xxx_xxx_xxx-xxxxx-xxxxx.rar
Alfadog (1994), Black Metal
I modified the code to extract the links first and then the titles.
Specifically, I modified using two for loops, one to extract links and the other to extract song titles. Also, I used the extend function to add the extracted items to the extract_text list instead of using the append function.
This is the fixed code
import requests
from bs4 import BeautifulSoup
# Open the HTML file in read mode
with open("C:/temp/pagina.html", "r") as f:
html = f.read()
# Create a Beautiful Soup object from HTML code
soup = BeautifulSoup(html, "html.parser")
# Initialize a list to contain the extracted text
extracted_text = []
# Find all divs with class "hidemain"
divs = soup.find_all("div", class_="hidemain")
# For each div found, look for the a tag inside
# and extract the link text contained in this tag
for div in divs:
a_tag = div.find("a")
if a_tag:
link = a_tag.get("href")
extracted_text.extend([link])
# Find all td's with style "word-wrap:break-word"
tds = soup.find_all("td", style="word-wrap:break-word;")
# For each td found, look for the div tag and the b tag inside
# and extract the text contained in these tags
for td in tds:
div_tag = td.find("div")
b_tag = div_tag.find("b")
if b_tag:
text = b_tag.text
# Also add the text after the b tag
text += td.text[td.text.index(b_tag.text) + len(b_tag.text):]
extracted_text.extend([text])
# Save the extracted text to a text file
with open("links.txt", "w") as f:
for line in extracted_text:
f.write(line + "\n")

Related

BeautifulSoup Extract Text from a Paragraph and Split Text by <br/>

I am very new to BeauitfulSoup.
How would I be able to extract the text in a paragraph from an html source code, split the text whenever there is a <br/>, and store it into an array such that each element in the array is a chunk from the paragraph text (that was split by a <br/>)?
For example, for the following paragraph:
<p>
<strong>Pancakes</strong>
<br/>
A <strong>delicious</strong> type of food
<br/>
</p>
I would like it to be stored into the following array:
['Pancakes', 'A delicious type of food']
What I have tried is:
import bs4 as bs
soup = bs.BeautifulSoup("<p>Pancakes<br/> A delicious type of food<br/></p>")
p = soup.findAll('p')
p[0] = p[0].getText()
print(p)
but this outputs an array with only one element:
['Pancakes A delicious type of food']
What is a way to code it so that I can get an array that contains the paragraph text split by any <br/> in the paragraph?
try this
from bs4 import BeautifulSoup, NavigableString
html = '<p>Pancakes<br/> A delicious type of food<br/></p>'
soup = BeautifulSoup(html, 'html.parser')
p = soup.findAll('p')
result = [str(child).strip() for child in p[0].children
if isinstance(child, NavigableString)]
Update for deep recursive
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p').find_all(text=True, recursive=True)
Update again for text split only by <br>
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
text = ''
for child in soup.find_all('p')[0]:
if isinstance(child, NavigableString):
text += str(child).strip()
elif isinstance(child, Tag):
if child.name != 'br':
text += child.text.strip()
else:
text += '\n'
result = text.strip().split('\n')
print(result)
I stumbled across this whilst having a similar issue. This was my solution...
A simple way is to replace the line
p[0] = p[0].getText()
with
p[0].getText('#').split('#')
Result is:
['Pancakes', ' A delicious type of food']
Obv choose a character/characters that won't appear in the text

parsing rss different tags extract image

Hi I am trying to extract the images from multiple sites rss.
First rss
<enclosure type="image/jpeg" length="321742" url="http://www.sitio.com.uy//uploads/2014/10/19/54441d68e01af.jpg"/>
Second rss
<g:image_link>http://img.sitio2.com/imagenes/314165_20150422201743_635653477836873822w.jpg</g:image_link>
Need extract url of image.
My code is with Beatifulsoup in python
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text)
items = soup.find_all('item')
for item in items:
title = item.find('title').get_text().encode('utf-8')
description = item.find('description').get_text().encode('utf-8')
category = item.find('category').get_text().encode('utf-8')
image = item.find('enclosure')
print(image)
You can search for multiple tags using a tag list.
item.find(['enclosure', 'g:image_link'])
This will return the first tag it finds. If there are multiple tags use find_all.
item.find_all(['enclosure', 'g:image_link'])

select a specific set of cell under a set of tables using python and beautifulsoup

Consider there are N web pages.
Each web page has one or more tables. The common thing the tables have is that their class is same, consider "table_class."
We need the contents under the same column[third column, heading is title] of every table.
Contents meaning, the href links in column three from all rows.
Some rows might just be plain text and some might have href link in them.
You should print each href link in a separate line, one after the other.
Using attributes to filter is not valid as some tags have different attributes. The position of the cell is the only hint available.
How do you code this?
Consider these two links for the web pages:
http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2014
http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2013
Consider the table: wikitable
Required content: href links of column Title
Code I tried for one page:
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer
content = urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
for sp in soup.find_all('tr'):
for bt in sp.find_all('td'):
for link in bt.find_all('a'):
print(link.get("href"))
print()
The idea is to iterate over every table with wikitable class; for every table find links directly inside the i tag directly inside td directly inside tr:
import requests
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2014"
soup = BeautifulSoup(requests.get(url).content)
# iterate over tables
for table in soup.select('table.wikitable.sortable'):
# get the table header/description, continue if not found
h3 = table.find_previous_sibling('h3')
if h3 is None:
continue
print h3.text
# get the links
for link in table.select('tr > td > i > a'):
print link.text, "|", link.get('href', '')
print "------"
Prints (also printing table names for clarity):
January 2014–june 2014[edit]
Celebrity | /wiki/Celebrity
Kshatriya | /wiki/Kshatriya
1: Nenokkadine | /wiki/1:_Nenokkadine
...
Oohalu Gusagusalade | /wiki/Oohalu_Gusagusalade
Autonagar Surya | /wiki/Autonagar_Surya
------
July 2014 – December 2014[edit]
...
O Manishi Katha | /wiki/O_Manishi_Katha
Mukunda | /wiki/Mukunda
Chinnadana Nee Kosam | /wiki/Chinnadana_Nee_Kosam
------

Parsing HTML with lxml (python)

I'm trying to save the content of a HTML-page in a .html-file, but I only want to save the content under the tag "table". In addition, I'd like to remove all empty tags like <b></b>.
I did all these things already with BeautifulSoup:
f = urllib2.urlopen('http://test.xyz')
html = f.read()
f.close()
soup = BeautifulSoup(html)
txt = ""
for text in soup.find_all("table", {'class': 'main'}):
txt += str(text)
text = BeautifulSoup(text)
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()==""))
[empty_tag.extract() for empty_tag in empty_tags]
My question is: Is this also possible with lxml? If yes: How would this +/- look like?
Thanks a lot for any help.
import lxml.html
# lxml can download pages directly
root = lxml.html.parse('http://test.xyz').getroot()
# use a CSS selector for class="main",
# or use root.xpath('//table[#class="main"]')
tables = root.cssselect('table.main')
# extract HTML content from all tables
# use lxml.html.tostring(t, method="text", encoding=unicode)
# to get text content without tags
"\n".join([lxml.html.tostring(t) for t in tables])
# removing only specific empty tags, here <b></b> and <i></i>
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
# removing all empty tags (tags that do not have children nodes)
for empty in root.xpath('//*[not(node())]'):
empty.getparent().remove(empty)
# root does not contain those empty tags anymore

Retrieve all content between a closing and opening html tag using Beautiful Soup

I am parsing content using Python and Beautiful Soup then writing it to a CSV file, and have run into a bugger of a problem getting a certain set of data. The data is ran through an implementation of TidyHTML that I have crafted and then other not needed data is stripped out.
The issue is that I need to retrieve all data between a set of <h3> tags.
Sample Data:
<h3>Pages 1-18</h3>
<ul><li>September 13 1880. First regular meeting of the faculty;
September 14 1880. Discussion of curricular matters. Students are
debarred from taking algebra until they have completed both mental
and fractional arithmetic; October 4 1880.</li><li>All members present.</li></ul>
<ul><li>Moved the faculty henceforth hold regular weekkly meetings in the
President's room of the University building; 11 October 1880. All
members present; 18 October 1880. Regular meeting 2. Moved that the
President wait on the property holders on 12th street and request
them to abate the nuisance on their property; 25 October 1880.
Moved that the senior and junior classes for rhetoricals be...</li></ul>
<h3>Pages 19-33</h3>`
I need to retrieve all of the content between the first closing </h3> tag and the next opening <h3> tag. This shouldn't be hard, but my thick head isn't making the necessary connections. I can grab all of the <ul> tags but that doesn't work because there is not a one to one relationship between <h3> tags and <ul> tags.
The output I am looking to achieve is:
Pages 1-18|Vol-1-pages-001.pdf|content between and tags.
The first two parts have not been a problem but content between a set of tags is difficult for me.
My current code is as follows:
import glob, re, os, csv
from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document
from collections import deque
html_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'
html_cleanup = {'\r\r\n':'', '\n\n':'', '\n':'', '\r':'', '\r\r': '', '<img src="UOSymbol1.jpg" alt="" />':''}
for infile in glob.glob( os.path.join(html_path, '*.html') ):
print "current file is: " + infile
html = open(infile).read()
for i, j in html_cleanup.iteritems():
html = html.replace(i, j)
#parse cleaned up html with Beautiful Soup
soup = BeautifulSoup(html)
#print soup
html_to_csv = csv.writer(open(csv_path, 'a'), delimiter='|',
quoting=csv.QUOTE_NONE, escapechar=' ')
#retrieve the string that has the page range and file name
volume = deque()
fileName = deque()
summary = deque()
i = 0
for title in soup.findAll('a'):
if title['href'].startswith('V'):
#print title.string
volume.append(title.string)
i+=1
#print soup('a')[i]['href']
fileName.append(soup('a')[i]['href'])
#print html_to_csv
#html_to_csv.writerow([volume, fileName])
#retrieve the summary of each archive and store
#for body in soup.findAll('ul') or soup.findAll('ol'):
# summary.append(body)
for body in soup.findAll('h3'):
body.findNextSibling(text=True)
summary.append(body)
#print out each field into the csv file
for c in range(i):
pages = volume.popleft()
path = fileName.popleft()
notes = summary
if not summary:
notes = "help"
if summary:
notes = summary.popleft()
html_to_csv.writerow([pages, path, notes])
Extract content between </h3> and <h3> tags:
from itertools import takewhile
h3s = soup('h3') # find all <h3> elements
for h3, h3next in zip(h3s, h3s[1:]):
# get elements in between
between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())
# extract text
print(''.join(getattr(el, 'text', el) for el in between_it))
The code assumes that all <h3> elements are siblings. If it is not the case then you could use h3.nextGenerator() instead of h3.nextSiblingGenerator().
If you try to extract data between <ul><li></ul></li> tags in lxml, it provides a great functionality of using CSSSelector
import lxml.html
import urllib
data = urllib.urlopen('file:///C:/Users/ranveer/st.html').read() //contains your html snippet
doc = lxml.html.fromstring(data)
elements = doc.cssselect('ul li') // CSSpath[using firebug extension]
for element in elements:
print element.text_content()
after executing the above code you will get all text between the ul,li tags. It is much cleaner than beautiful soup.
If you by any chance plan to use lxml than you can evaluate XPath expressions in the following way-
import lxml
from lxml import etree
content = etree.HTML(urllib.urlopen("file:///C:/Users/ranveer/st.html").read())
content_text = content.xpath("html/body/h3[1]/a/#href | //ul[1]/li/text() | //ul[2]/li/text() | //h3[2]/a/#href")
print content_text
You can change XPath according to your need.

Categories

Resources