BeautifulSoup python to parse html files - python

I am using BeautifulSoup to replace all the commas in an html file with ‚. Here is my code for that:
f = open(sys.argv[1],"r")
data = f.read()
soup = BeautifulSoup(data)
comma = re.compile(',')
for t in soup.findAll(text=comma):
t.replaceWith(t.replace(',', '‚'))
This code works except when there is some javascript included in the html file. In that case it even replaces the comma(,) with in the javascript code. Which is not required. I only want to replace in all the text content of the html file.

soup.findall can take a callable:
tags_to_skip = set(["script", "style"])
# Add to this list as needed
def valid_tags(tag):
"""Filter tags on the basis of their tag names
If the tag name is found in ``tags_to_skip`` then
the tag is dropped. Otherwise, it is kept.
"""
if tag.source.name.lower() not in tags_to_skip:
return True
else:
return False
for t in soup.findAll(valid_tags):
t.replaceWith(t.replace(',', '‚'))

Related

Python how to match a url string from html content

i'm trying to pull out a url from a function which is from html element
Content
content = 'memoizeFetch("/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/")'
My code :
content = 'memoizeFetch("/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/")'
match = re.search(r'memoizeFetch("(.*?)"', content).group(0)
print(match)
It doesn't work, i need to get the following string from that function:
"/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/"
How i can do that ?
You need to escape the pharantesis inside the string and select group(1).
Change your code to:
content = 'memoizeFetch("/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/")'
match = re.search(r'memoizeFetch\("(.*?)"', content).group(1)
print(match)
Output:
/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/

Get a clean string from HTML, CSS and JavaScript

Currently, I'm trying to scrape 10-K submission text files on sec.gov.
Here's an example text file:
https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt
The text document contains things like HTML tags, CSS styles, and JavaScript. Ideally, I'd like to scrape only the content after removing all the tags and styling.
First, I tried the obvious get_text() method from BeautifulSoup. That didn't work out.
Then I tried using regex to remove everything between < and >. Unfortunately, also this didn't work out entirely. It keeps some tags, styles, and scripts.
Does anyone have a clean solution for me to accomplish my goal?
Here is my code so far:
import requests
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)
Let's set a dummy string based on the example:
original_content = """
<script>console.log("test");</script>
<TD VALIGN="bottom" ALIGN="center"><FONT STYLE="font-family:Arial; ">(Address of principal executive offices)</FONT></TD>
"""
Now let's remove all the javascript.
from lxml.html.clean import Cleaner # remove javascript
# Delete javascript tags (some other options are left for the sake of example).
cleaner = Cleaner(
comments = True, # True = remove comments
meta=True, # True = remove meta tags
scripts=True, # True = remove script tags
embedded = True, # True = remove embeded tags
)
clean_dom = cleaner.clean_html(original_content)
(From https://stackoverflow.com/a/46371211/1204332)
And then we can either remove the HTML tags (extract the text) with the HTMLParser library:
from HTMLParser import HTMLParser
# Strip HTML.
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
text_content = strip_tags(clean_dom)
print text_content
(From: https://stackoverflow.com/a/925630/1204332)
Or we could get the text with the lxml library:
from lxml.html import fromstring
print fromstring(original_content).text_content()

Python adding a string to a match list with multiple items

The code I am working on is retrieving a list from an HTML page with 2 fields, URL, and title...
The URL anyway starts with /URL.... And I need to append the "http://website.com" to every returned vauled from a re.findall.
The code so far is this:
bsoup=bs(html)
tag=soup.find('div',{'class':'item'})
reg=re.compile('<a href="(.+?)" rel=".+?" title="(.+?)"')
links=re.findall(reg,str(tag))
*(append "http://website.com" to the href"(.+?)" field)*
return links
Try:
for link in tag.find_all('a'):
link['href'] = 'http://website.com' + link['href']
Then use one of these output methods:
return str(soup) gets you the document after the changes are applied.
return tag.find_all('a') gets you all the link elements.
return [str(i) for i in tag.find_all('a')] gets you all the link elements converted to strings.
Now, don't try to parse HTML with regex while you have a XML parser already working.

How to use lxml.html.document_fromtstring() to remove tag content using python and lxml

I want to remove the contents present between <pre><code> and </code></pre> tags. I looked at remove(), strip_elements() as well as BeautifulSoup methods but all the examples that I saw contained only a single tag like only <pre> or only <code>. How can I use them when both of them are present together(like in case of formated code block in stackoverflow posts).
EDIT: if I have something like this(This is how formatted code is present in stackoverflow posts)
<pre><code> some code stuff </code></pre> then I want to remove all the contents between <pre><code></code></pre> including the tags also.
EDITED CODE:
The code that I have is given below but it throws error as lxml.etree.XMLSyntaxError: Extra content at the end of the document at line doc = doc = etree.fromstring(record[1]) :
from lxml import etree
cur.execute('SELECT Title, Body FROM posts')
for item in cur:
record = list(item)
doc = etree.fromstring(record[1]) # error thrown here
for node in doc.xpath('pre[code]'):
doc.remove(node)
record[1] = etree.tostring(doc)
page = lxml.html.document_fromstring(record[1])
record[0] = str(record[0])
record[1] = str(page.text_content()) # Stripping HTML Tags
print record[1]
UPDATE: I understand that the format of the XML that I have is not standard and as such I will need to use lxml.html.document_fromtstring() to remove the tag contents instead of etree.fromstring(). Can anyone provide me an example of it as I cannot find any implementation of lxml.html.document_fromtstring()to remove the contents of a tag.

Python & lxml / xpath: Parsing XML

I need to get the value from the FLVPath from this link : http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite
from lxml import html
sub_r = requests.get("http://www.testpage.co/v2/videoConfigXmlCode.php?pg=video_%s_no_0_extsite" % list[6])
sub_root = lxml.html.fromstring(sub_r.content)
for sub_data in sub_root.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print sub_data.text
But no data returned
You're using lxml.html to parse the document, which causes lxml to lowercase all element and attribute names (since that doesn't matter in html), which means you'll have to use:
sub_root.xpath('//player_settings[#name="FLVPath"]/#value')
Or as you're parsing a xml file anyway, you could use lxml.etree.
You could try
print sub_data.attrib['Value']
url = "http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite"
response = requests.get(url)
# Use `lxml.etree` rathern than `lxml.html`,
# and unicode `response.text` instead of `response.content`
doc = lxml.etree.fromstring(response.text)
for path in doc.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print path

Categories

Resources