HTML code processing - python

I want to process some HTML code and remove the tags as in the example:
"<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting paragraph."
I'm using Python as technology; do you know any framework I may use to remove the HTML tags?
Thanks!

This question may help you: Strip HTML from strings in Python
No matter what solution you choose, I'd recommend avoiding regular expressions. They can be slow when processing large strings, they might not work due to invalid HTML, and stripping HTML with regex isn't always secure or reliable.

BeautifulSoup

import libxml2
text = "<p><b>This</b> is a very interesting paragraph.</p>"
root = libxml2.parseDoc(text)
print root.content
# 'This is a very interesting paragraph.'

Depending on your needs, you could just use the regular expression /<(.|\n)*?>/ and replace all matches with empty strings. This works perfectly for manual cases, but if you're building this as an application feature then you'll need a more robust and secure option.

you can use lxml.

Related

How can i remove every "<span" from items in my list? what should i change in regex?

Code:
text2=re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
Output:
['https://m.facebook.com/people/Vick-Arcadia/100009629167118/', 'https://m.facebook.com<span', 'https://m.facebook.com<span',
In general regexes aren't powerful enough to handle handle HTML which is tree structured and has matching openers and closers.
The preferred technique is to use a parser designed for HTML. In the Python world, lxml and BeautifulSoup are popular choices.
This regex should work better
'https?:\/\/[\w\.]+(\/[\/\w-]+)?'
For regex I recommends testing on https://regex101.com/
But in operation on html better use BeautifulSoup lib, if u add more detals I can help u with this.

Use Python re to get rid of links

Say I have a string looks like Boston–Cambridge–Quincy, MA–NH MSA
How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?
I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.
re.sub('<a[^>]+>(.*?)</a>', '\\1', text)
Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.
You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html

Regexp to parse HTML imgs

I'm crawling through an HTML page and I want to extract the img srcs and the a hrefs.
On the particular site, all of them are encapsulated in double quotes.
I've tried a wide variety of regexps with no success. Assume characters inside the double-quotes will be [-\w/] (printable characters [a-zA-Z\d-_] and / and .)
In python:
re.search(r'img\s+src="(?P<src>[\w-/]+_"', line)
Doesn't return anything, but
re.search(r'img\s+src="(?P[-\w[/]]+)"', line)
Returns wayy to much (i.e., does not stop at the " ).
I need help creating the right regexp. Thanks in advance!
I need help creating the right regexp.
No, you need help in finding the right tool.
Try BeautifulSoup.
(If you insist on using regular expressions - and I'd advise against it - try changing the greedy + to non-greedy +?).
Here's an example of a better way to do it than with regex, using the excellent lxml library and xpath
In [1]: import lxml.html
In [2]: doc = lxml.html.parse('http://www.google.com/search?q=kittens&tbm=isch')
In [3]: doc.xpath('//img/#src')
Out[3]:
['/images/nav_logo_hp2.png',
'http://t1.gstatic.com/images?q=tbn:ANd9GcQhajNZimPGLw9iTfzrAF_HV5UogY-KGep5WYgw-VHZ15oaAwGquNb5Q2I',
'http://t2.gstatic.com/images?q=tbn:ANd9GcS1LgVIlDgoIfNzwU4xBz9fL32ZJjZU26aB4aynRsEcz2VuXmjCtvxUonM',
'http://t1.gstatic.com/images?q=tbn:ANd9GcRgouJt5Moe8uTnDPUFTo4csZOcBtEDA_B7WdRPe8pdZroR5QB2q_-LT59G',
[...]
]
A good trick for finding things inside quotes you do "([^"]+)". So you search for any characters but the quote that are between quotes.
For help with creating regular expressions I can strongly recommend Expresso ( http://www.ultrapico.com/Expresso.htm )

Processing a HTML file using Python

I wanted to remove all the tags in HTML file. For that I used re module of python.
For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?
You can make the match non-greedy: '<.*?>'
You also need to be careful, HTML is a crafty beast, and can thwart your regexes.
Parse the HTML using BeautifulSoup, then only retrieve the text.
make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy
off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/
Use a parser, either lxml or BeautifulSoup:
import lxml.html
print lxml.html.fromstring(mystring).text_content()
Related questions:
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
Beautiful Soup is great for parsing html!
You might not require it now, but it's worth learning to use it. Will help you in the future too.

How to remove tags from a string in python using regular expressions? (NOT in HTML)

I need to remove tags from a string in python.
<FNT name="Century Schoolbook" size="22">Title</FNT>
What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.
This should work:
import re
re.sub('<[^>]*>', '', mystring)
To everyone saying that regexes are not the correct tool for the job:
The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.
I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.
Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)
print(soup.get_text())
Searching this regex and replacing it with an empty string should work.
/<[A-Za-z\/][^>]*>/
Example (from python shell):
>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title
If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.
If the source text is well-formed XML, you can use the stdlib module ElementTree:
import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text # 'Title'
If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.
Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

Categories

Resources