Substring any kind of HTML String - python

i need to divide any kind of html code (string) to a list of tokens.
For example:
"<abc/><abc/>" #INPUT
["<abc/>", "<abc/>"] #OUTPUT
or
"<abc comfy><room /></abc> <br /> <abc/> " # INPUT
["<abc comfy><room /></abc>", "<br />", "<abc/>"] # OUTPUT
or
"""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""" # INPUT
[
'<meta charset="utf-8" />',
"<title> test123 </title>",
'<meta name="test" content="index,follow" />',
'<meta name="description" content="Description123" />',
'<link rel="stylesheet" href="../xx/css/default.css" />',
] # OUTPUT
What i tried to do :
def split(html: str) -> List[str]:
if html == "":
return []
delimiter = "/>"
split_name = html.split(" ", maxsplit=1)[0]
name = split_name[1:]
delimited_list = [character + delimiter for character in html.split(delimiter) if character]
rest = html.split(" ", maxsplit=1)[1]
char_delim = html.find("</")
### Help
print(delimited_list)
return delimited_list
My output:
['<abc/>', '<abc/>']
['<abc comfy><room />', '</abc> <br />', ' <abc/>', ' />']
['<meta charset="utf-8" />', '<title> test123</title><meta name="test" content="index,follow" />', '<meta name="description" content="Description123" />', '<link rel="stylesheet" href="../xx/css/default.css" />']
So i tried to split at "/>" which is working for the first case. Then i tried several things. Tried to identify the "name", so the first identifier of the html string like "abc".
Do you guys have any idea how to continue?
Thanks!
Greetings
Nick

You will need a stack data structure and iterate over the string, push the position of opening tags onto the stack, and then when you encounter a closing tag, we assume either:
its name matches the name of the tag beginning at the position on the top of the stack
it is a self-closing tag
We also maintain a result list to save the parsed substrings.
For 1), we simply pop the position on the top of the stack, and save the substring sliced from this popped position until to the end of the closing tag to the result list.
For 2), we do not modify the stack, and only save the self-closing tag substring to the result list.
After encountering any tag (opening, closing, self-closing), we walk the iterator (a.k.a. current position pointer) forward by the length of that tag (from < to corresponding >).
If the html string sliced from the iterator onward does not match (from the beginning) any tag, then we simply walk the iterator forward by one (we crawl until we can again match a tag).
Here is my attempt:
import re
def split(html):
if html == "":
return []
openingTagPattern = r"<([a-zA-Z]+)(?:\s[^>]*)*(?<!\/)>"
closingTagPattern = r"<\/([a-zA-Z]+).*?>"
selfClosingTagPattern = r"<([a-zA-Z]+).*?\/>"
result = []
stack = []
i = 0
while i < len(html):
match = re.match(openingTagPattern, html[i:])
if match: # opening tag
stack.append(i) # push position of start of opening tag onto stack
i += len(match[0])
continue
match = re.match(closingTagPattern, html[i:])
if match: # closing tag
i += len(match[0])
result.append(html[stack.pop():i]) # pop position of start of corresponding opening tag from stack
continue
match = re.match(selfClosingTagPattern, html[i:])
if match: # self-closing tag
start = i
i += len(match[0])
result.append(html[start:i])
continue
i+=1 # otherwise crawl until we can match a tag
return result # reached the end of the string
Usage:
delimitedList = split("""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""")
for item in delimitedList:
print(item)
Output:
<meta charset="utf-8" />
<title> test123 </title>
<meta name="test" content="index,follow" />
<meta name="description" content="Description" />
<link rel="stylesheet" href="../layout/css/default.css" />
References:
The openingTagPattern is inspired from #Kobi 's answer here: https://stackoverflow.com/a/1732395/12109043

Related

How To Parse Improper JSON in Python?

I'm trying to parse the testObj in the html into JSON, but it includes so much formatting.
I already tried to remove the non-ascii characters in the object, but json.loads() and yaml still can't parse the string into an object.
How can I parse the string into an object?
html
<!DOCTYPE html>
<html lang="en">
<head>
<title>Sample Document</title>
</head>
<body></body>
<script>
const testObj = {
a: 1,
b: 2,
c: 3,
};
</script>
</html>
Python Script
import lxml.html
import urllib.request
import os
import json
import yaml
def removeNonAscii(str):
return ''.join(i for i in str if ord(i)>31 and ord(i)<126)
with urllib.request.urlopen('file:///'+os.path.abspath('./test.html')) as url:
page = url.read()
tree = lxml.html.fromstring(page)
x = tree.xpath("//script")[0].text_content()
json_str = x.strip().split('testObj = ')[1][:-1]
str = removeNonAscii(json_str)
print(str)
# >>> {a: 1,b: 2,c: 3,}
# Attempt 1 - This doesn't work as object doesn't originally have double quotes
# data = json.loads(str)
# >>> json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes
# Attempt 2 - Not sure how to detect or get rid of formatting
# data = yaml.load(str, yaml.SafeLoader)
# >>> ScannerError: While scanning for the next token found character '\t' that cannot start any token
print(data.a)
# >>> Should return 1
Edit: In my actual use case, the JSON object is very large and I cannot recreate the string. I need to remove the formatting and/or add double quotes to make it proper JSON so it can parse, but not sure how to do it. I'm close getting it to {a: 1,b: 2,c: 3,} but it still doesn't want to parse.
If it is as shown (not minified) then you can use the following regex to extract the string then hjson to add the quoted keys
import hjson, re
html = '''
<!DOCTYPE html>
<html lang="en">
<head>
<title>Sample Document</title>
</head>
<body></body>
<script>
const testObj = {
a: 1,
b: 2,
c: 3,
};
</script>
</html>'''
s = re.search(r'const testObj = ([\s\S]+?);', html).group(1)
res = hjson.loads(s)
print(res)
Regex:

How to extract tags from HTML file and write them to a new file?

My HTML file has the format shown below
<unit id="2" status="FINISHED" type="pe">
<S producer="Alice_EN">CHAPTER I Down the Rabbit-Hole</S>
<MT producer="ALICE_GG">CAPÍTULO I Abaixo do buraco de coelho</MT>
<annotations revisions="1">
<annotation r="1">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
CAPÍTULO I Descendo pela toca do coelho
</body>
</html></PE>
I need to extract ALL the content from two tags in the entire HTML file. The content of one of the tags that starts with <unit id ...> is in one line, but the content of the other tag that starts with "<PE producer ..." and ends with '' is spread over different lines. I need to extract the content within these two tags and write the content to a new file one after another. My output should be:
<unit id="2" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
CAPÍTULO I Descendo pela toca do coelho
</body>
</html></PE>
My code does not extract the content from all the tags of the file. Does anyone have a clue of whats is going on and how I can make this code work properly?
import codecs
import re
t=codecs.open('ALICE.per1_replaced.html','r')
t=t.read()
unitid=re.findall('<unit.*?"pe">', t)
PE=re.findall('<PE.*?</PE>', t, re.DOTALL)
for i in unitid:
for j in PE:
a=i + '\n' + j + '\n'
with open('PEtags.txt','w') as fi:
fi.write(a)
You have a problem with the code where you loop through the matches and write them to file.
If your initid and PE match counts are the same, you may adjust the code to
import re
with open('ALICE.per1_replaced.html','r') as t:
contents = t.read()
unitid=re.findall('<unit.*?"pe">', contents)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtags.txt','w') as fi:
for i, p in zip(unitid, PE):
fi.write( "{}\n{}\n".format(i, p) )

Using Templates Safe Substitute Function in a Loop

I have this simple code:
html_string = '''<html lang="en-US">
'<head>
<title>My Python articles</title>
</head>
<body>'''
for i in range(2):
html_string += '''
<p>
<span style="white-space: pre-line">$''' + str(i) + '''</span>
</p>'''
html_string += '''</body>
</html>'''
html_template = Template(html_string)
output_dir = "./html/"
output_path = os.path.join(output_dir, 'my_page.html')
with io.open(output_path, 'w+', encoding='UTF-8', errors='replace') as html_output:
for i in range(2):
html_output.write(html_template.safe_substitute(i="Hallo"))
html_output.truncate()
It looks like the i in the html_output.write(html_template.safe_substitute(i="Hello")) doesn't correspond to the i in the for loop and all I get is:
$0
$1
$0
$1
$0 and $1 need to exist only once and each of them have to be replaced with the word Hello. Later I'll be replacing $0 and $1 each with a different input.
The docs for template strings have this to say about substitution identifiers:
By default, "identifier" is restricted to any case-insensitive ASCII alphanumeric string (including underscores) that starts with an underscore or ASCII letter.
Identifiers like "$0" and "$1" don't satisfy this condition, because they start with an ASCII digit.
Inserting a letter between the "$" and the digit like this ought to work:
html_string = '''<html lang="en-US">
'<head>
<title>My Python articles</title>
</head>
<body>'''
# Make substitution identifiers like "$Ti"
for i in range(2):
html_string += '''
<p>
<span style="white-space: pre-line">$T''' + str(i) + '''</span>
</p>'''
html_string += '''</body>
</html>'''
html_template = Template(html_string)
# Map identifiers to values
mapping = {'T' + str(i): 'Hello' for i in range(2)}
output_dir = "./html/"
output_path = os.path.join(output_dir, 'my_page.html')
with open(output_path, 'w+', encoding='UTF-8', errors='replace') as html_output:
html_output.write(html_template.safe_substitute(mapping))
html_output.truncate()

Boolean not registering as true for string equivalence

I'm parsing some XML with BeautifulSoup and have data that looks like:
soup.FindAll('title')
<title>main title</title>
<title>other title</title>
<title>another title</title>
When iterating over the tags, I want to skip the first title. So I have:
for e in soup.findAll('title'):
if e == '<title>main title</title>':
pass
else:
print (e)
This still returns all the titles, including main title. I've tried removing the title tags as well.
Thanks for any help.
You boolean is not working as you want to compare an object <class 'bs4.element.Tag'> to string, it will always be False. You could convert it to string then compare them.
Try this:
for e in soup.find_all("title"):
if str(e) == '<title>main title</title>':
pass
else:
print (e)
Output:
<title>other title</title>
<title>another title</title>
If you want to skip the first title, then a better solution is to slice the list:
>>> soup.findAll('title')[1:]
[<title>other title</title>, <title>another title</title>]
You can check the text attribute of the node, instead of the node itself.
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<title>main title</title>
<title>other title</title>
<title>another title</title>""", "html.parser")
for e in soup.find_all("title"):
if e.text != 'main title':
print(e)
#<title>other title</title>
#<title>another title</title>

Line numbers in Pygments code highlight in xampp on Windows

I have configured xampp on windows to work with python 2.7 and Pygments. My php code is highlighted properly in Pygments on the website. The code has colors, span elements, classes.
That is how it looks:
But I cannot get line numbers.
As I have read tutorials it depends on the linenos value in python script. The value should be either table or inline or 1 or True.
But it does not work for me. I still gives the same final code
<!doctype html>
<html lang="pl">
<head>
<meta charset="UTF-8">
<title>Document</title>
<link rel="stylesheet" href="gh.css">
</head>
<body>
<div class="highlight highlight-php"><pre><code><span class="nv">$name</span> <span class="o">=</span> <span class="s2">"Jaś"</span><span class="p">;</span>
<span class="k">echo</span> <span class="s2">"Zażółć gęślą jaźń, "</span> <span class="o">.</span> <span class="nv">$name</span> <span class="o">.</span> <span class="s1">'.'</span><span class="p">;</span>
<span class="k">echo</span> <span class="s2">"hehehe#jo.io"</span><span class="p">;</span>
</code></pre></div>
</html>
How to add line numbers? I put two files of the website below:
index.py
import sys
from pygments import highlight
from pygments.formatters import HtmlFormatter
# If there isn't only 2 args something weird is going on
expecting = 2;
if ( len(sys.argv) != expecting + 1 ):
exit(128)
# Get the code
language = (sys.argv[1]).lower()
filename = sys.argv[2]
f = open(filename, 'rb')
code = f.read()
f.close()
# PHP
if language == 'php':
from pygments.lexers import PhpLexer
lexer = PhpLexer(startinline=True)
# GUESS
elif language == 'guess':
from pygments.lexers import guess_lexer
lexer = guess_lexer( code )
# GET BY NAME
else:
from pygments.lexers import get_lexer_by_name
lexer = get_lexer_by_name( language )
# OUTPUT
formatter = HtmlFormatter(linenos='table', encoding='utf-8', nowrap=True)
highlighted = highlight(code, lexer, formatter)
print highlighted
index.php
<?php
define('MB_WPP_BASE', dirname(__FILE__));
function mb_pygments_convert_code($matches)
{
$pygments_build = MB_WPP_BASE . '/index.py';
$source_code = isset($matches[3]) ? $matches[3] : '';
$class_name = isset($matches[2]) ? $matches[2] : '';
// Creates a temporary filename
$temp_file = tempnam(sys_get_temp_dir(), 'MB_Pygments_');
// Populate temporary file
$filehandle = fopen($temp_file, "w");
fwrite($filehandle, html_entity_decode($source_code, ENT_COMPAT, 'UTF-8'));
fclose($filehandle);
// Creates pygments command
$language = $class_name ? $class_name : 'guess';
$command = sprintf('C:\Python27/python %s %s %s', $pygments_build, $language, $temp_file);
// Executes the command
$retVal = -1;
exec($command, $output, $retVal);
unlink($temp_file);
// Returns Source Code
$format = '<div class="highlight highlight-%s"><pre><code>%s</code></pre></div>';
if ($retVal == 0)
$source_code = implode("\n", $output);
$highlighted_code = sprintf($format, $language, $source_code);
return $highlighted_code;
}
// This prevent throwing error
libxml_use_internal_errors(true);
// Get all pre from post content
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding('
<pre class="php">
<code>
$name = "Jaś";
echo "Zażółć gęślą jaźń, " . $name . \'.\';
echo "<address>hehehe#jo.io</address>";
</code>
</pre>', 'HTML-ENTITIES', "UTF-8"), LIBXML_HTML_NODEFDTD);
$pres = $dom->getElementsByTagName('pre');
foreach ($pres as $pre) {
$class = $pre->attributes->getNamedItem('class')->nodeValue;
$code = $pre->nodeValue;
$args = array(
2 => $class, // Element at position [2] is the class
3 => $code // And element at position [2] is the code
);
// convert the code
$new_code = mb_pygments_convert_code($args);
// Replace the actual pre with the new one.
$new_pre = $dom->createDocumentFragment();
$new_pre->appendXML($new_code);
$pre->parentNode->replaceChild($new_pre, $pre);
}
// Save the HTML of the new code.
$newHtml = "";
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$newHtml .= $dom->saveHTML($child);
}
?>
<!doctype html>
<html lang="pl">
<head>
<meta charset="UTF-8">
<title>Document</title>
<link rel="stylesheet" href="gh.css">
</head>
<body>
<?= $newHtml ?>
</body>
</html>
Thank you
While reading the file try readlines:
f = open(filename, 'rb')
code = f.readlines()
f.close()
This way you do the following it will get multiple lines :
formatter = HtmlFormatter(linenos='table', encoding='utf-8', nowrap=True)
Suggestion:
More pythonic way of opening files is :
with open(filename, 'rb') as f:
code = f.readlines()
That's it python context manager closes this file for you.
Solved!
nowrap
If set to True, don’t wrap the tokens at all, not even inside a tag. This disables most other options (default: False).
http://pygments.org/docs/formatters/#HtmlFormatter

Categories

Resources