Speed up parser: HTML into Database

Speed up parser: HTML into Database - python

I need insert all html tags and attributes into database
el.driver.get(url_page)
txthtml = el.driver.page_source
soup = BeautifulSoup(txthtml, "html.parser")
body = soup.find('html')
html_parse(body, el, url_page_id, 0, 0, 0,url_page)
def html_parse(html, el, url_page_id, level, i, parent_id, url_page):
txt = ""
if len(html.text) > 0:
txt = html.text.replace("\n","").replace("\t","").replace("\r","")
ta = tag_list()
ta.p_id = el.id
ta.page_id = url_page_id
ta.level = level
ta.number = i
ta.txt = txt
ta.name = html.name
ta.parent_id = parent_id
ta.html = str(html)
ta.save()
insert_attr(html, el.id, url_page_id, ta.id, url_page)
children = list(html.children)
j = 0
for child in children:
if child.name is None:
continue
j = j + 1
html_parse(child, el, url_page_id, level + 1, j, ta.id, url_page)
When I have recursive function html_parse
html - current html object
el - driver class
url_page_id - id of page
level - level in DOM
i - childe number
parent_id - id of parent
url_page - current URL
tag_list - insert current tag
insert_attr - insert into database attrs of tag
Every html_parse function run fast, but full html parsing run about 4-5 minutes per big html page.
How I can speed up the code?

Related

What is the best way to parse large XML and genarate a dataframe with the data in the XML (with python or else)?

I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.
The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.
this is how it looks like in the first 30 rows:
<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<generator>
<product_name>National Vulnerability Database (NVD)</product_name>
<product_version>4.9</product_version>
<schema_version>2.3</schema_version>
<timestamp>2022-03-17T03:51:01.909Z</timestamp>
</generator>
<cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
<title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
<references>
<reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
<reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
</cpe-item>
The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.
From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.
The dataframe should look like this:
| cpe23_name | title_text | ref1 | ref2 | ref3 | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN | NaN
1 | 'cpe23name 2'| 'this is a java pkg' | 'url1'| 'url2'| NaN | NaN
...
my code is here，finished in ~100sec：
import xml.etree.ElementTree as et
xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()
import time
start_time = time.time()
df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']
title = '{http://cpe.mitre.org/dictionary/2.0}title'
ref = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []
i = 0
while i < len(xroot):
for elm in xroot[i]:
if elm.tag == title:
p_text = elm.text
#assign p_text
elif elm.tag == ref:
for nn in elm:
s = nn.text.lower()
#check the lower text in refs
if 'version' in s:
p_vers = nn.attrib.get('href')
#assign p_vers
elif 'advisor' in s:
p_advi = nn.attrib.get('href')
#assign p_advi
elif 'product' in s:
p_prod = nn.attrib.get('href')
#assign p_prod
elif 'vendor' in s:
p_vend = nn.attrib.get('href')
#assign p_vend
elif 'change' in s:
p_chan = nn.attrib.get('href')
#assign p_vend
else:
p_othe = nn.attrib.get('href')
elif elm.tag == cpe_item:
p_cpe = elm.attrib.get("name")
#assign p_cpe
else:
print(elm.tag)
row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
rows.append(row)
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
print(len(rows)) #this shows how far I got during the running time
i+=1
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent
print("---853k rows take %s seconds ---" % (time.time() - start_time))
updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
the running time now is 136.0491042137146 seconds. yay!

Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:
url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)
If you do not have the default parser, lxml, installed, use the etree parser:
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)

How to fetch Total No of pages count in Python-docx?

I am learning Python-docx and I am using this solution to add page number as given on stackoverflow by #Utkarsh Dalal
def create_element(name):
return OxmlElement(name)
def create_attribute(element, name, value):
element.set(nsqn(name), value)
def add_page_number(paragraph):
paragraph.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
page_run = paragraph.add_run()
t1 = create_element('w:t')
create_attribute(t1, 'xml:space', 'preserve')
t1.text = 'Page '
page_run._r.append(t1)
page_num_run = paragraph.add_run()
fldChar1 = create_element('w:fldChar')
create_attribute(fldChar1, 'w:fldCharType', 'begin')
instrText = create_element('w:instrText')
create_attribute(instrText, 'xml:space', 'preserve')
instrText.text = "PAGE"
fldChar2 = create_element('w:fldChar')
create_attribute(fldChar2, 'w:fldCharType', 'end')
page_num_run._r.append(fldChar1)
page_num_run._r.append(instrText)
page_num_run._r.append(fldChar2)
of_run = paragraph.add_run()
t2 = create_element('w:t')
create_attribute(t2, 'xml:space', 'preserve')
t2.text = ' of '
of_run._r.append(t2)
fldChar3 = create_element('w:fldChar')
create_attribute(fldChar3, 'w:fldCharType', 'begin')
instrText2 = create_element('w:instrText')
create_attribute(instrText2, 'xml:space', 'preserve')
instrText2.text = "NUMPAGES"
fldChar4 = create_element('w:fldChar')
create_attribute(fldChar4, 'w:fldCharType', 'end')
num_pages_run = paragraph.add_run()
num_pages_run._r.append(fldChar3)
num_pages_run._r.append(instrText2)
num_pages_run._r.append(fldChar4)
for pg in num_pages_run.element:
print(pg.text)
doc = Document()
add_page_number(doc.sections[0].footer.paragraphs[0])
doc.save("your_doc.docx")
I receive the Page number in format of Page x of y.
But when I try to access the value of 'y' for total number of pages, I could not do it. I have tried to access it getting the text attribute of num_pages_run as pg.text. But all I get is NUMPAGES as output instead of number of pages.
I am looking for this feature because I would like to do some actions whenever a new page is added to the document.
Is there a way to get total pages from python-docx or any other alternative?

Cyclic relations between nodes A>B>C>A

This is how it looks like before and after, for the problem am trying to solve using Python. I have been trying for weeks. And am failing so miserable to tell Python to do the following:
STEP1: If you find on this document: "LinkedTo=" * (Example value: Node_3)*
STEP2: Then Stop
STEP3: Go to the previous NodePosX= and copy the value * (Example value: 10)*
STEP4: Go to the previous NotePosY= and copy the value * (Example value: 100)*
STEP5: Then find the next "Node_3" on the document
STEP6: And replace inside the NodePosX=30 and NodePosY=300 for the copied values 10 and 100
STEP7: Then look for the next "LinkedTo=" * (Example value: Node_5)* and repeat the STEP2 to STEP5
This is how it looks like Before running the Python script:
Begin
Name="Node_1"
NodePosX=10
NodePosY=100
LinkedTo=Node_3
LinkedTo=Node_5
End Object
Begin
Name="Node_2"
NodePosX=20
NodePosY=200
End Object
Begin
Name="Node_3"
NodePosX=30
NodePosY=300
End Object
Begin
Name="Node_4"
NodePosX=40
NodePosY=400
End Object
Begin
Name="Node_5"
NodePosX=50
NodePosY=500
End Object
This is how it should look like AFTER running the Python script:
Begin
Name="Node_1"
NodePosX=10
NodePosY=100
LinkedTo=Node_3
LinkedTo=Node_5
End Object
Begin
Name="Node_2"
NodePosX=20
NodePosY=200
End Object
Begin
Name="Node_3"
NodePosX=10
NodePosY=100
End Object
Begin
Name="Node_4"
NodePosX=40
NodePosY=400
End Object
Begin
Name="Node_5"
NodePosX=10
NodePosY=100
End Object
Do you think am asking to much from Python to do?
Any better suggestions for the title to this problem?

I hired a developer and this is the code they wrote
'''
By: Alex Reichenbach
'''
import re
begin_regex = re.compile("Begin")
name_regex = "(?<=Name=\"Node_).*(?=\")"
posX_regex = "(?<=NodePosX=).*"
posY_regex = "(?<=NodePosY=).*"
linkedTo_regex = "(?<=LinkedTo=).*"
end_regex = re.compile("End Object")
## Reading the contents of the file
text = open("1-Example-Original.txt", "r").read()
class Node:
def __init__(self):
self.name = ""
self.nodePosX = 0
self.nodePosY = 0
self.linked_to = []
def __str__(self):
linked = ""
for l in self.linked_to:
linked += "\nLinkedTo="+l
return """Begin
Name="%s"
NodePosX=%s
NodePosY=%s%s
End Object
"""%(self.name, self.nodePosX, self.nodePosY, linked)
## Read the text into the node objects
nodes = []
current_node = None
for line in text.split('\n'): ## Iterate through each line
if begin_regex.match(line): ## Begin
current_node = Node()
nodes.append(current_node)
elif re.findall(name_regex, line): ## Name
name = re.findall(name_regex, line)[0]
current_node.name = name
elif re.findall(posX_regex, line): ## PosX
posX = re.findall(posX_regex, line)[0]
current_node.nodePosX = posX
elif re.findall(posY_regex, line): ## PosY
posY = re.findall(posY_regex, line)[0]
current_node.nodePosY = posY
elif re.findall(linkedTo_regex, line): ## LinkedTo
name = re.findall(linkedTo_regex, line)[0]
current_node.linked_to.append(name)
## Copy the linked_to attributes
for i in range(len(nodes)):
for j in range(i, len(nodes)):
node1 = nodes[i]
node2 = nodes[j]
if node2.name in node1.linked_to:
node2.nodePosX = node1.nodePosX
node2.nodePosY = node1.nodePosY
## Print it all out
s = ""
for node in nodes:
s += str(node)
print(s)
## Write to File?
open("_edited.txt", "w").write(s)

Get contents from Qtable which is a child of QTreeWidget

I've created a treewidget with children which contain tables. I'd like to acces the contents of the QtableWidget but I cannot find how to do this?
The treewidget looks like:
I've generated the treewidget like:
software = QTreeWidgetItem(['Software'])
hardware = QTreeWidgetItem(['Hardware'])
beide = QTreeWidgetItem(['Beide'])
andere = QTreeWidgetItem(['Andere'])
i = 0
for key, value in sorted(data.items()):
if value['Subtype'] == 'Software':
sub = software
if value['Subtype'] == 'Hardware':
sub = hardware
if value['Subtype'] == 'Beide':
sub = beide
if value['Subtype'] == 'Andere':
sub = andere
l1 = QTreeWidgetItem(sub)
if value['Privacy'] == 'Voorzichtig':
l1.setBackgroundColor(0, QColor('orange'))
if value['Privacy'] == 'Vertrouwelijk':
l1.setBackgroundColor(0, QColor('red'))
l1.setTextColor(0, QColor('white'))
l1.setText(0, value['sDesc'])
self.treeMainDisplay.addTopLevelItem(l1)
l1_child = QTreeWidgetItem(l1)
self.item_table = QTableWidget()
self.item_table.verticalHeader().setVisible(False)
self.item_table.horizontalHeader().setVisible(False)
self.item_table.setColumnCount(5)
self.item_table.setRowCount(5)
c1_item = QTableWidgetItem("%s" % value['sDesc'].encode('utf-8'))
self.item_table.setItem(0, 0, c1_item)
c2_item = QTableWidgetItem("%s" % value['Type'].encode('utf-8'))
self.item_table.setItem(1,0, c2_item)
qt_child = self.treeMainDisplay.setItemWidget(l1_child, 0, self.item_table)
self.treeMainDisplay.addTopLevelItem(software)
self.treeMainDisplay.addTopLevelItem(hardware)
self.treeMainDisplay.addTopLevelItem(beide)
self.treeMainDisplay.addTopLevelItem(andere)
I'm iterating over the treewidgetitems but don't know how to access the table contents:
def testItems(self):
iterator = QTreeWidgetItemIterator(self.treeMainDisplay)
while iterator.value():
item = iterator.value()
if not item.text(0):
#Get Table Object?
# item.item(0,0).text()
else:
print item.text(0)
iterator += 1
It seems I can't get acces to the QTableWidget object, I only get the QTreeWidgetItem object.
All feedback is highly appreciated!

The item widgets must be access via the tree-widget using the itemWidget method:
def testItems(self):
iterator = QTreeWidgetItemIterator(self.treeMainDisplay)
while iterator.value():
item = iterator.value()
if not item.text(0):
# Get Table Object
table = self.treeMainDisplay.itemWidget(item, 0)
else:
print item.text(0)
iterator += 1

Generate a table of contents from HTML with Python

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags.
My plan so far was to:
Extract a list of headers using beautifulsoup
Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup?
Output a nested list of links to the headers in a predefined spot.
It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.
Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?
A example:
<p>This is an introduction</p>
<h2>This is a sub-header</h2>
<p>...</p>
<h3>This is a sub-sub-header</h3>
<p>...</p>
<h2>This is a sub-header</h2>
<p>...</p>

Some quickly hacked ugly piece of code:
soup = BeautifulSoup(html)
toc = []
header_id = 1
current_list = toc
previous_tag = None
for header in soup.findAll(['h2', 'h3']):
header['id'] = header_id
if previous_tag == 'h2' and header.name == 'h3':
current_list = []
elif previous_tag == 'h3' and header.name == 'h2':
toc.append(current_list)
current_list = toc
current_list.append((header_id, header.string))
header_id += 1
previous_tag = header.name
if current_list != toc:
toc.append(current_list)
def list_to_html(lst):
result = ["<ul>"]
for item in lst:
if isinstance(item, list):
result.append(list_to_html(item))
else:
result.append('<li>%s</li>' % item)
result.append("</ul>")
return "\n".join(result)
# Table of contents
print list_to_html(toc)
# Modified HTML
print soup

Use lxml.html.
It can deal with invalid html just fine.
It is very fast.
It allows you to easily create the missing elements and move elements around between the trees.

I have come with an extended version of the solution proposed by Łukasz's.
def list_to_html(lst):
result = ["<ul>"]
for item in lst:
if isinstance(item, list):
result.append(list_to_html(item))
else:
result.append('<li>{}</li>'.format(item[0], item[1]))
result.append("</ul>")
return "\n".join(result)
soup = BeautifulSoup(article, 'html5lib')
toc = []
h2_prev = 0
h3_prev = 0
h4_prev = 0
h5_prev = 0
for header in soup.findAll(['h2', 'h3', 'h4', 'h5', 'h6']):
data = [(slugify(header.string), header.string)]
if header.name == "h2":
toc.append(data)
h3_prev = 0
h4_prev = 0
h5_prev = 0
h2_prev = len(toc) - 1
elif header.name == "h3":
toc[int(h2_prev)].append(data)
h3_prev = len(toc[int(h2_prev)]) - 1
elif header.name == "h4":
toc[int(h2_prev)][int(h3_prev)].append(data)
h4_prev = len(toc[int(h2_prev)][int(h3_prev)]) - 1
elif header.name == "h5":
toc[int(h2_prev)][int(h3_prev)][int(h4_prev)].append(data)
h5_prev = len(toc[int(h2_prev)][int(h3_prev)][int(h4_prev)]) - 1
elif header.name == "h6":
toc[int(h2_prev)][int(h3_prev)][int(h4_prev)][int(h5_prev)].append(data)
toc_html = list_to_html(toc)

How do I generate a table of contents for HTML text in Python?
But I think you are on the right track and reinventing the wheel will be fun.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up parser: HTML into Database - python

Related

What is the best way to parse large XML and genarate a dataframe with the data in the XML (with python or else)?

How to fetch Total No of pages count in Python-docx?

Cyclic relations between nodes A>B>C>A

Get contents from Qtable which is a child of QTreeWidget

Generate a table of contents from HTML with Python

Categories

Resources