Cleanup HTML using lxml and XPath in Python - python

I'm learning python and lxml toolkit. I need process multiple .htm files in the local directory (recursively) and remove unwanted tags include its content (divs with IDs "box","columnRight", "adbox", footer", div class="box", plus all stylesheets and scripts).
Can't figure out how to do this. I have code that list all .htm files in directory:
#!/usr/bin/python
import os
from lxml import html
import lxml.html as lh
path = '/path/to/directory'
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(".htm"):
doc=lh.parse(filename)
So I need to add part, that creates a tree, process html and remove unnecessary divs, like
for element in tree.xpath('//div[#id="header"]'):
element.getparent().remove(element)
how to adjust the code for this?
html page example.

It's hard to tell without seeing your actual files, but try the following and see if it works:
First you don't need both
from lxml import html
import lxml.html as lh
So you can drop the first. Then
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(".htm"):
tree = lh.parse(name)
root = tree.getroot()
for element in root.xpath('//div[#id="header"]'):
element.getparent().remove(element)

Related

Removing hyperlinks in PowerPoint with python-pptx

Quite new to XML and the python-pptx module I want to remove a single hyperlink that is present on every page
my own attempt so far has been to retrieve my files, change to zip format and unzip them into separate folders
I then locate the following attribute <a:hlinkClick r:id="RelId4">
and remove it whilst removing the Relationshipattribute within the xml.rels file which corresponds to this slide.
I then rezip and change the extension to pptx and this loads fines. I then tried to replicate this in Python so I can create an on-going automation.
my attempt:
from pathlib import Path
import zipfile as zf
from pptx import Presentation
import re
import xml.etree.ElementTree as ET
path = 'mypath'
ppts = [files for files in Path(path).glob('*.pptx')]
for file in ppts:
file.rename(file.with_suffix('.zip'))
zip_files = ppts = [files for files in Path(path).glob('*.zip')]
for zips in zip_files:
with zf.ZipFile(zips,'r') as zip_ref:
zip_ref.extractall(Path(path).joinpath('zipFiles',zips.stem))
I then do some further filtering and end up with my xmls from the rels folder & the ppt/slide folder.
it's here that I get stuck I can read my xml with the ElementTree Module but I cannot find the relevant tag to remove?
for file in normal_xmls:
tree = (ET.parse(file).getroot())
y = tree.findall('a')
print(y)
this yields nothing, I tried to use the python-pptx module but the .Action.Hyperlink doesn't seem to be a complete feature unless I am misunderstanding the API.
To remove a hyperlink from a shape (the kind where clicking on the shape navigates somewhere), set the hyperlink address to None:
shape.click_action.hyperlink.address = None

How do I read all html-files in a directory recursively?

I am trying to get all html-files Doctype printed to a txt-file. I have no experience in Python, so bear with me a bit. :)
Final script is supposed to delete elements from the html-file depending on the html-version given in the Doctype set in the html-file. I've attempted to list files in PHP as well, and it works to some extent. I think Python is a better choice for this task.
The script below is what I've got right now, but I cant figure out how to write a "for each" to get the Doctype of every html-file in the arkivet folder recursively. I currently only prints the filename and extension, and I don't know how to get the path to it, or how to utilize BeautifulSoup to edit and get info out of the files.
import fnmatch
from urllib.request import urlopen as uReq
import os
from bs4 import BeautifulSoup as soup
from bs4 import Doctype
files = ['*.html']
matches = []
for root, dirnames, filenames in os.walk("arkivet"):
for extensions in files:
for filename in fnmatch.filter(filenames, extensions):
matches.append(os.path.join(root, filename))
print(filename)
matches is an array, but I am not sure how to handle it properly in Python. I would like to print the foldernames, filenames with extension and it's doctype into a text-file in root.
Script runs in CLI on a local Vagrant Debian server with Python 3.5 (Python 2.x present too). All files and folders exist in folder called arkivet (archive) under servers public root.
Any help appreciated! I'm stuck here :)
As you did not mark any of the answers solutions, I'm guessing you never quite got you answer. Here's a chunk of code that recursively searches for files, prints the full filepath, and shows the Doctype string in the html file if it exists.
import os
from bs4 import BeautifulSoup, Doctype
directory = '/home/brian/Code/sof'
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
if filename.endswith('.html'):
fname = os.path.join(root, filename)
print('Filename: {}'.format(fname))
with open(fname) as handle:
soup = BeautifulSoup(handle.read(), 'html.parser')
for item in soup.contents:
if isinstance(item, Doctype):
print('Doctype: {}'.format(item))
break
Vikas's answer is probably what you are asking for, but in case he interpreted the question incorrectly, it's worth it to know that you have access to all three of those variables as you're looping through, root, dirnames, and filenames. You are currently printing just the basefile name:
print(filename)
It is also possible to print the full path instead:
print(os.path.join(root, filename))
Vikas solved the lack of directory name by using a different function (os.listdir), but I think that will lose the ability to recurse.
A combination of os.walk as you posted, and reading the interior of the file with open as Vikas posted is perhaps what you are going for?
If you want to read all the html files in a particular directory you can try this one :
import os
from bs4 import BeautifulSoup
directory ='/Users/xxxxx/Documents/sample/'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
# parse the html as you wish

Walking through directory path and opening them with trimesh

I have the following code:
import os
import trimesh
# Core settings
rootdir = 'path'
extension = ".zip"
for root, dirs, files in os.walk(rootdir):
if not root.endswith(".zip"):
for file in files:
if file.endswith(".stl"):
mesh = trimesh.load(file)
And I get the following error:
ValueError: File object passed as string that is not a file!
When I open the files one by one however, it works. What could be the reason ?
that's because file is the filename, not the full filepath
Fix that by using os.path.join with the containing directory:
mesh = trimesh.load(os.path.join(root,file))
This is not a direct answer to your question. However, you might be interested in noting that there is now a less complicated paradigm for this situation. It involves using the pathlib module.
I don't use trimesh. I will process pdf documents instead.
First, you can identify all of the pdf files in a directory and its subdirectories recursively with just a single line.
>>> from pathlib import Path
>>> for item in path.glob('**/*.pdf'):
... item
...
WindowsPath('C:/Quantarctica2/Quantarctica-Get_Started.pdf')
WindowsPath('C:/Quantarctica2/Quantarctica2_GetStarted.pdf')
WindowsPath('C:/Quantarctica2/Basemap/Terrain/BEDMAP2/tc-7-375-2013.pdf') WindowsPath('C:/Quantarctica2/Scientific/Glaciology/ALBMAP/1st_ReadMe_ALBMAP_LeBrocq_2010_EarthSystSciData.pdf')
WindowsPath('C:/Quantarctica2/Scientific/Glaciology/ASAID/Bindschadler2011TC_GroundingLines.pdf')
WindowsPath('C:/Quantarctica2/Software/CIA_WorldFactbook_Antarctica.pdf')
WindowsPath('C:/Quantarctica2/Software/CIA_WorldFactbook_SouthernOcean.pdf')
WindowsPath('C:/Quantarctica2/Software/QGIS-2.2-UserGuide-en.pdf')
You will have noticed that (a) the complete paths are made available, and (b) the paths are available within object instances. Fortunately, it's easy to recover the full paths using str.
>>> import fitz
>>> for item in path.glob('**/*.pdf'):
... doc = fitz.Document(str(item))
...
This line shows that the final pdf document has been loaded as a fitz document, ready for subsequent processing.
>>> doc
fitz.Document('C:\Quantarctica2\Software\QGIS-2.2-UserGuide-en.pdf')

parse xml files in subdirectories using beautifulsoup in python

I have more than 5000 XML files in multiple sub directories named f1, f2, f3, f4,...
Each folder contains more than 200 files. At the moment I want to extract all the files using BeautifulSoup only as I have already tried lxml, elemetTree and minidom but am struggling to get it done through BeautifulSoup.
I can extract single file in sub directory but not able to get all the files through BeautifulSoup.
I have checked the below posts:
XML parsing in Python using BeautifulSoup (Extract Single File)
Parsing all XML files in directory and all subdirectories (This is minidom)
Reading 1000s of XML documents with BeautifulSoup (Unable to get the files through this post)
Here is the code which I have written to extract a single file:
from bs4 import BeautifulSoup
file = BeautifulSoup(open('./Folder/SubFolder1/file1.XML'),'lxml-xml')
print(file.prettify())
When I try to get all files in all folders I am using the below code:
from bs4 import BeautifulSoup
file = BeautifulSoup('//Folder/*/*.XML','lxml-xml')
print(file.prettify())
Then I am only getting XML Version and nothing else. I know that I have to use a for loop and am not sure how to use it in order to parse all the files through the loop.
I know that it will be very very slow but for the sake of learning I want to use beautifulsoup to parse all the files, or if for loop is not recommended then will be grateful if I can get a better solution but only in beautifulsoup only.
Regards,
If I understood you correctly, then you do need to loop through the files, as you had already thought:
from bs4 import BeautifulSoup
from pathlib import Path
for filepath in Path('./Folder').glob('*/*.XML'):
with filepath.open() as f:
soup = BeautifulSoup(f,'lxml-xml')
print(soup.prettify())
pathlib is just one approach to handling paths, on a higher level using objects. You could achieve the same with glob and string paths.
Use glob.glob to find the XML documents:
import glob
from bs4 import BeautifulSoup
for filename in glob.glob('//Folder/*/*.XML'):
content = BeautifulSoup(filename, 'lxml-xml')
print(content.prettify())
note: don't shadow the builtin function/class file.
Read the BeautifulSoup Quick Start

Retriving a tag value from multiple XML files in a directory using python

I am currently learning python to automate a few things in my job. I need to retrieve a tag value from multiple xml files in a directory. The directory has many subfolders too.
I tried the following code, and understood what is missing. But I am not able to fix this. Here is my code:
from xml.dom.minidom import parse, parseString
import os
def jarv(dir):
for r,d,f in os.walk(dir):
for files in f:
if files.endswith(".xml"):
print files
dom=parse(files)
name = dom.getElementsByTagName('rev')
print name[0].firstChild.nodeValue
jarv("/path)
I understand that while executing the dom=parse(files) line, it has got the filename without the path. So it says no such files/directory.
I don't know how to fix this.
You have to use os.path.join() to build the correct path from the dirname and the filename:
dom=parse(os.path.join(r, files))
should do it

Categories

Resources