parse xml files in subdirectories using beautifulsoup in python

parse xml files in subdirectories using beautifulsoup in python - python

I have more than 5000 XML files in multiple sub directories named f1, f2, f3, f4,...
Each folder contains more than 200 files. At the moment I want to extract all the files using BeautifulSoup only as I have already tried lxml, elemetTree and minidom but am struggling to get it done through BeautifulSoup.
I can extract single file in sub directory but not able to get all the files through BeautifulSoup.
I have checked the below posts:
XML parsing in Python using BeautifulSoup (Extract Single File)
Parsing all XML files in directory and all subdirectories (This is minidom)
Reading 1000s of XML documents with BeautifulSoup (Unable to get the files through this post)
Here is the code which I have written to extract a single file:
from bs4 import BeautifulSoup
file = BeautifulSoup(open('./Folder/SubFolder1/file1.XML'),'lxml-xml')
print(file.prettify())
When I try to get all files in all folders I am using the below code:
from bs4 import BeautifulSoup
file = BeautifulSoup('//Folder/*/*.XML','lxml-xml')
print(file.prettify())
Then I am only getting XML Version and nothing else. I know that I have to use a for loop and am not sure how to use it in order to parse all the files through the loop.
I know that it will be very very slow but for the sake of learning I want to use beautifulsoup to parse all the files, or if for loop is not recommended then will be grateful if I can get a better solution but only in beautifulsoup only.
Regards,

If I understood you correctly, then you do need to loop through the files, as you had already thought:
from bs4 import BeautifulSoup
from pathlib import Path
for filepath in Path('./Folder').glob('*/*.XML'):
with filepath.open() as f:
soup = BeautifulSoup(f,'lxml-xml')
print(soup.prettify())
pathlib is just one approach to handling paths, on a higher level using objects. You could achieve the same with glob and string paths.

Use glob.glob to find the XML documents:
import glob
from bs4 import BeautifulSoup
for filename in glob.glob('//Folder/*/*.XML'):
content = BeautifulSoup(filename, 'lxml-xml')
print(content.prettify())
note: don't shadow the builtin function/class file.
Read the BeautifulSoup Quick Start

Related

Cleanup HTML using lxml and XPath in Python

I'm learning python and lxml toolkit. I need process multiple .htm files in the local directory (recursively) and remove unwanted tags include its content (divs with IDs "box","columnRight", "adbox", footer", div class="box", plus all stylesheets and scripts).
Can't figure out how to do this. I have code that list all .htm files in directory:
#!/usr/bin/python
import os
from lxml import html
import lxml.html as lh
path = '/path/to/directory'
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(".htm"):
doc=lh.parse(filename)
So I need to add part, that creates a tree, process html and remove unnecessary divs, like
for element in tree.xpath('//div[#id="header"]'):
element.getparent().remove(element)
how to adjust the code for this?
html page example.

It's hard to tell without seeing your actual files, but try the following and see if it works:
First you don't need both
from lxml import html
import lxml.html as lh
So you can drop the first. Then
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(".htm"):
tree = lh.parse(name)
root = tree.getroot()
for element in root.xpath('//div[#id="header"]'):
element.getparent().remove(element)

How to import every docx file in a folder into Python?

I'm pretty new to Python and I'm using the Python-docx module to manipulate some docx files.
I'm importing the docx files using this code:
doc = docx.Document('filename.docx')
The thing is that I need to work with many docx files and in order to avoid write the same line code for each file, I was wondering, if I create a folder in my working directory, is there a way to import all the docx files in a more efficient way?

Something like:
from glob import glob
def edit_document(path):
document = docx.Document(path)
# --- do things on document ---
for path in glob.glob("./*.docx"):
edit_document(path)
You'll need to adjust the glob expression to suit.
There are plenty of other ways to do that part, like os.walk() if you want to recursively descend directories, but this is maybe a good place to start.

Removing hyperlinks in PowerPoint with python-pptx

Quite new to XML and the python-pptx module I want to remove a single hyperlink that is present on every page
my own attempt so far has been to retrieve my files, change to zip format and unzip them into separate folders
I then locate the following attribute <a:hlinkClick r:id="RelId4">
and remove it whilst removing the Relationshipattribute within the xml.rels file which corresponds to this slide.
I then rezip and change the extension to pptx and this loads fines. I then tried to replicate this in Python so I can create an on-going automation.
my attempt:
from pathlib import Path
import zipfile as zf
from pptx import Presentation
import re
import xml.etree.ElementTree as ET
path = 'mypath'
ppts = [files for files in Path(path).glob('*.pptx')]
for file in ppts:
file.rename(file.with_suffix('.zip'))
zip_files = ppts = [files for files in Path(path).glob('*.zip')]
for zips in zip_files:
with zf.ZipFile(zips,'r') as zip_ref:
zip_ref.extractall(Path(path).joinpath('zipFiles',zips.stem))
I then do some further filtering and end up with my xmls from the rels folder & the ppt/slide folder.
it's here that I get stuck I can read my xml with the ElementTree Module but I cannot find the relevant tag to remove?
for file in normal_xmls:
tree = (ET.parse(file).getroot())
y = tree.findall('a')
print(y)
this yields nothing, I tried to use the python-pptx module but the .Action.Hyperlink doesn't seem to be a complete feature unless I am misunderstanding the API.

To remove a hyperlink from a shape (the kind where clicking on the shape navigates somewhere), set the hyperlink address to None:
shape.click_action.hyperlink.address = None

Grab specific text from XML

Hello :) This is my first python program but it doesn't work.
What I want to do :
import a XML file and grab only Example.swf from
<page id="Example">
<info>
<title>page 1</title>
</info>
<vector_file>Example.swf</vector_file>
</page>
(the text inside <vector_file>)
than download the associated file on a website (https://website.com/.../.../Example.swf)
than rename it 1.swf (or page 1.swf)
and loop until I reach the last file, at the end of the page (Exampleaa_idontknow.swf → 231.swf)
convert all the files in pdf
What i have done (but useless, because of AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'):
import re
import urllib.request
import requests
import time
import requests
import lxml
import lxml.html
import os
from xml.etree import ElementTree as ET
DIR="C:/Users/mypath.../"
for filename in os.listdir(DIR):
if filename.endswith(".xml"):
with open(file=DIR+".xml",mode='r',encoding='utf-8') as file:
_tree = ET.fromstring(text=file.read())
_all_metadata_tags = _tree.xpath('.//vector_file')
for i in _all_metadata_tags:
print(i.text + '\n')
else:
print("skipping for filename")

First of all, you need to make up your mind about what module you're going to use. lxml or xml? Import only one of them. lxml has more features, but it's an external dependency. xml is more basic, but it is built-in. Both modules share a lot of their API, so they are easy to confuse. Check that you're looking at the correct documentation.
For what you want to do, the built-in module is good enough. However, the .xpath() method is not supported there, the method you are looking for here is called .findall().
Then you need to remember to never parse XML files by opening them as plain text files, reading them into into string, and parsing that string. Not only is this wasteful, it's fundamentally the wrong thing to do. XML parsers have built-in automatic encoding detection. This mechanism makes sure you never have to worry about file encodings, but you have to use it, too.
It's not only better, but less code to write: Use ET.parse() and pass a filename.
import os
from xml.etree import ElementTree as ET
DIR = r'C:\Users\mypath'
for filename in os.listdir(DIR):
if not filename.lower().endswith(".xml"):
print("skipping for filename")
continue
fullname = os.path.join(DIR, filename)
tree = ET.parse(fullname)
for vector_file in tree.findall('.//vector_file'):
print(vector_file.text + '\n')
If you only expect a single <vector_file> element per file, or if you only care for the first such element, use .find() instead of .findall():
vector_file = tree.find('.//vector_file')
if vector_file is None:
print('Nothing found')
else:
print(vector_file.text + '\n')

Retrieve a list of files located at a URL with filenames matching a known pattern

There is a URL where a colleague has set up a large number of files for me to download,
url = "http://www.some.url.edu/some/dirname/"
Inside this directory, there are a large number of files with different filename patterns that are known to me in advance, e.g., "subvol1_file1.tar.gz", "subvol1_file2.tar.gz", etc. I am going to selectively download these files based on their filename patterns using fnmatch.
What I need is a simple list or generator of all filenames located in dirname. Is there a simple way to use, for example, BeautifulSoup or urllib2 to retrieve such a list?
Once I have the list/iterable, let's call it filename_sequence, I plan to download the files with a pattern filepat with the following pseudocode:
filename_sequence = code_needed
filepat = "*my.pattern*"
import os, fnmatch
for basename in fnmatch.filter(filename_sequence, filepat):
os.system("wget "+os.path.join(url, basename))

Not sure this is applicable in your case, but you can apply a regular expression pattern on the href attribute values:
import re
pattern = re.compile(r"subvol1_file\d+\.tar\.gz")
links = [a["href"] for a in soup.find_all("a", href=pattern)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parse xml files in subdirectories using beautifulsoup in python - python

Use glob.glob to find the XML documents: import glob from bs4 import BeautifulSoup for filename in glob.glob('//Folder//.XML'): content = BeautifulSoup(filename, 'lxml-xml') print(content.prettify()) note: don't shadow the builtin function/class file. Read the BeautifulSoup Quick Start

Related

Cleanup HTML using lxml and XPath in Python

How to import every docx file in a folder into Python?

Removing hyperlinks in PowerPoint with python-pptx

Grab specific text from XML

Retrieve a list of files located at a URL with filenames matching a known pattern

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parse xml files in subdirectories using beautifulsoup in python - python

Use glob.glob to find the XML documents: import glob from bs4 import BeautifulSoup for filename in glob.glob('//Folder/*/*.XML'): content = BeautifulSoup(filename, 'lxml-xml') print(content.prettify()) note: don't shadow the builtin function/class file. Read the BeautifulSoup Quick Start

Related

Cleanup HTML using lxml and XPath in Python

How to import every docx file in a folder into Python?

Removing hyperlinks in PowerPoint with python-pptx

Grab specific text from XML

Retrieve a list of files located at a URL with filenames matching a known pattern

Categories

Resources

Use glob.glob to find the XML documents: import glob from bs4 import BeautifulSoup for filename in glob.glob('//Folder//.XML'): content = BeautifulSoup(filename, 'lxml-xml') print(content.prettify()) note: don't shadow the builtin function/class file. Read the BeautifulSoup Quick Start