Grab specific text from XML - python

Hello :) This is my first python program but it doesn't work.
What I want to do :
import a XML file and grab only Example.swf from
<page id="Example">
<info>
<title>page 1</title>
</info>
<vector_file>Example.swf</vector_file>
</page>
(the text inside <vector_file>)
than download the associated file on a website (https://website.com/.../.../Example.swf)
than rename it 1.swf (or page 1.swf)
and loop until I reach the last file, at the end of the page (Exampleaa_idontknow.swf → 231.swf)
convert all the files in pdf
What i have done (but useless, because of AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'):
import re
import urllib.request
import requests
import time
import requests
import lxml
import lxml.html
import os
from xml.etree import ElementTree as ET
DIR="C:/Users/mypath.../"
for filename in os.listdir(DIR):
if filename.endswith(".xml"):
with open(file=DIR+".xml",mode='r',encoding='utf-8') as file:
_tree = ET.fromstring(text=file.read())
_all_metadata_tags = _tree.xpath('.//vector_file')
for i in _all_metadata_tags:
print(i.text + '\n')
else:
print("skipping for filename")

First of all, you need to make up your mind about what module you're going to use. lxml or xml? Import only one of them. lxml has more features, but it's an external dependency. xml is more basic, but it is built-in. Both modules share a lot of their API, so they are easy to confuse. Check that you're looking at the correct documentation.
For what you want to do, the built-in module is good enough. However, the .xpath() method is not supported there, the method you are looking for here is called .findall().
Then you need to remember to never parse XML files by opening them as plain text files, reading them into into string, and parsing that string. Not only is this wasteful, it's fundamentally the wrong thing to do. XML parsers have built-in automatic encoding detection. This mechanism makes sure you never have to worry about file encodings, but you have to use it, too.
It's not only better, but less code to write: Use ET.parse() and pass a filename.
import os
from xml.etree import ElementTree as ET
DIR = r'C:\Users\mypath'
for filename in os.listdir(DIR):
if not filename.lower().endswith(".xml"):
print("skipping for filename")
continue
fullname = os.path.join(DIR, filename)
tree = ET.parse(fullname)
for vector_file in tree.findall('.//vector_file'):
print(vector_file.text + '\n')
If you only expect a single <vector_file> element per file, or if you only care for the first such element, use .find() instead of .findall():
vector_file = tree.find('.//vector_file')
if vector_file is None:
print('Nothing found')
else:
print(vector_file.text + '\n')

Related

Python 2.7.16 - ImportError: No module named etree.ElementTree

I am making a script to perform creating and writing data to XML file. The error is no module no module name
I refer to this stackoverflow link, Python 2.5.4 - ImportError: No module named etree.ElementTree. I refer to this tutorial, https://stackabuse.com/reading-and-writing-xml-files-in-python/. I still do not understand on what is the solution. I tried to replace
"from elementtree import ElementTree"
to
"from xml.etree import ElementTree"
It still did not work.
#!/usr/bin/python
import xml.etree.ElementTree as xml
root = xml.Element("FOLDER")
child = xml.Element("File")
root.append(child)
fn = xml.SubElement(child, "PICTURE")
fn.text = "he32dh32rf43hd23"
md5 = xml.SubElement(child, "CONTENT")
md5.text = "he32dh32rf43hd23"
tree = xml.ElementTree(root)
with open(xml.xml, "w") as fh:
tree.write(fh)
"""
I expect the result to be that data is written to xml file. But I received an error shown below,
File "./xml.py", line 2, in <module>
import xml.etree.ElementTree as xml
File "/root/Desktop/virustotal/testxml/xml.py", line 2, in <module>
import xml.etree.ElementTree as xml
```ImportError: No module named etree.ElementTree
import xml.etree.ElementTree as xml
and make sure you have the __init__.py file within the same folder if you use your own xml module and please avoid the path conflict.
then it will work.
etree package is provided by "ElementTree" and "lxml" both are similar but it is reported that ElementTree have bugs in python 2.7 and works great in python3.
I see you are using python 2.7 so lxml will work fine for you.
try this
from lxml import etree
from io import StringIO
tree = etree.parse(StringIO(xml_file))
# incase you need to read an XML.
print(tree.getroot())
And the StringIO is from default python io package.
StringIO is neccessary when you are passing file to it (I mean putting XML in a file and passing that file to parser).
It's good to keep it even tough you are passing XML as a big string.
all the writing operations will be same for both.

How do I read all html-files in a directory recursively?

I am trying to get all html-files Doctype printed to a txt-file. I have no experience in Python, so bear with me a bit. :)
Final script is supposed to delete elements from the html-file depending on the html-version given in the Doctype set in the html-file. I've attempted to list files in PHP as well, and it works to some extent. I think Python is a better choice for this task.
The script below is what I've got right now, but I cant figure out how to write a "for each" to get the Doctype of every html-file in the arkivet folder recursively. I currently only prints the filename and extension, and I don't know how to get the path to it, or how to utilize BeautifulSoup to edit and get info out of the files.
import fnmatch
from urllib.request import urlopen as uReq
import os
from bs4 import BeautifulSoup as soup
from bs4 import Doctype
files = ['*.html']
matches = []
for root, dirnames, filenames in os.walk("arkivet"):
for extensions in files:
for filename in fnmatch.filter(filenames, extensions):
matches.append(os.path.join(root, filename))
print(filename)
matches is an array, but I am not sure how to handle it properly in Python. I would like to print the foldernames, filenames with extension and it's doctype into a text-file in root.
Script runs in CLI on a local Vagrant Debian server with Python 3.5 (Python 2.x present too). All files and folders exist in folder called arkivet (archive) under servers public root.
Any help appreciated! I'm stuck here :)
As you did not mark any of the answers solutions, I'm guessing you never quite got you answer. Here's a chunk of code that recursively searches for files, prints the full filepath, and shows the Doctype string in the html file if it exists.
import os
from bs4 import BeautifulSoup, Doctype
directory = '/home/brian/Code/sof'
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
if filename.endswith('.html'):
fname = os.path.join(root, filename)
print('Filename: {}'.format(fname))
with open(fname) as handle:
soup = BeautifulSoup(handle.read(), 'html.parser')
for item in soup.contents:
if isinstance(item, Doctype):
print('Doctype: {}'.format(item))
break
Vikas's answer is probably what you are asking for, but in case he interpreted the question incorrectly, it's worth it to know that you have access to all three of those variables as you're looping through, root, dirnames, and filenames. You are currently printing just the basefile name:
print(filename)
It is also possible to print the full path instead:
print(os.path.join(root, filename))
Vikas solved the lack of directory name by using a different function (os.listdir), but I think that will lose the ability to recurse.
A combination of os.walk as you posted, and reading the interior of the file with open as Vikas posted is perhaps what you are going for?
If you want to read all the html files in a particular directory you can try this one :
import os
from bs4 import BeautifulSoup
directory ='/Users/xxxxx/Documents/sample/'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
# parse the html as you wish

Using a python program to rename all XML files within a linux directory

Currently, my code uses the name of an XML file as a parameter in order to take that file, parse some of its content and use it to rename said file, what I mean to do is actually run my program once and that program will search for every XML file (even if its inside a zip) inside the directory and rename it using the same parameters which is what I am having problems with.
#encoding:utf-8
import os, re
from sys import argv
script, nombre_de_archivo = argv
regexFecha = r'\d{4}-\d{2}-\d{2}'
regexLocalidad = r'localidad=\"[\w\s.,-_]*\"'
regexNombre = r'nombre=\"[\w\s.,-_]*\"'
regexTotal = r'total=\"\d+.?\d+\"'
fechas = []; localidades = []; nombres = []; totales = []
archivo = open(nombre_de_archivo)
for linea in archivo.readlines():
fechas.append(re.findall(regexFecha, linea))
localidades.append(re.findall(regexLocalidad, linea))
nombres.append(re.findall(regexNombre, linea))
totales.append(re.findall(regexTotal, linea))
fecha = str(fechas[1][0])
localidad = str(localidades[1][0]).strip('localidad=\"')
nombre = str(nombres[1][0]).strip('nombre=\"')
total = str(totales[1][0]).strip('total=\"')
nombre_nuevo_archivo = fecha+"_"+localidad+"_"+nombre+"_"+total+".xml"
os.rename(nombre_de_archivo, nombre_nuevo_archivo)
EDIT: an example of this would be.
directory contains only 3 files as well as the program.
silly.xml amusing.zip feisty.txt
So, you run the program and it ignores feisty as it is a .txt file and it reads silly.xml, ti then parses "fechas, localidad, nombre, total" concatenate or append or whatever and use that as the new file for silly.xml, then the program checks if zip has an xml file, if it does then it does the same thing.
so in the end we would have
20141211_sonora_walmart_2033.xml 20141008_sonora_starbucks_102.xml feisty txt amusing.zip
Your question is not clear, and the code you posted is too broad.
I can't debug regular expressions with the power of my sight, but there's a number of things you can do to simplify the code. Simple code means less errors, and an easier time debugging.
To locate your target files, use glob.glob:
files = glob.glob('dir/*.xml')
To parse them, ditch regular expressions and use the ElementTree API.
import xml.etree.ElementTree as ET
tree = ET.parse('target.xml')
root = tree.getroot()
There's also modules to navigate XML files with CSS notation and XPATH. Extracting fields form the filename using regex is okay, but check out named groups.

Python Manipulate and save XML without third-party libraries

I have a xml in which I have to search for a tag and replace the value of tag with a new values. For example,
<tag-Name>oldName</tag-Name>
and replace oldName to newName like
<tag-Name>newName</tag-Name>
and save the xml. How can I do that without using thirdparty lib like BeautifulSoup
Thank you
The best option from the standard lib is (I think) the xml.etree package.
Assuming that your example tag occurs only once somewhere in the document:
import xml.etree.ElementTree as etree
# or for a faster C implementation
# import xml.etree.cElementTree as etree
tree = etree.parse('input.xml')
elem = tree.find('//tag-Name') # finds the first occurrence of element tag-Name
elem.text = 'newName'
tree.write('output.xml')
Or if there are multiple occurrences of tag-Name, and you want to change them all if they have "oldName" as content:
import xml.etree.cElementTree as etree
tree = etree.parse('input.xml')
for elem in tree.findall('//tag-Name'):
if elem.text == 'oldName':
elem.text = 'newName'
# some output options for example
tree.write('output.xml', encoding='utf-8', xml_declaration=True)
Python has 'builtin' libraries for working with xml. For this simple task I'd look into minidom. You can find the docs here:
http://docs.python.org/library/xml.dom.minidom.html
If you are certain, I mean, completely 100% positive that the string <tag-Name> will never appear inside that tag and the XML will always be formatted like that, you can always use good old string manipulation tricks like:
xmlstring = xmlstring.replace('<tag-Name>oldName</tag-Name>', '<tag-Name>newName</tag-Name>')
If the XML isn't always as conveniently formatted as <tag>value</tag>, you could write something like:
a = """<tag-Name>
oldName
</tag-Name>"""
def replacetagvalue(xmlstring, tag, oldvalue, newvalue):
start = 0
while True:
try:
start = xmlstring.index('<%s>' % tag, start) + 2 + len(tag)
except ValueError:
break
end = xmlstring.index('</%s>' % tag, start)
value = xmlstring[start:end].strip()
if value == oldvalue:
xmlstring = xmlstring[:start] + newvalue + xmlstring[end:]
return xmlstring
print replacetagvalue(a, 'tag-Name', 'oldName', 'newName')
Others have already mentioned xml.dom.minidom which is probably a better idea if you're not absolutely certain beyond any doubt that your XML will be so simple. If you can guarantee that, though, remember that XML is just a big chunk of text that you can manipulate as needed.
I use similar tricks in some production code where simple checks like if "somevalue" in htmlpage are much faster and more understandable than invoking BeautifulSoup, lxml, or other *ML libraries.

How to display a XML file in a folder like system

I have searched quite a bit for some help on this but most information about XML and Python is about how to read and parse files, not how to write them. I currently have a script that outputs the file system of a folder as an XML file but I want this file to include HTML tags that can be read on the web.
Also, I am looking for a graphical folder style display of the XML information. I have seen this solved in javascript but I don't have any experience in javascript so is this possible in python?
import os
import sys
from xml.sax.saxutils import quoteattr as xml_quoteattr
dirname = sys.argv[1]
a = open("output2.xml", "w")
def DirAsLessXML(path):
result = '<dir name = %s>\n' % xml_quoteattr(os.path.basename(path))
for item in os.listdir(path):
itempath = os.path.join(path, item)
if os.path.isdir(itempath):
result += '\n'.join(' ' + line for line in
DirAsLessXML(os.path.join(path, item)).split('\n'))
elif os.path.isfile(itempath):
result += ' <file name=%s />\n' % xml_quoteattr(item)
result += '</dir>'
return result
if __name__ == '__main__':
a.write('<structure>\n' + DirAsLessXML(dirname))
Generally it's not a good idea to code HTML directly in Python by building strings as you do it. You won't understand what kind of HTML you produce if look at your code in a month or so. I recommend using a templating engine.
Alternatively, as you are converting from XML to XML, XSLT (an XML transformation language) seems like a good solution.
Concerning JavaScript, you'll need it if you want an interactive folder view. But this does not influence how to generate the HTML page. Whatever approach you use, just add a script include tag to the HTML header and implement the JavaScript part in an extra file (the one you include, of course).

Categories

Resources