I have searched quite a bit for some help on this but most information about XML and Python is about how to read and parse files, not how to write them. I currently have a script that outputs the file system of a folder as an XML file but I want this file to include HTML tags that can be read on the web.
Also, I am looking for a graphical folder style display of the XML information. I have seen this solved in javascript but I don't have any experience in javascript so is this possible in python?
import os
import sys
from xml.sax.saxutils import quoteattr as xml_quoteattr
dirname = sys.argv[1]
a = open("output2.xml", "w")
def DirAsLessXML(path):
result = '<dir name = %s>\n' % xml_quoteattr(os.path.basename(path))
for item in os.listdir(path):
itempath = os.path.join(path, item)
if os.path.isdir(itempath):
result += '\n'.join(' ' + line for line in
DirAsLessXML(os.path.join(path, item)).split('\n'))
elif os.path.isfile(itempath):
result += ' <file name=%s />\n' % xml_quoteattr(item)
result += '</dir>'
return result
if __name__ == '__main__':
a.write('<structure>\n' + DirAsLessXML(dirname))
Generally it's not a good idea to code HTML directly in Python by building strings as you do it. You won't understand what kind of HTML you produce if look at your code in a month or so. I recommend using a templating engine.
Alternatively, as you are converting from XML to XML, XSLT (an XML transformation language) seems like a good solution.
Concerning JavaScript, you'll need it if you want an interactive folder view. But this does not influence how to generate the HTML page. Whatever approach you use, just add a script include tag to the HTML header and implement the JavaScript part in an extra file (the one you include, of course).
Related
I need to access the source code of a locally saved file, but I need to automate this because there are multiple files in one folder. I've looked at the inspect module and the selenium module, but I still understand what to do. After accessing the source code, I need to use bs4 to extract from it.
I've read several posts on here and elsewhere with similar problems, but the thing is that my file does not open in the source code (it's written in xml and so far everything needs to be in source code before you can use these modules). If I open the file, it just uses my browser to open a regular page and then I have to click view page source.
How can I automate this so that it will open the page, go to the source code, and save it so I can stick it into a soup for later parsing?
path_g_jurt = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt'
file = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt' + "/" + file
for file in path_g_jurt:
if file.endswith(".xhtml"):
with open(file, encoding = "utf-8") as mdata_jurt:
soup = BeautifulSoup(mdata_jurt)
main = file.find("jcid").get_text()
misc_links = []
for item in file.find_all("regelgeving"):
misc = item.find("misc:link")
misc_links.append(misc.get("misc:jcid"))
Any help would be appreciated.
For example I would like to save the .pdf file # http://arxiv.org/pdf/1506.07825 with the filename: 'Data Assimilation- A Mathematical Introduction' at the location 'D://arXiv'.
But I have many such files. So, my input is of the form of a .csv file with rows given by (semi-colon is the delimiter):
url; file name; location.
I found some code here: https://github.com/ravisvi/IDM
But that is a bit advanced for me to parse. I want to start with something simpler. The above seems to have more functionality than I need right now - threading, pausing etc.
So can you please write me a very minimal code to do the above:
save the file 'Data Assimilation- A Mathematical Introduction'
from 'http://arxiv.org/pdf/1506.07825'
at 'D://arXiv'?
I think I will be able to generalize it to deal with a .csv file.
Or, hint me a place to get started. (The github repository already has a solution, and it is too perfect! I want something simpler.) My guess is, with Python, a task as above should be possible with no more than 10 lines of code. So tell me important ingredients of the code, and perhaps I can figure it out.
Thanks!
I would use the requests module, you can just pip install requests.
Then, the code is simple:
import requests
response = requests.get(url)
if response.ok:
file = open(file_path, "wb+") # write, binary, allow creation
file.write(response.content)
file.close()
else:
print("Failed to get the file")
Using Python 3.6.5
Here is a method that can create a folder and save the file in a folder.
dataURL - Complete URL path
data_path - Where the file needs to be saved.
tgz_path - Name of the datafile with the extension.
def fetch_data_from_tar(data_url,data_path,tgz_path):
if not os.path.isdir(data_path):
os.mkdir(data_path)
print ("Data Folder Created # Path", data_path)
else:
print("Folder path already exists")
tgz_path = os.path.join(data_path,tgz_path)
urllib.request.urlretrieve(data_url,filename=tgz_path)
data_tgz = tarfile.open(tgz_path)
data_tgz.extractall(path=data_path)
data_tgz.close()
Let me preface this by saying I'm not exactly sure what is happening with my code; I'm fairly new to programming.
I've been working on creating an individual final project for my python CS class that checks my teacher's website on a daily basis and determines if he's changed any of the web pages on his website since the last time the program ran or not.
The step I'm working on right now is as follows:
def write_pages_files():
'''
Writes the various page files from the website's links
'''
links = get_site_links()
for page in links:
site_page = requests.get(root_url + page)
soup = BeautifulSoup(site_page.text)
with open(page + ".txt", mode='wt', encoding='utf-8') as out_file:
out_file.write(str(soup))
The links look similar to this:
/site/sitename/class/final-code
And the error I get is as follows:
with open(page + ".txt", mode='wt', encoding='utf-8') as out_file:
FileNotFoundError: [Errno 2] No such file or directory: '/site/sitename/class.txt'
How can I write the site pages with these types of names (/site/sitename/nameofpage.txt)?
you cannot have / in the file basename on unix or windows, you could replace / with .:
page.replace("/",".") + ".txt"
Python presumes /site etc.. is a directory.
On Unix/Mac OS, for the middle slashes, you can use : which will convert to / when viewed, but trigger the subfolders that / does.
site/sitename/class/final-code -> final-code file in a class folder in a sitename folder in a site folder in the current folder
site:sitename:class:final-code -> site/sitename/class/final-code file in the current folder.
Related to the title of the question, though not the specifics, if you really want your file names to include something that looks like a slash, you can use the unicode character "∕" (DIVISION SLASH), aka u'\u2215'.
This isn't useful in most circumstances (and could be confusing), but can be useful when the standard nomenclature for a concept you wish to include in a filename includes slashes.
Currently, my code uses the name of an XML file as a parameter in order to take that file, parse some of its content and use it to rename said file, what I mean to do is actually run my program once and that program will search for every XML file (even if its inside a zip) inside the directory and rename it using the same parameters which is what I am having problems with.
#encoding:utf-8
import os, re
from sys import argv
script, nombre_de_archivo = argv
regexFecha = r'\d{4}-\d{2}-\d{2}'
regexLocalidad = r'localidad=\"[\w\s.,-_]*\"'
regexNombre = r'nombre=\"[\w\s.,-_]*\"'
regexTotal = r'total=\"\d+.?\d+\"'
fechas = []; localidades = []; nombres = []; totales = []
archivo = open(nombre_de_archivo)
for linea in archivo.readlines():
fechas.append(re.findall(regexFecha, linea))
localidades.append(re.findall(regexLocalidad, linea))
nombres.append(re.findall(regexNombre, linea))
totales.append(re.findall(regexTotal, linea))
fecha = str(fechas[1][0])
localidad = str(localidades[1][0]).strip('localidad=\"')
nombre = str(nombres[1][0]).strip('nombre=\"')
total = str(totales[1][0]).strip('total=\"')
nombre_nuevo_archivo = fecha+"_"+localidad+"_"+nombre+"_"+total+".xml"
os.rename(nombre_de_archivo, nombre_nuevo_archivo)
EDIT: an example of this would be.
directory contains only 3 files as well as the program.
silly.xml amusing.zip feisty.txt
So, you run the program and it ignores feisty as it is a .txt file and it reads silly.xml, ti then parses "fechas, localidad, nombre, total" concatenate or append or whatever and use that as the new file for silly.xml, then the program checks if zip has an xml file, if it does then it does the same thing.
so in the end we would have
20141211_sonora_walmart_2033.xml 20141008_sonora_starbucks_102.xml feisty txt amusing.zip
Your question is not clear, and the code you posted is too broad.
I can't debug regular expressions with the power of my sight, but there's a number of things you can do to simplify the code. Simple code means less errors, and an easier time debugging.
To locate your target files, use glob.glob:
files = glob.glob('dir/*.xml')
To parse them, ditch regular expressions and use the ElementTree API.
import xml.etree.ElementTree as ET
tree = ET.parse('target.xml')
root = tree.getroot()
There's also modules to navigate XML files with CSS notation and XPATH. Extracting fields form the filename using regex is okay, but check out named groups.
When I try to download a file using this code:
import urllib
urllib.urlretrieve("http://e4ftl01.cr.usgs.gov/MOLT/MOD11A1.005/2012.07.11/MOD11A1.A2012193.h22v10.005.2012196013617.hdf","1.hdf")
the file is correctly downloaded.
But my objective is to build a function that will download files depending to some inputs that are parts of the file name.
There are many files one the webpage. Some parts of the file names are the same for every file, (e.g. "/MOLT/MOD11A1.005/"), so this is not a problem. Some other parts change from file to file following some well defined rules (e.g."h22v10") and I have solved this using %s (e.g. h%sv%s), so this isn't a problem either. The problem is that some parts of the names change without any rule (e.g. "2012196013617", ). These parts of the name does not matter, and I want to ignore these parts. So, I want to download files whose names contain the first two parts (the part that does not change, and the part that changes under a rule) and WHATEVER else.
I thought, I could use wildcards for WHATEVER, so I tried this:
import urllib
def download(url,date,h,v):
urllib.urlretrieve("%s/MOLT/MOD11A1.005/%s/MOD11A1.*.h%sv%s.005.*.hdf" %
(url, date1, h, v), "2.hdf")
download("http://e4ftl01.cr.usgs.gov", "2012.07.11", "22", "10")
This does not download the requested file, but instead generates an error file that says:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>404 Not Found</title>
</head>
<body>
<h1>Not Foun d</h1>
<p>The requested URL /MOLT/MOD11A1.005/2012.07.11/MOD11A1\*\h22v10.005\*\.hdf was not found on this server.</p >
</body>
</html>
It seems like wildcards do not work with HTTP. Do you have any idea how to solve this?
The problem is that some parts of the names change without any rule (e.g. "2012196013617", ). These parts of the name does not matter, and I want to ignore these parts
That is not possible. HTTP URLs do not support 'wildcards'. You must provide an existing URL.
Here is a solution: This assumes that the PartialName is a string with the first part of the filename (as much as is known and constant), that URLtoSearch is the URL where the file can be found (also a string), and that FileExtension a string of the form ".ext", ".mp3", ".zip", etc
def findURLFile(PartialName, URLtoSearch, FileExtension):
import urllib2
sourceURL = urllib2.urlopen(URLtoSearch)
readURL = sourceURL.read()
#find the first instance of PartialName and get the Index
#of the first character in the string (an integer)
fileIndexStart = readURL.find(PartialName)
#find the first instance of the file extension after the first
#instance of the string and add 4 to get past the extension
fileIndexEnd = readURL[fileIndexStart:].find(FileExtension) + 4
#get the filename
fileName = readURL[fileIndexStart:fileIndexStart+fileIndexEnd]
#stop reading the url -not sure if this is necessary
sourceURL.close()
#output the URL to download the file from
downloadURL = URLtoSearch + fileName
return downloadURL
I am rather new at coding python and this could probably benefit from some exception handling and perhaps a while loop. It works for what I need, but I will likely refine the code and make it more elegant.