Python combining html and xml files into a single dictionary

Python combining html and xml files into a single dictionary - python

I'm trying to combine html files (text) and xml files (metadata) into a single flat dictionary which I'll write as a Json. The files are contained in the same folder and have following name structure:
abcde.html
abcde.html.xml
Here's my simplified code, my issue is that I had to separate the xml meta-data writing into
### Create a list of dict with one dict per file, first write file content then meta-data
for path, subdirs, files in os.walk("."):
for fname in files:
docname, extension = os.path.splitext(fname)
filename = os.path.join(path,fname)
file_dict = {}
if extension == ".html":
file_dict['type'] = 'circulaire'
file_dict['filename'] = fname
html_dict = parse_html_to_dict(filename)
file_dict.update(html_dict)
list_of_dict.append(file_dict)
#elif extension == ".xml":
# if not any(d['filename'] == docname for d in list_of_dict):
# print("Well Well Well, there's no html file in the list yet !")
# continue
# else:
# index = next((i for i, element in enumerate(list_of_dict) if element['filename'] == docname), None)
# metadata_dict = extract_metadata_xml(filename)
# list_of_dict[index].update(metadata_dict)
else: continue
json.dump(list_of_dict, outfile, indent=3)
outfile.close()
############# Extract Metadata from XML FILES #############
import xmltodict
def extract_metadata_xml(filename):
""" Returns xml file to dict """
with open(filename, encoding='utf-8', errors='ignore') as xml_file:
temp_dict = xmltodict.parse(xml_file.read())
metadata_dict = temp_dict.get('doc', {}).get('fields', {})
xml_file.close()
return metadata_dict
Normally, I would add an elif condition (now commented) below the if loop for html files, which checks for xml and updates the corresponding dictionary (bool condition that filenames are identical) with the metadata, thus sequentially.
But, unfortunately it seems that for most files, the list of dict isn't fully up to date, or at least I can't find a match for 40% of my filenames.
The work-around I use seems a little silly to me, I wrote a second loop with os.walk after the first one which is used exclusively for html files, my second loop then checks for xml extensions and appends the list_of_dict, which is fully up to date and I get 100% of my html filenames matched with xml metadata.
Can I introduce some forced timing to make sure my html is done writing before I start to match any xml filename, is it possible that both if/elif loops are executed in parallel for different files?
Or else what is in terms of processing the best way to have all my html files processed before my xml files (just ordering my list of files by type before proceeding with my if/elif loops?)
I'm quite new to this forum, so please let me know if I can improve my question writing style, be mindful though that I'm trying my best ;).
Thanks for your help!

The way it is now you check all files and handle them differently, depending on whether they are html or xml, hoping that the corresponding html for each xml you encounter has already been processed – but that's not guaranteed, which I suspect to be the cause of the issues.
Instead, you should only look for html files and retrieve the corresponding xml right away:
if extension == ".html":
file_dict['type'] = 'circulaire'
file_dict['filename'] = fname
html_dict = parse_html_to_dict(filename)
file_dict.update(html_dict)
# process the corresponding xml file
# this file should always exist, according to your description of the file structure
xml_filename = os.path.join(path, fname+'.xml')
# extract the meta data
metadata_dict = extract_metadata_xml(xml_filename)
# put it in the file dict we still have
file_dict.update(metadata_dict)
# and finally store it
list_of_dict.append(file_dict)
Alternatively, though less efficiently, you could also iterate over the sorted file list - for fname in sorted(files): - and proceed as you did, as this would also result in the html files preceding the corresponding xml files.

Related

In Python, how do I create a list of files based on specified file extensions?

Let's say I have a folder with a bunch of files (with different file extensions). I want to create a list of files from this folder. However, I want to create a list of files with SPECIFIC file extensions.
These file extensions are categorized into groups.
File Extensions: .jpg, .png, .gif, .pdf, .raw, .docx, .pptx, .xlsx, .js, .html, .css
Group "image" contains .jpg, .png, .gif.
Group "adobe" contains .pdf, .raw. (yes, I'm listing '.raw' as an adobe file for this example :P)
Group "microsoft" contains .docx, .pptx, .xlsx.
Group "webdev" contains .js, .html, .css.
I want to be able to add these files types to a list. That list will be generated in a ".txt" file and would contain ALL files with the chosen file extensions.
So if my folder has 5 image files, 10 adobe files, 5 microsoft files, 3 webdev files and I select the "image" and "microsoft" groups, this application in Python would create a .txt file that contains a list of filenames with file extensions that belong only in the image and microsoft groups.
The text file would have a list like below:
picture1.jpg
picture2.png
picture3.gif
picture4.jpg
picture5.jpg
powerpoint.pptx
powerpoint2.pptx
spreadsheet.xlsx
worddocument.docx
worddocument2.docx
As of right now, my code creates a text file that generates a list of ALL files in a specified folder.
I could use an "if" statement to get specific file extension, but I don't think this achieves the results I want. In this case, I would have to create a function for each Group (i.e. function for the image, adobe, microsoft and webdev groups). I want to be able to combine these groups freely (i.e. image and microsoft files in a list).
Example of an "if" statement:
for elem in os.listdir(filepath):
if elem.endswith('.jpg'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
if elem.endswith('.png'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
if elem.endswith('.gif'):
listItem = elem + '\n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
else:
continue
Full code without the "if" statement (generates a .txt file with all filenames from a specified folder):
import os
def enterFilePath():
global filepath
filepath = input("Please enter your file path. ")
os.chdir(filepath)
enterFilePath()
def enterFileName():
global name
global listName
name = str(input("Name the txt file. "))
listName = name + ".txt"
enterFileName()
def listGenerator():
for filename in os.listdir(filepath):
listItem = filename + ' \n'
listName = filepath + (r"\{}List.txt".format(name))
writeFile = open(listName, 'a')
writeFile.write(listItem)
writeFile.close()
listGenerator()

A pointer before getting into the answer - avoid using global in favor of function parameters and return values. It will make debugging significantly less of a headache and make it easier to follow data flow through your program.
nostradamus is correct in his comment, a dict will be the ideal way to solve your problem here. I've also done similar things as your issue before using itertools.chain.from_iterable and pathlib.Path, which I'll be using here.
First, the dict:
groups = {
'image': {'jpg', 'png', 'gif'},
'adobe': {'pdf', 'raw'},
'microsoft': {'docx', 'pptx', 'xlsx'},
'webdev': {'js', 'html', 'css'}
}
This sets up your extension groups that you mentioned, which you can then access easily with groups['image'], groups['adobe'], etc.
Then, using the Path.glob method, itertools.chain.from_iterable, and a comprehension, you can get your list of desired files in a single statement (or function).
# Set up some variables
target_groups = ['adobe', 'webdev']
# Initialize generator
files = chain.from_iterable(
# Glob pattern for the current extension
Path(filepath).glob(f'*.{ext}')
# Each group in target_groups
for group in target_groups
# Each extension in current group
for ext in groups[group]
)
# Then, just iterate the files
for fpath in files:
# Do stuff with file here
print(fpath.name)
My test directory has one file of each extension you listed, named a, b, etc for each group. Using the above code, my output is:
a.pdf
b.raw
a.js
b.html
c.css
The way the file list/generator is set up means that the list of files will be sorted by extension-group, then by extension, and then by name. If you want to change what groups are being listed, just add/remove values in the target_groups list above (works with a single option as well).
You'll also want to consider parameterizing your targets, such as through input or script arguments, as well as handling cases where a requested group doesn't exist in the groups dictionary. The code above would probably also be more useful as a function, but I'll leave that implementation up to you :)

Python: Iterating over every file in directory

Using python, I am trying to take a list of emails in .txt format and delete anything after a specific keyword, "Original message", in order to delete the part that I sent. All of the emails are currently saved to a directory via a VBScript in Outlook. Each email is its own .txt file which I would like to cycle through them all in the one file.
If there is a way to replace all text between two keywords that would also work as I have a program that combines the emails into one long .txt file.
I apologize if I left out any important information, this is my first time posting here

You can use os.listdir() to iterate over the files in your directory:
import os
files = [i for i in os.listdir("Path/to/directory/storing/emails") if i.endswith("txt")]
for file in files:
f = open(file).readlines()
f = [i.strip('\n') for i in f]
final_email = f[:f.index("Original message")] #this list slicing will remove the part containing "Original message" and below it
final_message = '\n'.join(final_email)
f = open(file, 'w')
f.write(final_message)
f.close()

Searching a folder structure and modifying XML files using Python

I am trying to create a python script that will iterate through a folder structure, find folders named 'bravo', and modify the xml files contained within them.
In the xml files, I want to modify the 'location' attribute of a tag, called 'file'. Such as:
<file location="e:\one\two"/>
I just need to change the drive letter of the file path from ‘e’ to ‘f’. So that it will read:
<file location="f:\one\two"/>
However...
The name of these xml files are unique, so I cannot search for the exact xml file name. Instead I am searching by the xml file type.
Also, there are other xml files in my folder structure, without the ‘file’ tag reference, that I wish to ignore.
The only constant is that the xml files I want to modify are all stored in folders named, ‘bravo’.
I also wish to create a log file that lists all the xml files and their filepaths which have successfully been updated (and preferably the ones that failed).
Using answers to similar questions on this site, I have cobbled together the following script.
In its current state, the script trys to modify every xml files it finds. I have not been able to successfully add code that only searches folders called, ‘bravo'.
When the script modifies an xml file, not in a 'bravo' folder, it errors because these files do not contain a 'file' tag.
Please could someone help me to correct my script (or create a new one).
Here is an example of the folder structure...
My folder structure
And my script so far...
from xml.dom import minidom
import os
# enter the directory where to start search for xml files...
for root, dirs, files in os.walk("c:/temp"):
for file in files:
#search for xml files...
if file.endswith(".xml"):
xml_file = file
xmldoc = minidom.parse(os.path.join(root, xml_file))
# in the xml file look for tag called "file"...
file_location = xmldoc.getElementsByTagName("file")
# i don't understand the next line of code, but it's needed
file_location = file_location[0]
# 'location_string' is a variable for the 'location' path of the file tag in the xml document
location_string = (file_location.attributes["location"].value)
# the new drive letter is added to the location_string to create 'new_location'
new_location = "f" + location_string[1:]
# replace the 'location' value of the file tag with the new location...
file_location.attributes["location"].value = new_location
# write the change to the original file
with open((os.path.join(root, xml_file)),'w') as f:
f.write(xmldoc.toxml())
print "%s has been updated!" % (os.path.join(root, xml_file))
# add updated file name to log...
log_file = open("filepath_update_log.txt", "a")
log_file.write("%s\n" % (os.path.join(root, xml_file)))
log_file.close

Test if the directory name fits, before your second loop. You'd have to get the last directory in the path first. As in: How to get only the last part of a path in Python?
if os.path.basename(os.path.normpath(root)) == "bravo":
You could use the https://docs.python.org/3/library/logging.html module for logging.
If you only want to replace a single letter, then maybe you can directly replace it instead of parsing xml. As suggested in: https://stackoverflow.com/a/17548459/7062162
def inplace_change(filename, old_string, new_string):
# Safely read the input filename using 'with'
with open(filename) as f:
s = f.read()
if old_string not in s:
print('"{old_string}" not found in {filename}.'.format(**locals()))
return
# Safely write the changed content, if found in the file
with open(filename, 'w') as f:
print('Changing "{old_string}" to "{new_string}" in {filename}'.format(**locals()))
s = s.replace(old_string, new_string)
f.write(s)

Creating Unique Names

I'm creating a corpus from a repository. I download the text from the repository in pdf, convert these to text files, and save them. However, I'm trying to find a good way to name these files.
To get the filenames I do this: (the records generator is an object from the Sickle package that I use to get access to all the records in the repository)
for record in records:
record_data = [] # data is stored in record_data
for name, metadata in record.metadata.items():
for i, value in enumerate(metadata):
if value:
record_data.append(value)
file_path = ''
fulltext = ''
for data in record_data:
if 'Fulltext' in data:
fulltext = data.replace('Fulltext ', '')
file_path = '/' + os.path.basename(data) + '.txt'
print fulltext
print file_path
The print statements on the two last lines:
https://www.duo.uio.no/bitstream/handle/10852/34910/1/Bertelsen-Master.pdf
/Bertelsen-Master.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34912/1/thesis-output.pdf
/thesis-output.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9976/1/gartmann.pdf
/gartmann.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34174/1/thesis-mariusno.pdf
/thesis-mariusno.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9285/1/thesis2.pdf
/thesis2.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9360/1/OMyhre.pdf
As you can see I add a .txt to the end of the original filename and want to use that name to save the file. However, a lot of the files have the same filename, like thesis.pdf. One way I thought about solving this was to add a few random numbers to the name, or have a number that gets incremented on each record and use that, like this: thesis.pdf.124.txt (adding 124 to the name).
But that does not look very good, and the repository is huge, so in the end I would have quite large numbers appended to each filename. Any smart suggestions on how I can solve this?
I have seen suggestions like using the time module. I was thinking maybe I can use regex or another technique to extract part of the name (so every name is equally long) and then create a method that adds a string to each file pased on the url of the file, which should be unique.

One thing you could do is to compute a unique hash of the files, e.g. with MD5 or SHA1 (or any other), cf. this article. For a large number of files this can become quite slow, though.
But you don't really see to touch the files in this piece of code. For generating some unique id, you could use uuid and put this somewhere in the name.

Create for loop for naming output file Python

So i'm importing a list of names
e.g.
Textfile would include:
Eleen
Josh
Robert
Nastaran
Miles
my_list = ['Eleen','Josh','Robert','Nastaran','Miles']
Then i'm assigning each name to a list and I want to write a new excel file for each name in that list.
#1. Is there anyway I can create a for loop where on the line:
temp = os.path.join(dir,'...'.xls')
_________________________
def high_throughput(names):
import os
import re
# Reading file
in_file=open(names,'r')
dir,file=os.path.split(names)
temp = os.path.join(dir,'***this is where i want to put a for loop
for each name in the input list of names***.xls')
out_file=open(temp,'w')
data = []
for line in in_file:
data.append(line)
in_file.close()

I'm still not sure what you're trying to do (and by "not sure", I mean "completely baffled"), but I think I can explain some of what you're doing wrong, and how to do it right:
in_file=open(names,'r')
dir,file=os.path.split(names)
temp = os.path.join(dir,'***this is where i want to put a for loop
for each name in the input list of names***.xls')
At this point, you don't have the input list of names. That's what you're reading from in_file, and you haven't read it yet. Later on, you read those named into data, after which you can use them. So:
in_file=open(names,'r')
dir,file=os.path.split(names)
data = []
for line in in_file:
data.append(line)
in_file.close()
for name in data:
temp = os.path.join(dir, '{}.xls'.format(name))
out_file=open(temp,'w')
Note that I put the for loop outside the function call, because you have to do that. And that's a good thing, because you presumably want to open each path (and do stuff to each file) inside that loop, not open a single path made out of a loop of files.
But if you don't insist on using a for loop, there is something that may be closer to what you were looking for: a list comprehension. You have a list of names. You can use that to build a list of paths. And then you can use that to build a list of open files. Like this:
paths = [os.path.join(dir, '{}.xls'.format(name)) for name in data]
out_files = [open(path, 'w') for path in paths]
Then, later, after you've built up the string you want to write to all the files, you can do this:
for out_file in out_files:
out_file.write(stuff)
However, this is kind of an odd design. Mainly because you have to close each file. They may get closed automatically by the garbage collection, and even if they don't, they may get flushed… but unless you get lucky, all that data you wrote is just sitting around in buffers in memory and never gets written to disk. Normally you don't want to write programs that depend on getting lucky. So, you want to close your files. With this design, you'd have to do something like:
for out_file in out_files:
out_file.close()
It's probably a lot simpler to go back to the one big loop I suggested in the first place, so you can do this:
for name in data:
temp = os.path.join(dir, '{}.xls'.format(name))
out_file=open(temp,'w')
out_file.write(stuff)
out_file.close()
Or, even better:
for name in data:
temp = os.path.join(dir, '{}.xls'.format(name))
with open(temp,'w') as out_file:
out_file.write(stuff)
A few more comments, while we're here…
First, you really shouldn't be trying to generate .xls files manually out of strings. You can use a library like openpyxl. Or you can create .csv files instead—they're easy to create with the csv library that comes built in with Python, and Excel can handle them just as easily as .xls files. Or you can use win32com or pywinauto to take control of Excel and make it create your files. Really, anything is better than trying to generate them by hand.
Second, the fact that you can write for line in in_file: means that an in_file is some kind of sequence of lines. So, if all you want to do is convert it to a list of lines, you can do that in one step:
data = list(in_file)
But really, the only reason you want this list in the first place is so you can loop around it later, creating the output files, right? So why not just hold off, and loop over the lines in the file in the first place?
Whatever you do to generate the output stuff, do that first. Then loop over the file with the list of filenames and write stuff. Like this:
stuff = # whatever you were doing later, in the code you haven't shown
dir = os.path.dirname(names)
with open(names, 'r') as in_file:
for line in in_file:
temp = os.path.join(dir, '{}.xls'.format(line))
with open(temp, 'w') as out_file:
out_file.write(stuff)
That replaces all of the code in your sample (except for that function named high_throughput that imports some modules locally and then does nothing).

Take a look at openpyxl, especially if you need to create .xlsx files. Below example assumes the Excel workbooks are created as blank.
from openpyxl import Workbook
names = ['Eleen','Josh','Robert','Nastaran','Miles']
for name in names:
wb = Workbook()
wb.save('{0}.xlsx'.format(name))

Try this:
in_file=open(names,'r')
dir,file=os.path.split(names)
for name in in_file:
temp = os.path.join(dir, name + '.xls')
with open(temp,'w') as out_file:
# write data to out_file

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python combining html and xml files into a single dictionary - python

Related

In Python, how do I create a list of files based on specified file extensions?

Python: Iterating over every file in directory

Searching a folder structure and modifying XML files using Python

Creating Unique Names

Create for loop for naming output file Python

Categories

Resources