Editing all text in childNodes of XML file with Python - python

I'm trying to edit the text inside of all of the tags named "Volume" in an XML file by multiplying that text by a number entered by the user. The text inside of the "Volume" tag will always be a number. My code works so far, but only on the first instance of the "Volume" text.
Here's an example of the XML:
<blah>
<moreblah> sometext </moreblah> ;
<blah2>
<blah3> <blah4> 30 </blah4> <Volume> 15 </Volume> </blah3>
</blah2>
</blah>
<blah>
<moreblah> sometext </moreblah> ;
<blah2>
<blah3> <blah4> 30 </blah4> <Volume> 25 </Volume> </blah3>
</blah2>
</blah>
And here's my Python code:
#import modules
import xml.dom.minidom
from xml.dom.minidom import parse
import os
import fileinput
#create a backup of original file
new_file_name = 'blah.xml'
old_file_name = new_file_name + "_old"
os.rename(new_file_name, old_file_name)
#find all instances of "Volume"
doc = parse(old_file_name)
volume = doc.getElementsByTagName('Volume')[0]
child = volume.childNodes[0]
txt = child.nodeValue
#ask for percentage input
print
percentage = raw_input("Set Volume Percentage (1 - 100): ")
if percentage.isdigit():
if int(percentage) <101 >1:
print 'Thank You'
#append text of <Volume> tag
child.nodeValue = str(int(float(txt) * (int(percentage)/100.0)))
#persist changes to new file
xml_file = open(new_file_name, "w")
doc.writexml(xml_file)
xml_file.close()
#remove XML Declaration
text = open("blah.xml", "r").read()
text = text.replace('<?xml version="1.0" ?>', '')
open("blah.xml", "w").write(text)
else:
print
print 'Please enter a number between 1 and 100.'
print
print 'Try again.'
print
print 'Exiting.'
xml_file = open(new_file_name, "w")
doc.writexml(xml_file)
xml_file.close()
os.remove(old_file_name)
I know that in my code, I have "doc.getElementsByTagName('Volume')[0]" which denotes the first instance of the "Volume" tag, but I was just doing that as a test to see if it would work. So I'm aware that the code is working exactly as it should. But I'm wondering if anyone has any suggestions, or could tell me the easiest way to apply the user input percentage to all of the instances of the "Volume" tag.
This is also my first attempt at Python, so if you see anything else that seems weird, please let me know.
Thank you for your help!

You'll be much happier if you use a more modern XML API, like ElementTree (in the standard library) or lxml (more advanced).
In ElementTree or lxml you get access to XPath (or something close), which allows for a much more flexible syntax in finding elements and attributes in XML documents.
In ElementTree:
volumes = my_parsed_xml_file.find('.//Volume')
...will find all occurrences of the Volume element.
If you stick with the current syntax, by doing:
doc.getElementsByTagName('Volume')[0]
...you're specifically asking for the zero-th (first) Volume. If you want to process them all, you want a loop:
for volume in doc.getElementsByTagName('Volume'):
child = volume.childNodes[0]
// ... rest of your code inside the loop
If constructs like loops are unfamiliar to you, you should probably step back and read an introductory programming guide, as things will get pretty complicated quickly without some fundamentals. Best of luck!

Related

How can I extract unformated, table-like text from PDF's using python?

I have scenario where I have PDFs with a letterhead and table-like body of text. I have tried using pdfminer but I'm struggling to figure out how to approach my problem
An example of the format for one my PDFs
In specific, pdf miner reads the data starting from the letterhead up until the table header. It then reads the table header in a row like fashion from left to right. From there it's just beyond messy.
Here is python to convert pdf to text:
import pdfminer
import sys
from pdfminer.high_level import extract_text
text = extract_text('./quote2.pdf')
print((text))
f = open("results2.txt", "w")
f.write(text)
And here is a snippet of what the output looks like:
... letter head info
ITEM�#
DESCRIPTION
561347
55�PCs-792.00�LB
6061-T651�PLATE�AMS�4027
4�S/C�6"�SQUARE
CUTTING�PLATE�SAW�ALUM
PACKAGING�SKIDDING
SHIP�VIA�:�OUR�TRUCK
Quotation
DATE:
CUSTOMER NUMBER:
QUOTE NUMBER:
FOB:
4/1/2022
319486
957242
Destination
SHIP TO:
The idea was to use regex to extract relevant numbers. As you can see it read the first 2 records for columns ITEM and DESCRIPTION, but from there it starts back up from the letterhead, and it's even more messy below
Is there perhaps a way to seperate the letterhead from the rest of the body as a starting step? Very new to python, not sure how to get what I want, help much appreciated!

How to in-place parse and edit huge xml file [duplicate]

This question already has an answer here:
How to use xml sax parser to read and write a large xml?
(1 answer)
Closed 3 years ago.
I have huge XML datasets (2-40GB). Some of the data is confidential, so I am trying to edit the dataset to mask all of the confidential information. I have a long list of each value that needs to be masked, so for example if I have ID 'GYT-1064' I need to find and replace every instance of it. These values can be in different fields/levels/subclasses, so in one object it might have 'Order-ID = GYT-1064' whereas another might say 'PO-Name = GYT-1064'. I have looked into iterparse but cannot figure out how to in-place edit the xml file instead of building the entire new tree in memory, because I have to loop through it multiple times to find each instance of each ID.
Ideal functionality:
For each element, if a given string is in element, replace the text and change the line in the XML file.
I have a solution that works if the dataset is small enough to load into memory, but I can't figure out how to correctly leverage iterparse. I've also looked into every answer that talks about lxml iterparse, but since I need to iterate through the entire file multiple times, I need to be able to edit it in place
Simple version that works, but has to load the whole xml into memory (and isn't in-place)
values_to_mask = ['val1', 'GMX-103', 'etc-555'] #imported list of vals to mask
with open(dataset_name, encoding='utf8') as f:
tree = ET.parse(f)
root = tree.getroot()
for old in values_to_mask:
new = mu.generateNew(old, randomnumber) #utility to generate new amt
for elem in root.iter():
try:
elem.text = elem.text.replace(old, new)
except AttributeError:
pass
tree.write(output_name, encoding='utf8')
What I attempted with iterparse:
with open(output_name, mode='rb+') as f:
context = etree.iterparse( f )
for old in values_to_mask:
new = mu.generateNew(old, randomnumber)
mu.fast_iter(context, mu.replace_if_exists, old, new, f)
def replace_if_exists(elem, old, new, xf):
try:
if(old in elem.text):
elem.text = elem.text.replace(old, new)
xf.write(elem)
except AttributeError:
pass
It runs but doesn't replace any text, and I get print(context.root) = 'Null'. Additionally, it doesn't seem like it would correctly write back to the file in place.
Basically how the XML data looks (hierarchical objects with subclasses)
It looks generally like this:
<Master_Data_Object>
<Package>
<PackageNr>1000</PackageNr>
<Quantity>900</Quantity>
<ID>FAKE_CONFIDENTIALGYO421</ID>
<Item_subclass>
<ItemType>C</ItemType>
<MasterPackageID>FAKE_CONFIDENTIALGYO421</MasterPackageID>
<Package>
<Other_Types>
Since Lack of Dataset , I would like to suggest you to
1) use readlines() in loop to read substantial amount of data at a time
2) use a regular expression for identifying confidential information (if Possible) then replace it.
Let me know if it works
You can pretty much use SAX parser for big xml files.
Here is your answer -
Editing big xml files using sax parser

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

Problems with PyPDF ignoring some data

Hoping for some help, as I can't find a solution.
We currently have a lot of manual data inputs through people reading PDF files, and I have been asked to find a way to cut this time down. My solution would be to transform the PDF to a much easier readable format, then using grep to get rid of the standard fields (Just leaving the data behind). This would then be uploaded into a template, then into SAP.
However, then main problem has come at the first hurdle - transforming the PDF into a txt file. The code I use is as follows -
import sys
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
f = open('test.txt', 'w+')
f.write(getPDFContent("Adminform.pdf").encode("ascii", "ignore"))
f.close()
This works, however it ignores some data from the PDF files. To show you what I mean, this PDF page -
http://s23.postimg.org/6dqykomqj/error.png
From the first section (gender, title, name) produces the below -
*Title: *Legal First Name (s): *Your forename and second name (if applicable) as it appears on your passport or birth certificate. Address: *Legal Surname: *Your surname as it appears on your passport or birth certificate
Basically, the actual data that I want to capture is not being converted.
Anyone have a fix for this?
Thanks,
Generally speaking converting pdfs to text is a bad idea. It almost always is messy.
There are linux utilities to do what you have implemented, but I don't expect them to do any better.
I can suggest tabula you can find it at.
http://tabula.technology/
It is meant for extracting tables out of pdfs by manually delineating the boundaries of the table. But running on a pdf with no tables would output text with some formatting retained.
There is some automation, although, limited.
Refer
https://github.com/tabulapdf/tabula-extractor/wiki/Using-the-command-line-tabula-extractor-tool
Also, may not entirely relevant here, you can use openrefine to manage messy data. Refer
http://openrefine.org/

Reading an xml response and printing a required data in python

I have got an xml data as a output for my code. And Now I wanted to get an element value from the obtained xml data.
I have used following commands
data1 = r1.read()
dom = xml.dom.minidom.parseString(data1)
conference=dom.getElementsByTagName('totalResults')
print conference.node value
But I was unable get the value.
My xml code will be
<first:totalresults>100</first:totalresults>
and so on
So now I want the value 100 to be printed
So can any one help me in solving this. I have been trying for this since last night please any one kindly help me.
I'd recommend you'd use etree for an easier XML parsing :
from lxml import etree
myFile = open("file.xml", 'r')
tree = etree.parse(myFile)
data = tree.xpath('//ns:totalresults', namespaces={'ns': 'http://api.com'})
print data

Categories

Resources