splitting an exml file into smaller files

splitting an exml file into smaller files - python

the xml file contains information about movies. how do i split the xml file into smaller files? ( so each small file is a separate movie)

Without knowing the details, here is a broad outline of a possible approach:
Parse the XML using a suitable library (BeautifulSoup, lxml etc.)
Find the element corresponding to each movie. This can be done using a plain findAll or may require using an XPATH expression.
Pretty print the subtree starting corresponding to each movie element into separate files.
Of course a more detailed answer is not possible unless you post some sample XML and provide more details.

Related

What's the best way to handle XML in Python?

I am working on an XML file that is very large (I think that since about it is 45 GB) I have to search through the document of OpenStreetMap data and find a particular path that satisfies a certain criteria. I am currently using XML element tree however I think it is a little slow since I need to search the XML for specific Geographic coordinates. Is there a better way to do this ?
I have also seen a little bit of lxml however, I'd like to know if there's a better choice between the two ? Thank you!

The sax package is well suited for parsing huge xml files that can't be loaded into memory all at once. The parser will go "line-by-line" and notify you when it encountered the beginning or ending of an element, instead of loading the file into memory, parsing it, and giving you the whole tree.

Remove ASCII-encoded binary blobs from .txt files

I want to parse 10-K files (financial statements of firms). Example of Apple's can be found here (look for the .txt file). Now, I was reading this research paper (look on page 30-31) on how to parse these files. The step one is described as removing all ASCII-Encoded segments ... that's what I want to figure out on how to remove them.
I see several questions on StackOverflow on how to remove non-ASCII codes, but this is different. ASCII-Encoded segments are: All document segments with <TYPE> tags of GRAPHIC, ZIP, EXCEL and PDF - I want to delete them.
So if I load a txt file as follow:
fil = open('F:\\file.txt','r')
x = fil.read()
How can I remove all ASCII Encoded segments from this txt file? To remove HTML tags, I use the procedure here, but what about ASCII Encoded segments?

If I understand you correctly, the format you are processing is somehow related to the SEC EDGAR process.
I have not taken the time to look it up formally. Perhaps you should.
From inspecting the Apple statement you link to, it looks like you want to replace anything matching the regular expression <DOCUMENT>\s*<TYPE>(?:GRAPHIC|ZIP|EXCEL|PDF).*?</DOCUMENT> with an empty string.
Disclaimer: A proper implementation would use an XML parser and extract the elements you want, instead of attempting to lexically zap things you don't want. This should not be hard in lxml.
I first thought this was XBLR but it's not. Attempting to parse it with ETree throws an exception because the close tags for some elements (including <TYPE>) appear to be optional. The best way forward would be to find out what format this is (the EDGAR site has several specifications; one of them, perhaps?) and locate a proper DTD, then proceed from there.
Once you have that sorted out, you want to see how to remove elements with XPath and perhaps how to use regex in (lxml) XPath. Then probably reimplement the other extractions you have already done using XML and XPath.

Map xml file with extracted information

I am trying to map one xml file to another, based on a configuration file (that too can be an xml file).
Input
<ia>
<ib>...</ib>
<ic>...</ic>
</ia>
Output
<oa>
<ob>...</ob>
<oc>...</oc>
</oa>
Config
<config>
<conf>
<input>ia</input>
<output>oa</output>
</conf>
<conf>
<input>ib</input>
<output>ob</output>
</conf>
.....
</config>
So, the intention is to parse an xml file and retrieve the information interesting to me, and write into another xml file where the mapping information is specified in the config file.
Due to the scripting nature (and extending with plugins lateron), and the support for xml processing I was considering python. I just learned the syntax and basics of language, and came to know about lxml
One way of doing this
parse the config file (where , tag can have xpath to the node I am interested in)
read the input file
write into output, using etbuilder based on the config file
Being new to python, and not seeing xpath support for etbuilder I wonder is this the best approach. Also not sure all the exceptional cases. Is there an easier way, or native support in any other libraries. If possible, I do not want to spend too much time on this task as I could focus on the core task.
thanks ins advance.

If you wish to transform an XML file into another XML file then XSLT was made for this purpose. You have to define a .xslt file that describes the transformation of XML content and what the eventual output should look like. That's one way to do it.
You can also read the XML file using lxml and generate the output XML with lxml.etree.ElementTree. I'm not familiar with etbuilder but I don't think generating the desired output is that difficult. Once you have parsed the input files, you can build the config XML and write it to a file.
XPath is primarily for reading XML content, you don't need it for constructing XML files. In fact, if you use a proper XML parser then you don't need XPath either to read the file contents, although XPath could make life a bit easier.

How to parse just the text from a Word Doc using Python?

When you try opening a MS Word document or for that matter most Windows file formats, you will see gibberish as given below broken intermittently by the actual text. I need to extract the text that goes in and want to ignore the gibberish -- which is something like given below. How do I extract only the text that matters, and ignore rest of the stuff. Please advise.
Here's a sample of open("sample.doc",r").read() of a word doc. Thanks
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00In an Interesting news,his is the first time we polled Indian channel community for their preferred memory supplier. Transcend came a close second, was seen to be more popular among class A city based resellers, was also the most recalled memory brand among customers according to resellers. However Transcend channels complained of parallel imports and constant unavailability of the products in grey x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x

The tool that seems the most viable, particularly if you need an all python solution is OleFileIO.

doc is a binary format, it's not a markup language or something.
Specs: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

There is no generic why to extract
information from every file format.
You need to know the format to know
how to extract the information.
Just wanted to state that first. So what you should look for is libraries and software that can convert/extract the information you want. And as mentioned by Ofir MicroSoft have tools for that for their formats.
But if you can not do this and want to take the chance that there is text visible in the file that you think is interesting to read you could do a normal read and look for sequences of bytes that will build text. Then comes the question, what languages/charset should I support support in my hunt for text. Is it multi-byte text?
The easy start is to loop through the data and look for sequences of [a-zA-z0-9_- ] to find the text. But word is probably multi-byte. So you should scan double byte as one char.
Note: some of the new formats like open office and docx is multiple files in a compressed container. So you need to de-compress the file first, and scan XML documents after the text you looking for.

Word doc is a compressed format. You need to uncompress it first to get the real data (try open a doc file in a program like winrar, and you'll see it contains multiple files.
It even seems to be XML, so reading the format should not be that hard, although I'm not sure if you get all the data this way.

I had a similar problem, needing to query hundreds of Word documents. I converted the Word files to text files and used normal text parsing tools. Worked well.

How to programmatically insert comments into a Microsoft Word document?

Looking for a way to programmatically insert comments (using the comments feature in Word) into a specific location in a MS Word document. I would prefer an approach that is usable across recent versions of MS Word standard formats and implementable in a non-Windows environment (ideally using Python and/or Common Lisp). I have been looking at the OpenXML SDK but can't seem to find a solution there.

Here is what I did:
Create a simple document with word (i.e. a very small one)
Add a comment in Word
Save as docx.
Use the zip module of python to access the archive (docx files are ZIP archives).
Dump the content of the entry "word/document.xml" in the archive. This is the XML of the document itself.
This should give you an idea what you need to do. After that, you can use one of the XML libraries in Python to parse the document, change it and add it back to a new ZIP archive with the extension ".docx". Simply copy every other entry from the original ZIP and you have a new, valid Word document.
There is also a library which might help: openxmllib

If this is server side (non-interactive) use of the Word application itself is unsupported (but I see this is not applicable). So either take that route or use the OpenXML SDK to learn the markup needed to create a comment. With that knowledge it is all about manipulating data.
The .docx format is a ZIP of XML files with a defines structure, so mostly once you get into the ZIP and get the right XML file it becomes a matter of modifying an XML DOM.
The best route might be to take a docx, copy it, add a comment (using Word) to one, and compare. A diff will show you the kind of elements/structures you need to be looking up in the SDK (or ISO/Ecma standard).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

splitting an exml file into smaller files - python

the xml file contains information about movies. how do i split the xml file into smaller files? ( so each small file is a separate movie)

Related

What's the best way to handle XML in Python?

Remove ASCII-encoded binary blobs from .txt files

Map xml file with extracted information

How to parse just the text from a Word Doc using Python?

How to programmatically insert comments into a Microsoft Word document?

Categories

Resources