Remove ASCII-encoded binary blobs from .txt files - python

I want to parse 10-K files (financial statements of firms). Example of Apple's can be found here (look for the .txt file). Now, I was reading this research paper (look on page 30-31) on how to parse these files. The step one is described as removing all ASCII-Encoded segments ... that's what I want to figure out on how to remove them.
I see several questions on StackOverflow on how to remove non-ASCII codes, but this is different. ASCII-Encoded segments are: All document segments with <TYPE> tags of GRAPHIC, ZIP, EXCEL and PDF - I want to delete them.
So if I load a txt file as follow:
fil = open('F:\\file.txt','r')
x = fil.read()
How can I remove all ASCII Encoded segments from this txt file? To remove HTML tags, I use the procedure here, but what about ASCII Encoded segments?

If I understand you correctly, the format you are processing is somehow related to the SEC EDGAR process.
I have not taken the time to look it up formally. Perhaps you should.
From inspecting the Apple statement you link to, it looks like you want to replace anything matching the regular expression <DOCUMENT>\s*<TYPE>(?:GRAPHIC|ZIP|EXCEL|PDF).*?</DOCUMENT> with an empty string.
Disclaimer: A proper implementation would use an XML parser and extract the elements you want, instead of attempting to lexically zap things you don't want. This should not be hard in lxml.
I first thought this was XBLR but it's not. Attempting to parse it with ETree throws an exception because the close tags for some elements (including <TYPE>) appear to be optional. The best way forward would be to find out what format this is (the EDGAR site has several specifications; one of them, perhaps?) and locate a proper DTD, then proceed from there.
Once you have that sorted out, you want to see how to remove elements with XPath and perhaps how to use regex in (lxml) XPath. Then probably reimplement the other extractions you have already done using XML and XPath.

Related

python-docx - replacing characters

I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.
The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation
I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package

Python -- Parsing files (docx, pdf and odt) and converting the content into my data model

I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.
We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility.
So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml.
The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images (if any), and chapter text, so that I can fit the content into a data model of:
Book --> o2m --> Chapter --> o2m --> Image
Clearly, PDFMiner has a .get_outlines() function that will return the TOC for me. But it can't link any of the returned tuples (chapter numbers and titles) to the actual pages for that chapter.
Even more problematic is that with docx/odt; those are just paragraphs -- <\w:sdt> -- elements, with attrs and child elements.
I'm looking for idea(s) to extrapolate some sense of structure from these filetypes, and if need be, I can apply those ideas (2 or 3) as suggested formats for our users who wish to import a book via one of those file formats.
Textract is the best tool that i have encountered so far for parsing different file formats.
It can parse most of the file formats.
You can find the project on Github
Here is the official documentation
(Python 3 answer)
When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/
What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.
I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:
import zipfile
try:
from xml.etree.cElementTree import XML
print("cElementTree")
except ImportError:
from xml.etree.ElementTree import XML
print("ElementTree")
EDIT:
If, in the body of the function, you replace
'word/document.xml'
with
'word/footnotes.xml'
or
'word/endnotes.xml'
you can get the footnotes and endnotes, respectively.
The markers for where they were in the text are lost, however.

Extracting Data from a .txt file using python

I many, many .xml files and i need to extract some co-ordinates from them.
Extracting data straight from .xml files seems to be very, very complicated - so i am working saving the .xml files as .txt files and extracting the data that way. However, when i open the .txt file, my data is all bunched together on about 6 lines.. And all the scripts i have found so far select the data by reading the first word on each line.. but obviously that won't work for me!
I need to extract the numbers inbetween these comments:
<gml:lowerCorner>137796 483752</gml:lowerCorner> <gml:upperCorner>138178 484222</gml:upperCorner>
In the text file they are all grouped together! Does anyone know how to extract this data? Thank you!
This is absolutely the wrong approach. Leave it alone and improve your ways :-)
Seriously, if the file is XML, then just use an XML parser to read it. Learning how to do it in Python isn't hard and will make your life easier now and much easier in the future, when you may find yourself facing more complex parsing needs, and you won't have to re-learn it.
Look at xml.etree.ElementTree.ElementTree. Here's some sample code:
>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("your_xml_file.xml")
Now just read the documentation of the module and see what you can do with tree. You'll be surprised to find out how simple it is to get to information this way. If you have specific questions about extracting data, I suggest you open another question in which you specify the format of the XML file you have to parse, and what data you have to take out of there. I'm sure you will have working code suggested to you in matters of minutes.
You can also open through the python script .xml file as you open a .txt file.
data = open("file.xml")
xml = data.read()
Then you can use regular expressions to find those numbers you want so badly.
The top answer is still the top answer. However, I've been doing just this with HTML and this link lxml and xpath ideal.
Open your browser to the site (or data) which is of interest. In Chrome, right click and 'Inspect Element'. In the Developer window on the highlighted text right click again and 'Copy XPath'. For google.com and clicking on the main search box I get the following XPath.
//*[#id="lst-ib"]
You can use lxml to grab various data from this item. See what you get when you append 'text()' or '#value' or '#href' on the end.
For really simple xml i just use a regex, can't be botherd to start an slow xml parser for a simple xml packet.
In [1]: data = open("file.txt","r").read()
In [2]: import re
In [3]: re.compile("([\d]+)").findall(data)
Out[3]: ['137796', '483752', '138178', '484222']

splitting an exml file into smaller files

the xml file contains information about movies. how do i split the xml file into smaller files? ( so each small file is a separate movie)
Without knowing the details, here is a broad outline of a possible approach:
Parse the XML using a suitable library (BeautifulSoup, lxml etc.)
Find the element corresponding to each movie. This can be done using a plain findAll or may require using an XPATH expression.
Pretty print the subtree starting corresponding to each movie element into separate files.
Of course a more detailed answer is not possible unless you post some sample XML and provide more details.

How to parse just the text from a Word Doc using Python?

When you try opening a MS Word document or for that matter most Windows file formats, you will see gibberish as given below broken intermittently by the actual text. I need to extract the text that goes in and want to ignore the gibberish -- which is something like given below. How do I extract only the text that matters, and ignore rest of the stuff. Please advise.
Here's a sample of open("sample.doc",r").read() of a word doc. Thanks
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00In an Interesting news,his is the first time we polled Indian channel community for their preferred memory supplier. Transcend came a close second, was seen to be more popular among class A city based resellers, was also the most recalled memory brand among customers according to resellers. However Transcend channels complained of parallel imports and constant unavailability of the products in grey x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
The tool that seems the most viable, particularly if you need an all python solution is OleFileIO.
doc is a binary format, it's not a markup language or something.
Specs: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
There is no generic why to extract
information from every file format.
You need to know the format to know
how to extract the information.
Just wanted to state that first. So what you should look for is libraries and software that can convert/extract the information you want. And as mentioned by Ofir MicroSoft have tools for that for their formats.
But if you can not do this and want to take the chance that there is text visible in the file that you think is interesting to read you could do a normal read and look for sequences of bytes that will build text. Then comes the question, what languages/charset should I support support in my hunt for text. Is it multi-byte text?
The easy start is to loop through the data and look for sequences of [a-zA-z0-9_- ] to find the text. But word is probably multi-byte. So you should scan double byte as one char.
Note: some of the new formats like open office and docx is multiple files in a compressed container. So you need to de-compress the file first, and scan XML documents after the text you looking for.
Word doc is a compressed format. You need to uncompress it first to get the real data (try open a doc file in a program like winrar, and you'll see it contains multiple files.
It even seems to be XML, so reading the format should not be that hard, although I'm not sure if you get all the data this way.
I had a similar problem, needing to query hundreds of Word documents. I converted the Word files to text files and used normal text parsing tools. Worked well.

Categories

Resources