Open and read each file in a directory separately [duplicate] - python

I'd like to search a Word 2007 file (.docx) for a text string, e.g., "some special phrase" that could/would be found from a search within Word.
Is there a way from Python to see the text? I have no interest in formatting - I just want to classify documents as having or not having "some special phrase".

After reading your post above, I made a 100% native Python docx module to solve this specific problem.
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
The docx module is at https://python-docx.readthedocs.org/en/latest/

More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.

A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!
<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">
<w:r w:rsidRPr="003F6D7A">
<w:rPr>
<w:b />
</w:rPr>
<w:t>Hello</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">World.</w:t>
</w:r>
</w:p>
You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other "dead ends" just because the OOXML spec is around 6000 pages long and MS Word can write lots of "stuff" you don't expect. So you could end up writing your own document processing library.
Or you can try using Aspose.Words.
It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can't post a second link, stackoverflow does not let me yet).

In this example, "Course Outline.docx" is a Word 2007 document, which does contain the word "Windows", and does not contain the phrase "random other string".
>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()
Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the 'document.xml' file in the 'word' folder. If you wanted to be more sophisticated, you could then parse the XML, but if you're just looking for a phrase (which you know won't be a tag), then you can just look in the XML for the string.

You can use docx2txt to get the text inside the docx, than search in that txt
npm install -g docx2txt
docx2txt input.docx # This will print the text to stdout

A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you're not interested in.
A second choice would be to interop with word and do the search through it.

a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.

OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:
<b>Looking <i>for</i> this <u>phrase</u>
There's no easy way to find that using a simple text scan.

You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.

You may also consider using the library from OpenXMLDeveloper.org

Related

Beautifulsoup4, parsing Tableau XML file, and writing to file

I'm having an issue where I'm using beautifulsoup to parse the xml generated from a Tableau workbook and when I write the results to file it doesn't behave as expected. Chose bs4 and it's standard XML parser, because I find it easiest for my brain to comprehend and I don't need the speed of the lxml parser/package.
Background: I have a calculated field in my Tableau workbook that will programmatically change during publish depending on the server and site location that template workbook will go to. I've already gone through and built some functions and scripted out everything I need to get the data to do this, but when my script writes the xml to file it adds some encodings for ampersand. This results in the file being valid and able to be opened in Tableau, but the field is considered invalid, despite looking like it is valid. I'm thinking the XML is some how getting malformed somewhere in my process.
Code so far for where I think the issue is occuring:
import bs4 as bs
twb = open(Script_config['local_file_location'], 'r')
bs_content = bs(twb, 'xml')
# formula_final below comes from another script that handles getting the data I need to programmatically generate the formula I need.
# Here is what I use to generate the bulk of the formula for Tableau
# 'When &apos;[{}]&apos; then {} '.format(rows['Column_Name'], rows['Formatted_ColumnName']))
# Does some other stuff and slaps together the formula I need as a string that can be written into my XML
# Verified that my result is coming over correctly and only changes once I do the replacement here and/or the writing of the file.
for calculation in bs_content.find_all('column', {'caption': 'Group By', 'datatype':'string', 'name':'[Calculation_12345678910]'}):
calculation.find('calculation')['formula'] = formula_final
with open('test.twb', 'w') as file:
file.write(str(bs_content))
Sample XML:
<?xml version="1.0" encoding="utf-8"?>
<workbook source-build="2021.1.4 (20211.21.0712.0907)" source-platform="win" version="18.1" xml:base="https://localhost" xmlns:user="http://www.tableausoftware.com/xml/user">
...
<column caption="Group By" datatype="string" name="[Calculation_12345678910]" role="dimension" type="nominal">
<calculation class="tableau" formula="Case [Parameters].[Location External ID Parameter] When &apos;[Territory]&apos; then [Territory] End"/>
</column>
Problem:
In the sample XML, Tableau is expecting the XML to be formatted without the & in front of the apos;. It should just be reading as &apos;.
What I've tried:
Thinking that I could just escape the & character I put the necessary slashes in place to escape it before the apos; portion, but to no avail I can't figure out how to get my XML to be formed so that it doesn't always put the ampersand code as part of the other special characters in my XML.
Any help would be much appreciated!
Good problem description.
Your problem is known as 'double escaping'. Your program is reading data which has already been serialized by an XML processor. That's why it contains &apos;[{}]&apos; and not '[{}]'
I think your program reads that XML value from a file as a simple string and assigns it to the value of a tag. But when BeautifulSoup's XML processor encounters the & in the tag value it must replace it with &. So you end up with &apos;' instead of &apos; in the XML output.
The quick and dirty solution is to write some code to replace all XML entities with the equivalent text. A better solution would be to read the XML data using an XML parser - that way, your program will receive the intended string value automatically.

Python3 Mutagen not outputing unicode tags

I'm attempting to automate some ID3 tagging with Mutagen, but whenever I attempt to insert unicode characters I have them replaced by question marks.
Smallest test code that results in this error is as follows
from mutagen.id3 import ID3, TALB
audio = ID3()
audio['TALB'] = TALB(encoding=3, text=u'test祥さtest')
audio.save('test.mp3', v1=2)
When run, test.mp3's album tag shows up as test??test in both my file manager and music player. If I manually enter unicode tags via the file manager the unicode characters display normally without issue.
Things I have already tried in order to fix this problem:
Trying both with and without the u string prefix
Using the alternate Mutagen tagging syntax (audio.add(TALB(encoding=3, text=u'test祥さtest')))
I'm using the v1=2 argument for the save function as leaving it out results in around half the files not having their tags written (and unicode still being outputted as question marks), and other values refuse to write ID3 tags for any files.
I'm using Windows 10 64bit. My Python environments are Anaconda3 (Python3.4) and Python2.7, both result in the same problem with same code.
So I think your main problem is that your way of testing if the tags are correct has some problems. Let me explain.
For me, this code works:
from mutagen.id3 import ID3, TALB
audio = ID3()
audio['TALB'] = TALB(encoding=3, text=u'test祥さtest')
audio.save("test.mp3",v1=0)
Checking the file in a text editor shows the tags correctly written in unicode.
So why can't you see the tags? Likely because mutagen defaults to writing ID3v2.4 tags which neither Windows File Explorer nor any of the standard Windows media players will read. However, when you have added the v1=2 argument you have forced mutagen to also write ID3v1 tags. These are readable by File Explorer but unfortunately do not support Unicode. That is why you are seeing the question marks instead. So it us useful, when you want to use Unicode, to add v1=0 (as I have done) to prevent any ID3v1 tags being written and distracting from the main issue of getting the ID3v2 tags working.
So now move to ID3v2.3 instead of ID3v2.4 and see if that helps:
from mutagen.id3 import ID3, TALB
audio = ID3()
audio.update_to_v23()
audio['TALB'] = TALB(encoding=3, text=u'test祥さtest')
audio.save("test.mp3",v1=0,v2_version=3)
Finally, the best way to see what tags are really in the file is to use a dedicated tag editor which comprehensively follows the spec, like Mp3tag. This helps to find out if the problem is how you are writing the tags, or how your player is reading them.

read error on Python in mac

I'm a new learner of Python and use python 2.7 in mac ox 10.8.3.
Today I met a problem that python don't get the right data when executing file reading.
my input file include two website url like this:
www.google.com
www.facebook.com
and python codes are below, just to print the input:
f = open("weblist.rtf","r")
print f.read()
f.close()
But after run, output is like this:
{\rtf1\ansi\ansicpg1252\cocoartf1187\cocoasubrtf370
{\fonttbl\f0\fnil\fcharset134 STHeitiSC-Medium;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww12200\viewh12840\viewkind1
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural
\f0\b\fs36 \cf0 www.google.com\
www.facebook.com}
How to solve this problem? Anyone has suggestion?
RTF files are not like simple text files (for example, windows .txt files), and there're specific headers for RTF files.
You may have a try on a simple text file instead of other kind of text files.
You cannot treat RTF files like normal text files and read them line-by-line.
You could look at the following link on stackoverflow which deals with converting RTF files to text files.
Is there a Python module for converting RTF to plain text?

Python parsing xml directly from web address

Hey. I tried to find a way but i can't. I have set up a xml.sax parser in python and it works perfect when i read a local file (for example calendar.xml), but i need to read a xml file from a web address.
I figured it would work if i do this:
toursxml='http://api.songkick.com/api/3.0/artists/mbid:'+mbid+'/calendar.xml?apikey=---------'
toursurl=urllib2.urlopen(toursxml)
toursurl=toursurl.read()
parser.parse(toursurl)
but it doesnt. im sure theres an easy way but i cant find it.
so yeah I can easily go to the url and download the file and open it by doing
parser.parse("calendar.xml")
as a work around ive set it up to read the file and create the file locally, close the file, and then read it. But as you can guess its slow as hell.
Is there anyone to directly read the xml? also note that the url name does not end in ".xml" so that may be a problem later
First, your example is mixed up. Please don't reuse variables.
toursurl= urllib2.urlopen(toursxml)
toursurl_string= toursurl.read()
parser.parseString( toursurl_string )
Reads the entire file into a string, named toursurl_string.
To parse a string, you use the parseString(toursurl_string) method.
http://docs.python.org/library/xml.sax.html#xml.sax.parseString
If you want to combine reading and parsing, you have to pass the "stream" or filename to parse.
toursurl= urllib2.urlopen(toursxml)
parser.parse(toursurl)
parser.parse(xyz)
expects xyz to be a file; you are looking for
parser.parseString(xyz)
which expects xyz to be a string containing XML.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.
What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()
You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.
You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Categories

Resources