I'm a new learner of Python and use python 2.7 in mac ox 10.8.3.
Today I met a problem that python don't get the right data when executing file reading.
my input file include two website url like this:
www.google.com
www.facebook.com
and python codes are below, just to print the input:
f = open("weblist.rtf","r")
print f.read()
f.close()
But after run, output is like this:
{\rtf1\ansi\ansicpg1252\cocoartf1187\cocoasubrtf370
{\fonttbl\f0\fnil\fcharset134 STHeitiSC-Medium;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww12200\viewh12840\viewkind1
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural
\f0\b\fs36 \cf0 www.google.com\
www.facebook.com}
How to solve this problem? Anyone has suggestion?
RTF files are not like simple text files (for example, windows .txt files), and there're specific headers for RTF files.
You may have a try on a simple text file instead of other kind of text files.
You cannot treat RTF files like normal text files and read them line-by-line.
You could look at the following link on stackoverflow which deals with converting RTF files to text files.
Is there a Python module for converting RTF to plain text?
Related
I am trying to use python to read a Japanese PDF or HTML file as input, and I want to get each Japanese characters' unicode in the file.
Someone suggests that I can use 'tika' library to read a PDF file. I ran the following code and got a series of garbled text as below.
import tika
from tika import parser
parsed = parser.from_file('jpn.pdf')
print(parsed["content"])
result:
��������������������������������
�1948.12.10
������
����������������
�������������� �!"#$%&'()#�&*+,-.#/01�(
)#2345678(9:3;<=>$?�#A&B(�&3
�-�CD=>EFG3���HI/JK6LMNOPQR/SNTU3VW=>XY
�9:GZ8T[3]=>^_�+,45�`aG3Yc��d�ef�gh#U�
iVj[N�&3
�kGlm#no#6p�(eq�rs#U�tu6vw()#G+,xy6�(Nz
623{�|}6xM��-/~��()#G��&B(�&3
k��������/���()#G��&B(�&3
���67,�3#���-3�k�!"=>���>6��
��-6�,��X�/��1U3��3Y��*+9:�y�&���� #¡¢£
¤�¥¦#/���()#/§¨UN�&3
���#«¬U�3�-=>#��9:�®�+!¯=>°±���/
²��()#/³´UN�&3
)[T�-.=>9:6p�(µ¶�·¸23)�³´/¹º6�(Nz6SM#S¯
�&B(�&3
���23 63
9Á�»¼�=>»½�G3)�45�-iV/¾6�¿6À*+GT3©ª
�ÃÄÅ6B(ÆÇ����k6S3)[T�-.#9:
#�!¯/ÈÉ=>ÊË6xM����()#�>6Ì[T�®�ÍÀ6xM��~
#G²���*µ¶�#¤#U�����#����
�3)��-iV/ÏÐ�(Ñ
Is there any recommended Python library or code to deal with the aforesaid problem ?
This is my first time to ask question on this platform. Please help......
You can solve your problem by using tika-python library ;
You can also try this code :
import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('fileName.pdf')
print(parsed["metadata"])
print(parsed["content"])
I tried looking inside stackoverflow and other sources, but could not find the solution.
I am trying to run/execute a Python script (that parses the data) using a text file as input.
How do I go about doing it?
Thanks in advance.
These basics can be found using google :)
http://pythoncentral.io/execute-python-script-file-shell/
http://www.python-course.eu/python3_execute_script.php
Since you are new to Python make sure that you read Python For Beginners
Sample code Read.py:
import sys
with open(sys.argv[1], 'r') as f:
contents = f.read()
print contents
To execute this program in Windows:
C:\Users\Username\Desktop>python Read.py sample.txt
You can try saving the input in the desired format (line-wise) file, and then using it as an alternative to STDIN (standard input file) using the file subcommand with Python
python source.py file input.txt
Note: To use it with input or source files in any other directory, use complete file location instead of file names.
I'd like to search a Word 2007 file (.docx) for a text string, e.g., "some special phrase" that could/would be found from a search within Word.
Is there a way from Python to see the text? I have no interest in formatting - I just want to classify documents as having or not having "some special phrase".
After reading your post above, I made a 100% native Python docx module to solve this specific problem.
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
The docx module is at https://python-docx.readthedocs.org/en/latest/
More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.
A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!
<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">
<w:r w:rsidRPr="003F6D7A">
<w:rPr>
<w:b />
</w:rPr>
<w:t>Hello</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">World.</w:t>
</w:r>
</w:p>
You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other "dead ends" just because the OOXML spec is around 6000 pages long and MS Word can write lots of "stuff" you don't expect. So you could end up writing your own document processing library.
Or you can try using Aspose.Words.
It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can't post a second link, stackoverflow does not let me yet).
In this example, "Course Outline.docx" is a Word 2007 document, which does contain the word "Windows", and does not contain the phrase "random other string".
>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()
Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the 'document.xml' file in the 'word' folder. If you wanted to be more sophisticated, you could then parse the XML, but if you're just looking for a phrase (which you know won't be a tag), then you can just look in the XML for the string.
You can use docx2txt to get the text inside the docx, than search in that txt
npm install -g docx2txt
docx2txt input.docx # This will print the text to stdout
A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you're not interested in.
A second choice would be to interop with word and do the search through it.
a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.
OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:
<b>Looking <i>for</i> this <u>phrase</u>
There's no easy way to find that using a simple text scan.
You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.
You may also consider using the library from OpenXMLDeveloper.org
I'm trying to open a PDF with pyPdf. I get the following error:
pyPdf.utils.PdfReadError: EOF marker not found
I thought that I should add the EOF myself. However, I don't want to write bytes. Isn't it OS specific? I want to call something like os.eof(). What do I write? This thread is not helpful.
PDF's EOF marker is a special string (%%EOF) that needs to appear in your PDF file. If it doesn't, you have a malformed PDF. This string separates the actual PDF contents from any additional data (embedded files etc.).
It has nothing to do with the EOF event you run into when reading any file up to its end.
I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.
What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()
You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.
You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.