appending %%EOF to a PDF file in python

appending %%EOF to a PDF file in python - python

I'm trying to open a PDF with pyPdf. I get the following error:
pyPdf.utils.PdfReadError: EOF marker not found
I thought that I should add the EOF myself. However, I don't want to write bytes. Isn't it OS specific? I want to call something like os.eof(). What do I write? This thread is not helpful.

PDF's EOF marker is a special string (%%EOF) that needs to appear in your PDF file. If it doesn't, you have a malformed PDF. This string separates the actual PDF contents from any additional data (embedded files etc.).
It has nothing to do with the EOF event you run into when reading any file up to its end.

Related

lldb crashlog - substring not found

I want to export my crash from Xcode to file. I've created at desktop file crash.log but I have an error when I'm trying to export file.
crashlog -c ~/Desktop/crash.log
Error:
error: python exception: substring not found

Does your crashlog start with the following banner by any chance?
-------------------------------------
Translated Report (Full Report Below)
-------------------------------------
If so, then you're dealing with the new JSON crashlog format. Nobody wants to read the raw JSON, so Xcode.app and Console.app are showing you a textual, user-readable version of the crash, even though the actual file contains just JSON. The textual representation of crashlog you're seeing is not exactly same as the old textual crashlog format and therefore LLDB doesn't know how to parse it. That means you cannot copy/paste it into a file and symbolicate it in LLDB.
As I mentioned before, even though Xcode and Console are showing you a textual representation, the real file is just plain JSON. If you don't believe me, try opening the file with TextEdit.
LLDB is fully capable of parsing the new JSON format. So if you pass the crashlog file to LLDB, it will be able to parse it. Alternatively, if you must, you can look for the "Full Report" banner and copy all the JSON below it into a file and pass that into LLDB.
-----------
Full Report
-----------

How to clear a file that is being used in a script and always writing on it? Strange symbols appear

I got a python script in crontab writing in a file 24/7 because is always getting information and writing on file FILE.txt.
I run another script on crontab to get that content and send to a database (just what I want) and then truncate file.
But what happen? Lets suppose the file have 1000 bytes and I clear with that second script, the file after writing on it will have 1000 bytes + new bytes.
That 1000 bytes are strange symbols on file something like (?)(?)(?)(?)(?)
I tried the following:
file.seek(0)
file.truncate(0)
But without success.

Python: How to get the URL to a file when the file is received from a pipe?

I created, in Python, an executable whose input is the URL to a file and whose output is the file, e.g.,
file:///C:/example/folder/test.txt --> url2file --> the file
Actually, the URL is stored in a file (url.txt) and I run it from a DOS command line using a pipe:
type url.txt | url2file
That works great.
I want to create, in Python, an executable whose input is a file and whose output is the URL to the file, e.g.,
a file --> file2url --> URL
Again, I am using DOS and connecting executables via pipes:
type url.txt | url2file | file2url
Question: file2url is receiving a file. How do I get the file's URL (or path)?

In general, you probably can't.
If the url is not stored in the file, I seems very difficult to get the url. Imagine someone reads a text to you. Without further information you have no way to know what book it comes from.
However there are certain usecases where you can do it.
Pipe the url together with the file.
If you need the url and you can do that, try to keep the url together with the file. Make url2file pipe your url first and then the file.
Restructure your pipeline
Maybe you don't need to find the url for the file, if you restructure your pipeline.
Index your files
If only a certain files could potentially be piped into file2url, you could precalculate a hash for all files and store it in your program together with the url. In python you would do this using a dict where the key is the file (as a string) and the value is the url. You could use pickle to write the dict object to a file and load it at the start of your program.
Then you could simply lookup the url from this dict.
You might want to research how databases or search functions in explorers handle indexing or alternative solutions.
Searching for the file
You could use one significant line of the file and use something like grep or head on linux to search all files of your computer for this line. Note that grep and head are programs, not python functions. For DOS, you might need to google the equivalent programs.
FYI: grep searches for one line of text inside a file.
head puts out the first few lines of a file. I suggest comparing only the first few lines of files to avoid searching through huge file.
Searching all files on the computer might take very long.
You could only search files with the same size as your piped input.
Use url.txt
If file2url knows the location of the file url.txt, then you could look up all files in url.txt until you find a file identical to the file that was piped into your program. You could combine this with the hashing/ indexing solution.

'file2url' receives the data via standard input (like keyboard).
The data is transferred by the kernel and it doesn't necessarily have to have any file-system representation. So if there's no file there's no URL or path to that for you to get.

Let's try to do it by obvious way:
$ cat test.py | python test.py
import sys
print ''.join(sys.stdin.readlines())
print sys.stdin.name
<stdin>
So, filename is "< stdin>" because, for the python there is no filename - only input.
Another way is a system-dependent. Find a command line, which was used, for example, but no garantee that is will be works.

How to append EOF to file using Perl or Python?

I’m trying to bulk insert data to SQL server express database. When doing bcp from Windows XP command prompt, I get the following error:
C:\temp>bcp in -T -f -S
Starting copy...
SQLState = S1000, NativeError = 0
Error = [Microsoft][SQL Native Client]Unexpected EOF encountered in BCP data-file
0 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 4391
So, there is a problem with EOF. How to append a correct EOF character to this file using Perl or Python?

EOF is End Of File. What probably occurred is that the file is not complete; the software expects data, but there is none to be had anymore.
These kinds of things happen when:
the export is interrupted (quit dump software while dumping)
while copying the dumpfile aborting the copy
disk full during dump
these kinds of things.
By the way, though EOF is usually just an end of file, there does exist an EOF character. This is used because terminal (command line) input doesn't really end like a file does, but it sometimes is necessary to pass an EOF to such a utility. I don't think it's used in real files, at least not to indicate an end of file. The file system knows perfectly well when the file has ended, it doesn't need an indicator to find that out.
EDIT shamelessly copied from a comment provided by John Machin
It can happen (uninentionally) in real files. All it needs is (1) a data-entry user to type Ctrl-Z by mistake, see nothing on the screen, type the intended Shift-Z, and keep going and (2) validation software (written by e.g. the company president's nephew) which happily accepts Ctrl-anykey in text fields and your database has a little bomb in it, just waiting for someone to produce a query to a flat file.

Unexpected EOF means that the bcp reader found an EOF when it was expecting more data. This EOF can be:
(1) the actual physical end-of-file (no more bytes to be read). This means that you have mis-formatted data. Check near the end of your file for an incomplete record.
OR
(2) on Windows, where you are, programs reading a file in text mode honour the ancient convention inherited via MS-DOS from CP/M of regarding Ctrl-Z (aka ^Z aka \'x1A' aka SUB aka SUBSTITUTE) as an end-of-file marker when reading from ANY file, not just a terminal. This includes Python -- the behaviour is determined by the C stdlib. Check for '\x1A' in your data.
Update responding to comments in a legible fashion:
In Notepad++, you can make it display unusual characters by doing View / Show Symbol / Show All Characters. You can search by doing Ctrl-F, typing \x1a in the Find What box, and selecting the Extended radio button in the Search panel.
Or you can with a little bit of Python get the line number of the first Ctrl-Z:
bytes = open('bcp.dat', 'rb').read()
zpos = bytes.find('\x1a')
# if zpos is -1, no Ctrl-Z in file
print 1 + bytes[:zpos].count('\r\n')
Where your .dat was created doesn't matter. An unintentional Ctrl-Z can happen anywhere in a file created on any operating system. It is where it is being read as a text file that matters -- Windows? Bang!

This is not a problem with missing EOF, but with EOF that is there and is not expected by bcp.
I am not a bcp tool expert, but it looks like there is some problem with format of your data files.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.

What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()

You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.

You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.