Modifying files containing SUB/escape characters

Modifying files containing SUB/escape characters - python

I am beginning to learn Python and want to use it to automate a process.
The process consists in
modifying a few lines of a file
use the file as the input for an executable
save, move, etc
repeat
The problem is that the file I'm trying to modify was written in a language that utilizes the SUB character to run. Therefore, when I try
with open(myFile,'r') as file:
data = list(file)
data does not contain any information beyond the SUB character.
Therefore, I need to be able to do two things:
Read the whole file in python (without exiting prematurely at the SUB character locations) so that I can modify it.
Be able to run it on the executable (that is, the SUB characters need to be back at their respective places).
Any suggestions on how to go about solving this problem?
Thanks

Use the binary mode to open file.
with open(myFile,'rb') as file:
for line in file:
print line

Are you on Windows? Quoted from your link to the SUB character:
In CP/M, 86-DOS, MS-DOS, PC DOS, DR-DOS and their various derivatives, character 26 was also used to indicate the end of a character stream, and thereby used to terminate user input in an interactive command line window (and as such, often used to finish console input redirection, e.g. as instigated by COPY CON: TYPEDTXT.TXT).
While no longer technically required to indicate the end of a file many text editors and program languages up to the present still support this convention...
Python 2.7 in text mode will stop at a CTRL-Z character (hex 1A), so open the file in binary mode:
Example:
# Create a file with embedded character 1Ah
with open('sub.txt','wb') as f:
f.write(b'abc\x1adef')
# Open in default (text) mode and read as much as possible
with open('sub.txt','r') as f:
print repr(f.read())
# Open in binary mode
with open('sub.txt','rb') as f:
print repr(f.read())
Output:
'abc'
'abc\x1adef'

Related

python file error while execution

I am creating a simple python file in unix, just to open and write some test in it, but getting error while execution. Using Python 2.4.3
file = open(“testfile.txt”, “w”)
file.write(“This is a test”)
file.write(“To add more lines.”)
file.close()
Error:
./test.py: line 1: syntax error near unexpected token `('
./test.py: line 1: `file = open(“testfile.txt”, “w”)'

I believe you are using curly quotes “” (e.g. from Microsoft Word, etc..) rather than actual single and double quote chars '' "".
Make sure you are using a regular text editor, not a rich text editor. That's the problem.

The problem is that “ is not a valid quote in Python. Try copying and pasting this code into your file/terminal and you should then realise the difference.
file = open("testfile.txt", "w")
file.write("This is a test")
file.write("To add more lines.")
file.close()

In addition to the "smart" quotes that need to be plain ASCII " characters you need a "shebang" line as the first line of the script. Otherwise it is likely to be treated as a shell script and handed to /bin/sh for execution. You should insert this as the first line of the file:
#!/usr/bin/env python
Or run it via python ./x.py.

I think quotes are the problem.
can you try a content manager
with open('testfile.txt', 'w') as output_file:
output_file.write("Your Text Here")
use of context managers is to properly manage resources. In fact, that's the reason we use a context manager . The act of opening a file consumes a resource (called a file descriptor), and this resource is limited by your OS. Similarly writing.
That is to say, there are a maximum number of files a process can have open at one time.

File with "|"s in Atom editor has smiley faces printed from Python; split("|") doesn't work

I have an input file I'm trying to process with Python, which appears to have content like the following:
# This works, when run at a REPL
line = 'aababasdf|75=2|asdfa|150=17|asdfasdf'
date = line.split('|75=')[1].split('|',1)[0]
When I run the above by hand, or copy-and-paste the file's contents from Atom, it works. However, when I have the Python open the file and read the line itself, it fails:
# This fails, reading from the file from which contents were copy-and-pasted
with open(filename) as curfile:
for line in curfile:
date = line.split('|75=')[1].split('|',1)[0]
This code fails with an IndexError: the split() creates only a single segment, so no [1] segment exists.
When I print the line from the file-based code, it prints smiley faces where the |s should be.
What could be going wrong here? How can I better debug this scenario?

If you're running this from the Windows console (code page 437) there are two vertical bar characters: b'\x7c' and b'\xb3'. The first is part of the ASCII character set, and the second is one of the line-drawing characters that were part of the original PC.
>>> print(b'\x7c\xb3'.decode('cp437'))
|│
In addition you appear to be using a text editor that shows b'\x01' as a vertical bar as well. That's a non-standard way of displaying that character, which is generally invisible since it's an ASCII/Unicode control character.
Once you've figured out the actual character in the file, you can substitute it in your split call.

what does ^# sign in .txt file suggest

I was concurrently manipulating a txt file (some r/w operation)with multiple processes. and I saw traces of special signs as ^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^# spreading across some lines now and then. What does this suggest? And under what circumstances will these symbols appear. Does it mean some binary contents were written in to, by mistake, where it should be text?
UPDATE
I read through the documentation. Some suggest it's due to newline issue on linux/windows platform, while others suggest it's because of big endian/small endian in a networked environment. The fact is I was running multiple processes in a networked filesystem and manipulate one common txt file. So I guess the encoding format might be the major reason. Anyone who can suggest how to avoid this issue? I don't want to edit files(like manually doing text substitution). A clean way of producing the right file without any null characters are preferred.
UPDATE2
This is the python pseudo code that implements my project. the fcntl.lockf thing is to lock the common manipulated file across multiple machines that run multiple process on it.
while(manipulatedfile size is not 0):
open(manipulatedfile, 'r+') as fh:
fcntl.lockf(fh, fcntl.LOCK_EX)
all_lines = fh.readlines()
listing=all_lines[0:50] #get the first 50 lines
rest_lines = all_lines[50:] # get remaining lines
fh.seek(0)
fh.truncate()
fh.writelines(rest_lines) # write remaining lines back to file
fcntl.lockf(fh, fcntl.LOCK_UN)
listing = map(lambda s:s.strip(), listing)
do_sth(listing)
Thanks

In ASCII, ^# is a binary zero (NUL) character.
Data containing ^# between each ASCII character can sometimes be incorrectly translated from Unicode (4 bytes to a character) to ASCII (1 bytes to a character), or vice versa.
To remove the ^# characters, run vi file.txt, and enter :%s/ Ctrl+V Ctrl+# //g and hit ↵ Return.
See this detailed article for more information.

These are "file holes" and contain null characters. The null character (or NUL char) has an ASCII code of 0 and appears as ^# when viewed in vi or less.
I usually see these when I am nearly out of disk space and processes are trying to write to log files.

How to append EOF to file using Perl or Python?

I’m trying to bulk insert data to SQL server express database. When doing bcp from Windows XP command prompt, I get the following error:
C:\temp>bcp in -T -f -S
Starting copy...
SQLState = S1000, NativeError = 0
Error = [Microsoft][SQL Native Client]Unexpected EOF encountered in BCP data-file
0 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 4391
So, there is a problem with EOF. How to append a correct EOF character to this file using Perl or Python?

EOF is End Of File. What probably occurred is that the file is not complete; the software expects data, but there is none to be had anymore.
These kinds of things happen when:
the export is interrupted (quit dump software while dumping)
while copying the dumpfile aborting the copy
disk full during dump
these kinds of things.
By the way, though EOF is usually just an end of file, there does exist an EOF character. This is used because terminal (command line) input doesn't really end like a file does, but it sometimes is necessary to pass an EOF to such a utility. I don't think it's used in real files, at least not to indicate an end of file. The file system knows perfectly well when the file has ended, it doesn't need an indicator to find that out.
EDIT shamelessly copied from a comment provided by John Machin
It can happen (uninentionally) in real files. All it needs is (1) a data-entry user to type Ctrl-Z by mistake, see nothing on the screen, type the intended Shift-Z, and keep going and (2) validation software (written by e.g. the company president's nephew) which happily accepts Ctrl-anykey in text fields and your database has a little bomb in it, just waiting for someone to produce a query to a flat file.

Unexpected EOF means that the bcp reader found an EOF when it was expecting more data. This EOF can be:
(1) the actual physical end-of-file (no more bytes to be read). This means that you have mis-formatted data. Check near the end of your file for an incomplete record.
OR
(2) on Windows, where you are, programs reading a file in text mode honour the ancient convention inherited via MS-DOS from CP/M of regarding Ctrl-Z (aka ^Z aka \'x1A' aka SUB aka SUBSTITUTE) as an end-of-file marker when reading from ANY file, not just a terminal. This includes Python -- the behaviour is determined by the C stdlib. Check for '\x1A' in your data.
Update responding to comments in a legible fashion:
In Notepad++, you can make it display unusual characters by doing View / Show Symbol / Show All Characters. You can search by doing Ctrl-F, typing \x1a in the Find What box, and selecting the Extended radio button in the Search panel.
Or you can with a little bit of Python get the line number of the first Ctrl-Z:
bytes = open('bcp.dat', 'rb').read()
zpos = bytes.find('\x1a')
# if zpos is -1, no Ctrl-Z in file
print 1 + bytes[:zpos].count('\r\n')
Where your .dat was created doesn't matter. An unintentional Ctrl-Z can happen anywhere in a file created on any operating system. It is where it is being read as a text file that matters -- Windows? Bang!

This is not a problem with missing EOF, but with EOF that is there and is not expected by bcp.
I am not a bcp tool expert, but it looks like there is some problem with format of your data files.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.

What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()

You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.

You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modifying files containing SUB/escape characters - python

Use the binary mode to open file. with open(myFile,'rb') as file: for line in file: print line

Related

python file error while execution

File with "|"s in Atom editor has smiley faces printed from Python; split("|") doesn't work

what does ^# sign in .txt file suggest

How to append EOF to file using Perl or Python?

How to separate content from a file that is a container for binary and other forms of content

Categories

Resources