Reading path names from a file in Python under Windows - python

I have a Python script that read a list of path names from a file and open them using the gzip module. It works well under Linux. But when I used it under Windows, I met an error when calling the gzip.open function. The error message is as follows:
File "C:\dev_tools\Python27\lib\gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "C:\dev_tools\Python27\lib\gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
The filename should be something like
'G:\ext_pt1\cfx33_50instr4_testset\cfx33_50instr4_0-99\cfx33_50instr4_cov\cfx33_50instr4_id0_cov\cfx33_50instr4_id0.detail.rpt.gz'
But when I printed the filename, it printed out something like
' ■G : \ e x t _ p t 1 \ c f x 3 3 _ 5 0 i n s t r 4 _ t e s t s e t \
c f x 3 3 _ 5 0 i n s t r 4 _ 0 - 9 9 \ c f x 3 3 _ 5 0 i n s t r 4 _
c o v \ c f x 3 3 _ 5 0 i n s t r 4 _ i d 0 _ c o v \ c f x 3 3 _ 5 0
i n s t r 4 _ i d 0 . d e t a i l . r p t . g z'
And when I printed repr(filename), it printed out something like
'\xff\xfeG\x00:\x00\\x00e\x00x\x00t\x00_\x00p\x00t\x001\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00t\x00e\x00s\x00t\x00s\x00e\x00t\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00\x00t\x
00r\x004\x00_\x000\x00-\x009\x009\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00\x00t\x00r\x004\x00_\x00c\x00o\x00v\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00i\x00d\x000\x00_\x00c\x00o\x00v\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00i\x00d\x000\x00.\x00d\x00e\x00t\x00a\x00i\x00l\x00.\x00r\x00p\x00t\x00.\x00g\x00z\x00'
I don't know why Python added those spaces (possibly the NULL bytes?) when it read the file. Does anyone have any clue?

Python has not added anything; it has merely read what is in the file. You have a little-endian UTF-16 string there, as you can plainly tell by the byte-order mark in the first two bytes. If you are not expecting this, you could convert it to ASCII (assuming it doesn't have any non-ASCII characters).
# convert mystring from little-endian UTF-16 with optional BOM to ASCII
mystring = unicode(mystring, encoding="utf-16le").encode("ascii", "ignore")
Or just convert it to proper Unicode and use it that way, if Windows will tolerate it:
mystring = unicode(mystring, encoding="utf-16le").lstrip(u"\ufeff")
Above, I have manually specified the byte order and then stripped off the BOM, rather than specifying "utf-16" as the encoding and letting Python figure out the byte order. This is because the BOM is going to be found once at the beginning of the file, not at the beginning of each line, so if you are converting the lines to Unicode one at a time, you won't have a BOM most of the time.
However, it might make more sense to go back to the source of that file and figure out why it's being saved in little-endian UTF-16 if you expected ASCII. Is the file generated the same way on Linux and Windows, for instance? Has it been touched by a text editor that defaults to saving as Unicode? Etc.

It seems that the encoding of your file has some problem. The printed file name pasted in your question is not the normal character. Have you saved your path-list file in unicode format?

I had the same problem. I replaced \ with / and it was ok. Just wanted you to remind this possibility before going into more advanced remedies.

Related

Create an ISO9660 compliant filename using pycdlib

I'm trying to implement the pycdlib example-creating-new-basic-iso example shown below. About half way down there is a line that reads, iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.;1'). This writes a new file to the ISO that will be names "FOO" in the root directory of the iso. This example works for me.
Building on the example, I'm trying to change the filename inside the iso from "/FOO", to "/FOO.txt" but I keep getting the error, PyCdlibInvalidInput: ISO9660 filenames must consist of characters A-Z, 0-9, and _. How do I write an ISO9660 compliant filename with pycdlib with ".txt" in it?
Example code:
try:
from cStringIO import StringIO as BytesIO
except ImportError:
from io import BytesIO
import pycdlib
iso = pycdlib.PyCdlib()
iso.new()
foostr = b'foo\n'
iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.;1')
iso.add_directory('/DIR1')
iso.write('new.iso')
iso.close()
The key here is in the error: PyCdlibInvalidInput: ISO9660 filenames must consist of characters A-Z, 0-9, and _, but there is a more complete [explanation]
(https://wiki.osdev.org/ISO_9660#Filenames):
d-characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _
Filenames must use d-character encoding (strD), plus dot and semicolon which have to occur exactly once per filename. Filenames are composed of a File Name, a dot, a File Name Extension, a semicolon; and a version number in decimal digits. The latter two are usually not displayed to the user.
There are three Levels of Interchange defined. Level 1 allows filenames with a File Name length of 8 and an extension length of 3 (like MS-DOS). Levels 2 and 3 allow File Name and File Name Extension to have a combined length of up to 30 characters.
The ECMA-119 Directory Record format can hold composed names of up to 222 characters. This would violate the specs but must nevertheless be handled by a reader of the filesystem.
You can't name the file FOO.txt because lowercase letters aren't included in the d-characters. You need to capitalize the extension in order to be ISO9660-compliant.
iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.TXT;1')

How to print text file content with line breaks in python?

The content of my text file is:
5 7 6 6 15
4 3
When I do
fs.open('path',mode='rb').read()
I get
b'5 7 6 6 15\r\n4 3'
But because I want it to compare to string output
5 7 6 6 15
4 3
I want to do this comparison like :
if fs.open('path',mode='rb').read() == output
print("yes")
How should I convert it in way that line breaks space everything is maintained?
PS: output is just the string that I am getting through json.
Using Python 3, fs.open('path',mode='rb').read() yields a bytes object, moreover containing a carriage return (windows text file)
(and using Python 2 doesn't help, because of this extra \r which isn't removed because of binary mode)
You're comparing a bytes object with a str object: that is always false.
Moreover, it's unclear if the output string has a line termination on the last line. I would open the file in text mode and strip blanks/newline the end (the file doesn't seem to contain one, but better safe than sorry):
with open('path') as f:
if f.read().rstrip() == output.rstrip():
Change the read mode from rb to r: rb gives back binary, r puts out text.

Not able to read rpt file using Python 3

I am trying to read a .rpt file using the python code:
>>> with open(r'C:\Users\lenovo-pc\Desktop\training2.rpt','r',encoding = 'utf-8', errors = 'replace') as d:
... count = 0
... for i in d.readlines():
... count = count + 1
... print(i+"\n")
...
...
u
i
d
|
e
x
p
i
d
|
n
a
m
e
|
d
o
m
a
i
n
And I am getting the following result as mentioned above.
Kindly, let me know how I can read the .rpt file using python3.
This is, indeed, strange behavior. While I can not easily reproduce the error without knowing the format of the .rpt file here are some hints what might go wrong. I assume it looks something like this:
uid|expid|name|domain
...
Which can be read and printed with the following code:
with open(r'C:\Users\lenovo-pc\Desktop\training2.rpt','r',encoding = 'utf-8', errors = 'replace') as rfile:
count = 0
for line in rfile:
count += 1
print(line.strip()) # this removes white spaces, line breaks etc.
However, the problem seems to be that you iterate over the string of the first line in your file instead of the lines in the file. That would produce the patter of you see, as the print() function adds a line break (in addition to the one you add manually). This leaves you with on character per line (followed by two line breaks).
>>> for i in "foo":
... print(i+"\n")
f
o
o
Make sure you did not reuse variable names from earlier in the session and do not overwrite the file object.

Ignore newline character in binary file with Python?

I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)
(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.
You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")

Facing issue while listing file contents in cygwin

Context: I want to install ".msi" file on remote windows machine via python script.
I have installed cygwin on remote windows machine and ssh service is running. I execute the command via ssh on remote windows machine from Linux host using python script. For installation of msi file i have used below command:
msiexec /package "msi file name" /quiet /norestart /log "log file name (say instlog.log)"
Now, to verify that installation is successful i list the contents of log file (instlog.log) and checks for string "Installation success or error status: 0".
Problem:
"type" command does not work in cygwin. So i tried "cd {0}; cat {1} | tail -5".format(FileLocation, FileName) to list file contents but i am getting output in different format and python script is unable to match above mentioned string in output. This is want i want to display on console:
MSI (s) (64:74) [18:03:51:360]: Windows Installer installed the product. Product Name: pkg-name. Product Version: 0.2.24-10891. Product Language: 1033. Manufacturer: XYZ Company. Installation success or error status: 0.
And what i am actually getting is:
M S I ( s ) ( 6 4 : 7 4 ) [ 1 8 : 0 3 : 5 1 : 3 6 0 ] : W i n d o w s I n s t a l l e r i n s t a l l e d t h e p r o d u c t . P r o d u c t N a m e : p k g - n a m e . P r o d u c t V e r s i o n : 0 . 2 . 2 4 - 1 0 8 9 1 . P r o d u c t L a n g u a g e : 1 0 3 3 . M a n u f a c t u r e r : X Y Z C o m p a n y . I n s t a l l a t i o n s u c c e s s o r e r r o r s t a t u s : 0 .
So somehow an extra space is introduced after each character in output. I want to know how can i get output in a normal way rather than space separated format. Thank you.
The problem is that msiexec saved its log file in Unicode format. In Windows Unicode consists of 2 chars (meaning that each character that you see is stored in memory as 2 bytes or chars): the first is the codepage number and the second is the entry of the character in that codepage (that is the character itself). Because you're running on an English version the codepage number is 0 (or \0 or \x00 or NULL). Some popular editors are smart enough to figure the encoding out and only display the characters (leaving the interleaved NULL chars aside). Now there are some ways to get through this.
Upgrade cygwin. On my computer (I also have Cygwin installed) I don't experience this problem (my Cygwin is using: GNU coreutils 8.15 - this can be seen for example by typing tail --version). Here are some outputs (I included the hexdump at the end to show you that the file is in unicode format):
cat unicode.txt
yields: unicode chars
tail unicode.txt
yields: unicode chars
hexdump unicode.txt
yields:
0000000 0075 006e 0069 0063 006f 0064 0065 0020
0000010 0063 0068 0061 0072 0073 000d 000a
000001e
Convert the msiexec logs to ASCII format. I am not aware of any native tool that does that but you can Google search for unicode to ascii converter and download such a tool; or as I mentioned earlier there are editors that understand unicode, one that i've already tried and is able to convert files from unicode to ascii is Textpad; or you can write the tool yourself.
If you're reading the msi log file from python you could handle the unicode files from the script. I assume that you have some code that reads the file contents like (!!! I didn't include any exception handling !!!):
f = open("some_msi_log_file.log", "rb")
text = f.read()
f.close()
and you're doing the processing on text. If you modify the code above to:
f = open("some_msi_log_file.log", "rb")
unicode_text = f.read()
f.close()
text = "".join([char for char in unicode_text if char != '\x00'])
text won't contain the \x00s anymore (and will also work with regular ASCII files).
The log file should be converted to a 8 bit wide format like UTF8. This could be achieved with iconv command. You should install it with cygwin installer, and after that use the following command:
iconv -f ucs2 -t utf8 instlog.log > instlog2.log

Categories

Resources