Problems with cross platform tell / seek in Python - python

Having a weird bug with Python 2.7.3 file reading. If I do this sort of thing:
end_of_header = f.tell()
print f.readline()
f.seek(end_of_header)
print f.readline()
the results are different. The file was written in Linux / Mac (not sure) and I'm trying to run it on Windows 7. If I run it in Linux it works. I have tried opening the file with both 'b' and 'U' tags and its not working. I have tried various encodings by opening with the codecs module.
Is the readline() causing the problem?
Some context is that there is a header after which there are a long trajectory (can be in the GB range) I need to be able read the header and process it, then read the file one line at a time. I may need to go back to the start of the file (end of the header) at any time though.

As you say of Windows and Linux/Mac , I think you have
a problem of different newlines ( http://www.editpadpro.com/tricklinebreak.html )
used by the operating system in which the file was written and the one in which it is read.
And the problem arises because you opened the file in a not-binary mode.
Try to open the file in binary mode, that is to say with 'rb' or 'rb+' or 'ab' or 'ab+' according what you want to do.

Related

Reading a CSV file to pandas works in windows, not in ubuntu

I have written some scrip in python using windows and want to run it in my raspberry with Ubuntu.
I am reading a csv file with line separator new line. When I load the df I use the following code:
dfaux = pd.read_csv(r'/home/ubuntu/Downloads/data.csv', sep=';')
which loads a df with just one row. I have also tried including the argument lineterminator = '\n\t' which throws this error message:
ValueError: Only length-1 line terminators supported
In windows I see the line breaks in the csv file, whereas when I open it with mousepad in ubuntu I don't see the line breakers, but I see the columns color coded.
How could I read properly the csv?
Thanks!
Well, at then end I solved it by changing the explorer I was using to download the csv file with webdriver from Firefox to Chrome, not sure what´s the reason behind but maybe this will help if you have the same issue in the future
This is almost certainly an issue with the difference in line endings between Windows and... well, everything else. Windows uses a two-character line terminator, "\r\n" (carriage return, followed by newline), whereas Linux and Mac and everything else use just "\n".
Two easy fixes:
Using read_csv(..., engine='python') should remedy the issue. You may also need to specify read_csv(..., lineterminator='\r\n'), but based on the error message you're getting, it looks like it's auto-detecting that anyway. (Function docs)
Fix the file before sending it to Pandas. Something like:
import io
csv_data = open(r'/home/ubuntu/Downloads/data.csv').read().replace('\r\n', '\n')
dfaux = pd.read_csv(io.StringIO(csv_data))

Why does setting the "hidden file" attribute on a file make it read only for Python (3.5) on Windows 7 [duplicate]

I want to replace the contents of a hidden file, so I attempted to open it in w mode so it would be erased/truncated:
>>> import os
>>> ini_path = '.picasa.ini'
>>> os.path.exists(ini_path)
True
>>> os.access(ini_path, os.W_OK)
True
>>> ini_handle = open(ini_path, 'w')
But this resulted in a traceback:
IOError: [Errno 13] Permission denied: '.picasa.ini'
However, I was able to achieve the intended result with r+ mode:
>>> ini_handle = open(ini_path, 'r+')
>>> ini_handle.truncate()
>>> ini_handle.write(ini_new)
>>> ini_handle.close()
Q. What is the difference between the w and r+ modes, such that one has "permission denied" but the other works fine?
UPDATE: I am on win7 x64 using Python 2.6.6, and the target file has its hidden attribute set. When I tried turning off the hidden attribute, w mode succeeds. But when I turn it back on, it fails again.
Q. Why does w mode fail on hidden files? Is this known behaviour?
It's just how the Win32 API works. Under the hood, Python's open function is calling the CreateFile function, and if that fails, it translates the Windows error code into a Python IOError.
The r+ open mode corresponds to a dwAccessMode of GENERIC_READ|GENERIC_WRITE and a dwCreationDisposition of OPEN_EXISTING. The w open mode corresponds to a dwAccessMode of GENERIC_WRITE and a dwCreationDisposition of CREATE_ALWAYS.
If you carefully read the remarks in the CreateFile documentation, it says this:
If CREATE_ALWAYS and FILE_ATTRIBUTE_NORMAL are specified, CreateFile fails and sets the last error to ERROR_ACCESS_DENIED if the file exists and has the FILE_ATTRIBUTE_HIDDEN or FILE_ATTRIBUTE_SYSTEM attribute. To avoid the error, specify the same attributes as the existing file.
So if you were calling CreateFile directly from C code, the solution would be to add in FILE_ATTRIBUTE_HIDDEN to the dwFlagsAndAttributes parameter (instead of just FILE_ATTRIBUTE_NORMAL). However, since there's no option in the Python API to tell it to pass in that flag, you'll just have to work around it by either using a different open mode or making the file non-hidden.
Here are the detailed differences:-
``r'' Open text file for reading. The stream is positioned at the
beginning of the file.
``r+'' Open for reading and writing. The stream is positioned at
the
beginning of the file.
``w'' Truncate file to zero length or create text file for writing.
The stream is positioned at the beginning of the file.
``w+'' Open for reading and writing. The file is created if it does
not
exist, otherwise it is truncated. The stream is positioned at
the beginning of the file.
``a'' Open for writing. The file is created if it does not exist.
The
stream is positioned at the end of the file. Subsequent writes
to the file will always end up at the then current end of file,
irrespective of any intervening fseek(3) or similar.
``a+'' Open for reading and writing. The file is created if it does
not
exist. The stream is positioned at the end of the file. Subse-
quent writes to the file will always end up at the then current
end of file, irrespective of any intervening fseek(3) or similar.
From python documentation - http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files:-
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
So if you are using w mode, you are actually trying to create a file and you may not have the permissions to do it. r+ is the appropriate choice.
If you are in a situation where you do not yet know where your .picasi.ini exists or not and your windows user has file creation permissions in that directory and you want to append new information instead of starting at the beginning of the file (a.k.a "append"), then a+ will be the appropriate choice.
It has nothing to do with whether your file is hidden or not.
Thanks for this thread; I had the same issue today. My workaround is as follows. Works with Python 3.7
import os
GuiPanelDefaultsFileName = 'panelDefaults.json'
GuiPanelValues = {
'-FileName-' : os.getcwd() + '\\_AcMovement.xlsx',
'-DraftEmail-' : True,
'-MonthComboBox-' : 'Jun',
'-YearComboBox-' : '2020'
}
# Unhide the file via OS
if os.path.isfile(GuiPanelDefaultsFileName):
os.system(f'attrib -h {GuiPanelDefaultsFileName}')
# Write dict values to json
with open(GuiPanelDefaultsFileName, 'w') as fp:
json.dump(GuiPanelValues, fp, indent=4)
# Make it hidden again
os.system(f'attrib +h {GuiPanelDefaultsFileName}')

Modifying files containing SUB/escape characters

I am beginning to learn Python and want to use it to automate a process.
The process consists in
modifying a few lines of a file
use the file as the input for an executable
save, move, etc
repeat
The problem is that the file I'm trying to modify was written in a language that utilizes the SUB character to run. Therefore, when I try
with open(myFile,'r') as file:
data = list(file)
data does not contain any information beyond the SUB character.
Therefore, I need to be able to do two things:
Read the whole file in python (without exiting prematurely at the SUB character locations) so that I can modify it.
Be able to run it on the executable (that is, the SUB characters need to be back at their respective places).
Any suggestions on how to go about solving this problem?
Thanks
Use the binary mode to open file.
with open(myFile,'rb') as file:
for line in file:
print line
Are you on Windows? Quoted from your link to the SUB character:
In CP/M, 86-DOS, MS-DOS, PC DOS, DR-DOS and their various derivatives, character 26 was also used to indicate the end of a character stream, and thereby used to terminate user input in an interactive command line window (and as such, often used to finish console input redirection, e.g. as instigated by COPY CON: TYPEDTXT.TXT).
While no longer technically required to indicate the end of a file many text editors and program languages up to the present still support this convention...
Python 2.7 in text mode will stop at a CTRL-Z character (hex 1A), so open the file in binary mode:
Example:
# Create a file with embedded character 1Ah
with open('sub.txt','wb') as f:
f.write(b'abc\x1adef')
# Open in default (text) mode and read as much as possible
with open('sub.txt','r') as f:
print repr(f.read())
# Open in binary mode
with open('sub.txt','rb') as f:
print repr(f.read())
Output:
'abc'
'abc\x1adef'

Do different OS's affect the md5 checksum of a file in Python?

So I have a python script that uses the pyserial library to send a file over serial to another computer. I wrote some script to calculate the md5 checksum of the file before and after being sent over serial and I have encountered some problems.
Example:
I sent a simple file named third.txt containing a list of numbers 1 through 10. Simple file, nothing fancy or large. The checksum of the file before transmitting is completely different than the checksum of the file after transmitting on the other computer, even though the files are clearly the same.
I checked to see if there was something wrong with my code by simply moving the file over a USB and doing the checksum calulations this way. This time it worked.
Any ideas why this is happening and how I might possibly fix it?
Here is my checksum code before sending. This is not the exact code, but basically what I did.
<<Code that waits for command from client>>
with open(file_loc) as file_to_read:
data = file_to_read.read()
md5a = hashlib.md5(data).hexdigest()
ser.write('\n' + md5a + '\n')
Here is my checksum code after sending.
with open(file_loc) as file_to_read:
data = file_to_read.read()
md5b = hashlib.md5(data).hexdigest()
print('Sending Checksum Command')
ser.write("\n<<SENDCHECKSUM>>\n")
md5a = ser.readline()
print(md5a)
print(md5b)
if md5a == md5b:
print("Correct File Transmission")
else:
print("The checksum indicated incorrect file transmission, please check.")
ser.flush()
Yes, opening a file in text mode potentially can result in different data being read as newlines are translated for you from the platform native format to \n. Thus, files containing \r\n will give you a different checksum when read on Windows vs. a POSIX platform.
Open files in binary mode instead:
with open(file_loc, 'rb') as file_to_read:
Note that the same applies when writing a file. If you receive data from a POSIX system using \n line endings, and you write this to a file opened for writing in text mode on Windows, you'll end up with \r\n line endings in the written file.
If you are using Python 3, you are complicating matters some more. When you are opening files in text mode, you are translating the data from encoded bytes to decoded Unicode values. What codec is used for that can also differ from OS to OS, and even from machine to machine. The default is locale-defined (using locale.getpreferredencoding(False)), and as long as the data is decodable by the default locale, you can get very different results from reading a file using a different codec. You really want to ensure you use the same codec by setting it explicitly, or better still, open files in binary mode.
Since hashlib requires you to feed it byte strings, this is less of a problem when trying to calculate the digest (you'd have run into that problem and at least have to think about codecs there), but this applies to file transfers too; writing to text file will encode the data to the default codec.

opening and writing a large binary file python

I have a homebrew web based file system that allows users to download their files as zips; however, I found an issue while dev'ing on my local box not present on the production system.
In linux this is a non-issue (the local dev box is a windows system).
I have the following code
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
temp_file = open(temp_file_path, 'wb+')
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
temp_file.write(dec_data)
data = file.read(settings.READ_SIZE)
# Takes a dump right here!
# error in cipher operation (wrong final block length)
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
The above code opens a file, and (using the key for the current file share) decrypts the file and writes it to a temporary location (that will later be stuffed into a zip file).
My issue is on the file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb') line. Since windows cares a metric ton about binary files if I don't specify 'rb' the file will not read to end on the data read loop; however, for some reason since I am also writing to temp_file it never completely reads to the end of the file...UNLESS i add a + after the b 'rb+'.
if i change the code to file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb+') everything works as desired and the code successfully scrapes the entire binary file and decrypts it. If I do not add the plus it fails and cannot read the entire file...
Another section of the code (for downloading individual files) reads (and works flawlessly no matter the OS):
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
filename = smart_str(cur_file.name, errors='replace')
response = HttpResponse(mimetype='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
response.write(dec_data)
data = file.read(settings.READ_SIZE)
# no dumps to be taken when finishing up the decrypt process...
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
Clarification
The cipher error is likely because the file was not read in its entirety. For example, I have a 500MB file I am reading in at 64*1024 bytes at a time. I read until I receive no more bytes, when I don't specify b in windows it cycles through the loop twice and returns some crappy data (because python thinks it is interacting with a string file not a binary file).
When I specify b it takes 10-15 seconds to completely read in the file, but it does it succesfully, and the code completes normally.
When I am concurrently writing to another file as i read in from the source file (as in the first example) if I do not specify rb+ it displays the same behavior as not even specifying b which is, that it only reads a couple segments from the file before closing the handle and moving on, i end up with an incomplete file and the decryption fails.
I'm going to take a guess here:
You have some other program that's continually replacing the files you're trying to read.
On linux, this other program works by atomically replacing the file (that is, writing to a temporary file, then moving the temporary file to the path). So, when you open a file, you get the version from 8 seconds ago. A few seconds later, someone comes along and unlinks it from the directory, but that doesn't affect your file handle in any way, so you can read the entire file at your leisure.
On Windows, there is no such thing as atomic replacement. There are a variety of ways to work around that problem, but what many people do is to just rewrite the file in-place. So, when you open a file, you get the version from 8 seconds ago, start reading it… and then suddenly someone else blanks the file to rewrite it. That does affect your file handle, because they've rewritten the same file. So you hit an EOF.
Opening the file in r+ mode doesn't do anything to solve the problem, but it adds a new problem that hides it: You're opening the file with sharing settings that prevent the other program from rewriting the file. So, now the other program is failing, meaning nobody is interfering with this one, meaning this one appears to work.
In fact, it could be even more subtle and annoying than this. Later versions of Windows try to be smart. If I try to open a file while someone else has it locked, instead of failing immediately, it may wait a short time and try again. The rules for exactly how this works depend on the sharing and access you need, and aren't really documented anywhere. And effectively, whenever it works the way you want, it means you're relying on a race condition. That's fine for interactive stuff like dragging a file from Explorer to Notepad (better to succeed 99% of the time instead of 10% of the time), but obviously not acceptable for code that's trying to work reliably (where succeeding 99% of the time just means the problem is harder to debug). So it could easily work differently between r and r+ modes for reasons you will never be able to completely figure out, and wouldn't want to rely on if you could…
Anyway, if any variation of this is your problem, you need to fix that other program, the one that rewrites the file, or possibly both programs in cooperation, to properly simulate atomic file replacement on Windows. There's nothing you can do from just this program to solve it.*
* Well, you could do things like optimistic check-read-check and start over whenever the modtime changes unexpectedly, or use the filesystem notification APIs, or… But it would be much more complicated than fixing it in the right place.

Categories

Resources