ftp sending python bytesio stream - python

I want to send a file with python ftplib, from one ftp site to another, so to avoid file read/write processees.
I create a BytesIO stream:
myfile=BytesIO()
And i succesfully retrieve a image file from ftp site one with retrbinary:
ftp_one.retrbinary('RETR P1090080.JPG', myfile.write)
I can save this memory object to a regular file:
fot=open('casab.jpg', 'wb')
fot=myfile.readvalue()
But i am not able to send this stream via ftp with storbinary. I thought this would work:
ftp_two.storbinary('STOR magnafoto.jpg', myfile.getvalue())
But doesnt. i get a long error msg ending with 'buf = fp.read(blocksize)
AttributeError: 'str' object has no attribute 'read'
I also tried many absurd combinations, but with no success. As an aside, I am also quite puzzled with what I am really doing with myfoto.write. Shouldnt it be myfoto.write() ?
I am also quite clueless to what this buffer thing does or require. Is what I want too complicated to achieve? Should I just ping pong the files with an intermediate write/read in my system? Ty all
EDIT: thanks to abanert I got things straight. For the record, storbinary arguments were wrong and a myfile.seek(0) was needed to 'rewind' the stream before sending it. This is a working snippet that moves a file between two ftp addresses without intermediate physical file writes:
import ftplib as ftp
from io import BytesIO
ftp_one=ftp.FTP(address1, user1, pass1)
ftp_two=ftp.FTP(address2, user2, pass2)
myfile=BytesIO()
ftp_one.retrbinary ('RETR imageoldname.jpg', myfile.write)
myfile.seek(0)
ftp_two.storbinary('STOR imagenewname.jpg', myfile)
ftp_one.close()
ftp_two.close()
myfile.close()

The problem is that you're calling getvalue(). Just don't do that:
ftp_two.storbinary('STOR magnafoto.jpg', myfile)
storbinary requires a file-like object that it can call read on.
Fortunately, you have just such an object, myfile, a BytesIO. (It's not entirely clear from your code what the sequence of things is here—if this doesn't work as-is, you may need to myfile.seek(0) or create it in a different mode or something. But a BytesIO will work with storbinary unless you do something wrong.)
But instead of passing myfile, you pass myfile.getvalue(). And getvalue "Returns bytes containing the entire contents of the buffer."
So, instead of giving storbinary a file-like object that it can call read on, you're giving it a bytes object, which is of course the same as str in Python 2.x, and you can't call read on that.
For your aside:
As an aside, I am also quite puzzled with what I am really doing with myfoto.write. Shouldnt it be myfoto.write() ?
Look at the docs. The second parameter isn't a file, it's a callback function.
The callback function is called for each block of data received, with a single string argument giving the data block.
What you want is a function that appends each block of data to the end of myfoto. While you could write your own function to do that:
def callback(block_of_data):
myfoto.write(block_of_data)
… it should be pretty obvious that this function does exactly the same thing as the myfoto.write method. So, you can just pass that method itself.
If you don't understand about bound methods, see Method Objects in the tutorial.
This flexibility, as weird as it seems, lets you do something even better than downloading the whole file into a buffer to send to another server. You can actually open the two connections at the same time, and use callbacks to send each buffer from the source server to the destination server as it's received, without ever storing anything more than one buffer.
But, unless you really need that, you probably don't want to go through all that complexity.
In fact, in general, ftplib is kind of low-level. And it has some designs (like the fact that storbinary takes a file, while retrbinary takes a callback) that make total sense at that low level but seem very odd from a higher level. So, you may want to look at some of the higher-level libraries by doing a search at PyPI.

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

i made a simple request code that downloads a file from a Server
r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close
when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.
i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach
import lzma
with open('C:\...\index_en.txt.lzma') as compressed:
print(compressed.readline)
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.
the second way i tried was with 7zip, because by hand it worked fine
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line
i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks
You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.
The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.
Rewriting your first block of code to be completely safe gets you:
with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
index_en.write(r.content)
Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (e.g. "C:\foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).
You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):
# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
# iter_content(None) produces an iterator of chunks of data (of whatever size
# is available in a single system call)
# Changing to writelines means the iterator is consumed and written
# as the data arrives
index_en.writelines(r.iter_content(None))
Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).
Also note that print(compressed.readline) is doing nothing (because you didn't call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.
Lastly,
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it; either omit the 'w' or change it to 'r'.

add header to stdout of a subprocess in python

I am merging several dataframes into one and sorting them using unix sort. Before I write the final sorted data I would like to add a prefix/header to that output.
So, my code is something like:
my_cols = '\t'.join(['CHROM', 'POS', "REF" ....])
my_cmd = ["sort", "-k1,2", "-V", "final_merged.txt"]
with open(output + 'mergedAndSorted.txt', 'w') as sort_data:
sort_data.write(my_cols + '\n')
subprocess.run(my_cmd, stdout=sort_data)
But, this above doe puts my_cols at the end of the final output file (i.e mergedAndSorted.txt)
I also tried substituting:
sort_data=io.StringIO(my_cols)
but this gives me an error as I had expected.
How can I add that header to the begining of the subprocess output. I believe this can be achieved by a simple code change.
The problem with your code is a matter of buffering; the tldr is that you can fix it like this:
sort_data.write(my_cols + '\n')
sort_data.flush()
subprocess.run(my_cmd, stdout=sort_data)
If you want to understand why it happens, and how the fix solves it:
When you open a file in text mode, you're opening a buffered file. Writes go into the buffer, and the file object doesn't necessarily flush them to disk immediately. (There's also stream-encoding from Unicode to bytes going on, but that doesn't really add a new problem, it just adds two layers where the same thing can happen, so let's ignore that.)
As long as all of your writes are to the buffered file object, that's fine—they get sequenced properly in the buffer, so they get sequenced properly on the disk.
But if you write to the underlying sort_data.buffer.raw disk file, or to the sort_data.fileno() OS file descriptor, those writes may get ahead of the ones that went to sort_data.
And that's exactly what happens when you use the file as a pipe in subprocess. This doesn't seem to be explained directly, but can be inferred from Frequently Used Arguments:
stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are PIPE, DEVNULL, an existing file descriptor (a positive integer), an existing file object, and None.
This implies pretty strongly—if you know enough about the way piping works on *nix and Windows—that it's passing the actual file descriptor/handle to the underlying OS functionality. But it doesn't actually say that. To really be sure, you have to check the Unix source and Windows source, where you can see that it is calling fileno or msvcrt.get_osfhandle on the file objects.

How to save incoming file in bottle api to hdfs

I am defining bottle api where I need to accept a file from the client and then save that file to HDFS on the local system.
The code looks something like this.
#route('/upload', method='POST')
def do_upload():
import pdb; pdb.set_trace()
upload = request.files.upload
name, ext = os.path.splitext(upload.filename)
save_path = "/data/{user}/{filename}".format(user=USER, filename=name)
hadoopy.writetb(save_path, upload.file.read())
return "File successfully saved to '{0}'.".format(save_path)
The issue is, the request.files.upload.file is an object of type cStringIO.StringO which can be converted to a str with a .read() method. But the hadoopy.writetb(path, content) expects the content to be some other format and the server sticks at that point. It doesn't give exception, it doesn't give error or any result. Just stands there as if it was in infinite loop.
Does anyone know how to write incoming file in bottle api to HDFS?
From the hadoopy documentation, it looks like the second parameter to writetb is supposed to be an iterable of pairs; but you're passing in bytes.
...the hadoopy.writetb command which takes an iterator of key/value pairs...
Have you tried passing in a pair? Instead of what you're doing,
hadoopy.writetb(save_path, upload.file.read()) # 2nd param is wrong
try this:
hadoopy.writetb(save_path, (path, upload.file.read()))
(I'm not familiar with Hadoop so it's not clear to me what the semantics of path are, but presumably it'll make sense to someone who knows HDFS.)

How do I get the snapshot length of a .pcap file using dpkt?

I am trying to get the snapshot length of a .pcap file. I have gone to the man page for pcap and pcap_snapshot but have not been able to get the function to work.
I am running a VM Fedora20 and it is written in python
First I try to import the file that the man page says to include but I get a syntax error on the import and the pcap_snapshot()
I am new at python so I imagine its something simple but not sure what it is. Any help is much appreciated!
import <pcap/pcap.h>
import dpkt
myPcap = open('mycapture.pcap')
myFile = dpkt.pcap.Reader(myPcap)
print "Snapshot length = ", myFile.pcap_snapshot()
Don't read the man page first unless you're writing code in C, C++, or Objective-C.
If you're not using a C-flavored language, you'll need to use a wrapper for libpcap, and should read the documentation for the wrapper first, as you won't be calling the C functions from libpcap, you'll be calling functions from the wrapper. If you try to import a C-language header file, such as pcap/pcap.h, in Python, that will not work. If you try to directly call a C-language function, such as pcap_snapshot(), that won't work, either.
Dpkt is not a wrapper; it is, instead, a library to parse packets and to read pcap files, with the code to read pcap files being independent of libpcap. Therefore, it won't offer wrappers for libpcap APIs such as pcap_snapshot().
Dpkt's documentation is, well, rather limited. A quick look at its pcap.py module seems to suggest that
print "Snapshot length = ", myFile.snaplen
would work; give that a try.

python File read buffer boundary

when we use function open() to open a file,we may set the buffersize for performance . But I just doubt if we set 1024,but the data in file is like this:
1999999999 3232344 54354364576 2343243254 6453623453245r3245235 5342453245233333333333333333 534545454364536 4355545...
So I don't know whether this will cut off one number,just as first read,buffer will be 1999999999 3232344 54354364576 2343243254 6453623453245r3245235 53424532,
And next we read buffer will be 45233333333333333333 534545454364536 4355545,and so on.
Or python's buffer implement had solve this question ? Can anyone give me some pointers ? Thanks.
If you use the read method without any arguments it will return the entire file content. You can use the size parameter if you only want to read a part of the file at a time.
See the documentation for more info http://docs.python.org/2.7/tutorial/inputoutput.html#reading-and-writing-files

Categories

Resources