Creating a new results/logging file using paramiko and sftp - python

I am using python and paramiko to read some files using sftp. The get is working fine. When I am done processing the file, I would like to put a file summarizing the results. I would rather not have to save the file locally first in order to do this; I have a dict of the results, I just want to create a file on the sftp server to put that into. Below is my code, with I hope all of the relevant bits in and the unrelated parts removed for readability.
Note that I am successfully reading the file and processing it, and creating the dict of results, without a problem, and I can print it to my terminal when I run csv_import. When I try to add the final step of putting the dict of results into a file on the same sftp server, though, it hangs forever. Any help is appreciated.
def csv_import():
we_are_live = True
host = "111.111.111.111"
port = 22
password = "cleverpwd"
username = "cleverun"
t = paramiko.Transport((host,port))
t.connect(username=username, password=password)
if we_are_live and t.is_authenticated():
sftp = paramiko.SFTPClient.from_transport(t)
sftp.chdir('.'+settings.REMOTE_SFTP_DIRECTORY)
files_to_pick_from = sftp.listdir()
…file processing code happens here, get back a dictionary of the results...
results_file_name = 'results'+client_file_name
results_file = paramiko.SFTPClient.from_transport(t)
results_file.file(results_file_name,mode='w',bufsize=-1)
results_file.write(str(sftp_results_of_import))
results_file.close()
t.close()

Did something similar a while ago, but i used disk files, maybe you find something useful:
http://code.activestate.com/recipes/576810-copy-files-over-ssh-using-paramiko/
And if you need to only create files in memory you could try
StringIO:
http://docs.python.org/library/stringio.html

Related

Reading large Parquet file from SFTP with Pyspark is slow

I'm having some issue reading data (parquet) from a SFTP server with SQLContext.
The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..
Below is the script that works but too slow.
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.implementations.sftp import SFTPFileSystem
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)
When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?
df = sqlContext.read.parquet('PATH/file')
Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.
df = sqlContext.read.parquet(sftp.open('PATH/file'))
I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark
The fsspec uses Paramiko under the hood. And this is known problem with Paramiko:
Reading file opened with Python Paramiko SFTPClient.open method is slow
In fsspec, it does not seem to be possible to change the buffer size.
But you can derive your own implementation from SFTPFileSystem that does:
def BufferedSFTPFileSystem(SFTPFileSystem):
def open(self, path, mode='rb'):
return super().open(self, path, mode, bufsize=32768)
By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)
Thanks #Martin Prikryl for your help ;)

Need to connect to ftp through the ftputil module, open an existing file with records and add new records to the end of those records

I used the ftputil module for this, but ran into a problem that it doesn't support 'a'(append) appending to the file, and if you write via 'w' it overwrites the contents.
That's what I tried and I'm stuck there:
with ftputil.FTPHost(host, ftp_user, ftp_pass) as ftp_host:
with ftp_host.open("my_path_to_file_on_the_server", "a") as fobj:
cupone_wr = input('Enter coupons with a space: ')
cupone_wr = cupone_wr.split(' ')
for x in range(0, len(cupone_wr)):
cupone_str = '<p>Your coupon %s</p>\n' % cupone_wr[x]
data = fobj.write(cupone_str)
print(data)
The goal is to leave the old entries in the file and add fresh entries to the end of the file every time the script is called again.
Indeed, ftputil does not support appending. So either you will have to download complete file and reupload it with appended records. Or you will have to use another FTP library.
For example the built-in Python ftplib supports appending. On the other hand, it does not (at least not easily) support streaming. Instead, it's easier to construct the new records in-memory and upload/append them at once:
from ftplib import FTP
from io import BytesIO
flo = BytesIO()
cupone_wr = input('Enter coupons with a space: ')
cupone_wr = cupone_wr.split(' ')
for x in range(0, len(cupone_wr)):
cupone_str = '<p>Your coupon %s</p>\n' % cupone_wr[x]
flo.write(cupone_str)
ftp = FTP('ftp.example.com', 'username', 'password')
flo.seek(0)
ftp.storbinary('APPE my_path_to_file_on_the_server', flo)
ftputil author here :-)
Martin is correct in that there's no explicit append mode. That said, you can open file-like objects with a rest argument. In your case, rest would need to be the original length of the file you want to append to.
The documentation warns against using a rest argument that points after the file because I'm quite sure rest isn't expected to be used that way. However, if you use your program only against a specific server and can verify its behavior, it might be worthwhile to experiment with rest. I'd be interested whether it works for you.

RedVox Python SDK | Not Reading in .rdvxz Files

I'm attempting to read in a series of files for processing contained in a single directory using RedVox:
input_directory = "/home/ben/Documents/Data/F1D1/21" # file location
rdvx_data = DataWindow(input_dir=input_directory, apply_correction=False, debug=True) # using RedVox to read in the files
print(os.listdir(input_directory)) # verifying the files actually exist...
# returns "['file1.rdvxz', 'file2.rdvxz', file3.rdvxz', ...etc]", they exist
# write audio portion to file
rdvx_data.to_json_file(base_dir=output_rpd_directory,
file_name=output_filename)
# this never runs, because rdvx_data.stations = [] (verified through debugging)
for station in rdvx_data.stations:
# some code here
Enabling debugging through arguments as seen above does not provide an extra details. In fact, there is no error message whatsoever. It writes the JSON file and pickle to disk, but the JSON file is full of null values and the pickle object is just a shell, no contents. So the files definitely exist, os.listdir() sees them, but RedVox does not.
I assume this is some very silly error or lack of understanding on my part. Any help is greatly appreciated. I have not worked with RedVox previously, nor do I have much understanding of what these files contain other than some audio data and some other data. I've simply been tasked with opening them to work on a model to analyze the data within.
SOLVED: Not sure why the previous code doesn't work (it was handed to me), however, I worked around the DataWindow call and went straight to calling the "redvox.api900.reader" object:
from redvox.api900 import reader
dataset_dir = "/home/*****/Documents/Data/F1D1/21/"
rdvx_files = glob(dataset_dir+"*.rdvxz")
for file in rdvx_files:
wrapped_packet = reader.read_rdvxz_file(file)
From here I can view all of the sensor data within:
if wrapped_packet.has_microphone_sensor():
microphone_sensor = wrapped_packet.microphone_sensor()
print("sample_rate_hz", microphone_sensor.sample_rate_hz())
Hope this helps anyone else who's confused.

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

I have looked at many related answers here on Stackoverflow and this question seems most related How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python?. I want to do something similar, however, I want to compress the file when I send it to the SFTP location, so I end up with a .csv.gz file essentially. The files I am working with are 15-40 MB in size uncompressed, but there are lots of them sometimes, so need to keep the fingerprint small.
I have been using code like this to move the dataframe to the destination, after pulling it from another location as a csv, doing some transformations on the data itself:
fileList = source_sftp.listdir('/Inbox/')
dataList = []
for item in fileList: # for each file in the list...
print(item)
if item[-3:] == u'csv':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item)) # read the csv directly from the sftp server into a pd Dataframe
elif item[-3:] == u'zip':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='zip')
elif item[-3:] == u'.gz':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='gzip')
else:
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='infer')
dataList.append(temp) # keep each
#... Some transformations in here on the data
FL = [(x.replace('.csv',''))+suffix # just swap out to suffix
for x in fileList]
locpath = '{}/some/new/dir/'.format(dest_sftp.pwd)
i = 0
for item in dataList:
with dest_sftp.open(locpath + FL[i], 'w') as f:
item.to_csv(f, index=False,compression='gzip')
i = i+1
It seems like I should be able to get this to work, but I am guessing there is something being skipped over when I use to_csv to convert the dataframe back and then compress it on the sftp fileobject. Should I be streaming this somehow, or is there solution I am missing somewhere in the documentation on pysftp or pandas?
If I can avoid saving the csv file somewhere local first, I would like to, but I don't think I should have to, right? I am able to get the file in the end to be compressed if I just save file locally with temp.to_csv('/local/path/myfile.csv.gz', compression='gzip'), and after transferring this local file to the destination it is still compressed, so I don't think it has do with the transfer, just how pandas.Dataframe.to_csv and the pysftp.Connection.open are used together.
I should probably add that I still consider myself a newbie to much of Python, but I have been working with local to sftp and sftp to local, and have not had to do much in the way of transferring (directly or indirectly) between them.
Make sure you have the latest version of Pandas.
It supports the compression with a file-like object since 0.24 only:
GH21227: df.to_csv ignores compression when provided with a file handle

App Engine - Save response from an API in the data store as file (blob)

I'm banging my head against the wall with this one:
What I want to do is store a file that is returned from an API in the data store as a blob.
Here is the code that I use on my local machine (which of course works due to an existing file system):
client.convertHtml(html, open('html.pdf', 'wb'))
Since I cannot write to a file on App Engine I tried several ways to store the response, without success.
Any hints on how to do this? I was trying to do it with StringIO and managed to store the response but then weren't able to store it as a blob in the data store.
Thanks,
Chris
Found the error. Here is how it looks like right now (simplified).
output = StringIO.StringIO()
try:
client.convertURI("example.com", output)
Report.pdf = db.Blob(output.getvalue())
Report.put()
except pdfcrowd.Error, why:
logging.error('PDF creation failed %s' % why)
I was trying to save the output without calling "getvalue()", that was the problem. Perhaps this is of use to someone in the future :)

Categories

Resources