Python ftplib: low download & upload speeds when using python ftplib - python

I was wondering if any one observed that the time taken to download or upload a file over ftp using Python's ftplib is very large as compared to performing FTP get/put over windows command prompt or using Perl's Net::FTP module.
I created a simple FTP client similar to http://code.activestate.com/recipes/521925-python-ftp-client/ but I am unable to attain the speed which I get when running FTP at the Windows DOS prompt or using perl. Is there something I am missing or is it a problem with the Python ftplib module.
I would really appreciate if you could throw some light as to why I am getting low throughput with Python.

The problem was with the block size, i was using a block size of 1024 which was too small. After increasing the block size to 250Kb the speeds are similar across all the different platforms.
def putfile(file=file, site=site, dir=dir, user=())
upFile = open(file, 'rb')
handle = ftplib.FTP(site)
apply(handle.login, user)
print "Upload started"
handle.storbinary('STOR ' + file, upFile, 262144)
print "Upload completed"
handle.quit()
upFile.close()

I had a similar issue with the default blocksize of 8192 using FTP_TLS
site = 'ftp.siteurl.com'
user = 'username-here'
upass = 'supersecretpassword'
ftp = FTP_TLS(host=site, user=user, passwd=upass)
with open(newfilename, 'wb') as f:
def callback(data):
f.write(data)
ftp.retrbinary('RETR filename.txt', callback, blocksize=262144)
Increasing the block size increased speed 10x. Thanks #Tanmoy Dube

Related

Reading large Parquet file from SFTP with Pyspark is slow

I'm having some issue reading data (parquet) from a SFTP server with SQLContext.
The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..
Below is the script that works but too slow.
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.implementations.sftp import SFTPFileSystem
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)
When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?
df = sqlContext.read.parquet('PATH/file')
Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.
df = sqlContext.read.parquet(sftp.open('PATH/file'))
I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark
The fsspec uses Paramiko under the hood. And this is known problem with Paramiko:
Reading file opened with Python Paramiko SFTPClient.open method is slow
In fsspec, it does not seem to be possible to change the buffer size.
But you can derive your own implementation from SFTPFileSystem that does:
def BufferedSFTPFileSystem(SFTPFileSystem):
def open(self, path, mode='rb'):
return super().open(self, path, mode, bufsize=32768)
By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)
Thanks #Martin Prikryl for your help ;)

send gzip data without unzipping

I am currently working on a script for RaspberryPi using a SIM module to send data to an FTP server. Problem is, some data are quite large and I formatted them into csv files but still, are a bit large to send through GPRS. By compressing them in gz files it reduces the size by 5 which is great, but in order to send data, the only way is to send data line by line. I was wondering if there was a way to send the information of a gzip file without sending the uncompressed data. Here is my code so far:
list_of_files = glob.glob('/home/pi/src/git/RPI/DATA/*.gz')
print(list_of_files)
for file_data in list_of_files:
zipp = gzip.GzipFile(file_data,'rb')
file_content = zipp.read()
#array = np.fromstring(file_content, dtype='f4')
print(len(file_content))
#AT commands to send the file_content to FTP server
Here the length returned is the length of the uncompressed data, but i want to be able to retrieve the uncompressed value of the gzip file? Is it doable?
Thanks for your help.
zipp = gzip.GzipFile(file_data,'rb')
specifically requests unzipping. If you just want to read the bare raw binary gzip data, use a regular open:
zipp = open(file_data,'rb')
You don't need to read the file into memory to fetch its size, though. The os.stat function lets you get information about a file's metadata without opening it.

Python requests not utilizing network

I'm trying to download a medium-sized APK file ( 10-300 MB ) and save it locally. My connection speed should be about 90 mbps, yet the process rarely surpasses 1 mbps, and my network doesn't seem to be anywhere near cap.
I've verified the part that's getting stuck is indeed the SSL download with cProfile, and I've tried various advice on StackOverflow like reducing or increasing chunk size, to no avail. I'd love a way to either test if this could be a server-side issue, or advice on what am I doing wrong on the clientside.
Relevant code:
session = requests.Session() # I've heard session is better due to the persistent HTTP connection
session.trust_env = False
r = session.get(<url>, headers=REQUEST_HEADERS, stream=True, timeout=TIMEOUT) # timeout=60.
r.raise_for_status()
filename = 'myFileName'
i = 0
with open(filename, 'wb') as result:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
i += 1
if(i % 5 == 0):
print(f'At chunk {i} with timeout {TIMEOUT}')
result.write(chunk)
I was trying to download many different files; Upon printing the urls I'm trying to download and testing in chrome I saw some of the URLs were significantly slower than other ones to download.
It seems like a server issue, which I ended up solving by picking a good timeout to the requests.

Is there any way to detect if an excel workbook is password protected in Python or Robot framework?

I am working on BOT where it needs to know if the excel workbook is password protected or not (using python or robot framework). Is there any library or trick in either of the them to get it done as I have been doing R&D on it since last many days but have found nothing. Every solution I encountered tells me how to read password protected excel but I don't want to read the content because BOT just needs to send an email if the given excel is password protected.
Thanks in advance.
I have found a solution to this issue -
Python has a library msoffcrypto-tool which helped in achieving what I need. Following is the code snippet for the same.
def isExcelEncrypted(excelPath):
try:
fileHandle = open(excelPath, "rb")
ofile = msoffcrypto.OfficeFile(fileHandle)
isEncrypted = ofile.is_encrypted()
fileHandle.close()
return isEncrypted
except Exception as err:
return "Exception: "+ str( format(err) )
Although the library is for decrypting MS Office files, I used its is_encrypted() function (returns True/False) only and also, it worked for both .xls and .xlsx formats.

Is there a rich ftp client library for Python?

I'm familiar with ftplib and it works great for simple interface but I need file properties and basically a rich ftp client. does anyone know of a good ftp client library?
Use the MLSD command. You have to parse it yourself but it's fairly easy.
# Note that portions of MLSD data are case insensitive...
def parseinfo(info):
for fact in info.split(';'):
if not fact:
continue
name, value = fact.split('=', 1)
yield name.lower(), value
ftp = ftplib.FTP(host, user, passwd)
dirinfo = {}
def callback(line):
info, fname = line.split(' ', 1)
dirinfo[fname] = dict(parseinfo(info))
ftp.retrlines('MLSD {}'.format(path), callback)
print(dirinfo)
That's about as rich as FTP gets.
The ftputil could be what you are looking for:
The FTPHost objects generated with ftputil allow many operations similar to those of ​os and ​os.path.
The API supports well file information gathering.

Categories

Resources