Reading large Parquet file from SFTP with Pyspark is slow

Reading large Parquet file from SFTP with Pyspark is slow - python

I'm having some issue reading data (parquet) from a SFTP server with SQLContext.
The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..
Below is the script that works but too slow.
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.implementations.sftp import SFTPFileSystem
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)
When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?
df = sqlContext.read.parquet('PATH/file')
Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.
df = sqlContext.read.parquet(sftp.open('PATH/file'))
I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark

The fsspec uses Paramiko under the hood. And this is known problem with Paramiko:
Reading file opened with Python Paramiko SFTPClient.open method is slow
In fsspec, it does not seem to be possible to change the buffer size.
But you can derive your own implementation from SFTPFileSystem that does:
def BufferedSFTPFileSystem(SFTPFileSystem):
def open(self, path, mode='rb'):
return super().open(self, path, mode, bufsize=32768)

By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)
Thanks #Martin Prikryl for your help ;)

Related

Parquet File re-write has slightly larger size in both Pandas / PyArrow

So I am trying to read a parquet file into memory, choose chunks of the file and upload it to AWS S3 Bucket. I want to write sanity tests to check if a file was uploaded correctly through either size check or MD5 hash check between the local and cloud file on the bucket.
One thing I noticed is that reading a file into memory, either as bytes or pd.DataFrame / Table, and then re-writing the same object into a new file would change the file size, in my case increasing it compared to the original. Here's some sample code:
import pandas as pd
df = pd.read_parquet("data/example.parquet")
Then I simply write:
from io import ByteIO
buffer = ByteIO()
df.to_parquet(buffer) # this can be done straight without BytesIO. I use it for clarity.
with open('copy.parquet', 'rb') as f:
f.write(buffer.getvalue())
Now using ls -l on both files give me different sizes:
37089 Oct 28 16:57 data/example.parquet
37108 Dec 7 14:17 copy.parquet
Interestingly enough, I tried using a tool such as xxd paired with diff, and to my surprise the binary difference was scattered all across the file, so I think it's safe to assume that this is not just limited to a metadata difference. Reloading both files into memory using pandas gives me the same table. It might also be worth mentioning that the parquet file contains both Nan and Nat values. Unfortunately I cannot share the file but see if I can replicate the behavior with a small sample.
I also tried using Pyarrow's file reading functionality which resulted in the same file size:
import pyarrow as pa
import pyarrow.parquet as pq
with open('data/example.parquet', 'rb') as f:
buffer = pa.BufferReader(obj)
table = pq.read_table(buffer)
pq.write_table(table, 'copy.parquet')
I have also tried turning on the compression='snappy' in both versions, but it did not change the output.
Is there some configuration I'm missing when writing back to disk?

Pandas uses pyarrow to read/write parquet so it is unsurprising that the results are the same. I am not sure what clarity using buffers gives compared to saving the files directly so I have left it out in the code below.
What was used to write the example file? If it was not pandas but e.g. pyarrow directly that would show up as mostly meta data difference as pandas adds its own schema in addition to the normal arrow meta data.
Though you say this is not the case here so the likely reason is that this file was written by another system with a different version of pyarrow, as Michael Delgado mentioned in the comments snappy compression is turned on by default. Snappy is not deterministic between systems:
not across library versions (and possibly not
even across architectures)
This explains why you see the difference all over the file. You can try the code below to see that on the same machine the md5 is the same between files but the pandas version is larger due to the added meta data.
Currently the arrow s3 writer does not check for integrity but the S3 API has such a functionality. I have opened an issue to make this accessible via arrow.
import pandas as pd
import pyarrow as pa
import numpy as np
import pyarrow.parquet as pq
arr = pa.array(np.arange(100))
table = pa.Table.from_arrays([arr], names=["col1"])
pq.write_table(table, "original.parquet")
pd_copy = pd.read_parquet("original.parquet")
copy = pq.read_table("original.parquet")
pq.write_table(copy, "copy.parquet")
pd_copy.to_parquet("pd_copy.parquet")
$ md5sum original.parquet copy.parquet pd_copy.parquet
fb70a5b1ca65923fec01a54f85f17260 original.parquet
fb70a5b1ca65923fec01a54f85f17260 copy.parquet
dcb93cb89426a948e885befdbee204ff pd_copy.parquet
1092 copy.parquet
1092 original.parquet
2174 pd_copy.parquet

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

I have looked at many related answers here on Stackoverflow and this question seems most related How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python?. I want to do something similar, however, I want to compress the file when I send it to the SFTP location, so I end up with a .csv.gz file essentially. The files I am working with are 15-40 MB in size uncompressed, but there are lots of them sometimes, so need to keep the fingerprint small.
I have been using code like this to move the dataframe to the destination, after pulling it from another location as a csv, doing some transformations on the data itself:
fileList = source_sftp.listdir('/Inbox/')
dataList = []
for item in fileList: # for each file in the list...
print(item)
if item[-3:] == u'csv':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item)) # read the csv directly from the sftp server into a pd Dataframe
elif item[-3:] == u'zip':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='zip')
elif item[-3:] == u'.gz':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='gzip')
else:
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='infer')
dataList.append(temp) # keep each
#... Some transformations in here on the data
FL = [(x.replace('.csv',''))+suffix # just swap out to suffix
for x in fileList]
locpath = '{}/some/new/dir/'.format(dest_sftp.pwd)
i = 0
for item in dataList:
with dest_sftp.open(locpath + FL[i], 'w') as f:
item.to_csv(f, index=False,compression='gzip')
i = i+1
It seems like I should be able to get this to work, but I am guessing there is something being skipped over when I use to_csv to convert the dataframe back and then compress it on the sftp fileobject. Should I be streaming this somehow, or is there solution I am missing somewhere in the documentation on pysftp or pandas?
If I can avoid saving the csv file somewhere local first, I would like to, but I don't think I should have to, right? I am able to get the file in the end to be compressed if I just save file locally with temp.to_csv('/local/path/myfile.csv.gz', compression='gzip'), and after transferring this local file to the destination it is still compressed, so I don't think it has do with the transfer, just how pandas.Dataframe.to_csv and the pysftp.Connection.open are used together.
I should probably add that I still consider myself a newbie to much of Python, but I have been working with local to sftp and sftp to local, and have not had to do much in the way of transferring (directly or indirectly) between them.

Make sure you have the latest version of Pandas.
It supports the compression with a file-like object since 0.24 only:
GH21227: df.to_csv ignores compression when provided with a file handle

pyarrow hdfs reads more data than requested

I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead.
The pyarrow parquet reader doesn't have this behavior, and I am looking for a way to turn off read ahead for the general HDFS interface.
I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 (newest released version). I am on python 2.7
I have been using wireshark to track the packets passed on the network.
I suspect it is read ahead since the time for the 1st read is much greater than the time for 2nd read.
The regular pyarrow reader
import pyarrow as pa
fs = pa.hdfs.connect(hostname)
file_path = 'dataset/train/piece0000'
f = fs.open(file_path)
f.seek(0)
n_bytes = 3000000
f.read(n_bytes)
Parquet code without the same issue
parquet_file = 'dataset/train/parquet/part-22e3'
pf = fs.open(parquet_path)
pqf = pa.parquet.ParquetFile(pf)
data = pqf.read_row_group(0, columns=['col_name'])

Discussed in the JIRA ticket: https://issues.apache.org/jira/browse/ARROW-5432
A read_at function is being added to pyarrow api that will allow you to read a file at an offset for a certain length with no reading ahead.

Python ftplib: low download & upload speeds when using python ftplib

I was wondering if any one observed that the time taken to download or upload a file over ftp using Python's ftplib is very large as compared to performing FTP get/put over windows command prompt or using Perl's Net::FTP module.
I created a simple FTP client similar to http://code.activestate.com/recipes/521925-python-ftp-client/ but I am unable to attain the speed which I get when running FTP at the Windows DOS prompt or using perl. Is there something I am missing or is it a problem with the Python ftplib module.
I would really appreciate if you could throw some light as to why I am getting low throughput with Python.

The problem was with the block size, i was using a block size of 1024 which was too small. After increasing the block size to 250Kb the speeds are similar across all the different platforms.
def putfile(file=file, site=site, dir=dir, user=())
upFile = open(file, 'rb')
handle = ftplib.FTP(site)
apply(handle.login, user)
print "Upload started"
handle.storbinary('STOR ' + file, upFile, 262144)
print "Upload completed"
handle.quit()
upFile.close()

I had a similar issue with the default blocksize of 8192 using FTP_TLS
site = 'ftp.siteurl.com'
user = 'username-here'
upass = 'supersecretpassword'
ftp = FTP_TLS(host=site, user=user, passwd=upass)
with open(newfilename, 'wb') as f:
def callback(data):
f.write(data)
ftp.retrbinary('RETR filename.txt', callback, blocksize=262144)
Increasing the block size increased speed 10x. Thanks #Tanmoy Dube

Creating a new results/logging file using paramiko and sftp

I am using python and paramiko to read some files using sftp. The get is working fine. When I am done processing the file, I would like to put a file summarizing the results. I would rather not have to save the file locally first in order to do this; I have a dict of the results, I just want to create a file on the sftp server to put that into. Below is my code, with I hope all of the relevant bits in and the unrelated parts removed for readability.
Note that I am successfully reading the file and processing it, and creating the dict of results, without a problem, and I can print it to my terminal when I run csv_import. When I try to add the final step of putting the dict of results into a file on the same sftp server, though, it hangs forever. Any help is appreciated.
def csv_import():
we_are_live = True
host = "111.111.111.111"
port = 22
password = "cleverpwd"
username = "cleverun"
t = paramiko.Transport((host,port))
t.connect(username=username, password=password)
if we_are_live and t.is_authenticated():
sftp = paramiko.SFTPClient.from_transport(t)
sftp.chdir('.'+settings.REMOTE_SFTP_DIRECTORY)
files_to_pick_from = sftp.listdir()
…file processing code happens here, get back a dictionary of the results...
results_file_name = 'results'+client_file_name
results_file = paramiko.SFTPClient.from_transport(t)
results_file.file(results_file_name,mode='w',bufsize=-1)
results_file.write(str(sftp_results_of_import))
results_file.close()
t.close()

Did something similar a while ago, but i used disk files, maybe you find something useful:
http://code.activestate.com/recipes/576810-copy-files-over-ssh-using-paramiko/
And if you need to only create files in memory you could try
StringIO:
http://docs.python.org/library/stringio.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading large Parquet file from SFTP with Pyspark is slow - python

By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :) df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768) Thanks #Martin Prikryl for your help ;)

Related

Parquet File re-write has slightly larger size in both Pandas / PyArrow

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

pyarrow hdfs reads more data than requested

Python ftplib: low download & upload speeds when using python ftplib

Creating a new results/logging file using paramiko and sftp

Categories

Resources