pyarrow hdfs reads more data than requested

pyarrow hdfs reads more data than requested - python

I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead.
The pyarrow parquet reader doesn't have this behavior, and I am looking for a way to turn off read ahead for the general HDFS interface.
I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 (newest released version). I am on python 2.7
I have been using wireshark to track the packets passed on the network.
I suspect it is read ahead since the time for the 1st read is much greater than the time for 2nd read.
The regular pyarrow reader
import pyarrow as pa
fs = pa.hdfs.connect(hostname)
file_path = 'dataset/train/piece0000'
f = fs.open(file_path)
f.seek(0)
n_bytes = 3000000
f.read(n_bytes)
Parquet code without the same issue
parquet_file = 'dataset/train/parquet/part-22e3'
pf = fs.open(parquet_path)
pqf = pa.parquet.ParquetFile(pf)
data = pqf.read_row_group(0, columns=['col_name'])

Discussed in the JIRA ticket: https://issues.apache.org/jira/browse/ARROW-5432
A read_at function is being added to pyarrow api that will allow you to read a file at an offset for a certain length with no reading ahead.

Related

Parquet File re-write has slightly larger size in both Pandas / PyArrow

So I am trying to read a parquet file into memory, choose chunks of the file and upload it to AWS S3 Bucket. I want to write sanity tests to check if a file was uploaded correctly through either size check or MD5 hash check between the local and cloud file on the bucket.
One thing I noticed is that reading a file into memory, either as bytes or pd.DataFrame / Table, and then re-writing the same object into a new file would change the file size, in my case increasing it compared to the original. Here's some sample code:
import pandas as pd
df = pd.read_parquet("data/example.parquet")
Then I simply write:
from io import ByteIO
buffer = ByteIO()
df.to_parquet(buffer) # this can be done straight without BytesIO. I use it for clarity.
with open('copy.parquet', 'rb') as f:
f.write(buffer.getvalue())
Now using ls -l on both files give me different sizes:
37089 Oct 28 16:57 data/example.parquet
37108 Dec 7 14:17 copy.parquet
Interestingly enough, I tried using a tool such as xxd paired with diff, and to my surprise the binary difference was scattered all across the file, so I think it's safe to assume that this is not just limited to a metadata difference. Reloading both files into memory using pandas gives me the same table. It might also be worth mentioning that the parquet file contains both Nan and Nat values. Unfortunately I cannot share the file but see if I can replicate the behavior with a small sample.
I also tried using Pyarrow's file reading functionality which resulted in the same file size:
import pyarrow as pa
import pyarrow.parquet as pq
with open('data/example.parquet', 'rb') as f:
buffer = pa.BufferReader(obj)
table = pq.read_table(buffer)
pq.write_table(table, 'copy.parquet')
I have also tried turning on the compression='snappy' in both versions, but it did not change the output.
Is there some configuration I'm missing when writing back to disk?

Pandas uses pyarrow to read/write parquet so it is unsurprising that the results are the same. I am not sure what clarity using buffers gives compared to saving the files directly so I have left it out in the code below.
What was used to write the example file? If it was not pandas but e.g. pyarrow directly that would show up as mostly meta data difference as pandas adds its own schema in addition to the normal arrow meta data.
Though you say this is not the case here so the likely reason is that this file was written by another system with a different version of pyarrow, as Michael Delgado mentioned in the comments snappy compression is turned on by default. Snappy is not deterministic between systems:
not across library versions (and possibly not
even across architectures)
This explains why you see the difference all over the file. You can try the code below to see that on the same machine the md5 is the same between files but the pandas version is larger due to the added meta data.
Currently the arrow s3 writer does not check for integrity but the S3 API has such a functionality. I have opened an issue to make this accessible via arrow.
import pandas as pd
import pyarrow as pa
import numpy as np
import pyarrow.parquet as pq
arr = pa.array(np.arange(100))
table = pa.Table.from_arrays([arr], names=["col1"])
pq.write_table(table, "original.parquet")
pd_copy = pd.read_parquet("original.parquet")
copy = pq.read_table("original.parquet")
pq.write_table(copy, "copy.parquet")
pd_copy.to_parquet("pd_copy.parquet")
$ md5sum original.parquet copy.parquet pd_copy.parquet
fb70a5b1ca65923fec01a54f85f17260 original.parquet
fb70a5b1ca65923fec01a54f85f17260 copy.parquet
dcb93cb89426a948e885befdbee204ff pd_copy.parquet
1092 copy.parquet
1092 original.parquet
2174 pd_copy.parquet

How to read HDF5 files in R without the memory error?

Goal
Read the data component of a hdf5 file in R.
Problem
I am using rhdf5 to read hdf5 files in R. Out of 75 files, it successfully read 61 files. But it throws an error about memory for the rest of the files. Although, some of these files are shorter than already read files.
I have tried running individual files in a fresh R session, but get the same error.
Following is an example:
# Exploring the contents of the file:
library(rhdf5)
h5ls("music_0_math_0_simple_12_2022_08_08.hdf5")
group name otype dclass dim
0 / data H5I_GROUP
1 /data ACC_State H5I_DATASET INTEGER 1 x 1
2 /data ACC_State_Frames H5I_DATASET INTEGER 1
3 /data ACC_Voltage H5I_DATASET FLOAT 24792 x 1
4 /data AUX_CACC_Adjust_Gap H5I_DATASET INTEGER 24792 x 1
... CONTINUES ----
# Reading the file:
rhdf5::h5read("music_0_math_0_simple_12_2022_08_08.hdf5", name = "data")
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, :
Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter.
In addition: Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
Error: Error in h5checktype(). H5Identifier not valid.
I can read the file via python:
import h5py
filename = "music_0_math_0_simple_12_2022_08_08.hdf5"
hf = h5py.File(filename, "r")
hf.keys()
data = hf.get('data')
data['SCC_Follow_Info']
#<HDF5 dataset "SCC_Follow_Info": shape (9, 24792), type "<f4">
How can I successfully read the file in R?

When you ask to read the data group, rhdf5 will read all the underlying datasets into R's memory. It's not clear from your example exactly how much data this is, but maybe for some of your files it really is more than the available memory on your computer. I don't know how Python works under the hood, but perhaps it doesn't do any reading of datasets until you run data['SCC_Follow_Info']?
One option to try, is that rather than reading the entire data group, you could be more selective and try reading only the specific dataset you're interested in at that moment. In the Python example that seems to be /data/SCC_Follow_Info.
You can do that with something like:
follow_info <- h5read(file = "music_0_math_0_simple_12_2022_08_08.hdf5",
name = "/data/SCC_Follow_Info")
Once you've finished working with that dataset remove it from your R session e.g. rm(follow_info) and read the next dataset or file you need.

Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time()
print("Time:",(end-start),"Seconds")
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?

I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
Update:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)

I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
EDIT:
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats

Reading large Parquet file from SFTP with Pyspark is slow

I'm having some issue reading data (parquet) from a SFTP server with SQLContext.
The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..
Below is the script that works but too slow.
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.implementations.sftp import SFTPFileSystem
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)
When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?
df = sqlContext.read.parquet('PATH/file')
Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.
df = sqlContext.read.parquet(sftp.open('PATH/file'))
I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark

The fsspec uses Paramiko under the hood. And this is known problem with Paramiko:
Reading file opened with Python Paramiko SFTPClient.open method is slow
In fsspec, it does not seem to be possible to change the buffer size.
But you can derive your own implementation from SFTPFileSystem that does:
def BufferedSFTPFileSystem(SFTPFileSystem):
def open(self, path, mode='rb'):
return super().open(self, path, mode, bufsize=32768)

By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)
Thanks #Martin Prikryl for your help ;)

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

I have looked at many related answers here on Stackoverflow and this question seems most related How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python?. I want to do something similar, however, I want to compress the file when I send it to the SFTP location, so I end up with a .csv.gz file essentially. The files I am working with are 15-40 MB in size uncompressed, but there are lots of them sometimes, so need to keep the fingerprint small.
I have been using code like this to move the dataframe to the destination, after pulling it from another location as a csv, doing some transformations on the data itself:
fileList = source_sftp.listdir('/Inbox/')
dataList = []
for item in fileList: # for each file in the list...
print(item)
if item[-3:] == u'csv':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item)) # read the csv directly from the sftp server into a pd Dataframe
elif item[-3:] == u'zip':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='zip')
elif item[-3:] == u'.gz':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='gzip')
else:
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='infer')
dataList.append(temp) # keep each
#... Some transformations in here on the data
FL = [(x.replace('.csv',''))+suffix # just swap out to suffix
for x in fileList]
locpath = '{}/some/new/dir/'.format(dest_sftp.pwd)
i = 0
for item in dataList:
with dest_sftp.open(locpath + FL[i], 'w') as f:
item.to_csv(f, index=False,compression='gzip')
i = i+1
It seems like I should be able to get this to work, but I am guessing there is something being skipped over when I use to_csv to convert the dataframe back and then compress it on the sftp fileobject. Should I be streaming this somehow, or is there solution I am missing somewhere in the documentation on pysftp or pandas?
If I can avoid saving the csv file somewhere local first, I would like to, but I don't think I should have to, right? I am able to get the file in the end to be compressed if I just save file locally with temp.to_csv('/local/path/myfile.csv.gz', compression='gzip'), and after transferring this local file to the destination it is still compressed, so I don't think it has do with the transfer, just how pandas.Dataframe.to_csv and the pysftp.Connection.open are used together.
I should probably add that I still consider myself a newbie to much of Python, but I have been working with local to sftp and sftp to local, and have not had to do much in the way of transferring (directly or indirectly) between them.

Make sure you have the latest version of Pandas.
It supports the compression with a file-like object since 0.24 only:
GH21227: df.to_csv ignores compression when provided with a file handle

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyarrow hdfs reads more data than requested - python

Discussed in the JIRA ticket: https://issues.apache.org/jira/browse/ARROW-5432 A read_at function is being added to pyarrow api that will allow you to read a file at an offset for a certain length with no reading ahead.

Related

Parquet File re-write has slightly larger size in both Pandas / PyArrow

How to read HDF5 files in R without the memory error?

Convert huge csv to hdf5 format

Reading large Parquet file from SFTP with Pyspark is slow

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

Categories

Resources