Reading feather file from a url in python - python

I am using api gateway proxy for s3 to read feather files. Below is the simplest form of the code I am using.
import pandas as pd
s3_data=pd.read_feather('https://<api_gateway>/<bucket_name/data.feather>')
This gives an error -
reader = _feather.FeatherReader(source, use_memory_map=memory_map)
File "pyarrow\_feather.pyx", line 75, in pyarrow._feather.FeatherReader.__cinit__
File "pyarrow\error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 114, in pyarrow.lib.check_status
OSError: Verification of flatbuffer-encoded Footer failed.
If I keep the feather file on my local and read it like below, all works fine.
s3_data=pd.read_feather("file://localhost//C://Users//<Username>//Desktop//data.feather")
How do I make this work ?

Maybe the gateway proxy needs to do some redirection, that makes it fail. I would have done something like this:
from s3fs import S3FileSystem
fs = S3FileSystem(anon=True)
with fs.open("<bucket>/data.feather") as f:
df = pd.read_feather(f)
s3fs is part of Dask. There are also other similar layers that you can use.
PS: if you are using feather for long term data storage, the Apache Arrow project advises against it (maintainer of feather). You should probably use parquet.

Related

Reading large Parquet file from SFTP with Pyspark is slow

I'm having some issue reading data (parquet) from a SFTP server with SQLContext.
The Parquet file is quite large (6M rows).
I found some solutions to read it, but it's taking almost 1hour..
Below is the script that works but too slow.
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.implementations.sftp import SFTPFileSystem
fs = SFTPFileSystem(host = SERVER_SFTP, port = SERVER_PORT, username = USER, password = PWD)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs)
When the data is not in some sftp server, I use the below code, which usually works well even with large file. So How can I use SparkSQL to read a remote file in SFTP?
df = sqlContext.read.parquet('PATH/file')
Things that I tried: using SFTP library to open but seems to loose all the advantage of SparkSQL.
df = sqlContext.read.parquet(sftp.open('PATH/file'))
I also tried to use spark-sftp library, following this article without success: https://www.jitsejan.com/sftp-with-spark
The fsspec uses Paramiko under the hood. And this is known problem with Paramiko:
Reading file opened with Python Paramiko SFTPClient.open method is slow
In fsspec, it does not seem to be possible to change the buffer size.
But you can derive your own implementation from SFTPFileSystem that does:
def BufferedSFTPFileSystem(SFTPFileSystem):
def open(self, path, mode='rb'):
return super().open(self, path, mode, bufsize=32768)
By adding the buffer_size parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)
Thanks #Martin Prikryl for your help ;)

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide.
In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes).
In my case, I have a 82GB file and 288 workers (12 physical nodes; there's a HDFS data node on each).
On all 12 nodes, I can access HDFS and execute a simple Python script that displays info on a file:
import pyarrow as pa
fs = pa.hdfs.connect([url], 8022)
print(str(fs.info('/path/to/file.csv')))
If I create a single-node cluster (only 24 workers) using only the machine running Dask Scheduler, I can read the .csv from HDFS and print the length:
import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
dask.config.set(hdfs_backend='pyarrow')
df = dd.read_csv('hdfs://[url]:8022/path/to/file.csv')
df = client.persist(df)
print(str(len(df)))
That last line gives 1046250873 (nice!) and takes 3m17s to run.
However, when I use the full cluster, that last line calling len(df) dies and I get this error:
KilledWorker: ("('pandas_read_text-read-block-from-delayed-9ad3beb62f0aea4a07005d5c98749d7e', 1201)", 'tcp://[url]:42866')
This is similar to an issue mentioned here which has a solution here involving Dask Yarn and a config (?) that looks like: worker_env={'ARROW_LIBHDFS_DIR': ...}
However, I'm not using Yarn, although my guess is that the Dask Workers are somehow not configured with the HDFS/Arrow paths they need in order to connect.
I don't see any documentation on this, hence my question as to what I'm missing.
Edit:
Here's the error traceback I'm seeing in the output of the Dask Workers:
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95N\x05\x00\x00\x00\x00\x00\x00(\x8c\x14dask.dataframe.utils\x94\x8c\ncheck_meta\x94\x93\x94(\x8c\x12dask
.compatibility\x94\x8c\x05apply\x94\x93\x94\x8c\x15dask.dataframe.io.csv\x94\x8c\x10pandas_read_text\x94\x93\x94]\x94(\x8c\x11pandas.io.parsers\x94\x8c\x08read_csv\x94\x93\x94(
\x8c\x0fdask.bytes.core\x94\x8c\x14read_block_from_file\x94\x93\x94h\r\x8c\x08OpenFile\x94\x93\x94(\x8c\x12dask.bytes.pyarrow\x94\x8c\x17PyArrowHadoopFileSystem\x94\x93\x94)\x8
1\x94}\x94(\x8c\x02fs\x94\x8c\x0cpyarrow.hdfs\x94\x8c\x10HadoopFileSystem\x94\x93\x94(\x8c\r10.255.200.91\x94MV\x1fNN\x8c\x07libhdfs\x94Nt\x94R\x94\x8c\x08protocol\x94\x8c\x04h
dfs\x94ub\x8c\x1a/path/to/file.csv\x94\x8c\x02rb\x94NNNt\x94R\x94K\x00J\x00\x90\xd0\x03C\x01\n\x94t\x94C\x12animal,weight,age\n\x94\x8c\x08builtins\x94\x8c\x04dict\x94
\x93\x94]\x94\x86\x94h*]\x94(]\x94(\x8c\x06animal\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff
\xff\xff\xffK?t\x94be]\x94(\x8c\x06weight\x94h2\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94be]\x94(\x8c\x03age\x94h<e
e\x86\x94]\x94(h/h9h#eeh*]\x94(]\x94(\x8c\x0cwrite_header\x94\x89e]\x94(\x8c\x07enforce\x94\x89e]\x94(\x8c\x04path\x94Nee\x86\x94t\x94\x8c\x11pandas.core.frame\x94\x8c\tDataFra
me\x94\x93\x94)\x81\x94}\x94(\x8c\x05_data\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)\x81\x94(]\x94(\x8c\x18pandas.core.indexes.base\x94\x8c\n_new_In
dex\x94\x93\x94hW\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94h0\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01
b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h5\x89]\x94(h/h9h#et\x94b\x8c\x04name\x94Nu\x86\x94R\x94hY\x8c\x19pandas.core.indexes.range\x94\x8c\nRangeIndex\x94\x93\x94}\x94(hjN\x8c\x0
5start\x94K\x00\x8c\x04stop\x94K\x00\x8c\x04step\x94K\x01u\x86\x94R\x94e]\x94(h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02K\x00\x86\x94h<\x89C\x00\x94t\x94bh`hbK\x00\x85\x94hd\x
87\x94R\x94(K\x01K\x01K\x00\x86\x94h5\x89]\x94t\x94be]\x94(hYh[}\x94(h]h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02\x85\x94h5\x89]\x94(h9h#et\x94bhjNu\x86\x94R\x94hYh[}\x94(h]h`
hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x01\x85\x94h5\x89]\x94h/at\x94bhjNu\x86\x94R\x94e}\x94\x8c\x060.14.1\x94}\x94(\x8c\x04axes\x94hV\x8c\x06blocks\x94]\x94(}\x94(\x8c\x06valu
es\x94hy\x8c\x08mgr_locs\x94h(\x8c\x05slice\x94\x93\x94K\x01K\x03K\x01\x87\x94R\x94u}\x94(h\x9dh\x7fh\x9eh\xa0K\x00K\x01K\x01\x87\x94R\x94ueust\x94b\x8c\x04_typ\x94\x8c\tdatafr
ame\x94\x8c\t_metadata\x94]\x94ub\x8c\x0cfrom_delayed\x94t\x94.'
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
return pickle.loads(x)
File "/usr/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 38, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm
Again, I can use pyarrow to read a successfully read a file from HDFS from any of the 12 nodes.
Looking at the traceback my guess is that PyArrow isn't correctly installed on the worker nodes. I might ask on the PyArrow issue tracker to see if they can help you to diagnose that traceback.
Ho boy! After building libhdfs3 from scratch and deploying to part of the cluster and finding the same exact result (ImportError: Can not find the shared library: libhdfs3.so), I realized the issue is I've been starting the Dask workers via pssh so they aren't catching the environment variables they should.

Unseekable HTTP Response

I am accessing some .mat files hosted on a SimpleHTTP server and trying to load them using Python's loadmat:
from scipy.io import loadmat
import requests
r = requests.get(requestURL,stream=True)
print loadmat(r.raw)
However, I am getting an error
io.UnsupportedOperation: seek
After looking around it seems that r.raw is a file object which is not seekable, and therefore I can't use loadmat on it. I tried the solution here (modified slightly to work with Python 2.7), but maybe I did something wrong because when I try
seekfile = SeekableHTTPFile(requestURL,debug=True)
print 'closed', seekfile.closed
It tells me that the file is closed, and therefore I can't use loadmat on that either. I can provide the code for my modified version if that would be helpful, but I was hoping that there is some better solution to my problem here.
I can't copy/download the .mat file because I don't have write permission on the server where my script is hosted. So I'm looking for a way to get a seekable file object that loadmat can use.
Edit: attempt using StringIO
import StringIO
stringfile = StringIO.StringIO()
stringfile.write(repr(r.raw.read())) # repr needed because of null characters
loaded = loadmat(stringfile)
stringfile.close()
r.raw.close()
results in:
ValueError: Unknown mat file type, version 48, 92

Using pyspark to read json file directly from a website

is it possible to use sqlContext to read a json file directly from a website?
for instance I can read file as such:
myRDD = sqlContext.read.json("sample.json")
but get I an error when I try something like this:
myRDD = sqlContext.read.json("http://192.168.0.13:9200/sample.json")
I'm using Spark 1.4.1
Thanks in advance!
It is not possible. Paths you use should point to either local file system or other file system supported by Hadoop. As long as sample.json has an expected format (single object per line) you can try something like this:
import json
import requests
r = requests.get("http://192.168.0.13:9200/sample.json")
df = sqlContext.createDataFrame([json.loads(line) for line in r.iter_lines()])

Python read file as stream from HDFS

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)
What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:
for line in open("myfile", "r"):
# do some processing
I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.
I was thinking to do this using the standard "hadoop" command line tools using the Python subprocess module, but I can't seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.
Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?
cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
If there is another way to achieve what I described above without using an external library, I'm also pretty open.
Thanks for any help !
You want xreadlines, it reads lines from a file without loading the whole file into memory.
Edit:
Now I see your question, you just need to get the stdout pipe from your Popen object:
cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
for line in cat.stdout:
print line
If you want to avoid adding external dependencies at any cost, Keith's answer is the way to go. Pydoop, on the other hand, could make your life much easier:
import pydoop.hdfs as hdfs
with hdfs.open('/user/myuser/filename') as f:
for line in f:
do_something(line)
Regarding your concerns, Pydoop is actively developed and has been used in production for years at CRS4, mostly for computational biology applications.
Simone
In the last two years, there has been a lot of motion on Hadoop-Streaming. This is pretty fast according to Cloudera: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ I've had good success with it.
You can use the WebHDFS Python Library (built on top of urllib3):
from hdfs import InsecureClient
client_hdfs = InsecureClient('http://host:port', user='root')
with client_hdfs.write(access_path) as writer:
dump(records, writer) # tested for pickle and json (doesnt work for joblib)
Or you can use the requests package in python as:
import requests
from json import dumps
params = (('op', 'CREATE')
('buffersize', 256))
data = dumps(file) # some file or object - also tested for pickle library
response = requests.put('http://host:port/path', params=params, data=data) # response 200 = successful
Hope this helps!

Categories

Resources