Dask Distributed: Reading .csv from HDFS

Dask Distributed: Reading .csv from HDFS - python

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide.
In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes).
In my case, I have a 82GB file and 288 workers (12 physical nodes; there's a HDFS data node on each).
On all 12 nodes, I can access HDFS and execute a simple Python script that displays info on a file:
import pyarrow as pa
fs = pa.hdfs.connect([url], 8022)
print(str(fs.info('/path/to/file.csv')))
If I create a single-node cluster (only 24 workers) using only the machine running Dask Scheduler, I can read the .csv from HDFS and print the length:
import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
dask.config.set(hdfs_backend='pyarrow')
df = dd.read_csv('hdfs://[url]:8022/path/to/file.csv')
df = client.persist(df)
print(str(len(df)))
That last line gives 1046250873 (nice!) and takes 3m17s to run.
However, when I use the full cluster, that last line calling len(df) dies and I get this error:
KilledWorker: ("('pandas_read_text-read-block-from-delayed-9ad3beb62f0aea4a07005d5c98749d7e', 1201)", 'tcp://[url]:42866')
This is similar to an issue mentioned here which has a solution here involving Dask Yarn and a config (?) that looks like: worker_env={'ARROW_LIBHDFS_DIR': ...}
However, I'm not using Yarn, although my guess is that the Dask Workers are somehow not configured with the HDFS/Arrow paths they need in order to connect.
I don't see any documentation on this, hence my question as to what I'm missing.
Edit:
Here's the error traceback I'm seeing in the output of the Dask Workers:
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95N\x05\x00\x00\x00\x00\x00\x00(\x8c\x14dask.dataframe.utils\x94\x8c\ncheck_meta\x94\x93\x94(\x8c\x12dask
.compatibility\x94\x8c\x05apply\x94\x93\x94\x8c\x15dask.dataframe.io.csv\x94\x8c\x10pandas_read_text\x94\x93\x94]\x94(\x8c\x11pandas.io.parsers\x94\x8c\x08read_csv\x94\x93\x94(
\x8c\x0fdask.bytes.core\x94\x8c\x14read_block_from_file\x94\x93\x94h\r\x8c\x08OpenFile\x94\x93\x94(\x8c\x12dask.bytes.pyarrow\x94\x8c\x17PyArrowHadoopFileSystem\x94\x93\x94)\x8
1\x94}\x94(\x8c\x02fs\x94\x8c\x0cpyarrow.hdfs\x94\x8c\x10HadoopFileSystem\x94\x93\x94(\x8c\r10.255.200.91\x94MV\x1fNN\x8c\x07libhdfs\x94Nt\x94R\x94\x8c\x08protocol\x94\x8c\x04h
dfs\x94ub\x8c\x1a/path/to/file.csv\x94\x8c\x02rb\x94NNNt\x94R\x94K\x00J\x00\x90\xd0\x03C\x01\n\x94t\x94C\x12animal,weight,age\n\x94\x8c\x08builtins\x94\x8c\x04dict\x94
\x93\x94]\x94\x86\x94h*]\x94(]\x94(\x8c\x06animal\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff
\xff\xff\xffK?t\x94be]\x94(\x8c\x06weight\x94h2\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94be]\x94(\x8c\x03age\x94h<e
e\x86\x94]\x94(h/h9h#eeh*]\x94(]\x94(\x8c\x0cwrite_header\x94\x89e]\x94(\x8c\x07enforce\x94\x89e]\x94(\x8c\x04path\x94Nee\x86\x94t\x94\x8c\x11pandas.core.frame\x94\x8c\tDataFra
me\x94\x93\x94)\x81\x94}\x94(\x8c\x05_data\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)\x81\x94(]\x94(\x8c\x18pandas.core.indexes.base\x94\x8c\n_new_In
dex\x94\x93\x94hW\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94h0\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01
b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h5\x89]\x94(h/h9h#et\x94b\x8c\x04name\x94Nu\x86\x94R\x94hY\x8c\x19pandas.core.indexes.range\x94\x8c\nRangeIndex\x94\x93\x94}\x94(hjN\x8c\x0
5start\x94K\x00\x8c\x04stop\x94K\x00\x8c\x04step\x94K\x01u\x86\x94R\x94e]\x94(h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02K\x00\x86\x94h<\x89C\x00\x94t\x94bh`hbK\x00\x85\x94hd\x
87\x94R\x94(K\x01K\x01K\x00\x86\x94h5\x89]\x94t\x94be]\x94(hYh[}\x94(h]h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02\x85\x94h5\x89]\x94(h9h#et\x94bhjNu\x86\x94R\x94hYh[}\x94(h]h`
hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x01\x85\x94h5\x89]\x94h/at\x94bhjNu\x86\x94R\x94e}\x94\x8c\x060.14.1\x94}\x94(\x8c\x04axes\x94hV\x8c\x06blocks\x94]\x94(}\x94(\x8c\x06valu
es\x94hy\x8c\x08mgr_locs\x94h(\x8c\x05slice\x94\x93\x94K\x01K\x03K\x01\x87\x94R\x94u}\x94(h\x9dh\x7fh\x9eh\xa0K\x00K\x01K\x01\x87\x94R\x94ueust\x94b\x8c\x04_typ\x94\x8c\tdatafr
ame\x94\x8c\t_metadata\x94]\x94ub\x8c\x0cfrom_delayed\x94t\x94.'
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
return pickle.loads(x)
File "/usr/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 38, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm
Again, I can use pyarrow to read a successfully read a file from HDFS from any of the 12 nodes.

Looking at the traceback my guess is that PyArrow isn't correctly installed on the worker nodes. I might ask on the PyArrow issue tracker to see if they can help you to diagnose that traceback.

Ho boy! After building libhdfs3 from scratch and deploying to part of the cluster and finding the same exact result (ImportError: Can not find the shared library: libhdfs3.so), I realized the issue is I've been starting the Dask workers via pssh so they aren't catching the environment variables they should.

Related

Reading feather file from a url in python

I am using api gateway proxy for s3 to read feather files. Below is the simplest form of the code I am using.
import pandas as pd
s3_data=pd.read_feather('https://<api_gateway>/<bucket_name/data.feather>')
This gives an error -
reader = _feather.FeatherReader(source, use_memory_map=memory_map)
File "pyarrow\_feather.pyx", line 75, in pyarrow._feather.FeatherReader.__cinit__
File "pyarrow\error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 114, in pyarrow.lib.check_status
OSError: Verification of flatbuffer-encoded Footer failed.
If I keep the feather file on my local and read it like below, all works fine.
s3_data=pd.read_feather("file://localhost//C://Users//<Username>//Desktop//data.feather")
How do I make this work ?

Maybe the gateway proxy needs to do some redirection, that makes it fail. I would have done something like this:
from s3fs import S3FileSystem
fs = S3FileSystem(anon=True)
with fs.open("<bucket>/data.feather") as f:
df = pd.read_feather(f)
s3fs is part of Dask. There are also other similar layers that you can use.
PS: if you are using feather for long term data storage, the Apache Arrow project advises against it (maintainer of feather). You should probably use parquet.

Is there any limits of saving result on S3 from sagemaker Processing?

※ I used google translation, if you have any question, let me know!
I am trying to run python script with huge 4 data, using sagemaker processing. And my current situation are as follows:
can run this script with 3 data
can't run the script with only 1 data (the biggest, the same structure with others)
as for all of 4 data, the script has finished (so, I suspected this error in S3, ie. when copying sagemaker result to S3)
The error I got is this InternalServerError.
Traceback (most recent call last):
File "sagemaker_train_and_predict.py", line 56, in <module>
outputs=outputs
File "{xxx}/sagemaker_constructor.py", line 39, in run
outputs=outputs
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/processing.py", line 408, in run
self.latest_job.wait(logs=logs)
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/processing.py", line 723, in wait
self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/session.py", line 3111, in logs_for_processing_job
self._check_job_status(job_name, description, "ProcessingJobStatus")
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/session.py", line 2615, in _check_job_status
actual_status=status,
sagemaker.exceptions.UnexpectedStatusException: Error for Processing job sagemaker-vm-train-and-predict-2020-04-12-04-15-40-655: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.

There may be some issue transferring the output data to S3 if the output is generated at a high rate and size is too large.
You can 1) try to slow down writing the output a bit or 2) call S3 from your algorithm container to upload the output directly using boto client (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html).

Compressed Hadoop Sequence Files Python

I have a client sending me Snappy compressed hadoop sequence files for analysis. What I ultimately want to do is to put this data into a pandas df. The format looks like the following
>>> body_read
b'SEQ\x06!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable\x01\x01)org.apache.hadoop.io.compress.SnappyCodec\x00\x00\x00\x00\x0b\xabZ\x92f\xceuAf\xa1\x9a\xf0-\x1d2D\xff\xff\xff\xff\x0b\xabZ\x92f\xceuAf\xa1\x9a\xf0-\x1d2D\x8e\x05^N\x00\x00\x05^\x00\x00\x00F\xde\n\x00\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00r\x01\x00\x04\x00\x00\x00\x00\x8e\x08\x92\x00\x00\x10\x1a\x00\x00\x08\x8a\x9a h\x8e\x02\xd6\x8e\x02\xc6\x8e\x02T\x8e\x02\xd4\x8e\x02\xdb\x8e\x02\xd8\x8e\x02\xdf\x8e\x02\xd9\x8e\x02\xd3\x05\x0c0\xd9\x8e\x02\xcc\x8e\x02\xfc\x8e\x02\xe8\x8e\x02\xd0\x05!\x00\xdb\x05\x06\x0c\xd1\x8e\x02\xd7\x05\'\x04\xde\x8e\x01\x03\x18\xce\x8e\x02\xe7\x8e\x02\xd2\x05<\x00\xd4\x05\x1b\x04\xdc\x8e
I think what I need to do is first decompress the file using python-snappy, and then read the sequence files. I'm not sure what the best method is for reading hadoop sequence files in python. I am also getting and error when trying to decompress this file
>>> body_decomp = snappy.uncompress(body_read)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/snappy/snappy.py", line 91, in uncompress
return _uncompress(data)
snappy.UncompressError: Error while decompressing: invalid input
What do I need to do in order to read these files?

Thanks to #cricket_007's helpful comments and some more digging, I was able to solve this. PySpark will accomplish the tasks that I need, and can read Hadoop Sequence Files directly from S3 locations, which is great. The tricky part was setting up PySpark, and I found this guide really helpful once I had downloaded Apache Spark - https://markobigdata.com/2017/04/23/manipulating-files-from-s3-with-apache-spark/.
One weird discrepancy I have though is that my spark-shell automatically decompresses the file:
scala> val fRDD = sc.textFile("s3a://bucket/file_path")
fRDD: org.apache.spark.rdd.RDD[String] = s3a://bucket/file_path MapPartitionsRDD[5] at textFile at <console>:24
scala> fRDD.first()
res4: String = SEQ?!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable??)org.apache.hadoop.io.compress.SnappyCodec???? �Z�f�uAf���- 2D���� �Z�f�uAf���- 2D�?^N???^???F�
but PySpark does not:
>>> from pyspark import SparkContext, SparkConf
>>> sc = SparkContext()
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/06 23:00:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> fRDD = sc.textFile("s3a://bucket/file_path")
>>> fRDD.first()
'SEQ\x06!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable\x01\x01)org.apache.hadoop.io.compress.SnappyCodec\x00\x00\x00\x00\x0b�Z�f�uAf���-\x1d2D����\x0b�Z�f�uAf���-\x1d2D�\x05^N\x00\x00\x05^\x00\x00\x00F�'
Any ideas how I get PySpark to do this?
EDIT: Thanks again to cricket_007, I started using .sequenceFile() instead. This was initially giving me the error
>>> textFile = sc.sequenceFile("s3a://bucket/file_path")
18/02/07 18:13:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
I was able to solve that issue by following this guide - https://community.hortonworks.com/questions/18903/this-version-of-libhadoop-was-built-without-snappy.html. I am now able to read the sequence file and decompile the protobuf message
>>> seqs = sc.sequenceFile("s3a://bucket/file_path").values()
>>> feed = protobuf_message_pb2.feed()
>>> row = bytes(seqs.first())
>>> feed.ParseFromString(row)
>>> feed.user_id_64
3909139888943208259
This is exactly what I needed. What I want to do now is find an efficient way to decompile the entire sequenceFile and turn it into a DataFrame, rather than doing it one record at a time as I have done above.

ERROR - error from worker No such file or directory: 'filepath'

I have a sample data set present in my local and I'm trying to do some basic opertaions on a cluster.
import dask.dataframe as ddf
from dask.distributed import Client
client = Client('Ip address of the scheduler')
import dask.dataframe as ddf
csvdata = ddf.read_csv('Path to the CSV file')
Client is connected to a scheduler which in turn is connected to two workers(on other machines).
My Questions may be pretty trivial.
Should this csv file be present on other worker nodes?
I seem to get file not found errors.
Using,
futures=client.scatter(csvdata)
x = ddf.from_delayed([future], meta=df)
#Price is a column in the data
df.Price.sum().compute(get=client.get) #returns" dd.Scalar<series-..., dtype=float64>" How do I access it?
client.submit(sum, x.Price) #returns "distributed.utils - ERROR - 6dc5a9f58c30954f77913aa43c792cc8"
Also, I did refer this
Loading local file from client onto dask distributed cluster and http://distributed.readthedocs.io/en/latest/manage-computation.html
I thinking I'm mixing up a lot of things here and my understanding is muddled up.
Any help would be really appreciated.

Yes, here dask.dataframe is assuming that the files you refer to in your client code are also accessible by your workers. If this is not the case then you will have you read in your data explicitly in your local machine and scatter it out to your workers.
It looks like you're trying to do exactly this, except that you're scattering dask dataframes rather than pandas dataframes. You will actually have to concretely load pandas data from disk before you scatter it. If your data fits in memory then you should be able to do exactly what you're doing now, but replace the dd.read_csv call with pd.read_csv
csvdata = pandas.read_csv('Path to the CSV file')
[future] = client.scatter([csvdata])
x = ddf.from_delayed([future], meta=df).repartition(npartitions=10).persist()
#Price is a column in the data
df.Price.sum().compute(get=client.get) # Should return an integer
If your data is too large then you might consider using dask locally to read and scatter data to your cluster piece by piece.
import dask.dataframe as dd
ddf = dd.read_csv('filename')
futures = ddf.map_partitions(lambda part: c.scatter([part])[0]).compute(get=dask.get) # single threaded local scheduler
ddf = dd.from_delayed(list(futures), meta=ddf.meta)

py2neo - Neo4j - System Error - Create Batch Nodes/Relationships

Attempting to batch create nodes & relationships - batch creation is failing - Traceback at end of the post
Note code functions with smaller subset of nodes - fails when get into massive number of relationships, unclear at what limit this is occurring.
Wondering if I need to increase ulimit above 40,000 open files
Read somewhere where persons were running into Xstream issues with REST API while conducting batch create - unclear if the problem set is on the py2neo end of the spectrum, or on the Neo4j server tuning/configuration, or on the Python end of the spectrum.
Any guidance would be greatly appreciated.
One cluster within the data set ends up with around 625525 relationships out of 700+ nodes.
Total Relationships will be 1M+ - utilizing an Apple Macbook Pro Retina with x86_64 - Ubuntu 13.04, SSD, 8GB memory.
Neo4j: configured auto_indexing & auto_relationships set to ON
Nodes Clustered/Grouped via Python Panadas DataFrame.groupby()
Nodes: contain 3 properties
Relationships Properties: 1 -> IN & Out Relationships created
ulimit set to 40,000 files open
Code
https://github.com/alienone/OSINT/blob/master/MANDIANTAPT/spitball.py
Operating System: Ubuntu 13.04
Python version: 2.7.5
py2neo Version: 1.5.1
Java version: 1.7.0_25-b15
Neo4j version: Community Edition 1.9.2
Traceback
Traceback (most recent call last):
File "/home/alienone/Programming/Python/OSINT/MANDIANTAPT/spitball.py", line 63, in
main()
File "/home/alienone/Programming/Python/OSINT/MANDIANTAPT/spitball.py", line 59, in main
graph_db.create(*sorted_nodes)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 420, in create
return batch.submit()
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 2123, in submit
for response in self._submit()
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 2092, in submit
for id, request in enumerate(self.requests)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 428, in _send
return self._client().send(request)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 365, in send
return Response(request.graph_db, rs.status, request.uri, rs.getheader("Location", None), rs_body)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 279, in init
raise SystemError(body)
SystemError: None
Process finished with exit code 1

I had a similar issue. One way to deal with it is to do the batch.submit() for chunks of your data and not the whole data set. This is slower of course, but splitting one million nodes in chunks of 5000 is still faster than adding every node separately.
I use a small helper class to do this, note that all my nodes are indexed: https://gist.github.com/anonymous/6293739

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask Distributed: Reading .csv from HDFS - python

Looking at the traceback my guess is that PyArrow isn't correctly installed on the worker nodes. I might ask on the PyArrow issue tracker to see if they can help you to diagnose that traceback.

Related

Reading feather file from a url in python

Is there any limits of saving result on S3 from sagemaker Processing?

Compressed Hadoop Sequence Files Python

ERROR - error from worker No such file or directory: 'filepath'

py2neo - Neo4j - System Error - Create Batch Nodes/Relationships

Categories

Resources