Compressed Hadoop Sequence Files Python - python

I have a client sending me Snappy compressed hadoop sequence files for analysis. What I ultimately want to do is to put this data into a pandas df. The format looks like the following
>>> body_read
b'SEQ\x06!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable\x01\x01)org.apache.hadoop.io.compress.SnappyCodec\x00\x00\x00\x00\x0b\xabZ\x92f\xceuAf\xa1\x9a\xf0-\x1d2D\xff\xff\xff\xff\x0b\xabZ\x92f\xceuAf\xa1\x9a\xf0-\x1d2D\x8e\x05^N\x00\x00\x05^\x00\x00\x00F\xde\n\x00\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00\xfe\x01\x00r\x01\x00\x04\x00\x00\x00\x00\x8e\x08\x92\x00\x00\x10\x1a\x00\x00\x08\x8a\x9a h\x8e\x02\xd6\x8e\x02\xc6\x8e\x02T\x8e\x02\xd4\x8e\x02\xdb\x8e\x02\xd8\x8e\x02\xdf\x8e\x02\xd9\x8e\x02\xd3\x05\x0c0\xd9\x8e\x02\xcc\x8e\x02\xfc\x8e\x02\xe8\x8e\x02\xd0\x05!\x00\xdb\x05\x06\x0c\xd1\x8e\x02\xd7\x05\'\x04\xde\x8e\x01\x03\x18\xce\x8e\x02\xe7\x8e\x02\xd2\x05<\x00\xd4\x05\x1b\x04\xdc\x8e
I think what I need to do is first decompress the file using python-snappy, and then read the sequence files. I'm not sure what the best method is for reading hadoop sequence files in python. I am also getting and error when trying to decompress this file
>>> body_decomp = snappy.uncompress(body_read)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/snappy/snappy.py", line 91, in uncompress
return _uncompress(data)
snappy.UncompressError: Error while decompressing: invalid input
What do I need to do in order to read these files?

Thanks to #cricket_007's helpful comments and some more digging, I was able to solve this. PySpark will accomplish the tasks that I need, and can read Hadoop Sequence Files directly from S3 locations, which is great. The tricky part was setting up PySpark, and I found this guide really helpful once I had downloaded Apache Spark - https://markobigdata.com/2017/04/23/manipulating-files-from-s3-with-apache-spark/.
One weird discrepancy I have though is that my spark-shell automatically decompresses the file:
scala> val fRDD = sc.textFile("s3a://bucket/file_path")
fRDD: org.apache.spark.rdd.RDD[String] = s3a://bucket/file_path MapPartitionsRDD[5] at textFile at <console>:24
scala> fRDD.first()
res4: String = SEQ?!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable??)org.apache.hadoop.io.compress.SnappyCodec???? �Z�f�uAf���- 2D���� �Z�f�uAf���- 2D�?^N???^???F�
but PySpark does not:
>>> from pyspark import SparkContext, SparkConf
>>> sc = SparkContext()
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/06 23:00:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> fRDD = sc.textFile("s3a://bucket/file_path")
>>> fRDD.first()
'SEQ\x06!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable\x01\x01)org.apache.hadoop.io.compress.SnappyCodec\x00\x00\x00\x00\x0b�Z�f�uAf���-\x1d2D����\x0b�Z�f�uAf���-\x1d2D�\x05^N\x00\x00\x05^\x00\x00\x00F�'
Any ideas how I get PySpark to do this?
EDIT: Thanks again to cricket_007, I started using .sequenceFile() instead. This was initially giving me the error
>>> textFile = sc.sequenceFile("s3a://bucket/file_path")
18/02/07 18:13:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
I was able to solve that issue by following this guide - https://community.hortonworks.com/questions/18903/this-version-of-libhadoop-was-built-without-snappy.html. I am now able to read the sequence file and decompile the protobuf message
>>> seqs = sc.sequenceFile("s3a://bucket/file_path").values()
>>> feed = protobuf_message_pb2.feed()
>>> row = bytes(seqs.first())
>>> feed.ParseFromString(row)
>>> feed.user_id_64
3909139888943208259
This is exactly what I needed. What I want to do now is find an efficient way to decompile the entire sequenceFile and turn it into a DataFrame, rather than doing it one record at a time as I have done above.

Related

"pd.read_excel(filename, sheet_name=None" causes UserWarning: Slicer List extension is not supported

I noticed this message only today and could not find any notification on the pandas documentation web...
I use a simple way to load all sheets into dictionary of dataframes:
filename = "data.xlsx"
sheets_dict = pd.read_excel(filename, sheet_name=None)
and it started to cause warning shown below...
is it a bug? or I should start using different method?
If not a bug, - please advise the option.
openpyxl\worksheet\_reader.py:312: UserWarning: Slicer List extension is not supported and will be removed
Its the warning from openpyxl. If you use the default engine on pandas to load each Excel file, the warning goes away, but the time it takes for each file goes up to 5 or even 6 seconds. You can ignore those warnings.
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='openpyxl')

Read OpenAir File using Python GDAL

I need to read OpenAir files in Python.
According to the following vector driver description, GDAL has built-in OpenAir functionality:
https://gdal.org/drivers/vector/openair.html
However there is no example code for reading such OpenAir files.
So far I have tried to read a sample file using the following lines:
from osgeo import gdal
airspace = gdal.Open('export.txt')
However it returns me the following error:
ERROR 4: `export.txt' not recognized as a supported file format.
I already looked at vectorio however no OpenAir functionality has been implemented.
Why do I get the error above?
In case anyone wants to reproduce the problem: sample OpenAir files can easily be generated using XContest:
https://airspace.xcontest.org/
Since you're dealing with vector data, you need to use ogr instead of gdal (it's normally packaged along with gdal)
So you can do:
from osgeo import ogr
ds = ogr.Open('export.txt')
layer = ds.GetLayer(0)
featureCount = layer.GetFeatureCount()
print(featureCount)
There's plenty of info out there on using ogr, but this cookbook might be helpful.

Read files from hdfs - pyspark

I am new to Pyspark, when I execute the below code, I am getting attribute error.
I am using apache spark 2.4.3
t=spark.read.format("hdfs:\\test\a.txt")
t.take(1)
I expect the output to be 1, but it throws error.
AttributeError: dataframereader object has no attribute take
You're not using the API properly:
format is used to specify the input data source format you want
Here, you're reading text file so all you have to do is:
t = spark.read.text("hdfs://test/a.txt")
t.collect()
See related doc

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide.
In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes).
In my case, I have a 82GB file and 288 workers (12 physical nodes; there's a HDFS data node on each).
On all 12 nodes, I can access HDFS and execute a simple Python script that displays info on a file:
import pyarrow as pa
fs = pa.hdfs.connect([url], 8022)
print(str(fs.info('/path/to/file.csv')))
If I create a single-node cluster (only 24 workers) using only the machine running Dask Scheduler, I can read the .csv from HDFS and print the length:
import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
dask.config.set(hdfs_backend='pyarrow')
df = dd.read_csv('hdfs://[url]:8022/path/to/file.csv')
df = client.persist(df)
print(str(len(df)))
That last line gives 1046250873 (nice!) and takes 3m17s to run.
However, when I use the full cluster, that last line calling len(df) dies and I get this error:
KilledWorker: ("('pandas_read_text-read-block-from-delayed-9ad3beb62f0aea4a07005d5c98749d7e', 1201)", 'tcp://[url]:42866')
This is similar to an issue mentioned here which has a solution here involving Dask Yarn and a config (?) that looks like: worker_env={'ARROW_LIBHDFS_DIR': ...}
However, I'm not using Yarn, although my guess is that the Dask Workers are somehow not configured with the HDFS/Arrow paths they need in order to connect.
I don't see any documentation on this, hence my question as to what I'm missing.
Edit:
Here's the error traceback I'm seeing in the output of the Dask Workers:
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95N\x05\x00\x00\x00\x00\x00\x00(\x8c\x14dask.dataframe.utils\x94\x8c\ncheck_meta\x94\x93\x94(\x8c\x12dask
.compatibility\x94\x8c\x05apply\x94\x93\x94\x8c\x15dask.dataframe.io.csv\x94\x8c\x10pandas_read_text\x94\x93\x94]\x94(\x8c\x11pandas.io.parsers\x94\x8c\x08read_csv\x94\x93\x94(
\x8c\x0fdask.bytes.core\x94\x8c\x14read_block_from_file\x94\x93\x94h\r\x8c\x08OpenFile\x94\x93\x94(\x8c\x12dask.bytes.pyarrow\x94\x8c\x17PyArrowHadoopFileSystem\x94\x93\x94)\x8
1\x94}\x94(\x8c\x02fs\x94\x8c\x0cpyarrow.hdfs\x94\x8c\x10HadoopFileSystem\x94\x93\x94(\x8c\r10.255.200.91\x94MV\x1fNN\x8c\x07libhdfs\x94Nt\x94R\x94\x8c\x08protocol\x94\x8c\x04h
dfs\x94ub\x8c\x1a/path/to/file.csv\x94\x8c\x02rb\x94NNNt\x94R\x94K\x00J\x00\x90\xd0\x03C\x01\n\x94t\x94C\x12animal,weight,age\n\x94\x8c\x08builtins\x94\x8c\x04dict\x94
\x93\x94]\x94\x86\x94h*]\x94(]\x94(\x8c\x06animal\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff
\xff\xff\xffK?t\x94be]\x94(\x8c\x06weight\x94h2\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94be]\x94(\x8c\x03age\x94h<e
e\x86\x94]\x94(h/h9h#eeh*]\x94(]\x94(\x8c\x0cwrite_header\x94\x89e]\x94(\x8c\x07enforce\x94\x89e]\x94(\x8c\x04path\x94Nee\x86\x94t\x94\x8c\x11pandas.core.frame\x94\x8c\tDataFra
me\x94\x93\x94)\x81\x94}\x94(\x8c\x05_data\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)\x81\x94(]\x94(\x8c\x18pandas.core.indexes.base\x94\x8c\n_new_In
dex\x94\x93\x94hW\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94h0\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01
b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h5\x89]\x94(h/h9h#et\x94b\x8c\x04name\x94Nu\x86\x94R\x94hY\x8c\x19pandas.core.indexes.range\x94\x8c\nRangeIndex\x94\x93\x94}\x94(hjN\x8c\x0
5start\x94K\x00\x8c\x04stop\x94K\x00\x8c\x04step\x94K\x01u\x86\x94R\x94e]\x94(h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02K\x00\x86\x94h<\x89C\x00\x94t\x94bh`hbK\x00\x85\x94hd\x
87\x94R\x94(K\x01K\x01K\x00\x86\x94h5\x89]\x94t\x94be]\x94(hYh[}\x94(h]h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02\x85\x94h5\x89]\x94(h9h#et\x94bhjNu\x86\x94R\x94hYh[}\x94(h]h`
hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x01\x85\x94h5\x89]\x94h/at\x94bhjNu\x86\x94R\x94e}\x94\x8c\x060.14.1\x94}\x94(\x8c\x04axes\x94hV\x8c\x06blocks\x94]\x94(}\x94(\x8c\x06valu
es\x94hy\x8c\x08mgr_locs\x94h(\x8c\x05slice\x94\x93\x94K\x01K\x03K\x01\x87\x94R\x94u}\x94(h\x9dh\x7fh\x9eh\xa0K\x00K\x01K\x01\x87\x94R\x94ueust\x94b\x8c\x04_typ\x94\x8c\tdatafr
ame\x94\x8c\t_metadata\x94]\x94ub\x8c\x0cfrom_delayed\x94t\x94.'
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
return pickle.loads(x)
File "/usr/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 38, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm
Again, I can use pyarrow to read a successfully read a file from HDFS from any of the 12 nodes.
Looking at the traceback my guess is that PyArrow isn't correctly installed on the worker nodes. I might ask on the PyArrow issue tracker to see if they can help you to diagnose that traceback.
Ho boy! After building libhdfs3 from scratch and deploying to part of the cluster and finding the same exact result (ImportError: Can not find the shared library: libhdfs3.so), I realized the issue is I've been starting the Dask workers via pssh so they aren't catching the environment variables they should.

Error: Line magic function

I'm trying to read a file using python and I keep getting this error
ERROR: Line magic function `%user_vars` not found.
My code is very basic just
names = read_csv('Combined data.csv')
names.head()
I get this for anytime I try to read or open a file. I tried using this thread for help.
ERROR: Line magic function `%matplotlib` not found
I'm using enthought canopy and I have IPython version 2.4.1. I made sure to update using the IPython installation page for help. I'm not sure what's wrong because it should be very simple to open/read files. I even get this error for opening text files.
EDIT:
I imported traceback and used
print(traceback.format_exc())
But all I get is none printed. I'm not sure what that means.
Looks like you are using Pandas. Try the following (assuming your csv file is in the same path as the your script lib) and insert it one line at a time if you are using the IPython Shell:
import pandas as pd
names = pd.read_csv('Combined data.csv')
names.head()

Categories

Resources