HBase on AWS EMR slow to retrieve data

HBase on AWS EMR slow to retrieve data - python

I am running an AWS EMR cluster with HBase installed, I followed these instructions for setting up the cluster using s3 as the Hbase datastore. The cluster is up and running and I am able to ssh in and use the hbase shell with no problems.
The data we are trying to store is genomic data and very wide. For each row-key, there can be up to 250,000 column keys. We have experimented with different numbers of column families, from grouping all the keys in 1 column family, to using 42 different column families with the column keys spread out amongst them.
To interact with Hbase, we are using happybase in python, which uses thrift to communicate with the primary node. When retrieving a single row-key, it takes around 2.7s to return the result. I was expecting ms data retrieval times for this type of operation. When retrieving. Our configuration is very simple with no additional optimizations done. We are trying to decide if Hbase is the right application for our database needs but given the slow data retrieval times, we are leaning away from it.
I am aware that other large industry players use HBase for their needs and was wondering if anyone knows what things we can try to optimize performance? While these times aren't terrible, the application will eventually need to put thousands of row-keys and retrieve thousands of row-keys for all columns. Given the scaling we have seen so far, it would be untenable for our needs.
I have minimal experience with distributed NoSQL technologies like HBase so any suggestions or help would be appreciated.
Cluster setup:
1 Master node, 3 Core nodes
m4.large instances
Things we have tried:
Adjusting number of column families
Using HDFS instead of s3 as datastore

Related

Firestore read and update is taking 1-12 sec delay to complete in Python

I am using firebase python client to write data to firestore.Any read / write operation at least takes 1 second to complete.Firestore DB is in us-central and our server is in Singapore.
Is it what causing issues?
During read, I have used a where query with limit like below.
collection_ref.where(
u"field", u"==", u"field_value").limit(1).get()
During write, I use set and update(dict)
Sometimes the lag is around 10 to 12 sec
Did anyone face similar issues?
Any pointers will be appreciated

This article on why is Cloud Firestore query slow mentioned the lists of reasons
If you are downloading a bunch of data you probably don’t need to download all of them.The solution would be to limit the amount that comes back.
Your offline cache is too big. Cloud Firestore does some amazing offline caching but this local cache does not apply the same indexes that the server does. This means when you query documents in your offline cache cloud Firestore needs to pack every documents stored locally for the collection being queried and compare it against your query.The solution is limit how much data is being stored in offline cache.
Without composite indexing Firestore would have to do a lot of searching to get the results set.So instead, create a composite index so Firestore can do a quick lookup.
Used to Realtime Database.Realtime Database generally has a lower latency,you are not really going to notice the difference.But if app needs every second of latency you are probably better off using Realtime
Database in these scenarios.
The laws of physics are keeping you down. Your customer might be too far away from your Firestore Database and the actual latency is taking too long. To fix this use real time listeners which is a technique called latency compensation.

Extracting data continuously from RDS MySQL schemas in parallel

I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.

Multi-user efficient time-series storing for Django web app

I'm developing a Django app. Use-case scenario is this:
50 users, each one can store up to 300 time series and each time serie has around 7000 rows.
Each user can ask at any time to retrieve all of their 300 time series and ask, for each of them, to perform some advanced data analysis on the last N rows. The data analysis cannot be done in SQL but in Pandas, where it doesn't take much time... but retrieving 300,000 rows in separate dataframes does!
Users can also ask results of some analysis that can be performed in SQL (like aggregation+sum by date) and that is considerably faster (to the point where I wouldn't be writing this post if that was all of it).
Browsing and thinking around, I've figured storing time series in SQL is not a good solution (read here).
Ideal deploy architecture looks like this (each bucket is a separate server!):
Problem: time series in SQL are too slow to retrieve in a multi-user app.
Researched solutions (from this article):
PyStore: https://github.com/ranaroussi/pystore
Arctic: https://github.com/manahl/arctic
Here are some problems:
1) Although these solutions are massively faster for pulling millions of rows time series into a single dataframe, I might need to pull around 500.000 rows into 300 different dataframes. Would that still be as fast?
This is the current db structure I'm using:
class TimeSerie(models.Model):
...
class TimeSerieRow(models.Model):
date = models.DateField()
timeserie = models.ForeignKey(timeserie)
number = ...
another_number = ...
And this is the bottleneck in my application:
for t in TimeSerie.objects.filter(user=user):
q = TimeSerieRow.objects.filter(timeserie=t).orderby("date")
q = q.filter( ... time filters ...)
df = pd.DataFrame(q.values())
# ... analysis on df
2) Even if PyStore or Arctic can do that faster, that'd mean I'd loose the ability to decouple my db from the Django instances, effictively using resources of one machine way better, but being stuck to use only one and not being scalable (or use as many separate databases as machines). Can PyStore/Arctic avoid this and provide an adapter for remote storage?
Is there a Python/Linux solution that can solve this problem? Which architecture I can use to overcome it? Should I drop the scalability of my app and/or accept that every N new users I'll have to spawn a separate database?

The article you refer to in your post is probably the best answer to your question. Clearly good research and a few good solutions being proposed (don't forget to take a look at InfluxDB).
Regarding the decoupling of the storage solution from your instances, I don't see the problem:
Arctic uses mongoDB as a backing store
pyStore uses a file system as a backing store
InfluxDB is a database server on its own
So as long as you decouple the backing store from your instances and make them shared among instances, you'll have the same setup as for your posgreSQL database: mongoDB or InfluxDB can run on a separate centralised instance; the file storage for pyStore can be shared, e.g. using a shared mounted volume. The python libraries that access these stores of course run on your django instances, like psycopg2 does.

How to send a query or stored procedure execution request to a specific location/region of cosmosdb?

I'm trying to multi-thread some tasks using cosmosdb to optimize ETL time, and I can't find how, using the python API (but I could do something in REST if required) if I have a stored procedure to call twice for two partitions keys, I could send it to two different regions (namely 'West Europe' and 'Central France)
I defined those as PreferredLocations in the connection policy but don't know how to include to a query, the instruction to route it to a specific location.

The only place you could specify that on would be the options objects of the requests. However there is nothing related to the regions.
What you can do is initialize multiple clients that have a different order in the preferred locations and then spread the load that way in different regions.
However, unless your apps are deployed on those different regions and latency is less, there is no point in doing so since Cosmos DB will be able to cope with all the requests in a single region as long as you have the RUs needed.

Extract a few million records from Teradata to Python (pandas)

I have data from 6 months of emails (email properties like send date, subject line plus recipient details like age, gender etc, altogether around 20 columns) in my teradata table. It comes around 20 million in total, which I want to be brough into Python for further predictive modelling purpose.
I tried to run the selection query using 'pyodbc' connector but it just runs for hours & hours. Then I stopped it & modified the query to fetch just 1 month of data (may be 3-4 million) but still takes a very long time.
Is there any better(faster) option than 'pyodbc' or any different approach altogether ?
Any input is appreciated. thanks

When communicating between Python and Teradata I recommend to use the Teradata -package (pip teradata; https://developer.teradata.com/tools/reference/teradata-python-module). It leverages ODBC (or REST) to connect.
Beside this you could use JDBC via JayDeBeApi. JDBC could be sometimes some faster then ODBC.
Both options support Python Database API Specification, so that your other code around doesn't have to be touched. E.g. pandas.read_sql works fine with connections from above.
Your performance issues look like some other issues:
network connectivity
Python (Pandas) memory handling
ad 1) throughput can only replaced with more throughput
ad 2) you could try to do as many as possible within the database (feature engineering) + your local machine should have RAM ("pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset") - Maybe Apache Arrow can relieve some of your local RAM issues
Check:
https://developer.teradata.com/tools/reference/teradata-python-module
http://wesmckinney.com/blog/apache-arrow-pandas-internals/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.