Big Data Retrieval and Processing Python and PostgreSQL - python

Just for some background. I am developing a hotel data analytics dashboard much like this one [here](https://my.infocaptor.com/free_data_visualization.php"D3 Builder") using d3.js and dc.js (with cross filter). It is a Django project and the database I am using is Postgresql. I am currently working on a universal bar graph series, it will eventually allow the user to choose the fields (from the data set provided) that they would like to see plotted against each other in a bar chart format.
My database consists of 10 million entries, with 54 fields each (single table). Retrieving the three fields used to plot the time based bar chart takes over a minute. Processing the data in Python(altering column key names to match those of the universal bar chart) and putting the data into a json format to be used by the graph takes a further few minutes which is unacceptable for my desired application.
Would it be possible to "parallelise" the querying of the data base and would this be faster than what I am doing currently (a normal query). I have looked around a bit and not found much. And is there a library or optimized function I might use to parse my data to the desired format quickly?

I have worked on similar kind of table size. Well, for what you are looking for you need to switch to something like a distributed postgres environment i.e. Greenplum which is MPP architecture and supports columnar storage. Which is ideal for table with large number of columns and table size.
http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
If you do not intend to switch to Greenplum you can try table partitioning in your current postgres database. Your dashboard queries should be such that they query individual partitions, that way you end up querying smaller partitions(tables) and the query time will be much much faster.

Related

Do we must make a complex query in PySpark or a simple, and use .filter / .select?

I have a question. Suppose I run a python script on the server where my data are stored. What is the faster way to have a spark dataframe of my data between :
Make a complex query with lot of conditions but it return me the exact dataframe I need or
Make a simple query and make the dataframe I need with .filter / .select
You can also suppose that the dataframe I need is small enough to fit on my RAM.
Thanks
IIUC, everything depends on from where you are reading the data, So here are some scenarios
DataSource: RDBMS(oracle, postgres, mysql....)
If you want to read data from RDBMS system then you have to establish a JDBC connection to the database then fetch the results.
Now remember spark is slow when fetching data from relational databases over JDBC and it is recommended you filter most of your records on database side itself as it will allow the minimum data to be transferred over the network
You can control the read speed using some tuning parameters but it is still slow.
DataSource: Redshift, Snowflake
In this scenario if your cluster is large and relatively free then pushdown the query to the cluster itself or if you want to read data using JDBC then it is also fast as BTS it unloads the data to a temp location and then spark reads it as file source.
DataSource: Files
Always try to pushdown the filter as they are there for a reason, so that your cluster needs to do the minimum work as you are reading only the required data.
Bottom line is that you should always try to pushdown the filters on the source to make your spark jobs faster.
The key points to mind is
Restrict/filter data to maximum possible level while loading into dataframe, so as only needed data resides in dataframe
for non file sources: filtering data at source by using native filter and fetching only needed columns (aim for minimum data transfer).
for file sources: restricting/modifying data in file source is not feasible. so the first operation is to filter data once loaded
In complex operations first perform narrow transformations (filters, selecting only needed columns) and then perform wide transformations(joins, ordering) which involves shuffle towards the end, so that less data will be shuffled between worker nodes.
The less the shuffles the faster your end dataframe would be.
First of all I think we should be careful when dealing with small data in our Spark programs. It was designed to give you parallel processing for big data.
Second, we have something like Catalyst query optimizer and lazy evaluation, which is a good tool for Spark to optimize everything what was put either in SQL query or API call transformations.

partitionBy & overwrite strategy in an Azure DataLake using PySpark in Databricks

I have a simple ETL process in an Azure environment
blob storage > datafactory > datalake raw > databricks > datalake
curated > datwarehouse(main ETL).
the datasets for this project are not very big (~1 million rows 20 columns give or take) however I would like to keep them partitioned properly in my datalake as Parquet files.
currently I run some simple logic to figure where in my lake each file should sit based off business calendars.
the files vaguely looks like this
Year Week Data
2019 01 XXX
2019 02 XXX
I then partition a given file into the following format replacing data that exists and creating new folders for new data.
curated ---
dataset --
Year 2019
- Week 01 - file.pq + metadata
- Week 02 - file.pq + metadata
- Week 03 - file.pq + datadata #(pre existing file)
the metadata are success and commits that are auto generated.
to this end i use the following query in Pyspark 2.4.3
pyspark_dataframe.write.mode('overwrite')\
.partitionBy('Year','Week').parquet('\curated\dataset')
now if I use this command on it's own, it will overwrite any existing data in the target partition
so Week 03 will be lost.
using spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") seems to stop the issue and only over write the target files but I wonder if this is the best way to handle files in my data lake?
also I've found it hard to find any documentation on the above feature.
my first instinct was to loop over a single parquet and write each partition manually, which although gives me greater control, but looping will be slow.
my next thought would be to write each partition to a /tmp folder and move each parquet file and then replace files / create files as need be using the query from above. then purge the /tmp folder whilst creating some sort of metadata log.
Is there a better way/method to this?
any guidance would be much appreciated.
the end goal here is to have a clean and safe area for all 'Curated' data whilst having a log of parquet files I can read into a DataWarehouse for further ETL.
I saw that you are using databricks in the azure stack. I think the most viable and recommended method for you to use would be to make use of the new delta lake project in databricks:
It provides options for various upserts, merges and acid transactions to object stores like s3 or azure data lake storage. It basically provides the management, safety, isolation and upserts/merges provided by data warehouses to datalakes. For one pipeline apple actually replaced its data warehouses to be run solely on delta databricks because of its functionality and flexibility. For your use case and many others who use parquet, it is just a simple change of replacing 'parquet' with 'delta', in order to use its functionality (if you have databricks). Delta is basically a natural evolution of parquet and databricks has done a great job by providing added functionality and as well as open sourcing it.
For your case, I would suggest you try the replaceWhere option provided in delta. Before making this targeted update, the target table has to be of format delta
Instead of this:
dataset.repartition(1).write.mode('overwrite')\
.partitionBy('Year','Week').parquet('\curataed\dataset')
From https://docs.databricks.com/delta/delta-batch.html:
'You can selectively overwrite only the data that matches predicates over partition columns'
You could try this:
dataset.write.repartition(1)\
.format("delta")\
.mode("overwrite")\
.partitionBy('Year','Week')\
.option("replaceWhere", "Year == '2019' AND Week >='01' AND Week <='02'")\ #to avoid overwriting Week3
.save("\curataed\dataset")
Also, if you wish to bring partitions to 1, why dont you use coalesce(1) as it will avoid a full shuffle.
From https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/:
'replaceWhere is particularly useful when you have to run a computationally expensive algorithm, but only on certain partitions'
Therefore, I personally think that using replacewhere to manually specify your overwrite will be more targeted and computationally efficient then to just rely on:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
Databricks provides optimizations on delta tables make it a faster, and much more efficient option to parquet( hence a natural evolution) by bin packing and z-ordering:
From Link:https://docs.databricks.com/spark/latest/spark-sql/language-manual/optimize.html
WHERE(binpacking)
'Optimize the subset of rows matching the given partition predicate. Only filters involving partition key attributes are supported.'
ZORDER BY
'Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read'.
Faster query execution with indexing, statistics, and auto-caching support
Data reliability with rich schema validation and transactional guarantees
Simplified data pipeline with flexible UPSERT support and unified Structured Streaming + batch processing on a single data source
You could also check out the complete documentation of the open source project: https://docs.delta.io/latest/index.html
.. I also want to say that I do not work for databricks/delta lake. I have just seen their improvements and functionality benefit me in my work.
UPDATE:
The gist of the question is "replacing data that exists and creating new folders for new data" and to do it in highly scalable and effective manner.
Using dynamic partition overwrite in parquet does the job however I feel like the natural evolution to that method is to use delta table merge operations which were basically created to 'integrate data from Spark DataFrames into the Delta Lake'. They provide you with extra functionality and optimizations in merging your data based on how would want that to happen and keep a log of all actions on a table so you can rollback versions if needed.
Delta lake python api(for merge):
https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaMergeBuilder
databricks optimization: https://kb.databricks.com/delta/delta-merge-into.html#discussion
Using a single merge operation you can specify the condition merge on, in this case it could be a combination of the year and week and id, and then if the records match(meaning they exist in your spark dataframe and delta table, week1 and week2), update them with the data in your spark dataframe and leave other records unchanged:
#you can also add additional condition if the records match, but not required
.whenMatchedUpdateAll(condition=None)
For some cases, if nothing matches then you might want to insert and create new rows and partitions, for that you can use:
.whenNotMatchedInsertAll(condition=None)
You can use .converttodelta operation https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.convertToDelta, to convert your parquet table to a delta table so that you can perform delta operations on it using the api.
'You can now convert a Parquet table in place to a Delta Lake table without rewriting any of the data. This is great for converting very large Parquet tables which would be costly to rewrite as a Delta table. Furthermore, this process is reversible'
Your merge case(replacing data where it exists and creating new records when it does not exist) could go like this:
(have not tested, refer to examples + api for syntax)
%python
deltaTable = DeltaTable.convertToDelta(spark, "parquet.`\curataed\dataset`")
deltaTable.alias("target").merge(dataset, "target.Year= dataset.Year AND target.Week = dataset.Week") \
.whenMatchedUpdateAll()\
.whenNotMatchedInsertAll()\
.execute()
If the delta table is partitioned correctly(year,week) and you used whenmatched clause correctly, these operations will be highly optimized and could take seconds in your case. It also provides you with consistency, atomicity and data integrity with option to rollback.
Some more functionality provided is that you can specify the set of columns to update if the match is made, (if you only need to update certain columns). You can also enable spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true"), so that delta uses minimal targeted partitions to carry out the merge(update,delete,create).
Overall, I think using this approach is a very new and innovative way of carrying out targeted updates as it gives you more control over it while keeping ops highly efficient. Using parquet with dynamic partitionoverwrite mode will also work fine however, delta lake features bring data quality to your data lake that is unmatched.
My recommendation:
I would say for now, use dynamic partition overwrite mode for parquet files to do your updates, and you could experiment and try to use the delta merge on just one table with the databricks optimization of spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true") and .whenMatchedUpdateAll() and compare the performance of both(your files are small so I do not think it will be a big difference). The databricks partition pruning optimization for merges article came out in Feb so it is really new and possibly could be a gamechanger for the overhead delta merge operations incur( as under the hood they just create new files, but partition pruning could speed it up)
Merge examples in python,scala,sql: https://docs.databricks.com/delta/delta-update.html#merge-examples
https://databricks.com/blog/2019/10/03/simple-reliable-upserts-and-deletes-on-delta-lake-tables-using-python-apis.html
Instead of writing the table directly we can use saveAsTable with append and remove the partitions before that.
dataset.repartition(1).write.mode('append')\
.partitionBy('Year','Week').saveAsTable("tablename")
For removing previous partitions
partitions = [ (x["Year"], x["Week"]) for x in dataset.select("Year", "Week").distinct().collect()]
for year, week in partitions:
spark.sql('ALTER TABLE tablename DROP IF EXISTS PARTITION (Year = "'+year+'",Week = "'+week+'")')
Correct me if I missed something crucial in your approach, but it seems like you want to write new data on top of the existing data, which is normally done with
write.mode('append')
instead of 'overwrite'
If you want to keep the data separated by batch, so you can select it for upload to data warehouse or audit, there is no sensible way to do it besides including this information into the dataset and partitioning it during save, e.g.
dataset.write.mode('append')\
.partitionBy('Year','Week', 'BatchTimeStamp').parquet('curated\dataset')
Any other manual intervention into the parquet file format will be at best hacky, at worst risk making your pipeline unreliable or corrupting your data.
Delta lake which Mohammad mentions is also a good suggestion overall for reliably storing data in data lakes and a golden industry standard right now. For your specific use case you could use its feature of making historical queries (append everything and then query for the difference between current dataset and after previous batch), however the audit log is limited in time to how you configure your delta lake and can be as low as 7 days, so if you want full information in the long term, you need to follow the approach of saving batch information anyway.
On a more strategical level, when following raw -> curated -> DW you can also consider adding another 'hop' and putting your ready data into a 'preprocessed' folder, organized by batch and then append it both to the curated and DW sets.
As a side note, .repartition(1) doesn't make too much sense when using parquets, as parquet is a multi-file format anyway, so the only effects doing this has are negative impacts on performance. But please do let me know if there is a specific reason you are using it.

I want to write a 75000x10000 matrix with float values effectively into a database

thanks for hearing me out.
I have a dataset that is a matrix of shape 75000x10000 filled with float values. Think of it like heatmap/correlation matrix. I want to store this in a SQLite database (SQLite because I am modifying an existing Django project). The source data file is 8 GB in size and I am trying to use python to carry out my task.
I have tried to use pandas chunking to read the file into python and transform it into unstacked pairwise indexed data and write it out onto a json file. But this method is eating up my computational cost. For a chunk of size 100x10000 it generates a 200 MB json file.
This json file will be used as a fixture to form the SQLite database in Django backend.
Is there a better way to do this? Faster/Smarter way. I don't think a 90 GB odd json file written out taking a full day is the way to go. Not even sure if Django databases can take this load.
Any help is appreciated!
SQLite is quite impressive for what it is, but it's probably not going to give you the performance you are looking for at that scale, so even though your existing project is Django on SQLite I would recommend simply writing a Python wrapper for a different data backend and just using that from within Django.
More importantly, forget about using Django models for something like this; they are an abstraction layer built for convenience (mapping database records to Python objects), not for performance. Django would very quickly choke trying to build 100s of millions of objects since it doesn't understand what you're trying to achieve.
Instead, you'll want to use a database type / engine that's suited to the type of queries you want to make; if a typical query consists of a hundred point queries to get the data in particular 'cells', a key-value store might be ideal; if you're typically pulling ranges of values in individual 'rows' or 'columns' then that's something to optimize for; if your queries typically involve taking sub-matrices and performing predictable operations on them then you might improve the performance significantly by precalculating certain cumulative values; and if you want to use the full dataset to train machine learning models, you're probably better off not using a database for your primary storage at all (since databases by nature sacrifice fast-retrieval-of-full-raw-data for fast-calculations-on-interesting-subsets), especially if your ML models can be parallelised using something like Spark.
No DB will handle everything well, so it would be useful if you could elaborate on the workload you'll be running on top of that data -- the kind of questions you want to ask of it?

Convert CSV table to Redis data structures

I am looking for a method/data structure to implement an evaluation system for a binary matcher for a verification.
This system will be distributed over several PCs.
Basic idea is described in many places over the internet, for example, in this document: https://precisebiometrics.com/wp-content/uploads/2014/11/White-Paper-Understanding-Biometric-Performance-Evaluation.pdf
This matcher, that I am testing, takes two data items as an input and calculates a matching score that reflects their similarity (then a threshold will be chosen, depending on false match/false non-match rate).
Currently I store matching scores along with labels in CSV file, like following:
label1, label2, genuine, 0.1
label1, label4, genuine, 0.2
...
label_2, label_n+1, impostor, 0.8
label_2, label_n+3, impostor, 0.9
...
label_m, label_m+k, genuine, 0.3
...
(I've got a labeled data base)
Then I run a python script, that loads this table into Pandas DataFrame and calculates FMR/FNMR curve, similar to the one, shown in figure 2 in the link above. The processing is rather simple, just sorting the dataframe, scanning rows from top to bottom and calculating amount of impostors/genuines on rows above and below each row.
The system should also support finding outliers in order to support matching algorithm improvement (labels of pairs of data items, produced abnormally large genuine scores or abnormally small impostor scores). This is also pretty easy with the DataFrames (just sort and take head rows).
Now I'm thinking about how to store the comparison data in RAM instead of CSV files on HDD.
I am considering Redis in this regard: amount of data is large, and several PCs are involved in computations, and Redis has a master-slave feature that allows it quickly sync data over the network, so that several PCs have exact clones of data.
It is also free.
However, Redis does not seem to me to suit very well for storing such tabular data.
Therefore, I need to change data structures and algorithms for their processing.
However, it is not obvious for me, how to translate this table into Redis data structures.
Another option would be using some other data storage system instead of Redis. However, I am unaware of such systems and will be grateful for suggestions.
You need to learn more about Redis to solve your challenges. I recommend you give https://try.redis.io a try and then think about your questions.
TL;DR - Redis isn't a "tabular data" store, it is a store for data structures. It is up to you to use the data structure(s) that serves your query(ies) in the most optimal way.
IMO what you want to do is actually keep the large data (how big is it anyway?) on slower storage and just store the model (FMR curve computations? Outliers?) in Redis. This can almost certainly be done with the existing core data structures (probably Hashes and Sorted Sets in this case), but perhaps even more optimally with the new Modules API. See the redis-ml module as an example of serving machine learning models off Redis (and perhaps your use case would be a nice addition to it ;))
Disclaimer: I work at Redis Labs, home of the open source Redis and provider of commercial solutions that leverage on it, including the above mentioned module (open source, AGPL licensed).

How to store multi-dimensional data

I am building a couple of web applications to store data using Django. This data would be generated from lab tests and might have up to 100 parameters being logged against time. This would leave me with an NxN matrix of data.
I'm struggling to see how this would fit into a Django model as the number of parameters logged may change each time, and it seems inefficient to create a new model for each dataset.
What would be a good way of storing data like this? Would it be best to store it as a separate file and then just use a model to link a test to a datafile? If so what would be the best format for fast access and being able to quickly render and search through data, generate graphs etc in the application?
In answer to the question below:
It would be useful to search through datasets generated from the same test for trend analysis etc.
As I'm still beginning with this site I'm using SQLite, but planning to move to full SQL as it grows

Categories

Resources