partitionBy & overwrite strategy in an Azure DataLake using PySpark in Databricks - python
I have a simple ETL process in an Azure environment
blob storage > datafactory > datalake raw > databricks > datalake
curated > datwarehouse(main ETL).
the datasets for this project are not very big (~1 million rows 20 columns give or take) however I would like to keep them partitioned properly in my datalake as Parquet files.
currently I run some simple logic to figure where in my lake each file should sit based off business calendars.
the files vaguely looks like this
Year Week Data
2019 01 XXX
2019 02 XXX
I then partition a given file into the following format replacing data that exists and creating new folders for new data.
curated ---
dataset --
Year 2019
- Week 01 - file.pq + metadata
- Week 02 - file.pq + metadata
- Week 03 - file.pq + datadata #(pre existing file)
the metadata are success and commits that are auto generated.
to this end i use the following query in Pyspark 2.4.3
pyspark_dataframe.write.mode('overwrite')\
.partitionBy('Year','Week').parquet('\curated\dataset')
now if I use this command on it's own, it will overwrite any existing data in the target partition
so Week 03 will be lost.
using spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") seems to stop the issue and only over write the target files but I wonder if this is the best way to handle files in my data lake?
also I've found it hard to find any documentation on the above feature.
my first instinct was to loop over a single parquet and write each partition manually, which although gives me greater control, but looping will be slow.
my next thought would be to write each partition to a /tmp folder and move each parquet file and then replace files / create files as need be using the query from above. then purge the /tmp folder whilst creating some sort of metadata log.
Is there a better way/method to this?
any guidance would be much appreciated.
the end goal here is to have a clean and safe area for all 'Curated' data whilst having a log of parquet files I can read into a DataWarehouse for further ETL.
I saw that you are using databricks in the azure stack. I think the most viable and recommended method for you to use would be to make use of the new delta lake project in databricks:
It provides options for various upserts, merges and acid transactions to object stores like s3 or azure data lake storage. It basically provides the management, safety, isolation and upserts/merges provided by data warehouses to datalakes. For one pipeline apple actually replaced its data warehouses to be run solely on delta databricks because of its functionality and flexibility. For your use case and many others who use parquet, it is just a simple change of replacing 'parquet' with 'delta', in order to use its functionality (if you have databricks). Delta is basically a natural evolution of parquet and databricks has done a great job by providing added functionality and as well as open sourcing it.
For your case, I would suggest you try the replaceWhere option provided in delta. Before making this targeted update, the target table has to be of format delta
Instead of this:
dataset.repartition(1).write.mode('overwrite')\
.partitionBy('Year','Week').parquet('\curataed\dataset')
From https://docs.databricks.com/delta/delta-batch.html:
'You can selectively overwrite only the data that matches predicates over partition columns'
You could try this:
dataset.write.repartition(1)\
.format("delta")\
.mode("overwrite")\
.partitionBy('Year','Week')\
.option("replaceWhere", "Year == '2019' AND Week >='01' AND Week <='02'")\ #to avoid overwriting Week3
.save("\curataed\dataset")
Also, if you wish to bring partitions to 1, why dont you use coalesce(1) as it will avoid a full shuffle.
From https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/:
'replaceWhere is particularly useful when you have to run a computationally expensive algorithm, but only on certain partitions'
Therefore, I personally think that using replacewhere to manually specify your overwrite will be more targeted and computationally efficient then to just rely on:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
Databricks provides optimizations on delta tables make it a faster, and much more efficient option to parquet( hence a natural evolution) by bin packing and z-ordering:
From Link:https://docs.databricks.com/spark/latest/spark-sql/language-manual/optimize.html
WHERE(binpacking)
'Optimize the subset of rows matching the given partition predicate. Only filters involving partition key attributes are supported.'
ZORDER BY
'Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read'.
Faster query execution with indexing, statistics, and auto-caching support
Data reliability with rich schema validation and transactional guarantees
Simplified data pipeline with flexible UPSERT support and unified Structured Streaming + batch processing on a single data source
You could also check out the complete documentation of the open source project: https://docs.delta.io/latest/index.html
.. I also want to say that I do not work for databricks/delta lake. I have just seen their improvements and functionality benefit me in my work.
UPDATE:
The gist of the question is "replacing data that exists and creating new folders for new data" and to do it in highly scalable and effective manner.
Using dynamic partition overwrite in parquet does the job however I feel like the natural evolution to that method is to use delta table merge operations which were basically created to 'integrate data from Spark DataFrames into the Delta Lake'. They provide you with extra functionality and optimizations in merging your data based on how would want that to happen and keep a log of all actions on a table so you can rollback versions if needed.
Delta lake python api(for merge):
https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaMergeBuilder
databricks optimization: https://kb.databricks.com/delta/delta-merge-into.html#discussion
Using a single merge operation you can specify the condition merge on, in this case it could be a combination of the year and week and id, and then if the records match(meaning they exist in your spark dataframe and delta table, week1 and week2), update them with the data in your spark dataframe and leave other records unchanged:
#you can also add additional condition if the records match, but not required
.whenMatchedUpdateAll(condition=None)
For some cases, if nothing matches then you might want to insert and create new rows and partitions, for that you can use:
.whenNotMatchedInsertAll(condition=None)
You can use .converttodelta operation https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.convertToDelta, to convert your parquet table to a delta table so that you can perform delta operations on it using the api.
'You can now convert a Parquet table in place to a Delta Lake table without rewriting any of the data. This is great for converting very large Parquet tables which would be costly to rewrite as a Delta table. Furthermore, this process is reversible'
Your merge case(replacing data where it exists and creating new records when it does not exist) could go like this:
(have not tested, refer to examples + api for syntax)
%python
deltaTable = DeltaTable.convertToDelta(spark, "parquet.`\curataed\dataset`")
deltaTable.alias("target").merge(dataset, "target.Year= dataset.Year AND target.Week = dataset.Week") \
.whenMatchedUpdateAll()\
.whenNotMatchedInsertAll()\
.execute()
If the delta table is partitioned correctly(year,week) and you used whenmatched clause correctly, these operations will be highly optimized and could take seconds in your case. It also provides you with consistency, atomicity and data integrity with option to rollback.
Some more functionality provided is that you can specify the set of columns to update if the match is made, (if you only need to update certain columns). You can also enable spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true"), so that delta uses minimal targeted partitions to carry out the merge(update,delete,create).
Overall, I think using this approach is a very new and innovative way of carrying out targeted updates as it gives you more control over it while keeping ops highly efficient. Using parquet with dynamic partitionoverwrite mode will also work fine however, delta lake features bring data quality to your data lake that is unmatched.
My recommendation:
I would say for now, use dynamic partition overwrite mode for parquet files to do your updates, and you could experiment and try to use the delta merge on just one table with the databricks optimization of spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true") and .whenMatchedUpdateAll() and compare the performance of both(your files are small so I do not think it will be a big difference). The databricks partition pruning optimization for merges article came out in Feb so it is really new and possibly could be a gamechanger for the overhead delta merge operations incur( as under the hood they just create new files, but partition pruning could speed it up)
Merge examples in python,scala,sql: https://docs.databricks.com/delta/delta-update.html#merge-examples
https://databricks.com/blog/2019/10/03/simple-reliable-upserts-and-deletes-on-delta-lake-tables-using-python-apis.html
Instead of writing the table directly we can use saveAsTable with append and remove the partitions before that.
dataset.repartition(1).write.mode('append')\
.partitionBy('Year','Week').saveAsTable("tablename")
For removing previous partitions
partitions = [ (x["Year"], x["Week"]) for x in dataset.select("Year", "Week").distinct().collect()]
for year, week in partitions:
spark.sql('ALTER TABLE tablename DROP IF EXISTS PARTITION (Year = "'+year+'",Week = "'+week+'")')
Correct me if I missed something crucial in your approach, but it seems like you want to write new data on top of the existing data, which is normally done with
write.mode('append')
instead of 'overwrite'
If you want to keep the data separated by batch, so you can select it for upload to data warehouse or audit, there is no sensible way to do it besides including this information into the dataset and partitioning it during save, e.g.
dataset.write.mode('append')\
.partitionBy('Year','Week', 'BatchTimeStamp').parquet('curated\dataset')
Any other manual intervention into the parquet file format will be at best hacky, at worst risk making your pipeline unreliable or corrupting your data.
Delta lake which Mohammad mentions is also a good suggestion overall for reliably storing data in data lakes and a golden industry standard right now. For your specific use case you could use its feature of making historical queries (append everything and then query for the difference between current dataset and after previous batch), however the audit log is limited in time to how you configure your delta lake and can be as low as 7 days, so if you want full information in the long term, you need to follow the approach of saving batch information anyway.
On a more strategical level, when following raw -> curated -> DW you can also consider adding another 'hop' and putting your ready data into a 'preprocessed' folder, organized by batch and then append it both to the curated and DW sets.
As a side note, .repartition(1) doesn't make too much sense when using parquets, as parquet is a multi-file format anyway, so the only effects doing this has are negative impacts on performance. But please do let me know if there is a specific reason you are using it.
Related
Read last N rows of S3 parquet table
If I apply what was discussed here to read parquet files in an S3 buck to pandas dataframe, particularly: import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas() When the table grows larger and larger as time goes by and I need to make this retrieval regularly, I want to just read the last N rows into the data frame. Is this possible?
Yes, this is entirely possible. S3 allows for partial object reads. Parquet files allow for partial reads based on row groups (and pyarrow exposes this capability). In addition, pyarrow allows for partial reads if you have multiple files (regardless of file format). However, these approaches will put some requirements on how the input file(s) are created (see aside at the bottom). The easy way The easiest thing will be to use the newer datasets API (which is worth a read on its own and obsoletes some of the question you referenced) and filter on some kind of column. import pyarrow.dataset as ds from datetime import datetime, timedelta two_days_ago = datetime.now() - timedelta(days=2) dataset = ds.dataset('s3://your-bucket').to_table(filter=ds.field('sample_date') > two_days_ago) The pyarrow datasets API supports "push down filters" which means that the filter is pushed down into the reader layer. If the reader is capable of reducing the amount of data read using the filter then it will. For simple filters like this the parquet reader is capable of optimizing reads by looking first at the row group metadata which should have a "statistics" section which contains the min/max for each column. However, that isn't quite "the last N rows" as it requires you to craft some kind of filter. If you have complete control over the data then you could create a row_num column. You could then create a filter on that if you knew the total # of rows (which you could store separately or access via the metadata, see below.) The slightly less easy way Alternatively, you can use ParquetFile which has the metadata attribute. Accessing this will only trigger a read for the metadata itself (which is not the whole file). From this you can get some information such as how many row groups are in the file and how many rows they contain. You can use this to determine how many rows groups you need and you can use read_row_group or read_row_groups to access just those row groups (this will not trigger a full file read). Neither of these solutions is ideal. The first option requires you to have more control over the data and the second option requires you to do quite a bit of work yourself. The Arrow project is working towards simplifying this sort of operation (see, for example, ARROW-3705 ). However, this answer is based only on features that are available today. One last aside: All of these approaches (and even any future approaches developed by Arrow) will require the data to either be stored as multiple files or multiple row groups. If you have one giant file stored as a single row group then there is not much that can be done. Parquet does not support partial row group reads.
No it is not possible solely with S3. S3 is an object store, which let's you store, retrieve, update etc only the 'whole' objects i.e. files. Having said that you should have a look a Athena which is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using Standard SQL. It should let you do what you want. Best, Stefan
Do we must make a complex query in PySpark or a simple, and use .filter / .select?
I have a question. Suppose I run a python script on the server where my data are stored. What is the faster way to have a spark dataframe of my data between : Make a complex query with lot of conditions but it return me the exact dataframe I need or Make a simple query and make the dataframe I need with .filter / .select You can also suppose that the dataframe I need is small enough to fit on my RAM. Thanks
IIUC, everything depends on from where you are reading the data, So here are some scenarios DataSource: RDBMS(oracle, postgres, mysql....) If you want to read data from RDBMS system then you have to establish a JDBC connection to the database then fetch the results. Now remember spark is slow when fetching data from relational databases over JDBC and it is recommended you filter most of your records on database side itself as it will allow the minimum data to be transferred over the network You can control the read speed using some tuning parameters but it is still slow. DataSource: Redshift, Snowflake In this scenario if your cluster is large and relatively free then pushdown the query to the cluster itself or if you want to read data using JDBC then it is also fast as BTS it unloads the data to a temp location and then spark reads it as file source. DataSource: Files Always try to pushdown the filter as they are there for a reason, so that your cluster needs to do the minimum work as you are reading only the required data. Bottom line is that you should always try to pushdown the filters on the source to make your spark jobs faster.
The key points to mind is Restrict/filter data to maximum possible level while loading into dataframe, so as only needed data resides in dataframe for non file sources: filtering data at source by using native filter and fetching only needed columns (aim for minimum data transfer). for file sources: restricting/modifying data in file source is not feasible. so the first operation is to filter data once loaded In complex operations first perform narrow transformations (filters, selecting only needed columns) and then perform wide transformations(joins, ordering) which involves shuffle towards the end, so that less data will be shuffled between worker nodes. The less the shuffles the faster your end dataframe would be.
First of all I think we should be careful when dealing with small data in our Spark programs. It was designed to give you parallel processing for big data. Second, we have something like Catalyst query optimizer and lazy evaluation, which is a good tool for Spark to optimize everything what was put either in SQL query or API call transformations.
Python large dataset feature engineering workflow using dask hdf/parquet
There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018. I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes). The initial file is a csv that doesn't fit in memory. Here are my needs: Create features (mainly using groupby operations on multiple columns.) Merge the new feature to the previous data (on disk because it doesn't fit in memory) Use a subset (or all) columns/index for some ML applications Repeat 1/2/3 (This is an iterative process like day1: create 4 features, day2: create 4 more ...) Attempt with parquet and dask: First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project. Attempt with HDF and dask: I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF). data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get) (Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?) Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this: data = dd.read_hdf('./hdf/data-*.hdf', key='base') data['day_pow2'] = data['day']**2 data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get) with dask.threaded I get "python stopped working" after 2%. With dask.multiprocessing.get it takes forever and create new files What are the most appropriated tools (storage and processing) for this workflow?
I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either. Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.
I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...
HDF5 with Python, Pandas: Data Corruption and Read Errors
So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc). I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like: Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent, Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980. Store the result of query 1 as the new "QUERY1" in the HDF5 store Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time. Some things I suspect could be part of the problem: using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time. From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update? Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'. Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables. Any ideas?
"Large data" workflows using pandas [closed]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations. Closed 1 year ago. The community reviewed whether to reopen this question 1 year ago and left it closed: Original close reason(s) were not resolved Improve this question I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons. One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive. My first thought is to use HDFStore to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this: What are some best-practice workflows for accomplishing the following: Loading flat files into a permanent, on-disk database structure Querying that database to retrieve data to feed into a pandas data structure Updating the database after manipulating pieces in pandas Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data". Edit -- an example of how I would like this to work: Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory. I would create new columns by performing various operations on the selected columns. I would then have to append these new columns into the database structure. I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem. Edit -- Responding to Jeff's questions specifically: I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns. It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).
I routinely use tens of gigabytes of data in just this fashion e.g. I have tables on disk that I read via queries, create data and append back. It's worth reading the docs and late in this thread for several suggestions for how to store your data. Details which will affect how you store your data, like: Give as much detail as you can; and I can help you develop a structure. Size of data, # of rows, columns, types of columns; are you appending rows, or just columns? What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these. (Giving a toy example could enable us to offer more specific recommendations.) After that processing, then what do you do? Is step 2 ad hoc, or repeatable? Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file? Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)? Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)? Solution Ensure you have pandas at least 0.10.1 installed. Read iterating files chunk-by-chunk and multiple table queries. Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow): (The following is pseudocode.) import numpy as np import pandas as pd # create a store store = pd.HDFStore('mystore.h5') # this is the key to your storage: # this maps your fields to a specific group, and defines # what you want to have as data_columns. # you might want to create a nice class wrapping this # (as you will want to have this map and its inversion) group_map = dict( A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']), B = dict(fields = ['field_10',...... ], dc = ['field_10']), ..... REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []), ) group_map_inverted = dict() for g, v in group_map.items(): group_map_inverted.update(dict([ (f,g) for f in v['fields'] ])) Reading in the files and creating the storage (essentially doing what append_to_multiple does): for f in files: # read in the file, additional options may be necessary here # the chunksize is not strictly necessary, you may be able to slurp each # file into memory in which case just eliminate this part of the loop # (you can also change chunksize if necessary) for chunk in pd.read_table(f, chunksize=50000): # we are going to append to each table by group # we are not going to create indexes at this time # but we *ARE* going to create (some) data_columns # figure out the field groupings for g, v in group_map.items(): # create the frame for this group frame = chunk.reindex(columns = v['fields'], copy = False) # append it store.append(g, frame, index=False, data_columns = v['dc']) Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary). This is how you get columns and create new ones: frame = store.select(group_that_I_want) # you can optionally specify: # columns = a list of the columns IN THAT GROUP (if you wanted to # select only say 3 out of the 20 columns in this sub-table) # and a where clause if you want a subset of the rows # do calculations on this frame new_frame = cool_function_on_frame(frame) # to 'add columns', create a new group (you probably want to # limit the columns in this new_group to be only NEW ones # (e.g. so you don't overlap from the other tables) # add this info to the group_map store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created) When you are ready for post_processing: # This may be a bit tricky; and depends what you are actually doing. # I may need to modify this function to be a bit more general: report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1) About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like: store.select(group, where = ['field_1000=foo', 'field_1001>0']) They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot). You also might want to: create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you. indexes on certain data columns (makes row-subsetting much faster). enable compression. Let me know when you have questions!
I think the answers above are missing a simple approach that I've found very useful. When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols) Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes) The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats This approach doesn't cover all scenarios, but is very useful in a lot of them
There is now, two years after the question, an 'out-of-core' pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it. Update: in the past two years it has been consistently maintained and there is substantial user community working with Dask. And now, four years after the question, there is another high-performance 'out-of-core' pandas equivalent in Vaex. It "uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted)." It can handle data sets of billions of rows and does not store them into memory (making it even possible to do analysis on suboptimal hardware).
If your datasets are between 1 and 20GB, you should get a workstation with 48GB of RAM. Then Pandas can hold the entire dataset in RAM. I know its not the answer you're looking for here, but doing scientific computing on a notebook with 4GB of RAM isn't reasonable.
I know this is an old thread but I think the Blaze library is worth checking out. It's built for these types of situations. From the docs: Blaze extends the usability of NumPy and Pandas to distributed and out-of-core computing. Blaze provides an interface similar to that of the NumPy ND-Array or Pandas DataFrame but maps these familiar interfaces onto a variety of other computational engines like Postgres or Spark. Edit: By the way, it's supported by ContinuumIO and Travis Oliphant, author of NumPy.
This is the case for pymongo. I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. First and foremost pymongo is a document based DB, so each person would be a document (dict of attributes). Many people form a collection and you can have many collections (people, stock market, income). pd.dateframe -> pymongo Note: I use the chunksize in read_csv to keep it to 5 to 10k records(pymongo drops the socket if larger) aCollection.insert((a[1].to_dict() for a in df.iterrows())) querying: gt = greater than... pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}}))) .find() returns an iterator so I commonly use ichunked to chop into smaller iterators. How about a join since I normally get 10 data sources to paste together: aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}}))) then (in my case sometimes I have to agg on aJoinDF first before its "mergeable".) df = pandas.merge(df, aJoinDF, on=aKey, how='left') And you can then write the new info to your main collection via the update method below. (logical collection vs physical datasources). collection.update({primarykey:foo},{key:change}) On smaller lookups, just denormalize. For example, you have code in the document and you just add the field code text and do a dict lookup as you create documents. Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. This works for me for 3 million records with numbers/big text/categories/codes/floats/... You can also use the two methods built into MongoDB (MapReduce and aggregate framework). See here for more info about the aggregate framework, as it seems to be easier than MapReduce and looks handy for quick aggregate work. Notice I didn't need to define my fields or relations, and I can add items to a document. At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :)
One trick I found helpful for large data use cases is to reduce the volume of the data by reducing float precision to 32-bit. It's not applicable in all cases, but in many applications 64-bit precision is overkill and the 2x memory savings are worth it. To make an obvious point even more obvious: >>> df = pd.DataFrame(np.random.randn(int(1e8), 5)) >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 100000000 entries, 0 to 99999999 Data columns (total 5 columns): ... dtypes: float64(5) memory usage: 3.7 GB >>> df.astype(np.float32).info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 100000000 entries, 0 to 99999999 Data columns (total 5 columns): ... dtypes: float32(5) memory usage: 1.9 GB
I spotted this a little late, but I work with a similar problem (mortgage prepayment models). My solution has been to skip the pandas HDFStore layer and use straight pytables. I save each column as an individual HDF5 array in my final file. My basic workflow is to first get a CSV file from the database. I gzip it, so it's not as huge. Then I convert that to a row-oriented HDF5 file, by iterating over it in python, converting each row to a real data type, and writing it to a HDF5 file. That takes some tens of minutes, but it doesn't use any memory, since it's only operating row-by-row. Then I "transpose" the row-oriented HDF5 file into a column-oriented HDF5 file. The table transpose looks like: def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"): # Get a reference to the input data. tb = h_in.getNode(table_path) # Create the output group to hold the columns. grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1)) for col_name in tb.colnames: logger.debug("Processing %s", col_name) # Get the data. col_data = tb.col(col_name) # Create the output array. arr = h_out.createCArray(grp, col_name, tables.Atom.from_dtype(col_data.dtype), col_data.shape) # Store the data. arr[:] = col_data h_out.flush() Reading it back in then looks like: def read_hdf5(hdf5_path, group_path="/data", columns=None): """Read a transposed data set from a HDF5 file.""" if isinstance(hdf5_path, tables.file.File): hf = hdf5_path else: hf = tables.openFile(hdf5_path) grp = hf.getNode(group_path) if columns is None: data = [(child.name, child[:]) for child in grp] else: data = [(child.name, child[:]) for child in grp if child.name in columns] # Convert any float32 columns to float64 for processing. for i in range(len(data)): name, vec = data[i] if vec.dtype == np.float32: data[i] = (name, vec.astype(np.float64)) if not isinstance(hdf5_path, tables.file.File): hf.close() return pd.DataFrame.from_items(data) Now, I generally run this on a machine with a ton of memory, so I may not be careful enough with my memory usage. For example, by default the load operation reads the whole data set. This generally works for me, but it's a bit clunky, and I can't use the fancy pytables magic. Edit: The real advantage of this approach, over the array-of-records pytables default, is that I can then load the data into R using h5r, which can't handle tables. Or, at least, I've been unable to get it to load heterogeneous tables.
As noted by others, after some years an 'out-of-core' pandas equivalent has emerged: dask. Though dask is not a drop-in replacement of pandas and all of its functionality it stands out for several reasons: Dask is a flexible parallel computing library for analytic computing that is optimized for dynamic task scheduling for interactive computational workloads of “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments and scales from laptops to clusters. Dask emphasizes the following virtues: Familiar: Provides parallelized NumPy array and Pandas DataFrame objects Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects. Native: Enables distributed computing in Pure Python with access to the PyData stack. Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Trivial to set up and run on a laptop in a single process Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans and to add a simple code sample: import dask.dataframe as dd df = dd.read_csv('2015-*-*.csv') df.groupby(df.user_id).value.mean().compute() replaces some pandas code like this: import pandas as pd df = pd.read_csv('2015-01-01.csv') df.groupby(df.user_id).value.mean() and, especially noteworthy, provides through the concurrent.futures interface a general infrastructure for the submission of custom tasks: from dask.distributed import Client client = Client('scheduler:port') futures = [] for fn in filenames: future = client.submit(load, fn) futures.append(future) summary = client.submit(summarize, futures) summary.result()
It is worth mentioning here Ray as well, it's a distributed computation framework, that has it's own implementation for pandas in a distributed way. Just replace the pandas import, and the code should work as is: # import pandas as pd import ray.dataframe as pd # use pd as usual can read more details here: https://rise.cs.berkeley.edu/blog/pandas-on-ray/ Update: the part that handles the pandas distribution, has been extracted to the modin project. the proper way to use it is now is: # import pandas as pd import modin.pandas as pd
One more variation Many of the operations done in pandas can also be done as a db query (sql, mongo) Using a RDBMS or mongodb allows you to perform some of the aggregations in the DB Query (which is optimized for large data, and uses cache and indexes efficiently) Later, you can perform post processing using pandas. The advantage of this method is that you gain the DB optimizations for working with large data, while still defining the logic in a high level declarative syntax - and not having to deal with the details of deciding what to do in memory and what to do out of core. And although the query language and pandas are different, it's usually not complicated to translate part of the logic from one to another.
Consider Ruffus if you go the simple path of creating a data pipeline which is broken down into multiple smaller files.
I'd like to point out the Vaex package. Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted). Have a look at the documentation: https://vaex.readthedocs.io/en/latest/ The API is very close to the API of pandas.
I recently came across a similar issue. I found simply reading the data in chunks and appending it as I write it in chunks to the same csv works well. My problem was adding a date column based on information in another table, using the value of certain columns as follows. This may help those confused by dask and hdf5 but more familiar with pandas like myself. def addDateColumn(): """Adds time to the daily rainfall data. Reads the csv as chunks of 100k rows at a time and outputs them, appending as needed, to a single csv. Uses the column of the raster names to get the date. """ df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True, chunksize=100000) #read csv file as 100k chunks '''Do some stuff''' count = 1 #for indexing item in time list for chunk in df: #for each 100k rows newtime = [] #empty list to append repeating times for different rows toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time while count <= toiterate.max(): for i in toiterate: if i ==count: newtime.append(newyears[count]) count+=1 print "Finished", str(chunknum), "chunks" chunk["time"] = newtime #create new column in dataframe based on time outname = "CHIRPS_tanz_time2.csv" #append each output to same csv, using no header chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)
The parquet file format is ideal for the use case you described. You can efficiently read in a specific subset of columns with pd.read_parquet(path_to_file, columns=["foo", "bar"]) https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
At the moment I am working "like" you, just on a lower scale, which is why I don't have a PoC for my suggestion. However, I seem to find success in using pickle as caching system and outsourcing execution of various functions into files - executing these files from my commando / main file; For example i use a prepare_use.py to convert object types, split a data set into test, validating and prediction data set. How does your caching with pickle work? I use strings in order to access pickle-files that are dynamically created, depending on which parameters and data sets were passed (with that i try to capture and determine if the program was already run, using .shape for data set, dict for passed parameters). Respecting these measures, i get a String to try to find and read a .pickle-file and can, if found, skip processing time in order to jump to the execution i am working on right now. Using databases I encountered similar problems, which is why i found joy in using this solution, however - there are many constraints for sure - for example storing huge pickle sets due to redundancy. Updating a table from before to after a transformation can be done with proper indexing - validating information opens up a whole other book (I tried consolidating crawled rent data and stopped using a database after 2 hours basically - as I would have liked to jump back after every transformation process) I hope my 2 cents help you in some way. Greetings.