Lower end GPU vs mid end CPU for data processing - python

i currently have data simple data processing that involving groupby, merge, and parallel column to column operation. The not so simple part are the massive row used (its detailed cost/financial data). its 300-400 gb in size.
due to limited RAM, currently im using out of core computing with dask. However, its really slow.
I've previously read using CuDF to improve performance on map_partitions and groupby, however most example are using mid-high end gpu (1050ti at least, most are run on gv-based cloud vm) and the data could fit on gpu RAM.
My machine spec are E5-2620v3(6C/12T), 128gb, and K620 (only have 2gb dedicated vram).
Intermediate dataframe used are stored in parquet.
Will it make it faster if i use low end GPU using CuDF? and is it possible to do out of core computing in GPU? (im looking all around for example but yet to find)
Below are simplified pseudo code of what im trying to do
a.csv are data with ~300gb in size, consisting of 3 column (Hier1, Hier2, Hier3, value) Hier1-3 are hierarchy, in string. value are sales value
b.csv are data with ~50gb in size, consisting of 3 column (Hier1, Hier2, valuetype, cost). Hier1-2 are hierarchy, in string. Value type are cost type, in string. cost are cost value
Basically, i need to do prorate topdown based on sales value from a.csv for each cost in b.csv. end of mind are i have each of cost available in Hier3 level (which is more detailed level)
First step is to create prorated ratio:
import dask.dataframe as dd
# read raw data, repartition, convert to parquet for both file
raw_reff = dd.read_csv('data/a.csv')
raw_reff = raw_reff.map_partitions(lambda df: df.assign(PartGroup=df['Hier1']+df['Hier2']))
raw_reff = raw_reff.set_index('PartGroup')
raw_reff.to_parquet("data/raw_a.parquet")
cost_reff = dd.read_csv('data/b.csv')
cost_reff = cost_reff.map_partitions(lambda df: df.assign(PartGroup=df['Hier1']+df['Hier2']))
cost_reff = cost_reff.set_index('PartGroup')
cost_reff.to_parquet("data/raw_b.parquet")
# create reference ratio
ratio_reff = dd.read_parquet("data/raw_a.parquet").reset_index()
#to push down ram usage, instead of dask groupby im using groupby on each partition. Should be ok since its already partitioned above on each group
ratio_reff = ratio_reff.map_partitions(lambda df: df.groupby(['PartGroup'])['value'].sum().reset_index())
ratio_reff = ratio_reff.set_index('PartGroup')
ratio_reff = ratio_reff.map_partitions(lambda df: df.rename(columns={'value':'value_on_group'}))
ratio_reff.to_parquet("data/reff_a.parquet")
and then do the merging to get the ratio
raw_data = dd.read_parquet("data/raw_a.parquet").reset_index()
reff_data = dd.read_parquet("data/reff_a.parquet").reset_index()
ratio_data = raw_data.merge(reff_data, on=['PartGroup'], how='left')
ratio_data['RATIO'] = ratio_data['value'].fillna(0)/ratio_data['value_on_group'].fillna(0)
ratio_data = ratio_data[['PartGroup','Hier3','RATIO']]
ratio_data = ratio_data.set_index('PartGroup')
ratio_data.to_parquet("data/ratio_a.parquet")
and then merge and multiply cost data on PartGroup to Ratio to have its prorated value
reff_stg = dd.read_parquet("data/ratio_a.parquet").reset_index()
cost_stg = dd.read_parquet("data/raw_b.parquet").reset_index()
final_stg = reff_stg.merge(cost_stg, on=['PartGroup'], how='left')
final_stg['allocated_cost'] = final_stg['RATIO']*final_stg['cost']
final_stg = final_stg.set_index('PartGroup')
final_stg.to_parquet("data/result_pass1.parquet")
in the real case there will be residual value caused by missing reference data etc and it will done in several pass using several reference, but basically above is the step
even with strictly parquet to parquet operation, it still tooks ~80gb of RAM out of my 128gb, all of my core running 100%, and 3-4 days to run. im looking for ways to have this done faster with current hardware. as you can see, its massively pararel problem which fit into definition for gpu-based processing
Thanks

#Ditto, unfortunately, this cannot be done with your current hardware. Your K620 has a Kepler architecture GPU and is below the minimum requirements for RAPIDS. You will need a Pascal card or better to run RAPIDS. The good news is that if purchasing a RAPIDS compatible video card is not a viable option, there are many inexpensive cloud provisioning options. Honestly, what you're asking to do, I'd want a little extra GPU processing speed and would recommend using a multi-GPU set up.
As for the larger data set than GPU RAM, you can use dask_cudf to allow for your dataset to be processed. There are several examples in our docs and notebooks. Please be advised that the resulting data set after dask.compute() needs to be able to fit in the GPU RAM.
https://rapidsai.github.io/projects/cudf/en/0.12.0/10min.html#10-Minutes-to-cuDF-and-Dask-cuDF
https://rapidsai.github.io/projects/cudf/en/0.12.0/dask-cudf.html#multi-gpu-with-dask-cudf
Once you can get a working, RAPID compatible, multi GPU set up and use dask_cudf, you should get a very worth while speed up, especially for that size of data exploration.
Hope this helps!

Related

Looking for a solution to speed up `pyspark.sql.GroupedData.applyInPandas` processing on a large dataset

I'm working with a dataset stored in S3 bucket (parquet files) consisting of a total of ~165 million records (with ~30 columns). Now, the requirement is to first groupby a certain ID column then generate 250+ features for each of these grouped records based on the data. Building these features is quite complex using multiple Pandas functionality along with 10+ supporting functions. The groupby function should generate ~5-6 million records, hence the final output should be 6M x 250 shaped dataframe.
Now, I've tested the code on a smaller sample and it works fine. The issue is, when I'm implementing it on the entire dataset, it takes a very long time - the progress bar in Spark display doesn't change even after 4+ hours of running. I'm running this in AWS EMR Notebook connected to a Cluster (1 m5.xlarge Master & 2 m5.xlarge Core Nodes).
I've tried with 1 m5.4xlarge Master & 2 m5.4xlarge Core Nodes, 1 m5.xlarge Master & 8 m5.xlarge Core Nodes combinations among others. None of them have shown any progress.
I've tried running it in Pandas in-memory in my local machine for ~650k records, the progress was ~3.5 iterations/sec which came to be an ETA of ~647 hours.
So, the question is - can anyone share any better solution to reduce the time consumption and speed up the processing ? Should another cluster type be used for this use-case ? Should this be refactored or should Pandas dataframe usage be removed or any other pointer would be really helpful.
Thanks much in advance!
First things first: is your data partitioned enough to take advantage of all of your workers? If some part of your process causes it to coalesce to e.g. a single partition, then you're basically running single-threaded.
Beyond that, I don't know for certain without seeing the code, but here's a subtle behaviour that can cause runtimes to become massive:
source_df = # some pandas dataframe with a lot of features in columns
flattened_df = your_df.stack().reset_index().unstack() # Turn the features into rows
spark_df = spark.createDataFrame(flattened_df) # 'index' is the column that contains the feature name
# a function to do a linear regression and calculate residual
def your_good_pandas_function(key, slice):
clf = LinearRegression()
X = slice[subset,of,columns]
y = slice[key]
clf.train(X,y)
predicted = clf.predict(X)
return y-predicted
def your_bad_pandas_function(key, slice):
clf = LinearRegression()
X = slice[subset,of,columns]
y = slice[key]
clf.train(X,y)
predicted = clf.predict(X)
return source_df[key]-predicted
spark_df.groupBy('index').applyInPandas(your_good_pandas_function,schema=some_schema) #fast
spark_df.groupBy('index').applyInPandas(your_bad_pandas_function,schema=some_schema) #slow
These two ApplyInPandas functions do the same thing - they linear-regress some characteristics against a feature and calculate the residual. The first uses variables that are in scope within the pandas UDF. The second uses a variable that is out of scope of the pandas UDF. In the second case, Spark will help you out by broadcasting source_df to every single invocation of your pandas UDF. This will cause enormous memory usage and definitely kill your job.
Your data don't seem large enough to take that long, so my guess is that the reason why it works on a small subset and not the larger set may be because you're inadvertently broadcasting the larger set to your applyInPandas function calls.

Import and work with large dataset (Python beginners)

since i couldn't find the best way to deal with my issue i came here to ask..
I'm a beginner with Python but i have to handle a large dataset.
However, i don't know what's the best way to handle the "Memory Error" problem.
I already have a 64 bits 3.7.3 Python version.
I saw that we can use TensorFlow or specify chunks in the pandas instruction or use the library Dask but i don't know which one is the best to fit with my problem and as a beginner it's not very clear.
I have a huge dataset (over 100M observations) i don't think reducing the dataset would decrease a lot the memory.
What i want to do is to test multiple ML algorithms with a train and test samples. I don't know how to deal with the problem.
Thanks!
This question is high level, so I'll provide some broad approaches for reducing memory usage in Dask:
Use a columnar file format like Parquet so you can leverage column pruning
Use column dtypes that require less memory int8 instead of int64
Strategically persist in memory, where appropriate
Use a cluster that's sized well for your data (running an analysis on 2GB of data requires different amounts of memory than 2TB)
split data into multiple files so it's easier to process in parallel
Your data has 100 million rows, which isn't that big (unless it had thousands of columns). Big data typically has billions or trillions of rows.
Feel free to add questions that are more specific and I can provide more specific advice. You can provide the specs of your machine / cluster, the memory requirements of the DataFrame (via ddf.memory_usage(deep=True)) and the actual code you're trying to run.

Dask crashing when saving to file?

I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using dask seems to work for large datasets but I'm having problems when I'm trying to save the file. I've tried CSV and parquet file. I want to save the results and then I can open it later in chunks.
Here's code to show the issue(the below script generates 2M rows and up to 30k unique values to onehot encode).
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, wait
sizeOfRows = 2000000
columnsForDF = 30000
partitionsforDask = 500
print("partition is ", partitionsforDask)
cluster = LocalCluster()
client = Client(cluster)
print(client)
df = pd.DataFrame(np.random.randint(0,columnsForDF,size=(sizeOfRows, 2)), columns=list('AB'))
ddf = dd.from_pandas(df, npartitions=partitionsforDask)
# ddf = ddf.persist()
wait(ddf)
# %%time
# need to globally know the categories before one hot encoding
ddf = ddf.categorize(columns=["B"])
one_hot = dd.get_dummies(ddf, columns=['B'])
print("starting groupby")
# result = one_hot.groupby('A').max().persist() # or to_parquet/to_csv/compute/etc.
# result = one_hot.groupby('A', sort=False).max().to_csv('./daskDF.csv', single_file = True)
result = one_hot.groupby('A', sort=False).max().to_parquet('./parquetFile')
wait(result)
It seems to work until it does the groupby to csv or parquet. At that point, I get many errors about workers exceeded 95% of memory and then the program exits with a "killedworker" exception:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
KilledWorker: ("('dataframe-groupby-max-combine-3ddcd8fc854613101b4bdc7fccde32cd', 1, 0, 0)", <Worker 'tcp://127.0.0.1:33815', name: 6, memory: 0, processing: 22>)
Monitoring my machine, I never get close to exceeding memory and my drive space is over 300 GB which is never used(no file is created during this process although it's in the groupby section).
What can I do?
Update - I thought I'd add an award. I'm having the same problem with .to_csv as well, since someone else had a similar problem I hope it has value for a wide audience.
Let's first think of the end result: it will be a dataframe with 30'000 columns and 30'000 rows. This object will take about 6.7 GB in memory. (there's scope in playing around with dtypes to reduce memory footprint and also not all combinations might appear in the data, but let's ignore these points for simplicity)
Now, imagine we only had two partitions and each partition contained all possible dummy variable combinations. That would mean that each worker would need at least 6.7 GB to store the .groupby().max() object, but the final step would require 13.4 GB because the final worker would need to find the .max of those two objects. Naturally, if you have more partitions, the memory requirements on the final worker will grow. There is a way of controlling that in dask by specifying split_every in the relevant function. For example, if you specify .max(split_every=2), then any worker will receive at most 2 objects (the default value of split_every is 8).
Early on in the processing of 500 partitions, it's likely that each partition will contain only a subset of possible dummy values, so memory requirements are low. However, as dask marches on in computing the final result, it will combine objects with different dummy value combinations, so memory requirements will grow towards the end of the pipeline.
In principle, you could also use resources to restrict how many tasks a worker will undertake at one time, but that's not going to help if the worker doesn't have sufficient memory to handle the tasks.
What are potential ways out of this? At least a few options:
use workers with bigger resources;
simplify the task (e.g. split the task into several sub-tasks based on subsets of possible categories);
develop a custom workflow with delayed/futures that will sort the data and implement custom priorities, ensuring that workers complete a subset of work before proceeding to the final aggregation.
If worker memory is a constraint, then the subsetting will have to be very fine-grained. For example, at the limit, subsetting to only one possible dummy variable combination will have very low memory requirements (the initial data load and filter will still require enough memory to fit a partition), but of course that's an extreme example that would generate tens of thousands of tasks, so larger category groups are recommended (balancing the number of tasks and memory requirements). To see an example, you can check this related answer.

How does dask work for larger than memory datasets

Would anyone be able to tell me how dask works for larger than memory dataset in simple terms. For example I have a dataset which is 6GB and 4GB RAM with 2 Cores. How would dask go about loading the data and doing a simple calculation such as sum of a column.
Does dask automatically check the size of the memory and chunk the dataset to smaller than memory pieces. Then, once requested to compute bring chunk by chunk into memory and do the computation using each of the available cores. Am I right on this.
Thanks
Michael
By "dataset" you are apparently referring to a dataframe. Let's consider two file formats from which you may be loading: CSV and parquet.
For CSVs, there is no inherent chunking mechanism in the file, so you, the user, can choose the bytes-per-chunk appropriate for your application using dd.read_csv(path, blocksize=..), or allow Dask to try to make a decent guess; "100MB" may be a fine size to try.
For parquet, the format itself has internal chunking of the data, and Dask will make use of this pattern in loading the data
In both cases, each worker will load one chunk at a time, and calculate the column sum you have asked for. Then, the loaded data will be discarded to make space for the next one, only keeping the results of the sum in memory (a single number for each partition). If you have two workers, two partitions will be in memory and processed at the same time. Finally, all the sums are added together.
Thus, each partition should comfortably fit into memory - not be too big - but the time it takes to load and process each should be much longer than the overhead imposed by scheduling the task to run on a worker (the latter <1ms) - not be too small.

"Large data" workflows using pandas [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.
My first thought is to use HDFStore to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:
What are some best-practice workflows for accomplishing the following:
Loading flat files into a permanent, on-disk database structure
Querying that database to retrieve data to feed into a pandas data structure
Updating the database after manipulating pieces in pandas
Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".
Edit -- an example of how I would like this to work:
Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory.
In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
I would create new columns by performing various operations on the selected columns.
I would then have to append these new columns into the database structure.
I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.
Edit -- Responding to Jeff's questions specifically:
I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset.
Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.
It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).
I routinely use tens of gigabytes of data in just this fashion
e.g. I have tables on disk that I read via queries, create data and append back.
It's worth reading the docs and late in this thread for several suggestions for how to store your data.
Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.
Size of data, # of rows, columns, types of columns; are you appending
rows, or just columns?
What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
(Giving a toy example could enable us to offer more specific recommendations.)
After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?
Solution
Ensure you have pandas at least 0.10.1 installed.
Read iterating files chunk-by-chunk and multiple table queries.
Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
(The following is pseudocode.)
import numpy as np
import pandas as pd
# create a store
store = pd.HDFStore('mystore.h5')
# this is the key to your storage:
# this maps your fields to a specific group, and defines
# what you want to have as data_columns.
# you might want to create a nice class wrapping this
# (as you will want to have this map and its inversion)
group_map = dict(
A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
B = dict(fields = ['field_10',...... ], dc = ['field_10']),
.....
REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),
)
group_map_inverted = dict()
for g, v in group_map.items():
group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))
Reading in the files and creating the storage (essentially doing what append_to_multiple does):
for f in files:
# read in the file, additional options may be necessary here
# the chunksize is not strictly necessary, you may be able to slurp each
# file into memory in which case just eliminate this part of the loop
# (you can also change chunksize if necessary)
for chunk in pd.read_table(f, chunksize=50000):
# we are going to append to each table by group
# we are not going to create indexes at this time
# but we *ARE* going to create (some) data_columns
# figure out the field groupings
for g, v in group_map.items():
# create the frame for this group
frame = chunk.reindex(columns = v['fields'], copy = False)
# append it
store.append(g, frame, index=False, data_columns = v['dc'])
Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).
This is how you get columns and create new ones:
frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
# select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows
# do calculations on this frame
new_frame = cool_function_on_frame(frame)
# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)
When you are ready for post_processing:
# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)
About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:
store.select(group, where = ['field_1000=foo', 'field_1001>0'])
They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).
You also might want to:
create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
indexes on certain data columns (makes row-subsetting much faster).
enable compression.
Let me know when you have questions!
I think the answers above are missing a simple approach that I've found very useful.
When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)
Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end
One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)
The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats
This approach doesn't cover all scenarios, but is very useful in a lot of them
There is now, two years after the question, an 'out-of-core' pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it. Update: in the past two years it has been consistently maintained and there is substantial user community working with Dask.
And now, four years after the question, there is another high-performance 'out-of-core' pandas equivalent in Vaex. It "uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted)." It can handle data sets of billions of rows and does not store them into memory (making it even possible to do analysis on suboptimal hardware).
If your datasets are between 1 and 20GB, you should get a workstation with 48GB of RAM. Then Pandas can hold the entire dataset in RAM. I know its not the answer you're looking for here, but doing scientific computing on a notebook with 4GB of RAM isn't reasonable.
I know this is an old thread but I think the Blaze library is worth checking out. It's built for these types of situations.
From the docs:
Blaze extends the usability of NumPy and Pandas to distributed and out-of-core computing. Blaze provides an interface similar to that of the NumPy ND-Array or Pandas DataFrame but maps these familiar interfaces onto a variety of other computational engines like Postgres or Spark.
Edit: By the way, it's supported by ContinuumIO and Travis Oliphant, author of NumPy.
This is the case for pymongo. I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. First and foremost pymongo is a document based DB, so each person would be a document (dict of attributes). Many people form a collection and you can have many collections (people, stock market, income).
pd.dateframe -> pymongo Note: I use the chunksize in read_csv to keep it to 5 to 10k records(pymongo drops the socket if larger)
aCollection.insert((a[1].to_dict() for a in df.iterrows()))
querying: gt = greater than...
pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))
.find() returns an iterator so I commonly use ichunked to chop into smaller iterators.
How about a join since I normally get 10 data sources to paste together:
aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))
then (in my case sometimes I have to agg on aJoinDF first before its "mergeable".)
df = pandas.merge(df, aJoinDF, on=aKey, how='left')
And you can then write the new info to your main collection via the update method below. (logical collection vs physical datasources).
collection.update({primarykey:foo},{key:change})
On smaller lookups, just denormalize. For example, you have code in the document and you just add the field code text and do a dict lookup as you create documents.
Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. This works for me for 3 million records with numbers/big text/categories/codes/floats/...
You can also use the two methods built into MongoDB (MapReduce and aggregate framework). See here for more info about the aggregate framework, as it seems to be easier than MapReduce and looks handy for quick aggregate work. Notice I didn't need to define my fields or relations, and I can add items to a document. At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :)
One trick I found helpful for large data use cases is to reduce the volume of the data by reducing float precision to 32-bit. It's not applicable in all cases, but in many applications 64-bit precision is overkill and the 2x memory savings are worth it. To make an obvious point even more obvious:
>>> df = pd.DataFrame(np.random.randn(int(1e8), 5))
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
...
dtypes: float64(5)
memory usage: 3.7 GB
>>> df.astype(np.float32).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
...
dtypes: float32(5)
memory usage: 1.9 GB
I spotted this a little late, but I work with a similar problem (mortgage prepayment models). My solution has been to skip the pandas HDFStore layer and use straight pytables. I save each column as an individual HDF5 array in my final file.
My basic workflow is to first get a CSV file from the database. I gzip it, so it's not as huge. Then I convert that to a row-oriented HDF5 file, by iterating over it in python, converting each row to a real data type, and writing it to a HDF5 file. That takes some tens of minutes, but it doesn't use any memory, since it's only operating row-by-row. Then I "transpose" the row-oriented HDF5 file into a column-oriented HDF5 file.
The table transpose looks like:
def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
# Get a reference to the input data.
tb = h_in.getNode(table_path)
# Create the output group to hold the columns.
grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
for col_name in tb.colnames:
logger.debug("Processing %s", col_name)
# Get the data.
col_data = tb.col(col_name)
# Create the output array.
arr = h_out.createCArray(grp,
col_name,
tables.Atom.from_dtype(col_data.dtype),
col_data.shape)
# Store the data.
arr[:] = col_data
h_out.flush()
Reading it back in then looks like:
def read_hdf5(hdf5_path, group_path="/data", columns=None):
"""Read a transposed data set from a HDF5 file."""
if isinstance(hdf5_path, tables.file.File):
hf = hdf5_path
else:
hf = tables.openFile(hdf5_path)
grp = hf.getNode(group_path)
if columns is None:
data = [(child.name, child[:]) for child in grp]
else:
data = [(child.name, child[:]) for child in grp if child.name in columns]
# Convert any float32 columns to float64 for processing.
for i in range(len(data)):
name, vec = data[i]
if vec.dtype == np.float32:
data[i] = (name, vec.astype(np.float64))
if not isinstance(hdf5_path, tables.file.File):
hf.close()
return pd.DataFrame.from_items(data)
Now, I generally run this on a machine with a ton of memory, so I may not be careful enough with my memory usage. For example, by default the load operation reads the whole data set.
This generally works for me, but it's a bit clunky, and I can't use the fancy pytables magic.
Edit: The real advantage of this approach, over the array-of-records pytables default, is that I can then load the data into R using h5r, which can't handle tables. Or, at least, I've been unable to get it to load heterogeneous tables.
As noted by others, after some years an 'out-of-core' pandas equivalent has emerged: dask. Though dask is not a drop-in replacement of pandas and all of its functionality it stands out for several reasons:
Dask is a flexible parallel computing library for analytic computing that is optimized for dynamic task scheduling for interactive computational workloads of
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments and scales from laptops to clusters.
Dask emphasizes the following virtues:
Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
Native: Enables distributed computing in Pure Python with access to the PyData stack.
Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Trivial to set up and run on a laptop in a single process
Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans
and to add a simple code sample:
import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean().compute()
replaces some pandas code like this:
import pandas as pd
df = pd.read_csv('2015-01-01.csv')
df.groupby(df.user_id).value.mean()
and, especially noteworthy, provides through the concurrent.futures interface a general infrastructure for the submission of custom tasks:
from dask.distributed import Client
client = Client('scheduler:port')
futures = []
for fn in filenames:
future = client.submit(load, fn)
futures.append(future)
summary = client.submit(summarize, futures)
summary.result()
It is worth mentioning here Ray as well,
it's a distributed computation framework, that has it's own implementation for pandas in a distributed way.
Just replace the pandas import, and the code should work as is:
# import pandas as pd
import ray.dataframe as pd
# use pd as usual
can read more details here:
https://rise.cs.berkeley.edu/blog/pandas-on-ray/
Update:
the part that handles the pandas distribution, has been extracted to the modin project.
the proper way to use it is now is:
# import pandas as pd
import modin.pandas as pd
One more variation
Many of the operations done in pandas can also be done as a db query (sql, mongo)
Using a RDBMS or mongodb allows you to perform some of the aggregations in the DB Query (which is optimized for large data, and uses cache and indexes efficiently)
Later, you can perform post processing using pandas.
The advantage of this method is that you gain the DB optimizations for working with large data, while still defining the logic in a high level declarative syntax - and not having to deal with the details of deciding what to do in memory and what to do out of core.
And although the query language and pandas are different, it's usually not complicated to translate part of the logic from one to another.
Consider Ruffus if you go the simple path of creating a data pipeline which is broken down into multiple smaller files.
I'd like to point out the Vaex package.
Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).
Have a look at the documentation: https://vaex.readthedocs.io/en/latest/
The API is very close to the API of pandas.
I recently came across a similar issue. I found simply reading the data in chunks and appending it as I write it in chunks to the same csv works well. My problem was adding a date column based on information in another table, using the value of certain columns as follows. This may help those confused by dask and hdf5 but more familiar with pandas like myself.
def addDateColumn():
"""Adds time to the daily rainfall data. Reads the csv as chunks of 100k
rows at a time and outputs them, appending as needed, to a single csv.
Uses the column of the raster names to get the date.
"""
df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True,
chunksize=100000) #read csv file as 100k chunks
'''Do some stuff'''
count = 1 #for indexing item in time list
for chunk in df: #for each 100k rows
newtime = [] #empty list to append repeating times for different rows
toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time
while count <= toiterate.max():
for i in toiterate:
if i ==count:
newtime.append(newyears[count])
count+=1
print "Finished", str(chunknum), "chunks"
chunk["time"] = newtime #create new column in dataframe based on time
outname = "CHIRPS_tanz_time2.csv"
#append each output to same csv, using no header
chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)
The parquet file format is ideal for the use case you described. You can efficiently read in a specific subset of columns with pd.read_parquet(path_to_file, columns=["foo", "bar"])
https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
At the moment I am working "like" you, just on a lower scale, which is why I don't have a PoC for my suggestion.
However, I seem to find success in using pickle as caching system and outsourcing execution of various functions into files - executing these files from my commando / main file; For example i use a prepare_use.py to convert object types, split a data set into test, validating and prediction data set.
How does your caching with pickle work?
I use strings in order to access pickle-files that are dynamically created, depending on which parameters and data sets were passed (with that i try to capture and determine if the program was already run, using .shape for data set, dict for passed parameters).
Respecting these measures, i get a String to try to find and read a .pickle-file and can, if found, skip processing time in order to jump to the execution i am working on right now.
Using databases I encountered similar problems, which is why i found joy in using this solution, however - there are many constraints for sure - for example storing huge pickle sets due to redundancy.
Updating a table from before to after a transformation can be done with proper indexing - validating information opens up a whole other book (I tried consolidating crawled rent data and stopped using a database after 2 hours basically - as I would have liked to jump back after every transformation process)
I hope my 2 cents help you in some way.
Greetings.

Categories

Resources