I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.
Eg.
df = db.query("select id, a_lot_of_data from table")
def process(id):
temp_df = df.loc[id]
temp_df.apply(another_function)
Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())
Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)
The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.
One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use select to select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.
An alternative would be to explore numba.vectorize with target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.
In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.
Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmap as mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping
Related
I'm trying to load a dask dataframe from a MySQL table which takes about 4gb space on disk. I'm using a single machine with 8gb of memory but as soon as I do a drop duplicate and try to get the length of the dataframe, an out of memory error is encountered.
Here's a snippet of my code:
df = dd.read_sql_table("testtable", db_uri, npartitions=8, index_col=sql.func.abs(sql.column("id")).label("abs(id)"))
df = df[['gene_id', 'genome_id']].drop_duplicates()
print(len(df))
I have tried more partitions for the dataframe(as many as 64) but they also failed. I'm confused why this could cause an OOM? The dataframe should fit in memory even without any parallel processing.
which takes about 4gb space on disk
It is very likely to be much much bigger than this in memory. Disk storage is optimised for compactness, with various encoding and compression mechanisms.
The dataframe should fit in memory
So, have you measured its size as a single pandas dataframe?
You should also keep in mind than any processing you do to your data often involves making temporary copies within functions. For example, you can only drop duplicates by first finding duplicates, which must happen before you can discard any data.
Finally, in a parallel framework like dask, there may be multiple threads and processes (you don't specify how you are running dask) which need to marshal their work and assemble the final output while the client and scheduler also take up some memory. In short, you need to measure your situation, perhaps tweak worker config options.
You don't want to read an entire DataFrame into a Dask DataFrame and then perform filtering in Dask. It's better to perform filtering at the database level and then read a small subset of the data into a Dask DataFrame.
MySQL can select columns and drop duplicates with distinct. The resulting data is what you should read in the Dask DataFrame.
See here for more information on syntax. It's easiest to query databases that have official connectors, like dask-snowflake.
I have a VERY large data structure, on which I need to run multiple functions, none of which are mutating (therefore, no risk of a race condition. I simply want to get the results faster by running these functions in parallel. For example, getting percentile values for a large data set)
How would I achieve this with multiprocessing, without having to create a copy of the data each time a process starts, which would end up making things slower than if I hadn’t bothered in the first place ?
(The absence of a code example is on purpose, as I don’t think the details of the data structure and functions are in any way relevant.)
TL;DR
I want to allow workers to use a scattered Pandas Dataframe, but not allow them to mutate any data. Look below for sample code. Is this possible? (or is this a pure Pandas question?)
Full question
I'm reading a Pandas Dataframe, and scattering it to the workers. I then use this future when I submit new tasks, and store it in a Variable for easy access.
Sample code:
df = pq.read_table('/data/%s.parq' % dataset).to_pandas()
df = client.scatter(df, broadcast=True, direct=True)
v = Variable('dataset')
v.set(df)
When I submit a job I use:
def top_ten_x_based_on_y(dataset, column):
return dataset.groupby(dataset['something'])[column].mean().sort_values(ascending=False)[0:10].to_dict()
a = client.submit(top_ten_x_based_on_y, df, 'a_column')
Now, I want to run 10-20 QPS on this dataset which all workers have in memory (data < RAM) but I want to protect against accidental changes of the dataset, such as one worker "corrupting" it's own memory which can lead to inconsistencies. Preferably raising an exception on trying to modify.
The data set is roughly 2GB.
I understand this might be problematic since a Pandas Dataframe itself is not immutable (although a Numpy array can be made to).
Other ideas:
Copy the dataset on each query, but 2GB copy takes time even in RAM (roughly 1.4 seconds)
Devise a way to hash a dataframe (probematic in itself, even though hash_pandas_object now exists), and check before and after (or every minute) if dataframe is the same as expected. Running hash_pandas_object takes roughly 5 seconds.
Unfortunately Dask currently offers no additional features on top of Python to avoid mutation in this way. Dask just runs Python functions, and those Python functions can do anything they like.
Your suggestions of copying or checking before running operations seems sensible to me.
You might also consider raising this as a question or feature request to Pandas itself.
I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.
I have attempted the following:
Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore. I created a dtype definition and provided the maximum string sizes.
However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore, I run into some problems.
pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.
If I try to use cursor.fecthmany() with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype)
however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.
Appreciate if you guys could offer your thoughts and point me in the right direction.
Well, the only practical solution for now is to use PyTables directly since it's designed for out-of-memory operation... It's a bit tedious but not that bad:
http://www.pytables.org/moin/HintsForSQLUsers#Insertingdata
Another approach, using Pandas, is here:
"Large data" work flows using pandas
Okay, so I don't have much experience with oracle databases, but here's some thoughts:
Your access time for any particular records from oracle are slow, because of a lack of indexing, and the fact you want data in timestamp order.
Firstly, you can't enable indexing for the database?
If you can't manipulate the database, you can presumably request a found set that only includes the ordered unique ids for each row?
You could potentially store this data as a single array of unique ids, and you should be able to fit into memory. If you allow 4k for every unique key (conservative estimate, includes overhead etc), and you don't keep the timestamps, so it's just an array of integers, it might use up about 1.1GB of RAM for 3 million records. That's not a whole heap, and presumably you only want a small window of active data, or perhaps you are processing row by row?
Make a generator function to do all of this. That way, once you complete iteration it should free up the memory, without having to del anything, and it also makes your code easier to follow and avoids bloating the actual important logic of your calculation loop.
If you can't store it all in memory, or for some other reason this doesn't work, then the best thing you can do, is work out how much you can store in memory. You can potentially split the job into multiple requests, and use multithreading to send a request once the last one has finished, while you process the data into your new file. It shouldn't use up memory, until you ask for the data to be returned. Try and work out if the delay is the request being fulfilled, or the data being downloaded.
From the sounds of it, you might be abstracting the database, and letting pandas make the requests. It might be worth looking at how it's limiting the results. You should be able to make the request for all the data, but only load the results one row at a time from the database server.
I want to use Pandas to work with series in real-time. Every second, I need to add the latest observation to an existing series. My series are grouped into a DataFrame and stored in an HDF5 file.
Here's how I do it at the moment:
>> existing_series = Series([7,13,97], [0,1,2])
>> updated_series = existing_series.append( Series([111], [3]) )
Is this the most efficient way? I've read countless posts but cannot find any that focuses on efficiency with high-frequency data.
Edit: I just read about modules shelve and pickle. It seems like they would achieve what I'm trying to do, basically save lists on disks. Because my lists are large, is there any way not to load the full list into memory but, rather, efficiently append values one at a time?
Take a look at the new PyTables docs in 0.10 (coming soon) or you can get from master. http://pandas.pydata.org/pandas-docs/dev/whatsnew.html
PyTables is actually pretty good at appending, and writing to a HDFStore every second will work. You want to store a DataFrame table. You can then select data in a query like fashion, e.g.
store.append('df', the_latest_df)
store.append('df', the_latest_df)
....
store.select('df', [ 'index>12:00:01' ])
If this is all from the same process, then this will work great. If you have a writer process and then another process is reading, this is a little tricky (but will work correctly depending on what you are doing).
Another option is to use messaging to transmit from one process to another (and then append in memory), this avoids the serialization issue.