Bootstrap Samples of a Dask Dataframe - python

I have a large data frame with all binary variables (a sparse matrix that was converted into pandas so that I can later convert to Dask). The dimensions are 398,888 x 52,034.
I am trying to create a much larger data frame that consists of 10,000 different bootstrap samples from the original data frame. Each sample is the same size as the original data. The final data frame will also have a column that keeps track of which bootstrap sample that row is from.
Here is my code:
# sample df
df_pd = pd.DataFrame(np.array([[0, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 1]]),
columns=['a', 'b', 'c'])
# convert into Dask dataframe
df_dd = dd.from_pandas(df_pd, npartitions=4)
B = 2 # eventually 10,000
big_df = dd.from_pandas(pd.DataFrame([]), npartitions = 1000)
for i in range(B+1):
data = df_dd.sample(frac = 1, replace = True, random_state=i)
data["sample"] = i
big_df.append(data)
The data frame produced by the loop is empty, but I cannot figure out why. To be more specific, if I look at big_df.head() I get, UserWarning: Insufficient elements for 'head'. 5 elements requested, only 0 elements available. Try passing larger 'npartitions' to 'head'. If I try print(big_df), I get, ValueError: No objects to concatenate.
My guess is there is at least a problem with this line, big_df = dd.from_pandas(pd.DataFrame([]), npartitions = 1000), but I have no idea.
Let me know if I need to clarify anything. I am somewhat new to Python and even newer to Dask, so even small tips or feedback that don't fully answer the question would be greatly appreciated. Thanks!

You are probably better off using dask.dataframe.concat and concatting dataframes together -- still there are a few problems.
append creates a new object so you will have to save that object -> df = df.append(data)
try calling big_df.head(npartitions=-1), it use all partitions to get 5 rows (the appending/concatting here can create small partitions with less than 5 rows).
It would be good to write this first with Pandas before jumping to Dask especially. You might also be interested in reading through: https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask

Related

How to extract the labels from sns.clustermap

If I'm plotting a (correlation) dataframe with sns.clustermap it automatically takes the dataframes multindex as labels and plots them right and below the clustermap.
How do I access these labels? I'm using clustermaps as an exploratory tool for large-ish datasets (100-200 entries) and I need the names for the entries in various clusters.
EXAMPLE:
elev = [1, 100, 10, 1000, 100, 10]
number = [1, 2, 3, 4, 5, 6]
name = ['foo', 'bar', 'baz', 'qux', 'quux', 'quuux']
idx = pd.MultiIndex.from_arrays([name, elev, number],
names=('name','elev', 'number'))
data = np.random.rand(20,6)
df = pd.DataFrame(data=data, columns=idx)
clustermap = sns.clustermap(df.corr())
gives
Now I'd say that theres two distinct clusters: the first two rows and the last 4 rows, so [foo-1-1, bar-100-2] and [baz-10-3, qux-1000-4, quux-100-5, quuux-10-6].
How can I extract these (or the whole [foo-1-1, bar-100-2, baz-10-3, qux-1000-4, quux-100-5, quuux-10-6] list)? With 100+ Entries, just writing them down by hand isn't really an option.
The documentation offers clustergrid.dendrogram_row.reordered_ind but that just gives me the index numbers in the original dataframe. But I'm looking for something more like the output of df.columns
With this it seems to me like I'm getting into the right direction, but I can only extract to which cluster a given row belongs, when I let it form clusters automatically, but I'd like to define the clusters myself, visually.
As always with such things, the answer is out there, I just overlooked it.
This answer (pointed out by Trenton McKinney in comments) has the needed snipped in it:
ax_heatmap.yaxis.get_majorticklabels()
(I wouldn't have looked into ax_heatmap to get to that...). So, continuing the MWE from the question:
labels = clustermap.ax_heatmap.yaxis.get_majorticklabels()
However, that's a list of
type(labels[0])
matplotlib.text.Text
so unless I'm missing something (again), it's not exactly straigtforward to use. However, that can simply be looped into something more usefull. Let's say I'm interested in the whole name (i.e. the complete former df multiindex) and the number:
labels_list = []
number_list = []
for i in labels:
i = str(i)
name_start = i.find('\'')+1
name_end = i.rfind('\'')
name = i[name_start:name_end]
number_start = name.rfind('-')+1
number = name[number_start:]
number = int(number)
labels_list.append(name)
number_list.append(number)
Now I've got two easily workable lists, one with full strings and one with ints.

Python numpy array filter

I got the following numpy array named 'data'. It consists of 15118 rows and 2 columns. The first column mostly consist of 0.01 steps, but sometimes there is a step in between (shown in red) which I would like to remove/filter out.
I achieved this with the following code:
# Create array [0, 0.01 .... 140], rounded 2 decimals to prevent floating point error
b = np.round(np.arange(0,140.01,0.01),2)
# New empty data array
new_data = np.empty(shape=[0, 2])
# Loop over values to remove/filter out data
for x in b:
Index = np.where(x == data[:,0])[0][0]
new_data = np.vstack([new_data,data[Index]])
I feel like this code is far from optimal and I was wondering if anyone knows a faster/better way of achieving this?
Here's a solution using pandas for resampling, you can probably achieve the same result in pure numpy but there are a number of floating point and rounding error pitfalls you are going to face, maybe it's better to let a trusted library do the work for you.
Let's say arr is your data array and assume your index to be in fractions of seconds. You can convert your array to a dataframe with a timedelta index:
df = pd.DataFrame(arr[:,1], index=arr[:,0])
df.index = pd.to_timedelta(df.index, unit="s")
Than resampling it's pretty easy, 10ms is the frequency you want, first() should give you the expected result dropping everything but the records at 10ms ticks, but feel free to experiment with other functions
df = df.resample("10ms").first()
Eventually you could get back to your array with something like:
np.vstack([pd.to_numeric(df.index, downcast="float").values / 1e9,
df.values.squeeze()]).T

Updating the bunch of rows in panda in loop

I am running analytics on edge device, to compute everything I need panda frames. Here is my problem, every 10 sec I am updating panda master dataframe with new set of rows. Some disagree with approach, it might hit performance. append is the only way I can update the rows, is there any other efficient way I can update panda frame, all I need is something like list.append(x) or list.extend(x) API in Panda. Hope I am using right API, any alternative for more efficient way ?
I do not have memory issue, since I am discarding after some time.
snippet
df.append(self.__get_pd_frame(tracker_data), ignore_index=True)
# tracker_data - another panda data frame contains 100-200 rows
I changed from append method to from_record API, something like below
data = np.array([[1, 3], [2, 4], [4, 5]])
pd.DataFrame.from_records(data, columns=("a", "b"))

How to get around slow groupby for a sparse matrix?

I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.
Here is an example dataset:
data = {'date': ['01', '01', '01', '02','02','03'],
'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])
I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.
# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)
# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()
# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)
I've also tried:
getting the dummies in non-sparse format;
using groupby for aggregation;
going to sparse format before matrix multiplication.
But I fail in step 1, since there is not enough RAM to create such a large matrix.
I would greatly appreciate your help.
I came up with an answer using only sparse matrices based on this post. The code is fast, taking about 10 seconds for 10 million rows (my previous code took 6 minutes for 5000 rows and was not scalable).
The time and memory savings come from working with sparse matrices until the very last step when it is necessary to unravel the (already small) co-occurrence matrix before export.
## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)
## Add an auxiliary variable
df['count'] = 1
## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
shape=(date_c.categories.size, action_c.categories.size))
## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)
## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(),
index = action_c.categories, columns = action_c.categories)
There are a couple of fairly straightforward simplifications you can consider.
One of them is that you can call max() directly on the GroupBy object, you don't need the fancy index on all columns, since that's what it returns by default:
df = df.groupby('date').max()
Second is that you can disable sorting of the GroupBy. As the Pandas reference for groupby() says:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
So try that as well:
df = df.groupby('date', sort=False).max()
Third is you can also use a simple pivot_table() to produce the same result.
df = df.pivot_table(index='date', aggfunc='max')
Yet another approach is going back to your "actions" DataFrame, turning that into a MultiIndex and using it for a simple Series, then using unstack() on it, that should get you the same result, without having to use the get_dummies() step (but not sure whether this will drop some of the sparseness properties you're currently relying on.)
actions_df = pd.DataFrame(data, columns = ['date', 'action'])
actions_index = pd.MultiIndex.from_frame(actions_df, names=['date', ''])
actions_series = pd.Series(1, index=actions_index)
df = actions_series.unstack(fill_value=0)
Your supplied sample DataFrame is quite useful for checking that these are all equivalent and produce the same result, but unfortunately not that great for benchmarking it... I suggest you take a larger dataset (but still smaller than your real data, like 10x smaller or perhaps 40-50x smaller) and then benchmark the operations to check how long they take.
If you're using Jupyter (or another IPython shell), you can use the %timeit command to benchmark an expression.
So you can enter:
%timeit df.groupby('date').max()
%timeit df.groupby('date', sort=False).max()
%timeit df.pivot_table(index='date', aggfunc='max')
%timeit actions_series.unstack(fill_value=0)
And compare results, then scale up and check whether the whole run will complete in an acceptable amount of time.

Efficient read and write of pandas dataframe

I have a pandas dataframe that I want to split into several smaller pieces of 100k rows each, then save onto the disk so that I can read in the data and process it one by one. I have tried using dill and hdf storage, as csv and raw text appears to take a lot of time.
I am trying this out on a subset of data with ~500k rows and five columns of mixed data. Two contains strings, one integers, one float and the final one contains bigram counts from sklearn.feature_extraction.text.CountVectorizer, stored as a scipy.sparse.csr.csr_matrix sparse matrix.
It is the last column that I am having problems with. Dumping and loading the data goes without issue, but when I try to actually access the data it is instead a pandas.Series object. Secondly, each row in that Series is a tuple which contains the whole dataset instead.
# Before dumping, the original df has 100k rows.
# Each column has one value except for 'counts' which has 1400.
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400.
vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2))
counts = vectorizer.fit_transform(df['string_data'])
df['counts'] = counts
df_split = pandas.DataFrame(np.column_stack([df['string1'][0:100000],
df['string2'][0:100000],
df['float'][0:100000],
df['integer'][0:100000],
df['counts'][0:100000]]),
columns=['string1','string2','float','integer','counts'])
dill.dump(df, open(file[i], 'w'))
df = dill.load(file[i])
print(type(df['counts'])
> <class 'pandas.core.series.Series'>
print(np.shape(df['counts'])
> (100000,)
print(np.shape(df['counts'][0])
> (496718, 1400) # 496718 is the number of rows in my complete data set.
print(type(df['counts']))
> <type 'tuple'>
Am I making any obvious mistake, or is there a better way to store this data in this format, one which isn't very time consuming? It has to be scalable to my full data containing 100 million rows.
df['counts'] = counts
this will produce a Pandas Series (column) with the # of elements equal to len(df) and where each element is a sparse matrix, which is returned by vectorizer.fit_transform(df['string_data'])
you can try to do the following:
df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)
NOTE: be aware this will explode your sparse matrix into densed (not sparse) DataFrame, so it will use much more memory and you can end up with the MemoryError
CONCLUSION:
That's why I'd recommend you to store your original DF and count sparse matrix separately

Categories

Resources