How to get around slow groupby for a sparse matrix? - python

I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.
Here is an example dataset:
data = {'date': ['01', '01', '01', '02','02','03'],
'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])
I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.
# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)
# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()
# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)
I've also tried:
getting the dummies in non-sparse format;
using groupby for aggregation;
going to sparse format before matrix multiplication.
But I fail in step 1, since there is not enough RAM to create such a large matrix.
I would greatly appreciate your help.

I came up with an answer using only sparse matrices based on this post. The code is fast, taking about 10 seconds for 10 million rows (my previous code took 6 minutes for 5000 rows and was not scalable).
The time and memory savings come from working with sparse matrices until the very last step when it is necessary to unravel the (already small) co-occurrence matrix before export.
## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)
## Add an auxiliary variable
df['count'] = 1
## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
shape=(date_c.categories.size, action_c.categories.size))
## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)
## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(),
index = action_c.categories, columns = action_c.categories)

There are a couple of fairly straightforward simplifications you can consider.
One of them is that you can call max() directly on the GroupBy object, you don't need the fancy index on all columns, since that's what it returns by default:
df = df.groupby('date').max()
Second is that you can disable sorting of the GroupBy. As the Pandas reference for groupby() says:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
So try that as well:
df = df.groupby('date', sort=False).max()
Third is you can also use a simple pivot_table() to produce the same result.
df = df.pivot_table(index='date', aggfunc='max')
Yet another approach is going back to your "actions" DataFrame, turning that into a MultiIndex and using it for a simple Series, then using unstack() on it, that should get you the same result, without having to use the get_dummies() step (but not sure whether this will drop some of the sparseness properties you're currently relying on.)
actions_df = pd.DataFrame(data, columns = ['date', 'action'])
actions_index = pd.MultiIndex.from_frame(actions_df, names=['date', ''])
actions_series = pd.Series(1, index=actions_index)
df = actions_series.unstack(fill_value=0)
Your supplied sample DataFrame is quite useful for checking that these are all equivalent and produce the same result, but unfortunately not that great for benchmarking it... I suggest you take a larger dataset (but still smaller than your real data, like 10x smaller or perhaps 40-50x smaller) and then benchmark the operations to check how long they take.
If you're using Jupyter (or another IPython shell), you can use the %timeit command to benchmark an expression.
So you can enter:
%timeit df.groupby('date').max()
%timeit df.groupby('date', sort=False).max()
%timeit df.pivot_table(index='date', aggfunc='max')
%timeit actions_series.unstack(fill_value=0)
And compare results, then scale up and check whether the whole run will complete in an acceptable amount of time.

Related

Use Dask to Drop Highly Correlated Pairwise Features in Dataframe?

Having a tough time finding an example of this, but I'd like to somehow use Dask to drop pairwise correlated columns if their correlation threshold is above 0.99. I CAN'T use Pandas' correlation function as my dataset is too large, and it eats up my memory in a hurry. What I have now is a slow, double for loop that starts with the first column, and finds the correlation threshold between it and all the other columns one by one, and if it's above 0.99, drop that 2nd comparative column, then starts at the new second column, and so on and so forth, KIND OF like the solution found here, however this is unbearably slow doing this in an iterative form across all columns, although it's actually possible to run it and not run into memory issues.
I've read the API here, and see how to drop columns using Dask here, but need some assistance in getting this figured out. I'm wondering if there's a faster, yet memory friendly, way of dropping highly correlated columns in a Pandas Dataframe using Dask? I'd like to feed in a Pandas dataframe to the function, and have it return a Pandas dataframe after the correlation dropping is done.
Anyone have any resources I can check out, or have an example of how to do this?
Thanks!
UPDATE
As requested, here is my current correlation dropping routine as described above:
print("Checking correlations of all columns...")
cols_to_drop_from_high_corr = []
corr_threshold = 0.99
for j in df.iloc[:,1:]: # Skip column 0
try: # encompass the below in a try/except, cuz dropping a col in the 2nd 'for' loop below will screw with this
# original list, so if a feature is no longer in there from dropping it prior, it'll throw an error
for k in df.iloc[:,1:]: # Start 2nd loop at first column also...
# If comparing the same column to itself, skip it
if (j == k):
continue
else:
try: # second try/except mandatory
correlation = abs(df[j].corr(df[k])) # Get the correlation of the first col and second col
if correlation > corr_threshold: # If they are highly correlated...
cols_to_drop_from_high_corr.append(k) # Add the second col to list for dropping when round is done before next round.")
except:
continue
# Once we have compared the first col with all of the other cols...
if len(cols_to_drop_from_high_corr) > 0:
df = df.drop(cols_to_drop_from_high_corr, axis=1) # Drop all the 2nd highly corr'd cols
cols_to_drop_from_high_corr = [] # Reset the list for next round
# print("Dropped all cols from most recent round. Continuing...")
except: # Now, if the first for loop tries to find a column that's been dropped already, just continue on
continue
print("Correlation dropping completed.")
UPDATE
Using the solution below, I'm running into a few errors and due to my limited dask syntax knowledge, I'm hoping to get some insight. Running Windows 10, Python 3.6 and the latest version of dask.
Using the code as is on MY dataset (the dataset in the link says "file not found"), I ran into the first error:
ValueError: Exactly one of npartitions and chunksize must be specified.
So I specify npartitions=2 in the from_pandas, then get this error:
AttributeError: 'Array' object has no attribute 'compute_chunk_sizes'
I tried changing that to .rechunk('auto'), but then got error:
ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes
My original dataframe is in the shape of 1275 rows, and 3045 columns. The dask array shape says shape=(nan, 3045). Does this help to diagnose the issue at all?
I'm not sure if this help but maybe it could be a starting point.
Pandas
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"
df = pd.read_csv(url)
# we check correlation for these columns only
cols = df.columns[-8:]
# columns in this df don't have a big
# correlation coefficient
corr_threshold = 0.5
corr = df[cols].corr().abs().values
# we take the upper triangular only
corr = np.triu(corr)
# we want high correlation but not diagonal elements
# it returns a bool matrix
out = (corr != 1) & (corr > corr_threshold)
# for every row we want only the True columns
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()
cols_to_remove = list(set(cols_to_remove))
df = df.drop(cols_to_remove, axis=1)
Dask
Here I comment only the steps are different from pandas
import dask.dataframe as dd
import dask.array as da
url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"
df = dd.read_csv(url)
cols = df.columns[-8:]
corr_threshold = 0.5
corr = df[cols].corr().abs().values
# with dask we need to rechunk
corr = corr.compute_chunk_sizes()
corr = da.triu(corr)
out = (corr != 1) & (corr > corr_threshold)
# dask is lazy
out = out.compute()
cols_to_remove = []
for o in out:
cols_to_remove += cols[o].to_list()
cols_to_remove = list(set(cols_to_remove))
df = df.drop(cols_to_remove, axis=1)

Computation between two large columns in a Pandas Dataframe

I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.

Simple Pandas df calculation lost in infinity

Setup
I have medium sized df (600K by 40) and I am trying to add element wise (by index) series values together and then create a new column with the subsequent values. However, it is taking over 24 hours and has not yet finished.
First I make two series:
(from the original df with some constraints)
Series1 = df.loc[df['ColumnX'] == 5, 'ColumnY']
Series2 = df.loc[df['ColumnX'] == 6, 'ColumnY']
Second I add them to together and insert as a new column into original df:
df['column1plus2'] = Series1 + Series2
It simply shouldn't take longer than 24 hours on a weak-medium powered server to compute, should it? Am I doing something fundamentally wrong?
Because of the mutually exclusive nature of your selection (5 vs 6), the indexes of all rows in Series1 are different from the indexes of all rows in Series2. The operator + uses data alignment to find the matching rows, and it cannot. So, instead it creates matching dummies with the values of NaN and adds them to the values from your series. (The result is also a NaN, of course.) For example, if you had row #10 in Series1 (with the value of, say, 3.14), there will be no row with the same number in Series2. Pandas will create row #10 in Series2 and set its value to NaN, because it doesn't know any better. The result of the summation in row #10 is now 3.14+NaN=NaN.
This explains why your code is wrong but not necessarily why it is slow. I would guess that data alignment is a very slow operation in the presence of so many missing values.
Did you mean to stack Series1 and Series2 instead of arithmetically adding them? If so, you should do pd.concat[Stack1,Stack2]).

Efficient read and write of pandas dataframe

I have a pandas dataframe that I want to split into several smaller pieces of 100k rows each, then save onto the disk so that I can read in the data and process it one by one. I have tried using dill and hdf storage, as csv and raw text appears to take a lot of time.
I am trying this out on a subset of data with ~500k rows and five columns of mixed data. Two contains strings, one integers, one float and the final one contains bigram counts from sklearn.feature_extraction.text.CountVectorizer, stored as a scipy.sparse.csr.csr_matrix sparse matrix.
It is the last column that I am having problems with. Dumping and loading the data goes without issue, but when I try to actually access the data it is instead a pandas.Series object. Secondly, each row in that Series is a tuple which contains the whole dataset instead.
# Before dumping, the original df has 100k rows.
# Each column has one value except for 'counts' which has 1400.
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400.
vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2))
counts = vectorizer.fit_transform(df['string_data'])
df['counts'] = counts
df_split = pandas.DataFrame(np.column_stack([df['string1'][0:100000],
df['string2'][0:100000],
df['float'][0:100000],
df['integer'][0:100000],
df['counts'][0:100000]]),
columns=['string1','string2','float','integer','counts'])
dill.dump(df, open(file[i], 'w'))
df = dill.load(file[i])
print(type(df['counts'])
> <class 'pandas.core.series.Series'>
print(np.shape(df['counts'])
> (100000,)
print(np.shape(df['counts'][0])
> (496718, 1400) # 496718 is the number of rows in my complete data set.
print(type(df['counts']))
> <type 'tuple'>
Am I making any obvious mistake, or is there a better way to store this data in this format, one which isn't very time consuming? It has to be scalable to my full data containing 100 million rows.
df['counts'] = counts
this will produce a Pandas Series (column) with the # of elements equal to len(df) and where each element is a sparse matrix, which is returned by vectorizer.fit_transform(df['string_data'])
you can try to do the following:
df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)
NOTE: be aware this will explode your sparse matrix into densed (not sparse) DataFrame, so it will use much more memory and you can end up with the MemoryError
CONCLUSION:
That's why I'd recommend you to store your original DF and count sparse matrix separately

Creation of large pandas DataFrames from Series

I'm dealing with data on a fairly large scale. For reference, a given sample will have ~75,000,000 rows and 15,000-20,000 columns.
As of now, to conserve memory I've taken the approach of creating a list of Series (each column is a series, so ~15K-20K Series each containing ~250K rows). Then I create a SparseDataFrame containing every index within these series (because as you notice, this is a large but not very dense dataset). The issue is this becomes extremely slow, and appending each column to the dataset takes several minutes. To overcome this I've tried batching the merges as well (select a subset of the data, merge these to a DataFrame, which is then merged into my main DataFrame), but this approach is still too slow. Slow meaning it only processed ~4000 columns in a day, with each append causing subsequent appends to take longer as well.
One part which struck me as odd is why my column count of the main DataFrame affects the append speed. Because my main index already contains all entries it will ever see, I shouldn't have to lose time due to re-indexing.
In anycase, here is my code:
import time
import sys
import numpy as np
import pandas as pd
precision = 6
df = []
for index, i in enumerate(raw):
if i is None:
break
if index%1000 == 0:
sys.stderr.write('Processed %s...\n' % index)
df.append(pd.Series(dict([(np.round(mz, precision),int(intensity)) for mz, intensity in i.scans]), dtype='uint16', name=i.rt))
all_indices = set([])
for j in df:
all_indices |= set(j.index.tolist())
print len(all_indices)
t = time.time()
main_df = pd.DataFrame(index=all_indices)
first = True
del all_indices
while df:
subset = [df.pop() for i in xrange(10) if df]
all_indices = set([])
for j in subset:
all_indices |= set(j.index.tolist())
df2 = pd.DataFrame(index=all_indices)
df2.sort_index(inplace=True, axis=0)
df2.sort_index(inplace=True, axis=1)
del all_indices
ind=0
while subset:
t2 = time.time()
ind+=1
arr = subset.pop()
df2[arr.name] = arr
print ind,time.time()-t,time.time()-t2
df2.reindex(main_df.index)
t2 = time.time()
for i in df2.columns:
main_df[i] = df2[i]
if first:
main_df = main_df.to_sparse()
first = False
print 'join time', time.time()-t,time.time()-t2
print len(df), 'entries remain'
Any advice on how I can load this large dataset quickly is appreciated, even if it means writing it to disk to some other format first/etc.
Some additional info:
1) Because of the number of columns, I can't use most traditional on-disk stores such as HDF.
2) The data will be queried across columns and rows when it is in use. So main_df.loc[row:row_end, col:col_end]. These aren't predictable block sizes so chunking isn't really an option. These lookups also need to be fast, on the order of ~10 a second to be realistically useful.
3) I have 32G of memory, so a SparseDataFrame I think is the best option since it fits in memory and allows fast lookups as needed. Just the creation of it is a pain at the moment.
Update:
I ended up using scipy sparse matrices and handling the indexing on my own for the time being. This results in appends at a constant rate of ~0.2 seconds which is acceptable (versus Pandas taking ~150seconds for my full dataset per append). I'd love to know how to make Pandas match this speed.

Categories

Resources