Efficient read and write of pandas dataframe - python

I have a pandas dataframe that I want to split into several smaller pieces of 100k rows each, then save onto the disk so that I can read in the data and process it one by one. I have tried using dill and hdf storage, as csv and raw text appears to take a lot of time.
I am trying this out on a subset of data with ~500k rows and five columns of mixed data. Two contains strings, one integers, one float and the final one contains bigram counts from sklearn.feature_extraction.text.CountVectorizer, stored as a scipy.sparse.csr.csr_matrix sparse matrix.
It is the last column that I am having problems with. Dumping and loading the data goes without issue, but when I try to actually access the data it is instead a pandas.Series object. Secondly, each row in that Series is a tuple which contains the whole dataset instead.
# Before dumping, the original df has 100k rows.
# Each column has one value except for 'counts' which has 1400.
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400.
vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2))
counts = vectorizer.fit_transform(df['string_data'])
df['counts'] = counts
df_split = pandas.DataFrame(np.column_stack([df['string1'][0:100000],
df['string2'][0:100000],
df['float'][0:100000],
df['integer'][0:100000],
df['counts'][0:100000]]),
columns=['string1','string2','float','integer','counts'])
dill.dump(df, open(file[i], 'w'))
df = dill.load(file[i])
print(type(df['counts'])
> <class 'pandas.core.series.Series'>
print(np.shape(df['counts'])
> (100000,)
print(np.shape(df['counts'][0])
> (496718, 1400) # 496718 is the number of rows in my complete data set.
print(type(df['counts']))
> <type 'tuple'>
Am I making any obvious mistake, or is there a better way to store this data in this format, one which isn't very time consuming? It has to be scalable to my full data containing 100 million rows.

df['counts'] = counts
this will produce a Pandas Series (column) with the # of elements equal to len(df) and where each element is a sparse matrix, which is returned by vectorizer.fit_transform(df['string_data'])
you can try to do the following:
df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)
NOTE: be aware this will explode your sparse matrix into densed (not sparse) DataFrame, so it will use much more memory and you can end up with the MemoryError
CONCLUSION:
That's why I'd recommend you to store your original DF and count sparse matrix separately

Related

Creating Dummy Variable with 200k unique value

I am trying to create a dummy variable for the categorical dataset, but the problem is python does not have a compatible ram to run the code since the unique value is too large to create a dummy variable. It is a large dataset with 500k rows and 200k unique values. Is it possible to create a dummy variable with 200k unique values?.
Indeed performing this operation takes lots of RAM.
As far as programming solutions I can think of:
Dimensionality reduction: if somehow there is some relation between your 200K categories and those can be reduced (eg hierarchy levels for those categories so you can group up categories and perform analyses by levels, eg lvl1 = 10 cateogries, lvl2 = 100 etc...). May I ask: what type of data do you have which contains 200K unique category values?
Splitting up dataset and combining results: I have the below working with numpy. You end up with smaller subsets, each encoded for the 200K categories (even if certain categories aren't present in a subset). Then you need to decide how to further process those subsets.
Somehow import statements were breaking formatting so I have them separate here:
import numpy as np
import random
And the rest of the code:
def np_one_hot_encode(n_categories: int, arr: np.array):
# Performs one-hot encoding of arr based on n_categories
# Allows encoding smaller chuncks of a bigger array
# even if the chunks do not contain 1 occurrence of each category
# while still producing n_categories columns for each chunks
result = np.zeros((arr.size, n_categories))
result[np.arange(arr.size), arr] = 1
return result
# Testing our encoding function
# even if our input array doesn't contain all categories
# the output does cater for all categories
encoded = np_one_hot_encode(3, np.array([1, 0]))
print('test np_one_hot_encode\n', encoded)
assert np.array_equal(encoded, np.array([[0, 1, 0], [1, 0, 0]]))
# Generating 500K rows with 200K unique categories present at least once
total = int(5e5)
nunique = int(2e5)
uniques = list(range(0, nunique))
random.shuffle(uniques)
values = uniques+(uniques*2)[:total-nunique]
print('Rows count', len(values))
print('Uniques count', len(list(set(values))))
# Produces subsets of the data in (~500K/50 x nuniques) shape:
n_chunks = 50
for i, chunk in enumerate(np.array_split(values, n_chunks)):
print('chunk', i, 'shape', chunk.shape)
encoded = np_one_hot_encode(nunique, chunk)
print('encoded', encoded.shape)
And the output:
test np_one_hot_encode
[[0. 1. 0.]
[1. 0. 0.]]
Rows count 500000
Uniques count 200000
chunk 0 shape (10000,)
encoded (10000, 200000)
chunk 1 shape (10000,)
encoded (10000, 200000)
Distributed processing with tools like Dask, Spark etc... so you can handle processing of the subsets
Database: other solution I can think of is normalize your model into a database (either relational or "big" flat data model) where you could leverage indices to filter and process part of the data only (only certain rows and certain categories), thus allowing you to handle a smaller output in memory
But in the end there is no magic, if ultimately you're tyring to load a N-M matrix into memory with N=500K and M=200K, it will take the RAM it needs to take, there is no way around that, thus the most likely gains to be had are dimensionality reduction OR a different approach to data processing altogether (eg distributed computing).

How to select n equally spaced rows from Dask dataframe?

I have a number of parquet files, where all of the chunks together are too big to fit into memory. I would like to load them into a dask dataframe, compute some results (cumsum) and then display the cumsum as a plot. For this reason I wanted to select equally spaced subset of data (some k rows) from the cumsum row, and then plot this subset. How would I do that?
You could try:
slices = 10 # or whatever
slice_point = int(df.shape[0]/slices)
for i in range(slices):
current_sliced_df = df.loc[i*slice_point:(i+1)*slice_point]
and do whatever you want with the current slice
I think that using df[serie].sample(...)(doc) would allow you to avoid to code the way of selecting a representative subset of rows.

How to get around slow groupby for a sparse matrix?

I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.
Here is an example dataset:
data = {'date': ['01', '01', '01', '02','02','03'],
'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])
I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.
# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)
# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()
# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)
I've also tried:
getting the dummies in non-sparse format;
using groupby for aggregation;
going to sparse format before matrix multiplication.
But I fail in step 1, since there is not enough RAM to create such a large matrix.
I would greatly appreciate your help.
I came up with an answer using only sparse matrices based on this post. The code is fast, taking about 10 seconds for 10 million rows (my previous code took 6 minutes for 5000 rows and was not scalable).
The time and memory savings come from working with sparse matrices until the very last step when it is necessary to unravel the (already small) co-occurrence matrix before export.
## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)
## Add an auxiliary variable
df['count'] = 1
## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
shape=(date_c.categories.size, action_c.categories.size))
## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)
## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(),
index = action_c.categories, columns = action_c.categories)
There are a couple of fairly straightforward simplifications you can consider.
One of them is that you can call max() directly on the GroupBy object, you don't need the fancy index on all columns, since that's what it returns by default:
df = df.groupby('date').max()
Second is that you can disable sorting of the GroupBy. As the Pandas reference for groupby() says:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
So try that as well:
df = df.groupby('date', sort=False).max()
Third is you can also use a simple pivot_table() to produce the same result.
df = df.pivot_table(index='date', aggfunc='max')
Yet another approach is going back to your "actions" DataFrame, turning that into a MultiIndex and using it for a simple Series, then using unstack() on it, that should get you the same result, without having to use the get_dummies() step (but not sure whether this will drop some of the sparseness properties you're currently relying on.)
actions_df = pd.DataFrame(data, columns = ['date', 'action'])
actions_index = pd.MultiIndex.from_frame(actions_df, names=['date', ''])
actions_series = pd.Series(1, index=actions_index)
df = actions_series.unstack(fill_value=0)
Your supplied sample DataFrame is quite useful for checking that these are all equivalent and produce the same result, but unfortunately not that great for benchmarking it... I suggest you take a larger dataset (but still smaller than your real data, like 10x smaller or perhaps 40-50x smaller) and then benchmark the operations to check how long they take.
If you're using Jupyter (or another IPython shell), you can use the %timeit command to benchmark an expression.
So you can enter:
%timeit df.groupby('date').max()
%timeit df.groupby('date', sort=False).max()
%timeit df.pivot_table(index='date', aggfunc='max')
%timeit actions_series.unstack(fill_value=0)
And compare results, then scale up and check whether the whole run will complete in an acceptable amount of time.

How to read a CSV file subset by subset with Pandas?

I have a data frame with 13000 rows and 3 columns:
('time', 'rowScore', 'label')
I want to read subset by subset:
[[1..360], [360..712], ..., [12640..13000]]
I used list too but it's not working:
import pandas as pd
import math
import datetime
result="data.csv"
dataSet = pd.read_csv(result)
TP=0
count=0
x=0
df = pd.DataFrame(dataSet, columns =
['rawScore','label'])
for i,row in df.iterrows():
data= row.to_dict()
ScoreX= data['rawScore']
labelX=data['label']
for i in range (1,13000,360):
x=x+1
for j in range (i,360*x,1):
if ((ScoreX > 0.3) and (labelX ==0)):
count=count+1
print("count=",count)
You can also use the parameters nrows or skiprows to break it up into chunks. I would recommend against using iterrows since that is typically very slow. If you do this when reading in the values, and saving these chunks separately, then it would skip the iterrows section. This is for the file reading if you want to split up into chunks (which seems to be an intermediate step in what you're trying to do).
Another way is to subset using generators by seeing if the values belong to each set:
[[1..360], [360..712], ..., [12640..13000]]
So write a function that takes the chunks with indices divisible by 360 and if the indices are in that range, then choose that particular subset.
I just wrote these approaches down as alternative ideas you might want to play around with, since in some cases you may only want a subset and not all of the chunks for calculation purposes.

Creation of large pandas DataFrames from Series

I'm dealing with data on a fairly large scale. For reference, a given sample will have ~75,000,000 rows and 15,000-20,000 columns.
As of now, to conserve memory I've taken the approach of creating a list of Series (each column is a series, so ~15K-20K Series each containing ~250K rows). Then I create a SparseDataFrame containing every index within these series (because as you notice, this is a large but not very dense dataset). The issue is this becomes extremely slow, and appending each column to the dataset takes several minutes. To overcome this I've tried batching the merges as well (select a subset of the data, merge these to a DataFrame, which is then merged into my main DataFrame), but this approach is still too slow. Slow meaning it only processed ~4000 columns in a day, with each append causing subsequent appends to take longer as well.
One part which struck me as odd is why my column count of the main DataFrame affects the append speed. Because my main index already contains all entries it will ever see, I shouldn't have to lose time due to re-indexing.
In anycase, here is my code:
import time
import sys
import numpy as np
import pandas as pd
precision = 6
df = []
for index, i in enumerate(raw):
if i is None:
break
if index%1000 == 0:
sys.stderr.write('Processed %s...\n' % index)
df.append(pd.Series(dict([(np.round(mz, precision),int(intensity)) for mz, intensity in i.scans]), dtype='uint16', name=i.rt))
all_indices = set([])
for j in df:
all_indices |= set(j.index.tolist())
print len(all_indices)
t = time.time()
main_df = pd.DataFrame(index=all_indices)
first = True
del all_indices
while df:
subset = [df.pop() for i in xrange(10) if df]
all_indices = set([])
for j in subset:
all_indices |= set(j.index.tolist())
df2 = pd.DataFrame(index=all_indices)
df2.sort_index(inplace=True, axis=0)
df2.sort_index(inplace=True, axis=1)
del all_indices
ind=0
while subset:
t2 = time.time()
ind+=1
arr = subset.pop()
df2[arr.name] = arr
print ind,time.time()-t,time.time()-t2
df2.reindex(main_df.index)
t2 = time.time()
for i in df2.columns:
main_df[i] = df2[i]
if first:
main_df = main_df.to_sparse()
first = False
print 'join time', time.time()-t,time.time()-t2
print len(df), 'entries remain'
Any advice on how I can load this large dataset quickly is appreciated, even if it means writing it to disk to some other format first/etc.
Some additional info:
1) Because of the number of columns, I can't use most traditional on-disk stores such as HDF.
2) The data will be queried across columns and rows when it is in use. So main_df.loc[row:row_end, col:col_end]. These aren't predictable block sizes so chunking isn't really an option. These lookups also need to be fast, on the order of ~10 a second to be realistically useful.
3) I have 32G of memory, so a SparseDataFrame I think is the best option since it fits in memory and allows fast lookups as needed. Just the creation of it is a pain at the moment.
Update:
I ended up using scipy sparse matrices and handling the indexing on my own for the time being. This results in appends at a constant rate of ~0.2 seconds which is acceptable (versus Pandas taking ~150seconds for my full dataset per append). I'd love to know how to make Pandas match this speed.

Categories

Resources