Creating Dummy Variable with 200k unique value - python

I am trying to create a dummy variable for the categorical dataset, but the problem is python does not have a compatible ram to run the code since the unique value is too large to create a dummy variable. It is a large dataset with 500k rows and 200k unique values. Is it possible to create a dummy variable with 200k unique values?.

Indeed performing this operation takes lots of RAM.
As far as programming solutions I can think of:
Dimensionality reduction: if somehow there is some relation between your 200K categories and those can be reduced (eg hierarchy levels for those categories so you can group up categories and perform analyses by levels, eg lvl1 = 10 cateogries, lvl2 = 100 etc...). May I ask: what type of data do you have which contains 200K unique category values?
Splitting up dataset and combining results: I have the below working with numpy. You end up with smaller subsets, each encoded for the 200K categories (even if certain categories aren't present in a subset). Then you need to decide how to further process those subsets.
Somehow import statements were breaking formatting so I have them separate here:
import numpy as np
import random
And the rest of the code:
def np_one_hot_encode(n_categories: int, arr: np.array):
# Performs one-hot encoding of arr based on n_categories
# Allows encoding smaller chuncks of a bigger array
# even if the chunks do not contain 1 occurrence of each category
# while still producing n_categories columns for each chunks
result = np.zeros((arr.size, n_categories))
result[np.arange(arr.size), arr] = 1
return result
# Testing our encoding function
# even if our input array doesn't contain all categories
# the output does cater for all categories
encoded = np_one_hot_encode(3, np.array([1, 0]))
print('test np_one_hot_encode\n', encoded)
assert np.array_equal(encoded, np.array([[0, 1, 0], [1, 0, 0]]))
# Generating 500K rows with 200K unique categories present at least once
total = int(5e5)
nunique = int(2e5)
uniques = list(range(0, nunique))
random.shuffle(uniques)
values = uniques+(uniques*2)[:total-nunique]
print('Rows count', len(values))
print('Uniques count', len(list(set(values))))
# Produces subsets of the data in (~500K/50 x nuniques) shape:
n_chunks = 50
for i, chunk in enumerate(np.array_split(values, n_chunks)):
print('chunk', i, 'shape', chunk.shape)
encoded = np_one_hot_encode(nunique, chunk)
print('encoded', encoded.shape)
And the output:
test np_one_hot_encode
[[0. 1. 0.]
[1. 0. 0.]]
Rows count 500000
Uniques count 200000
chunk 0 shape (10000,)
encoded (10000, 200000)
chunk 1 shape (10000,)
encoded (10000, 200000)
Distributed processing with tools like Dask, Spark etc... so you can handle processing of the subsets
Database: other solution I can think of is normalize your model into a database (either relational or "big" flat data model) where you could leverage indices to filter and process part of the data only (only certain rows and certain categories), thus allowing you to handle a smaller output in memory
But in the end there is no magic, if ultimately you're tyring to load a N-M matrix into memory with N=500K and M=200K, it will take the RAM it needs to take, there is no way around that, thus the most likely gains to be had are dimensionality reduction OR a different approach to data processing altogether (eg distributed computing).

Related

How to do parallization while needing to check all rows in the dataframe

I have a dataframe which has million rows and almost 100 features. I need to firstly cast a feature of them into string, then drop almost 17 features. Then I need to add a column to the data frame, this column called pred. The methodology I add this column is to group the rows by their "Reta" feature if -1 found all the rows with this class will have the pred value of -1 else this will be 1; this can be done with this code:
#getting the prediction
hs_p={}
for i in range(len(classes)):
class_name=classes[i]
#this can be rewritten 3shan law l2aina -1 n-stop bdl ma n-check kolo
check=df.loc[df['CLUSTER'] == class_name]['Reta'].values.tolist()
if (-1 in check):
hs_p[class_name]=-1
else:
hs_p[class_name]=1
hs_p_col=[]
print("prediction done")
#Adding the prediction column to the df
for i in hs_p:
df.loc[df['CLUSTER'] == i, 'pred'] = hs_p[i]
The problem is the data is very huge and it took me a lot of time to run and still no result. I thought about doing parallelization using multiprocessing library in python. However, I do understand that multiprocessing divide the dataframe into chunks, so the first chunk will have some of the class rows and another chunk will have the rest of the class rows, so pred column will not be done accurately. Any ideas about how to do this?

Out-of-core processing of sparse CSR arrays

How can one apply some function in parallel on chunks of a sparse CSR array saved on disk using Python? Sequentially this could be done e.g. by saving the CSR array with joblib.dump opening it with joblib.load(.., mmap_mode="r") and processing the chunks of rows one by one. Could this be done more efficiently with dask?
In particular, assuming one doesn't need all the possible out of core operations on sparse arrays, but just the ability to load row chunks in parallel (each chunk is a CSR array) and apply some function to them (in my case it would be e.g. estimator.predict(X) from scikit-learn).
Besides, is there a file format on disk that would be suitable for this task? Joblib works but I'm not sure about the (parallel) performance of CSR arrays loaded as memory maps; spark.mllib appears to use either some custom sparse storage format (that doesn't seem to have a pure Python parser) or LIBSVM format (the parser in scikit-learn is, in my experience, much slower than joblib.dump)...
Note: I have read documentation, various issues about it on https://github.com/dask/dask/ but I'm still not sure how to best approach this problem.
Edit: to give a more practical example, below is the code that works in dask for dense arrays but fails when using sparse arrays with this error,
import numpy as np
import scipy.sparse
import joblib
import dask.array as da
from sklearn.utils import gen_batches
np.random.seed(42)
joblib.dump(np.random.rand(100000, 1000), 'X_dense.pkl')
joblib.dump(scipy.sparse.random(10000, 1000000, format='csr'), 'X_csr.pkl')
fh = joblib.load('X_dense.pkl', mmap_mode='r')
# computing the results without dask
results = np.vstack((fh[sl, :].sum(axis=1)) for sl in gen_batches(fh.shape[0], batch_size))
# computing the results with dask
x = da.from_array(fh, chunks=(2000))
results = x.sum(axis=1).compute()
Edit2: following the discussion below, the example below overcomes the previous error but gets ones about IndexError: tuple index out of range in dask/array/core.py:L3413,
import dask
# +imports from the example above
dask.set_options(get=dask.get) # disable multiprocessing
fh = joblib.load('X_csr.pkl', mmap_mode='r')
def func(x):
if x.ndim == 0:
# dask does some heuristics with dummy data, if the x is a 0d array
# the sum command would fail
return x
res = np.asarray(x.sum(axis=1, keepdims=True))
return res
Xd = da.from_array(fh, chunks=(2000))
results_new = Xd.map_blocks(func).compute()
So I don't know anything about joblib or dask, let alone your application specific data format. But it is actually possible to read sparse matrices from disk in chunks while retaining the sparse data structure.
While the Wikipedia article for the CSR format does a great job explaining how it works, I'll give a short recap:
Some sparse Matrix, e.g.:
1 0 2
0 0 3
4 5 6
is stored by remembering each nonzero-value and the column it resides in:
sparse.data = 1 2 3 4 5 6 # acutal value
sparse.indices = 0 2 2 0 1 2 # number of column (0-indexed)
Now we are still missing the rows. The compressed format just stores how many non-zero values there are in each row, instead of storing every single values row.
Note that the non-zero count is also accumulated, so the following array contains the number of non-zero values up until and including this row. To complicate things even further, the array always starts with a 0 and thus contains num_rows+1 entries:
sparse.indptr = 0 2 3 6
so up until and including the second row there are 3 nonzero values, namely 1, 2 and 3.
Since we got this sorted out, we can start 'slicing' the matrix. The goal is to construct the data, indices and indptr arrays for some chunks. Assume the original huge matrix is stored in three binary files, which we will incrementally read. We use a generator to repeatedly yield some chunk.
For this we need to know how many non-zero values are in each chunk, and read the according amount of values and column-indices. The non-zero count can be conveniently read from the indptr array. This is achieved by reading some amount of entries from the huge indptr file that corresponds to the desired chunk size. The last entry of that portion of the indptr file minus the number of non-zero values before gives the number of non-zeros in that chunk. So the chunks data and indices arrays are just sliced from the big data and indices files. The indptr array needs to be prepended artificially with a zero (that's what the format wants, don't ask me :D).
Then we can just construct a sparse matrix with the chunk data, indices and indptr to get a new sparse matrix.
It has to be noted that the actual matrix size cannot be directly reconstructed from the three arrays alone. It is either the maximum column index of the matrix, or if you are unlucky and there is no data in the chunk undetermined. So we also need to pass the column count.
I probably explained things in a rather complicated way, so just read this just as opaque piece of code, that implements such a generator:
import numpy as np
import scipy.sparse
def gen_batches(batch_size, sparse_data_path, sparse_indices_path,
sparse_indptr_path, dtype=np.float32, column_size=None):
data_item_size = dtype().itemsize
with open(sparse_data_path, 'rb') as data_file, \
open(sparse_indices_path, 'rb') as indices_file, \
open(sparse_indptr_path, 'rb') as indptr_file:
nnz_before = np.fromstring(indptr_file.read(4), dtype=np.int32)
while True:
indptr_batch = np.frombuffer(nnz_before.tobytes() +
indptr_file.read(4*batch_size), dtype=np.int32)
if len(indptr_batch) == 1:
break
batch_indptr = indptr_batch - nnz_before
nnz_before = indptr_batch[-1]
batch_nnz = np.asscalar(batch_indptr[-1])
batch_data = np.frombuffer(data_file.read(
data_item_size * batch_nnz), dtype=dtype)
batch_indices = np.frombuffer(indices_file.read(
4 * batch_nnz), dtype=np.int32)
dimensions = (len(indptr_batch)-1, column_size)
matrix = scipy.sparse.csr_matrix((batch_data,
batch_indices, batch_indptr), shape=dimensions)
yield matrix
if __name__ == '__main__':
sparse = scipy.sparse.random(5, 4, density=0.1, format='csr', dtype=np.float32)
sparse.data.tofile('sparse.data') # dtype as specified above ^^^^^^^^^^
sparse.indices.tofile('sparse.indices') # dtype=int32
sparse.indptr.tofile('sparse.indptr') # dtype=int32
print(sparse.toarray())
print('========')
for batch in gen_batches(2, 'sparse.data', 'sparse.indices',
'sparse.indptr', column_size=4):
print(batch.toarray())
the numpy.ndarray.tofile() just stores binary arrays, so you need to remember the data format. scipy.sparse represents the indices and indptr as int32, so that's a limitation for the total matrix size.
Also I benchmarked the code and found that the scipy csr matrix constructor is the bottleneck for small matrices. Your mileage might vary tho, this is just a 'proof of principle'.
If there is need for a more sophisticated implementation, or something is too obstruse, just hit me up :)

Efficient read and write of pandas dataframe

I have a pandas dataframe that I want to split into several smaller pieces of 100k rows each, then save onto the disk so that I can read in the data and process it one by one. I have tried using dill and hdf storage, as csv and raw text appears to take a lot of time.
I am trying this out on a subset of data with ~500k rows and five columns of mixed data. Two contains strings, one integers, one float and the final one contains bigram counts from sklearn.feature_extraction.text.CountVectorizer, stored as a scipy.sparse.csr.csr_matrix sparse matrix.
It is the last column that I am having problems with. Dumping and loading the data goes without issue, but when I try to actually access the data it is instead a pandas.Series object. Secondly, each row in that Series is a tuple which contains the whole dataset instead.
# Before dumping, the original df has 100k rows.
# Each column has one value except for 'counts' which has 1400.
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400.
vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2))
counts = vectorizer.fit_transform(df['string_data'])
df['counts'] = counts
df_split = pandas.DataFrame(np.column_stack([df['string1'][0:100000],
df['string2'][0:100000],
df['float'][0:100000],
df['integer'][0:100000],
df['counts'][0:100000]]),
columns=['string1','string2','float','integer','counts'])
dill.dump(df, open(file[i], 'w'))
df = dill.load(file[i])
print(type(df['counts'])
> <class 'pandas.core.series.Series'>
print(np.shape(df['counts'])
> (100000,)
print(np.shape(df['counts'][0])
> (496718, 1400) # 496718 is the number of rows in my complete data set.
print(type(df['counts']))
> <type 'tuple'>
Am I making any obvious mistake, or is there a better way to store this data in this format, one which isn't very time consuming? It has to be scalable to my full data containing 100 million rows.
df['counts'] = counts
this will produce a Pandas Series (column) with the # of elements equal to len(df) and where each element is a sparse matrix, which is returned by vectorizer.fit_transform(df['string_data'])
you can try to do the following:
df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)
NOTE: be aware this will explode your sparse matrix into densed (not sparse) DataFrame, so it will use much more memory and you can end up with the MemoryError
CONCLUSION:
That's why I'd recommend you to store your original DF and count sparse matrix separately

How can I sample equally from a dataframe?

Suppose I have some observations, each with an indicated class from 1 to n. Each of these classes may not necessarily occur equally in the data set.
How can I equally sample from the dataframe? Right now I do something like...
frames = []
classes = df.classes.unique()
for i in classes:
g = df[df.classes = i].sample(sample_size)
frames.append(g)
equally_sampled = pd.concat(frames)
Is there a pandas function to equally sample?
For more elegance you can do this:
df.groupby('classes').apply(lambda x: x.sample(sample_size))
Extension:
You can make the sample_size a function of group size to sample with equal probabilities (or proportionately):
nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').\
apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))
It won't result in the exact number of rows as total_sample_size but sampling will be more proportional than the naive method.
While the accepted answer is awesome, another approach when the dataset is highly imbalanced:
For example: A dataset has 100K data-points (or rows) out of which 16K data-points are label 0 (-ve class) and remaining 84K data-points are label 1 (+ve class). To extract a sample of size 50K data-points with all 16K -ve class and filling the remaining space with +ve class, we can do below steps:
from sklearn import utils
# Pick all -ve class, fill the sample with +ve class and shuffle.
df = utils.shuffle(df.groupby("class_label").head(50000 - 16000))
# Reset index by dropping old index if not required.
df.reset_index(drop=True, inplace=True) # Optional step.

Merge CountVectorizer output from 4 text columns back into one dataset

I have a collection of ~100,000 documents in a dataset with a unique doc_id and four columns containing text (like below).
original dataset
I want to vectorize each of the four text columns individually and then combine all of those features back together to create one large dataset for the purpose of building a model for prediction. I approached the vectorization for each text feature using code like below:
stopwords = nltk.corpus.stopwords.words("english")
subject_transformer = CountVectorizer(stop_words=stopwords)
subject_vectorized = subject_transformer.fit_transform(full_docs['subject'])
body_transformer = CountVectorizer(stop_words=stopwords)
body_vectorized = body_transformer.fit_transform(full_docs['body'])
excerpt_transformer = CountVectorizer(stop_words=stopwords)
excerpt_vectorized = excerpt_transformer.fit_transform(full_docs['excerpt'])
regex_transformer = CountVectorizer(stop_words=stopwords)
regex_vectorized = regex_transformer.fit_transform(full_docs['regex'])
Each vectorization yields a sparse matrix like below where column one is the document number, column two is the column number (one for each word in the original text column), and the last column is the actual count.
sparse matrix
What I want to do is the following:
Transpose each sparse matrix to a full dataframe of dimensions nxp where n=number of documents & p=number of words in that corpus
Merge each of these matrices/dataframes back together for the purpose of building a model for prediction
I initially tried the following:
regex_vectorized_df = pd.DataFrame(regex_vectorized.toarray())
Then I could merge the four individual dataframes back together. This doesn't work because toarray() is too memory intensive. What is the best way to merge these four sparse matrices into one dataset with one unique line per document?

Categories

Resources