Merge CountVectorizer output from 4 text columns back into one dataset - python

I have a collection of ~100,000 documents in a dataset with a unique doc_id and four columns containing text (like below).
original dataset
I want to vectorize each of the four text columns individually and then combine all of those features back together to create one large dataset for the purpose of building a model for prediction. I approached the vectorization for each text feature using code like below:
stopwords = nltk.corpus.stopwords.words("english")
subject_transformer = CountVectorizer(stop_words=stopwords)
subject_vectorized = subject_transformer.fit_transform(full_docs['subject'])
body_transformer = CountVectorizer(stop_words=stopwords)
body_vectorized = body_transformer.fit_transform(full_docs['body'])
excerpt_transformer = CountVectorizer(stop_words=stopwords)
excerpt_vectorized = excerpt_transformer.fit_transform(full_docs['excerpt'])
regex_transformer = CountVectorizer(stop_words=stopwords)
regex_vectorized = regex_transformer.fit_transform(full_docs['regex'])
Each vectorization yields a sparse matrix like below where column one is the document number, column two is the column number (one for each word in the original text column), and the last column is the actual count.
sparse matrix
What I want to do is the following:
Transpose each sparse matrix to a full dataframe of dimensions nxp where n=number of documents & p=number of words in that corpus
Merge each of these matrices/dataframes back together for the purpose of building a model for prediction
I initially tried the following:
regex_vectorized_df = pd.DataFrame(regex_vectorized.toarray())
Then I could merge the four individual dataframes back together. This doesn't work because toarray() is too memory intensive. What is the best way to merge these four sparse matrices into one dataset with one unique line per document?

Related

how do I aggregate 50 datasets within an HDf5 file

I have an HDF5 file with 2 groups, each containing 50 datasets of 4D numpy arrays of same type per group. I want to combine all 50 datasets in each group into a single dataset. In other words, instead of 2 x 50 datasets I want 2x1 dataset. How can I accomplish this? The file is 18.4 Gb in size. I am a novice at working with large datasets. I am working in python with h5py.
Thanks!
Look at this answer: How can I combine multiple .h5 file? - Method 3b: Merge all data into 1 Resizeable Dataset. It describes a way to copy data from multiple HDF5 files into a single dataset. You want to do something similar. The only difference is all of your datasets are in 1 HDF5 file.
You didn't say how you want to stack the 4D arrays. In my first answer I stacked them along axis=3. As noted in my comment, I it's easier (and cleaner) to create the merged dataset as a 5d array, and stack the data along the 5th axis (axis=4). I like this for 2 reasons: The code is simpler/easier to follow, and 2) it's more intuitive (to me) that axis=4 represents a unique dataset (instead of slicing on axis=3).
I wrote a self-contained example to demonstrate the procedure. First it creates some data and closes the file. Then it reopens the file (read only) and creates a new file for the copied datasets. It loops over the groups and and datasets in the first and copies the data into to a merged dataset in the second file. The 5D example is first, and my original 4D example follows.
Note: this is a simple example that will work for your specific case. If you are writing a general solution, it should check for consistent shapes and dtypes before blindly merging the data (which I don't do).
Code to create the Example data (2 groups, 5 datasets each):
import h5py
import numpy as np
# Create a simple H5 file with 2 groups and 5 datasets (shape=a0,a1,a2,a3)
with h5py.File('SO_69937402_2x5.h5','w') as h5f1:
a0,a1,a2,a3 = 100,20,20,10
grp1 = h5f1.create_group('group1')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp1.create_dataset(f'dset_{ds:02d}',data=arr)
grp2 = h5f1.create_group('group2')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp2.create_dataset(f'dset_{ds:02d}',data=arr)
Code to merge the data (2 groups, 1 5D dataset each -- my preference):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_5d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
ds_cnt = len(h5f1[grp].keys())
for i,ds in enumerate(h5f1[grp].keys()):
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# If dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',dtype=h5f1[grp][ds].dtype,
shape=(ds_shape+(ds_cnt,)), maxshape=(ds_shape+(None,)) )
# Now add data to the merged dataset
merge_ds[:,:,:,:,i] = h5f1[grp][ds]
Code to merge the data (2 groups, 1 4D dataset each):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_4d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
for ds in h5f1[grp].keys():
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# if dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',data=h5f1[grp][ds],
maxshape=[ds_shape[0],ds_shape[1],ds_shape[2],None])
else:
# otherwise, resize the merged dataset to hold new values
ds1_shape = h5f1[grp][ds].shape
ds2_shape = merge_ds.shape
merge_ds.resize(ds1_shape[3]+ds2_shape[3],axis=3)
merge_ds[ :,:,:, ds2_shape[3]:ds2_shape[3]+ds1_shape[3] ] = h5f1[grp][ds]

Creating Dummy Variable with 200k unique value

I am trying to create a dummy variable for the categorical dataset, but the problem is python does not have a compatible ram to run the code since the unique value is too large to create a dummy variable. It is a large dataset with 500k rows and 200k unique values. Is it possible to create a dummy variable with 200k unique values?.
Indeed performing this operation takes lots of RAM.
As far as programming solutions I can think of:
Dimensionality reduction: if somehow there is some relation between your 200K categories and those can be reduced (eg hierarchy levels for those categories so you can group up categories and perform analyses by levels, eg lvl1 = 10 cateogries, lvl2 = 100 etc...). May I ask: what type of data do you have which contains 200K unique category values?
Splitting up dataset and combining results: I have the below working with numpy. You end up with smaller subsets, each encoded for the 200K categories (even if certain categories aren't present in a subset). Then you need to decide how to further process those subsets.
Somehow import statements were breaking formatting so I have them separate here:
import numpy as np
import random
And the rest of the code:
def np_one_hot_encode(n_categories: int, arr: np.array):
# Performs one-hot encoding of arr based on n_categories
# Allows encoding smaller chuncks of a bigger array
# even if the chunks do not contain 1 occurrence of each category
# while still producing n_categories columns for each chunks
result = np.zeros((arr.size, n_categories))
result[np.arange(arr.size), arr] = 1
return result
# Testing our encoding function
# even if our input array doesn't contain all categories
# the output does cater for all categories
encoded = np_one_hot_encode(3, np.array([1, 0]))
print('test np_one_hot_encode\n', encoded)
assert np.array_equal(encoded, np.array([[0, 1, 0], [1, 0, 0]]))
# Generating 500K rows with 200K unique categories present at least once
total = int(5e5)
nunique = int(2e5)
uniques = list(range(0, nunique))
random.shuffle(uniques)
values = uniques+(uniques*2)[:total-nunique]
print('Rows count', len(values))
print('Uniques count', len(list(set(values))))
# Produces subsets of the data in (~500K/50 x nuniques) shape:
n_chunks = 50
for i, chunk in enumerate(np.array_split(values, n_chunks)):
print('chunk', i, 'shape', chunk.shape)
encoded = np_one_hot_encode(nunique, chunk)
print('encoded', encoded.shape)
And the output:
test np_one_hot_encode
[[0. 1. 0.]
[1. 0. 0.]]
Rows count 500000
Uniques count 200000
chunk 0 shape (10000,)
encoded (10000, 200000)
chunk 1 shape (10000,)
encoded (10000, 200000)
Distributed processing with tools like Dask, Spark etc... so you can handle processing of the subsets
Database: other solution I can think of is normalize your model into a database (either relational or "big" flat data model) where you could leverage indices to filter and process part of the data only (only certain rows and certain categories), thus allowing you to handle a smaller output in memory
But in the end there is no magic, if ultimately you're tyring to load a N-M matrix into memory with N=500K and M=200K, it will take the RAM it needs to take, there is no way around that, thus the most likely gains to be had are dimensionality reduction OR a different approach to data processing altogether (eg distributed computing).

on efficiently separating one data file into two files for model building work

There is a data file, consisting of 100K rows, where each row stores a data point. I would like to randomly select 10K rows and save them into a validation file; and use the remaining rows to be saved into a training file. In stead of writing a code to do this, are there any existing function in scikit-learn, pandas or in generic Python to efficiently separate a data file into two ones?
There can be only one reason not to use sklearn train-test-split method because you probably don’t want to take the labels out from the features. You just simply want to split the DataFrame in two sections without splitting the features and labels.
If you don’t want to use train-test-split from sklearn, you can do it in pandas too
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]

Extracing tf-idf values and features from TfidfVectorizer and making them into pandas Series

I was extracting tf-idf values for each feature name from a text document (in .csv format where each row entry represents a text message (dtype=str)) using TfidfVectorizer with default parameters. This is what I did:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from pandas import Series
# .csv document has been converted into pandas format
tf = TfidfVectorizer()
X_tf = tf.fit_transform(document)
# get feature names and tf-idf values
feature_names = tf.get_feature_names()
tfidf = tf.idf_
I also used the last two lines to extract feature names and tf-idf values. However, I was also asked to (1) sort features by their tf-idf values in both ascending and descending orders, then by alphabetical order (if multiple features' tf-idf values are tied) and (2) make the output into a pandas Series object using feature name as index, such that the output looks like this (this one in descending order):
feature tf-idf
he 0.031
she 0.047
i 0.068
a 0.084
the 1.527
Seems that I can simply achieve this by matching 'feature_names' and 'tfidf' and sort them, but I am not sure if their sequences actually match as 'feature_names' is a list object while 'tfidf' is a numpy array and given that I don't really know what sklearn is doing under the hood.
If I want to compile a sorted Series in descending (and ascending) order with the exact feature name as index (sorted in alphabetical oder), what should I proceed from the last line of my code? If will be really appreciated if someone could enlighten me on this.
Thank you.

Efficient read and write of pandas dataframe

I have a pandas dataframe that I want to split into several smaller pieces of 100k rows each, then save onto the disk so that I can read in the data and process it one by one. I have tried using dill and hdf storage, as csv and raw text appears to take a lot of time.
I am trying this out on a subset of data with ~500k rows and five columns of mixed data. Two contains strings, one integers, one float and the final one contains bigram counts from sklearn.feature_extraction.text.CountVectorizer, stored as a scipy.sparse.csr.csr_matrix sparse matrix.
It is the last column that I am having problems with. Dumping and loading the data goes without issue, but when I try to actually access the data it is instead a pandas.Series object. Secondly, each row in that Series is a tuple which contains the whole dataset instead.
# Before dumping, the original df has 100k rows.
# Each column has one value except for 'counts' which has 1400.
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400.
vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2))
counts = vectorizer.fit_transform(df['string_data'])
df['counts'] = counts
df_split = pandas.DataFrame(np.column_stack([df['string1'][0:100000],
df['string2'][0:100000],
df['float'][0:100000],
df['integer'][0:100000],
df['counts'][0:100000]]),
columns=['string1','string2','float','integer','counts'])
dill.dump(df, open(file[i], 'w'))
df = dill.load(file[i])
print(type(df['counts'])
> <class 'pandas.core.series.Series'>
print(np.shape(df['counts'])
> (100000,)
print(np.shape(df['counts'][0])
> (496718, 1400) # 496718 is the number of rows in my complete data set.
print(type(df['counts']))
> <type 'tuple'>
Am I making any obvious mistake, or is there a better way to store this data in this format, one which isn't very time consuming? It has to be scalable to my full data containing 100 million rows.
df['counts'] = counts
this will produce a Pandas Series (column) with the # of elements equal to len(df) and where each element is a sparse matrix, which is returned by vectorizer.fit_transform(df['string_data'])
you can try to do the following:
df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)
NOTE: be aware this will explode your sparse matrix into densed (not sparse) DataFrame, so it will use much more memory and you can end up with the MemoryError
CONCLUSION:
That's why I'd recommend you to store your original DF and count sparse matrix separately

Categories

Resources