Machine learning handling new data in every batch - python

I have nearly a TB of data to process. I have a field which is of tags list that video is linked to. The problem is there are plenty of tags and one video info is linked to too many tags, How can I convert it( clean it) before processing. OnehotEncoding and all other algorithms don't fit with this one.
Example:
{"user_id":1, "vid_id":101, "name":"abc", "tags":["night", "horror"], "gender":"Male"}
{"user_id":2, "vid_id":192, "name":"xyz", "tags":["action", "twins"], "gender":"Male"}
and so on
the above json data has so many other params too. But I wanted to use this tag params into consideration.
Now I wanted to predict the gender of the data. Help me out with the algorithms or ideas. Using Python currently and using spark to load the big data.

You can read all of your data into a sparse matrix. The code below was built based on the brief data example you provided and will produce a sparse dictionary where each record is a row and each column is the count of how many times each term appears in the list of tags for that record. The vocabulary dict will provide a mapping of terms to their column index in the final matrix. Also, while looping across the dataset counting the tags a separate list, targets, is constructed with the outcome variables. In the end you should be able to use mat and targets to train your classifier.
idx_pointer = [0]
indices = []
mat_data = []
vocabulary = {}
targets = []
for d in data:
targets.append(d['gender'])
for t in d['tags']:
index = vocabulary.setdefault(t, len(vocabulary))
indices.append(index)
mat_data.append(1)
idx_pointer.append(len(indices))
mat = scipy.sparse.csr_matrix((mat_data, indices, idx_pointer), dtype=int)
Using the example input you provided the dense output would be a matrix like below.
night horror action twins
1 1 0 0
0 0 1 1

Related

Fast/efficient way to extract data from multiple large NetCDF files

I need to extract data from a global grid only for a specific set of nodes, given by lat/lon coordinates (in the order of 5000-10000). The data are time-series of hydraulic parameters, for example wave height.
The global data set is huge so it is divided into many NetCDF files. Each NetCDF file is around 5GB and contains data for the entire global grid, but only for one variable (e.g. wave height) and one year (e.g. 2020). Say I want to extract the full time series (42 years) of 6 variables at a certain location, I need to extract data form 6x42 = 252 NC files, each 5GB in size.
My current approach is a triple loop through years, variables, and nodes. I use Xarray to open each NC file, extract the data for all the required nodes and store it in a dictionary. Once I've extracted all the data in the dictionary I create one pd.dataframe for each location, which I store as a pickle file. With 6 variables and 42 years, this results in a pickle file of around 7-9 MB for each location (so not very large actually).
My approach works perfectly fine if I have a small amount of locations, but as soon as it grows to a few hundred, this approach takes extremely long. My gut feeling is that it is a memory problem (since all the extracted data is first stored in a single dictionary, until every year and variable are extracted). But one of my colleagues said that Xarray is actually quite inefficient and that this might lead to the long duration.
Does anyone here have experience with similar issues or know of an efficient way to extract data from a multitude of NC files? I put the code I currently use below. Thanks for any help!
# set conditions
vars = {...dictionary which contains variables}
years = np.arange(y0, y1 + 1) # year range
ndata = {} # dictionary which will contain all data
# loop through all the desired variables
for v in vars.keys():
ndata[v] = {}
# For each variable, loop through each year, open the nc file and extract the data
for y in years:
# Open file with xarray
fname = 'xxx.nc'
data = xr.open_dataset(fname)
# loop through the locations and load the data for each node as temp
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
lon = nodes.lon.iloc[n]
lat = nodes.lat.iloc[n]
temp = data.sel(longitude=lon, latitude=lat)
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
# merge the variables for the current location into one dataset
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
dset = xr.merge(ndata[v][node] for v in variables.keys())
df = dset.to_dataframe()
# save dataframe as pickle file, named by the node id
df.to_pickle('%s.xz'%(node)))
This is a pretty common workflow so I'll give a few pointers. A few suggested changes, with the most important ones first
Use xarray's advanced indexing to select all points at once
It looks like you're using a pandas DataFrame nodes with columns 'lat', 'lon', and 'node_id'. As with nearly everything in python, remove an inner for loop whenever possible, leveraging array-based operations written in C. In this case:
# create an xr.Dataset indexed by node_id with arrays `lat` and `lon
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# select all points from each file simultaneously, reshaping to be
# indexed by `node_id`
node_data = data.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# dump this reshaped data to pandas, with each variable becoming a column
node_df = node_data.to_dataframe()
Only reshape arrays once
In your code, you are looping over many years, and every year after
the first one you are allocating a new array with enough memory to
hold as many years as you've stored so far.
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
Instead, just gather all the years worth of data and concatenate
them at the end. This will only allocate the needed array for all the data once.
Use dask, e.g. with xr.open_mfdataset to leverage multiple cores. If you do this, you may want to consider using a format that supports multithreaded writes, e.g. zarr
All together, this could look something like this:
# build nested filepaths
filepaths = [
['xxx.nc'.format(year=y, variable=v) for y in years
for v in variables
]
# build node indexer
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# I'm not sure if you have conflicting variable names - you'll need to
# tailor this line to your data setup. It may be that you want to just
# concatenate along years and then use `xr.merge` to combine the
# variables, or just handle one variable at a time
ds = xr.open_mfdataset(
filepaths,
combine='nested',
concat_dim=['variable', 'year'],
parallel=True,
)
# this will only schedule the operation - no work is done until the next line
ds_nodes = ds.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# this triggers the operation using a dask LocalCluster, leveraging
# multiple threads on your machine (or a distributed Client if you have
# one set up)
ds_nodes.to_netcdf('all_the_data.zarr')
# alternatively, you could still dump to pandas:
df = ds_nodes.to_dataframe()

Concatenating and appending to a DataFrame inside the For Loop in Python

I have the following problem.
There is quite big dataset with the features and IDs. Due to the task definition, I'm trying to do clustering but not for all dataset, instead of that I'm taking each of the IDs and then train the model on the feature data from this particular ID. How does that look in details:
Imagine, that we have our initial dataframe df_init
Then I create the array with unique ID_s:
dd = df_init['ID'].unique()
After that, set comprehension is being created just like that:
dds = {x:y for x,y in df_init.groupby('ID')}
Using for loops and iterating over dds, I'm taking the data and use it for training the clustering algorithm. After that, pd.concat() is using to get the dataframe back(for this example, will show only two IDs):
df = pd.DataFrame()
d={}
n=5
for i in dd[:2]:
d[i] = dds[i].iloc[: , 1:5].values
ac = AgglomerativeClustering(n_clusters=n, linkage='complete').fit(d[i])
labels = ac.labels_
labels = pd.DataFrame(labels)
df = pd.concat([df, labels])
print(i)
print('Labels: ', labels)
So the result for this loop will be following output:
And the output df will look like that(shown only for first ID, the rest labels are also there):
My question is the following: how can I add the additional column to this dataframe in the loop, that will be matching certain ID to corresponding labels (4 labels-ID_1, another 4 labels-ID_2, etc.)? Are there any pandas solution for achieving that?
Many thanks in advance!
Below this line:
labels = pd.DataFrame(labels)
Add the following:
labels['ID']=i
This will give you the extra column with the proper ID for each subset

Keeping track of the output columns in sklearn preprocessing

How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:
What is the source variable of each column in the output array?
If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
What is the exact imputed value for each variable?
What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)
I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.
The answer which had mentioned is based on this in Sklearn.
You can get the answer for your first two question using the following snippet.
def get_feature_names(columnTransformer):
output_features = []
for name, pipe, features in columnTransformer.transformers_:
if name!='remainder':
for i in pipe:
trans_features = []
if hasattr(i,'categories_'):
trans_features.extend(i.get_feature_names(features))
else:
trans_features = features
output_features.extend(trans_features)
return output_features
import pandas as pd
pd.DataFrame(preprocessor.fit_transform(X_train),
columns=get_feature_names(preprocessor))
transformed_cols = get_feature_names(preprocessor)
def get_original_column(col_index):
return transformed_cols[col_index].split('_')[0]
get_original_column(3)
# 'embarked'
get_original_column(0)
# 'age'
def get_category(col_index):
new_col = transformed_cols[col_index].split('_')
return 'no category' if len(new_col)<2 else new_col[-1]
print(get_category(3))
# 'Q'
print(get_category(0))
# 'no category'
Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.

Appending to h5py groups

I have files with the following structure:
time 1
index 1
value x
value y
time 1
index 2
value x
value y
time 2
index 1
...
I wish to convert the file to the hdf5 format using h5py, and sort the values from each index into separate groups.
My approach is
f = h5py.File(filename1,'a')
trajfile = open(filename2, 'rb')
for i in range(length_of_filw):
time = struct.unpack('>d', filename2.read(8))[0]
index = struct.unpack('>i', filename2.read(4))[0]
x = struct.unpack('>d', filename2.read(8))[0]
y = struct.unpack('>d', filename2.read(8))[0]
f.create_dataset('/'+str(index), data=[time,x,y,z])
But in this way I am not able to append to the groups (I am only able to write to each group once...). The error message is "RuntimeError: Unable to create link (name already exists)".
Is there a way to append to the groups?
You can write to a dataset as many times as you want - you just can't have twice a dataset with the same name. This is the error you're getting. Note that you are creating a dataset and at the same time you are putting some data inside of it. In order to write other data to it, it has to be large enough to accomodate it.
Anyway, I believe you are confusing groups and datasets.
Groups are created with e.g.
grp = f.create_group('bar') # this create the group '/bar'
and you want to store datasets in a dataset, created like you said with:
dst = f.create_dataset('foo',shape=(100,)) # this create the dataset 'foo', with enough space for 100 elements.
you only need to create groups and datasets once - but you can refer to them through their handles, (grp and dst), in order to write in them.
I suggest you first go through your file once, create your desired groups and datasets using the 'shape' parameter to properly size it, and then populate the datasets with actual data.

How to get the indices for randomly selected rows in a list (Python)

Okay, I don't know if I phrased it badly or something, but I can't seem to find anything similar here for my problem.
So I have a 2D list, each row representing a case and each column representing a feature (for machine learning). In addition, I have a separated list (column) as labels.
I want to randomly select the rows from the 2D list to train a classifier while using the rest to test for accuracy. Thus I want to be able to know all the indices of rows I used for training to avoid repeats.
I think there are 2 parts of the question:
1) how to randomly select
2) how to get indices
again I have no idea why I can't find good info here by searching (maybe I just suck)
Sorry I'm still new to the community so I might have made a lot of format mistake. If you have any suggestion, please let me know.
Here's the part of code I'm using to get the 2D list
#273 = number of cases
feature_list=[[0]*len(mega_list)]*273
#create counters to use for index later
link_count=0
feature_count=0
#print len(mega_list)
for link in url_list[:-1]:
#setup the url
samp_url='http://www.mtsamples.com'+link
samp_url = "%20".join( samp_url.split() )
#soup it for keywords
samp_soup=BeautifulSoup(urllib2.urlopen(samp_url).read())
keywords=samp_soup.find('meta')['content']
keywords=keywords.split(',')
for keys in keywords:
#print 'megalist: '+ str(mega_list.index(keys))
if keys in mega_list:
feature_list[link_count][mega_list.index(keys)]=1
mega_list: a list with all keywords
feature_list: the 2D list, with any word in mega_list, that specific cell is set to 1, otherwise 0
I would store the data in a pandas data frame instead of a 2D list. If I understand your data right you could do that like this:
import pandas as pd
df = pd.DataFrame(feature_list, columns = mega_list)
I don't see any mention of a dependent variable, but I'm assuming you have one because you mentioned a classifier algorithm. If your dependent variable is called "Y" and is in a list format with indices that align with your features, then this code will work for you:
from sklearn import cross_validation
x_train, x_test, y_train, y_test = cross_validation.train_test_split(
df, Y, test_size=0.8, random_state=0)
As I understand the problem, you have a list and you want to sample the list and save the indices for future use.
See: https://docs.python.org/2/library/random.html
you could do a random.sample(xrange(sizeoflist),sizeofsample) which will return the indices of your sample. You can then use that sample for training and skip over them (or get fancy and do a Set difference) for validation.
Hope this helps

Categories

Resources