Removing fraction of a dataset - python

I am fairly new to python and numpy scipy packages in particular.
I am doing regression analysis for a class assignment which involves trying different regression techniques on a data set and see which one works. This involves deleting values from a dataset and see which algorithm performs well with reduced data set. Right now I am indexing upto a fraction of the length of dataset.
Something like.
data = np.loadtxt("filename")
to_be_used = data[0:int(0.6(len(data)))]
Is there any other way I can do this? Say, I want to randomly select 60% of the data instead of the first 60 elements.

You can grab a random set of data from your array using the numpy.random.choice function:
subset = np.random.choice(data, int(len(data)*0.6), replace=False)
However, if you want to create multiple non-overlapping random sets, you should instead shuffle your array, then use regular slices to get the amount you want in each chunk. For instance, to randomly split your data in half:
np.shuffle(data)
one_random_half = data[:len(data)//2]
other_random_half = data[len(data)//2:]

Related

Optimizing random sampling for scaling permutations

I have a two-fold problem here.
1) My initial problem, which was trying to improve the computational time on my algorithm
2) My next problem, which is one of my 'improvements' appears to consume over 500G RAM memory and I don't really know why.
I've been writing a permutation for a bioinformatics pipeline. Basically, the algorithm starts off sampling 1000 null variants for each variant I have. If a certain condition is not met, the algorithm repeats the sampling with 10000 nulls this time. If the condition is not met again, this up-scales to 100000, and then 1000000 nulls per variant.
I've done everything I can think of to optimize this, and at this point I'm stumped for improvements. I have this gnarly looking comprehension:
output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list]))
Basically all this is doing is calling my random_sampling_for_variants function on each variant in my variant list, and throwing the two outputs from that function into a list of lists (so, I end up with two lists of lists, output_names_df, and output_p_values_df). I then turn these list of lists back into DataFrames, rename the columns, and do whatever I need with them. The function being called looks like:
def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num):
"""
Inner function to permute_null_variants_single
equivalent to permutation for 1 variant
"""
#Get nulls that are in the same bin as our variant
permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant])
#If number of permutations > number of nulls, sample with replacement
if num_permutations >= len(permuted_null_variant_table):
replace = True
else:
replace = False
#Select rows for permutation, then add as columns to our temporary dfs
picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace)
temp_names_df = picked_indices['variant'].values[:use_num]
temp_p_values_df = picked_indices['GWAS_p_value'].values
return(temp_names_df, temp_p_values_df)
When defining permuted_null_variant_table, I'm just querying a pre-defined grouped up DataFrame to determine the appropriate null table to sample from. I found this to be faster than trying to determine the appropriate nulls to sample from on the fly. The logic that follows in there determines whether I need to sample with or without replacement, and it takes hardly any time at all. Defining picked_indices is where the random sampling is actually being done. temp_names_df and temp_p_values_df get the values that I want out of the null table, and then I send my returns out of the function.
The problem here is that the above doesn't scale very well. For 1000 permutations, the above lines of code take ~254 seconds on 7725 variants. For 10,000 permutations, ~333 seconds. For 100,000 permutations, ~720 seconds. For 1,000,000 permutations, which is the upper boundary I'd like to get to, my process got killed because it was apparently going to take up more RAM than the cluster had.
I'm stumped on how to proceed. I've turned all by ints into 8bit-uint, I've turned all my floats into 16bit-floats. I've dropped columns that I don't need where I don't need them, so I'm only ever sampling from tables with the columns I need. Initially I had the comprehension in the form of a loop, but the computational time on the comprehension is 3/5th that of the loop. Any feedback is appreciated.

Most orthodox way of splitting matrix and list in Python

If I have ratios to split a dataset into training, validation, and test sets, what is the most orthodox and elegant way of doing this in Python?
For instance, I split my data into 60% training, 20% testing, and 20% validation. I have 1000 rows of data with 10 features each, and a label vector of size 1000. The training set matrix should be of size (600, 10), and so on.
If I create new matrices of features and lists of labels, it wouldn't be memory efficient right? Lets say I did something like this:
TRAIN_PORTION = int(datasetSize * tr)
VALIDATION_PORTION = int(datasetSize * va)
# Whatever is left will be for testing
TEST_PORTION = datasetSize - TRAIN_PORTION - VALIDATION_PORTION
trainingSet = dataSet[0, TRAIN_PORTION:]
validationSet = dataSet[TRAIN_PORTION,
TRAIN_PORTION + VALIDATIONPORTION:]
testSet = dataset[TRAIN_PORTION+VALIATION_PORTION, datasetSize:]
That would leave me with the double amount of used memory, right?
Sorry for the incorrect Python syntax, and thank you for any help.
That's correct: you will double the memory usage that way. To avoid doubling the memory usage, you need to do one of two things:
Release the memory from one sub-matrix before you create the next; this reduces your memory high-water mark to 1.6x the main matrix;
Write your processing routines to stop at the proper row, always working on the original matrix.
You can achieve the first one by passing list slices to your processing routines, such as
model_test(data_set[:TRAIN_PORTION])
Remember, when you refer to a slice, the interpreter will build a temporary object that results from the given limits.
RESPONSE TO OP COMMENT
The reference I gave you does create a new list. To avoid using more memory, pass the entire list and the desired limits, such as
process_function(data_set, 0, TRAIN_PORTION)
process_function(data_set, TRAIN_PORTION,
TRAIN_PORTION + VALIDATION_PORTION)
process_function(data_set,
TRAIN_PORTION + VALIDATION_PORTION,
len(data_set))
If you want to do this with just list slices, then please explain where you're having trouble, and why the various pieces of documentation and the tutorials aren't satisfying your needs.
If you would use numpy-arrays (your code actually looks like that), it's possible to use views (memory is shared). It's not always easy to understand which operation results in a view and which does not. Here are some hints.
Short example:
import numpy as np
a = np.random.normal(size=(1000, 10))
b = a[:600]
print (b.flags['OWNDATA'])
# False
print(b[3,2])
# 0.373994992467 (some random-val)
a[3,2] = 88888888
print(b[3,2])
# 88888888.0
print(a.shape)
# (1000, 10)
print(b.shape)
# (600, 10)
This will probably allow you to do some in-place shuffle at the beginning and then use those linear-segments of your data to obtain views of train, val, test.

Is there a way to perform this subsampling algorithm in numpy?

The algorithm just builds up a new list from an input data array. It only appends a new element from the input array once the element has crossed the visibleDelta threshold of the previous stored element:
def subsample(data, visibleDelta):
subsampled = [data[0]]
for point in data[1:]:
if abs(point - subsampled[len(subsampled) - 1]) > visibleDelta:
subsampled.append(point)
return subsampled
Problem is I need this to run on very large datasets (~1B values), and I'd like to use numpy or some other numerical library to do this if possible.
I should probably mention that the 'real' function won't just deal with a 1D array of data. The input data will be a pandas dataframe, with the first column being x values, and the second being y values (I'll be comparing the y values).
Any way to do this efficiently?
if you want to track the data in this way, numpy is not the good tool, See Numba or Cython for efficiency.
A slightly different approach is to determine threshold and look when data reach them :
data=sin(arange(1e6)/3e4)
visibledelta=0.2
cat=floor(data/visibledelta)
subsample=arange(data.size-1)[diff(cat).astype(bool)]
plot(data)
plot(subsample,data[subsample],'o')
which give :
Some adjust may be done, but the data is splitted in chunks.

Python: nearest neighbour (or closest match) filtering on data records (list of tuples)

I am trying to write a function that will filter a list of tuples (mimicing an in-memory database), using a "nearest neighbour" or "nearest match" type algorithim.
I want to know the best (i.e. most Pythonic) way to go about doing this. The sample code below hopefully illustrates what I am trying to do.
datarows = [(10,2.0,3.4,100),
(11,2.0,5.4,120),
(17,12.9,42,123)]
filter_record = (9,1.9,2.9,99) # record that we are seeking to retrieve from 'database' (or nearest match)
weights = (1,1,1,1) # weights to approportion to each field in the filter
def get_nearest_neighbour(data, criteria, weights):
for each row in data:
# calculate 'distance metric' (e.g. simple differencing) and multiply by relevant weight
# determine the row which was either an exact match or was 'least dissimilar'
# return the match (or nearest match)
pass
if __name__ == '__main__':
result = get_nearest_neighbour(datarow, filter_record, weights)
print result
For the snippet above, the output should be:
(10,2.0,3.4,100)
since it is the 'nearest' to the sample data passed to the function get_nearest_neighbour().
My question then is, what is the best way to implement get_nearest_neighbour()?. For the purpose of brevity etc, assume that we are only dealing with numeric values, and that the 'distance metric' we use is simply an arithmentic subtraction of the input data from the current row.
Simple out-of-the-box solution:
import math
def distance(row_a, row_b, weights):
diffs = [math.fabs(a-b) for a,b in zip(row_a, row_b)]
return sum([v*w for v,w in zip(diffs, weights)])
def get_nearest_neighbour(data, criteria, weights):
def sort_func(row):
return distance(row, criteria, weights)
return min(data, key=sort_func)
If you'd need to work with huge datasets, you should consider switching to Numpy and using Numpy's KDTree to find nearest neighbors. Advantage of using Numpy is that not only it uses more advanced algorithm, but also it's implemented a top of highly optimized LAPACK (Linear Algebra PACKage).
About naive-NN:
Many of these other answers propose "naive nearest-neighbor", which is an O(N*d)-per-query algorithm (d is the dimensionality, which in this case seems constant, so it's O(N)-per-query).
While an O(N)-per-query algorithm is pretty bad, you might be able to get away with it, if you have less than any of (for example):
10 queries and 100000 points
100 queries and 10000 points
1000 queries and 1000 points
10000 queries and 100 points
100000 queries and 10 points
Doing better than naive-NN:
Otherwise you will want to use one of the techniques (especially a nearest-neighbor data structure) listed in:
http://en.wikipedia.org/wiki/Nearest_neighbor_search (most likely linked off from that page), some examples linked:
http://en.wikipedia.org/wiki/K-d_tree
http://en.wikipedia.org/wiki/Locality_sensitive_hashing
http://en.wikipedia.org/wiki/Cover_tree
especially if you plan to run your program more than once. There are most likely libraries available. To otherwise not use a NN data structure would take too much time if you have a large product of #queries * #points. As user 'dsign' points out in comments, you can probaby squeeze out a large additional constant factor of speed by using the numpy library.
However if you can get away with using the simple-to-implement naive-NN though, you should use it.
use heapq.nlargest on a generator calculating the distance*weight for each record.
something like:
heapq.nlargest(N, ((row, dist_function(row,criteria,weight)) for row in data), operator.itemgetter(1))

Numpy histogram of large arrays

I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogram on that array. This consumes an unnecessary amount of memory.
Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.
Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.
As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:
import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
d = np.random.randn(1000,1)
htemp, jnk = np.histogram(d, mybins)
myhist += htemp
I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. #doug's suggestion of a generator seems like a good way to address that problem.
Here's a way to bin your values directly:
import numpy as NP
column_of_values = NP.random.randint(10, 99, 10)
# set the bin values:
bins = NP.array([0.0, 20.0, 50.0, 75.0])
binned_values = NP.digitize(column_of_values, bins)
'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.
'bincount' will give you (obviously) the bin counts:
NP.bincount(binned_values)
Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:
data_array = NP.loadtxt(data_file.txt, delimiter=",")
def fnx() :
for i in range(0, data_array.shape[1]) :
yield dx[:,i]
Binning with a Fenwick Tree (very large dataset; percentile boundaries needed)
I'm posting a second answer to the same question since this approach is very different, and addresses different issues.
What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.
For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.
For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".
I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.
I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)
More on Fenwick Trees:
http://en.wikipedia.org/wiki/Fenwick_tree
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees
Are interval, segment, fenwick trees the same?
Binning with Generators (large dataset; fixed-width bins; float data)
If you know the width of your desired bins ahead of time -- even if there are hundreds or thousands of buckets -- then I think rolling your own solution would be fast (both to write, and to run). Here's some Python that assumes you have a iterator that gives you the next value from the file:
from math import floor
binwidth = 20
counts = dict()
filename = "mydata.csv"
for val in next_value_from_file(filename):
binname = int(floor(val/binwidth)*binwidth)
if binname not in counts:
counts[binname] = 0
counts[binname] += 1
print counts
The values can be floats, but this is assuming you use an integer binwidth; you may need to tweak this a bit if you want to use a binwidth of some float value.
As for next_value_from_file(), as mentioned earlier, you'll probably want to write a custom generator or object with an iter() method do do this efficiently. The pseudocode for such a generator would be this:
def next_value_from_file(filename):
f = open(filename)
for line in f:
# parse out from the line the value or values you need
val = parse_the_value_from_the_line(line)
yield val
If a given line has multiple values, then make parse_the_value_from_the_line() either return a list or itself be a generator, and use this pseudocode:
def next_value_from_file(filename):
f = open(filename)
for line in f:
for val in parse_the_values_from_the_line(line):
yield val

Categories

Resources