Scipy: parallel computing in ipython notebook?

Scipy: parallel computing in ipython notebook? - python

I'm doing a kernel density estimation of a dataset (a collection of points).
The estimation process is ok, the problem is that, when I'm trying to get the density value for each point, the speed is very slow:
from sklearn.neighbors import KernelDensity
# this speed is ok
kde = KernelDensity(bandwidth=2.0,atol=0.0005,rtol=0.01).fit(sample)
# this is very slow
kde_result = kde.score_samples(sample)
The sample is consist of 300,000 (x,y) points.
I'm wondering if it's possible to make it run parallely, so the speed would be quicker?
For example, maybe I can divide the sample in to smaller sets and run the score_samples for each set at the same time? Specifically:
I'm not familliar with parallel computing at all. So I'm wondering if it's applicable in my case?
If this can really speed up the process, what should I do? I'm just running the script in ipython notebook, and have no prior expereince in this, is there any good and simple example for my case?
I'm reading http://ipython.org/ipython-doc/dev/parallel/parallel_intro.html now.
UPDATE:
import cProfile
cProfile.run('kde.score_samples(sample)')
64 function calls in 8.653 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 8.653 8.653 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 _methods.py:31(_sum)
2 0.000 0.000 0.000 0.000 base.py:870(isspmatrix)
1 0.000 0.000 8.653 8.653 kde.py:133(score_samples)
4 0.000 0.000 0.000 0.000 numeric.py:464(asanyarray)
2 0.000 0.000 0.000 0.000 shape_base.py:60(atleast_2d)
2 0.000 0.000 0.000 0.000 validation.py:105(_num_samples)
2 0.000 0.000 0.000 0.000 validation.py:126(_shape_repr)
6 0.000 0.000 0.000 0.000 validation.py:153(<genexpr>)
2 0.000 0.000 0.000 0.000 validation.py:268(check_array)
2 0.000 0.000 0.000 0.000 validation.py:43(_assert_all_finite)
6 0.000 0.000 0.000 0.000 {hasattr}
4 0.000 0.000 0.000 0.000 {isinstance}
12 0.000 0.000 0.000 0.000 {len}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 8.652 8.652 8.652 8.652 {method 'kernel_density' of 'sklearn.neighbors.kd_tree.BinaryTree' objects}
2 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
2 0.000 0.000 0.000 0.000 {method 'sum' of 'numpy.ndarray' objects}
6 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}

Here is a simple example of parallelization using multiprocessing built-in module :
import numpy as np
import multiprocessing
from sklearn.neighbors import KernelDensity
def parrallel_score_samples(kde, samples, thread_count=int(0.875 * multiprocessing.cpu_count())):
with multiprocessing.Pool(thread_count) as p:
return np.concatenate(p.map(kde.score_samples, np.array_split(samples, thread_count)))
kde = KernelDensity(bandwidth=2.0,atol=0.0005,rtol=0.01).fit(sample)
kde_result = parrallel_score_samples(kde, sample)
As you can see from code above, multiprocessing.Pool allows you to map a pool of worker processes executing kde.score_samples on a subset of your samples.
The speedup will be significant if your processor have enough cores.

Related

efficiently read one file from a zip containing a lot of files in python

I am storing an index in a compressed zip on disk and wanted to extract a single file from this zip. Doing this in python seems to be very slow, is it possible to solve this.
with zipfile.ZipFile("testoutput/index_doc.zip", mode='r') as myzip:
with myzip.open("c0ibtxf_i.txt") as mytxt:
txt = mytxt.read()
txt = codecs.decode(txt, "utf-8")
print(txt)
Is the python code I use. Running this script in python takes a noticably long time
python3 testunzip.py 1.22s user 0.06s system 98% cpu 1.303 total
Which is annoying, especially since I know it can go much faster:
unzip -p testoutput/index_doc.zip c0ibtxf_i.txt 0.01s user 0.00s system 69% cpu 0.023 total
as per request: profiling
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.051 0.051 1.492 1.492 <string>:1(<module>)
127740 0.043 0.000 0.092 0.000 cp437.py:14(decode)
1 0.000 0.000 1.441 1.441 testunzip.py:69(toprofile)
1 0.000 0.000 0.000 0.000 threading.py:72(RLock)
1 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 zipfile.py:1065(__enter__)
1 0.000 0.000 0.000 0.000 zipfile.py:1068(__exit__)
1 0.692 0.692 1.441 1.441 zipfile.py:1085(_RealGetContents)
1 0.000 0.000 0.000 0.000 zipfile.py:1194(getinfo)
1 0.000 0.000 0.000 0.000 zipfile.py:1235(open)
1 0.000 0.000 0.000 0.000 zipfile.py:1591(__del__)
2 0.000 0.000 0.000 0.000 zipfile.py:1595(close)
2 0.000 0.000 0.000 0.000 zipfile.py:1713(_fpclose)
1 0.000 0.000 0.000 0.000 zipfile.py:191(_EndRecData64)
1 0.000 0.000 0.000 0.000 zipfile.py:234(_EndRecData)
127739 0.180 0.000 0.220 0.000 zipfile.py:320(__init__)
127739 0.046 0.000 0.056 0.000 zipfile.py:436(_decodeExtra)
1 0.000 0.000 0.000 0.000 zipfile.py:605(_check_compression)
1 0.000 0.000 0.000 0.000 zipfile.py:636(_get_decompressor)
1 0.000 0.000 0.000 0.000 zipfile.py:654(__init__)
3 0.000 0.000 0.000 0.000 zipfile.py:660(read)
1 0.000 0.000 0.000 0.000 zipfile.py:667(close)
1 0.000 0.000 0.000 0.000 zipfile.py:708(__init__)
1 0.000 0.000 0.000 0.000 zipfile.py:821(read)
1 0.000 0.000 0.000 0.000 zipfile.py:854(_update_crc)
1 0.000 0.000 0.000 0.000 zipfile.py:901(_read1)
1 0.000 0.000 0.000 0.000 zipfile.py:937(_read2)
1 0.000 0.000 0.000 0.000 zipfile.py:953(close)
1 0.000 0.000 1.441 1.441 zipfile.py:981(__init__)
127740 0.049 0.000 0.049 0.000 {built-in method _codecs.charmap_decode}
1 0.000 0.000 0.000 0.000 {built-in method _codecs.decode}
1 0.000 0.000 0.000 0.000 {built-in method _codecs.utf_8_decode}
127743 0.058 0.000 0.058 0.000 {built-in method _struct.unpack}
127739 0.016 0.000 0.016 0.000 {built-in method builtins.chr}
1 0.000 0.000 1.492 1.492 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
2 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
255484 0.020 0.000 0.020 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {built-in method builtins.max}
1 0.000 0.000 0.000 0.000 {built-in method builtins.min}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
1 0.000 0.000 0.000 0.000 {built-in method io.open}
2 0.000 0.000 0.000 0.000 {built-in method zlib.crc32}
1 0.000 0.000 0.000 0.000 {function ZipExtFile.close at 0x101975620}
127741 0.011 0.000 0.011 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of '_io.BufferedReader' objects}
127740 0.224 0.000 0.317 0.000 {method 'decode' of 'bytes' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
127739 0.024 0.000 0.024 0.000 {method 'find' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
7 0.006 0.001 0.006 0.001 {method 'read' of '_io.BufferedReader' objects}
510956 0.071 0.000 0.071 0.000 {method 'read' of '_io.BytesIO' objects}
8 0.000 0.000 0.000 0.000 {method 'seek' of '_io.BufferedReader' objects}
4 0.000 0.000 0.000 0.000 {method 'tell' of '_io.BufferedReader' objects}
it seems to be something that happens in the constructor? Can I avoid this overhead somehow?

I figured out what the problem was:
pythons zipfile library builds a list of information object for each file in the zip
this causes zipfile to be quite fast once it's loaded.
but when there are a lot of files in the zip and you only need a small portion of this files each time you load the zip, the overhead of creating the info-list costs a lot of time.
To solve this, I adapted the source of python's zipfile. It has all the default functionalities you need, but when you give the constructor a list of the filenames to extract, it will not build the entire information list.
In the particular use case that you only need a few files from a zip, this will make a big difference in performance and memory usage.
for the particular case in the example above (namely extracting only one file from a zip containing 128K files, the speed of the new implementation now approaches the speed of the unzip method)
A test case:
def original_zipfile():
import zipfile
with zipfile.ZipFile("testoutput/index_doc.zip", mode='r') as myzip:
with myzip.open("c6kn5pu_i.txt") as mytxt:
txt = mytxt.read()
def my_zipfile():
import zipfile2
with zipfile2.ZipFile("testoutput/index_doc.zip", to_extract=["c6kn5pu_i.txt"], mode='r') as myzip:
with myzip.open("c6kn5pu_i.txt") as mytxt:
txt = mytxt.read()
if __name__ == "__main__":
import time
time1 = time.time()
original_zipfile()
print("running time of original_zipfile = "+str(time.time()-time1))
time1 = time.time()
my_zipfile()
print("running time of my_new_zipfile = "+str(time.time()-time1))
print(myStopwatch.getPretty())
resulted in the following time readings
running time of original_zipfile = 1.0871901512145996
running time of my_new_zipfile = 0.07036209106445312
I will include the source code, but notice that there are 2 small flaws to my implementation (once you give an extract list, when you don't the behaviour will be the same as mentioned before):
it assumes all filenames to be encoded in the same encoding (which is an optimisation I included for my own purposes)
other functionality might be altered (for example extract_all might fail or only extract the files you gave to the the constructor)
github link

Python Profiling - discrepancy between inlining and calling function

I have this code:
def getNeighbors(cfg, cand, adj):
c_nocfg = np.setdiff1d(cand, cfg)
deg = np.sum(adj[np.ix_(cfg, c_nocfg)], axis=0)
degidx = np.where(deg) > 0
nbs = c_nocfg[degidx]
deg = deg[degidx]
return nbs, deg
which retrieves neighbors (and their degree in a subgraph spanned by nodes in cfg) from an adjacency matrix.
Inlining the code gives reasonable performance (~10kx10k adjacency matrix as boolean array, 10k candidates in cand, subgraph cfg spanning 500 nodes): 0.02s
However, calling the function getNeighbors results in an overhead of roughly 0.5s.
Mocking
deg = np.sum(adj[np.ix_(cfg, c_nocfg)], axis=0)
with
deg = np.random.randint(500, size=c_nocfg.shape[0])
drives down the runtime of the function call to 0.002s.
Could someone explain me what causes the enormous overhead in my case - after all, the sum-operation itself is not all too costly.
Profiling output
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.466 0.466 0.488 0.488 /home/x/x/x/utils/benchmarks.py:15(getNeighbors)
1 0.000 0.000 0.019 0.019 /usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:1621(sum)
1 0.000 0.000 0.019 0.019 /usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py:23(_sum)
1 0.019 0.019 0.019 0.019 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.002 0.002 /usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py:410(setdiff1d)
1 0.000 0.000 0.001 0.001 /usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py:284(in1d)
2 0.001 0.000 0.001 0.000 {method 'argsort' of 'numpy.ndarray' objects}
2 0.000 0.000 0.001 0.000 /usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py:93(unique)
2 0.000 0.000 0.000 0.000 {method 'sort' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.where}
4 0.000 0.000 0.000 0.000 {numpy.core.multiarray.concatenate}
2 0.000 0.000 0.000 0.000 {method 'flatten' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 /usr/local/lib/python2.7/dist-packages/numpy/lib/index_tricks.py:26(ix_)
5 0.000 0.000 0.000 0.000 /usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py:392(asarray)
5 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {method 'reshape' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {range}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
2 0.000 0.000 0.000 0.000 {issubclass}
2 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
6 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
inline version:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.019 0.019 /usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:1621(sum)
1 0.000 0.000 0.019 0.019 /usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py:23(_sum)
1 0.019 0.019 0.019 0.019 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.002 0.002 /usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py:410(setdiff1d)
1 0.000 0.000 0.001 0.001 /usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py:284(in1d)
2 0.001 0.000 0.001 0.000 {method 'argsort' of 'numpy.ndarray' objects}
2 0.000 0.000 0.000 0.000 /usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py:93(unique)
2 0.000 0.000 0.000 0.000 {method 'sort' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.where}
4 0.000 0.000 0.000 0.000 {numpy.core.multiarray.concatenate}
2 0.000 0.000 0.000 0.000 {method 'flatten' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 /usr/local/lib/python2.7/dist-packages/numpy/lib/index_tricks.py:26(ix_)
5 0.000 0.000 0.000 0.000 /usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py:392(asarray)
5 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {method 'reshape' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {range}
2 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
2 0.000 0.000 0.000 0.000 {issubclass}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
6 0.000 0.000 0.000 0.000 {len}
sample data for testing:
np.random.seed(0)
adj = np.zeros(10000*10000, dtype=np.bool)
adj[np.random.randint(low=0, high=10000*10000+1, size=100000)] = True
adj = adj.reshape((10000, 10000))
cand = np.arange(adj.shape[0])
cfgs = np.random.choice(cand, size=500)

Slow conversion of binary data

I have a binary file with a particular format, described here for those that are interested. The format isn't the import thing. I can read and convert this data into the form that I want but the problem is these binary files tend to have a lot of information in them. If I am just returning the bytes as read, this is very quick (less than 1 second), but I can't do anything useful with those bytes, they need to be converted into genotypes first and that is the code that appears to be slowing things down.
The conversion for a series of bytes into genotypes is as follows
h = ['%02x' % ord(b) for b in currBytes]
b = ''.join([bin(int(i, 16))[2:].zfill(8)[::-1] for i in h])[:nBits]
genotypes = [b[i:i+2] for i in range(0, len(b), 2)]
map = {'00': 0, '01': 1, '11': 2, '10': None}
return [map[i] for i in genotypes]
What I am hoping is that there is a faster way to do this? Any ideas? Below are the results of running python -m cProfile test.py where test.py is calling a reader object I have written to read these files.
vlan1711:src davykavanagh$ python -m cProfile test.py
183, 593483, 108607389, 366, 368, 46
that took 93.6410450935
86649088 function calls in 96.396 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.248 1.248 2.753 2.753 plinkReader.py:13(__init__)
1 0.000 0.000 0.000 0.000 plinkReader.py:47(plinkReader)
1 0.000 0.000 0.000 0.000 plinkReader.py:48(__init__)
1 0.000 0.000 0.000 0.000 plinkReader.py:5(<module>)
1 0.000 0.000 0.000 0.000 plinkReader.py:55(__iter__)
593484 77.634 0.000 91.477 0.000 plinkReader.py:58(next)
1 0.000 0.000 0.000 0.000 plinkReader.py:71(SNP)
593483 1.123 0.000 1.504 0.000 plinkReader.py:75(__init__)
1 0.000 0.000 0.000 0.000 plinkReader.py:8(plinkFiles)
1 0.000 0.000 0.000 0.000 plinkReader.py:85(Person)
183 0.000 0.000 0.001 0.000 plinkReader.py:89(__init__)
1 2.166 2.166 96.396 96.396 test.py:5(<module>)
27300218 5.909 0.000 5.909 0.000 {bin}
593483 0.080 0.000 0.080 0.000 {len}
1 0.000 0.000 0.000 0.000 {math.ceil}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}
593483 0.531 0.000 0.531 0.000 {method 'join' of 'str' objects}
593485 0.588 0.000 0.588 0.000 {method 'read' of 'file' objects}
593666 0.257 0.000 0.257 0.000 {method 'rsplit' of 'str' objects}
593666 0.125 0.000 0.125 0.000 {method 'rstrip' of 'str' objects}
27300218 4.098 0.000 4.098 0.000 {method 'zfill' of 'str' objects}
3 0.000 0.000 0.000 0.000 {open}
27300218 1.820 0.000 1.820 0.000 {ord}
593483 0.817 0.000 0.817 0.000 {range}
2 0.000 0.000 0.000 0.000 {time.time}

You are slowing things down by creating lists and large strings you don't need. You are just examining bits of the bytes and convert two-bit groups into numbers. That can be achieved much simpler, e. g. by this code:
def convert(currBytes, nBits):
for byte in currBytes:
for p in range(4):
bits = (ord(byte) >> (p*2)) & 3
yield None if bits == 1 else 1 if bits == 2 else 2 if bits == 3 else 0
nBits -= 2
if nBits <= 0:
raise StopIteration()
In case you really need a list in the end, just use
list(convert(currBytes, nBits))
But I guess there can be cases in which you just want to iterate over the results:
for blurp in convert(currBytes, nBits):
# handle your blurp (0, 1, 2, or None)

Profile Python cProfile vs unix time

I am profiling a python code ; why does it spend more time in the user space ?
user#terminal$ time python main.py
1964 function calls in 0.003 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.003 0.003 :1()
1 0.000 0.000 0.000 0.000 ConfigParser.py:218(init)
1 0.000 0.000 0.001 0.001 ConfigParser.py:266(read)
30 0.000 0.000 0.000 0.000 ConfigParser.py:354(optionxform)
1 0.000 0.000 0.000 0.000 ConfigParser.py:434(_read)
15 0.000 0.000 0.000 0.000 ConfigParser.py:515(get)
15 0.000 0.000 0.000 0.000 ConfigParser.py:611(_interpolate)
15 0.000 0.000 0.000 0.000 ConfigParser.py:619(_interpolate_some)
1 0.000 0.000 0.000 0.000 config.py:32(read_config_data)
1 0.000 0.000 0.001 0.001 config.py:9(init)
6 0.000 0.000 0.000 0.000 entity.py:108(add_to_filter)
1 0.000 0.000 0.002 0.002 entity.py:24(init)
1 0.001 0.001 0.002 0.002 entity.py:39(create_inverted_index)
493 0.000 0.000 0.001 0.000 entity.py:80(beautify)
1 0.000 0.000 0.000 0.000 entity.py:84(create_bucket_lookup)
1 0.000 0.000 0.000 0.000 main.py:15()
2 0.000 0.000 0.000 0.000 main.py:18()
1 0.000 0.000 0.003 0.003 main.py:23(main)
1 0.000 0.000 0.000 0.000 main.py:9(get_bag_of_words)
19 0.000 0.000 0.000 0.000 {built-in method group}
34 0.000 0.000 0.000 0.000 {built-in method match}
1 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {len}
28 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of 'file' objects}
15 0.000 0.000 0.000 0.000 {method 'copy' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
15 0.000 0.000 0.000 0.000 {method 'find' of 'str' objects}
19 0.000 0.000 0.000 0.000 {method 'isspace' of 'str' objects}
24 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
49 0.000 0.000 0.000 0.000 {method 'lower' of 'str' objects}
20 0.000 0.000 0.000 0.000 {method 'readline' of 'file' objects}
6 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
24 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}
47 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects}
9 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
1030 0.000 0.000 0.000 0.000 {method 'strip' of 'str' objects}
15 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of 'file' objects}
10 0.000 0.000 0.000 0.000 {open}
2 0.000 0.000 0.000 0.000 {range}
3 0.000 0.000 0.000 0.000 {reduce}
Done
real 0m0.063s
user 0m0.050s
sys 0m0.010s
While the cProfile says it took only 0.003 seconds, why is unix (sys) time saying it runs in 0.01 seconds?

time(1) is measuring the execution time of the whole process, whereas the profiler excludes Python interpreter startup time, bytecode compilation time, etc.

how to avoid substrings

I currently process sections of a string like this:
for (i, j) in huge_list_of_indices:
process(huge_text_block[i:j])
I want to avoid the overhead of generating these temporary substrings. Any ideas? Perhaps a wrapper that somehow uses index offsets? This is currently my bottleneck.
Note that process() is another python module that expects a string as input.
Edit:
A few people doubt there is a problem. Here are some sample results:
import time
import string
text = string.letters * 1000
def timeit(fn):
t1 = time.time()
for i in range(len(text)):
fn(i)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1) * 1000)
def test_1(i):
return text[i:]
def test_2(i):
return text[:]
def test_3(i):
return text
timeit(test_1)
timeit(test_2)
timeit(test_3)
Output:
test_1 took 972.046 ms
test_2 took 47.620 ms
test_3 took 43.457 ms

I think what you are looking for are buffers.
The characteristic of buffers is that they "slice" an object supporting the buffer interface without copying its content, but essentially opening a "window" on the sliced object content. Some more technical explanation is available here. An excerpt:
Python objects implemented in C can export a group of functions called the “buffer interface.” These functions can be used by an object to expose its data in a raw, byte-oriented format. Clients of the object can use the buffer interface to access the object data directly, without needing to copy it first.
In your case the code should look more or less like this:
>>> s = 'Hugely_long_string_not_to_be_copied'
>>> ij = [(0, 3), (6, 9), (12, 18)]
>>> for i, j in ij:
... print buffer(s, i, j-i) # Should become process(...)
Hug
_lo
string
HTH!

A wrapper that uses index offsets to a mmap object could work, yes.
But before you do that, are you sure that generating these substrings are a problem? Don't optimize before you have found out where the time and memory actually goes. I wouldn't expect this to be a significant problem.

If you are using Python3 you can use protocol buffer and memory views. Assuming that the text is stored somewhere in the filesystem:
f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)
mv = memoryview(data)
for (i, j) in huge_list_of_indices:
process(mv[i:j])
Check also this article. It might be useful.

Maybe a wrapper that uses index offsets is indeed what you are looking for. Here is an example that does the job. You may have to add more checks on slices (for overflow and negative indexes) depending on your needs.
#!/usr/bin/env python
from collections import Sequence
from timeit import Timer
def process(s):
return s[0], len(s)
class FakeString(Sequence):
def __init__(self, string):
self._string = string
self.fake_start = 0
self.fake_stop = len(string)
def setFakeIndices(self, i, j):
self.fake_start = i
self.fake_stop = j
def __len__(self):
return self.fake_stop - self.fake_start
def __getitem__(self, ii):
if isinstance(ii, slice):
if ii.start is None:
start = self.fake_start
else:
start = ii.start + self.fake_start
if ii.stop is None:
stop = self.fake_stop
else:
stop = ii.stop + self.fake_start
ii = slice(start,
stop,
ii.step)
else:
ii = ii + self.fake_start
return self._string[ii]
def initial_method():
r = []
for n in xrange(1000):
r.append(process(huge_string[1:9999999]))
return r
def alternative_method():
r = []
for n in xrange(1000):
fake_string.setFakeIndices(1, 9999999)
r.append(process(fake_string))
return r
if __name__ == '__main__':
huge_string = 'ABCDEFGHIJ' * 100000
fake_string = FakeString(huge_string)
fake_string.setFakeIndices(5,15)
assert fake_string[:] == huge_string[5:15]
t = Timer(initial_method)
print "initial_method(): %fs" % t.timeit(number=1)
which gives:
initial_method(): 1.248001s
alternative_method(): 0.003416s

The example the OP gives, will give nearly biggest performance difference between slicing and not slicing possible.
If processing actually does something that takes significant time, the problem may hardly exist.
Fact is OP needs to let us know what process does. The most likely scenario is it does something significant, and therefore he should profile his code.
Adapted from op's example:
#slice_time.py
import time
import string
text = string.letters * 1000
import random
indices = range(len(text))
random.shuffle(indices)
import re
def greater_processing(a_string):
results = re.findall('m', a_string)
def medium_processing(a_string):
return re.search('m.*?m', a_string)
def lesser_processing(a_string):
return re.match('m', a_string)
def least_processing(a_string):
return a_string
def timeit(fn, processor):
t1 = time.time()
for i in indices:
fn(i, i + 1000, processor)
t2 = time.time()
print '%s took %0.3f ms %s' % (fn.func_name, (t2-t1) * 1000, processor.__name__)
def test_part_slice(i, j, processor):
return processor(text[i:j])
def test_copy(i, j, processor):
return processor(text[:])
def test_text(i, j, processor):
return processor(text)
def test_buffer(i, j, processor):
return processor(buffer(text, i, j - i))
if __name__ == '__main__':
processors = [least_processing, lesser_processing, medium_processing, greater_processing]
tests = [test_part_slice, test_copy, test_text, test_buffer]
for processor in processors:
for test in tests:
timeit(test, processor)
And then the run...
In [494]: run slice_time.py
test_part_slice took 68.264 ms least_processing
test_copy took 42.988 ms least_processing
test_text took 33.075 ms least_processing
test_buffer took 76.770 ms least_processing
test_part_slice took 270.038 ms lesser_processing
test_copy took 197.681 ms lesser_processing
test_text took 196.716 ms lesser_processing
test_buffer took 262.288 ms lesser_processing
test_part_slice took 416.072 ms medium_processing
test_copy took 352.254 ms medium_processing
test_text took 337.971 ms medium_processing
test_buffer took 438.683 ms medium_processing
test_part_slice took 502.069 ms greater_processing
test_copy took 8149.231 ms greater_processing
test_text took 8292.333 ms greater_processing
test_buffer took 563.009 ms greater_processing
Notes:
Yes I tried OP's original test_1 with [i:] slice and it's much slower, making his test even more bunk.
Interesting that buffer almost always performs slightly slower then slicing. This time there is one where it does better though! The real test though is below and buffer seems to do better for larger substrings while slicing does better for smaller substrings.
And, yes, I do have some randomness in this test so test away and see the different results :). It also may be interesting to changes the size of the 1000's.
So, maybe some others believe you, but I don't, so I'd like to know something about what processing does and how you came to the conclusion: "slicing is the problem."
I profiled medium processing in my example and upped the string.letters multiplier to 100000 and raised the length of the slices to 10000. Also below is one with slices of length 100. I used cProfile for these (much less overhead then profile!).
test_part_slice took 77338.285 ms medium_processing
31200019 function calls in 77.338 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 77.338 77.338 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 iostream.py:63(write)
5200000 8.208 0.000 43.823 0.000 re.py:139(search)
5200000 9.205 0.000 12.897 0.000 re.py:228(_compile)
5200000 5.651 0.000 49.475 0.000 slice_time.py:15(medium_processing)
1 7.901 7.901 77.338 77.338 slice_time.py:24(timeit)
5200000 19.963 0.000 69.438 0.000 slice_time.py:31(test_part_slice)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
2 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5200000 3.692 0.000 3.692 0.000 {method 'get' of 'dict' objects}
5200000 22.718 0.000 22.718 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
4 0.000 0.000 0.000 0.000 {time.time}
test_buffer took 58067.440 ms medium_processing
31200103 function calls in 58.068 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 58.068 58.068 <string>:1(<module>)
3 0.000 0.000 0.000 0.000 __init__.py:185(dumps)
3 0.000 0.000 0.000 0.000 encoder.py:102(__init__)
3 0.000 0.000 0.000 0.000 encoder.py:180(encode)
3 0.000 0.000 0.000 0.000 encoder.py:206(iterencode)
1 0.000 0.000 0.001 0.001 iostream.py:37(flush)
2 0.000 0.000 0.001 0.000 iostream.py:63(write)
1 0.000 0.000 0.000 0.000 iostream.py:86(_new_buffer)
3 0.000 0.000 0.000 0.000 jsonapi.py:57(_squash_unicode)
3 0.000 0.000 0.000 0.000 jsonapi.py:69(dumps)
2 0.000 0.000 0.000 0.000 jsonutil.py:78(date_default)
1 0.000 0.000 0.000 0.000 os.py:743(urandom)
5200000 6.814 0.000 39.110 0.000 re.py:139(search)
5200000 7.853 0.000 10.878 0.000 re.py:228(_compile)
1 0.000 0.000 0.000 0.000 session.py:149(msg_header)
1 0.000 0.000 0.000 0.000 session.py:153(extract_header)
1 0.000 0.000 0.000 0.000 session.py:315(msg_id)
1 0.000 0.000 0.000 0.000 session.py:350(msg_header)
1 0.000 0.000 0.000 0.000 session.py:353(msg)
1 0.000 0.000 0.000 0.000 session.py:370(sign)
1 0.000 0.000 0.000 0.000 session.py:385(serialize)
1 0.000 0.000 0.001 0.001 session.py:437(send)
3 0.000 0.000 0.000 0.000 session.py:75(<lambda>)
5200000 4.732 0.000 43.842 0.000 slice_time.py:15(medium_processing)
1 5.423 5.423 58.068 58.068 slice_time.py:24(timeit)
5200000 8.802 0.000 52.645 0.000 slice_time.py:40(test_buffer)
7 0.000 0.000 0.000 0.000 traitlets.py:268(__get__)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 uuid.py:101(__init__)
1 0.000 0.000 0.000 0.000 uuid.py:197(__str__)
1 0.000 0.000 0.000 0.000 uuid.py:531(uuid4)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method now}
18 0.000 0.000 0.000 0.000 {isinstance}
4 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {locals}
1 0.000 0.000 0.000 0.000 {map}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {method 'count' of 'list' objects}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}
5200001 3.025 0.000 3.025 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'getvalue' of '_io.StringIO' objects}
3 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
5200000 21.418 0.000 21.418 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
1 0.000 0.000 0.000 0.000 {method 'send_multipart' of 'zmq.core.socket.Socket' objects}
2 0.000 0.000 0.000 0.000 {method 'strftime' of 'datetime.date' objects}
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {posix.close}
1 0.000 0.000 0.000 0.000 {posix.open}
1 0.000 0.000 0.000 0.000 {posix.read}
4 0.000 0.000 0.000 0.000 {time.time}
Smaller slices (100 length).
test_part_slice took 54916.153 ms medium_processing
31200019 function calls in 54.916 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 54.916 54.916 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 iostream.py:63(write)
5200000 6.788 0.000 38.312 0.000 re.py:139(search)
5200000 8.014 0.000 11.257 0.000 re.py:228(_compile)
5200000 4.722 0.000 43.034 0.000 slice_time.py:15(medium_processing)
1 5.594 5.594 54.916 54.916 slice_time.py:24(timeit)
5200000 6.288 0.000 49.322 0.000 slice_time.py:31(test_part_slice)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
2 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5200000 3.242 0.000 3.242 0.000 {method 'get' of 'dict' objects}
5200000 20.268 0.000 20.268 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
4 0.000 0.000 0.000 0.000 {time.time}
test_buffer took 62019.684 ms medium_processing
31200103 function calls in 62.020 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 62.020 62.020 <string>:1(<module>)
3 0.000 0.000 0.000 0.000 __init__.py:185(dumps)
3 0.000 0.000 0.000 0.000 encoder.py:102(__init__)
3 0.000 0.000 0.000 0.000 encoder.py:180(encode)
3 0.000 0.000 0.000 0.000 encoder.py:206(iterencode)
1 0.000 0.000 0.001 0.001 iostream.py:37(flush)
2 0.000 0.000 0.001 0.000 iostream.py:63(write)
1 0.000 0.000 0.000 0.000 iostream.py:86(_new_buffer)
3 0.000 0.000 0.000 0.000 jsonapi.py:57(_squash_unicode)
3 0.000 0.000 0.000 0.000 jsonapi.py:69(dumps)
2 0.000 0.000 0.000 0.000 jsonutil.py:78(date_default)
1 0.000 0.000 0.000 0.000 os.py:743(urandom)
5200000 7.426 0.000 41.152 0.000 re.py:139(search)
5200000 8.470 0.000 11.628 0.000 re.py:228(_compile)
1 0.000 0.000 0.000 0.000 session.py:149(msg_header)
1 0.000 0.000 0.000 0.000 session.py:153(extract_header)
1 0.000 0.000 0.000 0.000 session.py:315(msg_id)
1 0.000 0.000 0.000 0.000 session.py:350(msg_header)
1 0.000 0.000 0.000 0.000 session.py:353(msg)
1 0.000 0.000 0.000 0.000 session.py:370(sign)
1 0.000 0.000 0.000 0.000 session.py:385(serialize)
1 0.000 0.000 0.001 0.001 session.py:437(send)
3 0.000 0.000 0.000 0.000 session.py:75(<lambda>)
5200000 5.399 0.000 46.551 0.000 slice_time.py:15(medium_processing)
1 5.958 5.958 62.020 62.020 slice_time.py:24(timeit)
5200000 9.510 0.000 56.061 0.000 slice_time.py:40(test_buffer)
7 0.000 0.000 0.000 0.000 traitlets.py:268(__get__)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 uuid.py:101(__init__)
1 0.000 0.000 0.000 0.000 uuid.py:197(__str__)
1 0.000 0.000 0.000 0.000 uuid.py:531(uuid4)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method now}
18 0.000 0.000 0.000 0.000 {isinstance}
4 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {locals}
1 0.000 0.000 0.000 0.000 {map}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {method 'count' of 'list' objects}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}
5200001 3.158 0.000 3.158 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'getvalue' of '_io.StringIO' objects}
3 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
5200000 22.097 0.000 22.097 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
1 0.000 0.000 0.000 0.000 {method 'send_multipart' of 'zmq.core.socket.Socket' objects}
2 0.000 0.000 0.000 0.000 {method 'strftime' of 'datetime.date' objects}
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {posix.close}
1 0.000 0.000 0.000 0.000 {posix.open}
1 0.000 0.000 0.000 0.000 {posix.read}
4 0.000 0.000 0.000 0.000 {time.time}

process(huge_text_block[i:j])
I want to avoid the overhead of generating these temporary substrings.
(...)
Note that process() is another python module
that expects a string as input.
Completely contradictory.
How can you imagine to find a way for not creating what the function requires ?!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scipy: parallel computing in ipython notebook? - python

Related

efficiently read one file from a zip containing a lot of files in python

Python Profiling - discrepancy between inlining and calling function

Slow conversion of binary data

Profile Python cProfile vs unix time

how to avoid substrings

Categories

Resources