Efficient way of computing statistics for large/imprecise amount of data

Efficient way of computing statistics for large/imprecise amount of data - python

I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.
Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?
import numpy as np
np.average(obj)
np.min(mylist)
np.max(mylist)
np.std(mylist)
np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)
maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
maxx = max(maxx, float(p))
minx = min(minx, float(p))
sumz += float(p)
index += 1
my_max = maxx
my_min = minx
my_avg = sumz/index

Use binary file. Then you can use numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.
You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.

I think you are on the right track, by iterating over the file and keeping track of max and min values. To calculate the std, you should keep a sum of squares inside the loop: sum_of_squares += z**2. You then can calculate std = sqrt(sum_of_squares / n - (sumz / n)**2) after the loop, see formula here (but this formula might suffer from numerical problems). For performance, you might want to iterate over the file in some decent size chunks of data.
To calculate the median and percentiles in a 'continuous' way, you could build up a histogram inside your loop. After the loop, you can get approximate percentiles and median by converting the histogram to the CDF, the error will depend on the number of bins.

As Antti Haapala says, the easiest and most efficient way to do this will be to stick with numpy, and just use a memmapped binary file instead of a text file. Yes, converting from one format to the other will take a bit of time—but it'll almost certainly save more time than it costs (because you can use numpy vectorized operations instead of loops), and it will also make your code a lot simpler.
If you can't do that, Python 3.4 will come with a statistics module. A backport to 2.6+ will hopefully be available at some point after the PEP is finalized; at present I believe you can only get stats, the earlier module it's based on, which requires 3.1+. Unfortunately, while stats does do single-pass algorithms on iterators, it doesn't have any convenient way to run multiple algorithms in parallel on the same iterator, so you have be clever with itertools.tee and zip to force it to interleave the work instead of pulling the whole thing into memory.
And of course there are plenty of other modules out there if you search PyPI for "stats" and/or "statistics" and/or "statistical".
Either way, using a pre-built module will mean someone's already debugged all the problems you're going to run into, and they may have also optimized the code (maybe even ported it to C) to boot.

To get the percentiles, sort the text file using a command line program. Use the line count (index in your program) to find the line numbers of the percentiles (index // 4, etc.) Then retrieve those lines from the file.

Most of these operations can be expressed easily in terms of simple arithmetic. In that case, it can actually (surprisingly) be quite efficient to process simple statistics directly from the Linux command line using awk and sed, e.g. as in this post: < http://www.unixcl.com/2008/09/sum-of-and-group-by-using-awk.html >.
If you need to generalize to more advanced operations, like weighted percentiles, then I'd recommend using Python Pandas (notably the HDFStore capabilities for later retrieval). I've used Pandas with a DataFrame of over 25 million records before (10 columns by 25 million distinct rows). If you're more memory constrained, you could read the data in in chunks, calculate partial contributions from each chunk, and store out intermediate results, then finish off the calculation by just loading the intermediate results, in a serialized sort of map-reduce kind of framework.

Related

Read text file with collapsed values - Python

I need to read a huge text file with space separated value with Python, values may wrap to the new line arbitrarily. If there is a series of same values, they will be written as N*value, so a data file looks like this:
71*0.00000000 547.89916992 133*0.00000000 193.83299255 293.70559692 471.63159180
481.01190186 570.33251953 701.80609131 856.66857910 1037.59094238 1385.04602051
126*0.00000000 128.15339661 158.62240601 175.11259460 169.34010315 193.85580444
248.73970032 268.69958496 385.36999512 405.28909302 848.83941650 874.15411377
1034.21594238 1144.12500000 1378.19799805 935.37860107 1028.68994141 118*0.00000000
546.53228760 269.04510498 259.84869385 347.39208984 311.84609985 270.91180420
In this example, there are 71 zeros at the beginning of the data set, followed by the value of 547.89... and by another 133 zeros, and so on. Resulting dataset is flat 1D array
The dimension (i.e. total number of "expanded" values) is known.
Repeating values are not necessarily zeros, they are just in this example
So, is there an effective way, maybe a python library, to read this kind of datasets?
I tried doing straight-forward check if a values has asterisk, and then populate a chunk of pre-allocated numpy array with the value, but with tens of millions points it is way too slow.
I thought about writing C++ extension, but the project is used across multiple systems (both Linux and Windows) and it may be difficult to maintain, so I would prefer pure python solution (or at least some extra python library if they exist)
Thanks!

What is the most efficient way to repeatedly search a large text file (800 MB) for certain numbers?

The large file is 12 million lines of text such as this:
81.70, 89.86, 717.985
81.74, 89.86, 717.995
81.78, 89.86, 718.004
81.82, 89.86, 718.014
81.86, 89.86, 718.024
81.90, 89.86, 718.034
This is latitude, longitude, and distance from the nearest coastline (respectively).
My code uses coordinates of known places (for example: Mexico City: "-99.1, 19.4) and searches the large file, line by line, to output the distance from the nearest coastline of that coordinate.
I put each line into a list because many lines meet the long/lat criteria. I later average the distances from the coastline.
Each coordinate takes about 12 seconds to retrieve. My entire script takes 14 minutes to complete.
Here's what I have been using:
long = -99.1
lat = 19.4
country_d2s = []
# outputs all list items with specified long and lat values
with open(r"C:\Users\jason\OneDrive\Desktop\s1186prXbF0O", 'r') as dist2sea:
for line in dist2sea:
if long in line and lat in line and line.startswith(long):
country_d2s.append(line)
I am looking for a way to search through the file much quicker and/or rewrite the file to make it easier to work with.

Use a database with a key comprised of the latitude and longitude. If you're looking for a lightweight DB that can be shared as a file, there's SqliteDict or bsddb3. That would be much faster than reading a text file each time the program is run.

Import your data into SQLite database, then create index for (latitude, longitude). Index lookup should take milliseconds. To read data, use python SQLite module.

Comments:
It's unclear if you are using the fact that your long/lat are XX.Y and you are searching against XX.YY as some kind of fuzzy matching technique.
I also cannot tell how you plan to execute this: load + [run] x 1000 vs [load + run] x 1000, which would inform the solution you want to use.
That being said, if you want to do very fast exact lookups one option is to load the entire thing into memory as a mapping, e.g. {(long, lat): coast_distance, ...}. Since floats are not good keys, it would be better to use strings, integers, or fractions for this.
If you want to do fuzzy matching, there are data structures (and a number of packages) that would solve that issue:
1D: https://pypi.org/project/intervaltree/
2D: https://pypi.org/project/Quadtree/
3+D: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree
If you want the initial load time to be faster you can do things like writing a binary pickle and loading that directly instead of parsing a file. A database is also a simple solution to this.

You could partition the file into 10 by 10 degree patches. This would reduce the search space by 648 which would yield 648 files with each one having about 18500 lines. This would reduce the search time to about 0.02 seconds.
As you are doing exact matches of lat-long, you could instead use any on-disk key-value store. Python has at least one of them built in. If you were doing nearest neighbor or metric space searches, there are spacial databases that support those.

If you are using python i recommend to use PySpark.
in this particular case you can use the function mapPartitions and join the results.
this could help How does the pyspark mapPartitions function work?
PySpark is a useful at the time to work with giant amount of data because it makes N partitions and use your processor full power.
Hope it helps you.

python xarray indexing/slicing very slow

I'm currently processing some ocean model outputs. At each time step, it has 42*1800*3600 grid points.
I found that the bottelneck in my program is the slicing, and calling xarray_built in method to extract the values. And what's more interesting, same syntax sometimes require a vastly differnt amount of time.
ds = xarray.open_dataset(filename, decode_times=False)
vvel0=ds.VVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 #in CCSM output, unit is cm/s convert to m/s
uvel0=ds.UVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 ## why the speed is that different? now it's regional!!
temp0=ds.TEMP.sel(lat=slice(-60,-20),lon=slice(0,40)) #de
Take this for example, reading a VVEL and UVEL took ~4sec, while reading in TEMP only needed ~6ms. Without slicing, VVEL and UVEL took ~1sec, and TEMP needed 120 nanosecond.
I always thought that, when I only input part of the full array, I need less memory, and therefore less time. It turned out, that XARRAY loads in the full array and any extra slicing takes more time. But, could somebody please explain why is reading different variables from the same netcdf file takes that different of time?
The program is designed to extract a stepwise section, and calculate the cross-sectional heat transport, so I need to pick out either UVEL or VVEL, times that by TEMP along the section. So, it may seems that, loading in TEMP that fast is good, isn't it?
Unfortunately, that's not the case. When I loop through about ~250 grid points along the prescribed section...
# Calculate VT flux orthogonal to the chosen grid cells, which is the heat transport across GOODHOPE line
vtflux=[]
utflux=[]
vap = vtflux.append
uap = utflux.append
#for i in range(idx_north,idx_south+1):
for i in range(10):
yidx=gh_yidx[i]
xidx=gh_xidx[i]
lon_next=ds_lon[i+1].values
lon_current=ds_lon[i].values
lat_next=ds_lat[i+1].values
lat_current=ds_lat[i].values
tt=np.squeeze(temp[:,yidx,xidx].values) #<< calling values is slow
if (lon_next<lon_current) and (lat_next==lat_current): # The condition is incorrect
dxlon=Re*np.cos(lat_current*np.pi/180.)*0.1*np.pi/180.
vv=np.squeeze(vvel[:,yidx,xidx].values)
vt=vv*tt
vtdxdz=np.dot(vt[~np.isnan(vt)],layerdp[0:len(vt[~np.isnan(vt)])])*dxlon
vap(vtdxdz)
#del vtdxdz
elif (lon_next==lon_current) and (lat_next<lat_current):
#ut=np.array(uvel[:,gh_yidx[i],gh_xidx[i]].squeeze().values*temp[:,gh_yidx[i],gh_xidx[i]].squeeze().values) # slow
uu=np.squeeze(uvel[:,yidx,xidx]).values # slow
ut=uu*tt
utdxdz=np.dot(ut[~np.isnan(ut)],layerdp[0:len(ut[~np.isnan(ut)])])*dxlat
uap(utdxdz) #m/s*degC*m*m ## looks fine, something wrong with the sign
#del utdxdz
total_trans=(np.nansum(vtflux)-np.nansum(utflux))*3996*1026/1e15
Especially this line:
tt=np.squeeze(temp[:,yidx,xidx].values)
It takes ~3.65 Sec, but now it has to be repeated for ~250 times. If I remove .values, then this time reduces to ~4ms. But I need to time the tt to vt, so I have to extract the values. What's weird, is that the similar expression, vv=np.squeeze(vvel[:,yidx,xidx].values) requires much less time, only about ~1.3ms.
To summarize my questions:
Why loading in different variables from the same netcdf file takes different amount of time?
Is there a more efficient way to pick out a single column in a multidimensional array? (not necessary the xarray structure, also numpy.ndarray)
Why does extracting values from Xarray structures need different amount of time, for the exact same syntax?
Thank you!

When you index a variable loaded from a netCDF file, xarray doesn't load it into memory immediately. Instead, we create a lazy array that supports any number of further differed indexing operations. This is true even if you aren't using dask.array (triggered by setting chunks= in open_dataset or using open_mfdataset).
This explains the surprising performance you observe. Calculating temp0 is fast, because it doesn't load any data from disk. vvel0 is slow, because dividing by 100 requires loading the data into memory as a numpy array.
Later, it's slower to index temp0 because each operation loads data from disk, instead of indexing a numpy array already in memory.
The work-around is to explicitly load the portion of your dataset that you need into memory first, e.g., by writing temp0.load(). The netCDF section of the xarray docs also gives this tip.

finding a duplicate in a hdf5 pytable with 500e6 rows

Problem
I have a large (> 500e6 rows) dataset that I've put into a pytables database.
Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.
As a starter I've done something like this:
index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
query = '(id == %d) & (counts == %d)' % (row['id'], row['counts'])
result = th.readWhere(query)
if len(result) > 1:
print row
It's a brute force method I'll admit. Any suggestions on improvements?
update
current brute force runtime is 8421 minutes.
solution
Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:
ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)
ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
if row['hash'] == ref:
dups.append(row['hash'] )
ref = row['hash']
print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)
I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.
Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.
This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.

Two obvious techniques come to mind: hashing and sorting.
A) define a hash function to combine ID and Counter into a single, compact value.
B) count how often each hash code occurs
C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)
D) sort this data set to find duplicates.
The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.
This is essentially a Bloom filter.

The brute force approach that you've taken appears to require that you to execute 500e6 queries, one for each row of the table. Although I think that the hashing and sorting approaches suggested in another answer are essentially correct, it's worth noting that pytables is already supposedly built for speed, and should already be expected to have these kinds of techniques effectively included "under the hood", so to speak.
I contend that the simple code you have written most likely does not yet take best advantage of the capabilities that pytables already makes available to you.
In the documentation for create_index(), it says that the default settings are optlevel=6 and kind='medium'. It mentions that you can increase the speed of each of your 500e6 queries by decreasing the entropy of the index, and you can decrease the entropy of your index to its minimum possible value (zero) either by choosing non-default values of optlevel=9 and kind='full', or equivalently, by generating the index with a call to create_csindex() instead. According to the documentation, you have to pay a little more upfront by taking a longer time to create a better optimized index to begin with, but then it pays you back later by saving you time on the series of queries that you have to repeat 500e6 times.
If optimizing your pytables column indices fails to speed up your code sufficiently, and you want to just simply perform a massive sort on all of the rows, and then just search for duplicates by looking for matches in adjacent sorted rows, it's possible to perform a merge sort in O(N log(N)) time using relatively modest amounts of memory by sorting the data in chunks and then saving the chunks in temporary files on disk. Examples here and here demonstrate in principle how to do it in Python specifically. But you should really try optimizing your pytables index first, as that's likely to provide a much simpler and more natural solution in your particular case.

Python: Fastest way to iterate this through a large file

Right, I'm iterating through a large binary file
I need to minimise the time of this loop:
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
blockReturn = np.zeros((num_samples,num_receivers,num_channels))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
for smpl in range(0,num_samples):
r2_iq=np.fromfile(ReadFile.fid,np.int16,2)
blockReturn[smpl,rec,chl] = np.sqrt(math.fabs(r2_iq[0])*math.fabs(r2_iq[0]) + math.fabs(r2_iq[1])*math.fabs(r2_iq[1]))
return blockReturn
So, what's going on is as follows:
r1 is the header of the file, dTypes.NB_HDR is a type I made:
NB_HDR= np.dtype([('f3',np.uint32),('f4',np.uint32),('f5',np.uint32),('f6',np.int32),('f7',np.int32),('f8',np.uint32)])
That gets all the information about the forthcoming data block, and nicely puts us in the right position within the file (the start of the data block!).
In this data block there is:
4096 samples per channel,
4 channels per receiver,
9 receivers.
So num_receivers, num_channels, num_samples will always be the same (at the moment anyway), but as you can see this is a fairly large amount of data. Each 'sample' is a pair of int16 values that I want to find the magnitude of (hence Pythagoras).
This NB2 code is executed for each 'Block' in the file, for a 12GB file (which is how big they are) there are about 20,900 Blocks, and I've got to iterate through 1000 of these files (so, 12TB overall). Any speed advantage even it's it's milliseconds would be massively appreciated.
EDIT: Actually it might be of help to know how I'm moving around inside the file. I have a function as follows:
def navigateTo(self, blockNum, indexNum):
ReadFile.fid.seek(ReadFile.fileIndex[blockNum][indexNum],0)
ReadFile.currentBlock = blockNum
ReadFile.index = indexNum
Before I run all this code I scan the file and make a list of index locations at ReadFile.fileIndex that I browse using this function and then 'seek' to the absolute location - is this efficient?
Cheers

Because you know the length of a block after you read the header, read the whole block at once. Then reshape the array (very fast, only affects metadata) and take use the np.hypot ufunc:
blockData = np.fromfile(ReadFile.fid, np.int16, num_receivers*num_channels*num_samples*2)
blockData = blockData.reshape((num_receivers, num_channes, num_samples, 2))
return np.hypot(blockData[:,:,:,0], blockData[:,:,:,1])
On my machine it runs in 11ms per block.

import numpy as np
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
# first, match your array bounds to the way you are walking the file
blockReturn = np.zeros((num_receivers,num_channels,num_samples))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
# second, read in all the samples at once if you have enough memory
r2_iq=np.fromfile(ReadFile.fid,np.int16,2*num_samples)
r2_iq.shape = (-1,2) # tell numpy that it is an array of two values
# create dot product vector by squaring data elementwise, and then
# adding those elements together. Results is of length num_samples
r2_iq = r2_iq * r2_iq
r2_iq = r2_iq[:,0] + r2_iq[:,1]
# get the distance by performing the square root "into" blockReturn
np.sqrt(r2_iq, out=blockReturn[rec,chl,:])
return blockReturn
This should help your performance. Two main ideas in numpy work. First, your result arrays dimensions should match how your loop dimensions are crafted, for memory locality.
Second, Numpy is FAST. I've beaten hand coded C with numpy, simply because it uses LAPack and vector acceleration. However to get that power, you have to let it manipulate more data at a time. That is why your sample loop has been collapsed to read in the full sample for the receiver and channel in one large read. Then use the supreme vector powers of numpy to calculate your magnitude by dot product.
There is a little more optimization to be had in the magnitude calculation, but numpy recycles buffers for you, making it less important than you might think. I hope this helps!

I'd try to use as few loops and as much constants as possible.
Everything that can be done in a linear fashion should be done so.
If values don't change, use constants to reduce lookups and such,
because that eats up cpu cycles.
This is from a theoretical point of view ;-)
If possible use highly optimised libraries. I don't exaclty know what you are trying to achieve but i'd rather use an existing FFT-Lib than writing it myself :>
One more thing: http://en.wikipedia.org/wiki/Big_O_notation (can be an eye-opener)

Most importantly, you shouldn't do file access at the lowest level of a triple nested loop, whether you do this in C or Python. You've got to read in large chunks of data at a time.
So to speed this up, read in large chunks of data at a time, and process that data using numpy indexing (that is, vectorize your code). This is particularly easy in your case since all your data is int32. Just read in big chunks of data, and reshape the data into an array that reflects the (receiver, channel, sample) structure, and then use the appropriate indexing to multiply and add things for Pythagoras, and the 'sum' command to add up the terms in the resulting array.

This is more of an observation than a solution, but porting that function to C++ and loading it in with the Python API would get you a lot of speed gain to begin with before loop optimization.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way of computing statistics for large/imprecise amount of data - python

To get the percentiles, sort the text file using a command line program. Use the line count (index in your program) to find the line numbers of the percentiles (index // 4, etc.) Then retrieve those lines from the file.

Related

Read text file with collapsed values - Python

What is the most efficient way to repeatedly search a large text file (800 MB) for certain numbers?

python xarray indexing/slicing very slow

finding a duplicate in a hdf5 pytable with 500e6 rows

Python: Fastest way to iterate this through a large file

Categories

Resources