For a certain task, I have too many repeated calls to a complex function, call it f(x) where x is float. I do not have very large floats and not too much precision is required, so I thought why not use a lookup table for f(x), where x is a float16, maximum size of lookup table is (2**16). I was planning on making a small python demo using np.float16. I am a bit stuck on how to iterate over range of all floats. In C/C++, I would have used an uint16_t, kept incrementing it. How do I create this table using python ?
You can generate all the possible values using arange and then reinterpret the values as float16 values using view. Here is an example:
np.arange(65536, dtype=np.uint16).view(np.float16)
This should give you all possible float16 values. Note that many are NaN values.
Related
Once I've retrieved data from MongoDB and loaded into a Pandas dataframe, what is the recommended practice with regards to storing hexadecimal ObjectID's?
I presume that, stored as strings, they take a lot of memory, which can be limiting in very large datasets. Is it a good idea to convert them to integers (from hex to dec)? Wouldn't that decrease memory usage and speed up processing (merges, lookups...)?
And BTW, here's how I'm doing it. Is this the best way? It unfortunately fails with NaN.
tank_hist['id'] = pd.to_numeric(tank_hist['id'].apply(lambda x: int(str(x), base=16)))
First of all, I think it's NaN because object IDs are bigger than a 64bit integer. Python can handle that, but the underlying pandas/numpy may not.
I think you want to use a map to extract some useful fields that you could later do multi level sorting on. I'm not sure you'll see the performance improvement you're expecting though.
I'd start by creating a new series "oid_*" into your frame and checking your results
https://docs.mongodb.com/manual/reference/method/ObjectId/
breaks down the object id into components with:
timestamp (4 bytes),
random (5 bytes - used to be host identifier), and
counter (3 bytes)
these integer sizes are good and appropriate integer sizes for numpy to deal with.
tank_hist['oid_timestamp'] = tank_hist['id'].map(lambda x: int(str(x)[:8], 16))
tank_hist['oid_random'] = tank_hist['id'].map(lambda x: int(str(x[:4])[8:18], 16))
tank_hist['oid_counter'] = tank_hist['id'].map(lambda x: int(str(x[:4])[18:], 16))
This would allow you to primary sort on timestamp series, secondary sort on some other series in the frame? Then third sort on counter.
Maps are super helpful (though slow) way to poke every record in your series. Realize that if you are adding compute time here in exchange for saving this compute time later.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html
I have some fortran code that I would like to convert to python. Said code makes extensive use of a 'type' data structure to describe data files that have long headers containing multiple variables, as well as sub-headers, and the actual data itself which is stored in a five dimensional array; the useable dimensions of the array being defined by other variables in the header. In fortran I use an include file to define the type in each of the suite of programs I use.
In the below I've called the type 'SEQUENCE':
TYPE SEQUENCE
INTEGER NHEADER ! Number of items recorded from file header
CHARACTER*(256) FILE
CHARACTER*(8) UTC
CHARACTER*(10) UTDATE
CHARACTER*(24) OBJ_NAME
CHARACTER*(24) OBJ_CLASS
DOUBLE PRECISION MOD_TEMP
DOUBLE PRECISION MOD_FREQ
CHARACTER*(1) PMT
CHARACTER*(6) MOD
INTEGER FILNUM
CHARACTER*(24) FILTER
DOUBLE PRECISION MOD_AMP
INTEGER HT1
INTEGER GAIN1
INTEGER HT2
INTEGER GAIN2
DOUBLE PRECISION WP_DELTA
DOUBLE PRECISION WPA
DOUBLE PRECISION WPB
CHARACTER*(MAX_WP) WPSEQ
DOUBLE PRECISION ROTA
DOUBLE PRECISION ROTB
CHARACTER*(MAX_ROT) ROTSEQ
DOUBLE PRECISION TELESCOPE_PA
CHARACTER*(7) WAVEFORM
CHARACTER*(256) SKY_SUB
CHARACTER*(256) OS_SUB1
CHARACTER*(256) OS_SUB2
CHARACTER*(256) NOTES
INTEGER REPEATS
INTEGER WP_ROTATIONS
INTEGER ROTATIONS
INTEGER INTEGRATIONS
INTEGER CHANNELS
INTEGER POINTS
DOUBLE PRECISION ROT_POS(MAX_ROT)
DOUBLE PRECISION PRG_MOD_TEMP(MAX_REP,MAX_ROT,MAX_INT)
CHARACTER*(8) PRG_UTC(MAX_REP,MAX_ROT,MAX_INT)
DOUBLE PRECISION DAT(MAX_REP,MAX_ROT,MAX_INT,N_CH,MAX_PT)
DOUBLE PRECISION EXP_TIME
DOUBLE PRECISION INT_TIME
DOUBLE PRECISION TOTAL_SEQ_TIME
END TYPE
In case your wondering these files represent data attained by an instrument that goes through a number of configurations for each data block. Some of the configurations are nominally equivalent others are measuring an opposite state of the system.
As you can see there are a number of different data types in the structure, and times are currently being stored as strings, which is not ideal. Operations involving time are a total pain in fortran, being able to more easily deal with this in python is one of the motivations for switching.
In some of the programs I make arrays of multiple SEQUENCEs. In others I need to perform mathematical operations along particular combinations of the dimensions in SEQUENCE.DAT(). Being fortran this is typically done with lots of loops and if statements.
I'm moderately new to python, so before I rewrite a lot of code I need to figure out what the best way to do this is. Or else do it again in a year.
Initially I hoped that pandas would provide the answer, but it seems you can't create panels of more than 3 dimensions, and the dataframes have to be all the same size. I don't really want to have to create my own classes from scratch as this seems like a lot of work to build functionality that might be more easily attained in another way. Should I be using records? Or something else?
What would you recommend? What advantages are there in terms of simplicity/ease of getting started and/or functionality?
I've written a script that gives me the result of dividing two variables ("A" and "B") -- and the output of each variable is a numpy array with 26 elements. Usually, with any two elements from "A" and "B," the result of the operation is a float, and the the element in the output array that corresponds to that operation shows up as a float. But strangely, even if the output is supposed to be an integer (almost always 0 or 1), the integer will show up as "0." or "1." in the output array. Is there any way to turn these specific elements of the array back into integers, rather than keep them as floats?
I'd like to write a simple if statement that will convert any output elements that are supposed to be integers back into integers (i.e., make "0." into "0"). But I'm having some trouble with that. Any ideas?
You will probably want to read about data types:
http://docs.scipy.org/doc/numpy/user/basics.types.html
An entire numpy array has a datatype. For the operation you are doing, it would not make sense to ask that A/B sometimes be integer and sometime be float: the division of two float arrays is a float array.
Complication: it is possible to specify mixed-type arrays:
http://docs.scipy.org/doc/numpy/user/basics.rec.html#structured-arrays
The strength of Numpy arrays is that many low-level operations can be quickly performed on the data because most (not all) types used by these arrays have a fixed-size in memory. For instance, the floats you are using probably require 8 bytes each. The most important thing in that case is that all datas share the same type and fit in the same amount of memory. You can play a little around that if you really want (and need) to, but I would not suggest you to start by such special cases. Try to learn the strength of these arrays when used with this requirement (but this involves accepting the fact that you can't mix integers and floats in the same array).
I want to use BDB as a time-series data store, and planning to use the microseconds since epoch as the key values. I am using BTREE as the data store type.
However, when I try to store integer keys, bsddb3 gives an error saying TypeError: Integer keys only allowed for Recno and Queue DB's.
What is the best workaround? I can store them as strings, but that probably will make it unnecessarily slower.
Given BDB itself can handle any kind of data, why is there a restriction? can I sorta hack the bsddb3 implementation? has anyone used anyother methods?
You can't store integers since bsddb doesn't know how to represent integers and which kind of integer it is.
If you convert your integer to a string you will break the lexicographic ordering of keys of bsddb: 10 > 2 but as strings "10" < "2".
You have to use python struct to convert your integers into a string (or in python 3 into bytes) to store then store them in bsddb. You have to use bigendian packing or ordering will not be correct.
Then you can use bsddb's Cursor.set_range(key) to query for information in a given slice of time.
For instance, Cursor.set_range(struct.unpack('>Q', 123456789)) will set the cursor at the key of the even happening at 123456789 or the first that happens after.
Well, there's no workaround. But you can use two approaches
Store the integers as string using str or repr. If the ints are big, you can even use string formatting
use cPickle/pickle module to store and retrieve data. This is a good way if you have data types other than basic types. For basics ints and floats this actually is slower and takes more space than just storing strings
I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.
Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?
import numpy as np
np.average(obj)
np.min(mylist)
np.max(mylist)
np.std(mylist)
np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)
maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
maxx = max(maxx, float(p))
minx = min(minx, float(p))
sumz += float(p)
index += 1
my_max = maxx
my_min = minx
my_avg = sumz/index
Use binary file. Then you can use numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.
You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.
I think you are on the right track, by iterating over the file and keeping track of max and min values. To calculate the std, you should keep a sum of squares inside the loop: sum_of_squares += z**2. You then can calculate std = sqrt(sum_of_squares / n - (sumz / n)**2) after the loop, see formula here (but this formula might suffer from numerical problems). For performance, you might want to iterate over the file in some decent size chunks of data.
To calculate the median and percentiles in a 'continuous' way, you could build up a histogram inside your loop. After the loop, you can get approximate percentiles and median by converting the histogram to the CDF, the error will depend on the number of bins.
As Antti Haapala says, the easiest and most efficient way to do this will be to stick with numpy, and just use a memmapped binary file instead of a text file. Yes, converting from one format to the other will take a bit of timeābut it'll almost certainly save more time than it costs (because you can use numpy vectorized operations instead of loops), and it will also make your code a lot simpler.
If you can't do that, Python 3.4 will come with a statistics module. A backport to 2.6+ will hopefully be available at some point after the PEP is finalized; at present I believe you can only get stats, the earlier module it's based on, which requires 3.1+. Unfortunately, while stats does do single-pass algorithms on iterators, it doesn't have any convenient way to run multiple algorithms in parallel on the same iterator, so you have be clever with itertools.tee and zip to force it to interleave the work instead of pulling the whole thing into memory.
And of course there are plenty of other modules out there if you search PyPI for "stats" and/or "statistics" and/or "statistical".
Either way, using a pre-built module will mean someone's already debugged all the problems you're going to run into, and they may have also optimized the code (maybe even ported it to C) to boot.
To get the percentiles, sort the text file using a command line program. Use the line count (index in your program) to find the line numbers of the percentiles (index // 4, etc.) Then retrieve those lines from the file.
Most of these operations can be expressed easily in terms of simple arithmetic. In that case, it can actually (surprisingly) be quite efficient to process simple statistics directly from the Linux command line using awk and sed, e.g. as in this post: < http://www.unixcl.com/2008/09/sum-of-and-group-by-using-awk.html >.
If you need to generalize to more advanced operations, like weighted percentiles, then I'd recommend using Python Pandas (notably the HDFStore capabilities for later retrieval). I've used Pandas with a DataFrame of over 25 million records before (10 columns by 25 million distinct rows). If you're more memory constrained, you could read the data in in chunks, calculate partial contributions from each chunk, and store out intermediate results, then finish off the calculation by just loading the intermediate results, in a serialized sort of map-reduce kind of framework.