Pandas Group 2-D NumPy Data by Range of Values - python

I have a large data set in the form of a 2D array. The 2D array represents continuous intensity data and I want to use this to create another 2D array of the same size only this time, the values are grouped into discreet values. In other words if I have a 2D array like this,
[(11, 23, 33, 12),
(21, 31, 13, 19),
(33, 22, 26, 31)]
The output would be as shown below with the values from 10 to 19 assigned to 1, 20 to 29 assigned to 2 and 30 to 39 assigned to 3.
[(1, 2, 3, 1),
(2, 3, 1, 1),
(3, 2, 2, 3)]
More ideally, I would like to make these assignments based on percentiles. As in, the values that fall into the top ten percent get assigned to 5, the values in the top 20 to 4 and so on.
My data set is in a NumPy format. I have looked at the functions groupby but this does not seem to allow me to specify ranges. I have also looked at cut however cut only works on 1D arrays. I have considered running the cut function through a loop as I go through each row of the data but I am concerned that this may take too much time. My matrices could be as big as 4000 rows by 4000 columns.

You need to stack the dataframe to have a 1-D representation and then apply cut. After that you can unstack it.
[tuple(x) for x in (pd.cut(pd.DataFrame(a).stack(), bins=[10,20,30,40], labels=False)+1).unstack().values]
OR (using #user3483203's magic)
[tuple(x) for x in np.searchsorted([10, 20, 30, 40], np.array(a))]
Output:
[(1, 2, 3, 1),
(2, 3, 1, 1),
(3, 2, 2, 3)]

Related

Python: more efficient data structure than a nested dictionary of dictionaries of arrays?

I'm writing a python-3.10 program that predicts time series of various properties for a large number of objects. My current choice of data structure for collecting results internally in the code and then for writing to files is a nested dictionary of dictionaries of arrays. For example, for two objects with time series of 3 properties:
properties = {'obj1':{'time':np.arange(10),'x':np.random.randn(10),'vx':np.random.randn(10)},
'obj2': {'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}}
The reason I like this nested dictionary format is because it is intuitive to access -- the outer key is the object name, and the inner keys are the property names. The elements corresponding to each of the inner keys are numpy arrays giving the value of some property as a function of time. My actual code generates a dict of ~100,000s of objects (outer keys) each having ~100 properties (inner keys) recorded at ~1000 times (numpy float arrays).
I have noticed that when I do np.savez('filename.npz',**properties) on my own huge properties dictionary (or subsets of it), it takes a while and the output file sizes are a few GB (probably because np.savez is calling pickle under the hood since my nested dict is not an array).
Is there a more efficient data structure widely applicable for my use case? Is it worth switching from my nested dict to pandas dataframes, numpy ndarrays or record arrays, or a list of some kind of Table-like objects? It would be nice to be able to save/load the file in a binary output format that preserves the mapping from object names to their dict/array/table/dataframe of properties, and of course the names of each of the property time series arrays.
Let's look at your obj2 value, a dict:
In [307]: dd={'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}
In [308]: dd
Out[308]:
{'time': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]),
'x': array([-0.48197915, 0.15597792, 0.44113401, 1.38062753, -1.21273378,
-1.27120008, 1.53072667, 1.9799255 , 0.13647925, -1.37056793,
-2.06470784, 0.92314969, 0.30885371, 0.64860014, 1.30273519]),
'vx': array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
1.86391024, 1.006901 , -0.16168439, 1.5180135 , -1.16436363,
-0.20254291, -1.60280149, -1.91749387, 0.25366602, -1.61993012])}
It's easy to make a dataframe from that:
In [309]: df = pd.DataFrame(dd)
In [310]: df
Out[310]:
time x vx
0 0 -0.481979 -1.602281
1 1 0.155978 -1.491630
2 2 0.441134 -1.170610
3 3 1.380628 -0.092675
4 4 -1.212734 -0.941331
5 5 -1.271200 1.863910
6 6 1.530727 1.006901
7 7 1.979926 -0.161684
8 8 0.136479 1.518014
9 9 -1.370568 -1.164364
10 10 -2.064708 -0.202543
11 11 0.923150 -1.602801
12 12 0.308854 -1.917494
13 13 0.648600 0.253666
14 14 1.302735 -1.619930
We could also make structured array from that frame. I could also make the array directly from your dict, defining the same compound dtype. But since I already have the frame, I'll go this route. The distinction between structured array and recarray is minor.
In [312]: arr = df.to_records()
In [313]: arr
Out[313]:
rec.array([( 0, 0, -0.48197915, -1.60228105),
( 1, 1, 0.15597792, -1.49163002),
( 2, 2, 0.44113401, -1.17061046),
( 3, 3, 1.38062753, -0.09267467),
( 4, 4, -1.21273378, -0.94133092),
( 5, 5, -1.27120008, 1.86391024),
( 6, 6, 1.53072667, 1.006901 ),
( 7, 7, 1.9799255 , -0.16168439),
( 8, 8, 0.13647925, 1.5180135 ),
( 9, 9, -1.37056793, -1.16436363),
(10, 10, -2.06470784, -0.20254291),
(11, 11, 0.92314969, -1.60280149),
(12, 12, 0.30885371, -1.91749387),
(13, 13, 0.64860014, 0.25366602),
(14, 14, 1.30273519, -1.61993012)],
dtype=[('index', '<i8'), ('time', '<i4'), ('x', '<f8'), ('vx', '<f8')])
Now let's compare the pickle strings:
In [314]: import pickle
In [315]: len(pickle.dumps(dd))
Out[315]: 561
In [316]: len(pickle.dumps(df)) # df.to_pickle makes a 1079 byte file
Out[316]: 1052
In [317]: len(pickle.dumps(arr)) # arr.nbytes is 420
Out[317]: 738 # np.save writes a 612 byte file
And other encoding - a list:
In [318]: alist = list(dd.items())
In [319]: alist
Out[319]:
[('time', array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])),
('x',
array([-0.48197915, 0.15597792, 0.44113401, 1.38062753, -1.21273378,
-1.27120008, 1.53072667, 1.9799255 , 0.13647925, -1.37056793,
-2.06470784, 0.92314969, 0.30885371, 0.64860014, 1.30273519])),
('vx',
array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
1.86391024, 1.006901 , -0.16168439, 1.5180135 , -1.16436363,
-0.20254291, -1.60280149, -1.91749387, 0.25366602, -1.61993012]))]
In [320]: len(pickle.dumps(alist))
Out[320]: 567

Get top 5 max reoccurring values in a Python list? [duplicate]

This question already has answers here:
How to find most common elements of a list? [duplicate]
(11 answers)
How do I count the occurrences of a list item?
(29 answers)
Closed 4 years ago.
Let's say I have a list like so:
list = [0,0,1,1,1,1,1,1,1,1,3,3,3,3,5,9,9,9,9,9,9,22,22,22,22,22,22,22,22,22,22,45]
The top 5 reoccurring values would be:
22, 1, 9, 3, and 0.
What is the best way to get these values, as well as the number of times they reoccur? I was thinking of pushing the values into a new list, so that I get something like:
new_list = [22,10, 1,8, 9,6, 3,4, 0,2]
With the list value being the odd index entry, and the reoccurred value being the even index entry.
EDIT: What is the simplest way to do this without using a library?
Use collections.Counter:
from collections import Counter
l = [0,0,1,1,1,1,1,1,1,1,3,3,3,3,5,9,9,9,9,9,9,22,22,22,22,22,22,22,22,22,22,45]
print(Counter(l).most_common())
[(22, 10), (1, 8), (9, 6), (3, 4), (0, 2), (5, 1), (45, 1)]
You feed it an iterable and it counts it for you. The resultings dictionaries key is the value that was counted, its value is how often it occured. (i.e. 22 was counted 10 times)
Doku: collections.Counter(iterable)
Sidenote:
dont call variables after types or built ins, you shadow them and get problems later. Never call anything
list, tuple, dict, set, max, min, abs, ...
See: https://docs.python.org/3/library/functions.html
Use collections.Counter from the standard library.
import collections
list = [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 5, 9, 9, 9, 9, 9, 9, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 45]
ctr = collections.Counter(list)
print(ctr.most_common(5))
outputs
[
(22, 10),
(1, 8),
(9, 6),
(3, 4),
(0, 2),
]

Apply function to an array of tuples

I have a function that I would like to apply to an array of tuples and I am wondering if there is a clean way to do it.
Normally, I could use np.vectorize to apply the function to each item in the array, however, in this case "each item" is a tuple so numpy interprets the array as a 3d array and applies the function to each item within the tuple.
So I can assume that the incoming array is one of:
tuple
1 dimensional array of tuples
2 dimensional array of tuples
I can probably write some looping logic but it seems like numpy most likely has something that does this more efficiently and I don't want to reinvent the wheel.
This is an example. I am trying to apply the tuple_converter function to each tuple in the array.
array_of_tuples1 = np.array([
[(1,2,3),(2,3,4),(5,6,7)],
[(7,2,3),(2,6,4),(5,6,6)],
[(8,2,3),(2,5,4),(7,6,7)],
])
array_of_tuples2 = np.array([
(1,2,3),(2,3,4),(5,6,7),
])
plain_tuple = (1,2,3)
# Convert each set of tuples
def tuple_converter(tup):
return tup[0]**2 + tup[1] + tup[2]
# Vectorizing applies the formula to each integer rather than each tuple
tuple_converter_vectorized = np.vectorize(tuple_converter)
print(tuple_converter_vectorized(array_of_tuples1))
print(tuple_converter_vectorized(array_of_tuples2))
print(tuple_converter_vectorized(plain_tuple))
Desired Output for array_of_tuples1:
[[ 6 11 38]
[54 14 37]
[69 13 62]]
Desired Output for array_of_tuples2:
[ 6 11 38]
Desired Output for plain_tuple:
6
But the code above produces this error (because it is trying to apply the function to an integer rather than a tuple.)
<ipython-input-209-fdf78c6f4b13> in tuple_converter(tup)
10
11 def tuple_converter(tup):
---> 12 return tup[0]**2 + tup[1] + tup[2]
13
14
IndexError: invalid index to scalar variable.
array_of_tuples1 and array_of_tuples2 are not actually arrays of tuples, but just 3- and 2-dimensional arrays of integers:
In [1]: array_of_tuples1 = np.array([
...: [(1,2,3),(2,3,4),(5,6,7)],
...: [(7,2,3),(2,6,4),(5,6,6)],
...: [(8,2,3),(2,5,4),(7,6,7)],
...: ])
In [2]: array_of_tuples1
Out[2]:
array([[[1, 2, 3],
[2, 3, 4],
[5, 6, 7]],
[[7, 2, 3],
[2, 6, 4],
[5, 6, 6]],
[[8, 2, 3],
[2, 5, 4],
[7, 6, 7]]])
So, instead of vectorizing your function, because it then will basically for-loop through the elements of the array (integers), you should apply it on the suitable axis (the axis of the "tuples") and not care about the type of the sequence:
In [6]: np.apply_along_axis(tuple_converter, 2, array_of_tuples1)
Out[6]:
array([[ 6, 11, 38],
[54, 14, 37],
[69, 13, 62]])
In [9]: np.apply_along_axis(tuple_converter, 1, array_of_tuples2)
Out[9]: array([ 6, 11, 38])
The other answer above is certainly correct, and probably what you're looking for. But I noticed you put the word "clean" into your question, and so I'd like to add this answer as well.
If we can make the assumption that all the tuples are 3 element tuples (or that they have some constant number of elements), then there's a nice little trick you can do so that the same piece of code will work on any single tuple, 1d array of tuples, or 2d array of tuples without an if/else for the 1d/2d cases. I'd argue that avoiding switches is always cleaner (although I suppose this could be contested).
import numpy as np
def map_to_tuples(x):
x = np.array(x)
flattened = x.flatten().reshape(-1, 3)
return np.array([tup[0]**2 + tup[1] + tup[2] for tup in flattened]).reshape(x.shape[:-1])
Outputs the following for your inputs (respectively), as desired:
[[ 6 11 38]
[54 14 37]
[69 13 62]]
[ 6 11 38]
6
If you are serious about the tuples bit, you could define a structured dtype.
In [535]: dt=np.dtype('int,int,int')
In [536]: x1 = np.array([
[(1,2,3),(2,3,4),(5,6,7)],
[(7,2,3),(2,6,4),(5,6,6)],
[(8,2,3),(2,5,4),(7,6,7)],
], dtype=dt)
In [537]: x1
Out[537]:
array([[(1, 2, 3), (2, 3, 4), (5, 6, 7)],
[(7, 2, 3), (2, 6, 4), (5, 6, 6)],
[(8, 2, 3), (2, 5, 4), (7, 6, 7)]],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
Note that the display uses tuples. x1 is a 3x3 array of type dt. The elements, or records, are displayed as tuples. This more useful if the tuple elements differ - float, integer, string etc.
Now define a function that works with fields of such an array:
In [538]: def foo(tup):
return tup['f0']**2 + tup['f1'] + tup['f2']
It applies neatly to x1.
In [539]: foo(x1)
Out[539]:
array([[ 6, 11, 38],
[54, 14, 37],
[69, 13, 62]])
It also applies to a 1d array of the same dtype.
In [540]: x2=np.array([(1,2,3),(2,3,4),(5,6,7) ],dtype=dt)
In [541]: foo(x2)
Out[541]: array([ 6, 11, 38])
And a 0d array of matching type:
In [542]: foo(np.array(plain_tuple,dtype=dt))
Out[542]: 6
But foo(plain_tuple) won't work, since the function is written to work with named fields, not indexed ones.
The function could be modified to cast the input to the correct dtype if needed:
In [545]: def foo1(tup):
temp = np.asarray(tup, dtype=dt)
.....: return temp['f0']**2 + temp['f1'] + temp['f2']
In [548]: plain_tuple
Out[548]: (1, 2, 3)
In [549]: foo1(plain_tuple)
Out[549]: 6
In [554]: foo1([(1,2,3),(2,3,4),(5,6,7)]) # list of tuples
Out[554]: array([ 6, 11, 38])

What does "dimensionality" mean for a numpy array?

I am still new to scikit-learn and numpy.
I read the tutorial, but I can't understand how they define array dimensions.
In the following example:
>>> import numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
The array has five variables in each row, so I expect it to have 5 dimensions.
Why is a.ndim equal to 2?
Given that you're using scikit learn, I'll explain this in the context of machine learning as it may make more sense...
Your feature matrix (which I assume is what you're talking about here), is going to be 2 dimensional typically (hence why ndim = 2) because you have rows (which occupy a 1 dimension) and columns (which occupy a second dimension)
In machine learning cases, I typically think of the rows as the samples and columns as the features.
Note, however, that each dimension can have multiple entries (e.g. you will have multiple samples/rows, and multiple columns/features). This tells you the size along that dimension.
So in your case:
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
You have one dimension that has a length/size of 3. And a second dimension that has 5 entries. You can think of this as a feature matrix containing 3 samples and 5 features/variables, for example.
All in all, you have 2 dimensions (ndim = 2), but the specific size of the array is represented by the shape tuple, which tells you how large each of the 2 dimensions are.
Furthermore, (3,5,2) will be a matrix with 3 dimensions, where the 3rd dimension has 2 values
I think the key here, at least in the 2 dimensional case, is to not think of it as nested lists or nested vectors (which is what it looks like when you consider the []) but to just think of it as a table with rows and columns. The shape tuple and ndim will make more sense when you think of the data structure that way
Dimension means there the length of a.shape tuple.
Shape of the ndarray is (3, 5) since it has 3 rows and 5 columns in it. This is exactly what you try to find, aren't you?
I mistake the Frame and the Array, array is judged by how to find the number, how many steps to locate the number, in this case, you need locate the row and then locate the line. but in Frame, for one instance, this instance own a row, and all the lines in it described its attributes, or called it variables, or dimensions to define a instance. in Frame, you can use a two dimensions array to define a multi-dimension instance.
Array
1 4 5
2 3 6
so it's two steps you will find the number, like you can you a[][] to locate
but in Frame
Length Height Weight
1 23 34 56
2 89 87 63
This is a frame and actually is a 3-dimensional "array", but it's not an array.

Iterate over numpy array in a specific order based on values

I want to iterate over a numpy array starting at the index of the highest value working through to the lowest value
import numpy as np #imports numpy package
elevation_array = np.random.rand(5,5) #creates a random array 5 by 5
print elevation_array # prints the array out
ravel_array = np.ravel(elevation_array)
sorted_array_x = np.argsort(ravel_array)
sorted_array_y = np.argsort(sorted_array_x)
sorted_array = sorted_array_y.reshape(elevation_array.shape)
for index, rank in np.ndenumerate(sorted_array):
print index, rank
I want it to print out:
index of the highest value
index of the next highest value
index of the next highest value etc
If you want numpy doing the heavy lifting, you can do something like this:
>>> a = np.random.rand(100, 100)
>>> sort_idx = np.argsort(a, axis=None)
>>> np.column_stack(np.unravel_index(sort_idx[::-1], a.shape))
array([[13, 62],
[26, 77],
[81, 4],
...,
[83, 40],
[17, 34],
[54, 91]], dtype=int64)
You first get an index that sorts the whole array, and then convert that flat index into pairs of indices with np.unravel_index. The call to np.column_stack simply joins the two arrays of coordinates into a single one, and could be replaced by the Python zip(*np.unravel_index(sort_idx[::-1], a.shape)) to get a list of tuples instead of an array.
Try this:
from operator import itemgetter
>>> a = np.array([[2, 7], [1, 4]])
array([[2, 7],
[1, 4]])
>>> sorted(np.ndenumerate(a), key=itemgetter(1), reverse=True)
[((0, 1), 7),
((1, 1), 4),
((0, 0), 2),
((1, 0), 1)]
you can iterate this list if you so wish. Essentially I am telling the function sorted to order the elements of np.ndenumerate(a) according to the key itemgetter(1). This function itemgetter gets the second (index 1) element from the tuples ((0, 1), 7), ((1, 1), 4), ... (i.e the values) generated by np.ndenumerate(a).

Categories

Resources