Python- Selecting pairs of objects from a data frame - python

I have a data frame that contains information about the positions of various objects, and a unique index for each object (index in this case is not related to the data frame). Here is some example data:
ind pos
x y z
-1.0 7.0 0.0 21 [-2.76788330078, 217.786453247, 26.6822681427]
0.0 22 [-7.23852539062, 217.274139404, 26.6758270264]
0.0 1.0 152 [-0.868591308594, 2.48404550552, 48.4036369324]
6.0 2.0 427 [-0.304443359375, 182.772140503, 79.4475860596]
The actual data frame is quite long. I have written a function that takes two vectors as inputs and outputs the distance between them:
def dist(a, b):
diff = N.array(a)-N.array(b)
d = N.sqrt(N.dot(diff, diff))
return d
and a function that, given two arrays, will output all the unique combinations of elements between these arrays:
def getPairs(a, b):
if N.array_equal(a, b):
pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(i+1,
len(b))]
else:
pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(len(b))]
return pairs
I want to take my data frame and find all the pairs of elements whose distance between them is less than some value, say 30. For the pairs that meet this requirement, I also need to store the distance I calculated in some other data frame. Here is my attempt at solving this, but this turned out to be extremely slow.
pairs = [getPairs(list(group.ind), list(boxes.get_group((name[0]+i, name[1]+j, name[2]+k)).ind)) \
for i in [0,1] for j in [0,1] for k in [0,1] if name[0]+i != 34 and name[1]+j != 34 and name[2]+k != 34]
pairs = list(itertools.chain(*pairs))
subInfo = pandas.DataFrame()
subInfo['pairs'] = pairs
subInfo['r'] = subInfo.pairs.apply(lambda x: dist(df_yz.query('ind == #x[0]').pos[0], df_yz.query('ind == #x[1]').pos[0]))
Don't worry about what I'm iterating over in this for loop, it works for the system I'm dealing with and isn't where I'm getting slowed down. The step I use .query() is where the major jam happens.
The output I am looking for is something like:
pair distance
(21, 22) 22.59
(21, 152) 15.01
(22, 427) 19.22
I made the distances up, and the pair list would be much longer, but that's the basic idea.

Took me a while, but here are thee possible solution. Hope they are self explanatory. Written in Python 3.x in Jupyter Notebook. One remark: if your coordinates are world coordinates, you may think of using the Haversine distance (circular distance) instead of the Euclidean distance which is a straight line.
First, create your data
import pandas as pd
import numpy as np
values = [
{ 'x':-1.0, 'y':7.0, 'z':0.0, 'ind':21, 'pos':[-2.76788330078, 217.786453247, 26.6822681427] },
{ 'z':0.0, 'ind':22, 'pos':[-7.23852539062, 217.274139404, 26.6758270264] },
{ 'y':0.0, 'z':1.0, 'ind':152, 'pos':[-0.868591308594, 2.48404550552, 48.4036369324] },
{ 'y':6.0, 'z':2.0, 'ind':427, 'pos':[-0.304443359375, 182.772140503, 79.4475860596] }
]
def dist(a, b):
"""
Calculates the Euclidean distance between two 3D-vectors.
"""
diff = np.array(a) - np.array(b)
d = np.sqrt(np.dot(diff, diff))
return d
df_initial = pd.DataFrame(values)
The following three solutions will generate this output:
pairs distance
1 (21, 22) 4.499905
3 (21, 427) 63.373886
7 (22, 427) 63.429709
First solution is based on a full join of the data with itself. Downside is that it may exceed your memory if the dataset is huge. Advantages are the easy readability of the code and the usage of Pandas only:
#%%time
df = df_initial.copy()
# join data with itself, each line will contain two geo-positions
df['tmp'] = 1
df = df.merge(df, on='tmp', suffixes=['1', '2']).drop('tmp', axis=1)
# remove rows with similar index
df = df[df['ind1'] != df['ind2']]
# calculate distance for all
df['distance'] = df.apply(lambda row: dist(row['pos1'], row['pos2']), axis=1)
# filter only those within a specific distance
df = df[df['distance'] < 70]
# combine original indices into a tuple
df['pairs'] = list(zip(df['ind1'], df['ind2']))
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
The second solution tries to avoid the memory issue of the first version by iterating over the original data line by line and calculating the distance between the current line and the original data while keeping only values that satisfy the minimum distance constraint. I was expecting a bad performance, but wasn't bad at all (see summary at the end).
#%%time
df = df_initial.copy()
results = list()
for index, row1 in df.iterrows():
# calculate distance between current coordinate and all original rows in the data
df['distance'] = df.apply(lambda row2: dist(row1['pos'], row2['pos']), axis=1)
# filter only those within a specific distance and drop rows with same index as current coordinate
df_tmp = df[(df['distance'] < 70) & (df['ind'] != row1['ind'])].copy()
# prepare final data
df_tmp['ind2'] = row1['ind']
df_tmp['pairs'] = list(zip(df_tmp['ind'], df_tmp['ind2']))
# remember data
results.append(df_tmp)
# combine all into one dataframe
df = pd.concat(results)
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
The third solution is based on spatial operations using the KDTree from Scipy.
#%%time
from scipy import spatial
tree = spatial.KDTree(list(df_initial['pos']))
# calculate distances (returns a sparse matrix)
distances = tree.sparse_distance_matrix(tree, max_distance=70)
# convert to a Coordinate (coo) representation of the Compresses-Sparse-Column (csc) matrix.
coo = distances.tocoo(copy=False)
def get_cell_value(idx: int, column: str = 'ind'):
return df_initial.iloc[idx][column]
def extract_indices(row):
distance, idx1, idx2 = row
return get_cell_value(int(idx1)), get_cell_value(int(idx2))
df = pd.DataFrame({'idx1': coo.row, 'idx2': coo.col, 'distance': coo.data})
df['pairs'] = df.apply(extract_indices, axis=1)
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
So what about performance. If you just want to know which row of your original data is within the desired distance, then the KDTree version (third version) is super fast. It took just 4ms to generate the sparse matrix. But since I then used the indices from that matrix to extract the data from the original data, the performance dropped. Of course this should be tested on your full dataset.
version 1: 93.4 ms
version 2: 42.2 ms
version 3: 52.3 ms (4 ms)

Related

python interpolation of some datapoints in dataset / merging lists

In an .xlsx file there is logged machine data in a way that is not suitable for further calculations. Meaning I've got a file that contains depth data of a cutting tool. Each depth increment comes with several further informations like pressure, rotational speed, forces and many more.
As you can see in some datapoints the resolution of the depth parameter (0.01) is insufficient, as other parameters are updated more often. So I want to interpolate between two consecutive depth datapoints.
What is important to know, this effect doesn't occure on each depth. When the cutting tool moves fast, everything is fine.
Here is also an example file.
So I just need to interpolate values of the depth, when the differnce between two consecutive depth datapoints is 0.01
I've tried the following approach:
Open as dataframe, rename, drop NaN, convert to list
count identical depths in list and transfer them to dataframe
calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0"
Divide delta depth by number of time steps if 0.009 < delta depth < 0.011 -->interpolated depth
empty List of Lists with the number of elements of the sublist corresponding to the duration
Pass values from interpolated depth to the respective sublists --> List 1
Transfer elements from delta_depth to sublists --> Liste 2
Merge List 1 and List 2
Flatten the Lists
replace the original depth value by the interpolated values in dataframe
It looks like this, but at point 8 (merging) I don't get what I need:
import pandas as pd
from itertools import groupby
from itertools import zip_longest
import matplotlib.pyplot as plt
import numpy as np
#open and rename of some columns
df_raw=pd.read_excel(open('---.xlsx', 'rb'), sheet_name='---')
df_raw=df_raw.rename(columns={"---"})
#drop NaN
df_1=df_raw.dropna(subset=['depth'])
#convert to list
li = df_1['depth'].tolist()
#count identical depths in list and transfer them to dataframe
df_count = pd.DataFrame.from_records([[i, len([*group])] for i, group in groupby(li)])
df_count = df_count.rename(columns={0: "depth", 1: "duration"})
#calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0".
df_count["delta_depth"] = df_count["depth"].diff()
df_count=df_count.fillna(0)
#Divide delta depth by number of time steps if 0.009 < delta depth < 0.011
df_count["inter_depth"] = np.where(np.logical_and(df_count['delta_depth'] > 0.009, df_count['delta_depth'] < 0.011),df_count["delta_depth"] / df_count["duration"],0)
li2=df_count.values.tolist()
li_depth = df_count['depth'].tolist()
li_delta = df_count['delta_depth'].tolist()
li_duration = df_count['duration'].tolist()
li_inter = df_count['inter_depth'].tolist()
#empty List of Lists with the number of elements of the sublist corresponding to the duration
out=[]
for number in li_duration:
out.append(li_inter[:number])
#Pass values from interpolated depth to the respective sublists --> Liste 1
out = [[i]*j for i, j in zip(li_inter, [len(j) for j in out])]
#Transfer elements from delta_depth to sublists --> Liste 2
def extractDigits(lst):
return list(map(lambda el:[el], lst))
lst=extractDigits(li_delta)
#Merge list 1 and list 2
list1 = out
list2 = lst
new_list = []
for l1, l2 in zip_longest(list1, list2, fillvalue=[]):
new_list.append([y if y else x for x, y in zip_longest(l1, l2)])
new_list
After merging the first elements of the sublists the original depth values are followed by the interpolated values. But the sublists should contain only interpolated values.
Now I have the following questions:
is there in general a better approach to this problem?
How could I solve the problem with merging, or...
... find a way to override the wrong first elements in the sublists
The desired result would look something like this.
Any help would be much appreciated, as I'm very unexperienced in python and totally stuck.
I am sure someone could write something prettier, but I think this will work just fine:
Edited to some kinda messy scripting. I think this will do what you need it to though
_list_helper1 = df["Depth [m]"].to_list()
_list_helper1.insert(0, 0)
_list_helper1.insert(0, 0)
_list_helper1 = _list_helper1[:-2]
df["helper1"] = _list_helper1
_list = df["Depth [m]"].to_list() # grab all depth values
_list.insert(0, 0) # insert a value at the beginning to offset from original col
_list = _list[0:-1] # Delete the very last item
df["helper"] = _list # add the list to a helper col which is now offset
df["delta depth"] = df["Depth [m]"] - df["helper"] # subtract helper col from original
_id = 0
for i in range(len(df)):
if df.loc[i, "Depth [m]"] == df.loc[i, "helper"]:
break_val = df.loc[i, "Depth [m]"]
break_val_2 = df.loc[i+1, "Depth [m]"]
if break_val_2 == break_val:
df.loc[i, "IDcol"] = _id
df.loc[i+1, "IDcol"] = _id
else:
_id += 1
depth = df["IDcol"].to_list()
depth = list(dict.fromkeys(depth))
depth = [x for x in depth if str(x) != 'nan']
increments = []
for i in depth:
_df = df.copy()
_df = _df[_df["IDcol"] == i]
_df.reset_index(inplace=True, drop=True)
div_by = len(_df)
increment = _df.loc[0, "helper"] - _df.loc[0, "helper1"]
_df["delta depth"] = increment / div_by
_increment = increment / div_by
base_value = _df.loc[0, "Depth [m]"]
for y in range(div_by):
_df.loc[y, "Depth [m]"] = base_value + ((y + 1) * _increment)
increments.append(_df)
df["IDcol"] = df["IDcol"].fillna("KEEP")
df = df[df["IDcol"] == "KEEP"]
increments.append(df)
df = pd.concat(increments)
df = df.fillna(0)
df = df[["index", "Depth [m]", "delta depth", "IDcol"]] # and whatever other cols u want

add a column to a pandas.dataframe that holds the index of the closest point with a certain condition

I have a huge number of points stored with x and y coordinates and an additional value ('value_P') in a pandas.dataframe so the dataframe looks like:
x-coordinate
y-coordinate
value_P
0
0
3
1
1
40
58
1
2
5
4
2
3
76
98
2
4
15
35
3
5
5
4
3
but with around 250000 entries, so i look for a efficient solution. I am trying to add a column that holds the row index of the closest other point. But only the distance between points with value_P!=1 to points with value_P==1 should be considered. Also i am only interested in the index for points where value_P!=1. Its difficult to explain but the desired output should be:
x-coordinate
y-coordinate
value_P
index
0
0
3
1
NaN
1
40
58
1
NaN
2
5
4
2
0
3
76
98
2
1
4
15
35
3
1
5
5
4
3
0
For row 1 the index is NaN because i am not interested in it, since value_P==1. For row 2 its 0, because the point from row 0 is the closest point with a value_P of 1.
I hope its understandable.
I found a solution that involves 2 DataFrame.apply(lambda x:...) functions but it takes a long time. Even if you dont have a concrete solution but an idea how to improve the performance it would be highly appreciated.
My current code is: (P_sort is the data and 'zuord' is the added column)
def index2(x_1,y_1,x_2,y_2,last_1):
h = math.sqrt((x_1 - x_2) ** 2 + (y_1 - y_2) ** 2)
return h
def index(x_1,y_1,x_v,y_v,last_1):
df2 = pnd.DataFrame()
df3 = pnd.DataFrame()
df2['x-coordinate'] = x_v
df2['y-coordinate'] = y_v
df3['distances'] = df2.apply(
lambda x: index2(x['x-coordinate'], x['y-coordinate'], x_1, y_1, last_1), axis=1)
k=df3.idxmin()
print(k)
return k
last_1 = np.count_nonzero(P_sort[:, 2] == 1) - 1
df = pnd.DataFrame(P_sort,
columns=['x-coordinate', 'y-coordinate', 'value_P'])
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['zuord'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
I am new to programming so the code is kind of ugly
I benchmarked four solutions, and the fastest approach is a KD Tree.
Test Dataset
I randomly generated dataframes of various sizes to test the performance of each method.
def generate_spots(n, p=0.005):
x_pos = np.random.uniform(0, 100, n)
y_pos = np.random.uniform(0, 100, n)
value_P = np.random.binomial(size=n, n=1, p=(1 - p)) + 1
df = pd.DataFrame({
'x-coordinate': x_pos,
'y-coordinate': y_pos,
'value_P': value_P
})
df = df.sort_values('value_P').reset_index(drop=True)
return df
This generates a dataframe with n rows, with a probability p that each row is class 1. I also sorted it, because the original method seems to assume that the dataframe is sorted by P.
Method 1: Original
I made some small changes to your code to get it to work for me:
def method1(df):
df = df.copy()
last_1 = np.count_nonzero(df.loc[:, 'value_P'] == 1)
number_columnx = df.loc[:, 'x-coordinate']
number_columny = df.loc[:, 'y-coordinate']
x_v = number_columnx.values
y_v = number_columny.values
x_v = x_v[0:last_1]
y_v = y_v[0:last_1]
df['index'] = df.apply(lambda x: index(x['x-coordinate'],x['y-coordinate'],x_v,y_v,last_1),axis=1)
df.loc[0:last_1 - 1, 'index'] = -1
return df
index() and index2() are defined the same way as your question. I also use -1 as a placeholder instead of NaN. No deep reason for this, just personal preference.
Method 2: cdist
Scipy has a function called cdist() which takes the distance between each point among two arrays of points.
import scipy.spatial.distance
def method2(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
nearest_point = scipy.spatial.distance.cdist(source_df, target_df).argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
The cdist function is pretty much the same as what you're doing - it's just implemented in C rather than Python.
Method 3: KD Tree
A KD Tree is a data structure designed to efficiently search for nearby points. You can use SciKit Learn to implement this.
import sklearn.neighbors
def method3(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
tree = sklearn.neighbors.KDTree(target_df)
nearest_point = tree.query(source_df, k=1, return_distance=False)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point.flatten()
return df
Method 4: fastdist
The Python package fastdist bills itself as a faster alternative to scipy's distance calculation methods. Ironically, I found this solution to be slower than cdist at all problem sizes.
from fastdist import fastdist
def method4(df):
df = df.copy()
first_P_class = df['value_P'] == 1
target_df = df.loc[first_P_class][['x-coordinate', 'y-coordinate']]
target_array = target_df.to_numpy()
source_df = df.loc[~first_P_class][['x-coordinate', 'y-coordinate']]
source_array = source_df.to_numpy()
nearest_point = fastdist.matrix_to_matrix_distance(source_array, target_array, fastdist.euclidean, "euclidean").argmin(axis=1)
df['index'] = -1
df.loc[source_df.index, 'index'] = nearest_point
return df
Benchmarks
Each method was run ten times, with various sizes of dataframe, in random order. Here are the results of the benchmark. Note that both the X and Y axes are log-scale.
I didn't benchmark fastdist or the original method for more than 30,000 points, because it took too long.
The fastest methods, in this benchmark, are the cdist method, for fewer than 1000 points, and KD Tree method, for more than 1000 points. At 250K points, the fastest solution is the KD Tree, taking only 0.2 seconds.

fast way to iterate through list, find duplicates and perform calculations

I have two lists, one of areas and one of prices which are the same size.
For example:
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
I need to return a list of prices for each unique area not including the original area.
So for 1500, the lists that I need returned are: [750] and [200]
For the 2000, the lists that I need returned are [600,1000], [800,1000] and [800,600]
For the 1800 and 500, the lists I need returned are both empty lists [].
The goal is then to determine whether a value is an outlier subject to the absolute value of the price - mean(excluding the price itself) being less than 5 * population standard deviation(calculated excluding the price itself)
import statistics
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
outlier_idx = []
for idx, val in enumerate(area):
comp_idx = [i for i, x in enumerate(area) if x == val]
comp_idx.remove(idx)
comp_price = [price[i] for i in comp_idx]
if len(comp_price)>2:
sigma = statistics.stdev(comp_price)
p_m = statistics.mean(comp_price)
if abs(price[idx]-p_m) > 5 * sigma:
outlier_idx.append(idx)
area = [i for j, i in enumerate(area) if j not in outlier_idx]
price = [i for j, i in enumerate(price) if j not in outlier_idx]
The problem is that this calculation takes up a lot of time and I am dealing with arrays that can be quite large.
I am stuck as to how I can increase the computational efficiency.
I am open to using numpy, pandas or any other common packages.
Additionally, I have tried the problem in pandas:
df['p-p_m'] = ''
df['sigma'] = ''
df['outlier'] = False
for name, group in df.groupby('area'):
if len(group)>1:
idx = list(group.index)
for i in range(len(idx)):
tmp_idx = idx.copy()
tmp_idx.pop(i)
df['p-p_m'][idx[i]] = abs(group.price[idx[i]] - group.price[tmp_idx].mean())
df['sigma'][idx[i]] = group.price[tmp_idx].std(ddof=0)
if df['p-p_m'][idx[i]] > 3*df['sigma'][idx[i]]:
df['outlier'][idx[i]] = True
Thanks.
This code is how to must created the list for each area:
df = pd.DataFrame({'area': area, 'price': price})
price_to_delete = [item for idx_array in df.groupby('price').groups.values() for item in idx_array[1:]]
df.loc[price_to_delete, 'price'] = None
df = df.groupby('area').agg(lambda x: [] if all(x.isnull()) else x.tolist())
df
I don't understand what are you want, but this part is to calculate outliers for each price in each area:
df['outlier'] = False
df['outlier'] = df['price'].map(lambda x: abs(np.array(x) - np.mean(x)) > 3*np.std(x) if len(x) > 0 else [])
df
I hope this help you, in any way!
Here is a solution that combines Numpy and Numba. Although correct, I did not test it against alternative approaches regarding efficiency, but Numba usually results in significant speedups for tasks that require looping through data. I have added one extra point which is an outlier, according to your definition.
import numpy as np
from numba import jit
# data input
price = np.array([200,800,600,800,1000,750,200, 2000])
area = np.array([1500,2000,2000,1800,2000,1500,500, 1500])
#jit(nopython=True)
def outliers(price, area):
is_outlier = np.full(len(price), False)
for this_area in set(area):
indexes = area == this_area
these_prices = price[indexes]
for this_price in set(these_prices):
arr2 = these_prices[these_prices != this_price]
if arr2.size > 1:
std = arr2.std()
mean = arr2.mean()
indices = (this_price == price) & (this_area == area)
is_outlier[indices] = np.abs(mean - this_price) > 5 * std
return is_outlier
> outliers(price, area)
> array([False, False, False, False, False, False, False, True])
The code should be fast in case you have several identical price levels for each area, as they will be updated all at once.
I hope this helps.

Different results for linalg.norm in numpy

I am trying to create a feature matrix based on certain features and then finding distance b/w the items.
For testing purpose I am using only 2 points right now.
data : list of items I have
specs : feature dict of the items (I am using their values of keys as features of item)
features : list of features
This is my code by using numpy zero matrix :
import numpy as np
matrix = np.zeros((len(data),len(features)),dtype=bool)
for dataindex,item in enumerate(data):
if dataindex > 5:
break
specs = item['specs']
values = [value.lower() for value in specs.values()]
for idx,feature in enumerate(features):
if(feature in values):
matrix[dataindex,idx] = 1
print dataindex,idx
v1 = matrix[0]
v2 = matrix[1]
# print v1.shape
diff = v2 - v1
dist = np.linalg.norm(diff)
print dist
The value for dist I am getting is 1.0
This is my code by using python lists :
matrix = []
for dataindex,item in enumerate(data):
if dataindex > 5:
f = open("Matrix.txt",'w')
f.write(str(matrix))
f.close()
break
print "Item" + str(dataindex)
row = []
specs = item['specs']
values = [value.lower() for value in specs.values()]
for idx,feature in enumerate(features):
if(feature in values):
print dataindex,idx
row.append(1)
else:
row.append(0)
matrix.append(row)
v1 = np.array(matrix[0]);
v2 = np.array(matrix[1]);
diff = v2 - v1
print diff
dist = np.linalg.norm(diff)
print dist
The value of dist in this case is 4.35889894354
I have checked many time that the value 1 is being set at the same position in both cases but the answer is different.
May be I am not using numpy properly or there is an issue with the logic.
I am using numpy zero based matrix because of its memory efficiency.
What is the issue?
It's a type issue :
In [9]: norm(ones(3).astype(bool))
Out[9]: 1.0
In [10]: norm(ones(3).astype(float))
Out[10]: 1.7320508075688772
You must decide what it the good norm for your problem and eventually cast your data with astype.
norm(M) is sqrt(dot(M.ravel(),M.ravel())) , so for a boolean matrix, norm(M) is 0. if M is the False matrix,
1. otherwise. use the ord parameter of norm to tune the function.

Group by max or min in a numpy array

I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result

Categories

Resources