Cluster Rows in Data Subgroups

Cluster Rows in Data Subgroups - python

I have a dataset df of object components in 3-d space - each ID represents an object which has various components:
ID Comp x y z
A 1 2 2 1
A 2 2 1 -1
A 3 -10 1 -10
A 4 -1 3 -5
B 1 3 0 0
B 2 3 0 -5
...
I would like to loop through each ID, using a clustering technique in sklearn to create clusters of components (Comp) based on each component's (x,y,z) co-ordinates - to achieve something like this:
ID Comp x y z cluster
A 1 2 2 1 1
A 2 2 1 -1 1
A 3 -10 1 -10 2
A 4 -1 3 -5 3
B 1 3 0 0 1
B 2 3 0 -5 1
...
As an example - ID:A,Comp:1 is incluster1, whereasID:A, Comp:4 is in cluster 3. (I plan to then concatenate ID and cluster later).
I'm having no luck with the following groupby + apply:
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation()
df['cluster']=df.groupby(['ID','Comp']).apply(lambda x: ap.fit_predict(np.array([x.x,x.y,x.z]).T))
I could brute-force it by using a for loop over the ID but my dataset is large (~ 150k ID) and I'm worried about resource and time constraints. Any help would be great!

IIUC, I think you could try something like this:
def ap_fit_pred(x):
ap = AffinityPropagation()
return pd.Series(ap.fit_predict(x.loc[:,['x','y','z']]))
df['cluster'] = df.groupby('ID').apply(ap_fit_pred).reset_index(drop=True)

Related

Calculate a np.arange within a Panda dataframe from other columns

I want to create a new column with all the coordinates the car needs to pass to a certain goal. This should be as a list in a panda.
To start with I have this:
import pandas as pd
cars = pd.DataFrame({'x_now': np.repeat(1,5),
'y_now': np.arange(5,0,-1),
'x_1_goal': np.repeat(1,5),
'y_1_goal': np.repeat(10,5)})
output would be:
x_now y_now x_1_goal y_1_goal
0 1 5 1 10
1 1 4 1 10
2 1 3 1 10
3 1 2 1 10
4 1 1 1 10
I have tried to add new columns like this, and it does not work
for xy_index in range(len(cars)):
if cars.at[xy_index, 'x_now'] == cars.at[xy_index,'x_1_goal']:
cars.at[xy_index, 'x_car_move_route'] = np.repeat(cars.at[xy_index, 'x_now'].astype(int),(
abs(cars.at[xy_index, 'y_now'].astype(int)-cars.at[xy_index, 'y_1_goal'].astype(int))))
else:
cars.at[xy_index, 'x_car_move_route'] = \
np.arange(cars.at[xy_index,'x_now'], cars.at[xy_index,'x_1_goal'],
(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now']) / (
abs(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now'])))
at the end I want the columns x_car_move_route and y_car_move_route so I can loop over the coordinates that they need to pass. I will show it with tkinter. I will also add more goals, since this is actually only the first turn that they need to make.
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]

You can apply() something like this route() function along axis=1, which means route() will receive rows from cars. It generates either x or y coordinates depending on what's passed into var (from args).
You can tweak/fix as needed, but it should get you started:
def route(row, var):
var2 = 'y' if var == 'x' else 'x'
now, now2 = row[f'{var}_now'], row[f'{var2}_now']
goal, goal2 = row[f'{var}_1_goal'], row[f'{var2}_1_goal']
diff, diff2 = goal - now, goal2 - now2
if diff == 0:
result = np.array([now] * abs(diff2)).astype(int)
else:
result = 1 + np.arange(now, goal, diff / abs(diff)).astype(int)
return result
cars['x_car_move_route'] = cars.apply(route, args=('x',), axis=1)
cars['y_car_move_route'] = cars.apply(route, args=('y',), axis=1)
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]

How to change a certain count of numpy matrix elements?

I have a square numpy 2D matrix.
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
And I need to set a certain count of random matrix values to 0. Let's say it is 5 elements. That means any 5 from 16 matrix values must be set to 0. For example new matrix could be
2 2 0 0
0 2 2 2
2 2 2 2
0 2 0 2
or
2 0 2 2
2 2 0 2
2 2 0 2
0 2 2 0
or some else.
How could I do this efficient way?

This will do it:
import random
arr1d = arr.ravel()
randidx = random.sample(range(len(arr1d)), 5)
arr1d[randidx] = 0
This modifies arr because ravel() returns a view, not a copy.
For more on how the random numbers can be generated, see: Non-repetitive random number in numpy

lets say your matrix is "matrix"
import random
for i in range(5):
random1=random.randint(0,size_x_ofmatrix)
random2=random.randint(0,size_y_ofmatrix)
matrix[random1,random2]=0

How to calculate amounts that row values greater than a specific value in pandas?

How to calculate amounts that row values greater than a specific value in pandas?
For example, I have a Pandas DataFrame dff. I want to count row values greater than 0.
dff = pd.DataFrame(np.random.randn(9,3),columns=['a','b','c'])
dff
a b c
0 -0.047753 -1.172751 0.428752
1 -0.763297 -0.539290 1.004502
2 -0.845018 1.780180 1.354705
3 -0.044451 0.271344 0.166762
4 -0.230092 -0.684156 -0.448916
5 -0.137938 1.403581 0.570804
6 -0.259851 0.589898 0.099670
7 0.642413 -0.762344 -0.167562
8 1.940560 -1.276856 0.361775
I am using an inefficient way. How to be more efficient?
dff['count'] = 0
for m in range(len(dff)):
og = 0
for i in dff.columns:
if dff[i][m] > 0:
og += 1
dff['count'][m] = og
dff
a b c count
0 -0.047753 -1.172751 0.428752 1
1 -0.763297 -0.539290 1.004502 1
2 -0.845018 1.780180 1.354705 2
3 -0.044451 0.271344 0.166762 2
4 -0.230092 -0.684156 -0.448916 0
5 -0.137938 1.403581 0.570804 2
6 -0.259851 0.589898 0.099670 2
7 0.642413 -0.762344 -0.167562 1
8 1.940560 -1.276856 0.361775 2

You can create a boolean mask of your DataFrame, that is True wherever a value is greater than your threshold (in this case 0), and then use sum along the first axis.
dff.gt(0).sum(1)
0 1
1 1
2 2
3 2
4 0
5 2
6 2
7 1
8 2
dtype: int64

is there a 2D dictionary in python?

I was about to create a matrix like :
33 12 23 42 11 32 43 22
33 − 1 1 1 0 0 1 1
12 1 − 1 1 0 0 1 1
23 1 1 − 1 1 1 0 0
42 1 1 1 − 1 1 0 0
11 0 0 1 1 − 1 1 1
32 0 0 1 1 1 − 1 1
43 1 1 0 0 1 1 − 1
22 1 1 0 0 1 1 1 −
I want to query by horizontal or vertical titles, so I created the matrix by：
a = np.matrix('99 33 12 23 42 11 32 43 22;33 99 1 1 1 0 0 1 1;12 1 99 1 1 0 0 1 1;23 1 1 99 1 1 1 0 0;42 1 1 1 99 1 1 0 0;11 0 0 1 1 99 1 1 1;32 0 0 1 1 1 99 1 1;43 1 1 0 0 1 1 99 1;22 1 1 0 0 1 1 1 99')
I want to have the certain data if I query a[23][11] = 1
so is there a way we can create a 2D dictionary, so that a[23][11] = 1?
Thanks

You're clearly asking for something outside of numpy.
A defauldict with the default_factory as dict gives a sense of the 2D dictionary you want:
>>> from collections import defaultdict
>>> a = defaultdict(dict)
>>> a[23][11] = 1
>>> a[23]
{11: 1}
>>> a[23][11]
1

Another possibility is to use tuples as the dictionary keys
dict((33,12):1, (23,12):1, ...]
scipy.sparse has a sparse matrix format that stores it's values in such a dictionary. With your values such a matrix would represent a 50x50 matrix with mostly 0 values, and just 1's at these selected coordinates.
Keep in mind that the keys of a dictionary (ordinary at least) are not ordered
What are going to be doing with this data? A dictionary, whether type or nested, is good for one kind of usage, but bad for others. A matrix such as you sample is better for other things, like operations along rows or columns. The dictionary format largely obscures that kind of ordered layout.

Are you looking for a dictionary with pairs as keys?
d = {}
d[33, 12] = 1
d[33, 23] = 1
# etc
Note that in python d[a, b] is just syntactic sugar for d[(a, b)]

If I understand correctly you just want to label your row/columns. To stay within the numpy array framework, a simple solution would be to create a mapping between the labels and the array order. I am also going to assume that it is OK to convert the labels into strings as they can be anything (though integers would also be fine).
l = {str(x) : ind for ind , x in enumerate((33,12,23,42,11,32,43,22))}
a = sp.linalg.circulant([99,1,1,1,0,0,1,1])
a[l['32'],l['23']]

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"

Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cluster Rows in Data Subgroups - python

IIUC, I think you could try something like this: def ap_fit_pred(x): ap = AffinityPropagation() return pd.Series(ap.fit_predict(x.loc[:,['x','y','z']])) df['cluster'] = df.groupby('ID').apply(ap_fit_pred).reset_index(drop=True)

Related

Calculate a np.arange within a Panda dataframe from other columns

How to change a certain count of numpy matrix elements?

How to calculate amounts that row values greater than a specific value in pandas?

is there a 2D dictionary in python?

Conditional length of a binary data series in Pandas

Categories

Resources