Pandas create multiple aggregations - python

Trying to see how hard or easy this is to do with Pandas.
Let's say one has a two columns with data such as:
Cat1 Cat2
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
D 4
As you see A and C have three common elements 1, 2, 3. B however has only two elements 1 and 2. D has only one element: 4.
How would one programmatically get to this same result. The idea will be to have each group returned somehow. So one will be [A, C] and [1, 2, 3], then [B] and [1, 2] and [D] with [4].
I know a program can be written to do this so I am trying to figure out if there is something on Pandas to do it without having to build stuff from scratch.
Thanks!

You can use groupby twice to achieve this.
df = df.groupby('Cat1')['Cat2'].apply(lambda x: tuple(set(x))).reset_index()
df = df.groupby('Cat2')['Cat1'].apply(lambda x: tuple(set(x))).reset_index()
I'm using tuple because pandas needs elements to be hashable in order to do a groupby. The code above doesn't distinguish between (1, 2, 3) and (1, 1, 2, 3). If you want to make this distinction, replace set with sorted.
The resulting output:
Cat2 Cat1
0 (1, 2) (B,)
1 (1, 2, 3) (A, C)
2 (4,) (D,)

You could also:
df = df.set_index('Cat1', append=True).unstack().loc[:, 'Cat2']
df = pd.Series({col: tuple(values.dropna()) for col, values in df.items()})
df = df.groupby(df.values).apply(lambda x: list(x.index))
to get
Cat1
(1.0, 2.0) [B]
(1.0, 2.0, 3.0) [A, C]
(4.0,) [D]

Related

Pandas: Get list of shared values of column B that two different values from column B have in common

I have a table like this:
image
user
1
1
2
1
3
1
1
2
3
2
2
3
3
3
...
...
Now I want to find shared images between two users. For example, when I ask for the images shared between user 1 and user 2, I want [1, 3] as a result.
How could I achieve this?
Expanding on #BrendanA's answer, you can use itertools.combinations to get all 2-combinations, find their shared images and cast the result to a DataFrame:
from itertools import combinations
users_to_images = df.groupby('user')['image'].agg(set)
data = {(i,j): [list(users_to_images[i].intersection(users_to_images[j]))] for i,j in combinations(users_to_images.index, 2)}
out = pd.DataFrame.from_records(data, index=['shared_images']).T
Output:
shared_images
(1, 2) [1, 3]
(1, 3) [2, 3]
(2, 3) [3]
Then users 1,2 share images 1,3; users 1,3 share 2,3, etc.
You can do that with the following
seen = df.groupby("user")["image"].apply(set)
shared = list(seen[1].intersection(seen[2]))
print(shared)
[1,3]

Choose the best of three columns

I have a dataset with three columns A, B and C. I want to create a column where I select the two columns closest to each other and take the average. Take the table below as an example:
A B C Best of Three
3 2 5 2.5
4 3 1 3.5
1 5 2 1.5
For the first row, A and B are the closest pair, so the best of three column is (3+2)/2 = 2.5; for the third row, A and C are the closest pair, so the best of three column is (1+2)/2 = 1.5. Below is my code. It is quite unwieldy and quickly become too long if there are more columns. Look forward to suggestions!
data = {'A':[3,4,1],
'B':[2,3,5],
'C':[5,1,2]}
df = pd.DataFrame(data)
df['D'] = abs(df['A'] - df['B'])
df['E'] = abs(df['A'] - df['C'])
df['F'] = abs(df['C'] - df['B'])
df['G'] = min(df['D'], df['E'], df['F'])
if df['G'] = df['D']:
df['Best of Three'] = (df['A'] + df['B'])/2
elif df['G'] = df['E']:
df['Best of Three'] = (df['A'] + df['C'])/2
else:
df['Best of Three'] = (df['B'] + df['C'])/2
First you need a method that finds the minimum diff between 2 elements in a list, the method also returns the median with the 2 values, this is returned as a tuple (diff, median)
def min_list(values):
return min((abs(x - y), (x + y) / 2)
for i, x in enumerate(values)
for y in values[i + 1:])
Then apply it in each row
df = pd.DataFrame([[3, 2, 5, 6], [4, 3, 1, 10], [1, 5, 10, 20]],
columns=['A', 'B', 'C', 'D'])
df['best'] = df.apply(lambda x: min_list(x)[1], axis=1)
print(df)
Functions are your friends. You want to write a function that finds the two closest integers of an list, then pass it the list of the values of the row. Store those results and pass them to a second function that returns the average of two values.
(Also, your code would be much more readable if you replaced D, E, F, and G with descriptively named variables.)
Solve by using itertools combinations generator:
def get_closest_avg(s):
c = list(itertools.combinations(s, 2))
return sum(c[pd.Series(c).apply(lambda x: abs(x[0]-x[1])).idxmin()])/2
df['B3'] = df.apply(get_closest_avg, axis=1)
df:
A B C B3
0 3 2 5 2.5
1 4 3 1 3.5
2 1 5 2 1.5

Get index from rows with matching values in different columns

I have a set like this:
N1 N2
0 a b
1 b f
2 c d
3 d a
4 e b
I want to get the indexes with the repeated values between the two columns, and the value itself.
From the example, I should get something like these shortlists:
(value, idx(N1), idx(N2))
(a, 0, 3)
(b, 1, 0)
(b, 1, 4)
(d, 3, 2)
I have been able to do it with two for-loops, but for a half-million rows dataframe it took hours...
Use numpy broadcasting comparison and then use argwhere to find the indices where the values where equal:
import numpy as np
# make a broadcasted comparison
mat = df['N2'].values == df['N1'].values[:, None]
# find the indices where the values are True
where = np.argwhere(mat)
# select the values
values = df['N1'][where[:, 0]]
# create the DataFrame
res = pd.DataFrame(data=[[val, *row] for val, row in zip(values, where)], columns=['values', 'idx_N1', 'idx_N2'])
print(res)
Output
values idx_N1 idx_N2
0 a 0 3
1 b 1 0
2 b 1 4
3 d 3 2

Find rows in pandas dataframe, where diffrent rows have common values in lists in columns storing lists

I can solve my task by writing a for loop, but I wonder, how to do this in a more pandorable way.
So I have this dataframe storing some lists and want to find all the rows that have any common values in these lists,
(This code just to obtaine a df with lists:
>>> df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]})
>>> df
a b
0 A 1
1 A 2
2 B 5
3 B 1
4 B 4
5 C 6
>>> d = df.groupby('a')['b'].apply(list)
)
Here we start:
>>> d
A [1, 2]
B [5, 1, 4]
C [6]
Name: b, dtype: object
I want to select rows with index 'A' and 'B', because their lists overlap by the value 1.
I could write now a for loop or expand the dataframe at these lists (reversing the way I got it above) and have multiple rows copying other values.
What would you do here? Or is there some way, to use df.groupby(by=lambda x, y : return not set(x).isdisjoint(y)), that compares two rows?
But groupby and also boolean masking just look at one element at once...
I tried now to overload the equality operator for lists, and because lists are not hashable, then of tuples and sets (I set hash to 1 to avoid identity comparison). I then used groupby and merge on the frame with itself, but as it seems, that it checks off the indexes, that it has already matched.
import pandas as pd
import numpy as np
from operator import itemgetter
class IndexTuple(set):
def __hash__(self):
#print(hash(str(self)))
return hash(1)
def __eq__(self, other):
#print("eq ")
is_equal = not set(self).isdisjoint(other)
return is_equal
l = IndexTuple((1,7))
l1 = IndexTuple((4, 7))
print (l == l1)
df = pd.DataFrame(np.random.randint(low=0, high=4, size=(10, 2)), columns=['a','b']).reset_index()
d = df.groupby('a')['b'].apply(IndexTuple).to_frame().reset_index()
print (d)
print (d.groupby('b').b.apply(list))
print (d.merge (d, on = 'b', how = 'outer'))
outputs (it works fine for the first element, but at [{3}] there should be [{3},{0,3}] instead:
True
a b
0 0 {1}
1 1 {0, 2}
2 2 {3}
3 3 {0, 3}
b
{1} [{1}]
{0, 2} [{0, 2}, {0, 3}]
{3} [{3}]
Name: b, dtype: object
a_x b a_y
0 0 {1} 0
1 1 {0, 2} 1
2 1 {0, 2} 3
3 3 {0, 3} 1
4 3 {0, 3} 3
5 2 {3} 2
Using a merge on df:
v = df.merge(df, on='b')
common_cols = set(
np.sort(v.iloc[:, [0, -1]].query('a_x != a_y'), axis=1).ravel()
)
common_cols
{'A', 'B'}
Now, pre-filter and call groupby:
df[df.a.isin(common_cols)].groupby('a').b.apply(list)
a
A [1, 2]
B [5, 1, 4]
Name: b, dtype: object
I understand you are asking for a "pandorable" solution, but in my opinion this is not a task ideally suited to pandas.
Below is one solution using collections.Counter and itertools.combinations which provides your result without using a dataframe.
from collections import defaultdict
from itertools import combinations
data = {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]}
d = defaultdict(set)
for i, j in zip(data['a'], data['b']):
d[i].add(j)
res = {frozenset({i, j}) for i, j in combinations(d, 2) if not d[i].isdisjoint(d[j])}
# {frozenset({'A', 'B'})}
Explanation
Group to a set with collections.defaultdict. via an O(n) complexity solution.
Iterate using itertools.combinations to find set values which are not disjoint, using a set comprehension.
Use frozenset (or sorted tuple) for key type as lists are mutable and therefore cannot be used as dictionary keys.

concatenation large number of dataframes

I have a dictionary D that contains many dataframes.
I can access every dataframe with D[0], D[1]...D[i], with the integers as keys/identifier of the respective dataframe.
I now want to concat all the dataframes in this fashion into a new dataframe:
new_df = pd.concat([D[0],D[1],...D[i]], axis= 1)
What would you suggest how I can solve this (concat needs still to be used)?
I tried with generating a list of D's and including this but received an error message.
I think the easiest thing to do is to use a dict comprehension of the dict items:
In [14]:
d = {'a':pd.DataFrame(np.random.randn(5,3), columns=list('abc')), 'b':pd.DataFrame(np.random.randn(5,3), columns=list('def'))}
d
Out[14]:
{'a': a b c
0 0.030358 1.523752 1.040409
1 -0.220019 -1.579467 -0.312059
2 1.019489 -0.272261 1.182399
3 0.580368 1.985362 -0.835338
4 0.183974 -1.150667 1.571003, 'b': d e f
0 -0.911246 0.721034 -0.347018
1 0.483298 -0.553996 0.374566
2 -0.041415 -0.275874 -0.858687
3 0.105171 -1.509721 0.265802
4 -0.788434 0.648109 0.688839}
In [29]:
pd.concat([df for k,df in d.items()], axis=1)
Out[29]:
a b c d e f
0 0.030358 1.523752 1.040409 -0.911246 0.721034 -0.347018
1 -0.220019 -1.579467 -0.312059 0.483298 -0.553996 0.374566
2 1.019489 -0.272261 1.182399 -0.041415 -0.275874 -0.858687
3 0.580368 1.985362 -0.835338 0.105171 -1.509721 0.265802
4 0.183974 -1.150667 1.571003 -0.788434 0.648109 0.688839

Categories

Resources