pandas groupby to list of dicts - python

this is my data:
data = [
{'shape': 'circle', 'width': 10, 'height': 8},
{'shape': 'circle', 'width': 7, 'height': 2},
{'shape': 'square', 'width': 4, 'height': 6}
]
I am trying to group by shapes that will hold the x, y
my final output should be a dict in the following format:
{
'circle': [
{'x': 10, 'y': 8},
{'x': 7, 'y': 2}
],
'square': [
{'x': 4, 'y': 6}
],
}
here is what I tried, which does not work
df = pd.DataFrame(data)
df = df.rename({'width': 'x', 'height': 'y'}, axis='columns')
df.groupby('shape').apply(
lambda s: s.do_dict()).to_dict()
what is the correct way to do it? also is there a way to do it with out renaming the columns before, something like:
df.groupby('shape').apply(
lambda s: {'x': s['width'], 'y': s['height']}).to_dict()

I could not do without renaming the column but something like this?
(df.rename(columns={'width': 'x', 'height': 'y'})
.groupby('shape')
.apply(lambda s: s[['x', 'y']].to_dict(orient='records'))
.to_dict())

It can be done with a dict comprehension:
res = {i:df[df['shape']==i][['x', 'y']].to_dict(orient='records') for i in set(df['shape'])}
>>>print(res)
{'circle': [{'x': 10, 'y': 8}, {'x': 7, 'y': 2}], 'square': [{'x': 4, 'y': 6}]}

Related

Pandas: Pivot multi-index, with one 'shared' column

I have a pandas dataframe that can be represented like:
test_dict = {('a', 1) : {'shared':0,'x':1, 'y':2, 'z':3},
('a', 2) : {'shared':1,'x':2, 'y':4, 'z':6},
('b', 1) : {'shared':0,'x':10, 'y':20, 'z':30},
('b', 2) : {'shared':1,'x':100, 'y':200, 'z':300}}
example = pd.DataFrame.from_dict(test_dict).T
I am trying to figure out a way to turn this into a dataframe that looks like this dictionary representation:
res_dict = {1 : {'shared':0,'a':{'x':1, 'y':2, 'z':3}, 'b':{'x':10, 'y':20, 'z':30}},
2 : {'shared':1,'a':{'x':2, 'y':4, 'z':6},'b':{'x':100, 'y':200, 'z':300}}}
Any suggestions appreciated!
Thanks
A possible solution, which uses only dataframe manipulations and then converts to dictionary:
xyz = ['x', 'y', 'z']
out = (example.assign(xyz=example[xyz].apply(list, axis=1)).reset_index()
.pivot(index='level_0', columns=['level_1', 'shared'], values='xyz')
.applymap(lambda x: dict(zip(xyz, x))))
out.columns = out.columns.rename(None, level=0)
out.index = out.index.rename(None)
(pd.concat([out.droplevel(1, axis=1),
out.columns.to_frame().reset_index(drop=True).iloc[:,1]
.to_frame().T.set_axis(out.columns.get_level_values(0), axis=1)])
.iloc[np.arange(-1, len(out))].to_dict())
Output:
{
1: {
'shared': 0,
'a': {'x': 1, 'y': 2, 'z': 3},
'b': {'x': 10, 'y': 20, 'z': 30}
},
2: {
'shared': 1,
'a': {'x': 2, 'y': 4, 'z': 6},
'b': {'x': 100, 'y': 200, 'z': 300}
}
}

pandas include grouped value in to dict convert

here is my data
data = [
{'shape': 'circle', 'width': 10, 'height': 8},
{'shape': 'circle', 'width': 7, 'height': 2},
{'shape': 'square', 'width': 4, 'height': 6}
]
I am using pandas to aggregate min and max height on each group,
my final result should be:
[
{'shape': 'circle', 'min': 2, max: 8},
{'shape': 'square', 'min': 6, max: 6}
]
here is what I tried:
df = pd.DataFrame(data)
my_dict = df.groupby('shape').height.agg(['min', 'max']).to_dict('records')
but this results a record without the 'shape' column:
[
{'min': 2, 'max': 8},
{'min': 6, 'max': 6}
]
how can I include the grouped by column?
The group is set as index, try to reset it:
df.groupby('shape').height.agg(['min', 'max']).reset_index().to_dict('records')

Cartesian (cross) products and np.unique()

Depending on your few on my approach this is either a question about using np.unique() on awkward1 arrays or a call for a better approach:
Let a and b be two awkward1 arrays of the same outer length (number of events) but different inner lengths. For example:
a = [[1, 2], [3] , [] , [4, 5, 6]]
b = [[7] , [3, 5], [6], [8, 9]]
Let f: (x, y) -> z be a function that acts on two numbers x and y and results in the number z. For example:
f(x, y):= y - x
The idea is to compare every element in a with every element in b via f for each event and filter out the matches of a and b pairs that survive some cut applied to f. For example:
f(x, y) < 4
My approach for this is:
a = ak.from_iter(a)
b = ak.from_iter(b)
c = ak.cartesian({'x':a, 'y':b})
#c= [[{'x': 1, 'y': 7}, {'x': 2, 'y': 7}], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 4, 'y': 8}, {'x': 4, 'y': 9}, {'x': 5, 'y': 8}, {'x': 5, 'y': 9}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]
i = ak.argcartesian({'x':a, 'y':b})
#i= [[{'x': 0, 'y': 0}, {'x': 1, 'y': 0}], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}], [], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}, {'x': 1, 'y': 0}, {'x': 1, 'y': 1}, {'x': 2, 'y': 0}, {'x': 2, 'y': 1}]]
diff = c['y'] - c['x']
#diff= [[6, 5], [0, 2], [], [4, 5, 3, 4, 2, 3]]
cut = diff < 4
#cut= [[False, False], [True, True], [], [False, False, True, False, True, True]]
new = c[cut]
#new= [[], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 5, 'y': 8}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]
new_i = i[cut]
#new_i= [[], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}], [], [{'x': 1, 'y': 0}, {'x': 2, 'y': 0}, {'x': 2, 'y': 1}]]
It is possible that pairs with the same element from a but different elements from b survive the cut. (e.g. {'x': 3, 'y': 3} and {'x': 3, 'y': 5})
My goal is to group those pairs with the same element from a together and therefore reshape the new array into:
new = [[], [{'x': 3, 'y': [3, 5]}], [], [{'x': 5, 'y': 8}, {'x': 6, 'y': [8, 9]}]]
My only idea how to achieve this is to create a list of the indexes from a that are still present after the cut by using new_i:
i = new_i['x']
#i= [[], [0, 0], [], [1, 2, 2]]
However, I need a unique version of this list to make every index appear only once. This could be achieved with np.unique() in NumPy. But doesn't work in awkward1:
np.unique(i)
<__array_function__ internals> in unique(*args, **kwargs)
TypeError: no implementation found for 'numpy.unique' on types that implement __array_function__: [<class 'awkward1.highlevel.Array'>]
My question:
Is their a np.unique() equivalent in awkward1 and/or would you recommend a different approach to my problem?
Okay, I still don't know how to use np.unique() on my arrays, but I found a solution for my own problem:
In my previous approach I used the following code to pair up booth arrays.
c = ak.cartesian({'x':a, 'y':b})
#c= [[{'x': 1, 'y': 7}, {'x': 2, 'y': 7}], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 4, 'y': 8}, {'x': 4, 'y': 9}, {'x': 5, 'y': 8}, {'x': 5, 'y': 9}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]
However, with the nested = True parameter from ak.cartesian() I get a list grouped by the elements of a:
c = ak.cartesian({'x':a, 'y':b}, axis = 1, nested = True)
#c= [[[{'x': 1, 'y': 7}], [{'x': 2, 'y': 7}]], [[{'x': 3, 'y': 3}, {'x': 3, 'y': 5}]], [], [[{'x': 4, 'y': 8}, {'x': 4, 'y': 9}], [{'x': 5, 'y': 8}, {'x': 5, 'y': 9}], [{'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]]
After the cut I end up with:
new = c[cut]
#new= [[[], []], [[{'x': 3, 'y': 3}, {'x': 3, 'y': 5}]], [], [[], [{'x': 5, 'y': 8}], [{'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]]
I extract the y values and reduce the most inner layer of the nested lists of new to only one element:
y = new['y']
#y= [[[], []], [[3, 5]], [], [[], [8], [8, 9]]]
new = ak.firsts(new, axis = 2)
#new= [[None, None], [{'x': 3, 'y': 3}], [], [None, {'x': 5, 'y': 8}, {'x': 6, 'y': 8}]]
(I tried to use ak.firsts() with axis = -1 but it seems to be not implemented yet.)
Now every most inner entry in new belongs to exactly one element from a. By replacing the current y of new with the previously extracted y I end up with my desired result:
new['y'] = y
#new= [[None, None], [{'x': 3, 'y': [3, 5]}], [], [None, {'x': 5, 'y': [8]}, {'x': 6, 'y': [8, 9]}]]
Anyway, should you know a better solution, I'd be pleased to hear it.

Scatter plot line to out of order data

I am tracking the movements of an avian animal. I have detection points on an xy plot. I want to connect the previous detected point to the next detection, regardless of direction. This will assist with removing extraneous detections.
Data Sample:
Sample input
The goal is to have a line from the previous data point to the next point.
Sample output
Unsuccessful method 1:
plt.figure('Frame',figsize=(16,12))
plt.imshow(frame)
plt.plot(x, y, '-ro', 'd',markersize=2.5, color='orange')
Method 1 output
Unsuccessful method 2:
plt.plot(np.sort(x), y[np.argsort(x)], '-bo', ms = 2)
Method 2 output
I used your sample data and make a plot with method 1 (but with pandas) and the output was as you expected. I don't understand why you have an unsuccessful result.
data = [{'frame': 1, 'x': 5, 'y': 15},
{'frame': 4, 'x': 10, 'y': 15},
{'frame': 5, 'x': 15, 'y': 15},
{'frame': 6, 'x': 20, 'y': 15},
{'frame': 7, 'x': 23, 'y': 20},
{'frame': 8, 'x': 25, 'y': 25},
{'frame': 11, 'x': 20, 'y': 23},
{'frame': 15, 'x': 15, 'y': 20},
{'frame': 18, 'x': 8, 'y': 18},
{'frame': 19, 'x': 8, 'y': 10},
{'frame': 20, 'x': 12, 'y': 7}]
df = pd.DataFrame(data).sort_values('frame')
df.plot(x='x', y='y')

Select highest value from python list of dicts

In a list of list of dicts:
A = [
[{'x': 1, 'y': 0}, {'x': 2, 'y': 3}, {'x': 3, 'y': 4}, {'x': 4, 'y': 7}],
[{'x': 1, 'y': 0}, {'x': 2, 'y': 2}, {'x': 3, 'y': 13}, {'x': 4, 'y': 0}],
[{'x': 1, 'y': 20}, {'x': 2, 'y': 4}, {'x': 3, 'y': 0}, {'x': 4, 'y': 8}]
]
I need to retrieve the highest 'y' values from each of the list of dicts...so the resulting list would contain:
Z = [(4, 7), (3,13), (1,20)]
In A, the 'x' is the key of each dict while 'y' is the value of each dict.
Any ideas? Thank you.
max accept optional key parameter.
A = [
[{'x': 1, 'y': 0}, {'x': 2, 'y': 3}, {'x': 3, 'y': 4}, {'x': 4, 'y': 7}],
[{'x': 1, 'y': 0}, {'x': 2, 'y': 2}, {'x': 3, 'y': 13}, {'x': 4, 'y': 0}],
[{'x': 1, 'y': 20}, {'x': 2, 'y': 4}, {'x': 3, 'y': 0}, {'x': 4, 'y': 8}]
]
Z = []
for a in A:
d = max(a, key=lambda d: d['y'])
Z.append((d['x'], d['y']))
print Z
UPDATE
suggested by – J.F. Sebastian:
from operator import itemgetter
Z = [itemgetter(*'xy')(max(lst, key=itemgetter('y'))) for lst in A]
I'd use itemgetter and max's key argument:
from operator import itemgetter
pair_getter = itemgetter('x', 'y')
[pair_getter(max(d, key=itemgetter('y'))) for d in A]
[max(((d['x'], d['y']) for d in l), key=lambda t: t[1]) for l in A]
The solution to your stated problem has been given, but I suggest changing your underlying data structure. Tuples are much faster for small elements such as a point. You may retain the clarity of a dictionary by using namedtuple if you so desire.
>>> from collections import namedtuple
>>> A = [
[{'x': 1, 'y': 0}, {'x': 2, 'y': 3}, {'x': 3, 'y': 4}, {'x': 4, 'y': 7}],
[{'x': 1, 'y': 0}, {'x': 2, 'y': 2}, {'x': 3, 'y': 13}, {'x': 4, 'y': 0}],
[{'x': 1, 'y': 20}, {'x': 2, 'y': 4}, {'x': 3, 'y': 0}, {'x': 4, 'y': 8}]
]
Making a Point namedtuple is simple
>>> Point = namedtuple('Point', 'x y')
This is what an instance looks like
>>> Point(x=1, y=0) # Point(1, 0) also works
Point(x=1, y=0)
A would then look like this
>>> A = [[Point(**y) for y in x] for x in A]
>>> A
[[Point(x=1, y=0), Point(x=2, y=3), Point(x=3, y=4), Point(x=4, y=7)],
[Point(x=1, y=0), Point(x=2, y=2), Point(x=3, y=13), Point(x=4, y=0)],
[Point(x=1, y=20), Point(x=2, y=4), Point(x=3, y=0), Point(x=4, y=8)]]
Now working like this is much easier:
>>> from operator import attrgetter
>>> [max(row, key=attrgetter('y')) for row in A]
[Point(x=4, y=7), Point(x=3, y=13), Point(x=1, y=20)]
To retain the speed advantages of tuples it's better to access by index:
>>> from operator import itemgetter
>>> [max(row, key=itemgetter(2)) for row in A]
[Point(x=4, y=7), Point(x=3, y=13), Point(x=1, y=20)]
result=[]
for item in a:
new = sorted(item, key=lambda k: k['y'],reverse=True)
result.append((new[0]['x'],new[0]['y']))
print(result)
Note-The is not the efficient way to do this but this is one of the ways to get the required result.

Categories

Resources