Dictionaries x.update(y) for pandas DataFrames? - python

I am sitting in front of a probably very simple problem. I have two pandas DataFrames with some common Indices, like so:
import pandas as pd
x = pd.DataFrame(index=[1, 2, 3, 4],
data={'d': [5, 5, 5, 5]})
y = pd.DataFrame(index=[3, 4, 5, 6],
data={'d': [6, 6, 6, 6]})
What i now want to do is to update x by y. This means to me three things:
The indices 1, 2 are only in x and not in y. Keep the values from x.
The indices 3, 4 are common indices in x and y. Update the values with the new info from y.
The indices 5, 6 are only in y. Add them with their respective values to x.
In total, the result should look like this:
x = pd.DataFrame(index=[1, 2, 3, 4, 5, 6],
data={'d': [5, 5, 6, 6, 6, 6]})
Thinking in terms of python dictionaries, I tried x.update(y), which did steps 1. and 2., but doesn't do step 3.
I am confident that this is a one-liner, but i just cannot find it.
Addendum
I mentioned dictionaries (with the index as key), the approach there would look like this:
a = {1: 5,
2: 5,
3: 5,
4: 5}
b = {3: 6,
4: 6,
5: 6,
7: 6}
a.update(b)
It returns:
{1: 5, 2: 5, 3: 6, 4: 6, 5: 6, 7: 6}

You can call combine_first but using y as the destination, this will overwrite the values from x that are missing from y:
In [75]:
y.combine_first(x)
Out[75]:
d
1 5
2 5
3 6
4 6
5 6
6 6
you can't use update to achieve what you want as this only updates the existing values:
In [79]:
x.update(y)
x
Out[79]:
d
1 5
2 5
3 6
4 6

Related

Dataframe age column grouping in pandas [duplicate]

It seems like a simple question, but I need ur help.
For example, I have df:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 1, 3, 1, 8, 9, 6, 7, 4, 6]
How can I group 'x' in range from 1 to 5, and from 6 to 10 and calc mean 'y' value for this two bins?
I expect to get new df like:
x_grpd = [5, 10]
y_grpd = [3, 6.4]
Range of 'x' is given as an example. Ideally i want to be able to set any int value to get different bins quantity.
You can use cut and groupby.mean:
bins = [5, 10]
df2 = (df
.groupby(pd.cut(df['x'], [0]+bins,
labels=bins,
right=True))
['y'].mean()
.reset_index()
)
Output:
x y
0 5 3.0
1 10 6.4

Python - Group(Cluster/Sort) arrays based on ranking information

I have a dataframe looks like this:
A B C D
0 5 4 3 2
1 4 5 3 2
2 3 5 2 1
3 4 2 5 1
4 4 5 2 1
5 4 3 5 1
...
I converted the dataframe into 2D arrays like this:
[[5 4 3 2]
[4 5 3 2]
[3 5 2 1]
[4 2 5 1]
[4 5 2 1]
[4 3 5 1]
...]
The score of each row 1-5 actually means the people give the scores to item A, B, C, D. I would like to identify the people who have the same ranking, for example the people think A > B > C > D. And I would like to regroup these arrays based on the ranking information like this:
2DArray1: [[5 4 3 2]]
2DArray2: [[4 5 3 2]
[3 5 2 1]
[4 5 2 1]]
2DArray3: [[4 2 5 1]
[4 3 5 1]]
For example 2DArray2 means the people who think B > A > C > D, 2DArray3 are the people think C > A > B > D . I tried different sort functions in numpy but I cannot find one suitable. How should I do?
Numpy doesn't have a groupby function, because a groupby would return a list of lists of different sizes; whereas numpy mostly only deals with "rectangle" arrays.
A workaround would be to sort the rows so that similar rows are adjacent, then produce an array of the indices of the beginning of each group.
Since I'm too lazy to do that, here is a solution without numpy instead:
Index by the permutation directly
For each row, we compute the corresponding permutation of 'ABCD'. Then, we add the row to a dict of lists of rows, where the dictionary keys are the corresponding permutations.
from collections import defaultdict
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
(0, 1, 2, 3): [[5, 4, 3, 2]],
(1, 0, 2, 3): [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
(2, 0, 1, 3): [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Note that with this solution, the results might not be what you expect if some users give the same score to two different items, because sorted doesn't keep ex-aequo; instead it breaks ties by order of appearance (in this case, this means ties between two items are broken alphabetically).
Index by the index of the permutation
The permutations of 'ABCD' can be ordered lexicographically: 'ABCD' comes first, then 'ABDC' comes second, then 'ACBD' comes third...
As it turns out, there is an algorithm to compute the index at which a given permutation would come in that sequence! And that algorithm is implemented in python module more_itertools:
more_itertools.permutation_index
So, we can replace our tuple key tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True)) by a simple number key permutation_index(row, sorted(row, reverse=True)).
from collections import defaultdict
from more_itertools import permutation_index
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[permutation_index(row, sorted(row, reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
0: [[5, 4, 3, 2]],
6: [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
8: [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Mixing permutation_index and pandas
Since the output of permutation_index is a simple number, we can easily include it in a numpy array or a pandas dataframe as a new column:
import pandas as pd
from more_itertools import permutation_index
df = pd.DataFrame({'A': [5,4,3,4,4,4], 'B': [4,5,5,2,5,3], 'C': [3,2,2,5,2,5], 'D': [2,2,1,1,1,1]})
df['perm_idx'] = df.apply(lambda row: permutation_index(row, sorted(row, reverse=True)), axis=1)
print(df)
A B C D perm_idx
0 5 4 3 2 0
1 4 5 2 2 6
2 3 5 2 1 6
3 4 2 5 1 8
4 4 5 2 1 6
5 4 3 5 1 8
for idx, sub_df in df.groupby('perm_idx'):
print(idx)
print(sub_df)
0
A B C D perm_idx
0 5 4 3 2 0
6
A B C D perm_idx
1 4 5 2 2 6
2 3 5 2 1 6
4 4 5 2 1 6
8
A B C D perm_idx
3 4 2 5 1 8
5 4 3 5 1 8
You can
(i) transpose df and convert it to a dictionary,
(ii) sort this dictionary by value and get the keys,
(iii) join the sorted keys for each "person" and assign this dict to df['ranks'],
(iv) aggregate ranking points and assign it to df['pref'],
(v) groupby(['ranks']) and create lists from pref
df = pd.DataFrame({'A': {0: 5, 1: 4, 2: 3, 3: 4, 4: 4, 5: 4},
'B': {0: 4, 1: 5, 2: 5, 3: 2, 4: 5, 5: 3},
'C': {0: 3, 1: 3, 2: 2, 3: 5, 4: 2, 5: 5},
'D': {0: 2, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1}})
df['ranks'] = pd.Series({k : ''.join(list(zip(*sorted(v.items(), key=lambda d:d[1],
reverse=True)))[0])
for k,v in df.T.to_dict().items()})
df['pref'] = df.loc[:,'A':'D'].values.tolist()
out = df[['ranks','pref']].groupby('ranks').agg(list).to_dict()['pref']
Output:
{'ABCD': [[5, 4, 3, 2]],
'BACD': [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
'CABD': [[4, 2, 5, 1], [4, 3, 5, 1]]}

Return a list of Element by discarding the Sequential Occurrence of Elements

i/p 1:
test_list = [1, 1, 3, 4, 4, 4, 5,6, 6, 7, 8, 8, 6]
o/p
[3, 5, 7, 6]
Exp: Since (1 1), (4 4 4) (6 6) (8 8) are in consecutive occurrence so resultant list has no addition of 6 but for last occurrence where 8, 6 are not in multiple consecutive occurrence so 6 is valid
in last iteration
i/p 2:
test_list = [1, 1, 3, 4, 4, 4, 5,4,6, 6, 7, 8, 8, 6]
o/p
[3, 5,4, 7, 6]
** like wise for 2nd input 4,4,4 is not valid but 5,4 is valid
Any suggestion for the expected o/p?
(I am looking for bit elaborated algorithm)
You can use itertools.groupby to group adjacent identical values, then only keep values that have group length of 1.
>>> from itertools import groupby
>>> test_list = [1, 1, 3, 4, 4, 4, 5,6, 6, 7, 8, 8, 6]
>>> [k for k, g in groupby(test_list) if len(list(g)) == 1]
[3, 5, 7, 6]
>>> test_list = [1, 1, 3, 4, 4, 4, 5,4,6, 6, 7, 8, 8, 6]
>>> [k for k, g in groupby(test_list) if len(list(g)) == 1]
[3, 5, 4, 7, 6]
First of all, you need to know that increasing i in your for loop does not change the value of i.
You can check it by runin this code:
for i in range(5):
print(i)
i = 2
This code will print 0 1 2 3 4 not 0 2 2 2 2 as you might think.
Going back to your question. I would use groupby from itertools, but since you specified you don't want to use it, I would do something like this:
if test_list[0] != test_list[1]: # <-- check if first element should belong to result
res_list.append(test_list[0])
for i in range(len(test_list[1:-1])): # Here we use input list, but without first and last element.
if test_list[i+1] == test_list[i+2] or test_list[i+1] == test_list[i]:
continue
else:
res_list.append(test_list[i+1])
if test_list[-2] != test_list[-1]: # <-- check if last element should belong to result
res_list.append(test_list[-1])

Maximum of an array constituting a pandas dataframe cell

I have a pandas dataframe in which a column is formed by arrays. So every cell is an array.
Say there is a column A in dataframe df, such that
A = [ [1, 2, 3],
[4, 5, 6],
[7, 8, 9],
... ]
I want to operate in each array and get, e.g. the maximum of each array, and store it in another column.
In the example, I would like to obtain another column
B = [ 3,
6,
9,
...]
I have tried these approaches so far, none of them giving what I want.
df['B'] = np.max(df['A']);#
df.applymap (lambda B: A.max())
df['B'] = df.applymap (lambda B: np.max(np.array(df['A'].tolist()),0))
How should I proceed? And is this the best way to have my dataframe organized?
You can just apply(max). It doesn't matter if the values are lists or np.array.
df = pd.DataFrame({'a': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
df['b'] = df['a'].apply(max)
print(df)
Outputs
a b
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9
Here is one way without apply:
df['B']=np.max(df['A'].values.tolist(),axis=1)
A B
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9

picking values from columns [duplicate]

This question already has answers here:
Vectorized lookup on a pandas dataframe
(3 answers)
Closed 3 years ago.
I have a pandas DataFrame with values in a number of columns, make it two for simplicity, and a column of column names I want to use to pick values from the other columns:
import pandas as pd
import numpy as np
np.random.seed(1337)
df = pd.DataFrame(
{"a": np.arange(10), "b": 10 - np.arange(10), "c": np.random.choice(["a", "b"], 10)}
)
which gives
> df['c']
0 b
1 b
2 a
3 a
4 b
5 b
6 b
7 a
8 a
9 a
Name: c, dtype: object
That is, I want the first and second elements to be picked from column b, the third from a and so on.
This works:
def pick_vals_from_cols(df, col_selector):
condlist = np.row_stack(col_selector.map(lambda x: x == df.columns))
values = np.select(condlist.transpose(), df.values.transpose())
return values
> pick_vals_from_cols(df, df["c"])
array([10, 9, 2, 3, 6, 5, 4, 7, 8, 9], dtype=object)
But it just feels so fragile and clunky. Is there a better way to do this?
lookup
df.lookup(df.index, df.c)
array([10, 9, 2, 3, 6, 5, 4, 7, 8, 9])
Comprehension
But why when you have lookup?
[df.at[t] for t in df.c.items()]
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Bonus Hack
Not intended for actual use
[*map(df.at.__getitem__, zip(df.index, df.c))]
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Because df.get_value is deprecated
[*map(df.get_value, df.index, df.c)]
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]

Categories

Resources