I have one dataframe -
ID T Q1 P Q2
10 xy 1 pq 2
20 yz 1 rs 1
20 ab 1 tu 2
30 cd 2 cu 2
30 xy 1 mu 1
30 bb 1 bc 1
Now I need a dictionary with Id as key and rest of the column values as a list to be the dictionary's value
output:
{10:['xy',1,'pq',2]}
{20:['ab',1,'tu',2]}
{30:['bb',1,'bc',1]}
Expected result:
{10:[['xy',1,'pq',2]]}
{20:[['yz',1,'rs',1],['ab',1,'tu',2]]}
{30:[['cd',2,'cu',2],['xy',1,'mu',1],['bb',1,'bc',1]]}
Try:
x = (
df.groupby("ID")
.apply(lambda x: x.iloc[:, 1:].agg(list).values.tolist())
.to_dict()
)
print(x)
Prints:
{10: [['xy', 1, 'pq', 2]],
20: [['yz', 1, 'rs', 1], ['ab', 1, 'tu', 2]],
30: [['cd', 2, 'cu', 2], ['xy', 1, 'mu', 1], ['bb', 1, 'bc', 1]]}
Related
Let's assume I have the following DataFrame:
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df
Out[10]:
a b c d
0 1 1 f 10
1 1 1 f 20
2 2 1 f 30
3 2 1 e 40
4 2 2 f 50
5 2 2 f 60
6 3 1 f 70
7 3 1 e 80
8 3 2 f 90
9 3 2 f 100
In the following I want to take the values of column a and b, where c='e' and use those values to select respective rows of df (which would filter rows 2, 3, 6, 7). The idea is to create a list of tuples and index df by that list:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new
Out[13]:
c d
a b
1 1 f 10
1 f 20
2 1 f 30
1 e 40
2 f 50
2 f 60
3 1 f 70
1 e 80
2 f 90
2 f 100
list_tup
Out[14]: [(2, 1), (3, 1)]
df.loc[list_tup]
Results in an TypeError: unhashable type: 'writeable void-scalar', which I don't understand. Any suggestions? I'm pretty new to python and pandas, hence I assume that I miss something fundamental.
I believe it's better to groupby().transform() and boolean indexing in this use case:
valids = (df['c'].eq('e') # check if `c` is 'e`
.groupby([df['a'],df['b']]) # group by `a` and `b`
.transform('any') # check if `True` occurs in the group
# use the same label for all rows in group
)
# filter with `boolean indexing
df[valids]
Output:
a b c d
2 2 1 f 30
3 2 1 e 40
6 3 1 f 70
7 3 1 e 80
A similar idea with groupby().filter() which is more readable but can be slightly slower:
df.groupby(['a','b']).filter(lambda x: x['c'].eq('e').any())
You could try an innner join.
import pandas as pd
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df.merge(df.loc[df['c']=='e', ['a','b']], on=['a','b'])
Output
a b c d
0 2 1 f 30
1 2 1 e 40
2 3 1 f 70
3 3 1 e 80
May be try a MultiIndex:
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Full code:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Outputs:
c d
a b
2 1 f 30
1 e 40
3 1 f 70
1 e 80
Let us try
m = df[['a','b']].apply(tuple,1).isin(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False).tolist())
out = df[m].copy()
I'm trying to aggregate a DataFrame such that for each from, and each to given in the mappings table (e.g. .iloc[0] where a maps to b), we take the corresponding f# (feature) columns from the labels table, and find the number of times that that feature mapping occurred.
The expected output is given in the output table.
Example: in the output table we can see there are 4 times when a from element mapped to a to element (i.e. where the from had an f1 feature and the to had an f2 feature). We can deduce these as being a->b, a->c, d->e, and d->g.
Mappings
from to
0 a b
1 a c
2 d e
3 d f
4 d g
Labels
name f1 f2 f3
0 a 1 0 0
1 b 0 1 0
2 c 0 1 0
3 d 1 1 0
4 e 0 1 0
5 f 0 0 1
6 g 1 1 0
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Table construction code
# dataframe 1 - the mappings
mappings = pd.DataFrame({
'from': ['a', 'a', 'd', 'd', 'd'],
'to': ['b', 'c', 'e', 'f', 'g']
})
# dataframe 2 - the labels
labels = pd.DataFrame({
'name': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'f1': [1, 0, 0, 1, 0, 0, 1],
'f2': [0, 1, 1, 1, 1, 0, 1],
'f3': [0, 0, 0, 0, 0, 1, 0],
})
# dataframe 3 - the expected output
output = pd.DataFrame(
index = ['f1', 'f2', 'f3'],
data = {
'f1': [1, 1, 0],
'f2': [4, 2, 0],
'f3': [1, 1, 0],
})
First we melt your labels dataframe from columns to rows, so we can easily match on them. Then we merge these values to our mapping and finally use crosstab to get your final result:
labels = labels.set_index('name').where(lambda x: x > 0).melt(ignore_index=False).dropna()
df = (
mappings.merge(labels.add_suffix('_from'), left_on='from', right_on='name')
.merge(labels.add_suffix('_to'), left_on='to', right_on='name')
)
final = pd.crosstab(index=df['variable_from'], columns=df['variable_to'])
final = (
final.reindex(index=final.columns, fill_value=0)
.rename_axis(index=None, columns=None)
).convert_dtypes()
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Note:
melt(ignore_index=False) requires pandas >= 1.1.0
convert_dtypes requires pandas >= 1.0.0
For pandas < 1.1.0 we can use stack instead of melt:
(
labels.set_index('name')
.where(lambda x: x > 0)
.stack()
.reset_index(level=1)
.rename(columns={'level_1': 'variable', 0: 'value'})
)
I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}
I'm working with a data frame containing 582,260 rows and 24 columns. Each row corresponds to a 24 hours vector length time series, and 20 rows (days) correspond to id_1, 20 to id_2... and so on up to id_N. I would like to concatenate into a single row all the 20 rows of id_1 so that my concatenated time series become a 480 (20 days * 24 hrs/day) vector length, and repeat this operation from id_1 to id_N.
A very reduced and reproducible version of my data frame is shown (ID column should be an index but for iteration purposes I reseted it):
df = pd.DataFrame([['id1', 1, 1, 3, 4, 1], ['id1', 0, 1, 5, 2, 1], ['id1', 3, 4, 5, 0, 0],
['id2', 1, 1, 8, 0, 6], ['id2', 5, 3, 1, 1, 2], ['id2', 5, 4, 5, 2, 7]],
columns = ['ID', 'h0', 'h1', 'h2', 'h3', 'h4'] )
I've tried with the next function to iterate over the rows in the data frame but it doesn't give me the expected output.
def concatenation(df):
for i, row in df.iterrows():
if df.ix[i]['ID'] == df.ix[i+1]['ID']:
pd.concat([df], axis = 1)
return(df)
concatenation(df)
The expected output should look like this:
df = pd.DataFrame([['id1', 1, 1, 3, 4, 1, 0, 1, 5, 2, 1, 3, 4, 5, 0, 0],
['id2', 1, 1, 8, 0, 6, 5, 3, 1, 1, 2, 5, 4, 5, 2, 7]],
columns = ['ID', 'h0', 'h1', 'h2', 'h3', 'h4',
'h0', 'h1', 'h2', 'h3', 'h4',
'h0', 'h1', 'h2', 'h3', 'h4'])
Is there a compact and elegant way of programming this task with pandas tools?
Thank you in advance for your help.
First add a column day, then create a hierarchical index of ID and day that then gets unstacked:
df['day'] = df.groupby('ID').cumcount()
df = df.set_index(['ID','day'])
res = df.unstack()
Intermediate result:
h0 h1 h2 h3 h4
day 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
ID
id1 1 0 3 1 1 4 3 5 5 4 2 0 1 1 0
id2 1 5 5 1 3 4 8 1 5 0 1 2 6 2 7
Now we flatten the index and re-order the columns as requested:
res.set_axis([f"{y}{x}" for x, y in res.columns], axis=1, inplace=True)
res = res.reindex(sorted(res.columns), axis=1)
Final result:
0h0 0h1 0h2 0h3 0h4 1h0 1h1 1h2 1h3 1h4 2h0 2h1 2h2 2h3 2h4
ID
id1 1 1 3 4 1 0 1 5 2 1 3 4 5 0 0
id2 1 1 8 0 6 5 3 1 1 2 5 4 5 2 7
You can use defaultdict(list) and .extend() method to store all the values in exact order and to create the same output as you defined.
But this would require you to do a crude loop which is not recommended for large dataframes.
I need to do a fuzzy groupby where a single record can be in one or more groups.
I have a DataFrame like this:
test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series(['b', 'a', 'k', 'n', 'c'])})
Output:
score1 score2
0 a b
1 b a
2 c k
3 d n
4 e c
I wish to have groups like this:
The group keys should be the union of the unique values between score1 and score2. Record 0 should be in groups a and b because it contains both score values. Similarly record 1 should be in groups b and a; record 2 should be in groups c and k and so on.
I've tried doing a groupby on two columns like this:
In [192]: score_groups = pd.groupby(['score1', 'score2'])
However I get the group keys as tuples - (1, 2), (2, 1), (3, 8), etc, instead of unique group keys where records can be in multiple groups. The output is shown below:
In [192]: score_groups.groups
Out[192]: {('a', 'b'): [0],
('b', 'a'): [1],
('c', 'k'): [2],
('d', 'n'): [3],
('e', 'c'): [4]}
Also, I need the indexes preserved because I'm using them for another operation later.
Please help!
Combine the two columns in a single column using e.g. pd.concat():
s = pd.concat([test['score1'], test['score2'].rename(columns={'score2': 'score1'})]).reset_index()
s.columns = ['val', 'grp']
val grp
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 0 b
6 1 a
7 2 k
8 3 n
9 4 c
And then .groupby() on 'grp' and collect 'val' in a list:
s = s.groupby('grp').apply(lambda x: x.val.tolist())
a [0, 1]
b [1, 0]
c [2, 4]
d [3]
e [4]
k [2]
n [3]
or, if you prefer dict:
s.to_dict()
{'e': [4], 'd': [3], 'n': [3], 'k': [2], 'a': [0, 1], 'c': [2, 4], 'b': [1, 0]}
Or, to the same effect in a single step, skipping renaming the columns:
test.unstack().reset_index(-1).groupby(0).apply(lambda x: x.level_1.tolist())
a [0, 1]
b [1, 0]
c [2, 4]
d [3]
e [4]
k [2]
n [3]
Using Stefan's help, I solved it like this.
In (283): frame1 = test[['score1']]
frame2 = test[['score2']]
frame2.rename(columns={'score2': 'score1'}, inplace=True)
test = pandas.concat([frame1, frame2])
test
Out[283]:
score1
0 a
1 b
2 c
3 d
4 e
0 b
1 a
2 k
3 n
4 c
Notice the duplicate indices. The indexes have been preserved, which is what I wanted. Now, lets get to business - the group by operation.
In (283): groups = test.groupby('score1')
groups.get_group('a') # Get group with key a
Out[283]:
score1
0 a
1 a
In (283): groups.get_group('b') # Get group with key b
Out[283]:
score1
1 b
0 b
In (283): groups.get_group('c') # Get group with key c
Out[283]:
score1
2 c
4 c
In (283): groups.get_group('k') # Get group with key k
Out[283]:
score1
2 k
I'm baffled by how pandas retrieves rows with the correct index even though they are duplicated. As I understand, the group by operation uses an inverted index data structure to store the references (indexes) to rows. Any insights would be greatly appreciated. Anyone who answers this will have their answer accepted :)
Reorganizing your data for ease of manipulation (having multiple value columns for the same data will always cause you headaches).
import pandas as pd
test = pd.DataFrame({'score1' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series([2, 1, 8, 9, 3], index=['a', 'b', 'c', 'd', 'e'])})
test['name'] = test.index
result = pd.melt(test, id_vars=['name'], value_vars=['score1', 'score2'])
name variable value
0 a score1 1
1 b score1 2
2 c score1 3
3 d score1 4
4 e score1 5
5 a score2 2
6 b score2 1
7 c score2 8
8 d score2 9
9 e score2 3
Now you have only one column for your value and it's easy to groupby score or select by your name column:
hey = result.groupby('value')
hey.groups
#below are the indices that you care about
{1: [0, 6], 2: [1, 5], 3: [2, 9], 4: [3], 5: [4], 8: [7], 9: [8]}