I'm trying to aggregate a DataFrame such that for each from, and each to given in the mappings table (e.g. .iloc[0] where a maps to b), we take the corresponding f# (feature) columns from the labels table, and find the number of times that that feature mapping occurred.
The expected output is given in the output table.
Example: in the output table we can see there are 4 times when a from element mapped to a to element (i.e. where the from had an f1 feature and the to had an f2 feature). We can deduce these as being a->b, a->c, d->e, and d->g.
Mappings
from to
0 a b
1 a c
2 d e
3 d f
4 d g
Labels
name f1 f2 f3
0 a 1 0 0
1 b 0 1 0
2 c 0 1 0
3 d 1 1 0
4 e 0 1 0
5 f 0 0 1
6 g 1 1 0
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Table construction code
# dataframe 1 - the mappings
mappings = pd.DataFrame({
'from': ['a', 'a', 'd', 'd', 'd'],
'to': ['b', 'c', 'e', 'f', 'g']
})
# dataframe 2 - the labels
labels = pd.DataFrame({
'name': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'f1': [1, 0, 0, 1, 0, 0, 1],
'f2': [0, 1, 1, 1, 1, 0, 1],
'f3': [0, 0, 0, 0, 0, 1, 0],
})
# dataframe 3 - the expected output
output = pd.DataFrame(
index = ['f1', 'f2', 'f3'],
data = {
'f1': [1, 1, 0],
'f2': [4, 2, 0],
'f3': [1, 1, 0],
})
First we melt your labels dataframe from columns to rows, so we can easily match on them. Then we merge these values to our mapping and finally use crosstab to get your final result:
labels = labels.set_index('name').where(lambda x: x > 0).melt(ignore_index=False).dropna()
df = (
mappings.merge(labels.add_suffix('_from'), left_on='from', right_on='name')
.merge(labels.add_suffix('_to'), left_on='to', right_on='name')
)
final = pd.crosstab(index=df['variable_from'], columns=df['variable_to'])
final = (
final.reindex(index=final.columns, fill_value=0)
.rename_axis(index=None, columns=None)
).convert_dtypes()
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Note:
melt(ignore_index=False) requires pandas >= 1.1.0
convert_dtypes requires pandas >= 1.0.0
For pandas < 1.1.0 we can use stack instead of melt:
(
labels.set_index('name')
.where(lambda x: x > 0)
.stack()
.reset_index(level=1)
.rename(columns={'level_1': 'variable', 0: 'value'})
)
Related
I have 2 df one is
df1 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df2 = {'col_1': [3, 2, 1, 3]}
I want the result as follows
df3 = {'col_1': [3, 2, 1, 3], 'col_2': ['a', 'b', 'c', 'a']}
The column 2 of the new df is the same as the column 2 of the df1 depending on the value of the df1.
Add the new column by mapping the values from df1 after setting its first column as index:
df3 = df2.copy()
df3['col_2'] = df2['col_1'].map(df1.set_index('col_1')['col_2'])
output:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a
You can do it with merge after converting the dicts to df with pd.DataFrame():
output = pd.DataFrame(df2)
output = output.merge(pd.DataFrame(df1),on='col_1',how='left')
Or in a one-liner:
output = pd.DataFrame(df2).merge(pd.DataFrame(df1),on='col_1',how='left')
Outputs:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a
This could be a simple way of doing it.
# use df1 to create a lookup dictionary
lookup = df1.set_index("col_1").to_dict()["col_2"]
# look up each value from df2's "col_1" in the lookup dict
df2["col_2"] = df2["col_1"].apply(lambda d: lookup[d])
I have one dataframe -
ID T Q1 P Q2
10 xy 1 pq 2
20 yz 1 rs 1
20 ab 1 tu 2
30 cd 2 cu 2
30 xy 1 mu 1
30 bb 1 bc 1
Now I need a dictionary with Id as key and rest of the column values as a list to be the dictionary's value
output:
{10:['xy',1,'pq',2]}
{20:['ab',1,'tu',2]}
{30:['bb',1,'bc',1]}
Expected result:
{10:[['xy',1,'pq',2]]}
{20:[['yz',1,'rs',1],['ab',1,'tu',2]]}
{30:[['cd',2,'cu',2],['xy',1,'mu',1],['bb',1,'bc',1]]}
Try:
x = (
df.groupby("ID")
.apply(lambda x: x.iloc[:, 1:].agg(list).values.tolist())
.to_dict()
)
print(x)
Prints:
{10: [['xy', 1, 'pq', 2]],
20: [['yz', 1, 'rs', 1], ['ab', 1, 'tu', 2]],
30: [['cd', 2, 'cu', 2], ['xy', 1, 'mu', 1], ['bb', 1, 'bc', 1]]}
I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))
I am working with a pandas dataframe and trying to concatenate multiple string and numbers into one string.
This works
df1 = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'c']})
df1.apply(lambda x: ', '.join(x), axis=1)
0 a, a
1 b, b
2 c, c
How can I make this work just like df1?
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x), axis=1)
TypeError: ('sequence item 0: expected str instance, int found', 'occurred at index 2')
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(3, 3)),
columns=list('abc')
)
print(df)
a b c
0 0 2 7
1 3 8 7
2 0 6 8
You can use astype(str) ahead of the lambda
df.astype(str).apply(', '.join, 1)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
Using a comprehension
pd.Series([', '.join(l) for l in df.values.astype(str).tolist()], df.index)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
In [75]: df2
Out[75]:
Col1 Col2 Col3
0 a a x
1 b b y
2 1 1 2
In [76]: df2.astype(str).add(', ').sum(1).str[:-2]
Out[76]:
0 a, a, x
1 b, b, y
2 1, 1, 2
dtype: object
You have to convert column types to strings.
import pandas as pd
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x.astype('str')), axis=1)
In pandas, I have (app_categ_events is a dataframe):
> print(app_categ_events.label_id.unique().shape)
> print(app_categ_events.category.unique().shape)
Out:
(492,)
(458,)
I want to look at the label_category’s that have more than one label_id for each (because I thought there was supposed to be a one-to-one mapping).
In r data.table, I can do:
app_categ_events[, count_rows := .N, by = list(category, label_id)]
# (or smth of that sort...)
print(app_categ_events[counts_rows > 1])
What’s the best way of doing that in pandas?
We transform the dataset to create the 'count_rows' column after grouping by 'category', 'label_id'
app_categ_events['count_rows'] = app_categ_events.groupby(['category',
'label_id'])['label_id'].transform('count')
print(app_categ_events)
# category label_id count_rows
#0 a 1 2
#1 a 1 2
#2 b 2 1
#3 b 3 1
Now, the equivalent of data.table as showed in the OP's post would be
print(app_categ_events[app_categ_events.count_rows>1])
# category label_id count_rows
#0 a 1 2
#1 a 1 2
data
import pandas as pd;
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'], 'label_id': [1, 1, 2, 3]})
You can use filtration to return the desired results.
df = pd.DataFrame({'label_id': [1, 1, 2, 3],
'category': ['a', 'b', 'b', 'c']})
df.groupby(['category']).filter(lambda group: len(group) > 1)
category label_id
1 b 1
2 b 2
Given:
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'],
'label_id': [1, 1, 2, 3]})
Solution:
# identify categories with greater than 1 number of related label_id's
cat_mask = app_categ_events.groupby('category')['label_id'].nunique().gt(1)
cats = cat_mask[cat_mask]
# filter data
app_categ_events[app_categ_events.category.isin(cats.index)]