Functional approach to group DataFrame columns into MultiIndex - python

Is there a simpler functional way to group columns into a MultiIndex?
# Setup
l = [...]
l2,l3,l4 = do_things(l, [2,3,4])
d = {2:l2, 3:l3, 4:l4}
# Or,
l = l2 = l3 = l4 = list(range(20))
Problems with my approaches:
# Cons:
# * Complicated
# * Requires multiple iterations over the dictionary to occur
# in the same order. This is guaranteed as the dictionary is
# unchanged but I'm not happy with the implicit dependency.
df = pd.DataFrame\
( zip(*d.values())
, index=l
, columns=pd.MultiIndex.from_product([["group"], d.keys()])
).rename_axis("x").reset_index().reset_index()
# Cons:
# * Complicated
# * Multiple assignments
df = pd.DataFrame(d, index=l).rename_axis("x")
df.columns = pd.MultiIndex.from_product([["group"],df.columns])
df = df.reset_index().reset_index()
I'm looking for something like:
df =\
( pd.DataFrame(d, index=l)
. rename_axis("x")
. group_columns("group")
. reset_index().reset_index()
)
Result:
index x group
2 3 4
0 0 2 0 0 0
1 1 2 0 0 0
2 2 2 0 0 0
3 3 2 0 0 0
4 4 1 0 0 0
5 5 2 0 0 0
6 6 1 0 0 0
7 7 2 0 0 0
8 8 4 0 1 1
9 9 4 0 1 1
10 10 4 0 1 1
11 11 0 0 1 1
12 12 1 0 1 1
13 13 1 0 1 1
14 14 3 1 2 2
15 15 1 1 2 2
16 16 1 1 2 3
17 17 1 1 2 3
18 18 4 1 2 3
19 19 3 1 2 3
20 20 4 1 2 3
21 21 4 1 2 3
22 22 4 1 2 3
23 23 4 1 2 3

It is probably easiest just to reformat the dictionary and pass it to the DataFrame constructor:
# Sample Data
size = 5
lst = np.arange(size) + 10
d = {2: lst, 3: lst + size, 4: lst + (size * 2)}
df = pd.DataFrame(
# Add group level by changing keys to tuples
{('group', k): v for k, v in d.items()},
index=lst
)
Output:
group
2 3 4
10 10 15 20
11 11 16 21
12 12 17 22
13 13 18 23
14 14 19 24
Notice that tuples get interpreted as a MultiIndex automatically
This can be followed with whatever chain of operations desired:
df = pd.DataFrame(
{('group', k): v for k, v in d.items()},
index=lst
).rename_axis('x').reset_index().reset_index()
df:
index x group
2 3 4
0 0 10 10 15 20
1 1 11 11 16 21
2 2 12 12 17 22
3 3 13 13 18 23
4 4 14 14 19 24
It is also possible to combine steps and generate the complete DataFrame directly:
df = pd.DataFrame({
('index', ''): pd.RangeIndex(len(lst)),
('x', ''): lst,
**{('group', k): v for k, v in d.items()}
})
df:
index x group
2 3 4
0 0 10 10 15 20
1 1 11 11 16 21
2 2 12 12 17 22
3 3 13 13 18 23
4 4 14 14 19 24
Naturally any combination of dictionary comprehension and pandas operations can be used.

Related

Conditional Cumulative Count pandas while preserving values before first change

I work with Pandas and I am trying to create a column where the value is increased and especially reset by condition based on the Time column
Input data:
Out[73]:
ID Time Job Level Counter
0 1 17 a
1 1 18 a
2 1 19 a
3 1 20 a
4 1 21 a
5 1 22 b
6 1 23. b
7 1 24. b
8 2 10. a
9 2 11 a
10 2 12 a
11 2 13 a
12 2 14. b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
I want to create a new vector 'count' where the value within each ID group remains the same before the first change and start from zero every time a change in the Job level is encountered while remains equal to Time before the first change or no change.
What I would like to have:
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
This is what I tried
df = df.sort_values(['ID']).reset_index(drop=True)
df['Counter'] = promo_details.groupby('ID')['job_level'].apply(lambda x: x.shift()!=x)
def func(group):
group.loc[group.index[0],'Counter']=group.loc[group.index[0],'time_in_level']
return group
df = df.groupby('emp_id').apply(func)
df['Counter'] = df['Counter'].replace(True,'a')
df['Counter'] = np.where(df.Counter == False,df['Time'],df['Counter'])
df['Counter'] = df['Counter'].replace('a',0)
This is not creating a cumulative change after the first change while preserving counts before it,
Use GroupBy.cumcount for counter with filter first group - there is added values from column Time:
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
print (df)
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
Or:
#if each groups are unique
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
Difference in changed data:
print (df)
ID Time Job Level
12 2 14 b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
10 2 12 a
11 2 18 a
12 2 19 b
13 2 20 b
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter1'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter2'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
print (df)
ID Time Job Level Counter1 Counter2
12 2 14 b 14 14
13 2 15 b 15 15
14 2 16 b 16 16
15 2 17 c 0 0
16 2 18 c 1 1
10 2 12 a 0 0
11 2 18 a 1 1
12 2 19 b 0 19
13 2 20 b 1 20

Python Dataframe GroupBy Function

I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean
The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).
Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1

Pivot column and column values in pandas dataframe

I have a dataframe that looks like this, but with 26 rows and 110 columns:
index/io 1 2 3 4
0 42 53 23 4
1 53 24 6 12
2 63 12 65 34
3 13 64 23 43
Desired output:
index io value
0 1 42
0 2 53
0 3 23
0 4 4
1 1 53
1 2 24
1 3 6
1 4 12
2 1 63
2 2 12
...
I have tried with dict and lists by transforming the dataframe to dict, and then create a new list with index values and update in new dict with io.
indx = []
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indx.append(key)
indxio = {}
for element in indx:
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indxio.update({element:k})
I know this is too far probably, but it's the only thing I could think of. The process was too long, so I stopped.
You can use set_index, stack, and reset_index().
df.set_index("index/io").stack().reset_index(name="value")\
.rename(columns={'index/io':'index','level_1':'io'})
Output:
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
You need set_index + stack + rename_axis + reset_index:
df = df.set_index('index/io').stack().rename_axis(('index','io')).reset_index(name='value')
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
Solution with melt, rename, but there is different order of values, so sort_values is necessary:
d = {'index/io':'index'}
df = df.melt('index/io', var_name='io', value_name='value') \
.rename(columns=d).sort_values(['index','io']).reset_index(drop=True)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
And alternative solution for numpy lovers:
df = df.set_index('index/io')
a = np.repeat(df.index, len(df.columns))
b = np.tile(df.columns, len(df.index))
c = df.values.ravel()
cols = ['index','io','value']
df = pd.DataFrame(np.column_stack([a,b,c]), columns = cols)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43

Permute groups in Pandas

Say I have a Pandas DataFrame whose data look like
import numpy as np
import pandas as pd
n = 30
df = pd.DataFrame({'a': np.arange(n),
'b': np.random.choice([0, 1, 2], n),
'c': np.arange(n)})
Question: how to permute groups (grouped by b column)?
Not permutation within each group, but permutation in group level?
Example
Before
a b c
1 0 1
2 0 2
3 1 3
4 1 4
5 2 5
6 2 6
After
a b c
3 1 3
4 1 4
1 0 1
2 0 2
5 2 5
6 2 6
Basically before permutation, df['b'].unqiue() == [0, 1, 2], after permutation, df['b'].unique() == [1, 0, 2].
Here's an answer inspired by the accepted answer to this SO post, which uses a temporary Categorical column as a sorting key to do custom sort orderings. In this answer, I produce all permutations, but you can just take the first one if you are looking for only one.
import itertools
df_results = list()
orderings = itertools.permutations(df["b"].unique())
for ordering in orderings:
df_2 = df.copy()
df_2["b_key"] = pd.Categorical(df_2["b"], [i for i in ordering])
df_2.sort_values("b_key", inplace=True)
df_2.drop(["b_key"], axis=1, inplace=True)
df_results.append(df_2)
for df in df_results:
print(df)
The idea here is that we create a new categorical variable each time, with a slightly different enumerated order, then sort by it. We discard it at the end once we no longer need it.
If i understood your question correctly, you can do it this way:
n = 30
df = pd.DataFrame({'a': np.arange(n),
'b': np.random.choice([0, 1, 2], n),
'c': np.arange(n)})
order = pd.Series([1,0,2])
cols = df.columns
df['idx'] = df.b.map(order)
index = df.index
df = df.reset_index().sort_values(['idx', 'index'])[cols]
Step by step:
In [103]: df['idx'] = df.b.map(order)
In [104]: df
Out[104]:
a b c idx
0 0 2 0 2
1 1 0 1 1
2 2 1 2 0
3 3 0 3 1
4 4 1 4 0
5 5 1 5 0
6 6 1 6 0
7 7 2 7 2
8 8 0 8 1
9 9 1 9 0
10 10 0 10 1
11 11 1 11 0
12 12 0 12 1
13 13 2 13 2
14 14 0 14 1
15 15 2 15 2
16 16 1 16 0
17 17 2 17 2
18 18 1 18 0
19 19 1 19 0
20 20 0 20 1
21 21 0 21 1
22 22 1 22 0
23 23 1 23 0
24 24 2 24 2
25 25 0 25 1
26 26 0 26 1
27 27 0 27 1
28 28 1 28 0
29 29 1 29 0
In [105]: df.reset_index().sort_values(['idx', 'index'])
Out[105]:
index a b c idx
2 2 2 1 2 0
4 4 4 1 4 0
5 5 5 1 5 0
6 6 6 1 6 0
9 9 9 1 9 0
11 11 11 1 11 0
16 16 16 1 16 0
18 18 18 1 18 0
19 19 19 1 19 0
22 22 22 1 22 0
23 23 23 1 23 0
28 28 28 1 28 0
29 29 29 1 29 0
1 1 1 0 1 1
3 3 3 0 3 1
8 8 8 0 8 1
10 10 10 0 10 1
12 12 12 0 12 1
14 14 14 0 14 1
20 20 20 0 20 1
21 21 21 0 21 1
25 25 25 0 25 1
26 26 26 0 26 1
27 27 27 0 27 1
0 0 0 2 0 2
7 7 7 2 7 2
13 13 13 2 13 2
15 15 15 2 15 2
17 17 17 2 17 2
24 24 24 2 24 2

Changing structure of pandas dataframe

Is there a function that can swap between the following dataframes(df1,df2):
import random
import pandas as pd
numbers = random.sample(range(1,50), 10)
d = {'num': list(range(1,6)) + list(range(1,6)),'values':numbers,'type':['a']*5 + ['b']*5}
df = pd.DataFrame(d)
e = {'num': list(range(1,6)) ,'a':numbers[:5],'b':numbers[5:]}
df2 = pd.DataFrame(e)
Dataframe df1:
#df1
num type values
0 1 a 18
1 2 a 26
2 3 a 34
3 4 a 21
4 5 a 48
5 1 b 1
6 2 b 19
7 3 b 36
8 4 b 42
9 5 b 30
Dataframe df2:
a b num
0 18 1 1
1 26 19 2
2 34 36 3
3 21 42 4
4 48 30 5
I take the first df and the type column becomes a type name with the variables.Is there a function that can do this(from df1 to df2) and the vice-verca action(from df2 to df1)
You can use stack and pivot:
print df
num type values
0 1 a 20
1 2 a 25
2 3 a 2
3 4 a 27
4 5 a 29
5 1 b 39
6 2 b 40
7 3 b 6
8 4 b 17
9 5 b 47
print df2
a b num
0 20 39 1
1 25 40 2
2 2 6 3
3 27 17 4
4 29 47 5
df1 = df2.set_index('num').stack().reset_index()
df1.columns = ['num','type','values']
df1 = df1.sort_values('type')
print df1
num type values
0 1 a 20
2 2 a 46
4 3 a 21
6 4 a 33
8 5 a 10
1 1 b 45
3 2 b 39
5 3 b 38
7 4 b 37
9 5 b 34
df3 = df.pivot(index='num', columns='type', values='values').reset_index()
df3.columns.name = None
df3 = df3[['a','b','num']]
print df3
a b num
0 46 23 1
1 38 6 2
2 36 47 3
3 33 34 4
4 15 1 5

Categories

Resources