Pandas Dataframe. Expand tuple values as columns with multiindex - python

I have a dataframe df:
A B
first second
bar one 0.0 0.0
two 0.0 0.0
foo one 0.0 0.0
two 0.0 0.0
I transform it to another one where values are tuples:
A B
first second
bar one (6, 1, 0) (0, 9, 3)
two (9, 3, 4) (6, 2, 1)
foo one (1, 9, 0) (4, 0, 0)
two (6, 1, 5) (8, 3, 5)
My question is how can I get it (expanded) to be like below where tuples values become columns with multiindex? Can I do it during transform or should I do it as an additional step after transform?
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5
Code for the above:
import numpy as np
import pandas as pd
np.random.seed(123)
def expand(s):
# complex logic of `result` has been replaced with `np.random`
result = [tuple(np.random.randint(10, size=3)) for i in s]
return result
index = pd.MultiIndex.from_product([['bar', 'foo'], ['one', 'two']], names=['first', 'second'])
df = pd.DataFrame(np.zeros((4, 2)), index=index, columns=['A', 'B'])
print(df)
expanded = df.groupby(['second']).transform(expand)
print(expanded)

Try this:
df_lst = []
for col in df.columns:
expanded_splt = expanded.apply(lambda x: pd.Series(x[col]),axis=1)
columns = pd.MultiIndex.from_product([[col], ['m', 'n', 'k']])
expanded_splt.columns = columns
df_lst.append(expanded_splt)
pd.concat(df_lst, axis=1)
Output:
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5

Finally I found time to find an answer that suits me.
expanded_data = expanded.agg(lambda x: np.concatenate(x), axis=1).to_numpy()
expanded_data = np.stack(expanded_data)
column_index = pd.MultiIndex.from_product([expanded.columns, ['m', 'n', 'k']])
exploded = pd.DataFrame(expanded_data, index=expanded.index, columns=column_index)
print(exploded)
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5

Related

Compare dataframes and only use unmatched values

I have two dataframes that I want to compare, but only want to use the values that are not in both dataframes.
Example:
DF1:
A B C
0 1 2 3
1 4 5 6
DF2:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
So, from this example I want to work with row index 2 and 3 ([7, 8, 9] and [10, 11, 12]).
The code I currently have (only remove duplicates) below.
df = pd.concat([di_old, di_new])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
print(df.reindex(idx))
I would do :
df_n = df2[df2.isin(df1).all(axis=1)]
ouput
A B C
0 1 2 3
1 4 5 6

Can Pandas use a list for groupby?

import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'key':['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns=['key', 'data1', 'data2'])
df
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
L = [0, 1, 0, 1, 2, 0]
print(df.groupby(L).sum())
The output is:
data1 data2
0 7 17
1 4 3
2 4 7
I need a clear explanation, please?! What are 0s, 1s and 2 in the L? Are they key column of the df? or are they index label of df? And how groupby grouped based on L?
the L is a list of integers in your example. As you groupby L you simply saying: Look at this list of integers and group my df based on those unique integers.
I think visualizing it will make sense (note that the df doesn't have column L - I just added it for visualization) :
groupby L means - take the unique values (in this case 0,1 and 2) and do sum for data1 and data2. So the result for data1 when L=0 would be for data1: 0+2+5=7 (etc)
and the end result would be:
df.groupby(L).sum()
​
data1 data2
0 7 17
1 4 3
2 4 7
You can use a list to group observations in your dataframe. For instance, say you have the heights of a few people:
import pandas as pd
df = pd.DataFrame({'names':['John', 'Mark', 'Fred', 'Julia', 'Mary'],
'height':[180, 180, 180, 160, 160]})
print(df)
names height
0 John 180
1 Mark 180
2 Fred 180
3 Julia 160
4 Mary 160
And elsewhere, you received their assigned groups:
sex = ['man', 'man', 'man', 'woman', 'woman']
You won't need to concatenate a new column to your dataframe just to group them. You can use the list to do the work:
df.groupby(sex).mean()
height
man 180
woman 160
You can see here how it's working:
In [6006]: df.groupby(L).agg(list)
Out[6006]:
key data1 data2
0 [A, C, C] [0, 2, 5] [5, 3, 9]
1 [B, A] [1, 3] [0, 3]
2 [B] [4] [7]
In [6002]: list(df.groupby(L))
Out[6002]:
[(0, key data1 data2
0 A 0 5
2 C 2 3
5 C 5 9),
(1, key data1 data2
1 B 1 0
3 A 3 3),
(2, key data1 data2
4 B 4 7)]
In L, it groups the The 0, key, which is ACC, index 0,2m5 the 1 key is BA, index 1,3, and the two key is B, index 4
This is due to the alignment of the L key:
df['L'] = L
key data1 data2 L
0 A 0 5 0
1 B 1 0 1
2 C 2 3 0
3 A 3 3 1
4 B 4 7 2
5 C 5 9 0
I hope this makes sense

Why pd.MultiIndex.from_tuples changes the order of the tuples

When create a multiindex using from_tuples, the create index object has a different order than the input tuple
I am trying to add a column level to a data frame, using pd.MultiIndex.from_tuples method, but the levels is different from what I expected.
df = pd.DataFrame({'x_1':[1, 2], 'x_2':[3, 4], 'x_10':[3, 4], 'y_1':[5, 6], 'y_2':[7, 8], 'y_10':[1, 2]})
df = df.reindex(columns=['x_1', 'x_2', 'x_10', 'y_1', 'y_2', 'y_10'])
index = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
print(index)
MultiIndex(levels=[['x', 'y'], ['1', '10', '2']],
labels=[[0, 0, 0, 1, 1, 1], [0, 2, 1, 0, 2, 1]])
When I add the level to the dataframe and perform stacking, the order is not what I want.
df.columns = index
df.stack()
x y
0 1 1 5
10 3 1
2 3 7
1 1 2 6
10 4 2
2 4 8
I expect the index levels look like:
MultiIndex(levels=[['x', 'y'], ['1', '2', '10']])
and stacking will look like the following:
df.stack()
x y
0 1 1 5
2 3 7
10 3 1
1 1 2 6
2 4 8
10 4 2
You can reindex at a specific level, passing the level values from your column prior to the call to stack:
In[177]:
df.stack().reindex(df.columns.get_level_values(1).unique(), level=1)
Out[177]:
x y
0 1 1 5
2 3 7
10 3 1
1 1 2 6
2 4 8
10 4 2
Note that this has performance issues because an index is expected to be sorted for fast lookups
The index you have constructed is actually ordered as specified. When you print(index) you are seeing how Pandas stores the index internally. Using index.values unravels this representation to give an array of indices aligned with your dataframe.
print(index.values)
# array([('x', '1'), ('x', '2'), ('x', '10'), ('y', '1'), ('y', '2'),
# ('y', '10')], dtype=object)
df.columns = index
print(df)
# x y
# 1 2 10 1 2 10
# 0 1 3 3 5 7 1
# 1 2 4 4 6 8 2
The real issue is pd.DataFrame.stack applies sorting and, since you have defined strings, '10' comes before '2'. To maintain ordering as you desire after stack, make sure you use integers:
def splitter(x):
strng, num = x.split('_')
return strng, int(num)
index = pd.MultiIndex.from_tuples(df.columns.map(splitter))
df.columns = index
print(df.stack())
# x y
# 0 1 1 5
# 2 3 7
# 10 3 1
# 1 1 2 6
# 2 4 8
# 10 4 2

Count appearances of a value until it changes to another value

I have the following DataFrame:
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
I want to calculate the frequency of each value, but not an overall count - the count of each value until it changes to another value.
I tried:
df['values'].value_counts()
but it gives me
10 6
9 3
23 2
12 1
The desired output is
10:2
23:2
9:3
10:4
12:1
How can I do this?
Use:
df = df.groupby(df['values'].ne(df['values'].shift()).cumsum())['values'].value_counts()
Or:
df = df.groupby([df['values'].ne(df['values'].shift()).cumsum(), 'values']).size()
print (df)
values values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
Name: values, dtype: int64
Last for remove first level:
df = df.reset_index(level=0, drop=True)
print (df)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
Explanation:
Compare original column by shifted with not equal ne and then add cumsum for helper Series:
print (pd.concat([df['values'], a, b, c],
keys=('orig','shifted', 'not_equal', 'cumsum'), axis=1))
orig shifted not_equal cumsum
0 10 NaN True 1
1 10 10.0 False 1
2 23 10.0 True 2
3 23 23.0 False 2
4 9 23.0 True 3
5 9 9.0 False 3
6 9 9.0 False 3
7 10 9.0 True 4
8 10 10.0 False 4
9 10 10.0 False 4
10 10 10.0 False 4
11 12 10.0 True 5
You can keep track of where the changes in df['values'] occur, and groupby the changes and also df['values'] (to keep them as index) computing the size of each group
changes = df['values'].diff().ne(0).cumsum()
df.groupby([changes,'values']).size().reset_index(level=0, drop=True)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
itertools.groupby
from itertools import groupby
pd.Series(*zip(*[[len([*v]), k] for k, v in groupby(df['values'])]))
10 2
23 2
9 3
10 4
12 1
dtype: int64
It's a generator
def f(x):
count = 1
for this, that in zip(x, x[1:]):
if this == that:
count += 1
else:
yield count, this
count = 1
yield count, [*x][-1]
pd.Series(*zip(*f(df['values'])))
10 2
23 2
9 3
10 4
12 1
dtype: int64
Using crosstab
df['key']=df['values'].diff().ne(0).cumsum()
pd.crosstab(df['key'],df['values'])
Out[353]:
values 9 10 12 23
key
1 0 2 0 0
2 0 0 0 2
3 3 0 0 0
4 0 4 0 0
5 0 0 1 0
Slightly modify the result above
pd.crosstab(df['key'],df['values']).stack().loc[lambda x:x.ne(0)]
Out[355]:
key values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
dtype: int64
Base on python groupby
from itertools import groupby
[ (k,len(list(g))) for k,g in groupby(df['values'].tolist())]
Out[366]: [(10, 2), (23, 2), (9, 3), (10, 4), (12, 1)]
This is far from the most time/memory efficient method that in this thread but here's an iterative approach that is pretty straightforward. Please feel encouraged to suggest improvements on this method.
import pandas as pd
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
dict_count = {}
for v in df['values'].unique():
dict_count[v] = 0
curr_val = df.iloc[0]['values']
count = 1
for i in range(1, len(df)):
if df.iloc[i]['values'] == curr_val:
count += 1
else:
if count > dict_count[curr_val]:
dict_count[curr_val] = count
curr_val = df.iloc[i]['values']
count = 1
if count > dict_count[curr_val]:
dict_count[curr_val] = count
df_count = pd.DataFrame(dict_count, index=[0])
print(df_count)
The function groupby in itertools can help you, for str:
>>> string = 'aabbaacc'
>>> for char, freq in groupby('aabbaacc'):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
a:2
b:2
a:2
c:2
This function also works for list:
>>> df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
>>> for char, freq in groupby(df['values'].tolist()):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
10:2
23:2
9:3
10:4
12:1
Note: for df, you always use this way like df['values'] to take 'values' column, because DataFrame have a attribute values

Applying functions data frame columns

I have a following function to calculate a value for two parameter x,y:
import numpy as np
import math
def some_func(pt1,pt2):
return math.sqrt( (pt2[0]-pt1[0])*(pt2[0]-pt1[0]) + (pt2[1]-pt1[1])*(pt2[1]-pt1[1]) )
usage:
a = 1, 2
b = 4, 5
some_func(a,b)
#outputs = 4.24264
#or some_func((1,2), (4,5)) would give the same output too
I have a following df:
seq x y points
1 2 3 (2,3)
1 10 5 (10,5)
1 6 7 (6,7)
2 8 9 (8,9)
2 10 11 (10,11)
column "points" was obtained using the below piece of code:
df["points"] = list(zip(df.loc[:, "x"], df.loc[:, "y"]))
I want to apply the some_func function on the whole df, also by grouping them by "seq"
I tried :
df["value"] = some_func(df["points"].values, df["points"].shift(1).values)
#without using groupby
and
df["value"] = df.groupby("seq").points.apply(some_func) #with groupby
but both of them shows TypeError saying 1 missing argument or unsupported data type.
Expected df
seq x y points value
1 2 3 (2,3) NaN
1 10 5 (10,5) 8.24
1 6 7 (6,7) 4.47
2 8 9 (8,9) NaN
2 10 11 (10,11) 2.82
You can use groupby with DataFrameGroupBy.shift first, but then need replace NaNs to tuples - one possible solution is use fillna. Last use apply
s = pd.Series([(np.nan, np.nan)], index=df.index)
df['shifted'] = df.groupby('seq').points.shift().fillna(s)
df['values'] = df.apply(lambda x: some_func(x['points'], x['shifted']), axis=1)
print (df)
seq x y points shifted values
0 1 2 3 (2, 3) (nan, nan) NaN
1 1 10 5 (10, 5) (2, 3) 8.246211
2 1 6 7 (6, 7) (10, 5) 4.472136
3 2 8 9 (8, 9) (nan, nan) NaN
4 2 10 11 (10, 11) (8, 9) 2.828427
Another solution is filter out NaNs in apply:
df['shifted'] = df.groupby('seq').points.shift()
f = lambda x: some_func(x['points'], x['shifted']) if pd.notnull(x['shifted']) else np.nan
df['values'] = df.apply(f, axis=1)
print (df)
seq x y points shifted values
0 1 2 3 (2, 3) NaN NaN
1 1 10 5 (10, 5) (2, 3) 8.246211
2 1 6 7 (6, 7) (10, 5) 4.472136
3 2 8 9 (8, 9) NaN NaN
4 2 10 11 (10, 11) (8, 9) 2.828427
f=lambda x,y:some_func(x,y)
f["value"] = f(df["points"].values, df["points"].shift(1).values)

Categories

Resources