How to combine repeated header columns for multi-index pandas dataframe? - python

Current dataframe:
a a b b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
desired dataframe:
a b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
Its a multi index data frame, want to create a dynamic method to group the same headers into one for the columns where its repeated.

The two dataframes are exactly the same. If you want to change the style of the display you can do the following:
df = pd.DataFrame(np.array([[1, 2, 9, 1, 4],
[2, 3, 9, 2, 4],
[3, 8, 7, 8, 3],
[8, 8, 9, 0, 0]]),
columns=pd.MultiIndex.from_arrays([list('aabbc'), list('klmno')]),
index =list('abcd')
)
default print style:
>>> print(df)
a b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
Alternative style:
>>> with pd.option_context('display.multi_sparse', False):
... print (df)
a a b b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0

Related

Pandas Split lists into multiple rows

I have a Dataframe like this
pd.DataFrame([(1,'a','i',[1,2,3],['a','b','c']),(2,'b','i',[4,5],['d','e','f']),(3,'a','j',[7,8,9],['g','h'])])
Output:
0 1 2 3 4
0 1 a i [1, 2, 3] [a, b, c]
1 2 b i [4, 5] [d, e, f]
2 3 a j [7, 8, 9] [g, h]
I want to explode columns 3,4 matching their indices and preserving the rest of the columns like this. I go through this question but the answer is to create a new dataframe and defining all columns again which is memory inefficient (I have 18L rows and 19 columns)
0 1 2 3 4
0 1 a i 1 a
1 1 a i 2 b
2 1 a i 3 c
3 2 b i 4 d
4 2 b i 5 e
5 2 b i NaN f
6 3 c j 7 g
7 3 c j 8 h
8 3 c j 9 NaN
Update: Forgot to mention for missing indices it should be NaN for other
Another solution:
df_out = df.explode(3)
df_out[4] = df[4].explode()
print(df_out)
Prints:
0 1 2 3 4
0 1 a i 1 a
0 1 a i 2 b
0 1 a i 3 c
1 2 b i 4 d
1 2 b i 5 e
1 2 b i 6 f
2 3 a j 7 g
2 3 a j 8 h
EDIT: To handle uneven cases:
df = pd.DataFrame(
[
(1, "a", "i", [1, 2, 3], ["a", "b", "c"]),
(2, "b", "i", [4, 5], ["d", "e", "f"]),
(3, "a", "j", [7, 8, 9], ["g", "h"]),
]
)
def fn(x):
if len(x[3]) < len(x[4]):
x[3].extend([np.nan] * (len(x[4]) - len(x[3])))
elif len(x[3]) > len(x[4]):
x[4].extend([np.nan] * (len(x[3]) - len(x[4])))
return x
# "even-out" the lists:
df = df.apply(fn, axis=1)
# explode them:
df_out = df.explode(3)
df_out[4] = df[4].explode()
print(df_out)
Prints:
0 1 2 3 4
0 1 a i 1 a
0 1 a i 2 b
0 1 a i 3 c
1 2 b i 4 d
1 2 b i 5 e
1 2 b i NaN f
2 3 a j 7 g
2 3 a j 8 h
2 3 a j 9 NaN
You can use pd.Series.explode:
df = df.apply(pd.Series.explode).reset_index(drop=True)
output:
0 1 2 3 4
0 1 a i 1 a
1 1 a i 2 b
2 1 a i 3 c
3 2 b i 4 d
4 2 b i 5 e
5 2 b i 6 f
6 3 a j 7 g
7 3 a j 8 h

Split dataframe by two repeated values

I have a dataframe which describes the status of a person:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})
This dataframe records two trips of the person. I want to split it based on the values of column C, 'start' and 'end'. The other values in column C do not matter.
I could divide the dataframe by the following codes:
x=[]
y=[]
for i in range(len(df)):
if df['C'][i]=='start':
x.append(i)
elif df['C'][i]=='end':
y.append(i)
for i, j in zip(x, y):
new_df = df.iloc[i:j+1,:]
print(new_df)
However, I'm wondering is there any more efficient way to divide it without loop since I have a pretty large dataframe.
I would create a dict using GroupBy.__iter__()
Method 1
start = df['C'].eq('start')
dfs = dict(df.loc[(start.add(df['C'].shift().eq('end')).cumsum()%2).eq(1)]
.groupby(start.cumsum())
.__iter__())
#{1: A B C
# 0 1 6 start
# 1 2 7 running
# 2 3 8 running
# 3 4 9 end, 2: A B C
# 5 6 23 start
# 6 7 11 running
# 7 8 12 resting
# 8 3 13 end}
Method 2
start = df['C'].eq('start')
dfs = dict(df.loc[start.where(start)
.groupby(df['C'].shift()
.eq('end')
.cumsum())
.ffill().notna()]
.groupby(start.cumsum())
.__iter__())
#{1: A B C
# 0 1 6 start
# 1 2 7 running
# 2 3 8 running
# 3 4 9 end, 2: A B C
# 5 6 23 start
# 6 7 11 running
# 7 8 12 resting
# 8 3 13 end}
Accessing DataFrame
print(dfs[1])
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
print(dfs[2])
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
We can use groupby.get_group
dfs = (df.loc[start.where(start)
.groupby(df['C'].shift()
.eq('end')
.cumsum())
.ffill().notna()]
.groupby(start.cumsum()))
df1=dfs.get_group(1)
df2=dfs.get_group(2)
print(df1)
print(df2)
Details Method 2
start.where(start)
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Name: C, dtype: float64
df['C'].shift().eq('end').cumsum()
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 1
Name: C, dtype: int64
as you can see row 4 is within group 1, and when using groupby.ffill its value remains NaN
Based on the comments, the starting dataframe:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})
Then:
for g in df.groupby(df.assign(tmp=(df['C'] == 'start'))['tmp'].cumsum()):
m = (g[1]['C'] == 'end').shift().fillna(False).cumsum() == 0
print(g[1][m])
Prints:
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
You can use:
idx = zip(df[df['C'] == 'A'].index, df[df['C'] == 'C'].index)
dfs = [df.loc[i:j] for i, j in idx]
using str_extract | cumsum and groupby then holding your results in a dictionary.
df_dict = {}
counter =0
for group, data in df.assign(
g=df["C"].str.extract("(A|C)").bfill().apply(lambda x: x.ne("C")).cumsum()
).groupby("g"):
counter += 1
df_dict[counter] = data.drop('g',axis=1)
df_dict[1]
A B C
0 1 6 A
1 2 7 B
2 3 8 D
3 4 9 C
df_dict[2]
A B C
4 5 10 A
5 6 11 B
6 7 12 E
7 8 13 C
Try:
import numpy as np
df["group"]=df.groupby("C").cumcount()
df.loc[df["C"].ne("start"), "group"]=None
df["group"]=np.where(np.logical_and(df["C"].shift(1).eq("end"), df["C"].ne("start")), -1, df["group"])
df["group"]=df["group"].ffill()
dfs=[df.loc[df["group"].eq(grp)] for grp in df.groupby("group").groups]
Outputs:
#dfs[0]
A B C group
4 5 10 running -1.0
#dfs[1]
A B C group
0 1 6 start 0.0
1 2 7 running 0.0
2 3 8 running 0.0
3 4 9 end 0.0
#dfs[2]
A B C group
5 6 23 start 1.0
6 7 11 running 1.0
7 8 12 resting 1.0
8 3 13 end 1.0
I think you can do it with this line of code :
dfs = [ df[start:end+1]
for start, end in zip(df.index[df['C'] == 'start'],
df.index[df['C'] == 'end'])]
Output:
dfs[0]
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
dfs[1]
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end

Pandas Split DataFrame using row index

I want to split dataframe by uneven number of rows using row index.
The below code:
groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))
works only for uniform number of rows.
df
a b c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
l = [2, 5, 7]
df1
1 1 1
2 2 2
df2
3,3,3
4,4,4
5,5,5
df3
6,6,6
7,7,7
df4
8,8,8
You could use list comprehension with a little modications your list, l, first.
print(df)
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
l = [2,5,7]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
Output:
list_of_dfs[0]
a b c
0 1 1 1
1 2 2 2
list_of_dfs[1]
a b c
2 3 3 3
3 4 4 4
4 5 5 5
list_of_dfs[2]
a b c
5 6 6 6
6 7 7 7
list_of_dfs[3]
a b c
7 8 8 8
I think this is what you need:
df = pd.DataFrame({'a': np.arange(1, 8),
'b': np.arange(1, 8),
'c': np.arange(1, 8)})
df.head()
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
last_check = 0
dfs = []
for ind in [2, 5, 7]:
dfs.append(df.loc[last_check:ind-1])
last_check = ind
Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.
dfs[0]
a b c
0 1 1 1
1 2 2 2
dfs[2]
a b c
5 6 6 6
6 7 7 7
I think this is you are looking for.,
l = [2, 5, 7]
dfs=[]
i=0
for val in l:
if i==0:
temp=df.iloc[:val]
dfs.append(temp)
elif i==len(l):
temp=df.iloc[val]
dfs.append(temp)
else:
temp=df.iloc[l[i-1]:val]
dfs.append(temp)
i+=1
Output:
a b c
0 1 1 1
1 2 2 2
a b c
2 3 3 3
3 4 4 4
4 5 5 5
a b c
5 6 6 6
6 7 7 7
Another Solution:
l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
print v
Output:
a b c 0
0 1 1 1 2
1 2 2 2 2
a b c 0
2 3 3 3 5
3 4 4 4 5
4 5 5 5 5
a b c 0
5 6 6 6 7
6 7 7 7 7
You can create an array to use for indexing via NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))
L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))
for _, chunk in df.groupby(idx):
print(chunk, '\n')
a b c
0 0 1 2
1 3 4 5
a b c
2 6 7 8
3 9 10 11
4 12 13 14
a b c
5 15 16 17
6 18 19 20
a b c
7 21 22 23
Instead of defining a new variable for each dataframe, you can use a dictionary:
d = dict(tuple(df.groupby(idx)))
print(d[1]) # print second groupby value
a b c
2 6 7 8
3 9 10 11
4 12 13 14

I want to split dataframe into train and test set with ranges

import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.

Explode multiple columns lists into rows

How to explode the list into rows?
I have the following data frame:
df = pd.DataFrame([
(1,
[1,2,3],
['a','b','c']
),
(2,
[4,5,6],
['d','e','f']
),
(3,
[7,8],
['g','h']
)
])
Shown in output as follows
0 1 2
0 1 [1, 2, 3] [a, b, c]
1 2 [4, 5, 6] [d, e, f]
2 3 [7, 8] [g, h]
I want to have the following output:
0 1 2
0 1 1 a
1 1 2 b
2 1 3 c
3 2 4 d
4 2 5 e
5 2 6 f
6 3 7 g
7 3 8 h
You can use str.len for get length of lists which are repeated by numpy.repeat with flattening lists:
from itertools import chain
import numpy as np
df2 = pd.DataFrame({
0: np.repeat(df.iloc[:,0].values, df.iloc[:,1].str.len()),
1: list(chain.from_iterable(df.iloc[:,1])),
2: list(chain.from_iterable(df.iloc[:,2]))})
print (df2)
0 1 2
0 1 1 a
1 1 2 b
2 1 3 c
3 2 4 d
4 2 5 e
5 2 6 f
6 3 7 g
7 3 8 h

Categories

Resources