I have a dataframe which describes the status of a person:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})
This dataframe records two trips of the person. I want to split it based on the values of column C, 'start' and 'end'. The other values in column C do not matter.
I could divide the dataframe by the following codes:
x=[]
y=[]
for i in range(len(df)):
if df['C'][i]=='start':
x.append(i)
elif df['C'][i]=='end':
y.append(i)
for i, j in zip(x, y):
new_df = df.iloc[i:j+1,:]
print(new_df)
However, I'm wondering is there any more efficient way to divide it without loop since I have a pretty large dataframe.
I would create a dict using GroupBy.__iter__()
Method 1
start = df['C'].eq('start')
dfs = dict(df.loc[(start.add(df['C'].shift().eq('end')).cumsum()%2).eq(1)]
.groupby(start.cumsum())
.__iter__())
#{1: A B C
# 0 1 6 start
# 1 2 7 running
# 2 3 8 running
# 3 4 9 end, 2: A B C
# 5 6 23 start
# 6 7 11 running
# 7 8 12 resting
# 8 3 13 end}
Method 2
start = df['C'].eq('start')
dfs = dict(df.loc[start.where(start)
.groupby(df['C'].shift()
.eq('end')
.cumsum())
.ffill().notna()]
.groupby(start.cumsum())
.__iter__())
#{1: A B C
# 0 1 6 start
# 1 2 7 running
# 2 3 8 running
# 3 4 9 end, 2: A B C
# 5 6 23 start
# 6 7 11 running
# 7 8 12 resting
# 8 3 13 end}
Accessing DataFrame
print(dfs[1])
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
print(dfs[2])
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
We can use groupby.get_group
dfs = (df.loc[start.where(start)
.groupby(df['C'].shift()
.eq('end')
.cumsum())
.ffill().notna()]
.groupby(start.cumsum()))
df1=dfs.get_group(1)
df2=dfs.get_group(2)
print(df1)
print(df2)
Details Method 2
start.where(start)
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Name: C, dtype: float64
df['C'].shift().eq('end').cumsum()
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 1
Name: C, dtype: int64
as you can see row 4 is within group 1, and when using groupby.ffill its value remains NaN
Based on the comments, the starting dataframe:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})
Then:
for g in df.groupby(df.assign(tmp=(df['C'] == 'start'))['tmp'].cumsum()):
m = (g[1]['C'] == 'end').shift().fillna(False).cumsum() == 0
print(g[1][m])
Prints:
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
You can use:
idx = zip(df[df['C'] == 'A'].index, df[df['C'] == 'C'].index)
dfs = [df.loc[i:j] for i, j in idx]
using str_extract | cumsum and groupby then holding your results in a dictionary.
df_dict = {}
counter =0
for group, data in df.assign(
g=df["C"].str.extract("(A|C)").bfill().apply(lambda x: x.ne("C")).cumsum()
).groupby("g"):
counter += 1
df_dict[counter] = data.drop('g',axis=1)
df_dict[1]
A B C
0 1 6 A
1 2 7 B
2 3 8 D
3 4 9 C
df_dict[2]
A B C
4 5 10 A
5 6 11 B
6 7 12 E
7 8 13 C
Try:
import numpy as np
df["group"]=df.groupby("C").cumcount()
df.loc[df["C"].ne("start"), "group"]=None
df["group"]=np.where(np.logical_and(df["C"].shift(1).eq("end"), df["C"].ne("start")), -1, df["group"])
df["group"]=df["group"].ffill()
dfs=[df.loc[df["group"].eq(grp)] for grp in df.groupby("group").groups]
Outputs:
#dfs[0]
A B C group
4 5 10 running -1.0
#dfs[1]
A B C group
0 1 6 start 0.0
1 2 7 running 0.0
2 3 8 running 0.0
3 4 9 end 0.0
#dfs[2]
A B C group
5 6 23 start 1.0
6 7 11 running 1.0
7 8 12 resting 1.0
8 3 13 end 1.0
I think you can do it with this line of code :
dfs = [ df[start:end+1]
for start, end in zip(df.index[df['C'] == 'start'],
df.index[df['C'] == 'end'])]
Output:
dfs[0]
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
dfs[1]
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
Related
I would like to replace values in a column, but only to the values seen after an specific value
for example, I have the following dataset:
In [108]: df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]], columns=['ID','time,'A', 'B', 'C'])
In [109]: df
Out[109]:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 4 8 7
4 16 1 9 3 1
5 17 3 1 4 8
and I want to change for column "A" all the values that come after 5 for a 1, for column "B" all the values that come after 1 for 6, for column "C" change all the values after 7 for a 5. so it will look like this:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I know that I could use where to get sort of a similar effect, but if I put a condition like df["A"] = np.where(x!=5,1,x), but obviously this will change the values before 5 as well. I can't think of anything else at the moment.
Thanks for the help.
Use DataFrame.mask with shifted valeus by DataFrame.shift, compared by dictioanry and for next Trues is used DataFrame.cummax:
df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],
[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]],
index=['ID','time','A', 'B', 'C']).T
after = {'A':5, 'B':1, 'C': 7}
new = {'A':1, 'B':6, 'C': 5}
cols = list(after.keys())
s = pd.Series(new)
df[cols] = df[cols].mask(df[cols].shift().eq(after).cummax(), s, axis=1)
print (df)
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
Current dataframe:
a a b b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
desired dataframe:
a b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
Its a multi index data frame, want to create a dynamic method to group the same headers into one for the columns where its repeated.
The two dataframes are exactly the same. If you want to change the style of the display you can do the following:
df = pd.DataFrame(np.array([[1, 2, 9, 1, 4],
[2, 3, 9, 2, 4],
[3, 8, 7, 8, 3],
[8, 8, 9, 0, 0]]),
columns=pd.MultiIndex.from_arrays([list('aabbc'), list('klmno')]),
index =list('abcd')
)
default print style:
>>> print(df)
a b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
Alternative style:
>>> with pd.option_context('display.multi_sparse', False):
... print (df)
a a b b c
k l m n o
a 1 2 9 1 4
b 2 3 9 2 4
c 3 8 7 8 3
d 8 8 9 0 0
I want to split dataframe by uneven number of rows using row index.
The below code:
groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))
works only for uniform number of rows.
df
a b c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
l = [2, 5, 7]
df1
1 1 1
2 2 2
df2
3,3,3
4,4,4
5,5,5
df3
6,6,6
7,7,7
df4
8,8,8
You could use list comprehension with a little modications your list, l, first.
print(df)
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
l = [2,5,7]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
Output:
list_of_dfs[0]
a b c
0 1 1 1
1 2 2 2
list_of_dfs[1]
a b c
2 3 3 3
3 4 4 4
4 5 5 5
list_of_dfs[2]
a b c
5 6 6 6
6 7 7 7
list_of_dfs[3]
a b c
7 8 8 8
I think this is what you need:
df = pd.DataFrame({'a': np.arange(1, 8),
'b': np.arange(1, 8),
'c': np.arange(1, 8)})
df.head()
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
last_check = 0
dfs = []
for ind in [2, 5, 7]:
dfs.append(df.loc[last_check:ind-1])
last_check = ind
Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.
dfs[0]
a b c
0 1 1 1
1 2 2 2
dfs[2]
a b c
5 6 6 6
6 7 7 7
I think this is you are looking for.,
l = [2, 5, 7]
dfs=[]
i=0
for val in l:
if i==0:
temp=df.iloc[:val]
dfs.append(temp)
elif i==len(l):
temp=df.iloc[val]
dfs.append(temp)
else:
temp=df.iloc[l[i-1]:val]
dfs.append(temp)
i+=1
Output:
a b c
0 1 1 1
1 2 2 2
a b c
2 3 3 3
3 4 4 4
4 5 5 5
a b c
5 6 6 6
6 7 7 7
Another Solution:
l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
print v
Output:
a b c 0
0 1 1 1 2
1 2 2 2 2
a b c 0
2 3 3 3 5
3 4 4 4 5
4 5 5 5 5
a b c 0
5 6 6 6 7
6 7 7 7 7
You can create an array to use for indexing via NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))
L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))
for _, chunk in df.groupby(idx):
print(chunk, '\n')
a b c
0 0 1 2
1 3 4 5
a b c
2 6 7 8
3 9 10 11
4 12 13 14
a b c
5 15 16 17
6 18 19 20
a b c
7 21 22 23
Instead of defining a new variable for each dataframe, you can use a dictionary:
d = dict(tuple(df.groupby(idx)))
print(d[1]) # print second groupby value
a b c
2 6 7 8
3 9 10 11
4 12 13 14
import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.
If a have pandas dataframe with 4 columns like this:
A B C D
0 2 4 1 9
1 3 2 9 7
2 1 6 9 2
3 8 6 5 4
is it possible to apply df.cumsum() in some way to get the results in a new column next to existing column like this:
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22
You can create new columns using assign:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
and order the columns with sort_index:
result.sort_index(axis=1)
# A AA B BB C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
Note that depending on the column names, sorting may not produce the desired order. In that case, using reindex is a more robust way of ensuring you obtain the desired column order:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
Here is an example which demonstrates the difference:
import pandas as pd
df = pd.DataFrame({'A': [2, 3, 1, 8], 'A A': [4, 2, 6, 6], 'C': [1, 9, 9, 5], 'D': [9, 7, 2, 4]})
result = df.assign(**{col*2:df[col].cumsum() for col in df})
print(result.sort_index(axis=1))
# A A A A AA A AA C CC D DD
# 0 2 4 4 2 1 1 9 9
# 1 3 2 6 5 9 10 7 16
# 2 1 6 12 6 9 19 2 18
# 3 8 6 18 14 5 24 4 22
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
print(result)
# A AA A A A AA A C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
#unutbu's way certainly works but using insert reads better to me. Plus you don't need to worry about sorting/reindexing!
for i, col_name in enumerate(df):
df.insert(i * 2 + 1, col_name * 2, df[col_name].cumsum())
df
returns
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22