Split dataframe by certain values in first column? - python

I have a dataframe like this one:
A C1 C2 Total
PRODUCT1 8 11 19
rs1 5 9 14
rs2 2 2 4
rs3 1 0 1
PRODUCT2 21 12 33
rs7 11 7 18
rs2 7 3 10
rs1 3 1 4
rs9 0 1 1
PRODUCT3 2 11 13
rs9 1 6 7
rs5 1 5 6
The column A is made of strings, I want to split my dataframe by the values in this column, specifically every upper word in it. Like this:
df1 =
PRODUCT1 8 11 19
rs1 5 9 14
rs2 2 2 4
rs3 1 0 1
df2 =
PRODUCT2 21 12 33
rs7 11 7 18
rs2 7 3 10
rs1 3 1 4
rs9 0 1 1
df3 =
PRODUCT3 2 11 13
rs9 1 6 7
rs5 1 5 6
Is there an easy way to achieve this?

import pandas as pd
df = pd.DataFrame({'A': ['PRODUCT1', 'rs1', 'rs2', 'rs3', 'PRODUCT2', 'rs7', 'rs2', 'rs1', 'rs9', 'PRODUCT3', 'rs9', 'rs5'], 'C1': [8, 5, 2, 1, 21, 11, 7, 3, 0, 2, 1, 1], 'C2': [11, 9, 2, 0, 12, 7, 3, 1, 1, 11, 6, 5], 'Total': [19, 14, 4, 1, 33, 18, 10, 4, 1, 13, 7, 6]})
for key, group in df.groupby(df['A'].str.isupper().cumsum()):
print(group)
prints
A C1 C2 Total
0 PRODUCT1 8 11 19
1 rs1 5 9 14
2 rs2 2 2 4
3 rs3 1 0 1
A C1 C2 Total
4 PRODUCT2 21 12 33
5 rs7 11 7 18
6 rs2 7 3 10
7 rs1 3 1 4
8 rs9 0 1 1
A C1 C2 Total
9 PRODUCT3 2 11 13
10 rs9 1 6 7
11 rs5 1 5 6
The idea here is to identify rows which are uppercase:
In [95]: df['A'].str.isupper()
Out[95]:
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
11 False
Name: A, dtype: bool
then use cumsum to take a cumulative sum, where True is treated as 1 and False is treated as 0:
In [96]: df['A'].str.isupper().cumsum()
Out[96]:
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 2
9 3
10 3
11 3
Name: A, dtype: int64
These values can be used as group numbers. Pass them to df.groupby to group the DataFrame according to these group numbers. df.groupby(...) returns an iterable, which lets you loop through the sub-groups.

Related

Split dataframe by two repeated values

I have a dataframe which describes the status of a person:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})
This dataframe records two trips of the person. I want to split it based on the values of column C, 'start' and 'end'. The other values in column C do not matter.
I could divide the dataframe by the following codes:
x=[]
y=[]
for i in range(len(df)):
if df['C'][i]=='start':
x.append(i)
elif df['C'][i]=='end':
y.append(i)
for i, j in zip(x, y):
new_df = df.iloc[i:j+1,:]
print(new_df)
However, I'm wondering is there any more efficient way to divide it without loop since I have a pretty large dataframe.
I would create a dict using GroupBy.__iter__()
Method 1
start = df['C'].eq('start')
dfs = dict(df.loc[(start.add(df['C'].shift().eq('end')).cumsum()%2).eq(1)]
.groupby(start.cumsum())
.__iter__())
#{1: A B C
# 0 1 6 start
# 1 2 7 running
# 2 3 8 running
# 3 4 9 end, 2: A B C
# 5 6 23 start
# 6 7 11 running
# 7 8 12 resting
# 8 3 13 end}
Method 2
start = df['C'].eq('start')
dfs = dict(df.loc[start.where(start)
.groupby(df['C'].shift()
.eq('end')
.cumsum())
.ffill().notna()]
.groupby(start.cumsum())
.__iter__())
#{1: A B C
# 0 1 6 start
# 1 2 7 running
# 2 3 8 running
# 3 4 9 end, 2: A B C
# 5 6 23 start
# 6 7 11 running
# 7 8 12 resting
# 8 3 13 end}
Accessing DataFrame
print(dfs[1])
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
print(dfs[2])
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
We can use groupby.get_group
dfs = (df.loc[start.where(start)
.groupby(df['C'].shift()
.eq('end')
.cumsum())
.ffill().notna()]
.groupby(start.cumsum()))
df1=dfs.get_group(1)
df2=dfs.get_group(2)
print(df1)
print(df2)
Details Method 2
start.where(start)
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Name: C, dtype: float64
df['C'].shift().eq('end').cumsum()
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 1
Name: C, dtype: int64
as you can see row 4 is within group 1, and when using groupby.ffill its value remains NaN
Based on the comments, the starting dataframe:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 3],
'B': [6, 7, 8, 9, 10, 23, 11, 12, 13],
'C': ['start', 'running', 'running', 'end', 'running', 'start', 'running', 'resting', 'end']})
Then:
for g in df.groupby(df.assign(tmp=(df['C'] == 'start'))['tmp'].cumsum()):
m = (g[1]['C'] == 'end').shift().fillna(False).cumsum() == 0
print(g[1][m])
Prints:
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end
You can use:
idx = zip(df[df['C'] == 'A'].index, df[df['C'] == 'C'].index)
dfs = [df.loc[i:j] for i, j in idx]
using str_extract | cumsum and groupby then holding your results in a dictionary.
df_dict = {}
counter =0
for group, data in df.assign(
g=df["C"].str.extract("(A|C)").bfill().apply(lambda x: x.ne("C")).cumsum()
).groupby("g"):
counter += 1
df_dict[counter] = data.drop('g',axis=1)
df_dict[1]
A B C
0 1 6 A
1 2 7 B
2 3 8 D
3 4 9 C
df_dict[2]
A B C
4 5 10 A
5 6 11 B
6 7 12 E
7 8 13 C
Try:
import numpy as np
df["group"]=df.groupby("C").cumcount()
df.loc[df["C"].ne("start"), "group"]=None
df["group"]=np.where(np.logical_and(df["C"].shift(1).eq("end"), df["C"].ne("start")), -1, df["group"])
df["group"]=df["group"].ffill()
dfs=[df.loc[df["group"].eq(grp)] for grp in df.groupby("group").groups]
Outputs:
#dfs[0]
A B C group
4 5 10 running -1.0
#dfs[1]
A B C group
0 1 6 start 0.0
1 2 7 running 0.0
2 3 8 running 0.0
3 4 9 end 0.0
#dfs[2]
A B C group
5 6 23 start 1.0
6 7 11 running 1.0
7 8 12 resting 1.0
8 3 13 end 1.0
I think you can do it with this line of code :
dfs = [ df[start:end+1]
for start, end in zip(df.index[df['C'] == 'start'],
df.index[df['C'] == 'end'])]
Output:
dfs[0]
A B C
0 1 6 start
1 2 7 running
2 3 8 running
3 4 9 end
dfs[1]
A B C
5 6 23 start
6 7 11 running
7 8 12 resting
8 3 13 end

Pandas: join dataframe composed by different iteration

I have a dataframe in which multiple dataseries with 2 columsn (0,1). The data is composed of different iterations of a measurement. The data is structured like so:
df = pd.DataFrame({
0: ['user', 'x', 1, 4, 7, 10, 'user', 'x', 1, 4, 7, 10, 'user', 'x', 1, 4, 7, 10],
1: ['iteration=0', 'y',5, 7, 9, 12, 'iteration=1', 'y',20, 8, 12, 12, 'iteration=2', 'y',3, 17, 19, 112]
})
0 user iteration=0
1 x y
2 1 5
3 4 7
4 7 9
5 10 12
6 user iteration=1
7 x y
8 1 20
9 4 8
10 7 12
11 10 12
12 user iteration=2
13 x y
14 1 3
15 4 17
16 7 19
17 10 112
I want to plot x vs y grouped by iteration.
I am trying to do this by first creaeting a single dataframe with the iteration as a column to perform the groupby on:
1 x y iteration
2 1 5 0
3 4 7 0
4 7 9 0
5 10 12 0
8 1 20 1
9 4 8 1
10 7 12 1
11 10 12 1
14 1 3 2
15 4 17 2
16 7 19 2
17 10 112 2
To create this joined dataframe, I implemented this code :
meta=df.loc[df[0]=='user']
lst=[]
ind=0
for index, row in meta.iterrows():
if index==0: #continue to start loop from second value
continue
splitvalue = meta.loc[ind][1].split('=')[1]
print (splitvalue)
temp=temp.iloc[ind:index]
temp['iteration']=splitvalue
ind=index
lst.append(temp)
pd.concat(lst)
Is there a way to create this joined dataframe without creating lists of subdataframes ? Or is there a way to directly plot from the original dataframe ?
You can use:
numeric=~pd.Series([isinstance(key,str) for key in df[0]])
iterations=df[1].where(df[1].str.contains('=').fillna(False)).ffill()
iterations=[int(key.replace('iteration=','')) for key in iterations]
df['iterations']=iterations
df=df.loc[numeric]
df.columns=['x','y','iteration']
df.reset_index(drop=True,inplace=True)
print(df)
x y iteration
0 1 5 0
1 4 7 0
2 7 9 0
3 10 12 0
4 1 20 1
5 4 8 1
6 7 12 1
7 10 12 1
8 1 3 2
9 4 17 2
10 7 19 2
11 10 112 2

Python Group BY Cumsum

I have this DataFrame :
Value Month
0 1
1 2
8 3
11 4
12 5
17 6
0 7
0 8
0 9
0 10
1 11
2 12
7 1
3 2
1 3
0 4
0 5
And i want to create new variable "Cumsum" like this :
Value Month Cumsum
0 1 0
1 2 1
8 3 9
11 4 20
12 5 32
17 6
0 7
0 8 ...
0 9
0 10
1 11
2 12
7 1 7
3 2 10
1 3 11
0 4 11
0 5 11
Sorry if my code it is not clean, I failed to include my dataframe ...
My problem is that I do not have only 12 lines (1 line per month) but I have many more lines.
By cons I know that my table is tidy and I want to have the cumulated amount until the 12th month and repeat that when the month 1 appears.
Thank you for your help.
Try:
df['Cumsum'] = df.groupby((df.Month == 1).cumsum())['Value'].cumsum()
print(df)
Value Month Cumsum
0 0 1 0
1 1 2 1
2 8 3 9
3 11 4 20
4 12 5 32
5 17 6 49
6 0 7 49
7 0 8 49
8 0 9 49
9 0 10 49
10 1 11 50
11 2 12 52
12 7 1 7
13 3 2 10
14 1 3 11
15 0 4 11
16 0 5 11
code:
df = pd.DataFrame({'value': [0, 1, 8, 11, 12, 17, 0, 0, 0, 0, 1, 2, 7, 3, 1, 0, 0],
'month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5]})
temp = int(len(df)/12)
for i in range(temp + 1):
start = i * 12
if i < temp:
end = (i + 1) * 12 - 1
df.loc[start:end, 'cumsum'] = df.loc[start:end, 'value'].cumsum()
else:
df.loc[start:, 'cumsum'] = df.loc[start:, 'value'].cumsum()
# df.loc[12:, 'cumsum'] = 12
print(df)
output:
value month cumsum
0 0 1 0.0
1 1 2 1.0
2 8 3 9.0
3 11 4 20.0
4 12 5 32.0
5 17 6 49.0
6 0 7 49.0
7 0 8 49.0
8 0 9 49.0
9 0 10 49.0
10 1 11 50.0
11 2 12 52.0
12 7 1 7.0
13 3 2 10.0
14 1 3 11.0
15 0 4 11.0
16 0 5 11.0

I want to split dataframe into train and test set with ranges

import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.

Multiple pandas columns

If a have pandas dataframe with 4 columns like this:
A B C D
0 2 4 1 9
1 3 2 9 7
2 1 6 9 2
3 8 6 5 4
is it possible to apply df.cumsum() in some way to get the results in a new column next to existing column like this:
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22
You can create new columns using assign:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
and order the columns with sort_index:
result.sort_index(axis=1)
# A AA B BB C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
Note that depending on the column names, sorting may not produce the desired order. In that case, using reindex is a more robust way of ensuring you obtain the desired column order:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
Here is an example which demonstrates the difference:
import pandas as pd
df = pd.DataFrame({'A': [2, 3, 1, 8], 'A A': [4, 2, 6, 6], 'C': [1, 9, 9, 5], 'D': [9, 7, 2, 4]})
result = df.assign(**{col*2:df[col].cumsum() for col in df})
print(result.sort_index(axis=1))
# A A A A AA A AA C CC D DD
# 0 2 4 4 2 1 1 9 9
# 1 3 2 6 5 9 10 7 16
# 2 1 6 12 6 9 19 2 18
# 3 8 6 18 14 5 24 4 22
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
print(result)
# A AA A A A AA A C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
#unutbu's way certainly works but using insert reads better to me. Plus you don't need to worry about sorting/reindexing!
for i, col_name in enumerate(df):
df.insert(i * 2 + 1, col_name * 2, df[col_name].cumsum())
df
returns
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22

Categories

Resources