If a have pandas dataframe with 4 columns like this:
A B C D
0 2 4 1 9
1 3 2 9 7
2 1 6 9 2
3 8 6 5 4
is it possible to apply df.cumsum() in some way to get the results in a new column next to existing column like this:
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22
You can create new columns using assign:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
and order the columns with sort_index:
result.sort_index(axis=1)
# A AA B BB C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
Note that depending on the column names, sorting may not produce the desired order. In that case, using reindex is a more robust way of ensuring you obtain the desired column order:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
Here is an example which demonstrates the difference:
import pandas as pd
df = pd.DataFrame({'A': [2, 3, 1, 8], 'A A': [4, 2, 6, 6], 'C': [1, 9, 9, 5], 'D': [9, 7, 2, 4]})
result = df.assign(**{col*2:df[col].cumsum() for col in df})
print(result.sort_index(axis=1))
# A A A A AA A AA C CC D DD
# 0 2 4 4 2 1 1 9 9
# 1 3 2 6 5 9 10 7 16
# 2 1 6 12 6 9 19 2 18
# 3 8 6 18 14 5 24 4 22
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
print(result)
# A AA A A A AA A C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
#unutbu's way certainly works but using insert reads better to me. Plus you don't need to worry about sorting/reindexing!
for i, col_name in enumerate(df):
df.insert(i * 2 + 1, col_name * 2, df[col_name].cumsum())
df
returns
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22
Related
I would like to replace values in a column, but only to the values seen after an specific value
for example, I have the following dataset:
In [108]: df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]], columns=['ID','time,'A', 'B', 'C'])
In [109]: df
Out[109]:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 4 8 7
4 16 1 9 3 1
5 17 3 1 4 8
and I want to change for column "A" all the values that come after 5 for a 1, for column "B" all the values that come after 1 for 6, for column "C" change all the values after 7 for a 5. so it will look like this:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I know that I could use where to get sort of a similar effect, but if I put a condition like df["A"] = np.where(x!=5,1,x), but obviously this will change the values before 5 as well. I can't think of anything else at the moment.
Thanks for the help.
Use DataFrame.mask with shifted valeus by DataFrame.shift, compared by dictioanry and for next Trues is used DataFrame.cummax:
df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],
[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]],
index=['ID','time','A', 'B', 'C']).T
after = {'A':5, 'B':1, 'C': 7}
new = {'A':1, 'B':6, 'C': 5}
cols = list(after.keys())
s = pd.Series(new)
df[cols] = df[cols].mask(df[cols].shift().eq(after).cummax(), s, axis=1)
print (df)
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I was wondering if there was a way (possibly using Pandas) to cut a dataset (table below) into groups when the value of B transitions either above or below a set value:
A
B
1
10
2
15
3
12
4
2
5
5
6
3
7
4
8
2
9
14
10
11
For instance if the transition value was 6 then they would be grouped like:
A
B
Group
1
10
A
2
15
A
3
12
A
4
2
B
5
5
B
6
3
B
7
4
B
8
2
B
9
14
C
10
11
C
It's important that there are distinct groups and not just everything above/below 6 being in one group
Try:
df['Group'] = df['A'].ge(df['B']).ne(df['A'].ge(df['B']).shift()).cumsum()
>>> df
A B Group
0 1 10 1
1 2 15 1
2 3 12 1
3 4 2 2
4 5 5 2
5 6 3 2
6 7 4 2
7 8 2 2
8 9 14 3
9 10 11 3
If you want letter: df['Group'].add(64).apply(chr)
Here is my suggestion. Not the shortest, but it works:
import itertools
l=[(i>6)*1 for i in df.B.to_list()]
m = [(i, list(k)) for i, k in itertools.groupby(l))]
vals = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
res= []
for i in range(len(m)):
res.extend([vals[i]]*len(m[i][1]))
df['Group'] = res
>>>print(df)
A B Group
0 1 10 A
1 2 15 A
2 3 12 A
3 4 2 B
4 5 5 B
5 6 3 B
6 7 4 B
7 8 2 B
8 9 14 C
9 10 11 C
I have a dataframe generated by pandas, as follows:
NO CODE
1 a
2 a
3 a
4 a
5 a
6 a
7 b
8 b
9 a
10 a
11 a
12 a
13 b
14 a
15 a
16 a
I want to convert the CODE column data to get the NUM column. The encoding rules are as follows:
NO CODE NUM
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 a 1
15 a 2
16 a 3
thank you!
Try:
a_group = df.CODE.eq('a')
df['NUM'] = np.where(a_group,
df.groupby(a_group.ne(a_group.shift()).cumsum())
.CODE.cumcount()+1,
df.CODE)
on
df = pd.DataFrame({'CODE':list('baaaaaabbaaaabbaa')})
yields
CODE NUM
-- ------ -----
0 b b
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 b b
15 a 1
16 a 2
IIUC
s=df.CODE.eq('b').cumsum()
df['NUM']=df.CODE.where(df.CODE.eq('b'),s[~df.CODE.eq('b')].groupby(s).cumcount()+1)
df
Out[514]:
NO CODE NUM
0 1 a 1
1 2 a 2
2 3 a 3
3 4 a 4
4 5 a 5
5 6 a 6
6 7 b b
7 8 b b
8 9 a 1
9 10 a 2
10 11 a 3
11 12 a 4
12 13 b b
13 14 a 1
14 15 a 2
15 16 a 3
I want to split dataframe by uneven number of rows using row index.
The below code:
groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))
works only for uniform number of rows.
df
a b c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
l = [2, 5, 7]
df1
1 1 1
2 2 2
df2
3,3,3
4,4,4
5,5,5
df3
6,6,6
7,7,7
df4
8,8,8
You could use list comprehension with a little modications your list, l, first.
print(df)
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
l = [2,5,7]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
Output:
list_of_dfs[0]
a b c
0 1 1 1
1 2 2 2
list_of_dfs[1]
a b c
2 3 3 3
3 4 4 4
4 5 5 5
list_of_dfs[2]
a b c
5 6 6 6
6 7 7 7
list_of_dfs[3]
a b c
7 8 8 8
I think this is what you need:
df = pd.DataFrame({'a': np.arange(1, 8),
'b': np.arange(1, 8),
'c': np.arange(1, 8)})
df.head()
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
last_check = 0
dfs = []
for ind in [2, 5, 7]:
dfs.append(df.loc[last_check:ind-1])
last_check = ind
Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.
dfs[0]
a b c
0 1 1 1
1 2 2 2
dfs[2]
a b c
5 6 6 6
6 7 7 7
I think this is you are looking for.,
l = [2, 5, 7]
dfs=[]
i=0
for val in l:
if i==0:
temp=df.iloc[:val]
dfs.append(temp)
elif i==len(l):
temp=df.iloc[val]
dfs.append(temp)
else:
temp=df.iloc[l[i-1]:val]
dfs.append(temp)
i+=1
Output:
a b c
0 1 1 1
1 2 2 2
a b c
2 3 3 3
3 4 4 4
4 5 5 5
a b c
5 6 6 6
6 7 7 7
Another Solution:
l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
print v
Output:
a b c 0
0 1 1 1 2
1 2 2 2 2
a b c 0
2 3 3 3 5
3 4 4 4 5
4 5 5 5 5
a b c 0
5 6 6 6 7
6 7 7 7 7
You can create an array to use for indexing via NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))
L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))
for _, chunk in df.groupby(idx):
print(chunk, '\n')
a b c
0 0 1 2
1 3 4 5
a b c
2 6 7 8
3 9 10 11
4 12 13 14
a b c
5 15 16 17
6 18 19 20
a b c
7 21 22 23
Instead of defining a new variable for each dataframe, you can use a dictionary:
d = dict(tuple(df.groupby(idx)))
print(d[1]) # print second groupby value
a b c
2 6 7 8
3 9 10 11
4 12 13 14
I did not figure out how to solve the following question!
consider the following data set:
df = pd.DataFrame(data=np.array([['a',1, 2, 3], ['a',4, 5, 6],
['b',7, 8, 9], ['b',10, 11 , 12]]),
columns=['id','A', 'B', 'C'])
id A B C
a 1 2 3
a 4 5 6
b 7 8 9
b 10 11 12
I need to group the data by id and in each group duplicate the first row and add it to the dataset like the following data set:
id A B C A B C
a 1 2 3 1 2 3
a 4 5 6 1 2 3
b 7 8 9 7 8 9
b 10 11 12 7 8 9
I really appreciate it for your help.
I did the following steps, however I could not expand it :
df1 = df.loc [0:0 , 'A' :'C']
df3 = pd.concat([df,df1],axis=1)
Use groupby + first, and then concatenate df with this result:
v = df.groupby('id').transform('first')
pd.concat([df, v], 1)
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
cumcount + where+ffill
v=df.groupby('id').cumcount()==0
pd.concat([df,df.iloc[:,1:].where(v).ffill()],1)
Out[57]:
id A B C A B C
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9
One can also try drop_duplicates and merge.
df_unique = df.drop_duplicates("id")
df.merge(df_unique, on="id", how="left")
id A_x B_x C_x A_y B_y C_y
0 a 1 2 3 1 2 3
1 a 4 5 6 1 2 3
2 b 7 8 9 7 8 9
3 b 10 11 12 7 8 9