I have a dataframe and i want to group its "First" and "Second" column and then to produce the expected output as mentioned below:
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
print(df)
Output>
First Second Value_1 Value_2
0 a q 17 70
1 b e 44 47
2 c e 5 56
3 a e 23 58
4 b e 10 76
5 a q 11 67
6 b q 21 84
7 c q 42 67
8 b e 36 53
9 c q 16 63
When i Grouped this DataFrame using groupby, I am getting below output:
def func(arr,columns):
return arr.sort_values(by = columns).drop(columns, axis = 1)
df.groupby(['First','Second']).apply(func, columns = ['First','Second'])
Value_1 Value_2
First Second
a e 3 23 58
q 0 17 70
5 11 67
b e 1 44 47
4 10 76
8 36 53
q 6 21 84
c e 2 5 56
q 7 42 67
9 16 63
However i want below output:
Expected output:
Value_1 Value_2
First Second
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
It's not necessary to print "All" string but to print the sum of all grouped rows.
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
First Second Value_1 Value_2
0 a q 4 69
1 b e 20 74
2 c e 13 82
3 a e 9 41
4 b e 11 79
5 a q 32 77
6 b q 6 75
7 c q 39 62
8 b e 26 80
9 c q 26 42
def lambda_t(x):
df = x.sort_values(['First','Second']).drop(['First','Second'],axis=1)
df.loc['all'] = df.sum()
return df
df.groupby(['First','Second']).apply(lambda_t)
Value_1 Value_2
First Second
a e 3 9 41
all 9 41
q 0 4 69
5 32 77
all 36 146
b e 1 20 74
4 11 79
8 26 80
all 57 233
q 6 6 75
all 6 75
c e 2 13 82
all 13 82
q 7 39 62
9 26 42
all 65 104
You can try this:
reset the index in your group by:
d1 = df.groupby(['First','Second']).apply(func, columns = ['First','Second']).reset_index()
Then group by 'First' and 'Second' and sum the values columns.
d2 = d.groupby(['First', 'Second']).sum().reset_index()
Create the 'level_2' column in the new dataframe and concatenate with the initial one to get the desired result
d2.loc[:,'level_2'] = 'All'
pd.concat([d1,d2],0).sort_values(by = ['First', 'Second'])
Not sure about your function; however, you could chunk it into two steps:
Create an indexed dataframe, where you append the First and Second columns to the existing index:
df.index = df.index.astype(str).rename("Total")
indexed = df.set_index(["First", "Second"], append=True).reorder_levels(
["First", "Second", "Total"]
)
indexed
Value_1 Value_2
First Second Total
a q 0 17 70
b e 1 44 47
c e 2 5 56
a e 3 23 58
b e 4 10 76
a q 5 11 67
b q 6 21 84
c q 7 42 67
b e 8 36 53
c q 9 16 63
Create an aggregation, grouped by First and Second:
summary = (
df.groupby(["First", "Second"])
.sum()
.assign(Total="All")
.set_index("Total", append=True)
)
summary
Value_1 Value_2
First Second Total
a e All 23 58
q All 28 137
b e All 90 176
q All 21 84
c e All 5 56
q All 58 130
Combine indexed and summary dataframes:
pd.concat([indexed, summary]).sort_index(level=["First", "Second"])
Value_1 Value_2
First Second Total
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
Related
I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass
I have a call log data made on customers. Which looks something like below, where ID is customer ID and A and B are log attributes:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'),
index = ['A','A','A','B','B','C','C','C','D','D'])
df['ID']=df.index
df = df[['ID','A','B']]
ID A B
A A 46 31
A A 99 54
A A 34 9
B B 46 48
B B 7 75
C C 1 25
C C 71 40
C C 74 53
D D 57 17
D D 19 78
I want to replicate each set of event for each ID based on some slots. For e.g. if slot value is 2 then all events for ID "A" should be replicated slot-1 times.
ID A B
A A 46 31
A A 99 54
A A 34 9
A A 46 31
A A 99 54
A A 34 9
and a new Index should be created indicating which slot does replicated values belong to:
ID A B Index
A 46 31 A-1
A 99 54 A-1
A 34 9 A-1
A 46 31 A-2
A 99 54 A-2
A 34 9 A-2
I have tried following solution:
slots = 2
nba_data = pd.DataFrame()
idx = pd.Index(list(range(1,slots+1)))
for i in unique_rec_counts_dict:
b = df.loc[df.ID==i,:]
b = b.append([b]*(slots-1),ignore_index=True)
b['Index'] = str(i)+'-'+idx.repeat(unique_rec_counts_dict[i]).astype(str)
nba_data = nba_data.append(b)
it gives me the expected output but is not scalable when slots are increased and number of customers increases in order of 10k.
ID A B Index
0 A 46 31 A-1
1 A 99 54 A-1
2 A 34 9 A-1
3 A 46 31 A-2
4 A 99 54 A-2
5 A 34 9 A-2
0 B 46 48 B-1
1 B 7 75 B-1
2 B 46 48 B-2
3 B 7 75 B-2
0 C 1 25 C-1
1 C 71 40 C-1
2 C 74 53 C-1
3 C 1 25 C-2
4 C 71 40 C-2
5 C 74 53 C-2
0 D 57 17 D-1
1 D 19 78 D-1
2 D 57 17 D-2
3 D 19 78 D-2
I think its taking a long time because of the loop. Any solution which is vectorized will be really helpful.
You can try:
slots = 2
new_df = pd.concat(df.assign(Index=f'_{i}') for i in range(1, slots+1))
new_df['Index'] = new_df['ID'] + new_df['Index']
Output:
ID A B Index
A A 48 61 A_1
A A 70 13 A_1
A A 36 23 A_1
B B 22 66 B_1
B B 92 95 B_1
C C 53 9 C_1
C C 41 57 C_1
C C 88 93 C_1
D D 76 82 D_1
D D 11 36 D_1
A A 48 61 A_2
A A 70 13 A_2
A A 36 23 A_2
B B 22 66 B_2
B B 92 95 B_2
C C 53 9 C_2
C C 41 57 C_2
C C 88 93 C_2
D D 76 82 D_2
D D 11 36 D_2
Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used
You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83
If we have the following data:
X = pd.DataFrame({"t":[1,2,3,4,5],"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
X
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
How can I shift the data in a cyclical fashion so that the next step is:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
And then:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
etc.
This should also shift the index values with the row.
I know of pandas X.shift(), but it wasn't making the cyclical thing.
You can combine reindex with np.roll:
X = X.reindex(np.roll(X.index, 1))
Another option is to combine concat with iloc:
shift = 1
X = pd.concat([X.iloc[-shift:], X.iloc[:-shift]])
The resulting output:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
Timings
Using the following setup to produce a larger DataFrame and functions for timing:
df = pd.concat([X]*10**5, ignore_index=True)
def root1(df, shift):
return df.reindex(np.roll(df.index, shift))
def root2(df, shift):
return pd.concat([df.iloc[-shift:], df.iloc[:-shift]])
def ed_chum(df, num):
return pd.DataFrame(np.roll(df, num, axis=0), np.roll(df.index, num), columns=df.columns)
def divakar1(df, shift):
return df.iloc[np.roll(np.arange(df.shape[0]), shift)]
def divakar2(df, shift):
idx = np.mod(np.arange(df.shape[0])-1,df.shape[0])
for _ in range(shift):
df = df.iloc[idx]
return df
I get the following timings:
%timeit root1(df.copy(), 25)
10 loops, best of 3: 61.3 ms per loop
%timeit root2(df.copy(), 25)
10 loops, best of 3: 26.4 ms per loop
%timeit ed_chum(df.copy(), 25)
10 loops, best of 3: 28.3 ms per loop
%timeit divakar1(df.copy(), 25)
10 loops, best of 3: 177 ms per loop
%timeit divakar2(df.copy(), 25)
1 loop, best of 3: 4.18 s per loop
You can use np.roll in a custom func:
In [83]:
def roll(df, num):
return pd.DataFrame(np.roll(df,num,axis=0), np.roll(df.index, num), columns=df.columns)
roll(X,1)
Out[83]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [84]:
roll(X,2)
Out[84]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
Here we return a df using the rolled df array, with the index rolled also
You can use numpy.roll :
import numpy as np
nb_iterations = 3 # number of steps you want
for i in range(nb_iterations):
for col in X.columns :
df[col] = numpy.roll(df[col], 1)
Which is equivalent to :
for col in X.columns :
df[col] = numpy.roll(df[col], nb_iterations)
Here is a link to the documentation of this useful function.
One approach would be creating such an shifted-down indexing array once and re-using it over and over to index into rows with .iloc, like so -
idx = np.mod(np.arange(X.shape[0])-1,X.shape[0])
X = X.iloc[idx]
Another way to create idx would be with np.roll : np.roll(np.arange(X.shape[0]),1).
Sample run -
In [113]: X # Starting version
Out[113]:
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
In [114]: idx = np.mod(np.arange(X.shape[0])-1,X.shape[0]) # Creating once
In [115]: X = X.iloc[idx] # Using idx
In [116]: X
Out[116]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [117]: X = X.iloc[idx] # Re-using idx
In [118]: X
Out[118]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3 ## and so on
Suppose I had this large data frame:
In [31]: df
Out[31]:
A B C D E F G H I J ... Q R S T U V W X Y Z
0 0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
1 26 27 28 29 30 31 32 33 34 35 ... 42 43 44 45 46 47 48 49 50 51
2 52 53 54 55 56 57 58 59 60 61 ... 68 69 70 71 72 73 74 75 76 77
[3 rows x 26 columns]
which you can create using
alphabet = [chr(letter_i) for letter_i in range(ord('A'), ord('Z')+1)]
df = pd.DataFrame(np.arange(3*26).reshape(3, 26), columns=alphabet)
What's the best way to drop all columns between column 'D' and 'R' using the labels of the columns?
I found one ugly way to do it:
df.drop(df.columns[df.columns.get_loc('D'):df.columns.get_loc('R')+1], axis=1)
Here's my entry:
>>> df.drop(df.columns.to_series()["D":"R"], axis=1)
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77
By converting df.columns from an Index to a Series, we can take advantage of the ["D":"R"]-style selection:
>>> df.columns.to_series()["D":"R"]
D D
E E
F F
G G
H H
I I
J J
... ...
Q Q
R R
dtype: object
Here you are:
print df.ix[:,'A':'C'].join(df.ix[:,'S':'Z'])
Out[1]:
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77
Here's another way ...
low, high = df.columns.get_slice_bound(('D', 'R'), 'left')
drops = df.columns[low:high+1]
print df.drop(drops, axis=1)
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77
Use numpy for more flexibility ... numpy allows comparison of letters (probably by comparing on ASCII bit level, or something):
import numpy as np
array = (['A','B','C','D'])
array > 'B'
print(array)
print(array>'B')
gives:
['A' 'B' 'C' 'D']
array([False, False, True, True], dtype=bool)
More difficult selections are also easily possible:
b[np.logical_and(b>'B', b<'D')]
gives:
array(['C'],
dtype='|S1')