I have a call log data made on customers. Which looks something like below, where ID is customer ID and A and B are log attributes:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'),
index = ['A','A','A','B','B','C','C','C','D','D'])
df['ID']=df.index
df = df[['ID','A','B']]
ID A B
A A 46 31
A A 99 54
A A 34 9
B B 46 48
B B 7 75
C C 1 25
C C 71 40
C C 74 53
D D 57 17
D D 19 78
I want to replicate each set of event for each ID based on some slots. For e.g. if slot value is 2 then all events for ID "A" should be replicated slot-1 times.
ID A B
A A 46 31
A A 99 54
A A 34 9
A A 46 31
A A 99 54
A A 34 9
and a new Index should be created indicating which slot does replicated values belong to:
ID A B Index
A 46 31 A-1
A 99 54 A-1
A 34 9 A-1
A 46 31 A-2
A 99 54 A-2
A 34 9 A-2
I have tried following solution:
slots = 2
nba_data = pd.DataFrame()
idx = pd.Index(list(range(1,slots+1)))
for i in unique_rec_counts_dict:
b = df.loc[df.ID==i,:]
b = b.append([b]*(slots-1),ignore_index=True)
b['Index'] = str(i)+'-'+idx.repeat(unique_rec_counts_dict[i]).astype(str)
nba_data = nba_data.append(b)
it gives me the expected output but is not scalable when slots are increased and number of customers increases in order of 10k.
ID A B Index
0 A 46 31 A-1
1 A 99 54 A-1
2 A 34 9 A-1
3 A 46 31 A-2
4 A 99 54 A-2
5 A 34 9 A-2
0 B 46 48 B-1
1 B 7 75 B-1
2 B 46 48 B-2
3 B 7 75 B-2
0 C 1 25 C-1
1 C 71 40 C-1
2 C 74 53 C-1
3 C 1 25 C-2
4 C 71 40 C-2
5 C 74 53 C-2
0 D 57 17 D-1
1 D 19 78 D-1
2 D 57 17 D-2
3 D 19 78 D-2
I think its taking a long time because of the loop. Any solution which is vectorized will be really helpful.
You can try:
slots = 2
new_df = pd.concat(df.assign(Index=f'_{i}') for i in range(1, slots+1))
new_df['Index'] = new_df['ID'] + new_df['Index']
Output:
ID A B Index
A A 48 61 A_1
A A 70 13 A_1
A A 36 23 A_1
B B 22 66 B_1
B B 92 95 B_1
C C 53 9 C_1
C C 41 57 C_1
C C 88 93 C_1
D D 76 82 D_1
D D 11 36 D_1
A A 48 61 A_2
A A 70 13 A_2
A A 36 23 A_2
B B 22 66 B_2
B B 92 95 B_2
C C 53 9 C_2
C C 41 57 C_2
C C 88 93 C_2
D D 76 82 D_2
D D 11 36 D_2
Related
I have a dataframe and i want to group its "First" and "Second" column and then to produce the expected output as mentioned below:
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
print(df)
Output>
First Second Value_1 Value_2
0 a q 17 70
1 b e 44 47
2 c e 5 56
3 a e 23 58
4 b e 10 76
5 a q 11 67
6 b q 21 84
7 c q 42 67
8 b e 36 53
9 c q 16 63
When i Grouped this DataFrame using groupby, I am getting below output:
def func(arr,columns):
return arr.sort_values(by = columns).drop(columns, axis = 1)
df.groupby(['First','Second']).apply(func, columns = ['First','Second'])
Value_1 Value_2
First Second
a e 3 23 58
q 0 17 70
5 11 67
b e 1 44 47
4 10 76
8 36 53
q 6 21 84
c e 2 5 56
q 7 42 67
9 16 63
However i want below output:
Expected output:
Value_1 Value_2
First Second
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
It's not necessary to print "All" string but to print the sum of all grouped rows.
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
First Second Value_1 Value_2
0 a q 4 69
1 b e 20 74
2 c e 13 82
3 a e 9 41
4 b e 11 79
5 a q 32 77
6 b q 6 75
7 c q 39 62
8 b e 26 80
9 c q 26 42
def lambda_t(x):
df = x.sort_values(['First','Second']).drop(['First','Second'],axis=1)
df.loc['all'] = df.sum()
return df
df.groupby(['First','Second']).apply(lambda_t)
Value_1 Value_2
First Second
a e 3 9 41
all 9 41
q 0 4 69
5 32 77
all 36 146
b e 1 20 74
4 11 79
8 26 80
all 57 233
q 6 6 75
all 6 75
c e 2 13 82
all 13 82
q 7 39 62
9 26 42
all 65 104
You can try this:
reset the index in your group by:
d1 = df.groupby(['First','Second']).apply(func, columns = ['First','Second']).reset_index()
Then group by 'First' and 'Second' and sum the values columns.
d2 = d.groupby(['First', 'Second']).sum().reset_index()
Create the 'level_2' column in the new dataframe and concatenate with the initial one to get the desired result
d2.loc[:,'level_2'] = 'All'
pd.concat([d1,d2],0).sort_values(by = ['First', 'Second'])
Not sure about your function; however, you could chunk it into two steps:
Create an indexed dataframe, where you append the First and Second columns to the existing index:
df.index = df.index.astype(str).rename("Total")
indexed = df.set_index(["First", "Second"], append=True).reorder_levels(
["First", "Second", "Total"]
)
indexed
Value_1 Value_2
First Second Total
a q 0 17 70
b e 1 44 47
c e 2 5 56
a e 3 23 58
b e 4 10 76
a q 5 11 67
b q 6 21 84
c q 7 42 67
b e 8 36 53
c q 9 16 63
Create an aggregation, grouped by First and Second:
summary = (
df.groupby(["First", "Second"])
.sum()
.assign(Total="All")
.set_index("Total", append=True)
)
summary
Value_1 Value_2
First Second Total
a e All 23 58
q All 28 137
b e All 90 176
q All 21 84
c e All 5 56
q All 58 130
Combine indexed and summary dataframes:
pd.concat([indexed, summary]).sort_index(level=["First", "Second"])
Value_1 Value_2
First Second Total
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
I have a dataframe:
A B C D E
0 a 34 55 43 aa
1 b 53 77 65 bb
2 c 23 100 34 cc
3 d 54 43 23 dd
4 e 23 67 54 ee
5 f 43 98 23 ff
I need to get the maximum difference between the column B,C and D and return the value in column A . in row 'a' maximum difference between columns is 55 - 34 = 21 . data is in a dataframe.
The expected result is
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
Use np.ptp:
# df['A'] = np.ptp(df.loc[:, 'B':'D'], axis=1)
df['A'] = np.ptp(df[['B', 'C', 'D']], axis=1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
Or, find the max and min yourself:
df['A'] = df[['B', 'C', 'D']].max(1) - df[['B', 'C', 'D']].min(1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
If performance is important, you can do this in NumPy space:
v = df[['B', 'C', 'D']].values
df['A'] = v.max(1) - v.min(1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
I am trying to perform an aggregate calculation, but I want the calculation to apply to every other category.
So,
df.groupby(['index']).agg({data : [func1,func2]})
Will perform the aggregate calculations func1 and func2 on the data grouped by index, but I want to perform the calculations on all the data that isn't in the index.
For example:
index data
A 1
A 2
A 1
B 2
B 2
B 4
B 4
C 1
C 3
D 4
D 1
I would want the results for A to be performed on the data in B,C,D.
Is there a novel way to accomplish this?
Well, I actually think I figured this out. Basically, I created a new dataframe and re-index'd it.
value
original_index
A 44
A 65
A 88
B 69
B 11
B 52
C 56
C 42
C 85
D 66
D 77
D 9
Loop through each index and and copy everything not in that index to a new dataframe. Then concat them all together.
l = []
for i in df.index.unique():
d = df[~df.index.isin([i])].copy()
d['new_index'] = i
d.drop('original_index',axis=0,inplace=True)
d.set_index('new_index',inplace=True)
l.append(d)
df2 = pd.concat(l,axis=0)
Ouput:
value
new_index
A 69
A 11
A 52
A 56
A 42
A 85
A 66
A 77
A 9
B 44
B 65
B 88
B 56
B 42
B 85
B 66
B 77
B 9
C 44
C 65
C 88
C 69
C 11
C 52
C 66
C 77
C 9
D 44
D 65
D 88
D 69
D 11
D 52
D 56
D 42
D 85
Now we can apply our groupby function on the new index and it will return results from values that were originally not in the index.
group_df = df2.groupby(['new_index']).agg({'value' :[func1,func2]})[['value']]
It works, but I'm sure there must be a better way.
Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used
You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83
If we have the following data:
X = pd.DataFrame({"t":[1,2,3,4,5],"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
X
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
How can I shift the data in a cyclical fashion so that the next step is:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
And then:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
etc.
This should also shift the index values with the row.
I know of pandas X.shift(), but it wasn't making the cyclical thing.
You can combine reindex with np.roll:
X = X.reindex(np.roll(X.index, 1))
Another option is to combine concat with iloc:
shift = 1
X = pd.concat([X.iloc[-shift:], X.iloc[:-shift]])
The resulting output:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
Timings
Using the following setup to produce a larger DataFrame and functions for timing:
df = pd.concat([X]*10**5, ignore_index=True)
def root1(df, shift):
return df.reindex(np.roll(df.index, shift))
def root2(df, shift):
return pd.concat([df.iloc[-shift:], df.iloc[:-shift]])
def ed_chum(df, num):
return pd.DataFrame(np.roll(df, num, axis=0), np.roll(df.index, num), columns=df.columns)
def divakar1(df, shift):
return df.iloc[np.roll(np.arange(df.shape[0]), shift)]
def divakar2(df, shift):
idx = np.mod(np.arange(df.shape[0])-1,df.shape[0])
for _ in range(shift):
df = df.iloc[idx]
return df
I get the following timings:
%timeit root1(df.copy(), 25)
10 loops, best of 3: 61.3 ms per loop
%timeit root2(df.copy(), 25)
10 loops, best of 3: 26.4 ms per loop
%timeit ed_chum(df.copy(), 25)
10 loops, best of 3: 28.3 ms per loop
%timeit divakar1(df.copy(), 25)
10 loops, best of 3: 177 ms per loop
%timeit divakar2(df.copy(), 25)
1 loop, best of 3: 4.18 s per loop
You can use np.roll in a custom func:
In [83]:
def roll(df, num):
return pd.DataFrame(np.roll(df,num,axis=0), np.roll(df.index, num), columns=df.columns)
roll(X,1)
Out[83]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [84]:
roll(X,2)
Out[84]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
Here we return a df using the rolled df array, with the index rolled also
You can use numpy.roll :
import numpy as np
nb_iterations = 3 # number of steps you want
for i in range(nb_iterations):
for col in X.columns :
df[col] = numpy.roll(df[col], 1)
Which is equivalent to :
for col in X.columns :
df[col] = numpy.roll(df[col], nb_iterations)
Here is a link to the documentation of this useful function.
One approach would be creating such an shifted-down indexing array once and re-using it over and over to index into rows with .iloc, like so -
idx = np.mod(np.arange(X.shape[0])-1,X.shape[0])
X = X.iloc[idx]
Another way to create idx would be with np.roll : np.roll(np.arange(X.shape[0]),1).
Sample run -
In [113]: X # Starting version
Out[113]:
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
In [114]: idx = np.mod(np.arange(X.shape[0])-1,X.shape[0]) # Creating once
In [115]: X = X.iloc[idx] # Using idx
In [116]: X
Out[116]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [117]: X = X.iloc[idx] # Re-using idx
In [118]: X
Out[118]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3 ## and so on