Getting aggregate from Pandas for other categories - python

I am trying to perform an aggregate calculation, but I want the calculation to apply to every other category.
So,
df.groupby(['index']).agg({data : [func1,func2]})
Will perform the aggregate calculations func1 and func2 on the data grouped by index, but I want to perform the calculations on all the data that isn't in the index.
For example:
index data
A 1
A 2
A 1
B 2
B 2
B 4
B 4
C 1
C 3
D 4
D 1
I would want the results for A to be performed on the data in B,C,D.
Is there a novel way to accomplish this?

Well, I actually think I figured this out. Basically, I created a new dataframe and re-index'd it.
value
original_index
A 44
A 65
A 88
B 69
B 11
B 52
C 56
C 42
C 85
D 66
D 77
D 9
Loop through each index and and copy everything not in that index to a new dataframe. Then concat them all together.
l = []
for i in df.index.unique():
d = df[~df.index.isin([i])].copy()
d['new_index'] = i
d.drop('original_index',axis=0,inplace=True)
d.set_index('new_index',inplace=True)
l.append(d)
df2 = pd.concat(l,axis=0)
Ouput:
value
new_index
A 69
A 11
A 52
A 56
A 42
A 85
A 66
A 77
A 9
B 44
B 65
B 88
B 56
B 42
B 85
B 66
B 77
B 9
C 44
C 65
C 88
C 69
C 11
C 52
C 66
C 77
C 9
D 44
D 65
D 88
D 69
D 11
D 52
D 56
D 42
D 85
Now we can apply our groupby function on the new index and it will return results from values that were originally not in the index.
group_df = df2.groupby(['new_index']).agg({'value' :[func1,func2]})[['value']]
It works, but I'm sure there must be a better way.

Related

average of one wrt another or averageifs in python

I have a pandas df as displayed I would like to calculate Avg Rate by DC by Brand column which is a similar to averageif in excel ,
I have tried methods like groupby mean() but that does not give correct results
Your question is not clear but you may be looking for:
df.groupby(['DC','Brand'])['Rate'].mean()
AVERAGEIF in excel returns a column which is the same size as your original data. So I think you're looking for pandas.transform():
# Sample DF
Brand Rate
0 A 45
1 B 100
2 C 28
3 A 92
4 B 2
5 C 79
6 A 48
7 B 97
8 C 72
9 D 14
10 D 16
11 D 64
12 E 85
13 E 22
Result:
df['Avg Rate by Brand'] = df.groupby('Brand')['Rate'].transform('mean')
print(df)
Brand Rate Avg Rate by Brand
0 A 45 61.666667
1 B 100 66.333333
2 C 28 59.666667
3 A 92 61.666667
4 B 2 66.333333
5 C 79 59.666667
6 A 48 61.666667
7 B 97 66.333333
8 C 72 59.666667
9 D 14 31.333333
10 D 16 31.333333
11 D 64 31.333333
12 E 85 53.500000
13 E 22 53.500000

Replicate multiple rows of events for specific IDs multiple times

I have a call log data made on customers. Which looks something like below, where ID is customer ID and A and B are log attributes:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'),
index = ['A','A','A','B','B','C','C','C','D','D'])
df['ID']=df.index
df = df[['ID','A','B']]
ID A B
A A 46 31
A A 99 54
A A 34 9
B B 46 48
B B 7 75
C C 1 25
C C 71 40
C C 74 53
D D 57 17
D D 19 78
I want to replicate each set of event for each ID based on some slots. For e.g. if slot value is 2 then all events for ID "A" should be replicated slot-1 times.
ID A B
A A 46 31
A A 99 54
A A 34 9
A A 46 31
A A 99 54
A A 34 9
and a new Index should be created indicating which slot does replicated values belong to:
ID A B Index
A 46 31 A-1
A 99 54 A-1
A 34 9 A-1
A 46 31 A-2
A 99 54 A-2
A 34 9 A-2
I have tried following solution:
slots = 2
nba_data = pd.DataFrame()
idx = pd.Index(list(range(1,slots+1)))
for i in unique_rec_counts_dict:
b = df.loc[df.ID==i,:]
b = b.append([b]*(slots-1),ignore_index=True)
b['Index'] = str(i)+'-'+idx.repeat(unique_rec_counts_dict[i]).astype(str)
nba_data = nba_data.append(b)
it gives me the expected output but is not scalable when slots are increased and number of customers increases in order of 10k.
ID A B Index
0 A 46 31 A-1
1 A 99 54 A-1
2 A 34 9 A-1
3 A 46 31 A-2
4 A 99 54 A-2
5 A 34 9 A-2
0 B 46 48 B-1
1 B 7 75 B-1
2 B 46 48 B-2
3 B 7 75 B-2
0 C 1 25 C-1
1 C 71 40 C-1
2 C 74 53 C-1
3 C 1 25 C-2
4 C 71 40 C-2
5 C 74 53 C-2
0 D 57 17 D-1
1 D 19 78 D-1
2 D 57 17 D-2
3 D 19 78 D-2
I think its taking a long time because of the loop. Any solution which is vectorized will be really helpful.
You can try:
slots = 2
new_df = pd.concat(df.assign(Index=f'_{i}') for i in range(1, slots+1))
new_df['Index'] = new_df['ID'] + new_df['Index']
Output:
ID A B Index
A A 48 61 A_1
A A 70 13 A_1
A A 36 23 A_1
B B 22 66 B_1
B B 92 95 B_1
C C 53 9 C_1
C C 41 57 C_1
C C 88 93 C_1
D D 76 82 D_1
D D 11 36 D_1
A A 48 61 A_2
A A 70 13 A_2
A A 36 23 A_2
B B 22 66 B_2
B B 92 95 B_2
C C 53 9 C_2
C C 41 57 C_2
C C 88 93 C_2
D D 76 82 D_2
D D 11 36 D_2

Looping a function with Pandas DataFrames

I have some function that takes a DataFrame and an integer as arguments:
func(df, int)
The function returns a new DataFrame, e.g.:
df2 = func(df,2)
I'd like to write a loop for integers 2-10, resulting in 9 DataFrames. If I do this manually it would look like this:
df2 = func(df,2)
df3 = func(df2,3)
df4 = func(df3,4)
df5 = func(df4,5)
df6 = func(df5,6)
df7 = func(df6,7)
df8 = func(df7,8)
df9 = func(df8,9)
df10 = func(df9,10)
Is there a way to write a loop that does this?
This type of thing is what lists are for.
data_frames = [df]
for i in range(2, 11):
data_frames.append(func(data_frames[-1], i))
It's a sign of brittle code when you see variable names like df1, df2, df3, etc. Use lists when you have a sequence of related objects to build.
To clarify, this data_frames is a list of DataFrames that can be concatenated with data_frames = pd.concat(data_frames, sort=False), resulting in one DataFrame that combines the original df with everything that results from the loop, correct?
Yup, that's right. If your goal is one final data frame, you can concatenate the entire list at the end to combine the information into a single frame.
Do you mind explaining why data_frames[-1], which takes the last item of the list, returns a DataFrame? Not clear on this.
Because as you're building the list, at all times each entry is a data frame. data_frames[-1] evaluates to the last element in the list, which in this case, is the data frame you most recently appended.
You may try using itertools.accumulate as follows:
sample data
df:
a b c
0 75 18 17
1 48 56 3
import itertools
def func(x, y):
return x + y
dfs = list(itertools.accumulate([df] + list(range(2, 11)), func))
[ a b c
0 75 18 17
1 48 56 3, a b c
0 77 20 19
1 50 58 5, a b c
0 80 23 22
1 53 61 8, a b c
0 84 27 26
1 57 65 12, a b c
0 89 32 31
1 62 70 17, a b c
0 95 38 37
1 68 76 23, a b c
0 102 45 44
1 75 83 30, a b c
0 110 53 52
1 83 91 38, a b c
0 119 62 61
1 92 100 47, a b c
0 129 72 71
1 102 110 57]
dfs is the list of result dataframes where each one is the adding of 2 - 10 to the previous result
If you want concat them all into one dataframe, Use pd.concat
pd.concat(dfs)
Out[29]:
a b c
0 75 18 17
1 48 56 3
0 77 20 19
1 50 58 5
0 80 23 22
1 53 61 8
0 84 27 26
1 57 65 12
0 89 32 31
1 62 70 17
0 95 38 37
1 68 76 23
0 102 45 44
1 75 83 30
0 110 53 52
1 83 91 38
0 119 62 61
1 92 100 47
0 129 72 71
1 102 110 57
You can use exec with a formatted string:
for i in range(2, 11):
exec("df{0} = func(df{1}, {0})".format(i, i - 1 if i > 2 else ''))

Dataframe pandas how to pass list as columns

I have two lists, such as:
list_columns = ['a','b','c','d','e','f','g','h','k','l','m','n']
and a list of values
list_values = [11,22,33,44,55,66,77,88,99,100, 111, 222]
I want to create a Pandas dataframe using list_columns as columns.
I tried with df = pd.DataFrame(list_values, columns=list_columns)
but it doesn't work
I get this error: ValueError: Shape of passed values is (1, 12), indices imply (12, 12)
A dataframe is a two-dimensional object. To reflect this, you need to feed a nested list. Each sublist, in this case the only sublist, represents a row.
df = pd.DataFrame([list_values], columns=list_columns)
print(df)
# a b c d e f g h k l m n
# 0 11 22 33 44 55 66 77 88 99 100 111 222
If you supply an index with length greater than 1, Pandas broadcasts for you:
df = pd.DataFrame([list_values], columns=list_columns, index=[0, 1, 2])
print(df)
# a b c d e f g h k l m n
# 0 11 22 33 44 55 66 77 88 99 100 111 222
# 1 11 22 33 44 55 66 77 88 99 100 111 222
# 2 11 22 33 44 55 66 77 88 99 100 111 222
If I understand your question correctly just wrap list_values in brackets so it's a list of lists
list_columns = ['a','b','c','d','e','f','g','h','k','l','m','n']
list_values = [[11,22,33,44,55,66,77,88,99,100, 111, 222]]
pd.DataFrame(list_values, columns=list_columns)
a b c d e f g h k l m n
0 11 22 33 44 55 66 77 88 99 100 111 222
from your list you can do like below:
df = pd.DataFrame(list_values)
df=df.T
df.columns=list_columns
>>df
a b c d e f g h k l m n
0 11 22 33 44 55 66 77 88 99 100 111 222

Pandas Keyerror

I have a very simple code:
stats2 = {'a':[1,2,3,4,5,6],
'b':[43,34,65,56,29,76],
'c':[65,67,78,65,45,52],
'cac':['mns','ab','cd','cd','ab','k']}
f2 = pd.DataFrame(stats2)
f2.set_index(['cac'], inplace = True)
print(f2.ix['mns'])
print(f2['mns'])
f2.ix['mns'] works just fine. However, f2['mns'] reports KeyError. I am trying to understand why it does that. Is that how pandas work? Do I have to use ix even though I have set the index before?
This is your original dataframe:
>>> df
a b c cac
0 1 43 65 mns
1 2 34 67 ab
2 3 65 78 cd
3 4 56 65 cd
4 5 29 45 ab
5 6 76 52 k
>>> df.set_index(['cac'], inplace=True)
>>> df
a b c
cac
mns 1 43 65
ab 2 34 67
cd 3 65 78
cd 4 56 65
ab 5 29 45
k 6 76 52
So, setting the index in pandas is simply replacing the before counter values(0,1,2,...,5) to the new row values i.e (mns, ab,...,k) of cac column name.
>>> df.ix['mns']
a 1
b 43
c 65
This command specifically searches for row in the index column, cac whose value is equal to mns and retrieves it's corresponding elements.
Note: As mns is not a column name of the dataframe, df['mns'] throws a key error.

Categories

Resources