Aggregating dataframe to give sum of elements and string of grouped indices - python

I'm trying to use groupby to give me the sum or mean of a number of elements, and a string of the original row indices for each group. So for instance, the dataframe:
>>> df = pd.DataFrame([[1,2,3],[1,3,4],[2,3,4],[2,5,6],[7,8,3],[11,12,13],[11,2,3]],index = ['p','q','r','s','t','u','v'],columns =['a','b','c'])
a b c
p 1 2 3
q 1 3 4
r 2 3 4
s 2 5 6
t 7 8 3
u 11 12 13
v 11 2 3
I would then like df to be grouped by 'a', to give:
b c indices
1 5 7 p,q
2 8 10 r,s
7 8 3 t
11 14 16 u,v
So far, I've tried:
df.groupby('a').agg({'score' : np.sum, 'indices' : lambda x: ",".join(list(x.index.values))})
But am receiving an error based on 'indices' not existing, can anyone advise how to accomplish what I'm trying to do?
Thanks

The way aggregation works is that you give a key and a value, where the key is a pre existing column name and the value is a function to map on the column.
So to get the sums the way you want, you do the following:
>>> grouped = df.groupby('a')
>>> grouped.agg({'b' : np.sum, 'c' : np.sum}).head()
c b
a
1 7 5
2 10 8
7 3 8
11 16 14
But you want to know the rows that have been combined in a third column. So you actually need to add this column before you groupby! Here is the full code:
df['indices'] = range(len(df))
grouped = df.groupby('a')
final = grouped.agg({'b' : np.sum, 'c' : np.sum, 'indices': lambda x: ",".join(list(x.index.values))})
then you get the following result:
>>> final.head()
indices c b
a
1 p,q 7 5
2 r,s 10 8
7 t 3 8
11 u,v 16 14
if you have any further questions, feel free to comment.

Related

How to sum columns with a duplicate name with Pandas?

I have a dataframe with duplicate column name and I would like to sum these columns.
>df
A B A B
1 12 2 4 1
2 10 5 4 9
3 2 1 4 8
4 2 4 3 8
What i would like is something like this:
A B
1 16 3
2 14 14
3 6 9
4 5 12
I can select duplicate columns in a loop but I don't know how to remove the columns and recreate a new column with summed values. I would like to know if there a more elegant way?
col = list(df.columns)
dup = list(set([x for x in col if col.count(x) > 1]))
for d in dup:
sum = df[d].sum(axis=1)
Let us try
sum_df=df.sum(level=0,axis=1)
Try this
df.groupby(lambda x:x, axis=1).sum()

How do I convert my 2D numpy array to a pandas dataframe with given categories?

I have an array called 'values' which features 2 columns of mean reaction time data from 10 individuals. The first column refers to data collected for a single individual in condition A, the second for that same individual in condition B:
array([[451.75 , 488.55555556],
[552.44444444, 590.40740741],
[629.875 , 637.62962963],
[454.66666667, 421.88888889],
[637.16666667, 539.94444444],
[538.83333333, 516.33333333],
[463.83333333, 448.83333333],
[429.2962963 , 497.16666667],
[524.66666667, 458.83333333]])
I would like to plot these data using seaborn, to display the mean values and connected single values for each individual across the two conditions. What is the simplest way to convert the array 'values' into a 3 column DataFrame, whereby one column features all the values, another features a label distinguishing that value as condition A or condition B, and a final column which provides a number for each individual (i.e., 1-10)? For example, as follows:
Value Condition Individual
451.75 A 1
488.56 B 1
488.55 A 2
...etc
melt
You can do that using pd.melt:
pd.DataFrame(data, columns=['A','B']).reset_index().melt(id_vars = 'index')\
.rename(columns={'index':'Individual'})
Individual variable value
0 0 A 451.750000
1 1 A 552.444444
2 2 A 629.875000
3 3 A 454.666667
4 4 A 637.166667
5 5 A 538.833333
6 6 A 463.833333
7 7 A 429.296296
8 8 A 524.666667
9 0 B 488.555556
10 1 B 590.407407
11 2 B 637.629630
12 3 B 421.888889
13 4 B 539.944444
14 5 B 516.333333
15 6 B 448.833333
16 7 B 497.166667
17 8 B 458.833333
This should work
import pandas as pd
import numpy as np
np_array = np.array([[451.75 , 488.55555556],
[552.44444444, 590.40740741],
[629.875 , 637.62962963],
[454.66666667, 421.88888889],
[637.16666667, 539.94444444],
[538.83333333, 516.33333333],
[463.83333333, 448.83333333],
[429.2962963 , 497.16666667],
[524.66666667, 458.83333333]])
pd_df = pd.DataFrame(np_array, columns=["A", "B"])
num_individuals = len(pd_df.index)
pd_df = pd_df.melt()
pd_df["INDIVIDUAL"] = [(i)%(num_individuals) + 1 for i in pd_df.index]
pd_df
variable value INDIVIDUAL
0 A 451.750000 1
1 A 552.444444 2
2 A 629.875000 3
3 A 454.666667 4
4 A 637.166667 5
5 A 538.833333 6
6 A 463.833333 7
7 A 429.296296 8
8 A 524.666667 9
9 B 488.555556 1
10 B 590.407407 2
11 B 637.629630 3
12 B 421.888889 4
13 B 539.944444 5
14 B 516.333333 6
15 B 448.833333 7
16 B 497.166667 8
17 B 458.833333 9

Selecting the top 50 % percentage names from the columns of a pandas dataframe

I have a pandas dataframe that looks like this. The rows and the columns have the same name.
name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10
I can get the 5 number of largest values by passing df['column_name'].nlargest(n=5) but if I have to return 50 % of the largest in descending order, is there anything that is inbuilt in pandas of it I have to write a function for it, how can I get them? I am quite new to python. Please help me out.
UPDATE : So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total in this case is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly I want to go through all the other columns and select 50%.
Sorting is flexible :)
df.sort_values('column_name',ascending=False).head(int(df.shape[0]*.5))
Update: frac argument is available only on .sample(), not in .head or .tail. df.sample(frac=.5) does give 50% but head and tail expects only int. df.head(frac=.5) fails with TypeError: head() got an unexpected keyword argument 'frac'
Note: on int() vs round()
int(3.X) == 3 # True Where 0 >= X >=9
round(3.45) == 3 # True
round(3.5) == 4 # True
So when doing .head(int/round ...) do think of what behaviour fits your need.
Updated: Requirements
So let's take column a into consideration and it has values like 10,
5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of
them. so the total, in this case, is 36. 50% of these values would be
18. So from column a, I want to select 10 and 8 only. Similarly, I want to go through all the other columns and select 50%. -Matt
A silly hack would be to sort, find the cumulative sum, find the middle by dividing it with the sum total and then use that to select part of your sorted column. e.g.
import pandas as pd
data = pd.read_csv(
pd.compat.StringIO("""name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10"""),
sep=' ', index_col='name'
).dropna(axis=1).apply(
pd.to_numeric, errors='coerce', downcast='signed')
x = data[['a']].sort_values(by='a',ascending=False)[(data[['a']].sort_values(by='a',ascending=False).cumsum()
/data[['a']].sort_values(by='a',ascending=False).sum())<=.5].dropna()
print(x)
Outcome:
You could sort the data frame and display only 90% of the data
df.sort_values('column_name',ascending=False).head(round(0.9*len(df)))
data.csv
name,a,b,c,d,e,f,g
a,10,5,4,8,5,6,4
b,5,10,6,5,4,3,3
c,-,4,9,3,6,5,7
d,6,9,8,6,6,8,2
e,8,5,4,4,14,9,6
f,3,3,-,4,5,14,7
g,4,5,8,9,6,7,10
test.py
#!/bin/python
import pandas as pd
def percentageOfList(l, p):
return l[0:int(len(l) * p)]
df = pd.read_csv('data.csv')
print(percentageOfList(df.sort_values('b', ascending=False)['b'], 0.9))

groupby, sum and count to one table

I have a dataframe below
df=pd.DataFrame({"A":np.random.randint(1,10,9),"B":np.random.randint(1,10,9),"C":list('abbcacded')})
A B C
0 9 6 a
1 2 2 b
2 1 9 b
3 8 2 c
4 7 6 a
5 3 5 c
6 1 3 d
7 9 9 e
8 3 4 d
I would like to get grouping result (with key="C" column) below,and the row c d and e is dropped intentionally.
number A_sum B_sum
a 2 16 15
b 2 3 11
this is 2row*3column dataframe. the grouping key is column C. And
The column "number"represents the count of each letter(a and b).
A_sum and B_sum represents grouping sum of letters in column C.
I guess we should use method groupby but how can I get this data summary table ?
You can do this using a single groupby with
res = df.groupby(df.C).agg({'A': 'sum', 'B': {'sum': 'sum', 'count': 'count'}})
res.columns = ['A_sum', 'B_sum', 'count']
One option is to count the size and sum the columns for each group separately and then join them by index:
df.groupby("C")['A'].agg({"number": 'size'}).join(df.groupby('C').sum())
number A B
# C
# a 2 11 8
# b 2 14 12
# c 2 8 5
# d 2 11 12
# e 1 7 2
You can also do df.groupby('C').agg(["sum", "size"]) which gives an extra duplicated size column, but if you are fine with that, it should also work.

Filtering Pandas Dataframe Aggregate

I have a pandas dataframe that I groupby, and then perform an aggregate calculation to get the mean for:
grouped = df.groupby(['year_month', 'company'])
means = grouped.agg({'size':['mean']})
Which gives me a dataframe back, but I can't seem to filter it to the specific company and year_month that I want:
means[(means['year_month']=='201412')]
gives me a KeyError
The issue is that you are grouping based on 'year_month' and 'company' . Hence in the means DataFrame, year_month and company would be part of the index (MutliIndex). You cannot access them as you access other columns.
One method to do this would be to get the values of the level 'year_month' of index . Example -
means.loc[means.index.get_level_values('year_month') == '201412']
Demo -
In [38]: df
Out[38]:
A B C
0 1 2 10
1 3 4 11
2 5 6 12
3 1 7 13
4 2 8 14
5 1 9 15
In [39]: means = df.groupby(['A','B']).mean()
In [40]: means
Out[40]:
C
A B
1 2 10
7 13
9 15
2 8 14
3 4 11
5 6 12
In [41]: means.loc[means.index.get_level_values('A') == 1]
Out[41]:
C
A B
1 2 10
7 13
9 15
As already pointed out, you will end up with a 2 level index. You could try to unstack the aggregated dataframe:
means = df.groupby(['year_month', 'company']).agg({'size':['mean']}).unstack(level=1)
This should give you a single 'year_month' index, 'company' as columns and your aggregate size as values. You can then slice by the index:
means.loc['201412']

Categories

Resources