pandas dataframe groupby and get arbitrary member of each group - python

This question is similar to this other question.
I have a pandas dataframe. I want to split it into groups, and select an arbitrary member of each group, defined elsewhere.
Example: I have a dataframe that can be divided in 6 groups of 4 observations each. I want to extract the observations according to:
selected = [0,3,2,3,1,3]
This is very similar to
df.groupy('groupvar').nth(n)
But, crucially, n varies for each group according to the selected list.
Thanks!

Typically everything that you do within groupby should be group independent. So, within any groupby.apply(), you will only get the group itself, not the context. An alternative is to compute the index value for the whole sample (following, index) out of the indices for the groups (here, selected). Note that the dataset is sorted by groups, which you need to do if you want to apply the following.
I use test, out of which I want to select selected:
In[231]: test
Out[231]:
score
name
0 A -0.208392
1 A -0.103659
2 A 1.645287
0 B 0.119709
1 B -0.047639
2 B -0.479155
0 C -0.415372
1 C -1.390416
2 C -0.384158
3 C -1.328278
selected = [0, 2, 1]
c = test.groupby(level=1).count()
In[242]: index = c.shift(1).cumsum().add(array([selected]).T, fill_value=0)
In[243]: index
Out[243]:
score
name
A 0
B 5
C 4
In[255]: test.iloc[index.values[:,0]]
Out[255]:
score
name
0 A -0.208392
2 B -0.479155
1 C -1.390416

Related

How to access pandas groupby groups in decreasing order of group count in a for loop?

I used pandas groupby to group my data using multiple columns. Now, I want to access the groups in a for loop in decreasing order of group-count?
groups.size().sort_values(ascending = False).head(10) show me the groups in decreasing order of group-count but I want to access each group as a data frame ( like get_group() returns) in a for loop? How do I do that?
As hinted in your question, use get_group:
df = pd.DataFrame({'col': list('ABBCACBBC')})
# assign GroupBy object to variable
g = df.groupby('col', sort=False)
# get desired order of the groups
order = g['col'].size().sort_values(ascending=False).index
for x in order:
print(f'# group {x}')
print(g.get_group(x))
Output:
# group B
col
1 B
2 B
6 B
7 B
# group C
col
3 C
5 C
8 C
# group A
col
0 A
4 A

How to use groupby on multiple indexes and then use count aggregate function and then use one of the multiple indexes to get the sum of count?

I have created a dataframe in python lets say:
testingdf = pd.DataFrame({'A':[1,2,1,2,1,2],
'B':[1,2,1,2,3,3],
'C':[9,8,7,6,5,6]})
Now i want to get count of column 'C' according to 'A' and 'B' for that i am performing
testingdf.groupby(['A','B']).count()
to get:
C
A B
1 1 2
3 1
2 2 2
3 1
Now i want to get the Sum value of this count of 'C' with Respect to 'A' like:
A C
1 3
2 3
After grouping 'A' and 'B' i can select the 'A' column and apply the sum aggregate function on it. So I wanted to know what is the efficient way of doing this.
Note** : This sum is just an example i want to perform different things too like aggregate function to get max and min of count of C with respect to A after grouping A and B together.
P.S. : Sorry I should have had mentioned this earlier, but I don't want to use groupby twice. I want to know the most efficient way to get the results. Even if that means I don't have to use groupby.
you can use sum() method with level parameter after groupby()+count():
out=testingdf.groupby(['A','B']).count().sum(level=0).reset_index()
OR
other way is groupby twice:
out=testingdf.groupby(['A','B']).count().groupby(level=0).sum().reset_index()
output for your given data:
A C
0 1 2
1 2 2
2 3 1

i applied a sum on a groupby and i want to sort the result

I did : g=df.groupby('name of the column') and i got a group. Now I want to, for every different 'name of the column', sum values that are specified in another column. So when i run the function, i'll get a series(sorted by sum of values) with each 'name of the column' and its respective sum of the values . What i did was:
for name, dfaux in g:
print(name, dfaux['name of the column where the values are specified'].sum())
I did get the series that I wanted, but I don't know how to sort it. Any help? Thanks!
Do u want the below kind of sorting, if yes u can code so.
your data-frame
0 a 1
1 b 2
2 a 3
3 c 4
4 b 5
If u expect the output to be
a 4
c 4
b 7
d = {'col1':['a','b','a','c','b'], 'col2':[1,2,3,4,5]}
df = pd.DataFrame(d)
print(df.groupby(['col1']).sum().sort_values(by=['col2']))
here groupby will return a data-frame with the column names as specified before.
so u can just sort the returned data-frame.

How to use python pandas groupby or .DataFrameGroupBy objects to create unique list of combinations

Is there a more efficient way to use pandas groupby or pandas.core.groupby.DataFrameGroupBy object to create a unique list, series or dataframe, where I want unique combinations of 2 of N columns. E.g., if I have columns: Date, Name, Item Purchased and I just want to know unique Name and Date combination this works fine:
y = x.groupby(['Date','Name']).count()
y = y.reset_index()[['Date', 'Name']]
but I feel like there should be a cleaner way using
y = x.groupby(['Date','Name'])
but y.index gives me an error, although y.keys works. This actually leads me to ask the general question as what are pandas.core.groupby.DataFrameGroupBy objects convenient for?
Thanks!
You don't need to use -- and in fact shouldn't use -- groupby here. You could use drop_duplicates to get unique rows instead:
x.drop_duplicates(['Date','Name'])
Demo:
In [156]: x = pd.DataFrame({'Date':[0,1,2]*2, 'Name':list('ABC')*2})
In [158]: x
Out[158]:
Date Name
0 0 A
1 1 B
2 2 C
3 0 A
4 1 B
5 2 C
In [160]: x.drop_duplicates(['Date','Name'])
Out[160]:
Date Name
0 0 A
1 1 B
2 2 C
You shouldn't use groupby because
x.groupby(['Date','Name']).count() performs a count of the
number of elements in each group, but the count is not used -- it's a wasted computation.
x.groupby(['Date','Name']).count() raises an AttributeError if
x has only Date and Name columns.
drop_duplicates is much much faster for this purpose.
Use groupby when you want to perform some operation on each group, such as counting the number of elements in each group, or computing some statistic (e.g. a sum or mean, etc.) per group.

Only allow one to one mapping between two columns in pandas dataframe

I have a two column dataframe df, each row are distinct, one element in one column can map to one or more than one elements in another column. I want to filter OUT those elements. So in the final dataframe, one element in one column only map to a unique element in another column.
What I am doing is to groupby one column and count the duplicates, then remove rows with counts more than 1. and do it again for another column. I am wondering if there is a better, simpler way.
Thanks
edit1: I just realize my solution is INCORRECT, removing multi-mapping elements in column A reduces the number of mapping in column B, consider the following example:
A B
1 4
1 3
2 4
1 maps to 3,4 , so the first two rows should be removed, and 4 maps to 1,2. The final table should be empty. However, my solution will keep the last row.
Can anyone provide me a fast and simple solution ? thanks
Well, You could do something like the following:
>>> df
A B
0 1 4
1 1 3
2 2 4
3 3 5
You only want to keep a row if no other row has the value of 'A' and no other row as that value of 'B'. Only row three meets those conditions in this example:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Aone.merge(Bone,on=['A','B'],how='inner')
A B
0 3 5
Explanation:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Aone
A B
2 2 4
3 3 5
The above grabs the rows that may be allowed based on looking at column 'A' alone.
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Bone
A B
1 1 3
3 3 5
The above grabs the rows that may be allowed based on looking at column 'B' alone. And then merging the intersection leaves you with rows that only meet both conditions:
>>> Aone.merge(Bone,on=['A','B'],how='inner')
Note, you could also do a similar thing using groupby/transform. But transform tends to be slowish so I didn't do it as an alternative.

Categories

Resources