Using Panda's groupby just to drop repeated items

Using Panda's groupby just to drop repeated items - python

I'm sure this is a basic question, but I am unable to find the correct path here.
Let's suppose a dataframe like this, telling how many fruits each person eats per week:
Name Fruit Amount
1 Jack Lemon 3
2 Mary Banana 6
3 Sophie Lemon 1
4 Sophie Cherry 10
5 Daniel Banana 2
6 Daniel Cherry 4
Let's suppose now that I just want to create a bar plot with matplotlib, to show the total amount of each fruit eaten per week in the whole town. To do that, I must groupby the fruits
In his book, pandas author describes groupby as the first part of a split-apply-combine operation:
So, first of all groupby transforms the DataFrame into a DataFrameGroupBy object. Then, ussing a method such as sum, the result is combined into a new DataFrame object. Perfect, I can create my fruit plot now.
But the problem I'm facing is what happens when I do not want to sum, diff or apply any operation at all to each group members. What happens when I just want to use groupby to keep a DataFrame with only one row per fruit type (of course, for an example as simple as this one, I could just get a list of fruits with unique, but that's not the point).
If I do that, the return of groupby is a DataFrameGroupBy object, and many operations which work with DataFrame do not with DataFrameGroupBy.
This problem, which I'm sure its pretty simple to avoid, is giving me a lot of headaches. How can I get a DataFrame from groupby without having to apply any aggregation function? Is there a different workaround without even using groupby which I'm missing due to being lost in translation?

If you just want some row, you can use a combination of groupby-first() + reset_index - it will retain the first row per group:
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 3]})
>>> df.groupby(df.a).first().reset_index()
a b
0 1 1
1 2 3

This bit make me think this could be the answer you are looking for:
Is there a different workaround without even using groupby
If you just want to drop duplicated rows based on Fruit, .drop_duplicates is the way to go.
df.drop_duplicates(subset='Fruit')
Name Fruit Amount
1 Jack Lemon 3
2 Mary Banana 6
4 Sophie Cherry 10
You have limited control about which rows are preserved, see the docstring.
This is faster and more readable than groupby + first.

IIUC you could use pivot_table which will return DataFrame:
In [140]: df.pivot_table(index='Fruit')
Out[140]:
Amount
Fruit
Banana 4
Cherry 7
Lemon 2
In [141]: type(df.pivot_table(index='Fruit'))
Out[141]: pandas.core.frame.DataFrame
If you want to keep first element you could define your function and pass it to aggfunc argument:
In [144]: df.pivot_table(index='Fruit', aggfunc=lambda x: x.iloc[0])
Out[144]:
Amount Name
Fruit
Banana 6 Mary
Cherry 10 Sophie
Lemon 3 Jack
If you don't want your Fruit to be an index you could also use reset_index:
In [147]: df.pivot_table(index='Fruit', aggfunc=lambda x: x.iloc[0]).reset_index()
Out[147]:
Fruit Amount Name
0 Banana 6 Mary
1 Cherry 10 Sophie
2 Lemon 3 Jack

Related

How to groupby and calculate new field with python pandas?

I'd like to group by a specific column within a data frame called 'Fruit' and calculate the percentage of that particular fruit that are 'Good'
See below for my initial dataframe
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple','Apple','Banana'], 'Condition': ['Good','Bad','Good']})
Dataframe
Fruit Condition
0 Apple Good
1 Apple Bad
2 Banana Good
See below for my desired output data frame
Fruit Percentage
0 Apple 50%
1 Banana 100%
Note: Because there is 1 "Good" Apple and 1 "Bad" Apple, the percentage of Good Apples is 50%.
See below for my attempt which is overwriting all the columns
groupedDF = df.groupby('Fruit')
groupedDF.apply(lambda x: x[(x['Condition'] == 'Good')].count()/x.count())
See below for resulting table, which seems to calculate percentage but within existing columns instead of new column:
Fruit Condition
Fruit
Apple 0.5 0.5
Banana 1.0 1.0

We can compare Condition with eq and take advantage of the fact that True is (1) and False is (0) when processed as numbers and take the groupby mean over Fruits:
new_df = (
df['Condition'].eq('Good').groupby(df['Fruit']).mean().reset_index()
)
new_df:
Fruit Condition
0 Apple 0.5
1 Banana 1.0
We can further map to a format string and rename to get output into the shown desired output:
new_df = (
df['Condition'].eq('Good')
.groupby(df['Fruit']).mean()
.map('{:.0%}'.format) # Change to Percent Format
.rename('Percentage') # Rename Column to Percentage
.reset_index() # Restore RangeIndex and make Fruit a Column
)
new_df:
Fruit Percentage
0 Apple 50%
1 Banana 100%
*Naturally further manipulations can be done as well.

How to replace a string based upon the value in a different pandas column

I am cleaning a dataset and I need to remove formatting errors in column A if the value in column B matches a specific string.
A B
foo//, cherry
bar//, orange
bar//, cherry
bar apple
So in this situation if column B is 'cherry' I want to replace "//," with "," in column A. The final result would look like this.
A B
foo, cherry
bar//, orange
bar, cherry
bar apple
Any advice is much appreciated

You can simply write a function that takes in a row as series, checks the cherry condition, fixes the string with str.replace and returns the row. The you can use df.apply over axis=1.
def fix(s):
if s['B']=='cherry':
s['A']=s['A'].replace('//,',',')
return s
df.apply(fix, axis=1)
A B
0 foo, cherry
1 bar//, orange
2 bar, cherry
3 bar apple

I would first check which rows contain cherry in the B column:
rows = df['B'].str.contains('cherry')
and then replace "//" with "" in these rows but A column.

Calculating mean value of item in several columns in pandas

I have a dataframe with values spread over several columns. I want to calculate the mean value of all items from specific columns.
All the solutions I looked up end up giving me either the separate means of each column or the mean of the means of the selected columns.
E.g. my Dataframe looks like this:
Name a b c d
Alice 1 2 3 4
Alice 2 4 2
Alice 3 2
Alice 1 5 2
Ben 3 3 1 3
Ben 4 1 2 3
Ben 1 2 2
And I want to see the mean of the values in columns b & c for each "Alice":
When I try:
df[df["Name"]=="Alice"][["b","c"]].mean()
The result is:
b 2.00
c 4.00
dtype: float64
In another post I found a suggestion to try a "double" mean one time for each axis e.g:
df[df["Name"]=="Alice"][["b","c"]].mean(axis=1).mean()
But the result was then:
3.00
which is the mean of the means of both columns.
I am expecting a way to calculate:
(2 + 3 + 4 + 5) / 4 = 3.50
Is there a way to do this in Python?

You can use numpy's np.nanmean [numpy-doc] here this will simply see your section of the dataframe as an array, and calculate the mean over the entire section by default:
>>> np.nanmean(df.loc[df['Name'] == 'Alice', ['b', 'c']])
3.5
Or if you want to group by name, you can first stack the dataframe, like:
>>> df[['Name','b','c']].set_index('Name').stack().reset_index().groupby('Name').agg('mean')
0
Name
Alice 3.500000
Ben 1.833333

Can groupby to sum all values and get their respective sizes. Then, divide to get the mean.
This way you get for all Names at once.
g = df.groupby('Name')[['b', 'c']]
g.sum().sum(1)/g.count().sum(1)
Name
Alice 3.500000
Ben 1.833333
dtype: float64
PS: In your example, looks like you have empty strings in some cells. That's not advisable, since you'll have dtypes set to object for your columns. Try to have NaNs instead, to take full advantage of vectorized operations.

Assume all your columns are numeric type and empty spaces are NaN. A simple set_index and stack and direct mean
df.set_index('Name')[['b','c']].stack().mean(level=0)
Out[117]:
Name
Alice 3.500000
Ben 1.833333
dtype: float64

sum occurrences of a string in pandas dataframe

I have to count and sum totals over a dataframe, but with a condition:
fruit days_old
apple 4
apple 5
orange 1
orange 5
I have to count with the condition that a fruit is over 3 days old. So the output I need is
2 apples and 1 orange
I thought I would have to use an apply function, but I would have to save each fruit type to a variable or something. I'm sure there's an easier way.
ps. I've been looking but I don't see a clear way to create tables here with proper spacing. The only thing that's clear is to not copy and paste with tabs!

One way is to use pd.Series.value_counts:
res = df.loc[df['days_old'] > 3, 'fruit'].value_counts()
# apple 2
# orange 1
# Name: fruit, dtype: int64
Using pd.DataFrame.apply is inadvisable as this will result in an inefficient loop.

You can use value_counts():
In [120]: df[df.days_old > 3]['fruit'].value_counts()
Out[120]:
apple 2
orange 1
Name: fruit, dtype: int64

I wanted in the variation party.
pd.factorize + np.bincount
f, u = pd.factorize(df.fruit)
pd.Series(
np.bincount(f, df.days_old > 3).astype(int), u
)
apple 2
orange 1
dtype: int64

The value_counts() methods described by #jpp and #chrisz are great. Just to post another strategy, you can use groupby:
df[df.days_old > 3].groupby('fruit').size()
# fruit
# apple 2
# orange 1
# dtype: int64

Transforming Dataframe Columns in Python

If I have a pandas Dataframe like such
and I want to transform it in a way that it results in
Is there a way to achieve this on the most correct way? a good pattern

Use a pivot table:
pd.pivot_table(df,index='name',columns=['property'],aggfunc=sum).fillna(0)
Output:
price
Property boat dog house
name
Bob 0 5 4
Josh 0 2 0
Sam 3 0 0
Sidenote: Pasting in your df's helps so people can use pd.read_clipboard instead of generating the df themselves.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.