sum occurrences of a string in pandas dataframe

sum occurrences of a string in pandas dataframe - python

I have to count and sum totals over a dataframe, but with a condition:
fruit days_old
apple 4
apple 5
orange 1
orange 5
I have to count with the condition that a fruit is over 3 days old. So the output I need is
2 apples and 1 orange
I thought I would have to use an apply function, but I would have to save each fruit type to a variable or something. I'm sure there's an easier way.
ps. I've been looking but I don't see a clear way to create tables here with proper spacing. The only thing that's clear is to not copy and paste with tabs!

One way is to use pd.Series.value_counts:
res = df.loc[df['days_old'] > 3, 'fruit'].value_counts()
# apple 2
# orange 1
# Name: fruit, dtype: int64
Using pd.DataFrame.apply is inadvisable as this will result in an inefficient loop.

You can use value_counts():
In [120]: df[df.days_old > 3]['fruit'].value_counts()
Out[120]:
apple 2
orange 1
Name: fruit, dtype: int64

I wanted in the variation party.
pd.factorize + np.bincount
f, u = pd.factorize(df.fruit)
pd.Series(
np.bincount(f, df.days_old > 3).astype(int), u
)
apple 2
orange 1
dtype: int64

The value_counts() methods described by #jpp and #chrisz are great. Just to post another strategy, you can use groupby:
df[df.days_old > 3].groupby('fruit').size()
# fruit
# apple 2
# orange 1
# dtype: int64

Related

How to groupby and calculate new field with python pandas?

I'd like to group by a specific column within a data frame called 'Fruit' and calculate the percentage of that particular fruit that are 'Good'
See below for my initial dataframe
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple','Apple','Banana'], 'Condition': ['Good','Bad','Good']})
Dataframe
Fruit Condition
0 Apple Good
1 Apple Bad
2 Banana Good
See below for my desired output data frame
Fruit Percentage
0 Apple 50%
1 Banana 100%
Note: Because there is 1 "Good" Apple and 1 "Bad" Apple, the percentage of Good Apples is 50%.
See below for my attempt which is overwriting all the columns
groupedDF = df.groupby('Fruit')
groupedDF.apply(lambda x: x[(x['Condition'] == 'Good')].count()/x.count())
See below for resulting table, which seems to calculate percentage but within existing columns instead of new column:
Fruit Condition
Fruit
Apple 0.5 0.5
Banana 1.0 1.0

We can compare Condition with eq and take advantage of the fact that True is (1) and False is (0) when processed as numbers and take the groupby mean over Fruits:
new_df = (
df['Condition'].eq('Good').groupby(df['Fruit']).mean().reset_index()
)
new_df:
Fruit Condition
0 Apple 0.5
1 Banana 1.0
We can further map to a format string and rename to get output into the shown desired output:
new_df = (
df['Condition'].eq('Good')
.groupby(df['Fruit']).mean()
.map('{:.0%}'.format) # Change to Percent Format
.rename('Percentage') # Rename Column to Percentage
.reset_index() # Restore RangeIndex and make Fruit a Column
)
new_df:
Fruit Percentage
0 Apple 50%
1 Banana 100%
*Naturally further manipulations can be done as well.

How to replace a string based upon the value in a different pandas column

I am cleaning a dataset and I need to remove formatting errors in column A if the value in column B matches a specific string.
A B
foo//, cherry
bar//, orange
bar//, cherry
bar apple
So in this situation if column B is 'cherry' I want to replace "//," with "," in column A. The final result would look like this.
A B
foo, cherry
bar//, orange
bar, cherry
bar apple
Any advice is much appreciated

You can simply write a function that takes in a row as series, checks the cherry condition, fixes the string with str.replace and returns the row. The you can use df.apply over axis=1.
def fix(s):
if s['B']=='cherry':
s['A']=s['A'].replace('//,',',')
return s
df.apply(fix, axis=1)
A B
0 foo, cherry
1 bar//, orange
2 bar, cherry
3 bar apple

I would first check which rows contain cherry in the B column:
rows = df['B'].str.contains('cherry')
and then replace "//" with "" in these rows but A column.

Count number of matching values from pandas groupby

I have created a pandas dataframe for a store
I have columns Transaction and Item_Type
import pandas as pd
data = {'Transaction':[1, 2, 2, 2, 3], 'Item_Type':['Food', 'Drink', 'Food', 'Drink', 'Food']}
df = pd.DataFrame(data, columns=['Transaction', 'Item_Type'])
Transaction Item_Type
1 Food
2 Drink
2 Food
2 Drink
3 Food
I am trying to group by transaction and count the number of drinks per transaction, but cannot find the right syntax to do it.
df = df.groupby(['Transaction','Item_Type']).size()
This sort of works, but gives me a multi-index Series, which I cannot yet figure out how to select drinks per transaction from it.
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
This seems clunky - is there a better way?
This stackoverflow seemed most similar Adding a 'count' column to the result of a groupby in pandas?

Another way possible with pivot_table:
s = df.pivot_table(index='Transaction',
columns='Item_Type',aggfunc=len).stack().astype(int)
Or:
s = df.pivot_table(index=['Transaction','Item_Type'],aggfunc=len) ##thanks #Ch3steR
s.index = s.index.map("{0[0]}/{0[1]}".format)
print(s)
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
Or if you wish to filter a particular category:
to_filter = 'Drink'
(df.pivot_table(index='Transaction',columns='Item_Type',aggfunc=len,fill_value=0)
.filter(items=[to_filter]))
Item_Type Drink
Transaction
1 0
2 2
3 0

Edit: replacing original xs approach with unstack after seeing anky's answer.
>>> df.groupby('Transaction')['Item_Type'].value_counts().unstack(fill_value=0)['Drink']
Transaction
1 0
2 2
3 0
Name: Drink, dtype: int64

With a particular condition, you can sum the Boolean Series, within group, after you check the condition.
df['Item_Type'].eq('Drink').groupby(df['Transaction']).sum()
#Transaction
#1 0.0
#2 2.0
#3 0.0
#Name: Item_Type, dtype: float64

I found a solution I think
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
df = df.groupby(['Transaction','Item_Type']).size().reset_index(name='counts')
Gives me the information I need
Transaction Item_Type counts
1 Food 1
2 Drink 2
2 Food 1
3 Food 1

You may use agg and value_counts
s = df.astype(str).agg('/'.join, axis=1).value_counts(sort=False)
Out[61]:
3/Food 1
2/Drink 2
1/Food 1
2/Food 1
dtype: int64
If you want to keep the original order, chain additional sort_index
s = df.astype(str).agg('/'.join, axis=1).value_counts().sort_index(kind='mergesort')
Out[62]:
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
dtype: int64

Pandas assignment vs inplace=True on .loc? [duplicate]

I have tried many times, but seems the 'replace' can NOT work well after use 'loc'.
For example I want to replace the 'conlumn_b' with an regex for the row that the 'conlumn_a' value is 'apple'.
Here is my sample code :
df.loc[df['conlumn_a'] == 'apple', 'conlumn_b'].replace(r'^11*', 'XXX',inplace=True, regex=True)
Example:
conlumn_a conlumn_b
apple 123
banana 11
apple 11
orange 33
The result that I expected for the 'df' is:
conlumn_a conlumn_b
apple 123
banana 11
apple XXX
orange 33
Anyone has meet this issue that needs 'replace' with regex after 'loc' ?
OR you guys has some other good solutions ?
Thank you so much for your help!

inplace=True works on the object that it was applied on.
When you call .loc, you're slicing your dataframe object to return a new one.
>>> id(df)
4587248608
And,
>>> id(df.loc[df['conlumn_a'] == 'apple', 'conlumn_b'])
4767716968
Now, calling an in-place replace on this new slice will apply the replace operation, updating the new slice itself, and not the original.
Now, note that you're calling replace on a column of int, and nothing is going to happen, because regular expressions work on strings.
Here's what I offer you as a workaround. Don't use regex at all.
m = df['conlumn_a'] == 'apple'
df.loc[m, 'conlumn_b'] = df.loc[m, 'conlumn_b'].replace(11, 'XXX')
df
conlumn_a conlumn_b
0 apple 123
1 banana 11
2 apple XXX
3 orange 33
Or, if you need regex based substitution, then -
df.loc[m, 'conlumn_b'] = df.loc[m, 'conlumn_b']\
.astype(str).replace('^11$', 'XXX', regex=True)
Although, this converts your column to an object column.

I'm going to borrow from a recent answer of mine. This technique is a general purpose strategy for updating a dataframe in place:
df.update(
df.loc[df['conlumn_a'] == 'apple', 'conlumn_b']
.replace(r'^11$', 'XXX', regex=True)
)
df
conlumn_a conlumn_b
0 apple 123
1 banana 11
2 apple XXX
3 orange 33
Note that all I did was remove the inplace=True and instead wrapped it in the pd.DataFrame.update method.

I think you need filter in both sides:
m = df['conlumn_a'] == 'apple'
df.loc[m,'conlumn_b'] = df.loc[m,'conlumn_b'].astype(str).replace(r'^(11+)','XXX',regex=True)
print (df)
conlumn_a conlumn_b
0 apple 123
1 banana 11
2 apple XXX
3 orange 33

Using Panda's groupby just to drop repeated items

I'm sure this is a basic question, but I am unable to find the correct path here.
Let's suppose a dataframe like this, telling how many fruits each person eats per week:
Name Fruit Amount
1 Jack Lemon 3
2 Mary Banana 6
3 Sophie Lemon 1
4 Sophie Cherry 10
5 Daniel Banana 2
6 Daniel Cherry 4
Let's suppose now that I just want to create a bar plot with matplotlib, to show the total amount of each fruit eaten per week in the whole town. To do that, I must groupby the fruits
In his book, pandas author describes groupby as the first part of a split-apply-combine operation:
So, first of all groupby transforms the DataFrame into a DataFrameGroupBy object. Then, ussing a method such as sum, the result is combined into a new DataFrame object. Perfect, I can create my fruit plot now.
But the problem I'm facing is what happens when I do not want to sum, diff or apply any operation at all to each group members. What happens when I just want to use groupby to keep a DataFrame with only one row per fruit type (of course, for an example as simple as this one, I could just get a list of fruits with unique, but that's not the point).
If I do that, the return of groupby is a DataFrameGroupBy object, and many operations which work with DataFrame do not with DataFrameGroupBy.
This problem, which I'm sure its pretty simple to avoid, is giving me a lot of headaches. How can I get a DataFrame from groupby without having to apply any aggregation function? Is there a different workaround without even using groupby which I'm missing due to being lost in translation?

If you just want some row, you can use a combination of groupby-first() + reset_index - it will retain the first row per group:
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 3]})
>>> df.groupby(df.a).first().reset_index()
a b
0 1 1
1 2 3

This bit make me think this could be the answer you are looking for:
Is there a different workaround without even using groupby
If you just want to drop duplicated rows based on Fruit, .drop_duplicates is the way to go.
df.drop_duplicates(subset='Fruit')
Name Fruit Amount
1 Jack Lemon 3
2 Mary Banana 6
4 Sophie Cherry 10
You have limited control about which rows are preserved, see the docstring.
This is faster and more readable than groupby + first.

IIUC you could use pivot_table which will return DataFrame:
In [140]: df.pivot_table(index='Fruit')
Out[140]:
Amount
Fruit
Banana 4
Cherry 7
Lemon 2
In [141]: type(df.pivot_table(index='Fruit'))
Out[141]: pandas.core.frame.DataFrame
If you want to keep first element you could define your function and pass it to aggfunc argument:
In [144]: df.pivot_table(index='Fruit', aggfunc=lambda x: x.iloc[0])
Out[144]:
Amount Name
Fruit
Banana 6 Mary
Cherry 10 Sophie
Lemon 3 Jack
If you don't want your Fruit to be an index you could also use reset_index:
In [147]: df.pivot_table(index='Fruit', aggfunc=lambda x: x.iloc[0]).reset_index()
Out[147]:
Fruit Amount Name
0 Banana 6 Mary
1 Cherry 10 Sophie
2 Lemon 3 Jack

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.