I have created a pandas dataframe for a store
I have columns Transaction and Item_Type
import pandas as pd
data = {'Transaction':[1, 2, 2, 2, 3], 'Item_Type':['Food', 'Drink', 'Food', 'Drink', 'Food']}
df = pd.DataFrame(data, columns=['Transaction', 'Item_Type'])
Transaction Item_Type
1 Food
2 Drink
2 Food
2 Drink
3 Food
I am trying to group by transaction and count the number of drinks per transaction, but cannot find the right syntax to do it.
df = df.groupby(['Transaction','Item_Type']).size()
This sort of works, but gives me a multi-index Series, which I cannot yet figure out how to select drinks per transaction from it.
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
This seems clunky - is there a better way?
This stackoverflow seemed most similar Adding a 'count' column to the result of a groupby in pandas?
Another way possible with pivot_table:
s = df.pivot_table(index='Transaction',
columns='Item_Type',aggfunc=len).stack().astype(int)
Or:
s = df.pivot_table(index=['Transaction','Item_Type'],aggfunc=len) ##thanks #Ch3steR
s.index = s.index.map("{0[0]}/{0[1]}".format)
print(s)
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
Or if you wish to filter a particular category:
to_filter = 'Drink'
(df.pivot_table(index='Transaction',columns='Item_Type',aggfunc=len,fill_value=0)
.filter(items=[to_filter]))
Item_Type Drink
Transaction
1 0
2 2
3 0
Edit: replacing original xs approach with unstack after seeing anky's answer.
>>> df.groupby('Transaction')['Item_Type'].value_counts().unstack(fill_value=0)['Drink']
Transaction
1 0
2 2
3 0
Name: Drink, dtype: int64
With a particular condition, you can sum the Boolean Series, within group, after you check the condition.
df['Item_Type'].eq('Drink').groupby(df['Transaction']).sum()
#Transaction
#1 0.0
#2 2.0
#3 0.0
#Name: Item_Type, dtype: float64
I found a solution I think
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
df = df.groupby(['Transaction','Item_Type']).size().reset_index(name='counts')
Gives me the information I need
Transaction Item_Type counts
1 Food 1
2 Drink 2
2 Food 1
3 Food 1
You may use agg and value_counts
s = df.astype(str).agg('/'.join, axis=1).value_counts(sort=False)
Out[61]:
3/Food 1
2/Drink 2
1/Food 1
2/Food 1
dtype: int64
If you want to keep the original order, chain additional sort_index
s = df.astype(str).agg('/'.join, axis=1).value_counts().sort_index(kind='mergesort')
Out[62]:
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
dtype: int64
Related
I have a dataframe:
df1 = pd.DataFrame({'id': ['1','2','2','3','3','4','4'],
'name': ['James','Jim','jimy','Daniel','Dane','Ash','Ash'],
'event': ['Basket','Soccer','Soccer','Basket','Soccer','Basket','Soccer']})
I want to count unique values of id but with the name, the result I except are:
id name count
1 James 1
2 Jim, jimy 2
3 Daniel, Dane 2
4 Ash 2
I try to group by id and name but it doesn't count as i expected
You could try:
df1.groupby('id').agg(
name=('name', lambda x: ', '.join(x.unique())),
count=('name', 'count')
)
We are basically grouping by id and then joining the unique names to a comma separated list!
Here is a solution:
groups = df1[["id", "name"]].groupby("id")
a = groups.agg(lambda x: ", ".join( set(x) ))
b = groups.size().rename("count")
c = pd.concat([a,b], axis=1)
I'm not an expert when it comes to pandas but I thought I might as well post my solution because I think that it's straightforward and readable.
In your example, the groupby is done on the id column and not by id and name. The name column you see in your expected DataFrame is the result of an aggregation done after a groupby.
Here, it is obvious that the groupby was done on the id column.
My solution is maybe not the most straightforward but I still find it to be more readable:
Create a groupby object groups by grouping by id
Create a DataFrame a from groups by aggregating it using commas (you also need to remove the duplicates using set(...) ): lambda x: ", ".join( set(x) )
The DataFrame a will thus have the following data:
name
id
1 James
2 Jim, jimy
3 Daniel, Dane
4 Ash
Create another DataFrame b by computing the size of each groups in groups : groups.size() (you should also rename your column)
id
1 1
2 2
3 2
4 2
Name: count, dtype: int64
Concat a and b horizontally and you get what you wanted
name count
id
1 James 1
2 Jim, jimy 2
3 Daniel, Dane 2
4 Ash 2
I have a dataframe, which I use groupby on for further data aggregation:
import pandas as pd
test_df = pd.DataFrame(data={"id": [1,2,2,3,3], "review_id": [1,2,3,4,5], "text": ["good", "bad", "nice", "awesome", "dont buy"]})
grouped_df = test_df.groupby(by=["id", "review_id"]).apply(lambda x: [x["text"]])
Which give me the following series:
id review_id
1 1 [[good]]
2 2 [[bad]]
3 [[nice]]
3 4 [[awesome]]
5 [[dont buy]]
dtype: object
Now I need a way, how I can further reduce this series, as I only want ids with more than 1 review. so I want the id 1 to be dropped.
I just dont know how I could use aggregate() or apply() for this task.
How can I achieve this?
Let us do transform
out = grouped_df[grouped_df.groupby(level=0).transform('size')>1]
id review_id
2 2 [[bad]]
3 [[nice]]
3 4 [[awesome]]
5 [[dont buy]]
dtype: object
Or let us do duplicated
out = grouped_df[grouped_df.index.get_level_values(0).duplicated(keep=False)]
id review_id
2 2 [[bad]]
3 [[nice]]
3 4 [[awesome]]
5 [[dont buy]]
dtype: object
I have a very large DataFrame that looks like this:
A B
SPH2008 3/21/2008 1 2
3/21/2008 1 2
3/21/2008 1 2
SPM2008 6/21/2008 1 2
6/21/2008 1 2
6/21/2008 1 2
And I have the following code which is intended to flatten and acquire the unique pairs of the two indeces into a new DF:
indeces = [df.index.get_level_values(0), df.index.get_level_values(1)]
tmp = pd.DataFrame(data=indeces).T.drop_duplicates()
tmp.columns = ['ID', 'ExpirationDate']
tmp.sort_values('ExpirationDate', inplace=True)
However, this operation takes a remarkably long amount of time. Is there a more efficient way to do this?
pandas.DataFrame.index.drop_duplicates
pd.DataFrame([*df.index.drop_duplicates()], columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008
With older versions of Python that can't unpack in that way
pd.DataFrame(df.index.drop_duplicates().tolist(), columns=['ID', 'ExpirationDate'])
IIUC, You can also groupby the levels of your multiindex, then create a dataframe from that with your desired columns:
>>> pd.DataFrame(df.groupby(level=[0,1]).groups.keys(), columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008
I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –
If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count() and when does it make sense to use df['colA'].value_counts() ?
There is difference value_counts return:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
but count not, it sort output by index (created by column in groupby('col')).
df.groupby('colA').count()
is for aggregate all columns of df by function count. So it count values excluding NaNs.
So if need count only one column need:
df.groupby('colA')['colA'].count()
Sample:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
Groupby and value_counts are totally different functions. You cannot perform value_counts on a dataframe.
Value Counts are limited only for a single column or series and it's sole purpose is to return the series of frequencies of values
Groupby returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count() it will return the number of true values present in columns with respect to the specific columns in groupby.
When should be value_counts used and when should groupby.count be used :
Lets take an example
df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})
Groupby count:
df.groupby('color').count()
id size
color
b 2 2
g 2 2
r 3 3
Groupby count is generally used for getting the valid number of values
present in all the columns with reference to or with respect to one
or more columns specified. So not a number (nan) will be excluded.
To find the frequency using groupby you need to aggregate against the specified column itself like #jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).
Value Counts:
df['color'].value_counts()
r 3
g 2
b 2
Name: color, dtype: int64
Value count is generally used for finding the frequency of the values
present in one particular column.
In conclusion :
.groupby(col).count() should be used when you want to find the frequency of valid values present in columns with respect to specified col.
.value_counts() should be used to find the frequencies of a series.
in simple words: .value_counts() Return a Series containing counts of unique rows in the DataFrame which means it counts up the individual values in a specific row and reports how many of the values are in the column:
imagine we have a dataframe like:
df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
first_name middle_name
0 John Smith
1 Anne <NA>
2 John <NA>
3 Beth Louise
then we apply value_counts on it:
df.value_counts()
first_name middle_name
Beth Louise 1
John Smith 1
dtype: int64
as you can see it didn't count rows with NA values.
however count() count non-NA cells for each column or row.
in our example:
df.count()
first_name 4
middle_name 2
dtype: int64
The following is a subset of a data frame:
drug_id WD
lexapro.1 flu-like symptoms
lexapro.1 dizziness
lexapro.1 headache
lexapro.14 Dizziness
lexapro.14 headaches
lexapro.23 extremely difficult
lexapro.32 cry at anything
lexapro.32 Anxiety
I need to generate a column id based on the values in drug_id as follows:
id drug_id WD
1 lexapro.1 flu-like symptoms
1 lexapro.1 dizziness
1 lexapro.1 headache
2 lexapro.14 Dizziness
2 lexapro.14 headaches
3 lexapro.23 extremely difficult
4 lexapro.32 cry at anything
4 lexapro.32 Anxiety
I think I need to group them based on drug_id and then generate id based on the size of each group. But I do not know how to do it?
The shift+cumsum pattern mentioned by Boud is good, just make sure to sort by drug_id first. So something like,
df = df.sort_values('drug_id')
df['id'] = (df['drug_id'] != df['drug_id'].shift()).cumsum()
A different approach that does not involve sorting your dataframe would be to map a number to each unique drug_id.
uid = df['drug_id'].unique()
id_map = dict((x, y) for x, y in zip(uid, range(1, len(uid)+1)))
df['id'] = df['drug_id'].map(id_map)
Use the shift+cumsum pattern:
(df.drug_id!=df.drug_id.shift()).cumsum()
Out[5]:
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
Name: drug_id, dtype: int32