Count and sort pandas dataframe - python

I have a dataframe with column 'code' which I have sorted based on frequency.
In order to see what each code means, there is also a column 'note'.
For each counting/grouping of the 'code' column, I display the first note that is attached to the first 'code'
df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
Now my question is, how do I display only those rows that have frequency of e.g. >= 30?

Add a query call before you sort. Also, if you only want those rows EQUALing < insert frequency here >, sort_values isn't needed (right?!).
df.groupby('code')['note'].agg(['count', 'first']).query('count == 30')
If the question is for all groups with AT LEAST < insert frequency here >, then
(
df.groupby('code')
.note.agg(['count', 'first'])
.query('count >= 30')
.sort_values('count', ascending=False)
)
Why do I use query? It's a lot easier to pipe and chain with it.

You can just filter your result accordingly:
grp = grp[grp['count'] >= 30]
Example with data
import pandas as pd
df = pd.DataFrame({'code': [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'note': ['A', 'B', 'A', 'A', 'C', 'C', 'C', 'A', 'A',
'B', 'B', 'C', 'A', 'B'] })
res = df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
# count first
# code
# 2 5 C
# 3 5 B
# 1 4 A
res2 = res[res['count'] >= 5]
# count first
# code
# 2 5 C
# 3 5 B

Related

select element in a column A while column B does not have a value

I want to select products as long as it does not contain 0 in x.
Input:
test = pd.DataFrame(
[
['a', 0],
['a', 3],
['a', 4],
['b', 3],
['b', 2],
['c', 1],
['d', 0]
]
)
test.columns = ['product', 'x']
test.query("select distinct (product) where x not in (0) ")
expected out come:
b,c
How to do this in both pandas and SQL?
In SQL, you would use:
select product
from t
group by product
having min(x) > 0;
This works assuming x is never negative. A more general formulation is:
having sum(case when x = 0 then 1 else 0 end) = 0
In your case pandas can do with isin
test.loc[~test['product'].isin(test.loc[test.x.eq(0),'product']),'product'].unique()
Out[41]: array(['b', 'c'], dtype=object)
Or do with set
set(test['product'].tolist())-set(test.loc[test.x.eq(0),'product'].tolist())
Out[47]: {'b', 'c'}
If you want to filter your dataframe, you can use groupby with .any():
test[~test.groupby('product')['x'].transform(lambda x: x.eq(0).any())]
Output:
product x
b 3
b 2
c 1
If you only want to see unique values you can add ['product'].unique().tolist() at the end of the code which I pasted above.
Then we have the output:
['b', 'c']

Python pandas: elegant division within dataframe

I'm new on stackoverflow and have switched from R to python. I'm trying to do something probably not too difficult, and while I can do this by butchering, I am wondering what the most pythonic way to do it is. I am trying to divide certain values (E where F=a) in a column by values further down in the column (E where F=b) using column D as a lookup:
import pandas as pd
df = pd.DataFrame({'D':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1], 'E':[10,20,30,40,50,100, 250, 250, 360, 567, 400],'F':['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'c']})
print(df)
out = pd.DataFrame({'D': [1, 2, 3, 4, 5], 'a/b': [0.1, 0.08, 0.12 , 0.1111, 0.0881]}
print(out)
Can anyone help write this nicely?
I'm not entirely sure what you mean by "using D column as lookup" since there is no need for such lookup in the example you provided.
However the quick and dirty way to achieve the output you did provide is
output = pd.DataFrame({'a/b': df[df['F'] == 'a']['E'].values / df[df['F'] == 'b']['E'].values})
output['D'] = df['D']
which makes output to be
a/b D
0 0.100000 1
1 0.080000 2
2 0.120000 3
3 0.111111 4
4 0.088183 5
Lookup with .loc in pandas dataframe as df.loc[rows, columns] where the conditions for rows and columns are True
import numpy as np
# get indices from column D. I convert it to a list structure to make sure that the order is maintained.
idx = list(set(df['D']))
# A is an array of values with 'F'=a
A = np.array([df.loc[(df['F']=='a') & (df['D']==i),'E'].values[0] for i in idx])
# B is an array of values with 'F'=b
B = np.array([df.loc[(df['F']=='b') & (df['D']==i),'E'].values[0] for i in idx])
# Now devide towards your new dataframe of divisions
out = pd.DataFrame(np.vstack([A/B,idx]).T, columns = ['a/b','D'])
Instead of using numpy.vstack, you can use:
out = pd.DataFrame(A/B,idx).T
out.columns = ['a/b','D']
with the same result. I tried to do it in a single line (for no reason whatsoever)
Got it:
df = df.set_index('D')
out = df.loc[(df['F'] == 'a'), 'E'] / df.loc[(df['F'] == 'b'), 'E']
out = out.reset_index()
Thanks for your thoughts - I got inspired.

Subset data by group within pandas dataframe

I need to subset a dataframe using groups and three conditional rules. If within a group all values in the Value column are none, I need to retain the first row for that group. If within a group all values in the Value column are not none, I need to retain all the values. If within a group some of the values in the Value column are none and others not none, I need to drop all rows where there is a none. Columns Region and ID together define a unique group within the dataframe.
My first approach was to separate the dataframe into two chunks. The first chunk is rows where for a group there are all nulls. The second chunk is everything else. For the chunk of data where rows for a group contained all nulls, I would create a rownumber using a cumulative count of rows by group and query rows where the cumulative count = 1. For the second chunk, I would drop all rows where Value is null. Then I would append the dataframes.
Sample source dataframe
dfInput = pd.DataFrame({
'Region': [1, 1, 2, 2, 2, 2, 2],
'ID': ['A', 'A', 'B', 'B', 'B', 'A', 'A'],
'Value':[0, 1, 1, None, 2, None, None],
})
Desired output dataframe:
dfOutput = pd.DataFrame({
'Region': [1, 1, 2, 2, 2],
'ID': ['A', 'A', 'B', 'B', 'A'],
'Value':[0, 1, 1, 2, None],
})
Just follow your logic and using groupby
dfInput.groupby(['Region','ID']).Value.apply(lambda x : x.head(1) if x.isnull().all() else x.dropna()).\
reset_index(level=[0,1]).sort_index()
Out[86]:
Region ID Value
0 1 A 0.0
1 1 A 1.0
2 2 B 1.0
4 2 B 2.0
5 2 A NaN

How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))

How to get number of rows in a grouped-by category in pandas

In pandas, I have (app_categ_events is a dataframe):
> print(app_categ_events.label_id.unique().shape)
> print(app_categ_events.category.unique().shape)
Out:
(492,)
(458,)
I want to look at the label_category’s that have more than one label_id for each (because I thought there was supposed to be a one-to-one mapping).
In r data.table, I can do:
app_categ_events[, count_rows := .N, by = list(category, label_id)]
# (or smth of that sort...)
print(app_categ_events[counts_rows > 1])
What’s the best way of doing that in pandas?
We transform the dataset to create the 'count_rows' column after grouping by 'category', 'label_id'
app_categ_events['count_rows'] = app_categ_events.groupby(['category',
'label_id'])['label_id'].transform('count')
print(app_categ_events)
# category label_id count_rows
#0 a 1 2
#1 a 1 2
#2 b 2 1
#3 b 3 1
Now, the equivalent of data.table as showed in the OP's post would be
print(app_categ_events[app_categ_events.count_rows>1])
# category label_id count_rows
#0 a 1 2
#1 a 1 2
data
import pandas as pd;
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'], 'label_id': [1, 1, 2, 3]})
You can use filtration to return the desired results.
df = pd.DataFrame({'label_id': [1, 1, 2, 3],
'category': ['a', 'b', 'b', 'c']})
df.groupby(['category']).filter(lambda group: len(group) > 1)
category label_id
1 b 1
2 b 2
Given:
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'],
'label_id': [1, 1, 2, 3]})
Solution:
# identify categories with greater than 1 number of related label_id's
cat_mask = app_categ_events.groupby('category')['label_id'].nunique().gt(1)
cats = cat_mask[cat_mask]
# filter data
app_categ_events[app_categ_events.category.isin(cats.index)]

Categories

Resources