Pandas group by operations on a data frame - python

I have a pandas data frame like the one below.
UsrId JobNos
1 4
1 56
2 23
2 55
2 41
2 5
3 78
1 25
3 1
I group by the data frame based on the UsrId. The grouped data frame will conceptually look like below.
UsrId JobNos
1 [4,56,25]
2 [23,55,41,5]
3 [78,1]
Now, I'm looking for an in-build API that will give me the UsrId with the maximum job count. For the above example, UsrId-2 has the maximum count.
UPDATE:
Instead of the UsrID with maximum job count, I want n UserIds with maximum job counts. For the above example, if n=2 then the output is [2,1]. Can this be done?

Something like df.groupby('UsrId').JobNos.sum().idxmax() should do it:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: data = """UsrId JobNos
...: 1 4
...: 1 56
...: 2 23
...: 2 55
...: 2 41
...: 2 5
...: 3 78
...: 1 25
...: 3 1"""
In [4]: df = pd.read_csv(StringIO(data), sep='\s+')
In [5]: grouped = df.groupby('UsrId')
In [6]: grouped.JobNos.sum()
Out[6]:
UsrId
1 85
2 124
3 79
Name: JobNos
In [7]: grouped.JobNos.sum().idxmax()
Out[7]: 2
If you want your results based on the number of items in each group:
In [8]: grouped.size()
Out[8]:
UsrId
1 3
2 4
3 2
In [9]: grouped.size().idxmax()
Out[9]: 2
Update: To get ordered results you can use the .order method:
In [10]: grouped.JobNos.sum().order(ascending=False)
Out[10]:
UsrId
2 124
1 85
3 79
Name: JobNos

Related

Keep column order at DataFrame creation

I'd like to keep the columns in the order they were defined with pd.DataFrame. In the example below, df.info shows that GroupId is the first column and print also prints GroupId.
I'm using Python version 3.6.3
import numpy as np
import pandas as pd
df = pd.DataFrame({'Id' : np.random.randint(1,100,10),
'GroupId' : np.random.randint(1,5,10) })
df.info()
print(df.iloc[:,0])
One way is to use collections.OrderedDict, as below. Note that the OrderedDict object takes a list of tuples as an input.
from collections import OrderedDict
df = pd.DataFrame(OrderedDict([('Id', np.random.randint(1,100,10)),
('GroupId', np.random.randint(1,5,10))]))
# Id GroupId
# 0 37 4
# 1 10 2
# 2 42 1
# 3 97 2
# 4 6 4
# 5 59 2
# 6 12 2
# 7 69 1
# 8 79 1
# 9 17 1
Unless you're using python-3.6+ where dictionaries are ordered, this just isn't possible with a (standard) dictionary. You will need to zip your items together and pass a list of tuples:
np.random.seed(0)
a = np.random.randint(1, 100, 10)
b = np.random.randint(1, 5, 10)
df = pd.DataFrame(list(zip(a, b)), columns=['Id', 'GroupId'])
Or,
data = [a, b]
df = pd.DataFrame(list(zip(*data)), columns=['Id', 'GroupId']))
df
Id GroupId
0 45 3
1 48 1
2 65 1
3 68 1
4 68 3
5 10 2
6 84 3
7 22 4
8 37 4
9 88 3

Filtering grouped df in Dask

Related to this similar question for Pandas: filtering grouped df in pandas
Action
To eliminate groups based on an expression applied to a different column than the groupby column.
Problem
Filter is not implemented for grouped dataframes.
Tried
Groupby and apply to eliminate certain groups, which returns an index error because the apply function is supposed to always return something?
In [16]:
def filter_empty(df):
if not df.label.values.all(4):
return df
df_nonempty = df_norm.groupby('hash').apply(filter_empty, meta=meta)
In [17]:
len(df_nonempty.hash.unique())
...
<ipython-input-16-6da6d9b6c069> in filter_empty()
1 def filter_empty(df):
----> 2 if not df.label.values.all(4):
3 return df
4
5 df_nonempty = df_norm.groupby('hash').apply(filter_empty, meta=meta)
/opt/conda/lib/python3.5/site-packages/numpy/core/_methods.py in _all()
39
40 def _all(a, axis=None, dtype=None, out=None, keepdims=False):
---> 41 return umr_all(a, axis, dtype, out, keepdims)
42
43 def _count_reduce_items(arr, axis):
ValueError: 'axis' entry is out of bounds
Question
Is there another way to achieve the Dask equivalent of Pandas grouped.filter(lambda x: len(x) > 1) ? Or the groupby apply simply implemented wrongly?
Example
import numpy as np
import pandas as pd
import dask.dataframe as dd
In [3]:
df = pd.DataFrame({'A':list('aacaaa'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')})
df = dd.from_pandas(df, npartitions=1)
In [8]:
df.A.unique().compute()
Out[8]:
0 a
1 c
Name: A, dtype: object
In [6]:
def filter_4(df):
if not df.B.values.all(4):
return df
df_notalla = df.groupby('A').apply(filter_4, meta=df)
In [10]:
df_notall4.A.unique().compute()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-894a491faa57> in <module>()
----> 1 df_notalla.A.unique().compute()
...
<ipython-input-6-ef10326ae42a> in filter_4(df)
1 def filter_4(df):
----> 2 if not df.B.values.all(4):
3 return df
4
5 df_notalla = df.groupby('A').apply(filter_4, meta=df)
/opt/conda/lib/python3.5/site-packages/numpy/core/_methods.py in _all(a, axis, dtype, out, keepdims)
39
40 def _all(a, axis=None, dtype=None, out=None, keepdims=False):
---> 41 return umr_all(a, axis, dtype, out, keepdims)
42
43 def _count_reduce_items(arr, axis):
ValueError: 'axis' entry is out of bounds
I think you can groupby + size first, then map for Series (it is like transform, but not implemented in dask too) and last filter by boolean indexing:
df = pd.DataFrame({'A':list('aacaaa'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 a 5 8 3 3 a
2 c 4 9 5 6 a
3 a 5 4 7 9 b
4 a 5 2 1 2 b
5 a 4 3 0 4 c
a = df.groupby('F')['A'].size()
print (a)
F
a 3
b 2
c 1
Name: A, dtype: int64
s = df['F'].map(a)
print (s)
0 3
1 3
2 3
3 2
4 2
5 1
Name: F, dtype: int64
df = df[s > 1]
print (df)
A B C D E F
0 a 4 7 1 5 a
1 a 5 8 3 3 a
2 c 4 9 5 6 a
3 a 5 4 7 9 b
4 a 5 2 1 2 b
EDIT:
I think here is not necessary groupby:
df_notall4 = df[df.C != 4].drop_duplicates(subset=['A','D'])['D'].compute()
But if really need it:
def filter_4(x):
return x[x.C != 4]
df_notall4 = df.groupby('A').apply(filter_4, meta=df).D.unique().compute()
print (df_notall4)
0 1
1 3
2 0
3 5
Name: D, dtype: int64
Thanks to #jezrael I reviewed my implementation and created the following solution (see my provided example).
df_notall4 = []
for d in list(df[df.C != 4].D.unique().compute()):
df_notall4.append(df.groupby('D').get_group(d))
df_notall4 = dd.concat(df_notall4, interleave_partitions=True)
Which results in
In [8]:
df_notall4.D.unique().compute()
Out[8]:
0 1
1 3
2 5
3 0
Name: D, dtype: object

pandas split timeseries in groups

I have a pandas dataframe
>>> df = pd.DataFrame()
>>> df['a'] = np.random.choice(range(0,100), 200)
>>> df['b'] = np.random.choice([0,1], 200)
>>> df.head()
a b
0 69 1
1 49 1
2 79 1
3 88 0
4 57 0
>>>
Some of the variables (in this example 'a') have a lot of unique values.
I would like to replace 'a' with a2 where a2 has 5 unique values. In other words I want to define 5 groups and assign to each value of a one of the group.
For example a2=1 if 0<=df['a']<20 and a2=2 if 20<=df['a']<40 and so on.
Note:
I used group of size 20 because 100/5 = 20
How can I do that using numpy or pandas or something else?
EDIT:
Possible solution
def group_array(a):
a = a - a.min()
a = 100 * a/a.max()
a = (a.apply(int)//20)+1
return a
You could use pd.cut to categorize the values in df['a']:
import pandas as pd
df = pd.DataFrame({'a':[69,49,79,88,57], 'b':[1,1,1,0,0]})
df['a2'] = pd.cut(df['a'], bins=range(0,101,20), labels=range(1,6), )
print(df)
yields
a b a2
0 69 1 4
1 49 1 3
2 79 1 4
3 88 0 5
4 57 0 3

Pandas: Replace a set of rows by their sum

I'm sure there is a neat way of doing this, but haven't had any luck finding it yet.
Suppose I have a data frame:
f = pd.DataFrame({'A':[1, 2, 3, 4], 'B': [10, 20, 30, 40], 'C':[100, 200, 300, 400]}).T
that is, with rows indexed A, B and C.
Now suppose I want to take rows A and B, and replace them both by a single row that is their sum; and, moreover, that I want to assign a given index (say 'sum') to that replacement row (note the order of indices doesn't matter).
At the moment I'm having to do:
f.append(pd.DataFrame(f.ix[['A','B']].sum()).T).drop(['A','B'])
followed by something equally clunky to set the index of the replacement row. However, I'm curious to know if there's an elegant, one-line way of doing both of these steps?
Do this:
In [79]: f.append(f.loc[['A', 'B']].sum(), ignore_index=True).drop([0, 1]).set_index(Index(['C', 'sumAB'])
)
Out[79]:
0 1 2 3
C 100 200 300 400
sumAB 11 22 33 44
Alternatively you can use Index.get_indexer for an even uglier one-liner:
In [96]: f.append(f.loc[['A', 'B']].sum(), ignore_index=True).drop(f.index.get_indexer(['A', 'B'])).set_index(Index(['C', 'sumAB']))
Out[96]:
0 1 2 3
C 100 200 300 400
sumAB 11 22 33 44
Another option is to use concat:
In [11]: AB = list('AB')
First select the rows you wish to sum:
In [12]: f.loc[AB]
Out[12]:
0 1 2 3
A 1 2 3 4
B 10 20 30 40
In [13]: f.loc[AB].sum()
Out[13]:
0 11
1 22
2 33
3 44
dtype: int64
and as a row in a DataFrame (Note: this step may not be necessary in future versions...):
In [14]: pd.DataFrame({'sumAB': f.loc[AB].sum()}).T
Out[14]:
0 1 2 3
sumAB 11 22 33 44
and we want to concat with all the remaining rows:
In [15]: f.loc[f.index - AB]
Out[15]:
0 1 2 3
C 100 200 300 400
In [16]: pd.concat([pd.DataFrame({'sumAB': f.loc[AB].sum()}).T,
f.loc[f.index - AB]],
axis=0)
Out[16]:
0 1 2 3
sumAB 11 22 33 44
C 100 200 300 400

Python Pandas: remove entries based on the number of occurrences

I'm trying to remove entries from a data frame which occur less than 100 times.
The data frame data looks like this:
pid tag
1 23
1 45
1 62
2 24
2 45
3 34
3 25
3 62
Now I count the number of tag occurrences like this:
bytag = data.groupby('tag').aggregate(np.count_nonzero)
But then I can't figure out how to remove those entries which have low count...
New in 0.12, groupby objects have a filter method, allowing you to do these types of operations:
In [11]: g = data.groupby('tag')
In [12]: g.filter(lambda x: len(x) > 1) # pandas 0.13.1
Out[12]:
pid tag
1 1 45
2 1 62
4 2 45
7 3 62
The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.
Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:
In [21]: g.filter(lambda x: len(x) > 1) # pandas 0.12
Out[21]:
pid tag
1 1 45
4 2 45
2 1 62
7 3 62
Edit: Thanks to #WesMcKinney for showing this much more direct way:
data[data.groupby('tag').pid.transform(len) > 1]
import pandas
import numpy as np
data = pandas.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
bytag = data.groupby('tag').aggregate(np.count_nonzero)
tags = bytag[bytag.pid >= 2].index
print(data[data['tag'].isin(tags)])
yields
pid tag
1 1 45
2 1 62
4 2 45
7 3 62
Here are some run times for a couple of the solutions posted here, along with one that was not (using value_counts()) that is much faster than the other solutions:
Create the data:
import pandas as pd
import numpy as np
# Generate some 'users'
np.random.seed(42)
df = pd.DataFrame({'uid': np.random.randint(0, 500, 500)})
# Prove that some entries are 1
print "{:,} users only occur once in dataset".format(sum(df.uid.value_counts() == 1))
Output:
171 users only occur once in dataset
Time a few different ways of removing users with only one entry. These were run in separate cells in a Jupyter Notebook:
%%timeit
df.groupby(by='uid').filter(lambda x: len(x) > 1)
%%timeit
df[df.groupby('uid').uid.transform(len) > 1]
%%timeit
vc = df.uid.value_counts()
df[df.uid.isin(vc.index[vc.values > 1])].uid.value_counts()
These gave the following outputs:
10 loops, best of 3: 46.2 ms per loop
10 loops, best of 3: 30.1 ms per loop
1000 loops, best of 3: 1.27 ms per loop
df = pd.DataFrame([(1, 2), (1, 3), (1, 4), (2, 1),(2,2,)], columns=['col1', 'col2'])
In [36]: df
Out[36]:
col1 col2
0 1 2
1 1 3
2 1 4
3 2 1
4 2 2
gp = df.groupby('col1').aggregate(np.count_nonzero)
In [38]: gp
Out[38]:
col2
col1
1 3
2 2
lets get where the count > 2
tf = gp[gp.col2 > 2].reset_index()
df[df.col1 == tf.col1]
Out[41]:
col1 col2
0 1 2
1 1 3
2 1 4

Categories

Resources