Pandas groupby two columns then get dict for values - python

I have a pandas dataframe:
banned_titles =
TitleId RelatedTitleId
0 89989 32598
1 89989 3085083
2 95281 3085083
when I apply groupby as following
In [84]: banned_titles.groupby('TitleId').groups
Out[84]: {89989: [0, 1], 95281: [2]}
This is so close but not I want.
What I want is:
{89989: [32598, 3085083], 95281: [3085083]}
Is there a way to do this?

try this:
In [8]: x.groupby('TitleId')['RelatedTitleId'].apply(lambda x: x.tolist()).to_dict()
Out[8]: {89989: [32598, 3085083], 95281: [3085083]}
or as series of lists:
In [10]: x.groupby('TitleId')['RelatedTitleId'].apply(lambda x: x.tolist())
Out[10]:
TitleId
89989 [32598, 3085083]
95281 [3085083]
Name: RelatedTitleId, dtype: object
data:
In [9]: x
Out[9]:
TitleId RelatedTitleId
0 89989 32598
1 89989 3085083
2 95281 3085083

Try list one line (no lambda):
dict(df.groupby('TitleId')['RelatedTitleId'].apply(list))
# {89989: [32598, 3085083], 95281: [3085083]}

Related

Pandas groupby then drop groups below specified size

I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).
Here's what I've tried:
df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)
But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.
You can use len:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
A B
0 1 2
1 1 4
The number of rows is in the attribute .shape[0]:
df.groupby('A').filter(lambda x: x.shape[0] >= min_size)
NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).
groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:
Here's an example, first create the dataset:
import pandas as pd
import numpy as np
df = pd.concat([
pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i))
for i in range(1,1000)
]).reset_index(drop=True)
print(df)
y A
0 1.375980 1
1 -0.023861 1
2 -0.474707 1
3 -0.151859 2
4 -1.696823 2
... ... ...
2424 0.276737 998
2425 -0.142171 999
2426 -0.718891 999
2427 -0.621315 999
2428 1.335450 999
[2429 rows x 2 columns]
Time it:

Pandas drop duplicates on elements made of lists

Say my dataframe is:
df = pandas.DataFrame([[[1,0]],[[0,0]],[[1,0]]])
which yields:
0
0 [1, 0]
1 [0, 0]
2 [1, 0]
I want to drop duplicates, and only get elements [1,0] and [0,0], if I write:
df.drop_duplicates()
I get the following error: TypeError: unhashable type: 'list'
How can I call drop_duplicates()?
More in general:
df = pandas.DataFrame([[[1,0],"a"],[[0,0],"b"],[[1,0],"c"]], columns=["list", "letter"])
And I want to call df["list"].drop_duplicates(), so drop_duplicates applies to a Series and not a dataframe?
You can use numpy.unique() function:
>>> df = pandas.DataFrame([[[1,0]],[[0,0]],[[1,0]]])
>>> pandas.DataFrame(np.unique(df), columns=df.columns)
0
0 [0, 0]
1 [1, 0]
If you want to preserve the order checkout: numpy.unique with order preserved
drop_duplicates
Call drop_duplicates on tuplized data:
df[0].apply(tuple, 1).drop_duplicates().apply(list).to_frame()
0
0 [1, 0]
1 [0, 0]
collections.OrderedDict
However, I'd much prefer something that doesn't involve apply...
from collections import OrderedDict
pd.Series(map(
list, (OrderedDict.fromkeys(map(tuple, df[0].tolist()))))
).to_frame()
Or,
pd.Series(
list(k) for k in OrderedDict.fromkeys(map(tuple, df[0].tolist()))
).to_frame()
0
0 [1, 0]
1 [0, 0]
I tried the other answers but they didn't solve what I needed (large dataframe with multiple list columns).
I solved it this way:
df = df[~df.astype(str).duplicated()]
Here is one way, by turning your series of lists into separate columns, and only keeping the non-duplicates:
df[~df[0].apply(pandas.Series).duplicated()]
0
0 [1, 0]
1 [0, 0]
Explanation:
df[0].apply(pandas.Series) returns:
0 1
0 1 0
1 0 0
2 1 0
From which you can find duplicates:
>>> df[0].apply(pd.Series).duplicated()
0 False
1 False
2 True
And finally index using that

Pandas str.count

Consider the following dataframe. I want to count the number of '$' that appear in a string. I use the str.count function in pandas (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.count.html).
>>> import pandas as pd
>>> df = pd.DataFrame(['$$a', '$$b', '$c'], columns=['A'])
>>> df['A'].str.count('$')
0 1
1 1
2 1
Name: A, dtype: int64
I was expecting the result to be [2,2,1]. What am I doing wrong?
In Python, the count function in the string module returns the correct result.
>>> a = "$$$$abcd"
>>> a.count('$')
4
>>> a = '$abcd$dsf$'
>>> a.count('$')
3
$ has a special meaning in RegEx - it's end-of-line, so try this:
In [21]: df.A.str.count(r'\$')
Out[21]:
0 2
1 2
2 1
Name: A, dtype: int64
As the other answers have noted, the issue here is that $ denotes the end of the line. If you do not intend to use regular expressions, you may find that using str.count (that is, the method from the built-in type str) is faster than its pandas counterpart;
In [39]: df['A'].apply(lambda x: x.count('$'))
Out[39]:
0 2
1 2
2 1
Name: A, dtype: int64
In [40]: %timeit df['A'].str.count(r'\$')
1000 loops, best of 3: 243 µs per loop
In [41]: %timeit df['A'].apply(lambda x: x.count('$'))
1000 loops, best of 3: 202 µs per loop
Try pattern [$] so it doesn't treat $ as end of character (see this cheatsheet) if you place it in square brackets [] then it treats it as a literal character:
In [3]:
df = pd.DataFrame(['$$a', '$$b', '$c'], columns=['A'])
df['A'].str.count('[$]')
Out[3]:
0 2
1 2
2 1
Name: A, dtype: int64
taking a cue from #fuglede
pd.Series([x.count('$') for x in df.A.values.tolist()], df.index)
as pointed by #jezrael, the above fails when there is a null type, so...
def tc(x):
try:
return x.count('$')
except:
return 0
pd.Series([tc(x) for x in df.A.values.tolist()], df.index)
timings
np.random.seed([3,1415])
df = pd.Series(np.random.randint(0, 100, 100000)) \
.apply(lambda x: '\$' * x).to_frame('A')
df.A.replace('', np.nan, inplace=True)
def tc(x):
try:
return x.count('$')
except:
return 0

Pandas - how to slice value_counts?

I would like to slice a pandas value_counts() :
>sur_perimetre[col].value_counts()
44341006.0 610
14231009.0 441
12131001.0 382
12222009.0 364
12142001.0 354
But I get an error :
> sur_perimetre[col].value_counts()[:5]
KeyError: 5.0
The same with ix :
> sur_perimetre[col].value_counts().ix[:5]
KeyError: 5.0
How would you deal with that ?
EDIT
Maybe :
pd.DataFrame(sur_perimetre[col].value_counts()).reset_index()[:5]
Method 1:
You need to observe that value_counts() returns a Series object. You can process it like any other series and get the values. You can even construct a new dataframe out of it.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: vc = df.C1.value_counts()
In [4]: type(vc)
Out[4]: pandas.core.series.Series
In [5]: vc.values
Out[5]: array([2, 1, 1, 1, 1])
In [6]: vc.values[:2]
Out[6]: array([2, 1])
In [7]: vc.index.values
Out[7]: array([3, 5, 4, 2, 1])
In [8]: df2 = pd.DataFrame({'value':vc.index, 'count':vc.values})
In [8]: df2
Out[8]:
count value
0 2 3
1 1 5
2 1 4
3 1 2
4 1 1
Method2:
Then, I was trying to regenerate the error you mentioned. But, using a single column in DF, I didnt get any error in the same notation as you mentioned.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: df['C1'].value_counts()[:3]
Out[3]:
3 2
5 1
4 1
Name: C1, dtype: int64
In [4]: df.C1.value_counts()[:5]
Out[4]:
3 2
5 1
4 1
2 1
1 1
Name: C1, dtype: int64
In [5]: pd.__version__
Out[5]: u'0.17.1'
Hope it helps!

Extract on "-" delimited in pandas

I have a pandas series with some values like 19.99-20.99 (i.e. two numbers separated by a dash).
How would you just take the left or right value?
Use split("-") on the resulting string and then access the result with index notation, ie split_result[1].
Here's an example:
In [5]: my_series = pandas.Series(['19.22-20.11','18.55-34.22','12.33-22.00','13.33-34.23'])
In [6]: my_series[0]
Out[6]: '19.22-20.11'
In [7]: my_series[0].split("-")
Out[7]: ['19.22', '20.11']
In [8]: my_series[0].split("-")[0]
Out[8]: '19.22'
In [9]: my_series[0].split("-")[1]
Out[9]: '20.11'
In [1]: s = pd.Series(['19.99-20.99', '20.99-21.99'])
In [2]: s.str.split('-').str[0]
Out[2]:
0 19.99
1 20.99
dtype: object
In [3]: s.str.split('-').str[1]
Out[3]:
0 20.99
1 21.99
dtype: object

Categories

Resources