Pandas groupby: How to get a union of strings

Pandas groupby: How to get a union of strings - python

I have a dataframe like this:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
Calling
In [10]: print df.groupby("A")["B"].sum()
will return
A
1 1.615586
2 0.421821
3 0.463468
4 0.643961
Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
I have been trying to find ways to do this.
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although
df.groupby("A")["B"]
is a
pandas.core.groupby.SeriesGroupBy object
so I was hoping any Series method would work. Any ideas?

In [4]: df = read_csv(StringIO(data),sep='\s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum by default concatenates
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
You can do pretty much what you want
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object
Doing this on a whole frame, one group at a time. Key is to return a Series
def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))
In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}

You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.
>>> d
A B
0 1 This
1 2 is
2 3 a
3 4 random
4 1 string
5 2 !
>>> d.groupby('A')['B'].apply(list)
A
1 [This, string]
2 [is, !]
3 [a]
4 [random]
dtype: object
If you want something else, just write a function that does what you want and then apply that.

You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code)
df.groupby('A')['B'].agg(lambda col: ''.join(col))

You could try this:
df.groupby('A').agg({'B':'sum','C':'-'.join})

Named aggregations with pandas >= 0.25.0
Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:
aggregate and get a list of strings
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', list)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 [This, string]
1 2 0.421821 [is, !]
2 3 0.463468 [a]
3 4 0.643961 [random]
aggregate and join the strings
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', ', '.join)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 This, string
1 2 0.421821 is, !
2 3 0.463468 a
3 4 0.643961 random

a simple solution would be :
>>> df.groupby(['A','B']).c.unique().reset_index()

If you'd like to overwrite column B in the dataframe, this should work:
df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))

Following #Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:
unique_chars = lambda x: ', '.join(x.unique())
(df
.groupby(['A'])
.agg({'C': unique_chars}))

Related

Pandas Apply Function - Syntax for row filtering

Similar to a previous question, I need to use the apply function to some data after using groupBy.
Suppose we have the following dataset with column heads A,B,C:
a 0 2
b 0 3
c 0 4
a 1 1
b 1 2
c 1 5
a 0 3
b 1 2
c 0 1
and the following group by is applied:
mydf=df.groupby(['A','B'])['C'].max()
I need to filter the group by result by a filter to show only those where C is 3, but get an error.
print(mydf.apply(lambda g: g[g['C'] == 3]))
Thank you in advance.

You can use .loc[] with a lambda here:
df.groupby(['A','B'])['C'].max().loc[lambda x: x==3]
A B
a 0 3
b 0 3
Name: C, dtype: int64

How can I simply increase the elements of the pandas series as much as I want?

I want to increase the number of rows in the pandas series as many as I want.
It would not be difficult if it made it very inefficient.
Example)
a = pd.Series([3,4,5])
b = pd.Series([1,2,3])
Using a and b, I want the following result.
pd.Series([3,4,4,5,5,5])
The number is b, value is a.
If I use pandas, I can increase the number by using two series like that.
If you know how to solve this problem, please teach that.

Use numpy.repeat with Series constructor, only necessary same length of both Series:
c = pd.Series(np.repeat(a.values, b))
#pandas 0.24+
#c = pd.Series(np.repeat(a.to_numpy(), b))
print (c)
0 3
1 4
2 4
3 5
4 5
5 5
dtype: int64

Not the best answer though:
>>> pd.Series([i for x,y in zip(a,b) for i in [x]*y])
0 3
1 4
2 4
3 5
4 5
5 5
dtype: int64
>>>

Alternatively you can also do:
a.loc[a.index.repeat(b)]
#for reseting index :-> a.loc[a.index.repeat(b)].reset_index(drop=True)
0 3
1 4
1 4
2 5
2 5
2 5

How do you concatenate two single rows in pandas?

I am trying to select a bunch of single rows in bunch of dataframes and trying to make a new data frame by concatenating them together.
Here is a simple example
x=pd.DataFrame([[1,2,3],[1,2,3]],columns=["A","B","C"])
A B C
0 1 2 3
1 1 2 3
a=x.loc[0,:]
A 1
B 2
C 3
Name: 0, dtype: int64
b=x.loc[1,:]
A 1
B 2
C 3
Name: 1, dtype: int64
c=pd.concat([a,b])
I end up with this:
A 1
B 2
C 3
A 1
B 2
C 3
Name: 0, dtype: int64
Whearas I would expect the original data frame:
A B C
0 1 2 3
1 1 2 3
I can get the values and create a new dataframe, but this doesn't seem like the way to do it.

If you want to concat two series vertically (vertical stacking), then one option is a concat and transpose.
Another is using np.vstack:
pd.DataFrame(np.vstack([a, b]), columns=a.index)
A B C
0 1 2 3
1 1 2 3

Since you are slicing by index I'd use .iloc and then notice the difference between [[]] and [] which return a DataFrame and Series*
a = x.iloc[[0]]
b = x.iloc[[1]]
pd.concat([a, b])
# A B C
#0 1 2 3
#1 1 2 3
To still use .loc, you'd do something like
a = x.loc[[0,]]
b = x.loc[[1,]]
*There's a small caveat that if index 0 is duplicated in x then x.loc[0,:] will return a DataFrame and not a Series.

It looks like you want to make a new dataframe from a collection of records. There's a method for that:
import pandas as pd
x = pd.DataFrame([[1,2,3],[1,2,3]], columns=["A","B","C"])
a = x.loc[0,:]
b = x.loc[1,:]
c = pd.DataFrame.from_records([a, b])
print(c)
# A B C
# 0 1 2 3
# 1 1 2 3

Pandas: Using group by, combine multiple column values as one distinct group within the groupby

I have a data-frame which I'm using the pandas.groupby on a specific column and then running aggregate statistics on the produced groups (mean, median, count). I want to treat certain column values as members of the same group produced by the groupby rather than a distinct group per distinct value in the column which was used for the grouping. I was looking how I would accomplish such a thing.
For example:
>> my_df
ID SUB_NUM ELAPSED_TIME
1 1 1.7
2 2 1.4
3 2 2.1
4 4 3.0
5 6 1.8
6 6 1.2
So instead of the typical behavior:
>> my_df.groupby([SUB_NUM]).agg([count])
ID SUB_NUM Count
1 1 1
2 2 2
4 4 1
5 6 2
I want certain values (SUB_NUM in [1, 2]) to be computed as one group so instead something like below is produced:
>> # Some mystery pandas function calls
ID SUB_NUM Count
1 1, 2 3
4 4 1
5 6 2
Any help would be much appreciated, thanks!

For me works:
#for join values convert values to string
df['SUB_NUM'] = df['SUB_NUM'].astype(str)
#create mapping dict by dict comprehension
L = ['1','2']
d = {x: ','.join(L) for x in L}
print (d)
{'2': '1,2', '1': '1,2'}
#replace values by dict
a = df['SUB_NUM'].replace(d)
print (a)
0 1,2
1 1,2
2 1,2
3 4
4 6
5 6
Name: SUB_NUM, dtype: object
#groupby by mapping column and aggregating `first` and `size`
print (df.groupby(a)
.agg({'ID':'first', 'ELAPSED_TIME':'size'})
.rename(columns={'ELAPSED_TIME':'Count'})
.reset_index())
SUB_NUM ID Count
0 1,2 1 3
1 4 4 1
2 6 5 2
What is the difference between size and count in pandas?

You can create another column mapping the SUB_NUM values to actual groups and then group by it.
my_df['SUB_GROUP'] = my_df['SUB_NUM'].apply(lambda x: 1 if x < 3 else x)
my_df.groupby(['SUB_GROUP']).agg([count])

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !

Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.

Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby: How to get a union of strings - python

You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code) df.groupby('A')['B'].agg(lambda col: ''.join(col))

You could try this: df.groupby('A').agg({'B':'sum','C':'-'.join})

a simple solution would be : >>> df.groupby(['A','B']).c.unique().reset_index()

If you'd like to overwrite column B in the dataframe, this should work: df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))

Following #Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values: unique_chars = lambda x: ', '.join(x.unique()) (df .groupby(['A']) .agg({'C': unique_chars}))

Related

Pandas Apply Function - Syntax for row filtering

How can I simply increase the elements of the pandas series as much as I want?

How do you concatenate two single rows in pandas?

Pandas: Using group by, combine multiple column values as one distinct group within the groupby

Group by value of sum of columns with Pandas

Categories

Resources