Is there a way to slice a DataFrameGroupBy object?
For example, if I have:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3], 'B': ['x', 'y', 'z', 'r', 'p']})
A B
0 2 x
1 1 y
2 1 z
3 3 r
4 3 p
dfg = df.groupby('A')
Now, the returned GroupBy object is indexed by values from A, and I would like to select a subset of it, e.g. to perform aggregation. It could be something like
dfg.loc[1:2].agg(...)
or, for a specific column,
dfg['B'].loc[1:2].agg(...)
EDIT. To make it more clear: by slicing the GroupBy object I mean accessing only a subset of groups. In the above example, the GroupBy object will contain 3 groups, for A = 1, A = 2, and A = 3. For some reasons, I may only be interested in groups for A = 1 and A = 2.
It seesm you need custom function with iloc - but if use agg is necessary return aggregate value:
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[0:3]))
print (df)
A
1 y,z
2 x
3 r,p
Name: B, dtype: object
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
A
1 z
2
3 p
Name: B, dtype: object
For multiple columns:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3],
'B': ['x', 'y', 'z', 'r', 'p'],
'C': ['g', 'y', 'y', 'u', 'k']})
print (df)
A B C
0 2 x g
1 1 y y
2 1 z y
3 3 r u
4 3 p k
df = df.groupby('A').agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
B C
A
1 z y
2
3 p k
If I understand correctly, you only want some groups, but those are supposed to be returned completely:
A B
1 1 y
2 1 z
0 2 x
You can solve your problem by extracting the keys and then selecting groups based on those keys.
Assuming you already know the groups:
pd.concat([dfg.get_group(1),dfg.get_group(2)])
If you don't know the group names and are just looking for random n groups, this might work:
pd.concat([dfg.get_group(n) for n in list(dict(list(dfg)).keys())[:2]])
The output in both cases is a normal DataFrame, not a DataFrameGroupBy object, so it might be smarter to first filter your DataFrame and only aggregate afterwards:
df[df['A'].isin([1,2])].groupby('A')
The same for unknown groups:
df[df['A'].isin(list(set(df['A']))[:2])].groupby('A')
I believe there are some Stackoverflow answers refering to this, like How to access pandas groupby dataframe by key
Related
I have several columns that contain specific diseases. Here an example of a piece of it:
I want to make all possible combinations so I can check which combination of diseases mostly occur. So I want to make all combinations of 2 columns (A&B, A&C, A&D, B&C, B&D, C&D), but also combinations of 3 and 4 columns (A&B&C, B&C&D and so on). I have the following script for this:
from itertools import combinations
df.join(pd.concat({'_'.join(x): df[x[0]].str.cat(df[list(x[1:])].astype(str),
sep='')
for i in (2, 3, 4)
for x in combinations(df, i)}, axis=1))
But that generates a lot of extra columns in my dataset, and I still haven't got the frequencies of all combinations. This is the output that I would like to get:
What script can I use for this?
Use DataFrame.stack with aggregate join and last count by Series.value_counts:
s = df.stack().groupby(level=0).agg(','.join).value_counts()
print (s)
artritis,asthma 2
cancer,artritis,heart_failure,asthma 1
cancer,heart_failure 1
dtype: int64
If need 2 columns DataFrame:
df = s.rename_axis('vals').reset_index(name='count')
print (df)
vals count
0 artritis,asthma 2
1 cancer,artritis,heart_failure,asthma 1
2 cancer,heart_failure 1
You can create a pivot table
def index_agg_fn(x):
x = [e for e in x if e != '']
return ','.join(x)
df = pd.DataFrame({'A': ['cancer', 'cancer', None, None],
'B': ['artritis', None, 'artritis', 'artritis'],
'C': ['heart_failure', 'heart_failure', None, None],
'D': ['asthma', None, 'asthma', 'asthma']})
df['count'] = 1
ptable = pd.pivot_table(df.fillna(''), index=['A', 'B', 'C', 'D'], values=['count'], aggfunc='sum')
ptable.index = list(map(index_agg_fn, ptable.index))
print(ptable)
Result
count
artritis,asthma 2
cancer,heart_failure 1
cancer,artritis,heart_failure,asthma 1
I have n variables. Suppose n equals 3 in this case. I want to apply one function to all of the combinations(or permutations, depending on how you want to solve this) of variables and store the result in the same row and column in dataframe.
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
df = pd.DataFrame({x:np.nan for x in indexes}, index=indexes)
If I apply sum(the function can be anything), then the result that I want to get is like this:
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I can only think of iterating all the variables, apply the function one by one, and use the index of the iterators to set the value in the dataframe. Is there any better solution?
You can use apply and return a pd.Series for that effect. In such cases, pandas uses the series indices as columns in the resulting dataframe.
s = pd.Series({"a": 1, "b": 2, "c": 3})
s.apply(lambda x: x+s)
Just note that the operation you do is between an element and a series.
I believe you need broadcast sum of array created from variables if performance is important:
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
arr = np.array([a,b,c])
df = pd.DataFrame(arr + arr[:, None], index=indexes, columns=indexes)
print (df)
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I have a pandas dataframe and I need to modify all values in a given string column. Each column contains string values of the same length. The user provides the index they want to be replaced for each value
for example: [1:3] and the replacement value "AAA".
This would replace the string from values 1 to 3 with the value AAA.
How can I use the applymap(), map() or apply() function to get this done?
SOLUTION: Here is the final solution I went off of using the answer marked below:
import pandas as pd
df = pd.DataFrame({'A':['ffgghh','ffrtss','ffrtds'],
#'B':['ffrtss','ssgghh','d'],
'C':['qqttss',' 44','f']})
print df
old = ['g', 'r', 'z']
new = ['y', 'b', 'c']
vals = dict(zip(old, new))
pos = 2
for old, new in vals.items():
df.ix[df['A'].str[pos] == old, 'A'] = df['A'].str.slice_replace(pos,pos + len(new),new)
print df
Use str.slice_replace:
df['B'] = df['B'].str.slice_replace(1, 3, 'AAA')
Sample Input:
A B
0 w abcdefg
1 x bbbbbbb
2 y ccccccc
3 z zzzzzzzz
Sample Output:
A B
0 w aAAAdefg
1 x bAAAbbbb
2 y cAAAcccc
3 z zAAAzzzzz
IMO the most straightforward solution:
In [7]: df
Out[7]:
col
0 abcdefg
1 bbbbbbb
2 ccccccc
3 zzzzzzzz
In [9]: df.col = df.col.str[:1] + 'AAA' + df.col.str[4:]
In [10]: df
Out[10]:
col
0 aAAAefg
1 bAAAbbb
2 cAAAccc
3 zAAAzzzz
I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly
All I'm trying to do is add columns data1 and data2 if in the same row letters is a and subtract if it is c. multipy if it is b. Here is my code.
import pandas as pd
a=[['Date', 'letters', 'data1', 'data2'], ['1/2/2014', 'a', 6, 1], ['1/2/2014', 'a', 3, 1], ['1/3/2014', 'c', 1, 3],['1/3/2014', 'b', 3, 5]]
df = pd.DataFrame.from_records(a[1:],columns=a[0])
df['result']=df['data1']
for i in range(0,len(df)):
if df['letters'][i]=='a':
df['result'][i]=df['data1'][i]+df['data2'][i]
if df['letters'][i]=='b':
df['result'][i]=df['data1'][i]*df['data2'][i]
if df['letters'][i]=='c':
df['result'][i]=df['data1'][i]-df['data2'][i]
>>> df
Date letters data1 data2 result
0 1/2/2014 a 6 1 7
1 1/2/2014 a 3 1 4
2 1/3/2014 c 1 3 -2
3 1/3/2014 b 3 5 15
My question: is there a way to do it in one line without looping? something to the spirit of:
df['result']=df['result'].map(lambda x:df['data1'][i]+df['data2'][i] if x =='a' df['data1'][i]-df['data2'][i] elif x =='c' else x)`
You can use df.apply in combination with a lambda function. You have to use the keyword argument axis=1 to ensure you work on rows as opposed to the columns.
import pandas as pd
a=[['Date', 'letters', 'data1', 'data2'], ['1/2/2014', 'a', 6, 1], ['1/2/2014', 'a', 3, 1], ['1/3/2014', 'c', 1, 3]]
df = pd.DataFrame.from_records(a[1:],columns=a[0])
from operator import add, sub, mul
d = dict(a=add, b=mul, c=sub)
df['result'] = df.apply(lambda r: d[r['letters']](r['data1'], r['data2']), axis=1)
This will use the dictionary d to get the function you wish to use (add, sub, or mul).
Original solution below
df['result'] = df.apply(lambda r: r['data1'] + r['data2'] if r['letters'] == 'a'
else r['data1'] - r['data2'] if r['letters'] == 'c'
else r['data1'] * r['data2'], axis=1)
print(df)
Date letters data1 data2 result
0 1/2/2014 a 6 1 7
1 1/2/2014 b 3 1 3
2 1/3/2014 c 1 3 -2
The lambda function is a bit complex now so I'll go into it in a bit more detail...
The lambda function uses a so-called ternary operator to make boolean conditions inside one line, a typical ternary expession is of the form
a if b else c
Unfortunately you can't have an elif with a ternary expression, but what you can do is place another one inside the else statement, then it becomes
a if b else c if d else e
You can use the .where method:
where(cond, other=nan, inplace=False, axis=None, level=None, try_cast=False, raise_on_error=True) method of pandas.core.series.Series instance
Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
as in:
>>> df['data1'] + df['data2'].where(df['letters'] == 'a', - df['data2'])
0 7
1 4
2 -2
dtype: int64
alternatively, numpy.where:
>>> df['data1'] + np.where(df['letters'] == 'a', 1, -1) * df['data2']
0 7
1 4
2 -2
dtype: int64