Python pandas: Need to know how many people have met two criteria - python

With this data set I want to know the people (id) who have made payments for both types a and b. Want to create a subset of data with the people who have made both a and b payments. (this is just an example set of data, one I'm using is much larger)
I've tried grouping by the id then making subset of data where type.len >= 2. Then tried creating another subset based on conditions df.loc[(df.type == 'a') & (df.type == 'b')]. I thought if I grouped by the id first then ran that df.loc code it would work but it doesn't.
Any help is much appreciated.
Thanks.

Separate the dataframe into two, one with type a payments and the other with type b payments, then merge them,
df_typea = df[(df['type'] == 'a')]
df_typeb = df[(df['type'] == 'b')]
df_merge = pd.merge(df_typea, df_typeb, how = 'outer', on = ['id', 'id'], suffixes =('_a', '_b'))
This will create a separate column for each payment type.
Now, you can find the ids for which both payments have been made,
df_payments = df_merge[(df_merge['type_a'] == 'a') & (df_merge['type_b'] == 'b')]
Note that this will create two records for items similar to that of id 9, for which there is more than two payments. I am assuming that you simply want to check if any payments of type 'a' and 'b' have been made for each id. In this case, you can simply drop any duplicates,
df_payments_no_duplicates = df_payments['id'].drop_duplicates()

You first split your DataFrame into two DataFrames:
one with type a payments only
one with type b payments only
You then join both DataFrames on id.

You can use groupby to solve this problem. This first time, group by id and type and then you can group again to see if the id had both types.
import pandas as pd
df = pd.DataFrame({"id" : [1, 1, 2, 3, 4, 4, 5, 5], 'payment' : [10, 15, 5, 20, 35, 30, 10, 20], 'type' : ['a', 'b', 'a','a','a','a','b', 'a']})
df_group = df.groupby(['id', 'type']).nunique()
#print(df_group)
'''
payment
id type
1 a 1
b 1
2 a 1
3 a 1
4 a 2
5 a 1
b 1
'''
# if the value in this series is 2, the id has both a and b
data = df_group.groupby('id').size()
#print(data)
'''
id
1 2
2 1
3 1
4 1
5 2
dtype: int64
'''

You can use groupby and nunique to get the count of unique payment types done.
print (df.groupby('id')['type'].agg(['nunique']))
This will give you:
id
1 2
2 1
3 1
4 1
5 1
6 2
7 1
8 1
9 2
If you want to list out only the rows that had both a and b types.
df['count'] = df.groupby('id')['type'].transform('nunique')
print (df[df['count'] > 1])
By using groupby.transform, each row will be populated with the unique count value. Then you can use count > 1 to filter out the rows that have both a and b.
This will give you:
id payment type count
0 1 10 a 2
1 1 15 b 2
7 6 10 b 2
8 6 15 a 2
11 9 35 a 2
12 9 30 a 2
13 9 10 b 2

You may also use the length of the returned set for the given id for column 'type':
len(set(df[df['id']==1]['type'])) # returns 2
len(set(df[df['id']==2]['type'])) # returns 1
Thus, the following would give you an answer to your question
paid_both = []
for i in set(df['id']):
if len(set(df[df['id']==i]['type'])) == 2:
paid_both.append(i)
## paid_both = [1,6,9] #the id's who paid both
You could then iterate through the unique id values to return the results for all ids. If 2 is returned, then the people have made payments for both types (a) and (b).

Related

Sort pandas grouped items with the highest count overall

Say I have following dataframe:
d = {'col1': ["8","8","8","8","8","2","2","2","2","3","3"], 'col2': ['a', 'b','b','b','b','a','b','a','a','a','b'],
'col3': ['m', 'n','z','b','a','ac','b1','ad','a1','a','b1'],'col4': ['m', 'n','z','b1','a','ac1','b31','a1d','3a1','a3','b1']}
test = pd.DataFrame(data=d)
In order to sort each grouped item with count, I could do the following:
test.groupby(["col1",'col2'])['col4'].count().reset_index(name="count").sort_values(["col1","count"],ascending=[True,False]).
It returns this table:
However, I want the group with 8 in col1 to be the first item because this particular group has the highest count (i.e., 4).
How do I achieve this?
Edit: This is the expected output:
col1 col2 count
8 b 4
8 a 1
2 a 3
2 b 1
3 a 1
3 b 1
The expected output is unclear, but assuming you want to sort the rows within each group by decreasing orders of count, and also the groups with each other by decreasing order of the max (or total) count.
(test.groupby(["col1",'col2'])['col4'].count()
.reset_index(name="count")
# using the max count per group, for the total use transform('sum')
.assign(maxcount=lambda d: d.groupby('col1')['count'].transform('max'))
.sort_values(['maxcount', 'count'], ascending=False)
.drop(columns='maxcount')
)
Output:
col1 col2 count
5 8 b 4
4 8 a 1
0 2 a 3
1 2 b 1
2 3 a 1
3 3 b 1
You need to fix your sorting in that case.
Your description is a bit unclear, therefore a general guideline to solving your problem.
Sort_values sorts from left to right, where the first item defines the order of the group and following items define the order, if the first item is equal.
Therefore, select the order of your columns in which you would like to sort and set the ascending parameter correctly.

Python pandas: Is there a way to flag all records of a category which has one particular sub-category?

I am currently working on a dataset which has information on total sales for each product id and product sub category. For eg, let us consider that there are three products 1, 2 and 3. There are three product sub categories - A,B,C, one or two or all of which may comprise the products 1, 2 and 3. For instance, I have included a sample table below:
Now, I would like to add a flag column 'Flag' which can assign 1 or 0 to each product id depending on whether that product id is contains record of product sub category 'C'. If it does contain 'C', then assign 1 to the flag column. Otherwise, assign 0. Below is the desired output.
I am currently not able to do this in pandas. Could you help me out? Thank you so much!
use pandas transform and contains. transform applies the lambda function to all rows in the dataframe.
txt="""ID,Sub-category,Sales
1,A,100
1,B,101
1,C,102
2,B,100
2,C,101
3,A,102
3,B,100"""
df = pd.read_table(StringIO(txt), sep=',')
#print(df)
list_id=list(df[df['Sub-category'].str.contains('C')]['ID'])
df['flag']=df['ID'].apply(lambda x: 1 if x in list_id else 0 )
print(df)
output:
ID Sub-category Sales flag
0 1 A 100 1
1 1 B 101 1
2 1 C 102 1
3 2 B 100 1
4 2 C 101 1
5 3 A 102 0
6 3 B 100 0
Try this:
Flag = [ ]
for i in dataFrame["Product sub-category]:
if i == "C":
Flag.append(1)
else:
Flag.append(0)
So you have a list called "Flag" and can add it to your dataframe.
You can add a temporary column, isC to check for your condition. Then check for the number of isC's inside every "Product Id" group (with .groupby(...).transform).
check = (
df.assign(isC=lambda df: df["Product Sub-category"] == "C")
.groupby("Product Id").isC.transform("sum")
)
df["Flag"] = (check > 0).astype(int)

Problem in Appending dictionary to Dataframe

I have this dataframe(User is the index)(Also, this a part of a bigger dataframe, so Series cannot be used):
User Score
A 5
B 10
A 4
C 8
I want to add Score of duplicate Users.
So First I calculate the Sum of Score of the duplicate User:
sum = df.loc['A'].sum()
Then I drop the duplicate rows:
df.drop('A',inplace=True)
Then I append the new values as a dictionary to the dataframe:
dic = {'User':'A','Score':10}
df = df.append(dic,ignore_index=True)
But I get this dataframe:
Score User
0 10 NaN
1 8 NaN
2 10 A
The default autoincrement values has replaced User as index and values of User are non NaN.
The expected dataframe would be:
User Score
B 10
C 8
A 9
What you are attempting will not work because your original dataframe has an Index called 'User', not a column. You're trying to append a dict, defining a column called 'User', with a DataFrame that has no such column.
Compare the result of your
df = df.append(dic,ignore_index=True)
with
df = df.reset_index().append(dic,ignore_index=True)
However, this begs the question of why you're doing it this way. What you want to do can be simply achieved with a groupby, a la
import pandas as pd
data = {'User': ['A', 'B', 'A', 'C'], 'Score': [5, 10, 4, 8]}
data
df = pd.DataFrame(data)
df
Out[6]:
User Score
0 A 5
1 B 10
2 A 4
3 C 8
df.set_index('User')
Out[7]:
Score
User
A 5
B 10
A 4
C 8
df = df.set_index('User')
df
Out[10]:
Score
User
A 5
B 10
A 4
C 8
df.groupby('User').sum()
Out[30]:
Score
User
A 9
B 10
C 8
You can try this
Let's say you have the below dataframe
df
you can use the below code to get the result
df.groupby('User')['Score'].sum().reset_index().set_index('User')

Addition with nested columns in python

I have a pandas groupby object that I made from a larger dataframe, in which amounts are grouped under a person ID variable as well as whether it was an ingoing or outgoing transaction. Heres an example:
ID In_Out Amount
1 In 5
1 Out 8
2 In 4
2 Out 2
3 In 3
3 Out 9
4 Out 8
(sorry I don't know how to put actual sample data in). Note that some folks can have one or the other (e.g., maybe they have some going out but nothing coming in).
All I want to go is get the difference in the amounts, collapsed under the person. So the ideal output would be, perhaps a dictionary or other dataframe, containing the difference in amounts under each person, like this:
ID Difference
1 -3
2 2
3 -6
4 -8
I have tried a handful of different ways to do this but am not sure how to work with these nested lists in python.
Thanks!
We couold select the rows that are Out and convert them to negative integers and then use sum().
import pandas as pd
s = '''\
ID In_Out Amount
1 In 5
1 Out 8
2 In 4
2 Out 2
3 In 3
3 Out 9
4 Out 8'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Select rows where In_Out == 'Out' and multiple by -1
df.loc[df['In_Out'] == 'Out', 'Amount'] *= -1
# Convert to dict
d = df.groupby('ID')['Amount'].sum().to_dict()
print(d)
Returns:
{1: -3, 2: 2, 3: -6, 4: -8}

Pandas: conditional group-specific computations

Let's say I have a table with a key (e.g. customer ID) and two numeric columns C1 and C2. I would like to group rows by the key (customer) and run some aggregators like sum and mean on its columns. After computing group aggregators I would like to assign the results back to each customer row in a DataFrame (as some customer-wide features added to each row).
I can see that I can do something like
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
if I want to aggregate just one column and be able to add the result back to the DataFrame.
Can I make it conditional - can I add up C1 column in a group only for rows whose C2 column is equal to some number X and still be able to add results back to the DataFrame?
How can I run aggregator on a combination of rows like:
np.sum(C1 + C2)?
What would be the simplest and most elegant way to implement it? What is the most efficient way to do it? Can those aggregations be done in a one path?
Thank you in advance.
Here's some setup of some dummy data.
In [81]: df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
'C1': [1,2,3,4,5,6],
'C2': [7,8,9,10,11,12]})
In [82]: df['F1'] = df.groupby('Key')['C1'].transform(np.sum)
In [83]: df
Out[83]:
C1 C2 Key F1
0 1 7 a 3
1 2 8 a 3
2 3 9 b 7
3 4 10 b 7
4 5 11 c 11
5 6 12 c 11
If you want to do a conditional GroupBy, you can just filter the dataframe as it's passed to .groubpy. For example, if you wanted the group sum of 'C1' if C2 is less than 8 or greater than 9.
In [87]: cond = (df['C2'] < 8) | (df['C2'] > 9)
In [88]: df['F2'] = df[cond].groupby('Key')['C1'].transform(np.sum)
In [89]: df
Out[89]:
C1 C2 Key F1 F2
0 1 7 a 3 1
1 2 8 a 3 NaN
2 3 9 b 7 NaN
3 4 10 b 7 4
4 5 11 c 11 11
5 6 12 c 11 11
This works because the transform operation preserves the index, so it will still align with the original dataframe correctly.
If you want to sum the group totals for two columns, probably easiest to do something like this? Someone may have something more clever.
In [93]: gb = df.groupby('Key')
In [94]: df['C1+C2'] = gb['C1'].transform(np.sum) + gb['C2'].transform(np.sum)
Edit:
Here's one other way to get group totals for multiple columns. The syntax isn't really any cleaner, but may be more convenient for a large number of a columns.
df['C1_C2'] = gb[['C1','C2']].apply(lambda x: pd.DataFrame(x.sum().sum(), index=x.index, columns=['']))
I found another approach that uses apply() instead of transform(), but you need to join the result table with the input DataFrame and I just haven't figured out yet how to do it. Would appreciate help to finish the table joining part or any better alternatives.
df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
'C1': [1,2,3,4,5,6],
'C2': [7,8,9,10,11,12]})
# Group g will be given as a DataFrame
def group_feature_extractor(g):
feature_1 = (g['C1'] + g['C2']).sum()
even_C1_filter = g['C1'] % 2 == 0
feature_2 = g[even_C1_filter]['C2'].sum()
return pd.Series([feature_1, feature_2], index = ['F1', 'F2'])
# Group once
group = df.groupby(['Key'])
# Extract features from each group
group_features = group.apply(group_feature_extractor)
#
# Join with the input data frame ...
#

Categories

Resources