Summing multiple rows with duplicate values Python - python

Have for me what is a wide data frame (67) columns, (30) are float and 37 are object or date. Finding duplicate column values for one of the objects that should be a foreign key for me to join to another data set.
Looking for a group by agg solution to keep first instance of all object/date columns while summing of all float columns.
Must be a way to optimize code to to avoid df.grupby([('insert 37 variables'], as_index=False)['insert 30 variables'].sum()
The initial data set looks like this
And the resulting data set should look like
Here is basic code I was starting with, but thinking there must be a better way, does not appear I can use a lamda function given the mix of variables, pivot would also require listing all variables, I looked at iloc and loc as well, this must be something others deal with regularly have not been able to find a online solution.
df = df.groupby(['account_number','policy_number', 'other varibales to 37'],
as_index=False)['internal_expense','external_expense','other variables to 30'].sum()

Use select_dtypes:
obj = df.select_dtypes(exclude='number').columns.tolist()
num = df.select_dtypes(include='number').columns.tolist()
out = df.groupby(obj, as_index=False)[num].sum()
print(out)
# Output
account_number policy_number revenue internal_expense external_expense
0 1234 1234 26 3 1
1 1235 1237 10 0 0
Update if all object / numeric columns are correctly organized (ones to left, others to right), you can slice the dataframe:
left = df.columns[:2].tolist()
right = df.columns[2:].tolist()
out = df.groupby(left, as_index=False)[right].sum()
Input:
data = {'account_number': ['1234', '1234', '1234', '1235'],
'policy_number': ['1234', '1234', '1234', '1237'],
'revenue': [20, 5, 1, 10],
'internal_expense': [0, 1, 2, 0],
'external_expense': [0, 1, 0, 0]}
df = pd.DataFrame(data)
print(df)
# Output
account_number policy_number revenue internal_expense external_expense
0 1234 1234 20 0 0
1 1234 1234 5 1 1
2 1234 1234 1 2 0
3 1235 1237 10 0 0

Related

Python dataframe if conditions are true column value=1, otherwise 0

I have a dataframe and I want to make a new column that is 0 if any one of about 20 conditions are true, and 1 otherwise. My current code is:
conditions=[(lastYear_sale['Low_Balance_Flag'==0])&(lastYear_sale.another_Flag==0),(lastYear_sale['Low_Balance_Flag'==1])&(lastYear_sale.another_Flag==1)]
choices=[1,0]
lastYear_sale['eligible']=np.select(conditions,choices,default=0)
So here is a simplified version of the dataframe I have, that looks a little like:
data = {'ID':['a', 'b', 'c', 'd'],
'Low_Balance_Flag':[1, 0, 1, 0], 'another_Flag':[0,0,1,1]}
dfr = pd.DataFrame(data)
I would like to add a column called eligible that is 0 if low balance flag or another_flag are 1, but if all other columns are 0 then eligible should be 1. I get an error from my attempt that just says keyError: False but I can't see what the error is, thanks for any suggestions! :)
Edit: so the output I'd be looking for in this case would be:
ID Low_Balance another_Flag Eligible
a 1 0 0
b 0 0 1
c 1 1 0
d 0 1 0
So the conditions is basically what you need. You just need the proper conditions. I assume the conditions you have are 2 condition clauses separated by comma. If you have a data frame lastYear_sale which I believe is supposed to be dfr then
conditions=((lastYear_sale['Low_Balance_Flag']==0)&(lastYear_sale.another_Flag==0))|((lastYear_sale['Low_Balance_Flag']==1)&(lastYear_sale.another_Flag==1))
print((~conditions).astype(int))
0 1
1 0
2 0
3 0
dtype: int64
If your conditions are somewhat dynamic and you need to build them within code you can use pandas.DataFrame.query to evaluate a string expression you have built.
Edit: I still assume dfr is the same as lastYear_sale. Also, the data in dfr does not match the data in the expected output.
#Use either of these
conditions = ~((lastYear_sale['Low_Balance_Flag']==1)|(lastYear_sale.another_Flag==1))
conditions = ~lastYear_sale.eval('Low_Balance_Flag==1|another_Flag==1')
dfr['eligible'] = conditions.astype(int)

Pandas grouping with filtering on other columns

I have the following dataframe in Pandas:
name
value
in
out
A
50
1
0
A
-20
0
1
B
150
1
0
C
10
1
0
D
500
1
0
D
-250
0
1
E
800
1
0
There are maximally only 2 observations for each name: one for in and one for out.
If there is only in for a name there is only one observation for it.
You can create this dataset with this code:
data = {
'name': ['A','A','B','C','D','D','E'],
'values': [50,-20,150,10,500,-250,800],
'in': [1,0,1,1,1,0,1],
'out': [0,1,0,0,0,1,0]
}
df = pd.DataFrame.from_dict(data)
I want to sum the value column for each name but only if name has both in and out record. In other words, only when one unique name has exactly 2 rows.
The result should look like this:
name
value
A
30
D
250
If I run the following code I got all the results without filtering based on in and out.
df.groupby('name').sum()
name
value
A
30
B
150
C
10
D
250
E
800
How to add the beforementioned filtering based on columns?
Maybe you can try something with groupby, agg, and query (like below):
df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]
Output:
values
name
A 30
D 250
You could also make .query('name==2') in above if you like but assuming it can occur max at 2 .query('name>1') would also return same.
IIUC, you could filter before aggregation:
# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()
Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):
(df.groupby('name', as_index=False).sum()
.loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)
output:
name values
0 A 30
1 D 250

Python How to do conditional selection after groupby

I have a large dataframe mostly hast unique values, still there are multiple same IDs with different values stored. I want to group the same IDs then apply a logic to those to select one among them then remove the others.
df = pd.DataFrame({'ID': [11, 11,11,11,22,22,33] ,
'Source': [2, 2,4,3,3,2,3],
'Price':[10, 20,30,40,50,60,70]})
the logic is :if in group there is a row with SOURCE==4 keep and remove the others
else in group there is a row with SOURCE==2 keep and remove the others
else in group there is a row with SOURCE==3 keep and remove the others
so hierarchy is based on Source column and it is 4>2>3.
Expected output:
expected = pd.DataFrame({'ID': [11,22,33] ,
'Source': [4,2,3],
'Price':[30,60,70]})
A possible solution is creating a new column of hierarchy if source ==4 then hierarchy ==1... and then sort it and and select nth(1) . However I wonder most how can I do conditional select after groupby.
d= {4:1,2:2, 3:3} # dict of drop hierarchy
new=(df.assign(rank=df.Source.map(d))#Create a rank column that maps the hierachy of selection
.sort_values(by='rank')#Sort new dataframe by rank
.drop_duplicates(subset='ID',keep='first')#Drop all the duplicated Source values
.drop('rank',1)#Drop the temp sorting column
)
print(new)
ID Source Price
2 11 4 30
5 22 2 60
6 33 3 70
I feel like you are hunting for even and odd numbers, hence the 4, 2, 3 order. The code below should suffice, and avoid the anonymous function, while offering some speed up (depending on the data size); it is quite verbose in my opinion though:
(df.assign(even_odd = np.where(df.Source % 2 == 0, 'even', 'odd'))
.groupby(['ID', 'even_odd'], as_index = False)
.max()
.drop_duplicates('ID', keep='first')
.filter([*df.columns])
)
ID Source Price
0 11 4 30
2 22 2 60
4 33 3 70
Of course, this would fail if you had 5, 9, 6, 12, ..., in which case, another logic is required. This only works if the numbers are restricted to 4, 2, 3

Adding two values in column A for different rows given a condition that their values in column B is similar in Python

This is my first question on StackOverflow. If there is anything wrong with the formatting, please do let me know as I'm still not familiar with it.
I'm having trouble with adding two values from different columns given a certain condition. Here is the table
Table
I would like to add the valuation price if they are in the same quarter. E.g quarter = '20173'. Which will return 200,000 + 150,000 = 350,000. If there is only one value for one quarter, I'd like it to display just that value, e.g quarter = '20192' which returns 100,000.
I've tried this out
A = raw_data['QUARTER'].unique()
values = np.array(A)
raw_data.loc[(raw_data['QUARTER'] == values[i] ), 'Valuation price'].sum()
which returns the error below
only integer scalar arrays can be converted to a scalar index
Any help would be appreciated. Thank you.
Maybe this can be an answer for your question.
import pandas as pd
data = {'size': [623, 623, 623, 640], 'value': [13, 10, 16, 4],
'quarter': [20144, 20152, 20152, 19903], 'name': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
result = df.groupby('quarter').sum()['value']
where df:
size value quarter name
0 623 13 20144 a
1 623 10 20152 b
2 623 16 20152 c
3 640 4 19903 d
and result:
quarter
19903 4
20144 13
20152 26
Name: value, dtype: int64

Python pandas: Need to know how many people have met two criteria

With this data set I want to know the people (id) who have made payments for both types a and b. Want to create a subset of data with the people who have made both a and b payments. (this is just an example set of data, one I'm using is much larger)
I've tried grouping by the id then making subset of data where type.len >= 2. Then tried creating another subset based on conditions df.loc[(df.type == 'a') & (df.type == 'b')]. I thought if I grouped by the id first then ran that df.loc code it would work but it doesn't.
Any help is much appreciated.
Thanks.
Separate the dataframe into two, one with type a payments and the other with type b payments, then merge them,
df_typea = df[(df['type'] == 'a')]
df_typeb = df[(df['type'] == 'b')]
df_merge = pd.merge(df_typea, df_typeb, how = 'outer', on = ['id', 'id'], suffixes =('_a', '_b'))
This will create a separate column for each payment type.
Now, you can find the ids for which both payments have been made,
df_payments = df_merge[(df_merge['type_a'] == 'a') & (df_merge['type_b'] == 'b')]
Note that this will create two records for items similar to that of id 9, for which there is more than two payments. I am assuming that you simply want to check if any payments of type 'a' and 'b' have been made for each id. In this case, you can simply drop any duplicates,
df_payments_no_duplicates = df_payments['id'].drop_duplicates()
You first split your DataFrame into two DataFrames:
one with type a payments only
one with type b payments only
You then join both DataFrames on id.
You can use groupby to solve this problem. This first time, group by id and type and then you can group again to see if the id had both types.
import pandas as pd
df = pd.DataFrame({"id" : [1, 1, 2, 3, 4, 4, 5, 5], 'payment' : [10, 15, 5, 20, 35, 30, 10, 20], 'type' : ['a', 'b', 'a','a','a','a','b', 'a']})
df_group = df.groupby(['id', 'type']).nunique()
#print(df_group)
'''
payment
id type
1 a 1
b 1
2 a 1
3 a 1
4 a 2
5 a 1
b 1
'''
# if the value in this series is 2, the id has both a and b
data = df_group.groupby('id').size()
#print(data)
'''
id
1 2
2 1
3 1
4 1
5 2
dtype: int64
'''
You can use groupby and nunique to get the count of unique payment types done.
print (df.groupby('id')['type'].agg(['nunique']))
This will give you:
id
1 2
2 1
3 1
4 1
5 1
6 2
7 1
8 1
9 2
If you want to list out only the rows that had both a and b types.
df['count'] = df.groupby('id')['type'].transform('nunique')
print (df[df['count'] > 1])
By using groupby.transform, each row will be populated with the unique count value. Then you can use count > 1 to filter out the rows that have both a and b.
This will give you:
id payment type count
0 1 10 a 2
1 1 15 b 2
7 6 10 b 2
8 6 15 a 2
11 9 35 a 2
12 9 30 a 2
13 9 10 b 2
You may also use the length of the returned set for the given id for column 'type':
len(set(df[df['id']==1]['type'])) # returns 2
len(set(df[df['id']==2]['type'])) # returns 1
Thus, the following would give you an answer to your question
paid_both = []
for i in set(df['id']):
if len(set(df[df['id']==i]['type'])) == 2:
paid_both.append(i)
## paid_both = [1,6,9] #the id's who paid both
You could then iterate through the unique id values to return the results for all ids. If 2 is returned, then the people have made payments for both types (a) and (b).

Categories

Resources