python pandas adding 2 dataframe with specific column - python

I have 2 dataframe the one looks like this :
Date id name amount period
2011-06-30 1 A 10000 1
2011-06-30 2 B 10000 1
2011-06-30 3 C 10000 1
And another one looks like this :
id amount period
1 10000 1
3 10000 0
And the result that i want looks like this :
id amount period
1 20000 2
2 10000 1
3 20000 1
How can i do that in python pandas?

Use concat with filtered columns with aggregate sum:
df = pd.concat([df1[['id','amount','period']], df2]).groupby('id', as_index=False).sum()
print (df)
id amount period
0 1 20000 2
1 2 10000 1
2 3 20000 1
EDIT:
If need subtract by id create index for id and then use DataFrame.sub:
df11 = df1[['id','amount','period']].set_index('id')
df22 = df2.set_index('id')
df3 = df11.sub(df22, fill_value=0).reset_index()
print (df3)
id amount period
0 1 0.0 0.0
1 2 10000.0 1.0
2 3 0.0 1.0

Related

Python - Count duplicate user Id's occurence in a given month

If I create a Dataframe from
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26','2022-08-30','2022-09-3','2022-09-8','2022-09-13'],
"id": ['A','B','C','D','E','B','A','F','G','F','H']})
df['date'] = pd.to_datetime(df['date'])
(Table 1 below showing the data)
I am interested in counting how many times an ID appears in a given month. For example in a given month A, B and F all occur twice whilst everything else occurs once. The difficulty with this data is that the the frequency of dates are not evenly spread out.
I attempted to resample on date by month, with the hope of counting duplicates.
df.resample('M', on='date')['id']
But all the functions that can be used on resample just give me the number of unique occurences rather than how many times each ID occured.
A rough example of the output is below [Table 2]
All of the examples I have seen merely count how many total or unique occurences occur for a given month, this question is focused on finding out how many occurences each Id had in a month.
Thankyou for your time.
[Table 1] - Data
idx
date
id
0
2022-08-10
A
1
2022-08-18
B
2
2022-08-18
C
3
2022-08-20
D
4
2022-08-20
E
5
2022-08-24
B
6
2022-08-26
A
7
2022-08-30
F
8
2022-09-03
G
9
2022-09-08
F
10
2022-09-13
H
[Table 2] - Rough example of desired output
id
occurences in a month
A
2
B
2
C
1
D
1
E
1
F
2
G
1
H
1
Use Series.dt.to_period for month periods and count values per id by GroupBy.size, then aggregate sum:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
id occurences in a month
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 1
Or use Grouper:
df1 = (df.groupby(['id',pd.Grouper(freq='M', key='date')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
EDIT:
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26',
'2022-08-30','2022-09-3','2022-09-8','2022-09-13','2050-12-15'],
"id": ['A','B','C','D','E','B','A','F','G','F','H','H']})
df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
print (df)
Because count first per month or days or dates and sum values it is same like:
df1 = df.groupby('id').size().reset_index(name='occurences')
print (df1)
id occurences
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 2
Same sum of counts per id:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size())
print (df1)
id date
A 2022-08 2
B 2022-08 2
C 2022-08 1
D 2022-08 1
E 2022-08 1
F 2022-08 1
2022-09 1
G 2022-09 1
H 2022-09 1
2050-12 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.to_period('d')])
.size())
print (df1)
id date
A 2022-08-10 1
2022-08-26 1
B 2022-08-18 1
2022-08-24 1
C 2022-08-18 1
D 2022-08-20 1
E 2022-08-20 1
F 2022-08-30 1
2022-09-08 1
G 2022-09-03 1
H 2022-09-13 1
2050-12-15 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.day])
.size())
print (df1)
id date
A 10 1
26 1
B 18 1
24 1
C 18 1
D 20 1
E 20 1
F 8 1
30 1
G 3 1
H 13 1
15 1
dtype: int64

Finding the maximum value in a group with differentiation

I have a Pandas DataFrame that looks like this:
index
ID
value_1
value_2
0
1
200
126
1
1
200
127
2
1
200
128.1
3
1
200
125.7
4
2
300.1
85
5
2
289.4
0
6
2
0
76.9
7
2
199.7
0
My aim is to find all rows in each ID-group (1,2 in this example) which have the max value for value_1 column. The second condition is if there are multiple maximum values per group, the row where the value in column value_2 is maximum should be taken.
So the target table should look like this:
index
ID
value_1
value_2
0
1
200
128.1
1
2
300.1
85
Use DataFrame.sort_values by all 3 columns and then DataFrame.drop_duplicates:
df1 = (df.sort_values(['ID', 'value_1', 'value_2'], ascending=[True, False, False])
.drop_duplicates('ID'))
print (df1)
ID value_1 value_2
2 1 200.0 128.1
4 2 300.1 85.0

How to create a ranking index based on row values in pandas

I need to create a ranking row index as the example below, based on the average on the last three months and the client ID column, the ranking index needs to be unique to each client:
Ranking Index Client ID Month 3 Month 2 Month 1 Avg
1 Client 2 6 5 3 4,66
1 Client 1 4 6 2 4
2 Client 1 5 2 1 2,66
2 Client 2 1 5 2 2,66
3 Client 2 4 2 1 2,33
3 Client 1 1 3 2 2
you need groupby the column client ID and rank the column Avg, using the parameter ascending=False according to your expected output.
with a data example, you have
df = pd.DataFrame({'clientID':list('baabba'), 'Avg':[4.66,4,2.66,2.66,2.33,2]})
# create the column ranking
df['ranking'] = df.groupby('clientID')['Avg'].rank(ascending=False)
print(df)
clientID Avg ranking
0 b 4.66 1.0
1 a 4.00 1.0
2 a 2.66 2.0
3 b 2.66 2.0
4 b 2.33 3.0
5 a 2.00 3.0

How to find sum and count of a column based on a grouping condition on a Pandas dataset?

I have a Pandas dataset with 3 columns. I need to group by the ID column while finding the sum and count of the other two columns. Also, I have to ignore the zeroes in the columsn 'A' and 'B'.
The dataset looks like -
ID A B
1 0 5
2 10 0
2 20 0
3 0 30
What I need -
ID A_Count A_Sum B_Count B_Sum
1 0 0 1 5
2 2 30 0 0
3 0 0 1 30
I have tried this using one column but wasn't able to get both the aggregations in the final dataset.
(df.groupby('ID').agg({'A':'sum', 'A':'count'}).reset_index().rename(columns = {'A':'A_sum', 'A': 'A_count'}))
If you don't pass it columns specifically, it will aggregate the numeric columns by itself.
Since your don't want to count 0, replace them with NaN first:
df.replace(0, np.NaN, inplace=True)
print(df)
ID A B
0 1 NaN 5.0
1 2 10.0 NaN
2 2 20.0 NaN
3 3 NaN 30.0
df = df.groupby('ID').agg(['count', 'sum'])
print(df)
A B
count sum count sum
ID
1 0 0.0 1 5.0
2 2 30.0 0 0.0
3 0 0.0 1 30.0
Remove MultiIndex columns
You can use list comprehension:
df.columns = ['_'.join(col) for col in df.columns]
print(df)
A_count A_sum B_count B_sum
ID
1 0 0.0 1 5.0
2 2 30.0 0 0.0
3 0 0.0 1 30.0

Dataframe Sum contiguous columns following Duplicates

I have one df dataframe as follow :
Item Size
0 .decash 1
1 .decash 2
2 usdjpy 1
3 .decash 1
4 usdjpy 1
I would to transform to a df2 as follow (drop duplicates and sum Size) :
Item Size
0 .decash 4
1 usdjpy 2
You can use groupby(..., as_index=False) and sum():
In [270]: df.groupby('Item', as_index=False)['Size'].sum()
Out[270]:
Item Size
0 .decash 4
1 usdjpy 2

Categories

Resources