I have a dataframe in Pandas which has data like the below:
charge id
charge
payment
adjustment
type
123
$45
$30
0
P
123
$45
0
$10
A
124
$50
$20
0
P
I then use the pivot table to aggregate the data to produce sums / counts of various fields. However summing the charge is incorrect as the same charge can be in the data multiple times. Is it possible to sum the charge once based on the charge id field? So in this example charge id 123 would only be included once in the total.
Related
I have two dataframes: one with an account number, a purchase ID, a total cost, and a date
and another with account number, money paid, and date:
To make it clear there are two accounts, 11111 and 33333, but there are some typos in the dataframes.
AccountNumber Purchase ID Total Cost Date
11111 100 10 1/1/2020
333333 100 10 1/1/2020
33333 200 20 2/2/2020
11111 300 30 4/2/2020
AccountNumber Date Money Paid:
11111 1/1/2020 5
111111 1/2/2020 2
33333 1/2/2020 1
33333 2/3/2020 15
1111 4/2/2020 30
Each Purchase ID is an identifier for a single purchase, however multiple accounts may be involved within the purchase, such as account 11111 and 33333. Moreover, an account may be used for two different purchases such as account 11111 with Purchase ID 100 and 300. In the second dataframe, payments can be made in increments, so I need to use the date to make sure that the payment is associated with the correct Purchase ID. Moreover, there may be some slight errors in the account numbers so I need to use a fuzzy match. In the end, I want to get a dataframe that is grouped by Purchase ID and compares how much the accounts paid vs. the cost of the item:
Purchase ID Date Total Cost Amount Paid $Owed
100 1/1/2020 10 8 2
200 2/2/2020 20 15 5
300 4/2/2020 30 30 0
As you can see, this is a fairly complicated question. I first tried just joining the two dataframes based on AccountNumber but I ran into issues due to the slight differences as well as the problem of matching the Accountnumber transaction to the correct Purchase ID with the date, because one error with merging is that you might accidentally sum up money paid for the wrong Purchase since accounts can be involved with different purchases.
I'm thinking about iterating through the rows and using if statements/regex but I feel like that would take too long.
What's the simplest and efficient solution to this problem? I'm a beginner at pandas and python.
The library pandas-dedupe can help you to link the two dataframe by using a combination of active learning and clustering. have a look at the repo.
Here is the sample code (and step by step explanation):
import pandas as pd
import pandas_dedupe
#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')
#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
# At this point pandas_dedupe will ask you to label a sample of records according
# to whether they are distinct or the same observation.
# After that, pandas-dedupe uses its knowledge to cluster together similar records.
#send output to csv
df_final.to_csv('linkage_output.csv')
I have two dataframes - one which is a micro level containing all line items purchased across all transactions (DF1). The other dataframe will be built, with the intention to be a higher level aggregation that summarizes the revenue generated per transaction, essentially summing up all line items for each transaction (DF2).
df1
Out[df1]:
transaction_id item_id amount
0 AJGDO-12304 120 $120
1 AJGDO-12304 40 $10
2 AJGDO-12304 01 $10
3 ODSKF-99130 120 $120
4 ODSKF-99130 44 $30
5 ODSKF-99130 03 $50
df2
Out[df2]
transaction_id location_id customer_id revenue(THIS WILL BE THE ADDED COLUMN!)
0 AJGDO-12304 2131234 1234 $140
1 ODSKF-99130 213124 1345 $200
How would I go about linking the output of a groupby.sum() and assigning it to df2? The revenue column will essentially be the revenue aggregation of df1['transaction_id'] and I want to link it to df2['transaction_id']
Here is what I currently have tried but am struggling with putting together,
results = df1.groupby('transaction_id')['amount'].sum()
df2['revenue'] = df2['transaction_id'].merge(results,how='left').value
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200
I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
I currently have 2 datasets
1 = Drugs prescribed per hospital
2 = Crimes committed
I have been able to assign the located hospital ID to the various crimes so therefore I can identify which hospital is closer.
What I really would like to do is to assign the amount of drugs prescribed using the count_values method to the hospital ID in the Crime data so that I can then plot a scatter matrix of where the crimes took place and the total quantity of drugs prescribed from the closest hospital.
I have tried using the following
df = Crimes.merge(hosp[['hosp no', 'Total Quantity']],
left_on='hosp_no', right_on='hosp no').drop('hosp no', 1)
df
However when I use the above code the associated Hosp ID to the crime changes and I don't want it too!!
I am new to jupyter notebook so I would be most grateful for any help!!
Thank you in advance
Crimes df
ID Type Hosp No
0 Anti-Social 222
Hosp df
Hosp no Total Quantity Drug name
222 1000 Paracetamol
So basically Hosp 222 has prescribed 1000 Paracetamol drugs how can I assign the number 1000 to the Crime df where Hosp No = 222 to look like this:
Crimes df
ID Type Hosp No Total Quantity
0 Anti-Social 222 1000
If the columns you are merging on share the same name, you don't need on parameter. Since you need column added to crime, we can use parameter how = left
Crimes = Crimes.merge(Hosp[['Hosp No', 'Total Quantity']], how = 'left')
ID Type Hosp No Total Quantity
0 0 Anti-Social 222 1000
Let me know if this is the desired output or you need anything else
I have a pandas dataframe of grocery transactions containing ['customer_id', 'date', 'item_code', and 'amount'].
I want to group multiple transactions from the same day into 1 transaction, with a sum of those individual transactions. For example, if I bought 3 items on 1-1-16, for $5, $10, and $15 each, I want that to be represented as a single row with a value of $30.
That part is a simple groupby
df.groupby(['customer_id', 'date'])['amount'].sum()
My problem is that I want to create a new column called "transaction_type" that assigns a code ('grpd') to a row if that row was grouped, and the corresponding value of item_code if it was not grouped.
So if I purchased 3 items on 1-1-16, but purchased a single new item on 1-2-16, I want my customer_id to show 2 rows in the dataframe. One for 1-1-16 with the custom 'grpd' value in the new transaction_type column, and one for 1-2-16 with the original value from the item_code column reproduced into the transaction_type column. So my dataframe would look like this in the end for my transactions:
customer_id date transaction_type amount
4231 1-1-16 grpd $30
4231 1-2-16 candy $5
Create dummy data:
df = pd.DataFrame({'customer_id':['4231']*4,'date':['1-1-2016','1-1-2016','1-1-2016','1-2-2016'],'items':['gum','candy','soda','candy'],'amount':[9,11,10,5]})
Input:
amount customer_id date items
0 9 4231 1-1-2016 gum
1 11 4231 1-1-2016 candy
2 10 4231 1-1-2016 soda
3 5 4231 1-2-2016 candy
Use .agg, np.where, and size:
df_out = (df.groupby(['customer_id','date'])
.agg({'items': lambda x: np.where(x.size > 1,'grpd',x.min()),'amount':'sum'})
.reset_index()
.rename(columns={'items':"transaction_type"}))
Output:
customer_id date amount transaction_type
0 4231 1-1-2016 30 grpd
1 4231 1-2-2016 5 candy
You can groupby the transaction_type too:
df.groupby(['date', 'customer_id', 'transaction_type'])['amount'].sum()