I have a pandas dataframe of grocery transactions containing ['customer_id', 'date', 'item_code', and 'amount'].
I want to group multiple transactions from the same day into 1 transaction, with a sum of those individual transactions. For example, if I bought 3 items on 1-1-16, for $5, $10, and $15 each, I want that to be represented as a single row with a value of $30.
That part is a simple groupby
df.groupby(['customer_id', 'date'])['amount'].sum()
My problem is that I want to create a new column called "transaction_type" that assigns a code ('grpd') to a row if that row was grouped, and the corresponding value of item_code if it was not grouped.
So if I purchased 3 items on 1-1-16, but purchased a single new item on 1-2-16, I want my customer_id to show 2 rows in the dataframe. One for 1-1-16 with the custom 'grpd' value in the new transaction_type column, and one for 1-2-16 with the original value from the item_code column reproduced into the transaction_type column. So my dataframe would look like this in the end for my transactions:
customer_id date transaction_type amount
4231 1-1-16 grpd $30
4231 1-2-16 candy $5
Create dummy data:
df = pd.DataFrame({'customer_id':['4231']*4,'date':['1-1-2016','1-1-2016','1-1-2016','1-2-2016'],'items':['gum','candy','soda','candy'],'amount':[9,11,10,5]})
Input:
amount customer_id date items
0 9 4231 1-1-2016 gum
1 11 4231 1-1-2016 candy
2 10 4231 1-1-2016 soda
3 5 4231 1-2-2016 candy
Use .agg, np.where, and size:
df_out = (df.groupby(['customer_id','date'])
.agg({'items': lambda x: np.where(x.size > 1,'grpd',x.min()),'amount':'sum'})
.reset_index()
.rename(columns={'items':"transaction_type"}))
Output:
customer_id date amount transaction_type
0 4231 1-1-2016 30 grpd
1 4231 1-2-2016 5 candy
You can groupby the transaction_type too:
df.groupby(['date', 'customer_id', 'transaction_type'])['amount'].sum()
Related
I have two dataframes - one which is a micro level containing all line items purchased across all transactions (DF1). The other dataframe will be built, with the intention to be a higher level aggregation that summarizes the revenue generated per transaction, essentially summing up all line items for each transaction (DF2).
df1
Out[df1]:
transaction_id item_id amount
0 AJGDO-12304 120 $120
1 AJGDO-12304 40 $10
2 AJGDO-12304 01 $10
3 ODSKF-99130 120 $120
4 ODSKF-99130 44 $30
5 ODSKF-99130 03 $50
df2
Out[df2]
transaction_id location_id customer_id revenue(THIS WILL BE THE ADDED COLUMN!)
0 AJGDO-12304 2131234 1234 $140
1 ODSKF-99130 213124 1345 $200
How would I go about linking the output of a groupby.sum() and assigning it to df2? The revenue column will essentially be the revenue aggregation of df1['transaction_id'] and I want to link it to df2['transaction_id']
Here is what I currently have tried but am struggling with putting together,
results = df1.groupby('transaction_id')['amount'].sum()
df2['revenue'] = df2['transaction_id'].merge(results,how='left').value
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200
Use map:
lookup = df1.groupby(['transaction_id'])['amount'].sum()
df2['revenue'] = df2.transaction_id.map(lookup)
print(df2)
Output
transaction_id location_id customer_id revenue
0 AJGDO-12304 2131234 1234 140
1 ODSKF-99130 213124 1345 200
I have a dataset and want to times a value column by 2.5 based on the ID value of a list.
My data frame looks like this
Name ID Salary
James 21 25,000
Sam 12 15,000
My list is a series and let's call it s = ["21", "36"] this data is the ID numbers
How do I get it based on the ID number to times the salary by 2.5?
The goal is to have something like this
Name ID Salary
James 21 62,500
Sam 12 15,000
First convert Salary to numeric and then convert values of ID to strings, test by Series.isin and multiple by DataFrame.loc for select rows by mask and column by name Salary:
s = ["21", "36"]
#if values of Salary are strings
#df = pd.read_csv(file, thousands=',')
#or
#df['Salary'] = df['Salary'].str.replace(',','').astype(int)
#ID are converted to strings by `astype`, because valus in list s are strings
df.loc[df['ID'].astype(str).isin(s), 'Salary'] *= 2.5
#if s are numeric
#df.loc[df['ID'].isin(s), 'Salary'] *= 2.5
print (df)
Name ID Salary
0 James 21 62500.0
1 Sam 12 15000.0
I have a question about Pandas Dataframe. There are two tables, 1 table is a mapping table, and 2nd table is a transactional date.
In the mapping table, there are two columns with a range of From and To.
Below are the two dataframes:
1). The df1 is the mapping table with a range of account numbers to map to a specific tax type.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Category':['FBT Tax','CIT','GST','Stamp Duty','Sales Tax'],
'GL From':['10000000','20000000','30000000','40000000','50000000'],
'GL To':['10009999','20009999','30009999','40009999','50009999']})
Category GL From GL To
0 FBT Tax 10000000 10009999
1 CIT 20000000 20009999
2 GST 30000000 30009999
3 Stamp Duty 40000000 40009999
4 Sales Tax 50000000 50009999
2). The df2 is the transactional table (there should be more columns I skipped for this demo), with the account number that I want to use to search/lookup in the range in df1.
df2 = pd.DataFrame({'Date':['1/10/19','2/10/19','3/10/19','10/11/19','12/12/19','30/08/19','01/07/19'],
'GL Account':['20000456','30000199','20004689','40008900','50000876','10000325','70000199'],
'Product LOB':['Computer','Mobile Phone','TV','Fridge','Dishwasher','Tablet','Table']})
Date GL Account Product LOB
0 1/10/19 20000456 Computer
1 2/10/19 30000199 Mobile Phone
2 3/10/19 20004689 TV
3 10/11/19 40008900 Fridge
4 12/12/19 50000876 Dishwasher
5 30/08/19 10000325 Tablet
6 01/07/19 70000199 Table
In the df1 and df2, the account numbers are in String dtype. Hence, I created a simple function to convert into Integer.
def to_integer(col):
return pd.to_numeric(col,downcast='integer')
I have tried both np.dot and .loc to map the Category column, but I encountered this error:
ValueError: Can only compare identically-labeled Series objects
result = np.dot((to_integer(df2['GL Account']) >= to_integer(df1['GL From'])) &
(to_integer(df2['GL Account']) <= to_integer(df1['GL To'])),df1['Category'])
result = df1.loc[(to_integer(df2['GL Account']) >= to_integer(df1['GL From'])) &
(to_integer(df2['GL Account']) <= to_integer(df1['GL To'])),"Category"]
What I want to achieve is like below:
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 2/10/19 30000199 Mobile Phone GST
2 3/10/19 20004689 TV CIT
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
6 01/07/19 70000199 Table NaN
Is there anyway to map between two dataframes based on From-To range?
Pandas >= 0.25.0
We can do a cartesian merge by first assigning two artificial columns called key and joining on these. Then we can use query to filter everything between the correct ranges. Notice that we use backtick () to get our columns with spaces in the name, this ispandas >= 0.25.0`:
df2.assign(key=1).merge(df1.assign(key=1), on='key')\
.drop(columns='key')\
.query('`GL Account`.between(`GL From`, `GL To`)')\
.drop(columns=['GL From', 'GL To'])\
.reset_index(drop=True)
If you use left join, replace the .query part with:
.query('`GL Account`.between(`GL From`, `GL To`) | `GL From`.isna()')
To keep the rows which didn't match in the join
Or
Pandas < 0.25.0
Simple boolean indexing
mrg = df2.assign(key=1).merge(df1.assign(key=1), on='key')\
.drop(columns='key')
mrg[mrg['GL Account'].between(mrg['GL From'], mrg['GL To'])]\
.drop(columns=['GL From', 'GL To'])\
.reset_index(drop=True)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 2/10/19 30000199 Mobile Phone GST
2 3/10/19 20004689 TV CIT
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
In case your data follows the pattern provided, you can create a column that has the lower bound value of each account and then merge on it:
df1['GL From'] = df1['GL From'].astype(int) #make it integer
### create lower bound
df2['lbound'] = df2['GL Account'].astype(int)//10000000*10000000
### merge
df2.merge(df1, left_on='lbound', right_on='GL From')\
.drop(['lbound','GL From','GL To'], axis=1)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 3/10/19 20004689 TV CIT
2 2/10/19 30000199 Mobile Phone GST
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
Added
In case the data does not follow a specific patter, you can use np.intersect1d with np.where to find out the lower bound and upper bound intersection, and therefore the index of the matched range.
For instance:
### func to get the index where account is greater or equal to `FROM` and lower or equal to `TO`
#np.vectorize
def match_ix(acc_no):
return np.intersect1d(np.where(acc_no>=df1['GL From'].values),np.where(acc_no<=df1['GL To'].values))
## Apply to dataframe
df2['right_ix'] = match_ix(df2['GL Account'])
## Merge using the index. Use 'how=left' for the left join to preserve unmatched
df2.merge(df1, left_on='right_ix', right_on=df1.index, how='left')\
.drop(['right_ix','GL From','GL To'], axis=1)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 3/10/19 20004689 TV CIT
2 2/10/19 30000199 Mobile Phone GST
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
In terms of performance, you will get something quicker and without the issue of Memory Error you might have on full joins:
### Using 100* the sample provided
tempdf2 = pd.concat([df2]*100)
tempdf1 = pd.concat([df1]*100)
#23 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125