I want to groupby a particular column in a dataframe and calculate sum of subgroups thus created, while retaining( displaying) all the records in each subgroup.
I am trying to create my own credit card expense tracking program. (I know there are several already available, but the idea is to learn Python.)
I have the usual fields of 'Merchant', 'Date', 'Type' and 'Amount'
I would like to do one of the following:
Group items by merchant, then within each such group, split the amount under (two new columns) 'debit' and 'credit'. I also want to be able to sum the the amounts under these columns. Repeat this for every merchant group.
If it is not possible to split based on 'Type' of the transaction (that is, as 'debit' and 'credit'), then I want to be able to sum the debits and credit SEPARATELY and also retain the line items (while displaying, that is.)Doing a sum() on the 'Amount' column gives just one number for each merchant and I verified that it is an incorrect amount.
My data frame looks like this:
Posted_Date Amount Type Merchant
0 04/20/2019 -89.70 Debit UNI
1 04/20/2019 -6.29 Debit BOOKM
2 04/20/2019 -36.42 Debit BROOKLYN
3 04/18/2019 -20.95 Debit MTA*METROCARD
4 04/15/2019 -29.90 Debit ZARA
5 04/15/2019 -7.70 Debit STILES
The code I have, after reading into a data frame and marking a transaction as credit or debit is:
merch_new = df_new.groupby(['Merchant','Type'])
merch_new.groups
for key, values in merch_new.groups.items():
df_new['Amount'].sum()
print(df_new.loc[values], "\n\n")
I was able to split it the way below:
Posted_Date Amount Type Merchant
217 05/23/2019 -41.70 Debit AT
305 04/27/2019 -12.40 Debit AT
Posted_Date Amount Type Merchant
127 07/08/2019 69.25 Credit AT
162 06/21/2019 139.19 Credit AT
Ideally, I would like something like the below:
the line items are displayed and a total for a given subgroup. In this case for merchant 'AT' and ideally sorted by date.
Date Merchant Credit Debit
305 4/27/2019 AT 0 -12.4
217 5/23/2019 AT 0 -41.7
162 6/21/2019 AT 139.19 0
127 7/8/2019 AT 69.25 0
208.44 -54.1
It appears simple, but I am unable to format it in this way.
EDIT:
I get an error for rename_axis():
rename_axis() got an unexpected keyword argument 'index'
and if I delete the index argument, I get the same error for 'columns'
I searched a lot for the usage (like Benoit showed) but I cannot find any. They all used strings or lists. I tried using:
rename_axis(None,None)
and I get the error:
ValueError: No axis named None for object type <class 'pandas.core.frame.DataFrame'>
I don't know if this is because of the python version I am using (3.6.6). I tried on both Spyder and Jupyter. But I get the same error.
I used:
rename_axis(None, axis=1) and I seem to get the desired results (sort of)
But I am unable to understand how this is being interpreted since there is no qualifier specified as to which argument it is reading into for "None". Can anyone please explain?
Any help is appreciated!
Thanks a lot!
I think you try to achieve something like this:
In [1]:
## Create example
import pandas as pd
cols = ['Posted_Date', 'Amount', 'Type', 'Merchant']
data = [['04/20/2019', -89.70, 'Debit', 'UNI'],
['04/20/2019', -6.29, 'Credit', 'BOOKM'],
['04/20/2019', -36.42, 'Debit', 'BROOKLYN'],
['04/20/2019', -6.29, 'Credit', 'BOOKM'],
['04/20/2019', -54.52, 'Credit', 'BROOKLYN'],
['04/18/2019', -20.95, 'Credit', 'BROOKLYN']]
df = pd.DataFrame(columns=cols, data=data)
## Pivot Table with aggregation function ='sum'
df_final = pd.pivot_table(df, values='Amount', index=['Posted_Date', 'Merchant'],
columns=['Type'], aggfunc='sum').fillna(0).reset_index().rename_axis(index=None, columns=None)
df_final['Total'] = df_final['Debit'] + df_final['Credit']
Out [1]:
Posted_Date Merchant Credit Debit Total
0 04/18/2019 BROOKLYN -20.95 0.00 -20.95
1 04/20/2019 BOOKM -12.58 0.00 -12.58
2 04/20/2019 BROOKLYN -54.52 -36.42 -90.94
3 04/20/2019 UNI 0.00 -89.70 -89.70
Related
My Python code takes a bank statement from Excel and creates a dataframe that categorises each transaction based on description.
Example code:
import pandas as pd
import openpyxl
import datetime as dt
import numpy as np
dff = pd.DataFrame({'Date': ['20221003', '20221005'],
'Tran Type': ['BOOK TRANSFER CREDIT', 'ACH DEBIT'],
'Debit Amount': [0.00, -220000.00],
'Credit Amount': [182.90, 0.0],
'Description': ['BOOK TRANSFER CREDIT FROM ACCOUNT 98754877', 'USREF2548 ACH OFFSET'],
'Amount': [-220000.00, 182.90]})
Then the bit that adds a column that categorises it if certain words appear in the description:
import re
dff['Category'] = dff['Description'].str.findall('Ref|BCA|Fund|Transfer', flags=re.IGNORECASE)
Output:
What it looks like
But this code will not work. Any ideas why?
pivotf = dff
pivotf = pd.pivot_table(pivotf,
index=["Date"], columns="Category",
values=['Amount'],
margins=False, margins_name="Total")
The error message is TypeError: unhashable type: 'list'
When I change columns from "Category" to anything else, it works fine. I have tried converting the column from object to string but this doesn't actually convert it.
you can use explode() function:
dff['Category'] = dff['Description'].str.findall('Ref|BCA|Fund|Transfer', flags=re.IGNORECASE).explode()
print(dff)
Returning:
Date Tran Type ... Amount Category
0 20221003 BOOK TRANSFER CREDIT ... -220000.0 TRANSFER
1 20221005 ACH DEBIT ... 182.9 REF
If Category column has multiple values you have to use it like this:
dff['Category'] = dff['Description'].str.findall('Ref|BCA|Fund|Transfer', flags=re.IGNORECASE)
dff=dff.explode('Category')
This is because the output of findall() is a list:
Return all non-overlapping matches of pattern in string, as a list of strings or tuples
Which is not a valid type when using pivot_table(). Add str[0] at the end if you expect to only have one find per Description/row and it should work:
dff['Category'] = dff['Description'].str.findall('Ref|BCA|Fund|Transfer', flags=re.IGNORECASE).str[0]
However, it may be more efficient under this scenario, to use search() which would be the perfect fit for this case:
dff['Category'] = dff['Description'].str.search('Ref|BCA|Fund|Transfer', flags=re.IGNORECASE)
Regardless of the approach, you will get as final output:
Amount
Category REF TRANSFER
Date
20221003 NaN -220000.0
20221005 182.9 NaN
Otherwise, you can use explode() as #Clenage suggests but, as shown here, it would lead to a potential issue of duplicated values because of different Categories:
pivotf = pd.pivot_table(pivotf.explode('Category'), index=["Date"], columns="Category",values=['Amount'],
margins=False, margins_name="Total")
Outputting for example:
Category BCA FUND REF TRANSFER
Date
20221003 -220000.0 -220000.0 NaN -220000.0
20221005 NaN NaN 182.9 NaN
In which case, it's untrue that for 2022-10-03 there was a total of -220000 * 3 money movement.
I work with a lot of CSV data for my job. I am trying to use Pandas to convert the member 'Email' to populate into the row of their spouses 'PrimaryMemberEmail' column. Here is a sample of what I mean:
import pandas as pd
user_data = {'FirstName':['John','Jane','Bob'],
'Lastname':['Snack','Snack','Tack'],
'EmployeeID':['12345','12345S','54321'],
'Email':['John#issues.com','NaN','Bob#issues.com'],
'DOB':['09/07/1988','12/25/1990','07/13/1964'],
'Role':['Employee On Plan','Spouse On Plan','Employee Off Plan'],
'PrimaryMemberEmail':['NaN','NaN','NaN'],
'PrimaryMemberEmployeeId':['NaN','12345','NaN']
}
df = pd.DataFrame(user_data)
I have thousands of rows like this. I need to only populate the 'PrimaryMemberEmail' when the user is a spouse with the 'Email' of their associated primary holders email. So in this case I would want to autopopulate the 'PrimaryMemberEmail' for Jane Snack to be that of her spouse, John Snack, which is 'John#issues.com' I cannot find a good way to do this. currently I am using:
for i in (df['EmployeeId']):
p = (p + len(df['EmployeeId']) - (len(df['EmployeeId'])-1))
EEID = df['EmployeeId'].iloc[p]
if 'S' in EEID:
df['PrimaryMemberEmail'].iloc[p] = df['Email'].iloc[p-1]
What bothers me is that this only works if my file comes in correctly, like how I showed in the example DataFrame. Also my NaN values do not work with dropna() or other methods, but that is a question for another time.
I am new to python and programming. I am trying to add value to myself in my current health career and I find this all very fascinating. Any help is appreciated.
IIUC, map the values and fillna:
df['PrimaryMemberEmail'] = (df['PrimaryMemberEmployeeId']
.map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
.fillna(df['PrimaryMemberEmail'])
)
Alternatively, if you have real NaNs, (not strings), use boolean indexing:
df.loc[df['PrimaryMemberEmployeeId'].notna(),
'PrimaryMemberEmail'] = df['PrimaryMemberEmployeeId'].map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
output:
FirstName Lastname EmployeeID DOB Role PrimaryMemberEmail PrimaryMemberEmployeeId
0 John Mack 12345 09/07/1988 Employee On Plan John#issues.com NaN
1 Jane Snack 12345S 12/25/1990 Spouse On Plan John#issues.com 12345
2 Bob Tack 54321 07/13/1964 Employee Off Plan Bob#issues.com NaN
To summarize as concisely as I can, I have data file containing a list of chemical compounds along with their ID numbers ("CID" numbers). My goal is to use pubchempy's pubchempy.get_properties function along with pandas' df.map function to essentially obtain the properties of each compound (there is one compound per row) using the "CID" number as an identifier. The parameters of pubchempy.get_properties is an identifier ("CID" number in this case) along with the property of the chemical that you want to obtain from the pubchem website (Molecular weight in this case).
This is the code that I have written currently:
import pandas as pd
import pubchempy
import numpy as np
df = pd.read_csv("Data.tsv.txt", sep="\t")
from pubchempy import get_properties
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))
df = df.drop(df[df.CID=='nan'].index)
df = df.drop( df.index.to_list()[5:] ,axis = 0 )
df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)
df = df.rename(columns={'CID.': 'MolecularWeight'})
print(df)
This is the output that I was initially getting for that column (only including a few rows, in reality, dataset is very big):
MolecularWeight
[{'CID': 5339, 'MolecularWeight': '398.4'}]
[{'CID': 3889, 'MolecularWeight': '520.5'}]
[{'CID': 2788, 'MolecularWeight': '305.50'}]
[{'CID': 1422517, 'MolecularWeight': '440.5'}]
.
.
.
Now, the code was somewhat working in that it is providing me with the molecular weight of the compound (398.4) but I didn't want all that extra bit of writing nor did I want the quote marks around the molecular weight number (both of these get in the way of the next bit of code that I plan to write).
So I then added this bit of code:
df['MolecularWeight'] = df.MolecularWeight[0][0].get('MolecularWeight')
This is the output that I am now getting:
MolecularWeight
398.4
398.4
398.4
398.4
.
.
.
What I want to do is pretty much exactly the same it's just that instead of getting the molecular weight of the first row in the MolecularWeight column and copying it onto all the other rows, I want to have the molecular weight value of each individual row in that column as the output.
What I was hoping to get is something like this:
MolecularWeight
398.4
520.5
305.50
440.5
.
.
.
Does anyone know how I can solve this issue? I've spent many hours trying to figure it out myself with no luck. I'd appreciate any help!
Few lines of text file:
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments
1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25) R2|R2|R25|R46| A
2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32) R25| A
3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H R18|R26|R27| A
If you cast the column to float, that should help you: df['MolecularWeight'] = df['MolecularWeight'].astype(float).
It appears that you may want multiple properties from each CID:
props = ['HBondDonorCount', 'RotatableBondCount', 'MolecularWeight']
df2 = pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props))
print(df2)
Output:
CID HBondDonorCount RotatableBondCount MolecularWeight
0 5339 398.4 3 6
1 3889 520.5 4 9
2 2788 305.50 1 0
You can then merge this information onto the original dataframe:
df = df.merge(df2) # df = df.merge(pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props)))
print(df)
...
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments MolecularWeight HBondDonorCount RotatableBondCount
0 1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]... O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18... NaN R2|R2|R25|R46| A NaN 398.4 3 6
1 2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]... COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)... 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8... NaN R25| A NaN 520.5 4 9
2 3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1... NaN R18|R26|R27| A NaN 305.50 1 0
I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.
Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.
I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?
This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.
import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()