Pandas: Retrieving an Index from a dataframe to populate another df - python

I have tried to find a solution to this but have failed
I have my master df with transactional data and specifically credit card names:
transactionId, amount, type, person
1 -30 Visa john
2 -100 Visa Premium john
3 -12 Mastercard jenny
I am grouping by person and then aggregating by numb of records and amount.
person numbTrans Amount
john 2 -130
jenny 1 -12
This is fine but I need to add the dimension of creditcard type to my df.
I have grouped a df of the creditcards in use
index CreditCardName
0 Visa
1 Visa Premium
2 Mastercard
So, what I can't do is creating a new column in my master dataframe called 'CreditCard_id' which uses the string 'Visa/Visa Premium/Mastercard' to pull in the index for the column.
transactionId, amount, type, CreditCardId, person
1 -30 Visa 0 john
2 -100 Visa Premium 1 john
3 -12 Mastercard 2 jenny
I need this as I am doing some simple kmeans clustering and require ints, not strings (or at least I think I do)
Thanks in advance
Rob

If you set the 'CreditCardName' as the index of the second df then you can just call map:
In [80]:
# setup dummydata
import pandas as pd
temp = """transactionId,amount,type,person
1,-30,Visa,john
2,-100,Visa Premium,john
3,-12,Mastercard,jenny"""
temp1 = """index,CreditCardName
0,Visa
1,Visa Premium
2,Mastercard"""
df = pd.read_csv(io.StringIO(temp))
# crucually set the index column to be the credit card name
df1 = pd.read_csv(io.StringIO(temp1), index_col=[1])
df
Out[80]:
transactionId amount type person
0 1 -30 Visa john
1 2 -100 Visa Premium john
2 3 -12 Mastercard jenny
In [81]:
df1
Out[81]:
index
CreditCardName
Visa 0
Visa Premium 1
Mastercard 2
In [82]:
# now we can call map passing the series, naturally the map will align on index and return the index value for our new column
df['CreditCardId'] = df['type'].map(df1['index'])
df
Out[82]:
transactionId amount type person CreditCardId
0 1 -30 Visa john 0
1 2 -100 Visa Premium john 1
2 3 -12 Mastercard jenny 2

Related

How do i use medical codes to determine what disease a person have using jupyter?

I'm currently trying to use a number of medical codes to find out if a person has a certain disease and would require help as I tried searching for a couple of days but couldn't find any. Hoping someone can help me with this. Considering I've imported excel file 1 into df1 and excel file 2 into df2, how do I use excel file 2 to identify what disease does the patients in excel file 1 have and indicate them with a header? Below is an example of what the data looks like. I'm currently using pandas Jupyter notebook for this.
Excel file 1:
Patient
Primary Diagnosis
Secondary Diagnosis
Secondary Diagnosis 2
Secondary Diagnosis 3
Alex
50322
50111
John
50331
60874
50226
74444
Peter
50226
74444
Peter
50233
88888
Excel File 2:
Primary Diagnosis
Medical Code
Diabetes Type 2
50322
Diabetes Type 2
50331
Diabetes Type 2
50233
Cardiovescular Disease
50226
Hypertension
50111
AIDS
60874
HIV
74444
HIV
88888
Intended output:
Patient
Positive for Diabetes Type 2
Positive for Cardiovascular Disease
Positive for Hypertension
Positive for AIDS
Positive for HIV
Alex
1
1
0
0
0
John
1
1
0
1
1
Peter
1
1
0
0
1
You can use merge and pivot_table
out = (
df1.melt('Patient', var_name='Diagnosis', value_name='Medical Code').dropna()
.merge(df2, on='Medical Code').assign(dummy=1)
.pivot_table('dummy', 'Patient', 'Primary Diagnosis', fill_value=0)
.add_prefix('Positive for ').rename_axis(columns=None).reset_index()
)
Output:
Patient
Positive for AIDS
Positive for Cardiovescular Disease
Positive for Diabetes Type 2
Positive for HIV
Positive for Hypertension
Alex
0
0
1
0
1
John
1
1
1
1
0
Peter
0
1
1
1
0
IIUC, you could melt df1, then map the codes from reshaped df2, finally pivot_table on the output:
diseases = df2.set_index('Medical Code')['Primary Diagnosis']
(df1
.reset_index()
.melt(id_vars=['index', 'Patient'])
.assign(disease=lambda d: d['value'].map(diseases),
value=1,
)
.pivot_table(index='Patient', columns='disease', values='value', fill_value=0)
)
output:
disease AIDS Cardiovescular Disease Diabetes Type 2 HIV Hypertension
Patient
Alex 0 0 1 0 1
John 1 1 1 1 0
Peter 0 1 1 1 0
Maybe you could convert your excel file 2 to some form of key value pair and then replace the primary diagnostics column in file 1 with the corresponding disease name, later apply some form of encoding like one-hot or something similar to file 1. Not sure if this approach would definitely help, but just sharing my thoughts.

Python pandas: create new column based on max value within group, but using value from additional (string) column

I have a pandas DataFrame with the following:
import pandas as pd
df = pd.DataFrame({'group_id': [1,1,2,2],
'name':['Arthur','Bob','Caroline','Denise'],
'income': [40000, 20000,50000,60000]
})
df
Out[94]:
group_id name income
0 1 Arthur 40000
1 1 Bob 20000
2 2 Caroline 50000
3 2 Denise 60000
My desired output is to have, within group_id, the name of the person with the highest income, e.g.:
df
Out[94]:
group_id name income highest_income_name
0 1 Arthur 40000 Arthur
1 1 Bob 20000 Arthur
2 2 Caroline 50000 Denise
3 2 Denise 60000 Denise
Based on the data-generating process for my actual data, there will always be only one name within a group with the highest income.
What is the best practice way for generating the above?
If I try to fill in the max income and then find the name, I'm stuck with NaN, which I potentially can try to fill in but would additional complexity.
df['max_income'] = df.groupby('group_id')['income'].transform('max')
df['highest_income_name'] = df['name'][df['income']==df['max_income']]
df
Out[105]:
group_id name income max_income highest_income_name
0 1 Arthur 40000 40000 Arthur
1 1 Bob 20000 40000 NaN
2 2 Caroline 50000 60000 NaN
3 2 Denise 60000 60000 Denise
Use numpy.where with Groupby.transform:
In [287]: import numpy as np
In [302]: df['highest_income_name'] = np.where(df.income.eq(df.groupby('group_id')['income'].transform(max)), df.name, np.nan)
In [308]: df['highest_income_name'] = df.groupby('group_id')['highest_income_name'].transform('first')
In [309]: df
Out[309]:
group_id name income highest_income_name
0 1 Arthur 40000 Arthur
1 1 Bob 20000 Arthur
2 2 Caroline 50000 Denise
3 2 Denise 60000 Denise

Conditionally align two dataframes in order to derive a column passed in as a condition in numpy where

I come from a SQL background and new to python. I have been trying to figure out how to solve this particular problem for awhile now and am unable to come up with anything.
Here are my dataframes
from pandas import DataFrame
import numpy as np
Names1 = {'First_name': ['Jon','Bill','Billing','Maria','Martha','Emma']}
df = DataFrame(Names1,columns=['First_name'])
print(df)
names2 = {'name': ['Jo', 'Bi', 'Ma']}
df_2 = DataFrame(names2,columns=['name'])
print(df_2)
Results to this:
First_name
0 Jon
1 Bill
2 Billing
3 Maria
4 Martha
5 Emma
name
0 Jo
1 Bi
2 Ma
This code helps me identify in df which First_name starts with a tuple from df_2
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), 'true', df['First_name'])
results to this:
First_name like_flg
0 Jon true
1 Bill true
2 Billing true
3 Maria true
4 Martha true
5 Emma Emma
I would like the final output of the dataframe to set the like_flg to the value of the tuple in which the First_name field is being conditionally compared against. See below for final desired output:
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Here's what I've tried so far
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), tuple(list(df_2['name'])), df['First_name'])
results to this error:
`ValueError: operands could not be broadcast together with shapes (6,) (3,) (6,)`
I've also tried aligning both dataframes, however, that won't work for the use case that I'm trying to achieve.
Is there a way to conditionally align dataframes to fill in the columns that start with the tuple?
I believe the issue I'm facing is that the tuple or dataframe that I'm using as a comparison is not the same size as the dataframe that I want to append the tuple to. Please see above for the desired output.
Thank you all advance!
If your starting strings differ in length, you can use .str.extract
df['like_flag'] = df['First_name'].str.extract('^('+'|'.join(df_2.name)+')')
df['like_flag'] = df['like_flag'].fillna(df.First_name) # Fill non matches.
I modified df_2 to be
name
0 Jo
1 Bi
2 Mar
which leads to:
First_name like_flag
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Mar
4 Martha Mar
5 Emma Emma
You can use np.where,
df['like_flg'] = np.where(df.First_name.str[:2].isin(df_2.name), df.First_name.str[:2], df.First_name)
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Do with numpy find
v=df.First_name.values.astype(str)
s=df_2.name.values.astype(str)
df_2.name.dot((np.char.find(v,s[:,None])==0))
array(['Jo', 'Bi', 'Bi', 'Ma', 'Ma', ''], dtype=object)
Then we just assign it back
df['New']=df_2.name.dot((np.char.find(v,s[:,None])==0))
df.loc[df['New']=='','New']=df.First_name
df
First_name New
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma

Counting the occurrences of a substring from one column within another column

I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. Portions of the rows of interest within these two dataframes are shown below.
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. For example,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
works fine but if I try
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
it returns the error 'Series' objects are mutable, thus they cannot be hashed. I have looked at various similar questions but cannot find the solution to this problem for the life of me. Any assistance on this would be greatly appreciated, thank you!
I think need findall by regex with join all values of Name, then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
Last if necessary join to df1:
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0

Merge two dataframes based on a column

I want to compare name column in two dataframes df1 and df2 , output the matching rows from dataframe df1 and store the result in new dataframe df3. How do i do this in Pandas ?
df1
place name qty unit
NY Tom 2 10
TK Ron 3 15
Lon Don 5 90
Hk Sam 4 49
df2
place name price
PH Tom 7
TK Ron 5
Result:
df3
place name qty unit
NY Tom 2 10
TK Ron 3 15
Option 1
Using df.isin:
In [1362]: df1[df1.name.isin(df2.name)]
Out[1362]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 2
Performing an inner-join with df.merge:
In [1365]: df1.merge(df2.name.to_frame())
Out[1365]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 3
Using df.eq:
In [1374]: df1[df1.name.eq(df2.name)]
Out[1374]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
You want something called an inner join.
df1.merge(df2,on = 'name')
place_x name qty unit place_y price
NY Tom 2 10 PH 7
TK Ron 3 15 TK 5
The _xand _y happens when you have a column in both data frames being merged.

Categories

Resources