Pandas - Split and refactor and overloaded ID column - python

I have a pandas DataFrame with columns patient_id, patient_sex, patient_dob (and other less relevant columns). Rows can have duplicate patient_ids, as each patient may have more than one entry in the data for multiple medical procedures. I discovered, however, that a great many of the patient_ids are overloaded, i.e. more than one patient has been assigned to the same id (evidenced by many instances of a single patient_id being associated with multiple sexes and multiple days of birth).
To refactor the ids so that each patient has a unique one, my plan was to group the data not only by patient_id, but by patient_sex and patient_dob as well. I figure this must be sufficient to separate the data into individual users (and if two patients with the same sex and dob just happened to be assigned the same id, then so be it.
Here is the code I currently use:
# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()
# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
# index is a tuple of the three column values, so this should get me a unique
# patient id for each patient
indv_patients.loc[index, new_patient_id] = str(hash(index))
# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)
# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)
patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})
The problem is that with over 7 million patients, this is way too slow a solution, the biggest bottleneck being the for-loop. So my question is, is there a better way to fix these overloaded ids? (The actual id doesn't matter, so long as its unique for each patient)

I don't know what the values for the columns are but have you tried something like this?
patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)
This should create a new column and you can then use groupby with the new_patient_id

Related

Performance issues with adding a column to Pandas Groupby object from a second data frame

I have two dataframes with name information.
df_good_ssn contains real member information where the member can have several rows with the same SSN. (For example they opened two accounts, one row will have account number 00123, the second 00456, but both will have the same name and an SSN of 111-22-3333)
df_random_name contains randomly generated names.
I am systematically assigning a row of information from the df_random_name to each member (There are other columns and information being assigned. They have been removed to simplify the example). Since df_good_ssn can have multuple rows with the same ssn I need to group the real member information on the SSN column.
I am using the following code which is working but is taking a very long time. df_good_ssn contains over 900k+ rows and about 100k unique SSN groups. Each group can take upwards of 1 second. If anyone can think of a faster way to accomplish this please let me know. This can not use a SQL server so if pandas is unable to perform the groupby faster my next step will most likely be to write a sqlite file and go from there.
ssn_groups = df_good_ssn.groupby('SSN')
new_ssn_number=666000001
df_good_ssn_random_name_row_count=0
for ssn, ssn_group in ssn_groups:
df_good_ssn.loc[ssn_group.index, 'NEW_FIRST'] = df_random_name.loc[df_good_ssn_random_name_row_count,'NEW_FIRST'].upper()
df_good_ssn.loc[ssn_group.index, 'NEW_MIDDLE'] = ""
df_good_ssn.loc[ssn_group.index, 'NEW_LAST'] = df_random_name.loc[df_good_ssn_random_name_row_count,'NEW_LAST'].upper()
# Removed other columns for this example
df_good_ssn.loc[ssn_group.index, 'NEW_SSN'] = str(new_ssn_number)
df_good_ssn_random_name_row_count += 1
new_ssn_number += 1

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Compare two date columns in pandas DataFrame to validate third column

Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!
IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])

Pivot across multiple columns with repeating values in each column

I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!

How to create a new python DataFrame with multiple columns of differing row lengths?

I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.

Categories

Resources