I apologize for the uninformative title but I need help for a pandas request that I could not resume in a small title.
So I have a dataframe of some orders containing columns for
OrderId
ClientId
OrderDate
ReturnQuantity
I would like to add a boolean column HasReturnedBefore, which is True only if a customer with the same ClientId has made one or more previous order (OrderDate inferior), with a ReturnQuantity greater than 0.
I don't how to take that problem, I am not enough familiar with all the subtleties of pandas at the moment.
If I understand your question correctly, this is what you need:
df.sort_values(by=['ClientId','OrderDate']).assign(HasReturnedBefore = lambda x: (x['ClientId'] == x['ClientId'].shift(1))&(x.groupby('ClientId')['ReturnQuantity'].transform(all)))
First you need to sort_values by the columns that you use to distinguish records - ClientId and OrderDate in this case.
Now you can use assign which used to add new column to dataframe.
In documentation you can see how to use assign but in this case what I did was:
Check if ClientID is the same as the next ClientID and
Check if the user had had all values of ReturnQuantity greater than 0
The reason why the first occurrence of user with multiple orders is false is because it is treated as if it had no previous purchases (which it didn't) but it could be set to True - but it would require additional editing.
Additional functions:
shift - moves all record by the given number of rows
groupby - groups the dataframe by desired columns and provided function
transform - merges the groupby object with existing dataframe
Related
I have two dataframes with name information.
df_good_ssn contains real member information where the member can have several rows with the same SSN. (For example they opened two accounts, one row will have account number 00123, the second 00456, but both will have the same name and an SSN of 111-22-3333)
df_random_name contains randomly generated names.
I am systematically assigning a row of information from the df_random_name to each member (There are other columns and information being assigned. They have been removed to simplify the example). Since df_good_ssn can have multuple rows with the same ssn I need to group the real member information on the SSN column.
I am using the following code which is working but is taking a very long time. df_good_ssn contains over 900k+ rows and about 100k unique SSN groups. Each group can take upwards of 1 second. If anyone can think of a faster way to accomplish this please let me know. This can not use a SQL server so if pandas is unable to perform the groupby faster my next step will most likely be to write a sqlite file and go from there.
ssn_groups = df_good_ssn.groupby('SSN')
new_ssn_number=666000001
df_good_ssn_random_name_row_count=0
for ssn, ssn_group in ssn_groups:
df_good_ssn.loc[ssn_group.index, 'NEW_FIRST'] = df_random_name.loc[df_good_ssn_random_name_row_count,'NEW_FIRST'].upper()
df_good_ssn.loc[ssn_group.index, 'NEW_MIDDLE'] = ""
df_good_ssn.loc[ssn_group.index, 'NEW_LAST'] = df_random_name.loc[df_good_ssn_random_name_row_count,'NEW_LAST'].upper()
# Removed other columns for this example
df_good_ssn.loc[ssn_group.index, 'NEW_SSN'] = str(new_ssn_number)
df_good_ssn_random_name_row_count += 1
new_ssn_number += 1
Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!
IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])
I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?
Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.
I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()
I have a pandas DataFrame with columns patient_id, patient_sex, patient_dob (and other less relevant columns). Rows can have duplicate patient_ids, as each patient may have more than one entry in the data for multiple medical procedures. I discovered, however, that a great many of the patient_ids are overloaded, i.e. more than one patient has been assigned to the same id (evidenced by many instances of a single patient_id being associated with multiple sexes and multiple days of birth).
To refactor the ids so that each patient has a unique one, my plan was to group the data not only by patient_id, but by patient_sex and patient_dob as well. I figure this must be sufficient to separate the data into individual users (and if two patients with the same sex and dob just happened to be assigned the same id, then so be it.
Here is the code I currently use:
# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()
# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
# index is a tuple of the three column values, so this should get me a unique
# patient id for each patient
indv_patients.loc[index, new_patient_id] = str(hash(index))
# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)
# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)
patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})
The problem is that with over 7 million patients, this is way too slow a solution, the biggest bottleneck being the for-loop. So my question is, is there a better way to fix these overloaded ids? (The actual id doesn't matter, so long as its unique for each patient)
I don't know what the values for the columns are but have you tried something like this?
patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)
This should create a new column and you can then use groupby with the new_patient_id