Python Pandas - merge rows if some values are blank - python

I have a dataset that looks a little like this:
ID Name Address Zip Cost
1 Bob the Builder 123 Main St 12345
1 Bob the Builder $99,999.99
2 Bob the Builder 123 Sub St 54321 $74,483.01
3 Nigerian Prince Area 51 33333 $999,999.99
3 Pinhead Larry Las Vegas 31333 $11.00
4 Fox Mulder Area 51 $0.99
where missing data is okay, unless it's obvious that they can be merged. What I mean by that is instead of the dataset above, I want to merge the rows where both the ID and Name are the same, and the other features can fill in each other's blanks. For example, the dataset above would become:
ID Name Address Zip Cost
1 Bob the Builder 123 Main St 12345 $99,999.99
2 Bob the Builder 123 Sub St 54321 $74,483.01
3 Nigerian Prince Area 51 33333 $999,999.99
3 Pinhead Larry Las Vegas 31333 $11.00
4 Fox Mulder Area 51 $0.99
I've thought about using df.groupby(["ID", "Name"]) and then concatenating the strings since the missing values are empty strings, but got no luck with it.
The data has been scraped off websites, so they've had to go through a lot of cleaning to end up here. I can't think of an elegant way of figuring this out!

This only works if rows we are potentially merging are next to each other.
setup
df = pd.DataFrame(dict(
ID=[1, 1, 2, 3, 3, 4],
Name=['Bob the Builder'] * 3 + ['Nigerian Prince', 'Pinhead Larry', 'Fox Mulder'],
Address=['123 Main St', '', '123 Sub St', 'Area 51', 'Las Vegas', 'Area 51'],
Zip=['12345', '', '54321', '33333', '31333', ''],
Cost=['', '$99,999.99', '$74,483.01', '$999.999.99', '$11.00', '$0.99']
))[['ID', 'Name', 'Address', 'Zip', 'Cost']]
fill up missing
replace('', np.nan) then forward fill then back fill
df_ = df.replace('', np.nan).ffill().bfill()
concat
take last row of filled up df_ if its a duplicate row
take non filled up df if not duplicated
pd.concat([
df_[df_.duplicated()],
df.loc[df_.drop_duplicates(keep=False).index]
])

I'll describe an algorithm:
Put aside all the rows where all fields are populated. We don't need to touch these.
Create a boolean DataFrame like the input where empty fields are False and populated fields are True. This is df.notnull().
For each name in df.Name.unique():
Take df[df.Name == name] as the working set.
Sum each pair (or tuple) of boolean rows, resulting in a boolean vector the same width as the input columns except those which are always populated. In the example this means [True, True, False] and [False, False, True], so the sum is [1, 1, 1].
If the sum is equal to 1 everywhere, that pair (or tuple) of rows can be merged.
But there are a ton of possible edge cases here, such as what to do if you have three rows A,B,C and you could merge either A+B or A+C. It will help if you can narrow down the constraints that exist in the data before implementing the merging algorithm.

Related

Update an existing column in one dataframe based on the value of a column in another dataframe

I have two csv files as my raw data to read into different dataframes. One is called 'employee' and another is called 'origin'. However, I cannot upload the files here so I hardcoded the data into the dataframes below. The task I'm trying to solve is to update the 'Eligible' column in employee_details with 'Yes' or 'No' based on the value of the 'Country' column in origin_details. If Country = UK, then put 'Yes' in the Eligible column for that Personal_ID. Else, put 'No'.
import pandas as pd
import numpy as np
employee = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943'],
'Name': ['Tom', 'Joseph', 'Krish', 'John'],
'Age': ['40', '35', '43', '51'],
'Eligible': ' '}
origin = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943', '1573482', '1739526'],
'Country': ['UK', 'USA', 'FRA', 'SWE', 'UK', 'AU']}
employee_details = pd.DataFrame(employee)
origin_details = pd.DataFrame(origin)
employee_details['Eligible'] = np.where((origin_details['Country']) == 'UK', 'Yes', 'No')
print(employee_details)
print(origin_details)
The output of above code shows the below error message:
ValueError: Length of values (6) does not match length of index (4)
However, I am expecting to see the below as my output.
Personal_ID Name Age Eligible
0 1000123 Tom 40 Yes
1 1100258 Joseph 35 No
2 1104682 Krish 43 No
3 1020943 John 51 No
I also don't want to delete anything in my dataframes to match the size specified in the ValueError message because I may need the extra Personal_IDs in the origin_details later. Alternatively, I can keep all the existing Personal_ID's in the raw data (employee_details, origin_details) and create a new dataframe from those to extract the records which have the same Personal_ID's and determine the np.where() condition from there.
Please advise! Any helps are appreciated, thank you!!
You can merge the 2 dataframes on Personal ID and then use np.where
Merge with how='outer' to keep all personal IDs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='outer')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
4 1573482 NaN NaN Yes UK
5 1739526 NaN NaN No AU
If you dont want to keep all personal IDs then you can merge with how='inner' and you won't see the NANs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='inner')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
You are using a Pandas Series object inside a Numpy method, np.where((origin_details['Country'])). I believe this is the problem.
try:
employee_details['Eligible'] = origin_details['Country'].apply(lambda x:"Yes" if x=='UK' else "No")
It is always much easier and faster to use the pandas library to analyze dataframes instead of converting them back to numpy arrays
Well, the first thing I want to answer about is the exception and how lucky you are that it didn't if your tables were the same length your code was going to work.
but there is an assumption in the code that I don't think you thought about and that is that the ids may not be in the same order or like in the example there are more ids in some table than the other if you had the same length of tables but not the same order you would have got incorrect eligible values for each row. the current way to do this is as follow
first join the table to one using personal_id but use left join as you don't want to lose data if there is no origin info for that personal id.
combine_df = pd.merge(employee_details, origin_details, on='Personal_ID', how='left')
use the apply function to fill the new column
combine_df['Eligible'] = combine_df['Country'].apply(lambda x:'Yes' if x=='UK' else 'No')

How to drop rows in one DataFrame based on one similar column in another Dataframe that has a different number of rows

I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.
As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)

Pandas: creating a new column conditional on substring searches of one column and inverse of another column

I'd like to create a new column in a Pandas data frame based on a substring search of one column and an inverse of another column. Here is some data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Manufacturer':['ABC-001', 'ABC-002', 'ABC-003', 'ABC-004', 'DEF-123', 'DEF-124', 'DEF-125', 'ABC-987', 'ABC-986', 'ABC-985'],
'Color':['04-Red', 'vs Red - 07', 'Red', 'Red--321', np.nan, np.nan, np.nan, 'Blue', 'Black', 'Orange'],
})
Manufacturer Color
0 ABC-001 04-Red
1 ABC-002 vs Red - 07
2 ABC-003 Red
3 ABC-004 Red--321
4 DEF-123 NaN
5 DEF-124 NaN
6 DEF-125 NaN
7 ABC-987 Blue
8 ABC-986 Black
9 ABC-985 Orange
I would like to be able to create a new column named Country based on the following logic:
a) if the Manufacturer column contains the substring 'ABC' and the Color column contains the substring 'Red', then write 'United States' to the Country column
b) if the Manufacturer column contains the substring 'DEF', then write 'Canada to the Country column
c) if the Manufacturer column contains the substring 'ABC' and the Color column does NOT contain the substring 'Red', then write 'England' to the Country column.
My attempt is as follows:
df['Country'] = np.where((df['Manufacturer'].str.contains('ABC')) & (df['Color'].str.contains('Red', na=False)), 'United States', # the 'a' case
np.where(df['Manufacturer'].str.contains('DEF', na=False), 'Canada', # the 'b' case
np.where((df['Manufacturer'].str.contains('ABC')) & (df[~df['Color'].str.contains('Red', na=False)]), 'England', # the 'c' case
'ERROR')))
But, this gets the following error:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
The error message suggests that it might be a matter of operator precedence, as mentioned in:
pandas comparison raises TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
Python error: TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
I believe I'm using parentheses properly here (although maybe I'm not).
Does anyone see the cause of this error? (or know of a more elegant want to accomplish this?)
Thanks in advance!
You don't want to index into df here, so just do this:
Just change: (df[~df['Color'].str.contains('Red', na=False)])
to: ~df['Color'].str.contains('Red', na=False)
and it should work.
Also, if you want to break this up for readability and to eliminate some repetition, I would suggest something like this:
# define the parameters that define the Country variable in another table
df_countries = pd.DataFrame(
{'letters': ['ABC', 'DEF', 'ABC'],
'is_red': [True, False, False],
'Country': ['United States', 'Canada', 'England']})
# add those identifying parameters to your current table as temporary columns
df['letters'] = df.Manufacturer.str.replace('-.*', '')
df['is_red'] = df.Color.str.contains('Red', na=False)
# merge the tables together and drop the temporary key columns
df = df.merge(df_countries, how='left', on=['letters', 'is_red'])
df = df.drop(columns=['letters', 'is_red'])
Or more concise:
in_col = lambda col, string: df[col].str.contains(string, na=False)
conds = {'United States': in_col('Manufacturer', 'ABC') & in_col('Color', 'Red'),
'Canada': in_col('Manufacturer', 'DEF'),
'England': in_col('Manufacturer', 'ABC') & ~in_col('Color', 'Red')}
df['Country'] = np.select(condlist=conds.values(), choicelist=conds.keys())
Another way out is use of np.select(list of conditions, list of choices)
conditions=[(df['Manufacturer'].str.contains('ABC')) & (df['Color'].str.contains('Red')),df['Manufacturer'].str.contains('DEF', na=False),(df['Manufacturer'].str.contains('ABC')) & (~df['Color'].str.contains('Red', na=False))]
choices=['United States','Canada','England']
df['Country']=np.select(conditions,choices)
Manufacturer Color Country
0 ABC-001 04-Red United States
1 ABC-002 vs Red - 07 United States
2 ABC-003 Red United States
3 ABC-004 Red--321 United States
4 DEF-123 NaN Canada
5 DEF-124 NaN Canada
6 DEF-125 NaN Canada
7 ABC-987 Blue England
8 ABC-986 Black England
9 ABC-985 Orange England
This is an easy and straightforward way to do it:
country = []
for index, row in df.iterrows():
if 'DEF' in row['Manufacturer']:
country.append('Canada')
elif 'ABC' in row['Manufacturer']:
if 'Red' in row['Color']:
country.append('United States')
else:
country.append('England')
else:
country.append('')
df['Country'] = country
Of course there will be more efficient ways to go about this w/o looping through the entire dataframe, but in almost all cases this should be sufficient.

Pandas groupby results - move some grouped column values into rows of new data frame

I have seen a number of similar questions but cannot find a straightforward solution to my issue.
I am working with a pandas dataframe containing contact information for constituent donors to a nonprofit. The data has Households and Individuals. Most Households have member Individuals, but not all Individuals are associated with a Household. There is no data that links the Household to the container Individuals, so I am attempting to match them up based on other data - Home Street Address, Phone Number, Email, etc.
A simplified version of the dataframe looks something like this:
Constituent Id Type Home Street
1234567 Household 123 Main St.
2345678 Individual 123 Main St.
3456789 Individual 123 Main St.
4567890 Individual 433 Elm Rd.
0123456 Household 433 Elm Rd.
1357924 Individual 500 Stack Ln.
1344444 Individual 500 Stack Ln.
I am using groupby in order to group the constituents. In this case, by Home Street. I'm trying to ensure that I only get groupings with more than one record (to exclude Individuals unassociated with a Household). I am using something like:
df1 = df[df.groupby('Home Street').filter(lambda x: len(x)>1)
What I would like to do is somehow export the grouped dataframe to a new dataframe that includes the Household Constituent Id first, then any Individual Constituent Ids. And in the case that there is no Household in the grouping, place the Individual Constituents in the appropriate locations. The output for my data set above would look like:
Household Individual Individual
1234567 2345678 3456789
0123456 4567890
1357924 1344444
I have toyed with iterating through the groupby object, but I feel like I'm missing some easy way to accomplish my task.
This should do it
df['Type'] = df['Type'] + '_' + (df.groupby(['Home Street','Type']).cumcount().astype(str))
df.pivot_table(index='Home Street', columns='Type', values='Constituent Id', aggfunc=lambda x: ' '.join(x)).reset_index(drop=True)
Output
Type Household_0 Individual_0 Individual_1
0 1234567 2345678 3456789
1 0123456 4567890 NaN
2 NaN 1357924 1344444
IIUC, we can use groupby agg(list) and some re-shaping using .join & explode
s = df.groupby(["Home Street", "Type"]).agg(list).unstack(1).reset_index(
drop=True
).droplevel(level=0, axis=1).explode("Household")
df1 = s.join(pd.DataFrame(s["Individual"].tolist()).add_prefix("Indvidual_")).drop(
"Individual", axis=1
)
print(df1.fillna(' '))
Household Indvidual_0 Indvidual_1
0 1234567 2345678 3456789
1 0123456 4567890
2 1357924 1344444
or we can ditch the join and cast Household to your index.
df1 = pd.DataFrame(s["Individual"].tolist(), index=s["Household"])\
.add_prefix("Individual_")
print(df1)
Individual_0 Individual_1
Household
1234567 2345678 3456789
0123456 4567890 None
NaN 1357924 1344444

Update missing values in a column using pandas

I have a dataframe df with two of the columns being 'city' and 'zip_code':
df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})
As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.
How do I accomplish this task using pandas?
You can go for:
import numpy as np
df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')
>>> df
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:
In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df
Out[255]:
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])
The reason you need to do this is because it'll raise an error if there are duplicate index entries
My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.
And then you use that dictionary to fill in all missing zip code values.

Categories

Resources