The problem is as follows:
The columns are: first name, last name, department (consulting or sales, abbreviated into C and S respectively), employee id, and salary. The salary column doesn't have any function in this example; it's just to emphasize that there are actually lots of other columns.
Certain names are duplicate between departments.
Not sure if this helps, but the first_name + last_name + id forms a unique identifier for each row. I have to use this, because it's the shortest unique identifier which identifies the most duplicates in previous duplicate-removal scenarios (see rows 1 and 2). I can go one step further and concatenate this identifier with even more columns, but that's just not a very elegant solution.
The initial dataframe is as follows:
first_name | last_name | id | dept | salary
-------------------------------------------
sarah | jones | C1 | C | 60000
sarah | jones | C2 | C | 55000
robert | jones | C3 | C | 50000
alice | clarke | C4 | C | 40000
alice | clarke | S1 | S | 40000
thomas | roberts | S2 | S | 45000
I'd like to remove row 4 (the alice clarke row that's associated with the consulting dept) and keep row 5, but retain the consulting dept ID. That is, I should have:
first_name | last_name | id | dept | salary
-------------------------------------------
sarah | jones | C1 | C | 60000
sarah | jones | C2 | C | 55000
robert | jones | C3 | C | 50000
alice | clarke | C4 | S | 40000
thomas | roberts | S2 | S | 45000
(IRL: I have two data sources, D1 and D2. D2 data is of a higher quality, but the ID used by D1 is more widely recognized, like an ISO standard in my field. So whenever D1 and D2 happen to give me the same row, I want to use the D1 ID, and the actual data from D2. )
The actual problem is a little more complicated than this MVWE (several duplicate-removal scenarios). I've tried cutting up the problem with some of my previous questions on duplicate removal or conditionally overriding values, but haven't been able to successfully tackle the whole thing, mostly because I've been unable to modularize the problem properly. This question on conditionally updating rows might help.
Per some of the commenters your example is a little short on detail, but if I understand correctly, you basically have two data frames and want to keep some info from one, and other info from another. Assuming you're actually starting with two dataframes and are in control of merging them, combine_first() should do the trick:
csv = io.StringIO(u'''
first last id dept salary
sarah jones C1 C 60
sarah jones C2 C 55
robert jones C3 C 50
alice clarke C4 C 40
thomas roberts S2 S 45
''')
df = pd.read_csv(csv, delim_whitespace = True)
csv2 = io.StringIO(u'''
first last id dept salary
alice clarke S1 S 43
''')
df2 = pd.read_csv(csv2, delim_whitespace = True)
df2.drop('id', axis = 1)
print df2.set_index(['first','last']).combine_first(df.set_index(['first','last'])).reset_index()
Output:
first last dept id salary
0 alice clarke S C4 43.0
1 robert jones C C3 50.0
2 sarah jones C C1 60.0
3 sarah jones C C2 55.0
4 thomas roberts S S2 45.0
And of course you can sort as you see fit at that point.
If the starting point is the initial data frame you provided, and given that there are only two dept types, you can groupby name and then apply a selection/swapping function:
# using initial data frame provided, copied to clipboard
df = pd.read_clipboard().drop(0, 0).drop(['|','|.1','|.2','|.3'], 1)
def choose_data(data, chosen_field, chosen_value, swap_field):
if len(data[chosen_field].unique()) > 1:
chosen = data[data[chosen_field]==chosen_value]
chosen[swap_field] = data.ix[data[chosen_field]!=chosen_value, swap_field].values
return chosen
return data
(df.groupby(['first_name','last_name'], as_index=False)
.apply(choose_data,
chosen_field='dept',
chosen_value='S',
swap_field='id')
.reset_index(drop=True)
.sort_values('id')
)
Yields:
first_name last_name id dept salary
0 sarah jones C1 C 60000.0
1 sarah jones C2 C 55000.0
2 robert jones C3 C 50000.0
3 alice clarke C4 S 40000.0
4 thomas roberts S2 S 45000.0
Note that the reset_index() and sort_values() are basically cosmetic, all that's really necessary is groupby() and apply().
Related
I have 2 dataframes which I have to join. I am trying to use merge on columns ID and STATUS. However the problem is if for a row status is NULL in df2 I want it to still match it based on just ID and bring the name.
If STATUS has a value match it else match just the ID and bring the name.
mer_col_list = ['ID','STATUS']
df_out = pd.merge(df1,df2, on=mer_col_list, how='left')
df1
|ID | STATUS | NAME |
|11 | ACTIVE | John |
|22 | DORMANT| NICK |
|33 | NOT_ACTIVE| HARRY|
df2
|ID | STATUS | BRANCH |
|11 | DORMANT| USA |
|11 | | USA |
|22 | | UK |
|33 | NOT_ACTIVE | AUS|
df_out:
|ID| NAME | BRANCH|
|11| JOHN | USA |
|22| NICK | USA |
|33|HARRY | AUS |
You can create another left join by ID only if STATUS is missing and then combine both DataFrames by DataFrame.fillna:
df_out1 = df1.merge(df2, on=['ID','STATUS'], how='left')
df_out2 = df1.merge(df2[df2['STATUS'].isna()].drop('STATUS',axis=1), on=['ID'],how='left')
df_out = df_out1.fillna(df_out2)
print (df_out)
ID STATUS NAME BRANCH
0 11 ACTIVE John USA
1 22 DORMANT NICK UK
2 33 NOT_ACTIVE HARRY AUS
You can also remove missing values first and duplicates if exist duplicated ['ID','STATUS'] for df2 and ID if remove missing valeus per STATUS:
df21 = df2.dropna(subset=['STATUS']).drop_duplicates(['ID','STATUS'])
df_out1 = df1.merge(df21, on=['ID','STATUS'], how='left')
df22 = df2[df2['STATUS'].isna()].drop('STATUS',axis=1).drop_duplicates(['ID'])
df_out2 = df1.merge(df22, on='ID',how='left')
df_out = df_out1.fillna(df_out2)
print (df_out)
Dynamic solution - if always ID is in list mer_col_list:
mer_col_list = ['ID','STATUS']
df21 = df2.dropna(subset=mer_col_list).drop_duplicates(mer_col_list)
df_out1 = df1.merge(df21, on=mer_col_list, how='left')
no_id_cols = np.setdiff1d(mer_col_list, ['ID'])
print (no_id_cols)
['STATUS']
df22=df2[df2[no_id_cols].isna().any(axis=1)].drop(no_id_cols,axis=1).drop_duplicates(['ID'])
df_out2 = df1.merge(df22, on='ID',how='left')
df_out = df_out1.fillna(df_out2)
print (df_out)
ID STATUS NAME BRANCH
0 11 ACTIVE John USA
1 22 DORMANT NICK UK
2 33 NOT_ACTIVE HARRY AUS
this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.
I have two pandas dataframes formatted as follows
df1
+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2 | Descr 1 |
| 3 | Descr 2 |
| 2,3,5 | Descr 3 |
+------------+-------------+
df2
+--------+--------------+
| Ref_ID | ShortRef |
+--------+--------------+
| 1 | Smith (2006) |
| 2 | Mike (2009) |
| 3 | John (2014) |
| 4 | Cole (2007) |
| 5 | Jill (2019) |
| 6 | Tom (2007) |
+--------+--------------+
Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1
What I would like to do is to replace values in the References field in df1 so it looks like this:
+-------------------------------------+-------------+
| References | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009) | Descr 1 |
| John (2014) | Descr 2 |
| Mike (2009);John (2014);Jill (2019) | Descr 3 |
+-------------------------------------+-------------+
So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly
Pandas - Replacing Values by Looking Up in an Another Dataframe
But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.
This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.
Anyone is willing to point me in the right direction?
Many thanks in advance.
Let's try this:
df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
'Mike (2009)',
'John (2014)',
'Cole (2007)',
'Jill (2019)',
'Tom (2007)']})
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(list))
Output:
Reference Description Reference2
0 1,2 Descr 1 [Smith (2006), Mike (2009)]
1 3 Descr 2 [John (2014)]
2 1,3,5 Descr 3 [Smith (2006), John (2014), Jill (2019)]
#Datanovice thanks for the update.
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(';'.join))
Output:
Reference Description Reference2
0 1,2 Descr 1 Smith (2006);Mike (2009)
1 3 Descr 2 John (2014)
2 1,3,5 Descr 3 Smith (2006);John (2014);Jill (2019)
you can use some list comprehension and dict lookups and I dont think this will be too slow
First, making a fast-to-access mapping for id to short_ref
mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()
Then, lets split references by commas
df1_values = [v.split(',') for v in df1['References']]
Finally, we can iterate over and do dictionary lookups, before concatenating back to strings
df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])
Is this usable or is it too slow?
Another solution is using str.get_dummies and dot
df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
.reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
.reset_index())
Out[462]:
Description References
0 Descr 1 Smith (2006);Mike (2009)
1 Descr 2 John (2014)
2 Descr 3 Mike (2009);John (2014);Jill (2019)
I am matching two large data-sets and trying to perform update,remove and create operations on original data-set by comparing it with other data-set. How can I update 2 or 3 column out of 10 of original data-set and keep other column's value same as before?
I tried merge but no avail.
Original data:
id | full_name | date
1 | John | 02-23-2006
2 | Paul Elbert | 09-29-2001
3 | Donag | 11-12-2013
4 | Tom Holland | 06-17-2016
other data:
id | full_name | date
1 | John | 02-25-2018
2 | Paul | 03-09-2001
3 | Donag | 07-09-2017
4 | Tom | 05-09-2016
Is it possible to update date column of original data on the basis of ID?
answering your question :
"When ID match code update all values in date column without changing any value in name column of original data set"
original = pd.DataFrame({'id':['1','2'],'full_name':['John','Paul Elbert'],'date':
['02-23-2006','09-29-2001']})
other = pd.DataFrame({'id':['1','2'],'full_name':['John','Paul'],'date':['02-25-2018','03-09-2001']})
original = original[['id','full_name']].merge(other[['id','date']],on='id')
print(original)
id full_name date
0 1 John 02-25-2018
1 2 Paul Elbert 03-09-2001
Let's say, I have three dataframes as follows, and I would like to find in which dataframes a particular record exists.
this is dataframe1 (df1)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | rider | 223344 | Mexico
This is dataframe2 (df2)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | keith | 993344 | Brazil
This is dataframe3 (df3)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | hopper | 444444 | Canada
So, if I run the following code, I can find all the information about acct_no 112233 for a single dataframe.
p = df1.loc[df1['acct_no']=112233]
But, I would like to know which code will help me find out that acct_no 112233 exists in df1, df2, df3
One wat to know if the element is in the column 'acct_no' of the dataframe is:
>> (df1['acct_no']==112233).any()
True
You could check all at the same time by doing:
>> all([(df['acct_no']==112233).any() for df in [df1, df2, df3]])
True
I have a data frame with the structure below:
ID | Name | Role
1 | John | Owner
1 | Bob | Driver
2 | Jake | Owner
2 | Tom | Driver
2 | Sally | Owner
3 | Mary | Owner
3 | Sue | Driver
I'd like to pivot the Role column and have the Name column as the value, but since some IDs (the index in this case) have more than one person in the owner role and some don't the pivot_table function doesn't work. Is there a way to create a new column for each additional owner a particular ID may have. Some may have 2,3,4+ owners. Thanks!
Sample output below:
ID | Owner_1 | Owner_2 | Driver
1 | John | NaN | Bob
2 | Jake | Sally | Tom
3 | Mary | NaN | Sue
This is what I tried:
pd.pivot_table(df,values='Name',index='ID',columns='Role')
DataError: No numeric types to aggregate
You can create the additional key for the duplicate Item within each ID by using cumcount, then we can simply using pivot
df.Role=df.Role+'_'+df.groupby(['ID','Role']).cumcount().add(1).astype(str)
df.pivot('ID','Role','Name')
Out[432]:
Role Driver_1 Owner_1 Owner_2
ID
1 Bob John None
2 Tom Jake Sally
3 Sue Mary None
You need to change the default aggregation function from mean to sum:
pivoted = pd.pivot_table(df, values='Name',
index='ID', columns='Role', aggfunc='sum')
#Role Driver Owner
#ID
#1 Bob John
#2 Tom Jake Sally
#3 Sue Mary
Now, some owners are represented as multiword strings. Split them into individual words:
result = pivoted.join(pivoted['Owner'].str.split().apply(pd.Series))\
.drop("Owner", axis=1)
# Driver 0 1
#ID
#1 Bob John NaN
#2 Tom Jake Sally
#3 Sue Mary NaN
result.columns = "Driver", "Owner_1", "Owner_2"