Replace comma-separated values in a dataframe with values from another dataframe - python

this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.
I have two pandas dataframes formatted as follows
df1
+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2 | Descr 1 |
| 3 | Descr 2 |
| 2,3,5 | Descr 3 |
+------------+-------------+
df2
+--------+--------------+
| Ref_ID | ShortRef |
+--------+--------------+
| 1 | Smith (2006) |
| 2 | Mike (2009) |
| 3 | John (2014) |
| 4 | Cole (2007) |
| 5 | Jill (2019) |
| 6 | Tom (2007) |
+--------+--------------+
Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1
What I would like to do is to replace values in the References field in df1 so it looks like this:
+-------------------------------------+-------------+
| References | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009) | Descr 1 |
| John (2014) | Descr 2 |
| Mike (2009);John (2014);Jill (2019) | Descr 3 |
+-------------------------------------+-------------+
So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly
Pandas - Replacing Values by Looking Up in an Another Dataframe
But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.
This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.
Anyone is willing to point me in the right direction?
Many thanks in advance.

Let's try this:
df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
'Mike (2009)',
'John (2014)',
'Cole (2007)',
'Jill (2019)',
'Tom (2007)']})
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(list))
Output:
Reference Description Reference2
0 1,2 Descr 1 [Smith (2006), Mike (2009)]
1 3 Descr 2 [John (2014)]
2 1,3,5 Descr 3 [Smith (2006), John (2014), Jill (2019)]
#Datanovice thanks for the update.
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(';'.join))
Output:
Reference Description Reference2
0 1,2 Descr 1 Smith (2006);Mike (2009)
1 3 Descr 2 John (2014)
2 1,3,5 Descr 3 Smith (2006);John (2014);Jill (2019)

you can use some list comprehension and dict lookups and I dont think this will be too slow
First, making a fast-to-access mapping for id to short_ref
mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()
Then, lets split references by commas
df1_values = [v.split(',') for v in df1['References']]
Finally, we can iterate over and do dictionary lookups, before concatenating back to strings
df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])
Is this usable or is it too slow?

Another solution is using str.get_dummies and dot
df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
.reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
.reset_index())
Out[462]:
Description References
0 Descr 1 Smith (2006);Mike (2009)
1 Descr 2 John (2014)
2 Descr 3 Mike (2009);John (2014);Jill (2019)

Related

DataFrame groupby on each item within a column of lists

I have a dataframe (df):
| A | B | C |
| --- | ----- | ----------------------- |
| CA | Jon | [sales, engineering] |
| NY | Sarah | [engineering, IT] |
| VA | Vox | [services, engineering] |
I am trying to group by each item in the C column list (sales, engineering, IT, etc.).
Tried:
df.groupby('C')
but got list not hashable, which is expected. I came across another post where it was recommended to convert the C column to tuple which is hashable, but I need to groupby each item and not the combination.
My goal is to get the count of each row in the df for each item in the C column list. So:
sales: 1
engineering: 3
IT: 1
services: 1
While there is probably a simpler way to obtain this than using groupby, I am still curious if groupby can be used in this case.
You can explode & value_counts :
out = df.explode("C").value_counts("C")
​
Output :
print(out)
C
engineering 3
IT 1
sales 1
services 1
dtype: int64

Group by and chose the string value of a column based on a condition using pandas

I have a dataframe consisting of people: ('id','name','occupation').
| id | name | occupation |
|:--:|:--------:|:----------:|
| 1 | John | artist |
| 1 | John | painter |
| 2 | Mary | consultant |
| 3 | Benjamin | architect |
| 3 | Benjamin | manager |
| 4 | Alice | intern |
| 4 | Alice | architect |
Task:
Some people have multiple occupations, however I need each person to have only one. For this I am trying to use the groupby pandas function.
Issue:
So far so good, however I need to apply a condition based on their occupation and here is where I got stuck.
The condition is simple:
if "architect" is in the 'occupation' of the group (person):
   keep the 'occupation' as "architect"
else:
   keep any/last/first (it doesn't matter) 'occupation'
The desired output would be:
| id | name | occupation |
|:--:|:--------:|:----------:|
| 1 | John | artist |
| 2 | Mary | consultant |
| 3 | Benjamin | architect |
| 4 | Alice | architect |
Attempt:
def one_occupation_per_person(occupation):
if "architect" in occupation:
return "architect"
else:
return ???
df.groupby(['id','name')['occupation'].apply(lambda x: one_occupation_per_person(x['occupation']),axis=1)
I hope this describes the issue clear enough. Any hints and ideas are appreciated!
Since architect will come out at the first item from a natural sort, you can simply sort on occupation and then groupby:
df.sort_values("occupation").groupby("id", as_index=False).first()
If you somehow had another occupation that sorts before architect, you can convert the column to pd.Categorical before sorting:
s = ["architect"] + df.loc[df["occupation"].ne("architect"),"occupation"].unique().tolist()
df["occupation"] = pd.Categorical(df["occupation"], ordered=True, categories=s)
print (df.sort_values("occupation").groupby("id", as_index=False).first())
Result:
id name occupation
0 1 John artist
1 2 Mary consultant
2 3 Benjamin architect
3 4 Alice architect

Copy data from 1 data-set to another on the basis of Unique ID

I am matching two large data-sets and trying to perform update,remove and create operations on original data-set by comparing it with other data-set. How can I update 2 or 3 column out of 10 of original data-set and keep other column's value same as before?
I tried merge but no avail.
Original data:
id | full_name | date
1 | John | 02-23-2006
2 | Paul Elbert | 09-29-2001
3 | Donag | 11-12-2013
4 | Tom Holland | 06-17-2016
other data:
id | full_name | date
1 | John | 02-25-2018
2 | Paul | 03-09-2001
3 | Donag | 07-09-2017
4 | Tom | 05-09-2016
Is it possible to update date column of original data on the basis of ID?
answering your question :
"When ID match code update all values in date column without changing any value in name column of original data set"
original = pd.DataFrame({'id':['1','2'],'full_name':['John','Paul Elbert'],'date':
['02-23-2006','09-29-2001']})
other = pd.DataFrame({'id':['1','2'],'full_name':['John','Paul'],'date':['02-25-2018','03-09-2001']})
original = original[['id','full_name']].merge(other[['id','date']],on='id')
print(original)
id full_name date
0 1 John 02-25-2018
1 2 Paul Elbert 03-09-2001

find a record across multiple python pandas dataframes

Let's say, I have three dataframes as follows, and I would like to find in which dataframes a particular record exists.
this is dataframe1 (df1)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | rider | 223344 | Mexico
This is dataframe2 (df2)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | keith | 993344 | Brazil
This is dataframe3 (df3)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | hopper | 444444 | Canada
So, if I run the following code, I can find all the information about acct_no 112233 for a single dataframe.
p = df1.loc[df1['acct_no']=112233]
But, I would like to know which code will help me find out that acct_no 112233 exists in df1, df2, df3
One wat to know if the element is in the column 'acct_no' of the dataframe is:
>> (df1['acct_no']==112233).any()
True
You could check all at the same time by doing:
>> all([(df['acct_no']==112233).any() for df in [df1, df2, df3]])
True

Adding new columns based on number of unique row values in Pandas

I have a data frame with the structure below:
ID | Name | Role
1 | John | Owner
1 | Bob | Driver
2 | Jake | Owner
2 | Tom | Driver
2 | Sally | Owner
3 | Mary | Owner
3 | Sue | Driver
I'd like to pivot the Role column and have the Name column as the value, but since some IDs (the index in this case) have more than one person in the owner role and some don't the pivot_table function doesn't work. Is there a way to create a new column for each additional owner a particular ID may have. Some may have 2,3,4+ owners. Thanks!
Sample output below:
ID | Owner_1 | Owner_2 | Driver
1 | John | NaN | Bob
2 | Jake | Sally | Tom
3 | Mary | NaN | Sue
This is what I tried:
pd.pivot_table(df,values='Name',index='ID',columns='Role')
DataError: No numeric types to aggregate
You can create the additional key for the duplicate Item within each ID by using cumcount, then we can simply using pivot
df.Role=df.Role+'_'+df.groupby(['ID','Role']).cumcount().add(1).astype(str)
df.pivot('ID','Role','Name')
Out[432]:
Role Driver_1 Owner_1 Owner_2
ID
1 Bob John None
2 Tom Jake Sally
3 Sue Mary None
You need to change the default aggregation function from mean to sum:
pivoted = pd.pivot_table(df, values='Name',
index='ID', columns='Role', aggfunc='sum')
#Role Driver Owner
#ID
#1 Bob John
#2 Tom Jake Sally
#3 Sue Mary
Now, some owners are represented as multiword strings. Split them into individual words:
result = pivoted.join(pivoted['Owner'].str.split().apply(pd.Series))\
.drop("Owner", axis=1)
# Driver 0 1
#ID
#1 Bob John NaN
#2 Tom Jake Sally
#3 Sue Mary NaN
result.columns = "Driver", "Owner_1", "Owner_2"

Categories

Resources