Is there a way to groupby concatenating multiple strings? - python

I have the following df:
Name Role Company [other columns]
John Admin GM
John Director Kodak
John Partner McDonalds
Mark Director Gerdau
Mark Partner Kibon
I want to turn it into:
Name Companies [other columns]
John GM (Admin), Kodak (Director), McDonalds (Partner)
Mark Gerdau (Director), Kibon (Partner
I think the answer is somewhere in the groupby field, this question is almost there, however I need to find a way to do that iterating two columns and putting the parenthesis in place.

IIUC assign and groupby
df1 = df.assign(companies=df['Company'] + ' (' + df['Role'] + ')')\
.groupby('Name')['companies'].agg(','.join)
print(df1)
Name
John GM (Admin),Kodak (Director),McDonalds (Partner)
Mark Gerdau (Director),Kibon (Partner)
Name: companies, dtype: object

Related

In python/pandas is there a way to find a row that has a duplicate value in one column and a unique value in another? [duplicate]

This question already has answers here:
How to drop duplicates based on two or more subsets criteria in Pandas data-frame
(2 answers)
Closed 5 months ago.
For example:
Say I have a dataframe like
Date Registered
Name
Gift
2021-10-30
John Doe
Money
2021-10-30
John Doe
Food
2021-11-02
Tyler Blue
Gift Card
2021-11-02
Tyler Blue
Food
2021-12-01
John Doe
Supplies
I want to locate all indexes where an entry in name has a unique value in date. Like so:
Date Registered
Name
Gift
2021-10-30
John Doe
Money
2021-11-02
Tyler Blue
Gift Card
2021-12-01
John Doe
Supplies
I tried this:
name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last')
def extract_name(TableName):
return TableName.duplicated(subset=['Name'])
extract_name(name_view)
But this does not get rid of all the indexes with duplicate dates. Any suggestions? I'm fine with it simply returning a list of the indexes as well, it isn't required to output the full row.
You were already there with pd.drop_duplicates():
>>> df.drop_duplicates(subset=['Date Registered', 'Name'])
Date Registered Name Gift
0 2021-10-30 John Doe Money
2 2021-11-02 Tyler Blue Gift Card
4 2021-12-01 John Doe Supplies
The indices are therefore:
>>> df.drop_duplicates(subset=['Date Registered', 'Name']).index
Int64Index([0, 2, 4], dtype='int64')
This will give the requested output as a list of index values:
print(df.reset_index().groupby(['Date Registered','Name']).first()['index'].tolist())
Output:
[0, 2, 4]
Just use pd.drop_duplicates(): on what you already have
import pandas as pd
df= pd.read_csv(csv_file, encoding='latin-1')
df.drop_duplicates(subset=['Date Registered', 'Name'])
print(df)
Adding .reset_index() to the end of your "name_view" variable removed all excess rows when I ran it against your example column (*I had to change the name of the first column to make it work).
name_view = df.drop_duplicates(subset=['Name', 'DateTime'], keep= 'last').reset_index()
def extract_name(TableName):
return TableName.duplicated(subset=['Name'])
extract_name(name_view)

Pandas Number of Unique Values from 2 Fields

I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!
Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()
IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5

Match two columns in dataframe

I have two columns in dataframe df
ID Name
AXD2 SAM S
AXD2 SAM
SCA4 JIM
SCA4 JIM JONES
ASCQ JOHN
I need the output to get a unique id and should match the first name only,
ID Name
AXD2 SAM S
SCA4 JIM
ASCQ JOHN
Any suggestions?
You can use groupby with agg and get first of Name
df.groupby(['ID']).agg(first_name=('Name', 'first')).reset_index()
Use drop_duplicates:
out = df.drop_duplicates('ID', ignore_index=True)
print(out)
# Output
ID Name
0 AXD2 SAM S
1 SCA4 JIM
2 ASCQ JOHN
You can use cumcount() to find the first iteration name of the ID
df['RN'] = df.groupby(['ID']).cumcount() + 1
df = df.loc[df['RN'] == 1]
df[['ID', 'Name']]

Break up a data-set into separate excel files based on a certain row value in a given column in Pandas?

I have a fairly large dataset that I would like to split into separate excel files based on the names in column A ("Agent" column in the example provided below). I've provided a rough example of what this data-set looks like in Ex1 below.
Using pandas, what is the most efficient way to create a new excel file for each of the names in column A, or the Agent column in this example, preferably with the name found in column A used in the file title?
For example, in the given example, I would like separate files for John Doe, Jane Doe, and Steve Smith containing the information that follows their names (Business Name, Business ID, etc.).
Ex1
Agent Business Name Business ID Revenue
John Doe Bobs Ice Cream 12234 $400
John Doe Car Repair 445848 $2331
John Doe Corner Store 243123 $213
John Doe Cool Taco Stand 2141244 $8912
Jane Doe Fresh Ice Cream 9271499 $2143
Jane Doe Breezy Air 0123801 $3412
Steve Smith Big Golf Range 12938192 $9912
Steve Smith Iron Gyms 1231233 $4133
Steve Smith Tims Tires 82489233 $781
I believe python / pandas would be an efficient tool for this, but I'm still fairly new to pandas, so I'm having trouble getting started.
I would loop over the groups of names, then save each group to its own excel file:
s = df.groupby('Agent')
for name, group in s:
group.to_excel(f"{name}.xls")
Use lise comprehension with groupby on agent column:
dfs = [d for _,d in df.groupby('Agent')]
for df in dfs:
print(df, '\n')
Output
Agent Business Name Business ID Revenue
4 Jane Doe Fresh Ice Cream 9271499 $2143
5 Jane Doe Breezy Air 123801 $3412
Agent Business Name Business ID Revenue
0 John Doe Bobs Ice Cream 12234 $400
1 John Doe Car Repair 445848 $2331
2 John Doe Corner Store 243123 $213
3 John Doe Cool Taco Stand 2141244 $8912
Agent Business Name Business ID Revenue
6 Steve Smith Big Golf Range 12938192 $9912
7 Steve Smith Iron Gyms 1231233 $4133
8 Steve Smith Tims Tires 82489233 $781
Grouping is what you are looking for here. You can iterate over the groups, which gives you the grouping attributes and the data associated with that group. In your case, the Agent name and the associated business columns.
Code:
import pandas as pd
# make up some data
ex1 = pd.DataFrame([['A',1],['A',2],['B',3],['B',4]], columns = ['letter','number'])
# iterate over the grouped data and export the data frames to excel workbooks
for group_name,data in ex1.groupby('letter'):
# you probably have more complicated naming logic
# use index = False if you have not set an index on the dataframe to avoid an extra column of indices
data.to_excel(group_name + '.xlsx', index = False)
Use the unique values in the column to subset the data and write it to csv using the name:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_csv(f"{unique_val}.csv")
if you need excel:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_excel(f"{unique_val}.xlsx")

Pandas merge two dataframes without duplicating column

My question is similar to Pandas Merge - How to avoid duplicating columns but I cannot find a solution for the specific example below.
I have DateFrame df:
Customer Address
J. Smith 10 Sunny Rd Timbuktu
and Dataframe emails:
Name Email
J. Smith j.smith#myemail.com
I want to merge the two dataframes to produce:
Customer Address Email
J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com
I am using the following code:
data_names = {'Name':data_col[1], ...}
mapped_name = data_names['Name']
df = df.merge(emails, how='inner', left_on='Customer', right_on=mapped_name)
The result is:
Customer Address Email Name
J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com J. Smith
While I could just delete the column named mapped_name, there is the possibility that the mapped_name could be 'Customer' and in that case I dont want to remove both Customer columns.
Any ideas?
I think you can rename first column in email dataframe to Customer, how='inner' can be omit because default value:
emails.columns = ['Customer'] + emails.columns[1:].tolist()
df = df.merge(emails, on='Customer')
print (df)
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com
And similar solution as another answer - is possible rename first column selected by [0]:
df = df.merge(emails.rename(columns={emails.columns[0]:'Customer'}), on='Customer')
print (df)
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com
You can just rename your email name column to 'Customer' and then merge. This way, you don't need to worry about dropping the column at all.
df.merge(emails.rename(columns={mapped_name:'Customer'}), how='inner', on='Customer')
Out[53]:
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com

Categories

Resources