Pandas aggregate data by same ID and comma separate values in column

Pandas aggregate data by same ID and comma separate values in column - python

I have data such as the following:
ID
Category
1
Finance
2
Computer Science
3
Data Science
1
Marketing
2
Finance
My goal is to aggregate the common ID's into one row and add the differing categories all into one column seperate by commas, such as the following:
ID
Category
1
Finance, Marketing
2
Computer Science , Finance
3
Data Science
How would I go about this using Pandas?
Edit:
I also have other columns for the ID's I would like to keep. For example:
ID
Category
Location
1
Finance
New York
2
Computer Science
Los Angeles
3
Data Science
Austin
1
Marketing
New York
2
Finance
Los Angeles
since the additional data from the other columns are the same for all similarr ID's (ex: ID 1 has the same location for all instances, as does ID 2) I would like to not drop any columns and keep other data like this:
ID
Category
Location
1
Finance, Marketing
New York
2
Computer Science , Finance
Los Angeles
3
Data Science
Austin

Related

How can I compare one column of a dataframe to multiple other columns using SequenceMatcher?

I have a dataframe with 6 columns, the first two are an id and a name column, the remaining 4 are potential matches for the name column.
id name match1 match2 match3 match4
id name match1 match2 match3 match4
1 NXP Semiconductors NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN
4 Miami-Dade County County of Will County of Orange NaN NaN
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN
I'd like to use SequenceMatcher to compare the name column with each match column if there is a value and return the match value with the highest ratio, or closest match, in a new column at the end of the dataframe.
So the output would be something like this:
id name match1 match2 match3 match4 best match
1 NXP Semiconductors NaN NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital Cincinnati Children's Hospital Medical Center
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN The State Board of Administration of Florida
4 Miami-Dade County County of Will County of Orange NaN NaN County of Orange
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia California Teacher's Association
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN Waste Management
I've gotten the data into the dataframe and have been able to compare one column to a single other column using the apply method:
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)
However, I'm not sure how to loop over multiple columns in the same row. I also thought about trying to reformat my data so it that the method above would work, something like this:
name match
name1 match1
name1 match2
name1 match3
However, I was running into issues dealing with the NaN values. Open to suggestions on the best route to accomplish this.

I ended up solving this using the second idea of reformatting the table. Using the melt function I was able to get a two column table of the name field with each possible match. From there I used the original lambda function to compare the two columns and output a ratio. From there it was relatively easy to go through and see the most likely matches, although it did require some manual effort.
df = pd.read_csv('output.csv')
df1 = df.melt(id_vars = ['id', 'name'], var_name = 'match').dropna().drop('match',1).sort_values('name')
df1['diff'] = df1.apply(lambda x: diff.SequenceMatcher(None, x[1].strip(), x[2].strip()).ratio(), axis=1)
df1.to_csv('comparison-output.csv', encoding='utf-8')

Creating new variable by aggregation in python

I'm pretty new to python and pandas, and know only the basics. Nowadays I'm conducting a research and I need your kind help.
Let’s say I have data on births, containing 2 variables: Date and Country.
Date Country
1.1.20 USA
1.1.20 USA
1.1.20 Italy
1.1.20 England
2.1.20 Italy
2.1.20 Italy
3.1.20 USA
3.1.20 USA
Now I want to create a third variable, let’s call him ‘Births’, which contains the number of births in country at a date. In other words, I want to stick to just one row for each date+country combination by aggregating the number of countries in each date, so I end up with something like this:
Date Country Births
1.1.20 USA 2
1.1.20 Italy 1
1.1.20 England 1
2.1.20 Italy 2
3.1.20 USA 2
I’ve tried many things, but nothing seemed to work. Any help will be much appreciated.
Thanks,
Eran

I guess you can use the groupby method of your DataFrame, then use the size method to count the number of individuals in each group :
df.groupby(by=['Date', 'Country']).size().reset_index(name='Births')
Output:
Date Country Births
0 1.1.20 England 1
1 1.1.20 Italy 1
2 1.1.20 USA 2
3 2.1.20 Italy 2
4 3.1.20 USA 2
Also, the pandas documentation has several examples related to group-by operations : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html.

Pythom:Compare 2 columns and write data to excel sheets

I need to compare two columns together: "EMAIL" and "LOCATION".
I'm using Email because it's more accurate than name for this issue.
My objective is to find total number of locations each person worked
at, sum up the total of locations to select which sheet the data
will been written to and copy the original data over to the new
sheet(tab).
I need the original data copied over with all the duplicate
locations, which is where this problem stumps me.
Full Excel Sheet
Had to use images because it flagged post as spam
The Excel sheet (SAMPLE) I'm reading in as a data frame:
Excel Sample Spreadsheet
Example:
TOMAPPLES#EXAMPLE.COM worked at WENDYS,FRANKS HUT, and WALMART - That
sums up to 3 different locations, which I would add to a new sheet
called SHEET: 3 Different Locations
SJONES22#GMAIL.COM worked at LONDONS TENT and YOUTUBE - That's 2 different locations, which I would add to a new sheet called SHEET:
2 Different Locations
MONTYJ#EXAMPLE.COM worked only at WALMART - This user would be added
to SHEET: 1 Location
Outcome:
data copied to new sheets
Sheet 2
Sheet 2: different locations
Sheet 3
Sheet 3: different locations
Sheet 4
Sheet 4: different locations
Thanks for taking your time looking at my problem =)

Hi Check below lines if work for you..
import pandas as pd
df = pd.read_excel('sample.xlsx')
df1 = df.groupby(['Name','Location','Job']).count().reset_index()
# this is long line
df2 = df.groupby(['Name','Location','Job','Email']).agg({'Location':'count','Email':'count'}).rename(columns={'Location':'Location Count','Email':'Email Count'}).reset_index()
print(df1)
print('\n\n')
print(df2)
below is the output change columns to check more variations
df1
Name Location Job Email
0 Monty Jakarta Manager 1
1 Monty Mumbai Manager 1
2 Sahara Jonesh Paris Cook 2
3 Tom App Jakarta Buser 1
4 Tom App Paris Buser 2
df2 all columns
Name Location ... Location Count Email Count
0 Monty Jakarta ... 1 1
1 Monty Mumbai ... 1 1
2 Sahara Jonesh Paris ... 2 2
3 Tom App Jakarta ... 1 1
4 Tom App Paris ... 2 2

Fill in the missing values in one table by using a column in another table - Python

I have a table (df1) that contains 3 columns - id, Industry, Job. I also have a table 2 (df2), which contains some of the values that are missing from df1.
DF1:
ID | Industry | Job
1 Tech Data Engineer
2 N/A N/A
3 Blah Blah
4 N/A Police Officer
8 Transport N/A
DF2:
ID | Industry | Job
1 Tech Data Engineer
2 Oil Engineer
4 Government Police Officer
10 E-Sports Gamer
I want to transfer the values that are missing in df1 from df2. Note that I DO NOT want to fully replace any values that are in df1 already and I only want to take values from df2 if they are missing in df1. Also note that ID 8 is missing in DF2, so I would want to keep the N/A in df1.

Add a value to a new column on Data frame that depends on the value on another Data frame [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two data frames df1 and df2. df1 has entries of amounts spent by users and each user can have several entries with different amounts values.
The second data frame just holds the information of every users(each user is unique in this data frame).
i want to create a new column on df1 that includes the country value of each unique user from df2.
Any help will be appreciated
df1
name_id Dept amt_spent
0 Alex-01 Engineering 5
1 Bob-01 Finance 5
2 Charles-01 HR 10
3 David-01 HR 6
4 Alex-01 Engineering 50
df2
name_id Country
0 Alex-01 UK
1 Bob-01 USA
2 Charles-01 GHANA
3 David-01 BRAZIL
Result
name_id Dept amt_spent Country
0 Alex-01 Engineering 5 UK
1 Bob-01 Finance 5 USA
2 Charles-01 HR 10 GHANA
3 David-01 HR 6 BRAZIL
4 Alex-01 Engineering 50 UK

This should work:
df = pd.merge(df1, df2)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas aggregate data by same ID and comma separate values in column - python

Related

How can I compare one column of a dataframe to multiple other columns using SequenceMatcher?

Creating new variable by aggregation in python

Pythom:Compare 2 columns and write data to excel sheets

Fill in the missing values in one table by using a column in another table - Python

Add a value to a new column on Data frame that depends on the value on another Data frame [duplicate]

Categories

Resources