Problems with concatenating rows using pd.groupby().agg() - python

For example, let's say that I have the dataframe:
NAME = ['BOB', 'BOB', 'BOB', 'SUE', 'SUE', 'MARY', 'JOHN', 'JOHN', 'MARK', 'MARK', 'MARK', 'MARK']
STATE = ['CA','CA','CA','DC','DC','PA','GA','GA','NY','NY','NY','NY']
MAJOR = ['MARKETING','BUSINESS ADM',np.nan,'ECONOMICS','MATH','PSYCHOLOGY','HISTORY','BUSINESS ADM','MATH', 'MEDICAL SCIENCES',np.nan,np.nan]
SCHOOL = ['UCLA','UCSB','CAL STATE','HARVARD','WISCONSIN','YALE','CHICAGO','MIT','UCSD','UCLA','CAL STATE','COMMUNITY']
data = {'NAME':NAME, 'STATE':STATE,'MAJOR':MAJOR, 'SCHOOL':SCHOOL}
df = pd.DataFrame(data)
I am to concatenate rows with multiple unique values for the same name.
I tried:
gr_columns = [x for x in df1.columns if x not in ['MAJOR','SCHOOL']]
df1 = df1.groupby(gr_columns).agg(lambda col: '|'.join(col))
and expected
I am trying to concatenate rows in columns where the NAME field is the same. Conveniently, the STATE field is static for each NAME. I would like the output to look like:
NAME
STATE
MAJOR
SCHOOL
BOB
CA
MARKETING,BUSINESS ADM
UCLA,UCSB,CAL STATE
SUE
DC
ECONOMICS,MATH
HARVARD,WISCONSIN
MARY
PA
PSYCHOLOGY
YALE
JOHN
GA
HISTORY,BUSINESS ADM
CHICAGO,MIT
MARK
NY
MATH,MEDICAL SCIENCES
UCSD,UCLA,CAL STATE,COMMUNITY
but instead, I get a single column containing the concatenated schools.

It is because your np.nan cannot be converted to str, so it is dropped automatically by pandas. You need to convert its type to str first:
df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.astype(str)))
To drop na and keep NAME and STATE as columns:
df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.dropna())).reset_index()

Related

Check if a pandas.core.series.Series contains a specific string in Python

I have a dataframe (df1) that looks like this:
first_name
last_name
affiliation
jean
dulac
University of Texas
peter
huta
University of Maryland
I want to match this dataframe to another one that contains several potential matches. Each potential match has a first and last name and also a list of all the affiliations this person was associated with, and I want to use the information in this affiliation column to differentiate between my potential matches and keep only the most likely one.
The second dataframe has the following form:
first_name
last_name
affiliations_all
jean
dulac
[{'city_name': 'Kyoto', 'country_name': 'Japan', 'name': 'Kyoto University'}]
jean
dulac
[{'city_name': 'Texas', 'country_name': 'USA', 'name': 'University of Texas'}]
The column affiliations_all is apparently saved as a pandas.core.series.Series (and I can't change that since it comes from an API query).
I am thinking that one way to match the 2 dataframes would be to remove the words like "university" and "of" from the affiliation column of the first dataframe (that's easy), do the same for the affiliations_all column of the second dataframe (don't know how to do that) and then run some version of
test.apply(lambda x: str(x.affiliation) in str(x.affiliations_all), axis=1)
adapted to the fact that affiliations_all is a series.Series.
Any idea how to do that?
Thanks!
One possible solution would be to transform df2 (expand the columns) and then merge df1 with df2:
# transform df2
df2 = df2.explode("affiliations_all")
df2 = pd.concat([df2, df2.pop("affiliations_all").apply(pd.Series)], axis=1)
df2 = df2.rename(columns={"name": "affiliation"})
print(df2)
This prints:
first_name last_name city_name country_name affiliation
0 jean dulac Kyoto Japan Kyoto University
1 jean dulac Texas USA University of Texas
And the seconds step will be merge df1 with transformed df2:
df_out = pd.merge(df1, df2, on=["first_name", "last_name", "affiliation"])
print(df_out)
Prints:
first_name last_name affiliation city_name country_name
0 jean dulac University of Texas Texas USA

Why does not it get rid of duplicates when merging columns values in dataframe?

I have a dataframe values like this:
name foreign_name acronym alias
United States États-Unis USA USA
I want to merge all those four columns in a row into one single columns 'names', so I do:
merge = lambda x: '|'.join([a for a in x.unique() if a])
df['names'] = df[['name', 'foreign_name', 'acronym', 'alias',]].apply(merge, axis=1)
The problem with this code is that, it doesn't remove the duplicate 'USA', instead it gets:
names = 'United States|États-Unis|USA|USA'
Where am I wrong?
aggregate to set to eliminate duplicates
Turn the set to list
apply str.join('|') to concat the strings with a | separator
df['name']=df.agg(set,1).map(list).str.join('|')
MCVE:
import pandas as pd
import numpy as np
d= {'name': {0: 'United States'},
'foreign_name': {0: 'États-Unis'},
'acronym': {0: 'USA'},
'alias': {0: 'USA'}}
df = pd.DataFrame(d)
merge = lambda x: '|'.join([a for a in x.unique() if a])
df['names'] = df[['name', 'foreign_name', 'acronym', 'alias',]].apply(merge, axis=1)
print(df)
Output:
name foreign_name acronym alias names
0 United States États-Unis USA USA United States|États-Unis|USA
You just need to tell is to operate along row axis. axis=1
df.apply(lambda r: "|".join(r.unique()), axis=1)
output
United States|États-Unis|USA
dtype: object

Turning repeated row labels into column headers in pandas

I have a questionnaire in this format
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
As you can see the same 'Question' appears repeatedly, and I need to reformat this so that the result is as follows
df2 = pd.DataFrame({'Name': ['Bob', 'Michelle'],
'Age': [ 50, 42],
'Income': [42000,62000]})
Use numpy.reshape:
print (pd.DataFrame(df["Answer"].to_numpy().reshape((2,-1)), columns=df["Question"][:3]))
Or transpose and pd.concat:
s = df.set_index("Question").T
print (pd.concat([s.iloc[:, n:n+3] for n in range(0, len(s.columns), 3)]).reset_index(drop=True))
Both yield the same result:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000
You can create new column group with .assign that utilizes .groupby and .cumcount (Bob would be the first group and Michelle would be in the second group, with the groups being determined based off repetition of Name, Age, and Income)
Then .pivot the datraframe with the index being the group.
code:
df3 = (df.assign(group=df.groupby('Question').cumcount())
.pivot(index='group', values='Answer', columns='Question')
.reset_index(drop=True)[['Name','Age','Income']]) #[['Name','Age','Income']] at the end reorders the columns.
df3
Out[76]:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000
Here is a solution! It assumes that there are an even number of potential names for each observation (3 columns for Bob and Michelle, respectively):
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
df=df.set_index("Question")
pd.concat([df.iloc[i:i+3,:].transpose() for i in range(0,len(df),3)],axis=0).reset_index(drop=True)

Groupby one column and count another column with a condition?

I was wondering if it is possible to groupby one column while counting the values of another column that fulfill a condition. Because my dataset is a bit weird, I created a similar one:
import pandas as pd
raw_data = {'name': ['John', 'Paul', 'George', 'Emily', 'Jamie'],
'nationality': ['USA', 'USA', 'France', 'France', 'UK'],
'books': [0, 15, 0, 14, 40]}
df = pd.DataFrame(raw_data, columns = ['name', 'nationality', 'books'])
Say, I want to groupby the nationality and count the number of people that don't have any books (books == 0) from that country.
I would therefore expect something like the following as output:
nationality
USA 1
France 1
UK 0
I tried most variations of groupby, using filter, agg but don't seem to get anything that works.
Thanks in advance,
BBQuercus :)
IIUC:
df.books.eq(0).astype(int).groupby(df.nationality).sum()
nationality
France 1
UK 0
USA 1
Name: books, dtype: int64
Use:
df.groupby('nationality')['books'].apply(lambda x: x.eq(0).any().astype(int))
nationality
France 1
UK 0
USA 1
Name: books, dtype: int64

How to check each time-series entry if name/id is in previous years entries?

I'm stuck.
I have a dataframe where rows are created at the time a customer quotes cost of a product.
My (truncated) data:
import pandas as pd
d = {'Quote Date': pd.to_datetime(['3/10/2016', '3/10/2016', '3/10/2016',
'3/10/2016', '3/11/2017']),
'Customer Name': ['Alice', 'Alice', 'Bob', 'Frank', 'Frank']
}
df = pd.DataFrame(data=d)
I want to check, for each row, if this is the first interaction I have had with this customer in over a year. My thought is to check each row's customer name against the customer name in the preceding years worth of rows. If a row's customer name is not in the previous year subset, then I will append a True value to the new column:
df['Is New']
In practice, the dataframe's shape will be close to (150000000, 5) and I fear adding a calculated column will not scale well.
I also thought to create a multi-index with the date and then customer name, but I was not sure how to execute the necessary search with this indexing.
Please apply any method you believe would be more efficient at checking for the first instance of a customer in the preceding year.
Here is the first approach that came to mind. I don't expect it to scale that well to 150M rows, but give it a try. Also, your truncated data does not produce a very interesting output, so I created some test data in which some users are new, and some are not:
# Create example data
d = {'Quote Date': pd.to_datetime(['3/10/2016',
'3/10/2016',
'6/25/2016',
'1/1/2017',
'6/25/2017',
'9/29/2017']),
'Customer Name': ['Alice', 'Bob', 'Alice', 'Frank', 'Bob', 'Frank']
}
df = pd.DataFrame(d)
df.set_index('Quote Date', inplace=True)
# Solution
day = pd.DateOffset(days=1)
is_new = [s['Customer Name'] not in df.loc[i - 365*day:i-day]['Customer Name'].values
for i, s in df.iterrows()]
df['Is New'] = is_new
df.reset_index(inplace=True)
# Result
df
Quote Date Customer Name Is New
0 2016-03-10 Alice True
1 2016-03-10 Bob True
2 2016-06-25 Alice False
3 2017-01-01 Frank True
4 2017-06-25 Bob True
5 2017-09-29 Frank False

Categories

Resources