I was wondering if it is possible to groupby one column while counting the values of another column that fulfill a condition. Because my dataset is a bit weird, I created a similar one:
import pandas as pd
raw_data = {'name': ['John', 'Paul', 'George', 'Emily', 'Jamie'],
'nationality': ['USA', 'USA', 'France', 'France', 'UK'],
'books': [0, 15, 0, 14, 40]}
df = pd.DataFrame(raw_data, columns = ['name', 'nationality', 'books'])
Say, I want to groupby the nationality and count the number of people that don't have any books (books == 0) from that country.
I would therefore expect something like the following as output:
nationality
USA 1
France 1
UK 0
I tried most variations of groupby, using filter, agg but don't seem to get anything that works.
Thanks in advance,
BBQuercus :)
IIUC:
df.books.eq(0).astype(int).groupby(df.nationality).sum()
nationality
France 1
UK 0
USA 1
Name: books, dtype: int64
Use:
df.groupby('nationality')['books'].apply(lambda x: x.eq(0).any().astype(int))
nationality
France 1
UK 0
USA 1
Name: books, dtype: int64
Related
For example, let's say that I have the dataframe:
NAME = ['BOB', 'BOB', 'BOB', 'SUE', 'SUE', 'MARY', 'JOHN', 'JOHN', 'MARK', 'MARK', 'MARK', 'MARK']
STATE = ['CA','CA','CA','DC','DC','PA','GA','GA','NY','NY','NY','NY']
MAJOR = ['MARKETING','BUSINESS ADM',np.nan,'ECONOMICS','MATH','PSYCHOLOGY','HISTORY','BUSINESS ADM','MATH', 'MEDICAL SCIENCES',np.nan,np.nan]
SCHOOL = ['UCLA','UCSB','CAL STATE','HARVARD','WISCONSIN','YALE','CHICAGO','MIT','UCSD','UCLA','CAL STATE','COMMUNITY']
data = {'NAME':NAME, 'STATE':STATE,'MAJOR':MAJOR, 'SCHOOL':SCHOOL}
df = pd.DataFrame(data)
I am to concatenate rows with multiple unique values for the same name.
I tried:
gr_columns = [x for x in df1.columns if x not in ['MAJOR','SCHOOL']]
df1 = df1.groupby(gr_columns).agg(lambda col: '|'.join(col))
and expected
I am trying to concatenate rows in columns where the NAME field is the same. Conveniently, the STATE field is static for each NAME. I would like the output to look like:
NAME
STATE
MAJOR
SCHOOL
BOB
CA
MARKETING,BUSINESS ADM
UCLA,UCSB,CAL STATE
SUE
DC
ECONOMICS,MATH
HARVARD,WISCONSIN
MARY
PA
PSYCHOLOGY
YALE
JOHN
GA
HISTORY,BUSINESS ADM
CHICAGO,MIT
MARK
NY
MATH,MEDICAL SCIENCES
UCSD,UCLA,CAL STATE,COMMUNITY
but instead, I get a single column containing the concatenated schools.
It is because your np.nan cannot be converted to str, so it is dropped automatically by pandas. You need to convert its type to str first:
df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.astype(str)))
To drop na and keep NAME and STATE as columns:
df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.dropna())).reset_index()
I have a dataframe values like this:
name foreign_name acronym alias
United States États-Unis USA USA
I want to merge all those four columns in a row into one single columns 'names', so I do:
merge = lambda x: '|'.join([a for a in x.unique() if a])
df['names'] = df[['name', 'foreign_name', 'acronym', 'alias',]].apply(merge, axis=1)
The problem with this code is that, it doesn't remove the duplicate 'USA', instead it gets:
names = 'United States|États-Unis|USA|USA'
Where am I wrong?
aggregate to set to eliminate duplicates
Turn the set to list
apply str.join('|') to concat the strings with a | separator
df['name']=df.agg(set,1).map(list).str.join('|')
MCVE:
import pandas as pd
import numpy as np
d= {'name': {0: 'United States'},
'foreign_name': {0: 'États-Unis'},
'acronym': {0: 'USA'},
'alias': {0: 'USA'}}
df = pd.DataFrame(d)
merge = lambda x: '|'.join([a for a in x.unique() if a])
df['names'] = df[['name', 'foreign_name', 'acronym', 'alias',]].apply(merge, axis=1)
print(df)
Output:
name foreign_name acronym alias names
0 United States États-Unis USA USA United States|États-Unis|USA
You just need to tell is to operate along row axis. axis=1
df.apply(lambda r: "|".join(r.unique()), axis=1)
output
United States|États-Unis|USA
dtype: object
I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']
I have a questionnaire in this format
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
As you can see the same 'Question' appears repeatedly, and I need to reformat this so that the result is as follows
df2 = pd.DataFrame({'Name': ['Bob', 'Michelle'],
'Age': [ 50, 42],
'Income': [42000,62000]})
Use numpy.reshape:
print (pd.DataFrame(df["Answer"].to_numpy().reshape((2,-1)), columns=df["Question"][:3]))
Or transpose and pd.concat:
s = df.set_index("Question").T
print (pd.concat([s.iloc[:, n:n+3] for n in range(0, len(s.columns), 3)]).reset_index(drop=True))
Both yield the same result:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000
You can create new column group with .assign that utilizes .groupby and .cumcount (Bob would be the first group and Michelle would be in the second group, with the groups being determined based off repetition of Name, Age, and Income)
Then .pivot the datraframe with the index being the group.
code:
df3 = (df.assign(group=df.groupby('Question').cumcount())
.pivot(index='group', values='Answer', columns='Question')
.reset_index(drop=True)[['Name','Age','Income']]) #[['Name','Age','Income']] at the end reorders the columns.
df3
Out[76]:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000
Here is a solution! It assumes that there are an even number of potential names for each observation (3 columns for Bob and Michelle, respectively):
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
df=df.set_index("Question")
pd.concat([df.iloc[i:i+3,:].transpose() for i in range(0,len(df),3)],axis=0).reset_index(drop=True)
I have a dataframe where i use group by to group them like follows
Name Nationality age
Peter UK 28
John US 29
Wiley UK 28
Aster US 29
grouped = self_ex_df.groupby([Nationality, age])
Now i want to attach a unique ID against each of these values
I am trying this though not sure it works?
uniqueID = 'ID_'+ grouped.groups.keys().astype(str)
uniqueID Name Nationality age
ID_UK28 Peter UK 28
ID_US29 John US 29
ID_UK28 Wiley UK 28
ID_US29 Aster US 29
I want to now combine this into a new DF to something like this
uniqueID Nationality age Text
ID_UK28 UK 28 Peter and Whiley have a combined age of 56
ID_US_29 US 29 John and Aster have a combined age of 58
How do i achieve the above?
Hopefully close enough, couldn't get average age:
import pandas as pd
#create dataframe
df = pd.DataFrame({'Name': ['Peter', 'John', 'Wiley', 'Aster'], 'Nationality': ['UK', 'US', 'UK', 'US'], 'age': [28, 29, 28, 29]})
#make uniqueID
df['uniqueID'] = 'ID_' + df['Nationality'] + df['age'].astype(str)
#groupby has agg method that can take dict and preform multiple aggregations
df = df.groupby(['uniqueID', 'Nationality']).agg({'age': 'sum', 'Name': lambda x: ' and '.join(x)})
#to get text you just combine new Name and sum of age
df['Text'] = df['Name'] + ' have a combined age of ' + df['age'].astype(str)
You don't need the groupby to create the uniqueID, and you can groupby that uniqueID later to get the groups based on age and nationality. I used a custom function to build the text str. This one way of doing it.
df1 = df.assign(uniqueID='ID_'+df.Nationality+df.age.astype(str))
def myText(x):
str = ' and '.join(x.Name)
str += ' have a combined age of {}.'.format(x.age.sum())
return str
df2 = df1.groupby(['uniqueID', 'Nationality','age']).apply(myText).reset_index().rename(columns={0:'Text'})
print(df2)
Output:
uniqueID Nationality age Text
0 ID_UK28 UK 28 Peter and Wiley have a combined age of 56.
1 ID_US29 US 29 John and Aster have a combined age of 58.