I have a pandas dataframe (df) with the following fields:
id
name
category
01
Eddie
magician
01
Eddie
plumber
02
Martha
actress
03
Jeremy
dancer
03
Jeremy
actor
I want to create a dataframe (df2) like the following:
id
name
categories
01
Eddie
magician, plumber
02
Martha
actress
03
Jeremy
dancer, actor
So, first of all, i create df2 and add an additional column by the following commands:
df2 = df.groupby("id", as_index= False).count()
df2["categories"] = str()
(this also counts the occurrences of various categories, which is something useful for what I intend to do)
Then, I use this loop:
for i in df2.itertuples():
for entries in df.itertuples():
if i.id == entries.id:
df2["categories"].iloc[i.Index] += entries.category
else:
pass
Using this code, I get the dataframe that I wanted. However, this implementation has several problems:
Doesn't look optimal.
If there are repeated entries (such as another column with "Eddie" and "magician"), the entry for Eddie in df2 would have "magician, plumber, magician" in categories.
Therefore I would like to ask the community: is there a better way to do this?
Also keep in mind that this is my first question on this website!
Thanks in advance!
You can groupby your id and name columns and apply a function to the category one like this:
import pandas as pd
data = {
'id': ['01', '01', '02', '03', '03'],
'name': ['Eddie', 'Eddie', 'Martha', 'Jeremy', 'Jeremy'],
'category': ['magician', 'plumber', 'actress', 'dancer', 'actor']
}
df = pd.DataFrame(data)
df2 = df.groupby(['id', 'name'])['category'].apply(lambda x: ', '.join(x)).reset_index()
df2
Output:
id name category
0 01 Eddie magician, plumber
1 02 Martha actress
2 03 Jeremy dancer, actor
Related
I managed to group rows in a dataframe, given one column (id).
The problem is that one column consists of parts of sentences, and when I add them together, the spaces are missing.
An example probably makes it easier to understand...
My dataframe looks something like this:
import pandas as pd
#create dataFrame
df = pd.DataFrame({'id': [101, 101, 102, 102, 102],
'text': ['The government changed', 'the legislation on import control.', 'Politics cannot solve all problems', 'but it should try to do its part.', 'That is the reason why these elections are important.'],
'date': [1990, 1990, 2005, 2005, 2005],})
id text date
0 101 The government changed 1990
1 101 the legislation on import control. 1990
2 102 Politics cannot solve all problems 2005
3 102 but it should try to do its part. 2005
4 102 That is the reason why these elections are imp... 2005
Then I used the aggregation function:
aggregation_functions = {'id': 'first','text': 'sum', 'date': 'first'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)
which returns:
id text date
0 101 The government changedthe legislation on import control. 1990
2 102 Politics cannot solve all problemsbut it should try to... 2005
So, for example I need a space in between ' The government changed' and 'the legislation...'. Is that possible?
If you need to put a space between the two phrases/rows, use str.join :
ujoin = lambda s: " ".join(dict.fromkeys(s.astype(str)))
out= df.groupby(["id", "date"], as_index=False).agg(**{"text": ("text", ujoin)})[df.columns]
# Output :
print(out.to_string())
id text date
0 101 The government changed the legislation on import control. 1990
1 102 Politics cannot solve all problems but it should try to do its part. That is the reason why these elections are important. 2005
I have two csv files as my raw data to read into different dataframes. One is called 'employee' and another is called 'origin'. However, I cannot upload the files here so I hardcoded the data into the dataframes below. The task I'm trying to solve is to update the 'Eligible' column in employee_details with 'Yes' or 'No' based on the value of the 'Country' column in origin_details. If Country = UK, then put 'Yes' in the Eligible column for that Personal_ID. Else, put 'No'.
import pandas as pd
import numpy as np
employee = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943'],
'Name': ['Tom', 'Joseph', 'Krish', 'John'],
'Age': ['40', '35', '43', '51'],
'Eligible': ' '}
origin = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943', '1573482', '1739526'],
'Country': ['UK', 'USA', 'FRA', 'SWE', 'UK', 'AU']}
employee_details = pd.DataFrame(employee)
origin_details = pd.DataFrame(origin)
employee_details['Eligible'] = np.where((origin_details['Country']) == 'UK', 'Yes', 'No')
print(employee_details)
print(origin_details)
The output of above code shows the below error message:
ValueError: Length of values (6) does not match length of index (4)
However, I am expecting to see the below as my output.
Personal_ID Name Age Eligible
0 1000123 Tom 40 Yes
1 1100258 Joseph 35 No
2 1104682 Krish 43 No
3 1020943 John 51 No
I also don't want to delete anything in my dataframes to match the size specified in the ValueError message because I may need the extra Personal_IDs in the origin_details later. Alternatively, I can keep all the existing Personal_ID's in the raw data (employee_details, origin_details) and create a new dataframe from those to extract the records which have the same Personal_ID's and determine the np.where() condition from there.
Please advise! Any helps are appreciated, thank you!!
You can merge the 2 dataframes on Personal ID and then use np.where
Merge with how='outer' to keep all personal IDs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='outer')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
4 1573482 NaN NaN Yes UK
5 1739526 NaN NaN No AU
If you dont want to keep all personal IDs then you can merge with how='inner' and you won't see the NANs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='inner')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
You are using a Pandas Series object inside a Numpy method, np.where((origin_details['Country'])). I believe this is the problem.
try:
employee_details['Eligible'] = origin_details['Country'].apply(lambda x:"Yes" if x=='UK' else "No")
It is always much easier and faster to use the pandas library to analyze dataframes instead of converting them back to numpy arrays
Well, the first thing I want to answer about is the exception and how lucky you are that it didn't if your tables were the same length your code was going to work.
but there is an assumption in the code that I don't think you thought about and that is that the ids may not be in the same order or like in the example there are more ids in some table than the other if you had the same length of tables but not the same order you would have got incorrect eligible values for each row. the current way to do this is as follow
first join the table to one using personal_id but use left join as you don't want to lose data if there is no origin info for that personal id.
combine_df = pd.merge(employee_details, origin_details, on='Personal_ID', how='left')
use the apply function to fill the new column
combine_df['Eligible'] = combine_df['Country'].apply(lambda x:'Yes' if x=='UK' else 'No')
I have the following df (just for example):
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
print(df)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
Output:
Name
0 Tom
1 Joseph
2 Krish
3 John
I would like to add another column to the df which will be based on the city dictionary.
I would like to do it by the following conditions:
If the Name is Tom or Krish then they should get 123 (New York).
If the Name is John then he should get 456 (LA).
If the Name is Joseph then he should get 789 (Miami).
Thanks in advance :)
try via loc and boolean masking:
df.loc[df['Name'].isin(['Tom','Krish']),'City']='New York'
df.loc[df['Name'].eq('Joseph'),'City']='LA'
df.loc[df['Name'].eq('John'),'City']='Miami'
Finally:
df['Value']=df['City'].map(city)
#you can also use replace() in place of map()
OR
#import numpy as np
cond=[
df['Name'].isin(['Tom','Krish']),
df['Name'].eq('Joseph'),
df['Name'].eq('John')
]
df['City']=np.select(cond,['New York','LA','Miami'])
df['Value']=df['City'].map(city)
I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
I have a dataframe where i use group by to group them like follows
Name Nationality age
Peter UK 28
John US 29
Wiley UK 28
Aster US 29
grouped = self_ex_df.groupby([Nationality, age])
Now i want to attach a unique ID against each of these values
I am trying this though not sure it works?
uniqueID = 'ID_'+ grouped.groups.keys().astype(str)
uniqueID Name Nationality age
ID_UK28 Peter UK 28
ID_US29 John US 29
ID_UK28 Wiley UK 28
ID_US29 Aster US 29
I want to now combine this into a new DF to something like this
uniqueID Nationality age Text
ID_UK28 UK 28 Peter and Whiley have a combined age of 56
ID_US_29 US 29 John and Aster have a combined age of 58
How do i achieve the above?
Hopefully close enough, couldn't get average age:
import pandas as pd
#create dataframe
df = pd.DataFrame({'Name': ['Peter', 'John', 'Wiley', 'Aster'], 'Nationality': ['UK', 'US', 'UK', 'US'], 'age': [28, 29, 28, 29]})
#make uniqueID
df['uniqueID'] = 'ID_' + df['Nationality'] + df['age'].astype(str)
#groupby has agg method that can take dict and preform multiple aggregations
df = df.groupby(['uniqueID', 'Nationality']).agg({'age': 'sum', 'Name': lambda x: ' and '.join(x)})
#to get text you just combine new Name and sum of age
df['Text'] = df['Name'] + ' have a combined age of ' + df['age'].astype(str)
You don't need the groupby to create the uniqueID, and you can groupby that uniqueID later to get the groups based on age and nationality. I used a custom function to build the text str. This one way of doing it.
df1 = df.assign(uniqueID='ID_'+df.Nationality+df.age.astype(str))
def myText(x):
str = ' and '.join(x.Name)
str += ' have a combined age of {}.'.format(x.age.sum())
return str
df2 = df1.groupby(['uniqueID', 'Nationality','age']).apply(myText).reset_index().rename(columns={0:'Text'})
print(df2)
Output:
uniqueID Nationality age Text
0 ID_UK28 UK 28 Peter and Wiley have a combined age of 56.
1 ID_US29 US 29 John and Aster have a combined age of 58.