I need to get the first and last names of people who work in HR department.
FirstName LastName Year Department
83 Joe Faulk 2 Austin Public Library
84 Bryce Benton 5 HR
85 Sarah Cronin 7 Austin Public Library
86 Gabriel Montgomery 2 Austin Resource Recovery
87 Patricia Genty-Andrade 3 HR
This is my code it shows me error AttributeError: 'DataFrame' object has no attribute 'unique'
names = df.iloc[:, 0:4][df['Department'] == 'HR'].unique()
I need the output to be like this
FirstName LastName Department
0 Joe Faulk HR
1 Bryce Benton HR
2 Sarah Cronin HR
3 Gabriel Montgomery HR
4 Patricia Genty-Andrade HR
use drop_duplicates instead of unique, as unique is for Series.
df.loc[df['Department'] == 'HR',
['FirstName', 'LastName', 'Department']].drop_duplicates()
Related
Sorry for the confusing title. I am practicing how to manipulate dataframes in Python through pandas. How do I make this kind of table:
id role name
0 11 ACTOR Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 11 DIRECTOR Christian Schwochow
2 22 ACTOR Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...
3 22 DIRECTOR Andrew Baird
4 33 ACTOR Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
5 33 DIRECTOR Saron Sakina
Into this kind:
id director actors name
0 11 Christian Schwochow Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 22 Andrew Baird Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...d
2 33 Saron Sakina Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
Try this way
df.pivot(index='id', columns='role', values='name')
You can do in addition to #Tejas's answer:
df = (df.pivot(index='id', columns='role', values='name').
reset_index().
rename_axis('',axis=1).
rename(columns={'ACTOR':'actors name','DIRECTOR':'director'}))
House Number
Street
First Name
Surname
Age
Relationship to Head of House
Marital Status
Gender
Occupation
Infirmity
Religion
0
1
Smith Radial
Grace
Patel
46
Head
Widowed
Female
Petroleum engineer
None
Catholic
1
1
Smith Radial
Ian
Nixon
24
Lodger
Single
Male
Publishing rights manager
None
Christian
2
2
Smith Radial
Frederick
Read
87
Head
Divorced
Male
Retired TEFL teacher
None
Catholic
3
3
Smith Radial
Daniel
Adams
58
Head
Divorced
Male
Therapist, music
None
Catholic
4
3
Smith Radial
Matthew
Hall
13
Grandson
NaN
Male
Student
None
NaN
5
3
Smith Radial
Steven
Fletcher
9
Grandson
NaN
Male
Student
None
NaN
6
4
Smith Radial
Alison
Jenkins
38
Head
Single
Female
Physiotherapist
None
Catholic
7
4
Smith Radial
Kelly
Jenkins
12
Daughter
NaN
Female
Student
None
NaN
8
5
Smith Radial
Kim
Browne
69
Head
Married
Female
Retired Estate manager/land agent
None
Christian
9
5
Smith Radial
Oliver
Browne
69
Husband
Married
Male
Retired Merchandiser, retail
None
None
Hello,
I have a dataset that you can see below. When I tried to convert Age to int. I got that error: ValueError: invalid literal for int() with base 10: '43.54302670766108'
This means there is float data inside that data. I tried to replace '.' to '0' then tried to convert but I failed. Could you help me to do that?
df['Age'] = df['Age'].replace('.','0')
df['Age'] = df['Age'].astype('int')
I still got the same error. I think replace line is not working. Do you know why?
Thanks
Try:
df['Age'] = df['Age'].replace('\..*$', '', regex=True).astype(int)
Or, more drastic:
df['Age'] = df['Age'].replace('^(?:.*\D.*)?$', '0', regex=True).astype(int)
You do not need to manipulate the strings; you might first convert values to float then to int like:
df["Age"] = df["Age"].astype('float').astype('int')
I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.
After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)
I have two pandas df with the exact same column names. One of these columns is named id_number which is unique to each table (What I mean is an id_number can only appear once in each df). I want to find all records that have the same id_number but have at least one different value in any column and store these records in a new pandas df.
I've tried merging (more specifically inner join), but it keeps only one record with that specific id_number so I can't look for any differences between the two dfs.
Let me provide some example to provide a clearer explanation:
Example dfs:
First DF:
id_number name type city
1 John dev Toronto
2 Alex dev Toronto
3 Tyler dev Toronto
4 David dev Toronto
5 Chloe dev Toronto
Second DF:
id_number name type city
1 John boss Vancouver
2 Alex dev Vancouver
4 David boss Toronto
5 Chloe dev Toronto
6 Kyle dev Vancouver
I want the resulting df to contain the following records:
id_number name type city
1 John dev Toronto
1 John boss Vancouver
2 Alex dev Toronto
2 Alex dev Vancouver
4 David dev Toronto
4 David Boss Toronto
NOTE: I would not want records with id_number 5 to appear in the resulting df, that is because the records with id_number 5 are exactly the same in both dfs.
In reality, there are 80 columns for each record, but I think these tables make my point a little clearer. Again to summarize, I want the resulting df to contain records with same id_numbers, but a different value in any of the other columns. Thanks in advance for any help!
Here is one way using nunique then we pick those id_number more than 1 and slice them out
s = pd.concat([df1, df2])
s = s.loc[s.id_number.isin(s.groupby(['id_number']).nunique().gt(1).any(1).loc[lambda x : x].index)]
s
Out[654]:
id_number name type city
0 1 John dev Toronto
1 2 Alex dev Toronto
3 4 David dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Vancouver
2 4 David boss Toronto
Here is, a way using pd.concat, drop_duplicates and duplicated:
pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('id_number')\
.loc[lambda x: x.id_number.duplicated(keep=False)]
Output:
id_number name type city
0 1 John dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Toronto
1 2 Alex dev Vancouver
3 4 David dev Toronto
2 4 David boss Toronto
I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))