import numpy as np
import pandas as pd
df = pd.read_csv(“data.csv”)
pd.pivot_table(df, index = ‘Employee ID’ , values = [ ‘ Member ID’, ‘Firstname’, ‘Lastname’] , aggfunc =‘first)
The format seems to work but only for one value , how do i display everthing ?
Any help is appreciated .
You can use set_index() and unstack(), but you will need to fix the columns, e.g.:
In []:
df = pd.read_csv(“data.csv”)
df['ID'] = df['MemberID'] # Copy because you want it in the values too
df = df.set_index(['EmployeeID', 'MemberID']).unstack(level=1, fill_value='').sort_index(level=1, axis=1)
df.columns = df.columns.to_series().apply(lambda x: 'Member{}{}'.format(x[1], x[0]))
print(df)
Out[]:
Member1ID Member1Lastname Member1firstname Member2ID Member2Lastname Member2firstname Member3ID Member3Lastname Member3firstname
EmployeeID
1 1 Ann Anu 2 Ann Aju 3 vAnn Abi
2 1 John Cini 2 John Biju
3 1 Peter Mathew 2 Peter Joseph
But I feel you can simplify if you really don't need MemberID in the values (you have it in the column name) or if you don't mind a MultiIndex then:
In []:
df.set_index(['EmployeeID', 'MemberID']).unstack(level=1, fill_value='').swaplevel(axis=1).sort_index(axis=1)
Out[]:
MemberID 1 2 3
Lastname firstname Lastname firstname Lastname firstname
EmployeeID
1 Ann Anu Ann Aju Ann Abi
2 John Cini John Biju
3 Peter Mathew Peter Joseph
You can use pivot_table of pandas
df = df.pivot_table(index=['Employe-id'],
columns=['MemberID','firstname','lastname'])
To install pandas use pip install pandas
then first make a dataframe object by read_csv()
then use above method to convert
Related
I have two columns in dataframe df
ID Name
AXD2 SAM S
AXD2 SAM
SCA4 JIM
SCA4 JIM JONES
ASCQ JOHN
I need the output to get a unique id and should match the first name only,
ID Name
AXD2 SAM S
SCA4 JIM
ASCQ JOHN
Any suggestions?
You can use groupby with agg and get first of Name
df.groupby(['ID']).agg(first_name=('Name', 'first')).reset_index()
Use drop_duplicates:
out = df.drop_duplicates('ID', ignore_index=True)
print(out)
# Output
ID Name
0 AXD2 SAM S
1 SCA4 JIM
2 ASCQ JOHN
You can use cumcount() to find the first iteration name of the ID
df['RN'] = df.groupby(['ID']).cumcount() + 1
df = df.loc[df['RN'] == 1]
df[['ID', 'Name']]
for example, there is a column in a dataframe, 'ID'.
One of the entries is for example, '13245993, 3004992'
I only want to get '13245993'.
That also applies for every row in column 'ID'.
How to change the data in each row in column 'ID'?
You can try like this, apply slicing on ID column to get the required result. I am using 3 chars as no:of chars here
import pandas as pd
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'ID':[90877, 10909, 12223, 12334]}
df=pd.DataFrame(data)
print('Before change')
print(df)
df["ID"]=df["ID"].apply(lambda x: (str(x)[:3]))
print('After change')
print(df)
output
Before change
Name ID
0 Tom 90877
1 nick 10909
2 krish 12223
3 jack 12334
After change
Name ID
0 Tom 908
1 nick 109
2 krish 122
3 jack 123
You could do something like
data[data['ID'] == '13245993']
this will give you the columns where ID is 13245993
More Indepth Code
I hope this answers your question if not please let me know.
With best regards
I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!
You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary
I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
I have two dataframes of different length like those:
DataFrame A:
FirstName LastName
Adam Smith
John Johnson
DataFrame B:
First Last Value
Adam Smith 1.2
Adam Smith 1.5
Adam Smith 3.0
John Johnson 2.5
Imagine that what I want to do is to create a new column in "DataFrame A" summing all the values with matching last names, so the output in "A" would be:
FirstName LastName Sums
Adam Smith 5.7
John Johnson 2.5
If I were in Excel, I'd use
=SUMIF(dfB!B:B, B2, dfB!C:C)
In Python I've been trying multiple solutions but using both np.where, df.sum(), dropping indexes etc., but I'm lost. Below code is returning "ValueError: Can only compare identically-labeled Series objects", but I don't think it's written correctly anyways.
df_a['Sums'] = df_a[df_a['LastName'] == df_b['Last']].sum()['Value']
Huge thanks in advance for any help.
Use boolean indexing with Series.isin for filtering and then aggregate sum:
df = (df_b[df_b['Last'].isin(df_a['LastName'])]
.groupby(['First','Last'], as_index=False)['Value']
.sum())
If want match both, first and last name:
df = (df_b.merge(df_a, left_on=['First','Last'], right_on=['FirstName','LastName'])
.groupby(['First','Last'], as_index=False)['Value']
.sum())
df_b_a = (pd.merge(df_b, df_a, left_on=['FirstName', 'LastName'], right_on=['First', 'Last'], how='left')
.groupby(by=['First', 'Last'], as_index=False)['Value'].sum())
print(df_b_a)
First Last Value
0 Adam Smith 5.7
1 John Johnson 2.5
Use DataFrame.merge + DataFrame.groupby:
new_df=( dfa.merge(dfb.groupby(['First','Last'],as_index=False).Value.sum() ,
left_on='LastName',right_on='Last',how='left')
.drop('Last',axis=1) )
print(new_df)
to join for both columns:
new_df=( dfa.merge(dfb.groupby(['First','Last'],as_index=False).Value.sum() ,
left_on=['FirstName','LastName'],right_on=['First','Last'],how='left')
.drop(['First','Last'],axis=1) )
print(new_df)
Output:
FirstName LastName Value
0 Adam Smith 5.7
1 John Johnson 2.5