Reshape python dataframe - python

I have dataframe like this.
description
Brian
No.22
Tel:+00123456789
email:brain#email.com
Sandra
No:43
Tel:+00312456789
Michel
No:593
Kent
No:13
Engineer
Tel:04512367890
email:kent#yahoo.com
and I want it like this.
name
address
designation
telephone
email
Brian
No:22
null
Tel:+00123456789
email:brain#email.com
Sandra
No:43
null
Tel:+00312456789
null
Michel
No:593
null
null
null
Kent
No:13
Engineer
Tel:04512367890
email:kent#yahoo.com
How to do this in python.

Use np.where to label each row then pivot your dataframe.
Step 1.
condlist = [df['description'].shift(fill_value='').eq(''),
df['description'].str.contains('^No[:.]'),
df['description'].str.startswith('Tel:'),
df['description'].str.startswith('email:')]
choicelist = ['name', 'address', 'telephone', 'email']
df['column'] = np.select(condlist, choicelist, default='designation')
print(df)
# Output:
description column
0 Brian name
1 No.22 address
2 Tel:+00123456789 telephone
3 email:brain#email.com email
4 designation
5 Sandra name
6 No:43 address
7 Tel:+00312456789 telephone
8 designation
9 Michel name
10 No:593 address
11 designation
12 Kent name
13 No:13 address
14 Engineer designation
15 Tel:04512367890 telephone
16 email:kent#yahoo.com email
Step 2. Now remove empty rows and create an index to allow the pivot:
df = df[df['description'].ne('')].assign(index=df['column'].eq('name').cumsum())
print(df)
# Output:
description column index
0 Brian name 1
1 No.22 address 1
2 Tel:+00123456789 telephone 1
3 email:brain#email.com email 1
5 Sandra name 2
6 No:43 address 2
7 Tel:+00312456789 telephone 2
9 Michel name 3
10 No:593 address 3
12 Kent name 4
13 No:13 address 4
14 Engineer designation 4
15 Tel:04512367890 telephone 4
16 email:kent#yahoo.com email 4
Step 3. Pivot your dataframe:
cols = ['name', 'address', 'designation', 'telephone', 'email']
out = df.pivot('index', 'column', 'description')[cols] \
.rename_axis(index=None, columns=None)
print(out)
# Output:
name address designation telephone email
1 Brian No.22 NaN Tel:+00123456789 email:brain#email.com
2 Sandra No:43 NaN Tel:+00312456789 NaN
3 Michel No:593 NaN NaN NaN
4 Kent No:13 Engineer Tel:04512367890 email:kent#yahoo.com
Edit
There is an error at final step" ValueError: Index contains duplicate entries, cannot reshape" how can I overcome this.
There is no magic to solve this problem because your data are mess. The designation label is the fallback if the row was not tagged to name, address, telephone and email. So there is a great chance, you have multiple rows labelled designation for a same person.
At then end of this step, check if you have duplicates (person/label -> index/column) with this command:
df.value_counts(['index', 'column']).loc[lambda x: x > 1]
Probably (and I hope for you), the output should indicate only designation label under column column unless one person can have multiple telephone or email. Now you can adjust the condlist to catch a maximum of pattern. I don't know anything about your data so I can't help you much.

Related

reshape a dataframe with internal headers as new columns

I have a data frame as below :
col value
0 companyId 123456
1 company_name small company
2 department IT
3 employee_name Jack
4 rank Grade 8
5 department finance
6 employee_name Tim
7 rank Grade 6
and i would like the data frame to be reshaped to tabular format, ideally like this:
companyId company_name department employee_name rank
0 123456 small company IT Jack Grade 8
1 123456 small company finance Tim Grade 6
can any one help me please? thanks.
Making two assumptions you could reshape your data.
1- the companies are determined using headers and all subsequent rows are data from employees of the company
2- there is a given starting item to define employees records (here department)
headers = ['companyId', 'company_name']
first_item = 'department'
masks = {h: df['col'].eq(h) for h in headers}
df2 = (df
# move headers as new columns
.assign(**{h: df['value'].where(m).ffill().bfill() for h,m in masks.items()})
# and drop their rows
.loc[~pd.concat(masks, axis=1).any(1)]
# compute a unique identifier per employee
.assign(idx=lambda d: d['col'].eq(first_item).cumsum())
# pivot the data
.pivot(index=['idx']+headers, columns='col', values='value')
.reset_index(headers)
)
output:
companyId company_name department employee_name rank
1 123456 small company IT Jack Grade 8
2 123456 small company finance Tim Grade 6
Example on a more complex input:
col value
0 companyId 123456
1 company_name small company
2 department IT
3 employee_name Jack
4 rank Grade 8
5 department finance
6 employee_name Tim
7 rank Grade 6
8 companyId 67890
9 company_name other company
10 department IT
11 employee_name Jane
12 rank Grade 9
13 department management
14 employee_name Tina
15 rank Grade 12
output:
companyId company_name department employee_name rank
1 123456 small company IT Jack Grade 8
2 123456 small company finance Tim Grade 6
3 67890 other company IT Jane Grade 9
4 67890 other company management Tina Grade 12

Pandas - Expand table based on different email with same key from another table

I have a quick one that I am struggling with.
Table 1 has a lot of user information in addition to an email column and a unique ID column.
Table 2 has only a unique ID column and an email column. These emails can be different from table 1, but do not have to be.
I am attempting to merge them such that table 1 expands only to include new rows when there is a new email from table 2 on the same unique id.
Example:
Table 1:
id email first_name last_name
1 jo# joe king
2 john# johnny maverick
3 Tom# Tom J
Table 2:
id email
2 johnmk#
3 TomT#
8 Jared#
Desired Output:
id email first_name last_name
1 jo# joe king
2 john# johnny maverick
2 johnmk# johnny maverick
3 Tom# Tom J
3 TomT# Tom J
I would have expected pd.merge(table1, table2, on = 'id', how = 'left') to do this, but this just generates the email columns with the suffix _x, _y.
How can I make the merge?
IIUC, you can try pd.concat with a boolean mask using isn for df2 , with groupby.ffill:
out = pd.concat((df1,df2[df2['id'].isin(df1['id'])]),sort=False)
out.update(out.groupby("id").ffill())
out = out.sort_values("id")#.reset_index(drop=True)
id email first_name last_name
0 1 jo# joe king
1 2 john# johnny maverick
0 2 johnmk# johnny maverick
2 3 Tom# Tom J
1 3 TomT# Tom J

How to find records with same value in one column but different value in another column

I have two pandas df with the exact same column names. One of these columns is named id_number which is unique to each table (What I mean is an id_number can only appear once in each df). I want to find all records that have the same id_number but have at least one different value in any column and store these records in a new pandas df.
I've tried merging (more specifically inner join), but it keeps only one record with that specific id_number so I can't look for any differences between the two dfs.
Let me provide some example to provide a clearer explanation:
Example dfs:
First DF:
id_number name type city
1 John dev Toronto
2 Alex dev Toronto
3 Tyler dev Toronto
4 David dev Toronto
5 Chloe dev Toronto
Second DF:
id_number name type city
1 John boss Vancouver
2 Alex dev Vancouver
4 David boss Toronto
5 Chloe dev Toronto
6 Kyle dev Vancouver
I want the resulting df to contain the following records:
id_number name type city
1 John dev Toronto
1 John boss Vancouver
2 Alex dev Toronto
2 Alex dev Vancouver
4 David dev Toronto
4 David Boss Toronto
NOTE: I would not want records with id_number 5 to appear in the resulting df, that is because the records with id_number 5 are exactly the same in both dfs.
In reality, there are 80 columns for each record, but I think these tables make my point a little clearer. Again to summarize, I want the resulting df to contain records with same id_numbers, but a different value in any of the other columns. Thanks in advance for any help!
Here is one way using nunique then we pick those id_number more than 1 and slice them out
s = pd.concat([df1, df2])
s = s.loc[s.id_number.isin(s.groupby(['id_number']).nunique().gt(1).any(1).loc[lambda x : x].index)]
s
Out[654]:
id_number name type city
0 1 John dev Toronto
1 2 Alex dev Toronto
3 4 David dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Vancouver
2 4 David boss Toronto
Here is, a way using pd.concat, drop_duplicates and duplicated:
pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('id_number')\
.loc[lambda x: x.id_number.duplicated(keep=False)]
Output:
id_number name type city
0 1 John dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Toronto
1 2 Alex dev Vancouver
3 4 David dev Toronto
2 4 David boss Toronto

How to combine two dataframes and have unique key column using Pandas?

I have two dataframes with the same columns that I need to combine:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
and
first_name last_name
0 Billy Bonder
1 Brian Black
2 Bran Balwner
When I do this:
df_new = pd.concat([df1, df1])
I get this:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
0 Billy Bonder
1 Brian Black
2 Bran Balwner
Is there a way to have the left column have a unique number like this?
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
3 Billy Bonder
4 Brian Black
5 Bran Balwner
If not, how can I add a new key column with numbers from 1 to whatever the row count is?
As said earlier by #MaxU you can use ignore_index=True.
If you want to keep the index of your first table you can use the parameter ignore_index=True after the [dataframe1, dataframe2].
You can check if the indexes are being repeated with the paremeter verify_integrity=True it will return a boolean (you never know when you'll have to check.
But be careful because this procedure can be a little slow depending on the size of you Dataframe

Duplicated rows when merging dataframes in Python

I am currently merging two dataframes with an outer join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.
Specifically, I have the following code.
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
Here are the two dataframes and the results.
df1
email_address name surname
0 john.smith#email.com john smith
1 john.smith#email.com john smith
2 elvis#email.com elvis presley
df2
email_address street city
0 john.smith#email.com street1 NY
1 john.smith#email.com street1 NY
2 elvis#email.com street2 LA
merged_df
email_address name surname street city
0 john.smith#email.com john smith street1 NY
1 john.smith#email.com john smith street1 NY
2 john.smith#email.com john smith street1 NY
3 john.smith#email.com john smith street1 NY
4 elvis#email.com elvis presley street2 LA
5 elvis#email.com elvis presley street2 LA
My question is, shouldn't it be like this?
This is how I would like my merged_df to be like.
email_address name surname street city
0 john.smith#email.com john smith street1 NY
1 john.smith#email.com john smith street1 NY
2 elvis#email.com elvis presley street2 LA
Are there any ways I can achieve this?
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.
DO NOT drop duplicates BEFORE the merge, but after!
Best solution is do the merge and then drop the duplicates.
In your case:
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
merged_df.drop_duplicates(subset=['email_address'], keep='first', inplace=True, ignore_index=True)

Categories

Resources