reshape a dataframe with internal headers as new columns - python

I have a data frame as below :
col value
0 companyId 123456
1 company_name small company
2 department IT
3 employee_name Jack
4 rank Grade 8
5 department finance
6 employee_name Tim
7 rank Grade 6
and i would like the data frame to be reshaped to tabular format, ideally like this:
companyId company_name department employee_name rank
0 123456 small company IT Jack Grade 8
1 123456 small company finance Tim Grade 6
can any one help me please? thanks.

Making two assumptions you could reshape your data.
1- the companies are determined using headers and all subsequent rows are data from employees of the company
2- there is a given starting item to define employees records (here department)
headers = ['companyId', 'company_name']
first_item = 'department'
masks = {h: df['col'].eq(h) for h in headers}
df2 = (df
# move headers as new columns
.assign(**{h: df['value'].where(m).ffill().bfill() for h,m in masks.items()})
# and drop their rows
.loc[~pd.concat(masks, axis=1).any(1)]
# compute a unique identifier per employee
.assign(idx=lambda d: d['col'].eq(first_item).cumsum())
# pivot the data
.pivot(index=['idx']+headers, columns='col', values='value')
.reset_index(headers)
)
output:
companyId company_name department employee_name rank
1 123456 small company IT Jack Grade 8
2 123456 small company finance Tim Grade 6
Example on a more complex input:
col value
0 companyId 123456
1 company_name small company
2 department IT
3 employee_name Jack
4 rank Grade 8
5 department finance
6 employee_name Tim
7 rank Grade 6
8 companyId 67890
9 company_name other company
10 department IT
11 employee_name Jane
12 rank Grade 9
13 department management
14 employee_name Tina
15 rank Grade 12
output:
companyId company_name department employee_name rank
1 123456 small company IT Jack Grade 8
2 123456 small company finance Tim Grade 6
3 67890 other company IT Jane Grade 9
4 67890 other company management Tina Grade 12

Related

Python Multiple key in a text file

My Data is as below:
Name: Joe
Age: 26
Property: 1 of 3
Item : Car
Make: Toyota
Model: Corolla
Year:2006
Property: 2 of 3
Item : House
Address : new Street
Cost : 20000
Property: 3 of 3
Item: Stocks
Investment: 1000
Name: Blogg
Age: 28
Property: 1 of 2
Item : Bike
BikeMake: Harley
BikeModel: IronRod
BikeYear:2018
Property: 2 of 2
Item: Stocks
Investment: 2000
I need the result to look like below:
Name
Age
Property
Item
Make
Model
Year
Address
Cost
Investment
BikeMake
BikeMode
BikeYear
Joe
26
1 of 3
Car
Toyota
Corolla
2006
Joe
26
2 of 3
House
new Street
20000
Joe
26
3 of 3
Stocks
1000
Blogg
28
1 of 2
Bike
Harley
Ironrod
2018
Blogg
26
3 of 3
Stocks
2000
My code is currently
for line in t:
print(line)
key, _, value = line.partition(": ")
if not value: # separator was not found
value = "NA"
if "Name" in key:
stuff[index] = {"Reference": [value]} # Always use a list as vale
current_key = index
index += 1
elif key not in stuff[current_key]: # If key does not exist
stuff[current_key][key] = [value] # Create key with value in a list.
else:
stuff[current_key][key].append(value)
My current results are being pivoted by the Name key (e.g)
Name
Age
Property
Item
Make
Model
Year
Address
Cost
Investment
BikeMake
BikeMode
BikeYear
Joe
26
1 of 3,2 of 3,3 of 3
Car,House,Stocks
Toyota
Corolla
2006
New Street
20000
1000
Blogg
28
1 of 2,2 of 2
Bike,Stocks
2000
Harley
IronRod
2018

Reshape python dataframe

I have dataframe like this.
description
Brian
No.22
Tel:+00123456789
email:brain#email.com
Sandra
No:43
Tel:+00312456789
Michel
No:593
Kent
No:13
Engineer
Tel:04512367890
email:kent#yahoo.com
and I want it like this.
name
address
designation
telephone
email
Brian
No:22
null
Tel:+00123456789
email:brain#email.com
Sandra
No:43
null
Tel:+00312456789
null
Michel
No:593
null
null
null
Kent
No:13
Engineer
Tel:04512367890
email:kent#yahoo.com
How to do this in python.
Use np.where to label each row then pivot your dataframe.
Step 1.
condlist = [df['description'].shift(fill_value='').eq(''),
df['description'].str.contains('^No[:.]'),
df['description'].str.startswith('Tel:'),
df['description'].str.startswith('email:')]
choicelist = ['name', 'address', 'telephone', 'email']
df['column'] = np.select(condlist, choicelist, default='designation')
print(df)
# Output:
description column
0 Brian name
1 No.22 address
2 Tel:+00123456789 telephone
3 email:brain#email.com email
4 designation
5 Sandra name
6 No:43 address
7 Tel:+00312456789 telephone
8 designation
9 Michel name
10 No:593 address
11 designation
12 Kent name
13 No:13 address
14 Engineer designation
15 Tel:04512367890 telephone
16 email:kent#yahoo.com email
Step 2. Now remove empty rows and create an index to allow the pivot:
df = df[df['description'].ne('')].assign(index=df['column'].eq('name').cumsum())
print(df)
# Output:
description column index
0 Brian name 1
1 No.22 address 1
2 Tel:+00123456789 telephone 1
3 email:brain#email.com email 1
5 Sandra name 2
6 No:43 address 2
7 Tel:+00312456789 telephone 2
9 Michel name 3
10 No:593 address 3
12 Kent name 4
13 No:13 address 4
14 Engineer designation 4
15 Tel:04512367890 telephone 4
16 email:kent#yahoo.com email 4
Step 3. Pivot your dataframe:
cols = ['name', 'address', 'designation', 'telephone', 'email']
out = df.pivot('index', 'column', 'description')[cols] \
.rename_axis(index=None, columns=None)
print(out)
# Output:
name address designation telephone email
1 Brian No.22 NaN Tel:+00123456789 email:brain#email.com
2 Sandra No:43 NaN Tel:+00312456789 NaN
3 Michel No:593 NaN NaN NaN
4 Kent No:13 Engineer Tel:04512367890 email:kent#yahoo.com
Edit
There is an error at final step" ValueError: Index contains duplicate entries, cannot reshape" how can I overcome this.
There is no magic to solve this problem because your data are mess. The designation label is the fallback if the row was not tagged to name, address, telephone and email. So there is a great chance, you have multiple rows labelled designation for a same person.
At then end of this step, check if you have duplicates (person/label -> index/column) with this command:
df.value_counts(['index', 'column']).loc[lambda x: x > 1]
Probably (and I hope for you), the output should indicate only designation label under column column unless one person can have multiple telephone or email. Now you can adjust the condlist to catch a maximum of pattern. I don't know anything about your data so I can't help you much.

From 2 df one with Country code and Country name and other df with phone number of different len(11 to 15) try to find the county code for each ph no

I have 2 data frame:
1st data frame : Country Code(Country Name & Country Code):
Country Name
Country Code
Country Code Length
India
91
2
Nepal
977
3
American Samoa
1
1
2nd Dataframe User details(Username, Phone Numbers with country code):
User name
Phone Numbers with country code
Phone number length
Jay
919988665500
12
XYZ
9771234665500
13
abc
12233445500
11
cvv
9779988665500
13
Need a final Table like the below one:
User name
Clean Phone Number
County code
County Name
Jay
91-9988665500
91
India
XYZ
977-1234665500
977
Nepal
abc
1-2233445500
1
American Samoa
cvv
977-9988665500
977
Nepal
my python script. which is not helping me with the right output.
for number in df['Phone']:
for code in country_df['Country Code']:
if code in number[:4]:
df["new_no"] = f"{code}-{number[len(code_to_check):]}"
df['Country'] = country_df['Country']
elif code not in number[:4]:
df["new_no"] = df['Phone']
df['Country'] = country_df['Country']

Pandas - Expand table based on different email with same key from another table

I have a quick one that I am struggling with.
Table 1 has a lot of user information in addition to an email column and a unique ID column.
Table 2 has only a unique ID column and an email column. These emails can be different from table 1, but do not have to be.
I am attempting to merge them such that table 1 expands only to include new rows when there is a new email from table 2 on the same unique id.
Example:
Table 1:
id email first_name last_name
1 jo# joe king
2 john# johnny maverick
3 Tom# Tom J
Table 2:
id email
2 johnmk#
3 TomT#
8 Jared#
Desired Output:
id email first_name last_name
1 jo# joe king
2 john# johnny maverick
2 johnmk# johnny maverick
3 Tom# Tom J
3 TomT# Tom J
I would have expected pd.merge(table1, table2, on = 'id', how = 'left') to do this, but this just generates the email columns with the suffix _x, _y.
How can I make the merge?
IIUC, you can try pd.concat with a boolean mask using isn for df2 , with groupby.ffill:
out = pd.concat((df1,df2[df2['id'].isin(df1['id'])]),sort=False)
out.update(out.groupby("id").ffill())
out = out.sort_values("id")#.reset_index(drop=True)
id email first_name last_name
0 1 jo# joe king
1 2 john# johnny maverick
0 2 johnmk# johnny maverick
2 3 Tom# Tom J
1 3 TomT# Tom J

Creating a pandas column from a dictionary of regular expressions

I want to create a column which essentially shows the data type of the data within an excel spreadsheet, i.e. if the data within any given cell is a string or an integer or a float etc. Currently I'm working with mocked up data to test with and hope to eventually use this for larger excel files with more field headers.
My Current high level method is as follows:
Read Excel file and create a dataframe
Re-format this table to create a column of all data I wish to label with a data type (i.e if it is a string, integer or float), alongside the respective field headers.
Create a 'Data Type' column which will contain these labels for each piece of data which is populated by the corresponding data types held in a dictionary of regular expressions
import os
from glob import glob
import pandas as pd
from os import path
import re
sample_file = 'C:/Users/951297/Documents/Python Scripts/DD\\Fund_Data.xlsx'
dataf = pd.read_excel(sample_file)
dataf
FUND ID FUND NAME AMOUNT
0 10101 Holdings company A 10000
1 20202 Holdings company B 2000.5
2 30303 Holdings company C 3000
# Create column list of data attributes
stackdf= dataf.stack().reset_index()
stackdf = stackdf.rename(columns={'level_0':'index','level_1':'fh',0:'attribute'})
# Create a duplicate column of attribute to apply regex
stackdf_regex = stackdf.iloc[:,2:].rename(columns = {'attribute':'Data Type'})
# Dictionary of regex to replace values within the 'Data Type' column depending on the attribute
repl_dict = {re.compile(r'^[\d]+$'):'Integer',
re.compile(r'^[a-zA-Z0-9_ ]*$'): 'String',
re.compile(r'[\d]+\.'): 'Float'}
#concatenate tables
pd.concat([stackdf, stackdf_regex], axis=1)
This is the reformatted table I wish to apply my regular expressions onto:
index fh attribute Data Type
0 0 FUND ID 10101 10101
1 0 FUND NAME Holdings company A Holdings company A
2 0 AMOUNT 10000 10000
3 1 FUND ID 20202 20202
4 1 FUND NAME Holdings company B Holdings company B
5 1 AMOUNT 2000.5 2000.5
6 2 FUND ID 30303 30303
7 2 FUND NAME Holdings company C Holdings company C
8 2 AMOUNT 3000 3000
This is the desired output:
index fh attribute Data Type
0 0 FUND ID 10101 Integer
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 Integer
3 1 FUND ID 20202 Integer
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 Float
6 2 FUND ID 30303 Integer
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 Integer
However the following code produces the table below:
stackdf_regex = stackdf_regex.replace({'Data Type':repl_dict}, regex=True)
pd.concat([stackdf, stackdf_regex], axis=1)
index fh attribute Data Type
0 0 FUND ID 10101 10101
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 10000
3 1 FUND ID 20202 20202
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 2000.5
6 2 FUND ID 30303 30303
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 3000
Perhaps my regular expressions are incorrect or my understanding is lacking in applying the regular expressions on the dataframe. Happy to receive any suggestions on this current method or another suitable/efficient method I have not considered.
Note: I hope to eventually expand the regex dictionary to account for more data types and I understand it may not be efficient to check every cell for a pattern for larger datasets but I'm still in the early stages.
You can use, np.select, where each of the conditions test a given regex to the column Data Type using Series.str.contains and choices corresponds to the conditions:
conditions = [
df['Data Type'].str.contains(r'^\d+$'),
df['Data Type'].str.contains(r'^[\w\s]+$'),
df['Data Type'].str.contains(r'^\d+\.\d+$')]
choices = ['Interger', 'String', 'Float']
df['Data Type'] = np.select(conditions, choices, default=None)
# print(df)
index fh attribute Data Type
0 0 FUND ID 10101 Interger
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 Interger
3 1 FUND ID 20202 Interger
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 Float
6 2 FUND ID 30303 Interger
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 Interger

Categories

Resources