I have 2 dataframes one dataframe(df1) contains columns like- ISIN, Name, Currency, Value, % Weight, Asset type., comments and assumptions
So this dataframe looks like this:- df1
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions
0 NaN Transcanada Trust 5.875 08/15/76 USD 7616765.00 0.0176 NaN https://assets.cohenandsteers.com/assets/conte...
1 NaN Bp Capital Markets Plc Flt Perp USD 7348570.50 0.0169 NaN Holding value for each constituent is derived ...
2 NaN Transcanada Trust Flt 09/15/79 USD 7341250.00 0.0169 NaN NaN
3 NaN Bp Capital Markets Plc Flt Perp USD 6734022.32 0.0155 NaN NaN
4 NaN Prudential Financial 5.375% 5/15/45 USD 6508290.68 0.0150 NaN NaN
(241, 7)
whereas I have another dataframe df2 having columns like- Short Name, ISIN.
This dataframe looks like this.
Short Name ISIN
0 ABU DHABI COMMER AEA000201011
1 ABU DHABI NATION AEA002401015
2 ABU DHABI NATION AEA006101017
3 ADNOC DRILLING C AEA007301012
4 ALPHA DHABI HOLD AEA007601015
(66987, 2)
I developed a logic that compares Name(from df1) and Short Name(from df2) based on a match it extracts relevant ISIN(from df2) into df1(ISIN column which is empty at present).
Here's the logic for the same
def strMergeData(strColumnDf1):
strColumnDf1 = strColumnDf1.split()[0]
for strColumnDf2 in df2['Short Name']:
if strColumnDf1 in strColumnDf2:
return df2[df2['Short Name'] == strColumnDf2]['ISIN'].values[0]
break
else:
pass
df1['ISIN'] = df1.apply(lambda x: strMergeData(x['Name']),axis=1)
print(df1)
which gives the output as :
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions
0 NA Transcanada Trust 5.875 08/15/76 USD 7616765.00 0.0176 NaN https://assets.cohenandsteers.com/assets/conte...
1 NA Bp Capital Markets Plc Flt Perp USD 7348570.50 0.0169 NaN Holding value for each constituent is derived ...
2 NA Transcanada Trust Flt 09/15/79 USD 7341250.00 0.0169 NaN NaN
3 NA Bp Capital Markets Plc Flt Perp USD 6734022.32 0.0155 NaN NaN
4 NA Prudential Financial 5.375% 5/15/45 USD 6508290.68 0.0150 NaN NaN
The end result should look like this however because of the logic(which actually compares Name and Short Name word by word) it takes the first occurrence in the dataframe and straightaway gives ISIN which is incorrect. For eg: for Name- Bank of Scotland ISIN is 1324fdd is written as 1345o
as a result, I developed a new logic using fuzzywuzzy module which shows the exact match, if a match is not relevant wrt Name then it shows null. Here's the logic.
mat1 = []
mat2 = []
p = []
# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = df1['Name'].tolist()
list2 = df2['Short Name'].tolist()
# taking the threshold as 80
threshold = 93
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(i, list2, scorer=fuzz.token_set_ratio))
df1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in df1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back
# to df1
df1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
print(df1.tail())
and the output that I get is this:
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions matches
236 NaN Partnerre Ltd 4.875% Perp Sr:J USD 1.684069e+05 0.0004 NaN NaN
237 NaN Berkley (Wr) Corporation 5.700% 03/30/58 USD 6.955837e+04 0.0002 NaN NaN
238 NaN Tc Energy Corp Flt Perp Sr:11 USD 6.380262e+04 0.0001 NaN NaN TC ENERGY CORP
239 NaN Cash and Equivalents USD 2.166579e+07 0.0499 NaN NaN
240 NaN AUM NaN 4.338766e+08 0.9999 NaN NaN AUM IND BARC US
This output basically adds a match column on df1 and constitutes which ShortName(from df1) matches Name(from df1) however doesn't add any ISIN.
How do I add ISIN from df2 to df1 based on the above logic(fuzzywuzzy) so that in the new dataframe(df3) I get the output as:
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions
0 NA Transcanada Trust 5.875 08/15/76 USD 7616765.00 0.0176 NaN https://assets.cohenandsteers.com/assets/conte...
1 NA Bp Capital Markets Plc Flt Perp USD 7348570.50 0.0169 NaN Holding value for each constituent is derived ...
2 NA Transcanada Trust Flt 09/15/79 USD 7341250.00 0.0169 NaN NaN
3 NA Bp Capital Markets Plc Flt Perp USD 6734022.32 0.0155 NaN NaN
4 NA Prudential Financial 5.375% 5/15/45 USD 6508290.68 0.0150 NaN NaN
Please help.
One option is to use recordlinkage: https://recordlinkage.readthedocs.io/en/latest/
The code below is a quick hack, so will probably need fixing:
import recordlinkage
# Indexation step
indexer = recordlinkage.Index()
indexer.add(recordlinkage.index.Full())
candidate_links = indexer.index(df1, df2)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('Name', 'Short Name', label='name_similarity', method='jarowinkler', threshold=0.85)
matches = compare_cl.compute(candidate_links, df1, df2)
Related
Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#
First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
from datetime import datetime
def extract_source(url):
agent = {"User-Agent":"Mozilla/5.0"}
source=requests.get(url, headers=agent).text
return source
html_text = extract_source('https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')
for a in soup.find_all('a', class_ = 'button button--link button--fluid catalog-list-item__actions-primary-button', href=True):
# print ("Found the URL:", a['href'])
urlof = a['href']
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, 'lxml')
table_rows = soup.find_all('tr')
first_columns = []
third_columns = []
for row in table_rows:
# for row in table_rows[1:]:
first_columns.append(row.findAll('td')[0])
third_columns.append(row.findAll('td')[1])
for first, third in zip(first_columns, third_columns):
print(first.text, third.text)
Basically I am trying to scrape data from tables from multiple links of Website. And I want to insert that data in one excel csv file in following table format
SKU 07DE9922
Analyte / Target Corticosterone
Base Catalog Number DE9922
Diagnostic Platforms EIA/ELISA
Diagnostic Solutions Endocrinology
Disease Screened Corticosterone
Evaluation Quantitative
Pack Size 96 Wells
Sample Type Plasma, Serum
Sample Volume 10 uL
Species Reactivity Mouse, Rat
Usage Statement For Research Use Only, not for use in diagnostic procedures.
To below format in excel file
SKU Analyte/Target Base Catalog Number Pack Size Sample Type
data data data data data
I am facing difficulties while converting data in proper format
I made small modification to your code. Instead of printing the data, I created a dictionary and added it to list. Then I used this list to create a DataFrame:
import pandas as pd
import requests
import time
from datetime import datetime
def extract_source(url):
agent = {"User-Agent": "Mozilla/5.0"}
source = requests.get(url, headers=agent).text
return source
html_text = extract_source(
"https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids"
)
soup = BeautifulSoup(html_text, "lxml")
data = []
for a in soup.find_all(
"a",
class_="button button--link button--fluid catalog-list-item__actions-primary-button",
href=True,
):
urlof = a["href"]
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, "lxml")
table_rows = soup.find_all("tr")
first_columns = []
third_columns = []
for row in table_rows:
first_columns.append(row.findAll("td")[0])
third_columns.append(row.findAll("td")[1])
# create dictionary with values and add to the list
d = {}
for first, third in zip(first_columns, third_columns):
d[first.get_text(strip=True)] = third.get_text(strip=True)
data.append(d)
df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)
Prints:
SKU Alternate Names Base Catalog Number CAS # EC Number Format Molecular Formula Molecular Weight Personal Protective Equipment Usage Statement Application Notes Beilstein Registry Number Optical Rotation Purity UV Visible Absorbance Hazard Statements RTECS Number Safety Symbol Auto Ignition Biochemical Physiological Actions Density Melting Point pH pKa Solubility Vapor Pressure Grade Boiling Point Isoelectric Point
0 02100078-CF 2-Acetamido-5-Guanidinovaleric acid 100078 210545-23-6 205-846-6 Powder C8H16N4O3· 2H2O 216.241 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 02100142-CF Acetyltryptophan; DL-α-Acetylamino-3-indolepro... 100142 87-32-1 201-739-3 Powder C13H14N2O3 246.266 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... N-acetyl-DL-tryptophan, is used as stabilizer ... 89478 0° ± 2° (c=1, 1N NaOH, 24 hrs.) ~99% λ max (water)=280 ± 2 nm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 02100421-CF L-2,5-Diaminopentanoic acid; 2,5-Diaminopentan... 100421 3184-13-2 221-678-6 Powder C5H12N2O2 • HCl 168.621 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN 3625847 NaN ~99% NaN H319 RM2985000 GHS07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 02100520-CF Phosphocreatine Disodium Salt Tetrahydrate; So... 100520 922-32-7 213-074-6 Powder C4H8N3Na2O5P·4H2O 255.077 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN NaN NaN ≥98% NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 02100769-CF Vitamin C; Ascorbate; Sodium ascorbate; L-Xylo... 100769 50-81-7 200-066-2 NaN C6H8O6 176.124 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... L-Ascorbic Acid is used as an Antimicrobial an... 84272 +18° to +32° (c=1, water) ≥98% NaN NaN CI7650000 NaN 1220° F (NTP, 1992) Ascorbic Acid, also known as Vitamin C, is a s... 1.65 (NTP, 1992) 374 to 378° F (decomposes) (NTP, 1992) Between 2,4 and 2,8 (2 % aqueous solution) pK1: 4.17; pK2: 11.57 greater than or equal to 100 mg/mL at 73° F (... 9.28X10-11 mm Hg at 25 deg C (est) NaN NaN NaN
5 02101003-CF Lycine; Oxyneurine; (Carboxymethyl)trimethylam... 101003 107-43-7 203-490-6 Powder C5H11NO2 117.148 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... Betaine is a reagent that is used in soldering... 3537113 NaN NaN NaN NaN DS5900000 NaN NaN End-product of oxidative metabolism of choline... NaN Decomposes around 293 deg C NaN 1.83 (Lit.) Solubility (g/100 g solvent): <a class="pubche... 1.36X10-8 mm Hg at 25 deg C (est) Anhydrous NaN NaN
6 02101806-CF (S)-2,5-Diamino-5-oxopentanoic acid; L-Glutami... 101806 56-85-9 200-292-1 NaN C5H10N2O3 146.146 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... L-glutamine is an essential amino acid, which ... 1723797 +30 ± 5° (c = 3.5, 1N HCl) ≥99% NaN NaN MA2275100 NaN NaN L-Glutamine is an essential amino acid that is... 1.364 g/cu cm 185.5 dec °C pH = 5-6 at 14.6 g/L at 25 deg C NaN Water Solubility41300 mg/L (at 25 °C) 1.9X10-8 mm Hg at 25 deg C (est) NaN NaN NaN
7 02102158-CF L-2-Amino-4-methylpentanoic acid; Leu; L; α-am... 102158 61-90-5 200-522-0 Powder C6H13NO2 131.175 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... Leucine has been used as a molecular marker in... 1721722 +14.5 ° to +16.5 ° (Lit.) NaN NaN NaN OH2850000 NaN NaN NaN 1.293 g/cu cm at 18 deg C 293 °C NaN 2.33 (-COOH), 9.74 (-NH2)(Lit.) Water Solubility21500 mg/L (at 25 °C) 5.52X10-9 mm Hg at 25 deg C (est) NaN Sublimes at 145-148 deg C. Decomposes at 293-2... 6.04(Lit.)
8 02102576-CF 4-Hydroxycinnamic acid; 3-(4-Hydroxphenyl)-2-p... 102576 7400-08-0 231-000-0 Powder C9H8O3 164.16 g/mol Dust mask, Eyeshields, Gloves Unless specified otherwise, MP Biomedical's pr... p-Coumaric acid was used as a substrate to stu... NaN NaN ≥98% NaN NaN GD9094000 NaN NaN NaN NaN 211.5 °C NaN NaN NaN NaN NaN NaN NaN
9 02102868-CF DL-2-Amino-3-hydroxypropionic acid; (±)-2-Amin... 102868 302-84-1 206-130-6 Powder C3H7NO3 105.093 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN 1721405 -1° to + 1° (c = 5, 1N HCl) ≥98% NaN NaN NaN NaN NaN NMDA agonist acting at the glycine site; precu... 1.6 g/cu cm # 22 deg C 228 deg C (decomposes) NaN NaN SOL IN <a class="pubchem-internal-link CID-962... NaN NaN NaN NaN
...and so on.
And saves data.csv (screenshot from LibreOffice):
Try this -
import pandas as pd
import requests
import time
from datetime import datetime
from bs4 import BeautifulSoup
def extract_source(url):
agent = {"User-Agent": "Mozilla/5.0"}
return requests.get(url, headers=agent).text
html_text = extract_source(
'https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')
result = []
for index, a in enumerate(soup.find_all('a', class_='button button--link button--fluid catalog-list-item__actions-primary-button', href=True)):
# if index >= 10:
# break
# print ("Found the URL:", a['href'])
urlof = a['href']
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, 'lxml')
table_rows = soup.find_all('tr')
first_columns = []
third_columns = []
for row in table_rows:
# for row in table_rows[1:]:
first_columns.append(row.findAll('td')[0])
third_columns.append(row.findAll('td')[1])
temp = {}
for first, third in zip(first_columns, third_columns):
third = str(third.text).strip('\n')
first = str(first.text).strip('\n')
temp[first] = third
result.append(temp)
df = pd.DataFrame(result)
# please drop useless column before saving the output to csv.
df.to_csv('out.csv', index=False)
This will give output -
SKU
Alternate Names
Base Catalog Number
CAS #
EC Number
Format
Molecular Formula
Molecular Weight
Personal Protective Equipment
Usage Statement
Application Notes
Beilstein Registry Number
Optical Rotation
Purity
UV Visible Absorbance
Hazard Statements
RTECS Number
Safety Symbol
02100078-CF
2-Acetamido-5-Guanidinovaleric acid
100078
210545-23-6
205-846-6
Powder
C8H16N4O3· 2H2O
216.241 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
02100142-CF
Acetyltryptophan; DL-α-Acetylamino-3-indolepropionic acid
100142
87-32-1
201-739-3
Powder
C13H14N2O3
246.266 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
N-acetyl-DL-tryptophan, is used as stabilizer in the human blood-derived therapeutic products normal serum albumin and plasma protein fraction.
89478
0° ± 2° (c=1, 1N NaOH, 24 hrs.)
~99%
λ max (water)=280 ± 2 nm
02100421-CF
L-2,5-Diaminopentanoic acid; 2,5-Diaminopentanoic acid monohydrochloride
100421
3184-13-2
221-678-6
Powder
C5H12N2O2 • HCl
168.621 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
3625847
~99%
H319
RM2985000
GHS07
i have a 2 dataframes as given below,
import pandas as pd
restaurant = pd.read_excel("C:/Users/Avinash/Desktop/restaurant data.xlsx")
restaurant
Restaurant StartYear Capex inflation_adjusted_capex
Bawarchi Restaurant 1986 6000 Nan
Ks Baker's 1988 2000 Nan
Rajesh Restaurant 1989 1050 Nan
Ahmed Steak House 1990 9000 Nan
Absolute Barbique 1997 9500 Nan
inflation = pd.read_excel("C:/Users/Avinash/Desktop/restaurant data.xlsx", sheet_name="Sheet2")
inflation
Years Inflation_Factor
1985 0.111
1986 0.134
1987 0.191
1988 0.2253
1989 0.265
1990 0.304
Aim: is to fill "inflation_adjusted_capex" with div of "Capex" by corresponding years "Inflation_Factor from second Dataframe.
The code i wrote is,
for i in restaurant["StartYear"]:
restaurant["inflation_adjusted_capex"] =
(restaurant["inflation_adjusted_capex"])/(inflation[inflation["Years"] == i]["Inflation_Factor"])
print(restaurant["inflation_adjusted_capex"])
0 Nan
1 Nan
2 Nan
3 Nan
4 Nan
Name: Inflation adjusted Capex to current year, dtype: float64
Unfortunately this code is returning Nan values, kindly help me. Thanks in advance.
There are a couple ways to do this. The first is to join the dataframes so that you have your inflation factors in the first dataframe, and then do the calculation:
#add inflation_factor column to first dataframe
restaurant = restaurant.merge(inflation, left_on = 'StartYear', right_on = 'Year')
#do dividsion
restaurant['inflation_adjusted_capex'] = restaurant['Capex']/restaurant['Inflation_Factor']
The other is to apply a function that behaves like an excel VLOOKUP:
#set year as index for inflation so we can look up based on it
inflation = inflation.set_index('Year')
#look up inflation factor and divide with a lambda function
restaurant['inflation_adjusted_capex'] = inflation.apply(lambda row: row['Capex']/inflation['Inflation_Factor'][row['StartYear']], 1)
This is very similar to the question i asked yesterday. The aim is to be able to add a functionality which will allow for a column to be created depending on the value shown in another. For example, when it finds a country code in a specified file, i would like it to create a column with the name 'Country Code Total', and sum the amount of units for every row with that same country code
This is what my script outputs at the moment:
What i want to see:
My Script:
df['Sum of Revenue'] = df['Units Sold'] * df['Dealer Price']
df['AR Revenue'] = df[]
df = df.sort_values(['End Consumer Country', 'Currency Code'])
# Sets first value of index by position
df.loc[df.index[0], 'Unit Total'] = df['Units Sold'].sum()
# Sets first value of index by position
df.loc[df.index[0], 'Total Revenue'] = df['Sum of Revenue'].sum()
# Sums the amout of Units with the End Consumer Country AR
df['AR Total'] = df.loc[df['End Consumer Country'] == 'AR', 'Units Sold'].sum()
# Sums the amount of Units with the End Consumer Country AU
df['AU Total'] = df.loc[df['End Consumer Country'] == 'AU', 'Units Sold'].sum()
# Sums the amount of Units with the End Consumer Country NZ
df['NZ Total'] = df.loc[df['End Consumer Country'] == 'NZ', 'Units Sold'].sum()
However, as i know the countries that will come up in this file, i have added them accordingly to my script to find. How would i write my script so that if it finds another country code, for example GB, it would create a column called 'GB Total' and sum the units for every row with the country code set to GB.
Any help would be greatly appreciated!
If you truly need that format, then here is how I would proceed (starting data below):
# Get those first two columns
d = {'Sum of Revenue': 'Total Revenue', 'Units Sold': 'Total Sold'}
for col, newcol in d.items():
df.loc[df.index[0], newcol] = df[col].sum()
# Add the rest for every country:
s = df.groupby('End Consumer Country')['Units Sold'].sum().to_frame().T.add_suffix(' Total')
s.index = [df.index[0]]
df = pd.concat([df, s], 1, sort=False)
Output: df:
End Consumer Country Sum of Revenue Units Sold Total Revenue Total Sold AR Total AU Total NZ Total US Total
a AR 13.486216 1 124.007334 28.0 3.0 7.0 11.0 7.0
b AR 25.984073 2 NaN NaN NaN NaN NaN NaN
c AU 21.697871 3 NaN NaN NaN NaN NaN NaN
d AU 10.962232 4 NaN NaN NaN NaN NaN NaN
e NZ 16.528398 5 NaN NaN NaN NaN NaN NaN
f NZ 29.908619 6 NaN NaN NaN NaN NaN NaN
g US 5.439925 7 NaN NaN NaN NaN NaN NaN
As you can see, pandas added a bunch of NaN values as we only assigned something to the first row, and a DataFrame must be rectangular
It's far simpler to have a different DataFrame that summarizes the totals and within each country. If this is fine, then everything simplifies to a single .pivot_table
df.pivot_table(index='End Consumer Country',
values=['Sum of Revenue', 'Units Sold'],
margins=True,
aggfunc='sum').T.add_suffix(' Total)
Output:
End Consumer Country AR Total AU Total NZ Total US Total All Total
Sum of Revenue 39.470289 32.660103 46.437018 5.439925 124.007334
Units Sold 3.000000 7.000000 11.000000 7.000000 28.000000
Same information, much simpler to code.
Sample data:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'End Consumer Country': ['AR', 'AR', 'AU', 'AU', 'NZ', 'NZ', 'US'],
'Sum of Revenue': np.random.normal(20,6,7),
'Units Sold': np.arange(1,8,1)},
index = list('abcdefg'))
End Consumer Country Sum of Revenue Units Sold
a AR 13.486216 1
b AR 25.984073 2
c AU 21.697871 3
d AU 10.962232 4
e NZ 16.528398 5
f NZ 29.908619 6
g US 5.439925 7
I have 4 Excel files that I have to merge into one Excel file.
Demography file containing ID, Initials, Age, and Sex.
Laboratory file containing ID, Initials Test name, Test date, and Test Value.
Medical History containing ID, Initials, Medical condition, Start and Stop Dates.
Medication given containing ID, Initials, Drug name, dose, frequency, start and stop dates.
There are 50 patients. The demography file contains all 50 rows of 50 patients. The rest of the files have 50 patients but between 100 to 400 rows because each patient has multiple lab tests or multiple drugs.
When I merge in pandas, I have duplicates or assignment of entities to wrong patients. The challenge is to do this a way such that where you have a patient with more medications given than lab tests, the lab test should replace the duplicates with whitespaces.
This is a shortened representation:
import pandas as pd
lab = pd.read_excel('data/data.xlsx', sheetname='lab')
drugs = pd.read_excel('data/data.xlsx', sheetname='drugs')
merged_data = pd.merge(drugs, lab, on='ID', how='left')
merged_data.to_excel('merged_data.xls')
You get this result: Pandas merge result
I would prefer this result: Prefered output
Consider using cumcount() on a groupby() and then join on both that field with ID:
drugs['GrpCount'] = (drugs.groupby(['ID'])).cumcount()
lab['GrpCount'] = (lab.groupby(['ID'])).cumcount()
merged_data = pd.merge(drugs, lab, on=['ID', 'GrpCount'], how='left').drop(['GrpCount'], axis=1)
# ID Initials_x Drug Name Frequency Route Start Date End Date Initials_y Name Result Date Result
# 0 1 AB AMPICLOX NaN Oral 21-Jun-2016 21-Jun-2016 AB Rapid Diagnostic Test 30-May-16 Abnormal
# 1 1 AB CIPROFLOXACIN Daily Oral 30-May-2016 03-Jun-2016 AB Microscopy 30-May-16 Normal
# 2 1 AB Ibuprofen Tablet 400 mg Two Times a Day Oral 06-Oct-2016 10-Oct-2016 NaN NaN NaN NaN
# 3 1 AB COARTEM NaN Oral 17-Jun-2016 17-Jun-2016 NaN NaN NaN NaN
# 4 1 AB INJECTABLE ARTESUNATE 12 Hourly Intravenous 01-Jun-2016 02-Jun-2016 NaN NaN NaN NaN
# 5 1 AB COTRIMOXAZOLE Daily Oral 30-May-2016 12-Jun-2016 NaN NaN NaN NaN
# 6 1 AB METRONIDAZOLE Two Times a Day Oral 30-May-2016 03-Jun-2016 NaN NaN NaN NaN
# 7 2 SS GENTAMICIN Daily Intravenous 04-Jun-2016 04-Jun-2016 SS Microscopy 6-Jun-16 Abnormal
# 8 2 SS METRONIDAZOLE 8 Hourly Intravenous 04-Jun-2016 06-Jun-2016 SS Complete Blood Count 6-Oct-16 Recorded
# 9 2 SS Oral Rehydration Salts Powder PRN Oral 06-Jun-2016 06-Jun-2016 NaN NaN NaN NaN
# 10 2 SS ZINC 8 Hourly Oral 06-Jun-2016 06-Jun-2016 NaN NaN NaN NaN