I'm trying to enrich a dataframe with data collected from an API.
So, I'm going like this:
for i in df.index:
if pd.isnull(df.cnpj[i]) == True:
pass
else:
k=get_financials_hnwi(df.cnpj[i]) # this is my API requesting function, working fine
df=df.merge(k,on=["cnpj"],how="left") # here is my problem <-------------------------------
Since I'm running that merge in a for sentence, it is showing up suffixes (_x, _y). So I found this alternative here:
Pandas: merge dataframes without creating new columns
for i in df.index:
if pd.isnull(df.cnpj[i]) == True:
pass
else:
k=get_financials_hnwi(df.cnpj[i]) # this is my requesting function, working fine
val = np.intersect1d(df.cnpj, k.cnpj)
df_temp = pd.concat([df,k], ignore_index=True)
df=df_temp[df_temp.cnpj.isin(val)]
However it creates a new df, killing the original index and not allowing this line to run if pd.isnull(df.cnpj[i]) == True:.
Is there a nice way to run a merge/join/concat inside a for operation without creating new columns with _x and _y? Or there is a way to mix _x and _y columns afterall getting rid of it and condensing it in a single column? I just want a single column with all of it
Sample data and reproducible code
df=pd.DataFrame({'cnpj':[12,32,54,65],'co_name':['Johns Market','T Bone Gril','Superstore','XYZ Tech']})
#first API request:
k=pd.DataFrame({'cnpj':[12],'average_revenues':[687],'years':['2019,2018,2017']})
df=df.merge(k,on="cnpj", how='left')
#second API request:
k=pd.DataFrame({'cnpj':[32],'average_revenues':[456],'years':['2019,2017']})
df=df.merge(k,on="cnpj", how='left')
#third API request:
k=pd.DataFrame({'cnpj':[53],'average_revenues':[None],'years':[None]})
df=df.merge(k,on="cnpj", how='left')
#fourth API request:
k=pd.DataFrame({'cnpj':[65],'average_revenues':[4142],'years':['2019,2018,2015,2013,2012']})
df=df.merge(k,on="cnpj", how='left')
print(df)
Result:
cnpj co_name average_revenues_x years_x average_revenues_y \
0 12 Johns Market 687.0 2019,2018,2017 NaN
1 32 T Bone Gril NaN NaN 456.0
2 54 Superstore NaN NaN NaN
3 65 XYZ Tech NaN NaN NaN
years_y average_revenues_x years_x average_revenues_y \
0 NaN None None NaN
1 2019,2017 None None NaN
2 NaN None None NaN
3 NaN None None 4142.0
years_y
0 NaN
1 NaN
2 NaN
3 2019,2018,2015,2013,2012
Desired result:
cnpj co_name average_revenues years
0 12 Johns Market 687.0 2019,2018,2017
1 32 T Bone Gril 456.0 2019,2017
2 54 Superstore None None
3 65 XYZ Tech 4142.0 2019,2018,2015,2013,2012
as your joining on a single column and mapping values we can take advantage of the cnpj column and set it to the index, we can then use combine_first or update or map to add your values into your dataframe.
assuming k will look like this. If not just update the function to return a dictionary that you can use map with.
cnpj average_revenues years
0 12 687 2019,2018,2017
lets hold this in a tidy function.
def update_api_call(dataframe,api_call):
if dataframe.index.name == 'cnpj':
pass
else:
dataframe = dataframe.set_index('cnpj')
return dataframe.combine_first(
api_call.set_index('cnpj')
)
assuming your variable ks are numbered 1-4 for our test.
df1 = update_api_call(df,k1)
print(df1)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 NaN T Bone Gril NaN
54 NaN Superstore NaN
65 NaN XYZ Tech NaN
df2 = update_api_call(df1,k2)
print(df2)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 456.0 T Bone Gril 2019,2017
54 NaN Superstore NaN
65 NaN XYZ Tech NaN
print(df4)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 456.0 T Bone Gril 2019,2017
53 NaN NaN NaN
54 NaN Superstore NaN
65 4142.0 XYZ Tech 2019,2018,2015,2013,2012
Related
Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#
First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN
I have a dataframe that looks like this:
ID
Name
Major1
Major2
Major3
12
Dave
English
NaN
NaN
12
Dave
NaN
Biology
NaN
12
Dave
NaN
NaN
History
13
Nate
Spanish
NaN
NaN
13
Nate
NaN
Business
NaN
I need to merge rows resulting in this:
ID
Name
Major1
Major2
Major3
12
Dave
English
Biology
History
13
Nate
Spanish
Business
NaN
I know this is possible with groupby but I haven't been able to get it to work correctly. Can anyone help?
If you are intent on using groupby, you could do something like this:
dataframe = dataframe.melt(['ID', 'Name']).dropna()
dataframe = dataframe.groupby(['ID', 'Name', 'variable'])['value'].sum().unstack('variable')
You may have to mess with the column names a bit, but this is what comes to me as a possible solution using groupby.
Use melt and pivot
>>> df.melt(['ID', 'Name']).dropna() \
.pivot(['ID', 'Name'], 'variable', 'value') \
.reset_index().rename_axis(columns=None)
ID Name Major1 Major2 Major3
0 12 Dave English Biology History
1 13 Nate Spanish Business NaN
I have extracted this data frame from an excel spreadsheet using pandas library,
after getting the needed columns and,
I have table formatted like this,
REF PLAYERS
0 103368 Andrés Posada Sanmiguel
1 300552 Diego Posada Sanmiguel
2 103304 Roberto Motta Stanziola
3 NaN NaN
4 REF PLAYERS
5 1047012 ANABELLA EISMANN DE AMAYA
6 104701 FERNANDO ENRIQUE AMAYA CASTRO
7 103451 AUGUSTO ANTONIO ALVARADO AZCARRAGA
8 103484 Kevin Adrian Villarreal Kam
9 REF PLAYERS
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 REF PLAYERS
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 REF PLAYERS
I want to create multiple dataframes converting each row [['REF', 'PLAYERS']] to a new dataframe columns.
suggestions are welcomed I also need to preserve the blank spaces. A pandas newbie.
For this to work, you must first read the dataframe from the file differently: set the argument header=None in your pd.read_excel() function. Because now your columns are called "REF" and "PLAYERS", but we would like to group by them.
Then the first column name probably would be "0", and the first line will be as follows, where the df is the name of your dataframe:
# Set unique index for each group
df["group_id"] = (df[0] == "REF").cumsum()
Solution:
# Set unique index for each group
df["group_id"] = (df["name_of_first_column"] == "REF").cumsum()
# Iterate over groups
dataframes = []
for name, group in df.groupby("group_id"):
df_ = group
# promote 1st row to column name
df_.columns = df_.iloc[0]
# and drop it
df_ = df_.iloc[1:]
# drop index column
df_ = df_[["REF", "PLAYERS"]]
# append to the list of dataframes
dataframes.append(df_)
All your multiple dataframes are now stored in an array dataframes.
You can split your dataframe, into equal lengths (in your case 4 rows for each df), using np.split.
Since you want 4 rows per dataframe, you can split it into 5 different df:
import numpy as np
dfs = [df.loc[idx] for idx in np.split(df.index,5)]
And then create your individual dataframes:
df1 = dfs[1]
df1
REF PLAYERS
4 REF PLAYERS
5 1047012 ANABELLA EISMANN DE AMAYA
6 104701 FERNANDO ENRIQUE AMAYA CASTRO
7 103451 AUGUSTO ANTONIO ALVARADO AZCARRAGA
df2 = dfs[2]
df2
REF PLAYERS
8 103484 Kevin Adrian Villarreal Kam
9 REF PLAYERS
10 NaN NaN
11 NaN NaN
I want to add "NSW" to the end of each town name in a pandas data frame.The dataframe currently looks like this:
0 Parkes NaN
1 Forbes NaN
2 Yanco NaN
3 Orange NaN
4 Narara NaN
5 Wyong NaN
I need every town to also have the word NSW added to it
Try with
df['Name'] = df['Name'] + 'NSW'
I have 4 Excel files that I have to merge into one Excel file.
Demography file containing ID, Initials, Age, and Sex.
Laboratory file containing ID, Initials Test name, Test date, and Test Value.
Medical History containing ID, Initials, Medical condition, Start and Stop Dates.
Medication given containing ID, Initials, Drug name, dose, frequency, start and stop dates.
There are 50 patients. The demography file contains all 50 rows of 50 patients. The rest of the files have 50 patients but between 100 to 400 rows because each patient has multiple lab tests or multiple drugs.
When I merge in pandas, I have duplicates or assignment of entities to wrong patients. The challenge is to do this a way such that where you have a patient with more medications given than lab tests, the lab test should replace the duplicates with whitespaces.
This is a shortened representation:
import pandas as pd
lab = pd.read_excel('data/data.xlsx', sheetname='lab')
drugs = pd.read_excel('data/data.xlsx', sheetname='drugs')
merged_data = pd.merge(drugs, lab, on='ID', how='left')
merged_data.to_excel('merged_data.xls')
You get this result: Pandas merge result
I would prefer this result: Prefered output
Consider using cumcount() on a groupby() and then join on both that field with ID:
drugs['GrpCount'] = (drugs.groupby(['ID'])).cumcount()
lab['GrpCount'] = (lab.groupby(['ID'])).cumcount()
merged_data = pd.merge(drugs, lab, on=['ID', 'GrpCount'], how='left').drop(['GrpCount'], axis=1)
# ID Initials_x Drug Name Frequency Route Start Date End Date Initials_y Name Result Date Result
# 0 1 AB AMPICLOX NaN Oral 21-Jun-2016 21-Jun-2016 AB Rapid Diagnostic Test 30-May-16 Abnormal
# 1 1 AB CIPROFLOXACIN Daily Oral 30-May-2016 03-Jun-2016 AB Microscopy 30-May-16 Normal
# 2 1 AB Ibuprofen Tablet 400 mg Two Times a Day Oral 06-Oct-2016 10-Oct-2016 NaN NaN NaN NaN
# 3 1 AB COARTEM NaN Oral 17-Jun-2016 17-Jun-2016 NaN NaN NaN NaN
# 4 1 AB INJECTABLE ARTESUNATE 12 Hourly Intravenous 01-Jun-2016 02-Jun-2016 NaN NaN NaN NaN
# 5 1 AB COTRIMOXAZOLE Daily Oral 30-May-2016 12-Jun-2016 NaN NaN NaN NaN
# 6 1 AB METRONIDAZOLE Two Times a Day Oral 30-May-2016 03-Jun-2016 NaN NaN NaN NaN
# 7 2 SS GENTAMICIN Daily Intravenous 04-Jun-2016 04-Jun-2016 SS Microscopy 6-Jun-16 Abnormal
# 8 2 SS METRONIDAZOLE 8 Hourly Intravenous 04-Jun-2016 06-Jun-2016 SS Complete Blood Count 6-Oct-16 Recorded
# 9 2 SS Oral Rehydration Salts Powder PRN Oral 06-Jun-2016 06-Jun-2016 NaN NaN NaN NaN
# 10 2 SS ZINC 8 Hourly Oral 06-Jun-2016 06-Jun-2016 NaN NaN NaN NaN