Merging several pandas.DataFrames with specific constraints - python

Lets say I have 3 different pandas dataFrames
>>> import pandas as pd
>>> import numpy as np
>>>> df1 = pd.DataFrame({'ID': [20016, 50048, 13478, 68493, 57483],
'Sex': ['F', 'M', 'F', 'F', 'M'],
'Var1': [3, 3, 3, 3, 2],
'Var2': [2, 3, np.nan, 3, 2],
'Var3': [-0.25, 0, 4, np.nan, 0.14]})
>>> df1.set_index('ID')
Sex Var1 Var2 Var3
ID
20016 F 3 2.0 -0.25
50048 M 3 3.0 0.00
13478 F 3 NaN 4.00
68493 F 3 3.0 NaN
57483 M 2 2.0 0.14
The 2nd DF is basically an updated version of DF1, which means more row entries and also other columns and maybe changed values in some of the other columns e.g
>>>> df2 = pd.DataFrame({'PERSID': [20016, 50048, 13478, 68493, 57483, 45623],
'Sex': ['F', 'M', 'F', 'F', 'M', 'M'],
'Var1': [3, *1*, 3, 3, 2, np.nan],
'Var2': [*3*, 3, np.nan, 3, 2, 0],
'Var3': [-0.25, 0, 4, np.nan, 0.14, 0.28]})
>>> df2.set_index('ID')
Sex Var1 Var2 Var3
PERSID
20016 F 3.0 3.0 -0.25
50048 M 1.0 3.0 0.00
13478 F 3.0 NaN 4.00
68493 F 3.0 3.0 NaN
57483 M 2.0 2.0 0.14
45623 M NaN 0.0 0.28
And the last dataFrame as an example should be rather different with something like:
SUBJECT Var4 Var5 Var6
200 1640.345 345.0 -0.250000
6700 14236.430 1713.0 -0.050735
6702 1345.400 NaN 0.034450
1330__201805 345.750 335.0 0.140000
4786__201805 NaN 0.0 NaN
And the goal is to merge all 3 dataFrames into one, containing all non redundant information. This means:
if there is a new ID simply add the row
if there is a new column add the column
if there are the exact IDs in two different DFs they need to be merged in such a way that if the cell content is the same, the content of the 2nd DF can be neglected. If however, the content of the cell is different, a new column needs to be added with columnName.y and the other column needs to be renamed to columnName.x
considering only merging of DF1 and DF2 it should look something like this:
ID Sex_x Var1_x Var2_x Var3 Var1_y Var2_y
20016 F 3.0 2.0 -0.25 NaN 3.0
50048 M 3.0 3.0 0.00 1.0 NaN
13478 F 3.0 NaN 4.00 NaN NaN
68493 F 3.0 3.0 NaN NaN NaN
57483 M 2.0 2.0 0.14 NaN NaN
45623 M NaN NaN 0.28 NaN 0.0
The 3rd DF should then be merged too which would result in basibally only adding the rows and columns.
All cells which are not present in the other DFs should be filled with NaN
and it would be awesome if the corresponding columns like name.x and name.y would be next to each other to ensure readability.
I have tried things like pandas.DataFrame.(Merge, join, and concatenate), trying to do it by hand but nothing works as it needs to be.
This is an example how I did the adding of the columns if they are not present:
df_combined = df_1.copy()
for ind, column in enumerate(df_2):
if not column in list(df_combined):
df_combined.insert(len(df_combined.columns), column,
value=pd.Series(np.nan),
allow_duplicates=False)
frame = [df_combined, df_2]
df_combined = pd.concat(frame)
which is probably already not a good solution.
Thanks for any help on how to implement this!

Related

Create a new columns to show the month of last spend in Pandas

I am working on a spend data where I want to see the last month when the spend was made in current and previous year. If there is no spend in these years, then, I assume the last month of spend as Dec-2020.
My data looks like this
As shown in the data the months are already there in the form of columns.
I want to create a new column last_txn_month which gives last month when the spend was made. So the output should look like this:
Let's say your DataFrame looks like:
df = pd.DataFrame([[1, np.nan, np.nan, 3, np.nan], [10, 11, 12, 13, 14],
[101, 102, np.nan, np.nan, np.nan],
[110, np.nan, np.nan, 111, np.nan]],
columns=[*'abcde'])
Then you could use notna to create boolean DataFrame; then apply a lambda function that filters the last column name of a non-NaN value for each row:
df['last'] = df.notna().apply(lambda x: df.columns[x][-1], axis=1)
Output:
a b c d e last
0 1 NaN NaN 3.0 NaN d
1 10 11.0 12.0 13.0 14.0 e
2 101 102.0 NaN NaN NaN b
3 110 NaN NaN 111.0 NaN d

Group latest values in pandas columns for a given id

I have a pandas dataframe containing some metrics for a given date and user.
>>> pd.DataFrame({"user": ['juan','juan','juan','gonzalo'], "date": [1, 2, 3, 1], "var1": [1, 2, None, 1], "var2": [None, 4, 5, 6]})
user date var1 var2
0 juan 1 1.0 NaN
1 juan 2 2.0 4.0
2 juan 3 NaN 5.0
3 gonzalo 1 1.0 6.0
Now, for each user, I want to extract the 2 more recent values for each variable (var1, var2) ignoring NaN unless there aren't enough values to fill the data.
For reference, this should be the resulting dataframe for the data depicted above
user var1_0 var1_1 var2_0 var2_1
juan 2.0 1.0 5.0 4.0
gonzalo 1.0 NaN 6.0 NaN
each "historical" value is added as a new column with a _0 or _1 suffix.
First sorting if necessary by both columns in DataFrame.sort_values with reshape by DataFrame.sort_values and remove missing values, filter top2 rrows per groups by GroupBy.head, then create counter column by GroupBy.cumcount with pivoting in DataFrame.pivot with flatten MultiIndex:
df1 = (df.sort_values(['user','date'])
.melt(id_vars='user', value_vars=['var1','var2'])
.dropna(subset=['value'])
)
df1 = df1.groupby(['user','variable']).head(2)
df1['g'] = df1.groupby(['user','variable']).cumcount(ascending=False)
df1 = df1.pivot(index='user', columns=['variable', 'g'], values='value')
#oldier pandas versions
#df1 = df1.set_index(['user','variable', 'g'])['value'].unstack([1,2])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
user var1_0 var1_1 var2_0 var2_1
0 gonzalo 1.0 NaN 6.0 NaN
1 juan 2.0 1.0 5.0 4.0
You could group by user and aggregate to get the 2 most recent values. That get's almost all the way there - but instead of columns you have a list of elements. If you want to have the actual 2 columns you have to split the newly created list into columns. Full code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"user": ["juan", "juan", "juan", "gonzalo"],
"date": [1, 2, 3, 1],
"var1": [1, 2, None, 1],
"var2": [None, 4, 5, 6],
}
)
# This almost gets you there
df = (
df.sort_values(by="date")
.groupby("user")
.agg({"var1": lambda x: x.dropna().head(2), "var2": lambda x: x.dropna().head(2)})
)
# Split the columns and get the correct column names
df[["var1_0", "var2_0"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[0] if isinstance(el, np.ndarray) else el for el in row),
axis=1,
)
df[["var1_1", "var2_1"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[-1] if isinstance(el, np.ndarray) else None for el in row),
axis=1,
)
print(df)
>>
var1 var2 var1_0 var2_0 var1_1 var2_1
user
gonzalo 1.0 6.0 1.0 6.0 NaN NaN
juan [1.0, 2.0] [4.0, 5.0] 1.0 4.0 2.0 5.0

Combination of two dataframes - still show NaN values

I would like to fill my first dataframe with data from the second dataframe. Since I don't need and any special condition I suppose combine_first function looks like the right choice for me.
Unfortunately when I try to combine two dataframes result is still the original dataframe.
My code:
import pandas as pd
df1 = pd.DataFrame({'Gen1': [5, None, 3, 2, 1],
'Gen2': [1, 2, None, 4, 5]})
df2 = pd.DataFrame({'Gen1': [None, 4, None, None, None],
'Gen2': [None, None, 3, None, None]})
df1.combine_first(df2)
Then, When I print(df1) I get df1 as I initiate it on the second row.
Where did I make a mistake?
For me working nice if assign back output, but very similar method DataFrame.update working inplace:
df = df1.combine_first(df2)
print (df)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
df1.update(df2)
print (df1)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
combine_first returns a dataframe which has the change and not updating the existing dataframe so you should get the return dataframe
df1=df1.combine_first(df2)

Squeezing pandas DataFrame to have non-null values and modify column names

I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0

Python 3 / Pandas | Using Function with IF & ELIF statements to populate a column

Python 3 / Pandas
I am trying to use a function to check the values of various columns in a dataframe, and select only the value from the column that is not NaN.
The data is structured so there is one main column df['C1'] that I want to populate based on the value in one of the next four columns, df['C2'], df['C3'], df['C4'] and df['C5']. When I observe the data, I see that in the rows df['C2'], df['C3'], df['C4'] and df['C5'], every column has a value that is NaN except for one column which has a text value. This is true for all rows in the dataframe. I am trying to write a function that will be applied to the dataframe to find the column which has a text value, and copy that value from the column into df['C1'].
Here is the function I wrote:
def get_component(df):
if ~df['C2'].isna():
return df['C2']
elif ~df['C3'].isna():
return df['C3']
elif ~df['C4'].isna():
return df['C4']
elif ~df['C5'].isna():
return df['C5']
df['C1'] = df.apply(get_component, axis=1)
But I get the following error:
AttributeError: ("'float' object has no attribute 'isna'", 'occurred at index 0')
Any ideas on how to fix this error so I can achieve this objective? Is there another method to achieve the same result?
Thanks for the help!
Nevermind, I figured it just stumbled upon np.where and used the following code to solve the problem:
df['C1'] = np.where(~df['C2'].isna(),df['C2'],
np.where(~df['C3'].isna(),df['C3'],
np.where(~df['C4'].isna(),df['C4'],
np.where(~df['C5'].isna(),df['C5'],None))))
A solution that makes use of pandas' stack method:
import pandas as pd
import numpy as np
# Initialize example dataframe
df = pd.DataFrame({
"C2": [np.nan, 3, np.nan, np.nan, np.nan],
"C3": [5, np.nan, np.nan, np.nan, np.nan],
"C4": [np.nan, np.nan, np.nan, 7, 3],
"C5": [np.nan, np.nan, 2, np.nan, np.nan],
})
df["C1"] = df.stack().to_numpy()
print(df)
# Output:
# C2 C3 C4 C5 C1
# 0 NaN 5.0 NaN NaN 5.0
# 1 3.0 NaN NaN NaN 3.0
# 2 NaN NaN NaN 2.0 2.0
# 3 NaN NaN 7.0 NaN 7.0
# 4 NaN NaN 3.0 NaN 3.0

Categories

Resources