I am having Dataframe which has multiple columns in which some columns are equal (Same key in trailing end eg: column1 = 'a/first', column2 = 'b/first'). I want to merge these two columns. Please help me out to solve the problem.
My Dataframe looks like
name g1/column1 g1/column2 g1/g2/column1 g2/column2
AAAA 10 20 nan nan
AAAA nan nan 30 40
My result will be like as follows
name g1/column1 g1/column2
AAAA 10 20
AAAA 30 40
Thanks in advance
Use:
#create index by all columns with no merge
df = df.set_index('name')
#MultiIndex by split last /
df.columns = df.columns.str.rsplit('/', n=1, expand=True)
#aggregate first no NaN values per second level of MultiIndex
df = df.groupby(level=1, axis=1).first()
print (df)
column1 column2
name
AAAA 10.0 20.0
AAAA 30.0 40.0
you need df.combine_first,
col1=['g1/column1', 'g1/column2']
col2=['g1/g2/column1', 'g2/column2']
df[col1]=df[col1].combine_first(pd.DataFrame(df[col2].values,columns=col1))
df=df.drop(col2,axis=1)
print(df)
# name g1/column1 g1/column2
#0 AAAA 10.0 20.0
#1 AAAA 30.0 40.0
One of the solution:
df = pd.DataFrame([[10, 20, np.nan, np.nan],
[np.nan, np.nan, 30, 40]],
columns=['g1/column1', 'g1/column2', 'g1/g2/column1', 'g2/column2'])
df
g1/column1 g1/column2 g1/g2/column1 g2/column2
0 10.0 20.0 NaN NaN
1 NaN NaN 30.0 40.0
df = df.fillna(0) # <- replacing all NaN with 0
ndf = pd.DataFrame()
unique_cols = ['column1', 'column2']
for i in range(len(unique_cols)):
val = df.columns[df.columns.str.contains(unique_cols[i])]
ndf[val[0]] = df.loc[:,val].sum().reset_index(drop=True)
ndf # <- You can add index if you need (AAAA, AAAA)
g1/column1 g1/column2
0 10.0 20.0
1 30.0 40.0
import pandas as pd
import numpy as np
g1 = [20, np.nan, 30, np.nan]
g1_2 = [10, np.nan, 20, np.nan]
g2 = [np.nan, 30, np.nan, 40]
g2_2 = [np.nan, 10, np.nan, 30]
dataList = list(zip(g1, g1_2, g2, g2_2))
df = pd.DataFrame(data = dataList, columns=['g1/column1', 'g1/column2', 'g1/g2/column1', 'g2/column2'])
df.fillna(0, inplace=True)
df['g1Combined'] = df['g1/column1'] + df['g1/g2/column1']
df['g2Combined'] = df['g1/column2'] + df['g2/column2']
df.drop('g1/column1', axis=1, inplace=True)
df.drop('g1/column2', axis=1, inplace=True)
df.drop('g1/g2/column1', axis=1, inplace=True)
df.drop('g2/column2', axis=1, inplace=True)
df
Related
I want to add a series of columns whose value is determined from date offsets present from a selection of boolean columns (in this case y0, y1, y2,y3) from the current year.
Consider the following dataframe
import pandas as pd
import numpy as np
# Raw Data
years = ["2000", "2001", "2002", "2003"]
num_combos = len(years)
products = ["A"] * num_combos
bools = [True, False, True, False]
bools1 = [False, True, False, np.nan]
bools2 = [True, False, np.nan, np.nan]
bools3 = [False, np.nan, np.nan, np.nan]
values = [100, 97, 80, np.nan]
cols = {"years": years,
"products": products,
"y0": bools,
"y1": bools1,
"y2": bools2,
"y3": bools3,
"value": values}
df = pd.DataFrame(cols)
df[["y0", "y1", "y2", "y3"]] = df[["y0", "y1", "y2", "y3"]].astype(float)
Consider the year 2000
y0 is 1 so the value at year 2000 (value_0) is 100
y1 is 0 so the value at year 2000 one year into the future (value_1) is NaN
y2 is 1 so the value at year 2000 two years into the future (value_2) is the value at 2002 which is 80 etc
This would yield the following dataframe.
df["value_0"] = [100, np.nan, 80, np.nan]
df["value_1"] = [np.nan, 80, np.nan, np.nan]
df["value_2"] = [80, np.nan, np.nan, np.nan]
df["value_3"] = [np.nan, np.nan, np.nan, np.nan]
Is there a clever way of determining these columns using apply or np.where? (or alternative)
With the dataframe df you provided, here is one way to do it using Pandas shift, concat, and apply:
# Setup
counter = range(df.shape[0])
# Add new columns and rows
temp_df = pd.DataFrame(
data=[df["value"].shift(-i).T for i in counter],
)
temp_df.columns = [f"value_{i}" for i in counter]
temp_df.index = [i for i in counter]
df = pd.concat([df, temp_df], axis=1)
# Update values according to "y0", "y1", ... columns
for i in counter:
df[f"value_{i}"] = df.apply(
lambda x: x[f"value_{i}"] if x[f"y{i}"] else None, axis=1
)
print(df)
# Output
years products y0 y1 y2 y3 value value_0 value_1 value_2 \
0 2000 A 1.0 0.0 1.0 0.0 100.0 100.0 NaN 80.0
1 2001 A 0.0 1.0 0.0 NaN 97.0 NaN 80.0 NaN
2 2002 A 1.0 0.0 NaN NaN 80.0 80.0 NaN NaN
3 2003 A 0.0 NaN NaN NaN NaN NaN NaN NaN
value_3
0 NaN
1 NaN
2 NaN
3 NaN
Hello I have the following dataframe
df = pd.DataFrame(data={'grade_1':['A','B','C'],
'grade_1_count': [19,28,32],
'grade_2': ['pass','fail',np.nan],
'grade_2_count': [39,18, np.nan]})
whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary
grade_dict = {'grade_1':['A','B','C','D','E','F'],
'grade_2' : ['pass','fail','not present', 'borderline']}
and the corresponding row value in the _count column should be filled with np.nan
so the expected output is like this
expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'],
'grade_1_count': [19,28,32,0,0,0],
'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan],
'grade_2_count': [39,18,0,0,np.nan,np.nan]})
so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks
x=[]
for k, v in grade_dict.items():
out = df[k].reindex(grade_dict[k], axis=0, fill_value=0)
x = pd.concat([out], axis=1)
x[k] = x.index
x = x.reset_index(drop=True)
df[k] = x.fillna(np.nan)
Here is a solution using two consecutive merges:
# set up combinations
from itertools import zip_longest
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
# merge
(df2.merge(df.filter(like='grade_1'),
on='grade_1', how='left')
.merge(df.filter(like='grade_2'),
on='grade_2', how='left')
.sort_index(axis=1)
)
output:
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D NaN borderline NaN
4 E NaN None NaN
5 F NaN None NaN
multiple merges:
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
for col in grade_dict:
df2 = df2.merge(df.filter(like=col),
on=col, how='left')
df2
If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first:
print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(),
index=grade_dict.keys()).T.set_index("grade_1"))
.fillna({"grade_1_count": 0}).reset_index())
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D 0.0 borderline NaN
4 E 0.0 None NaN
5 F 0.0 None NaN
I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Let's say that I have the following data-frame:
df = pd.DataFrame({"unique_id": [1, 1, 1], "att1_amr": [11, 11, 11], "att2_nominal": [1, np.nan, np.nan], "att3_nominal": [np.nan, 1, np.nan], "att4_bok": [33.33, 33.33, 33.33], "att5_nominal": [np.nan, np.nan, np.nan], "att6_zpq": [22.22, 22.22, 22.22]})
What I want to do is group-by the rows of the data-frame by unique_id such that I can apply a separate group-by operation on the columns that contain the word nominal and a separate to all other. To be more specific, I want to group-by the columns that contain nominal using sum(min_count = 1) and the other with first() or last(). The result should be the following:
df_result = pd.DataFrame({"unique_id": [1], "att1_amr": [11], "att2_nominal": [1], "att3_nominal": [1], "att4_bok": [33.33], "att5_nominal": [np.nan], "att6_zpq": [22.22]})
Thank you!
You can create dictionary dynamically - first all columns with nominal with lambda function and then all another columns with last and merge it together, last call DataFrameGroupBy.agg:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('nominal')],
lambda x : x.sum(min_count=1))
d2 = dict.fromkeys(df.columns.difference(['unique_id'] + list(d1)), 'last')
d = {**d1, **d2}
df = df.groupby('unique_id').agg(d)
print (df)
att2_nominal att3_nominal att5_nominal att1_amr att4_bok \
unique_id
1 1.0 1.0 NaN 11 33.33
att6_zpq
unique_id
1 22.22
Another more cleaner solution:
d = {k: (lambda x : x.sum(min_count=1))
if 'nominal' in k
else 'last'
for k in df.columns.difference(['unique_id'])}
df = df.groupby('unique_id').agg(d)
print (df)
att1_amr att2_nominal att3_nominal att4_bok att5_nominal \
unique_id
1 11 1.0 1.0 33.33 NaN
att6_zpq
unique_id
1 22.22
Why not just:
>>> df.ffill().bfill().drop_duplicates()
att1_amr att2_nominal att3_nominal att4_bok att5_nominal att6_zpq \
0 11 1.0 1.0 33.33 NaN 22.22
unique_id
0 1
>>>
The solution provided by #jezrael works just fine while being the most elegant one, however, I ran into severe performance issues. Surprisingly, I found this to be a much faster solution while achieving the same goal.
nominal_cols = df.filter(like="nominal").columns.values
other_cols = [col for col in df.columns.values if col not in nominal_cols and col != "unique_id"]
df1 = df.groupby('unique_id', as_index=False)[nominal_cols].sum(min_count=1)
df2 = df.groupby('unique_id', as_index=False)[other_cols].first()
pd.merge(df1, df2, on=["unique_id"], how="inner")
I try to update a DataFrame
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
by another DataFrame
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]}).
Now, my aim is to update df1 by df2 and overwrite all values (NaN values too) using
df1.update(df2)
In contrast with the common usage it's important to me to get the NaN values finally in df1.
But as far as I see the update returns
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
Is there a way to get
>>> df1
A B
0 1 9
1 2 NaN
2 3 11
3 4 NaN
without building df1 manually?
I am late to the party but I was recently confronted to the same issue, i.e. trying to update a dataframe without ignoring NaN values like the Pandas built-in update method does.
For two dataframes sharing the same column names, a workaround would be to concatenate both dataframes and then remove duplicates, only keeping the last entry:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [9, np.nan, 11, np.nan]})
frames = [df1, df2]
df_concatenated = pd.concat(frames)
df1=df_concatenated.loc[~df_concatenated.index.duplicated(keep='last')]
Depending on indexing, it might be necessary to sort the indices of the output dataframe:
df1=df1.sort_index()
To address you very specific example for which df2 does not have a column A, you could run:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
frames = [df1, df2]
df_concatenated = pd.concat(frames)
df1['B']=df_concatenated.loc[~df_concatenated.index.duplicated(keep='last')]['B']
It also works fine for me. You could perhaps use np.nan instead of 'nan'?
I guess you meant [9, np.nan, 11, np.nan], not string "nan".
If there is no mandatory to use update() then do df1.B = df2.B instead, so that the new df1.B will contain NaN.
DataFrame.update() only updates non-NA values. See docs
Approach 1: Drop all affected columns
I achieved this by dropping the new columns and joining the data from the replacement DataFrame:
df1 = df1.drop(columns=df2.columns).join(df2)
This tells Pandas to remove the columns from df1 that you're about to recreate using the values from df2. Note that the column order changes since the new columns are appended to the end.
Approach 2: Preserve column order
Loop over all columns in the replacement DataFrame, inserting affected columns in the target DataFrame in their original place after dropping the original. If the replacement DataFrame includes a column not in the target DataFrame, it will be appended to the end.
for col in df2.columns:
try:
col_pos = list(df1.columns).index(col)
df1.drop(columns=[col], inplace=True)
df1.insert(col_pos, col, df2[col])
except ValueError:
df1[col] = df2[col]
Caveat
With both of these approaches, if your indices do not match between df1 and df2, the missing indices from df2 will end up NaN in your output DataFrame:
df1 = pd.DataFrame(data = {'B' : [1,2,3,4,5], 'A' : [5,6,7,8,9]}) # Note the additional row
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df1.update(df2)
Output:
>>> df1
B A
0 9.0 5
1 2.0 6
2 11.0 7
3 4.0 8
4 5.0 9
My version 1:
df1 = pd.DataFrame(data = {'A' : [1,2,3,4,5], 'B' : [5,6,7,8,9]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df1 = df1.drop(columns=df2.columns).join(df2)
Output:
>>> df1
A B
0 5 9.0
1 6 NaN
2 7 11.0
3 8 NaN
4 9 NaN
My version 2:
df1 = pd.DataFrame(data = {'A' : [1,2,3,4,5], 'B' : [5,6,7,8,9]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
for col in df2.columns:
try:
col_pos = list(df1.columns).index(col)
df1.drop(columns=[col], inplace=True)
df1.insert(col_pos, col, df2[col])
except ValueError:
df1[col] = df2[col]
Output:
>>> df1
B A
0 9.0 5
1 NaN 6
2 11.0 7
3 NaN 8
4 NaN 9
A usable trick is to fill with a string like 'n/a', then replace 'n/a' with np.nan, and convert column type back to float
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, 'n/a', 11, 'n/a']})
df1.update(df2)
df1['B'] = df1['B'].replace({'n/a':np.nan})
df1['B'] = df1['B'].apply(pd.to_numeric, errors='coerce')
Some explanation about the type conversion: after the call to replace, the result is:
A B
0 1 9.0
1 2 NaN
2 3 11.0
3 4 NaN
This looks acceptable, but actually the type of column B has changed from float to object.
df1.dtypes
will give
A int64
B object
dtype: object
To set it back to float, you can use:
df1['B'] = df1['B'].apply(pd.to_numeric, errors='coerce')
And then, you shall have the expected result:
df1.dtypes
will give the expected type:
A int64
B float64
dtype: object
The pandas.DataFrame.update doesn't replace values by nan by default so to circumvent this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df2.replace(np.nan, 'NAN', inplace = True)
df1.update(df2)
df1.replace('NAN', np.nan, inplace = True)