Problem
I have a dataframe with some NaNs that I am trying to fill intelligently based off values from another dataframe. I have not found an efficient way to do this but I suspect there is a way with pandas.
Minimal Example
index1 = [1, 1, 1, 2, 2, 2]
index2 = ['a', 'b', 'a', 'b', 'a', 'b']
# dataframe to fillna
df = pd.DataFrame(
np.asarray([[np.nan, 90, 90, 100, 100, np.nan], index1, index2]).T,
columns=['data', 'index1', 'index2']
)
# dataframe to lookup fill values from
multi_index = pd.MultiIndex.from_product([sorted(list(set(index1))), sorted(list(set(index2)))])
fill_val_lookup = pd.DataFrame([89, 91, 99, 101], index=multi_index, columns=
['fill_vals'])
Starting data (df):
data index1 index2
0 nan 1 a
1 90 1 b
2 90 1 a
3 100 2 b
4 100 2 a
5 nan 2 b
Lookup table to find values to fill NaNs:
fill_vals
1 a 89
b 91
2 a 99
b 101
Desired output:
data index1 index2
0 89 1 a
1 90 1 b
2 90 1 a
3 100 2 b
4 100 2 a
5 101 2 b
Ideas
The closest post I have found is about filling NaNs with values from one level of a multiindex.
I've also tried setting the index of df to be a multiindex using columns index1 and index2 and then using df.fillna, however this does not work.
combine_first is the function that you need. But first, update the index names of the other dataframe.
fill_val_lookup.index.names = ["index1", "index2"]
fill_val_lookup.columns = ["data"]
df.index1 = df.index1.astype(int)
df.data = df.data.astype(float)
df.set_index(["index1","index2"]).combine_first(fill_val_lookup)\
.reset_index()
# index1 index2 data
#0 1 a 89.0
#1 1 a 90.0
#2 1 b 90.0
#3 2 a 100.0
#4 2 b 100.0
#5 2 b 101.0
Related
two sample dataframes with different index values, but identical column names and order:
df1 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[2,4])
df2 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[7,9])
df1
A B C
2 1 3
4 2
df2
A B C
7 4
9 5 6
I know how to concat the two dataframes, but that gives this:
A B C
2 1 3
4 2
Omitting the non=matching indexes from the other df
result I am trying to achieve is:
A B C
0 1 4 3
1 5 2 6
I want to combine the rows with the same index values from each df so that missing values in one df are replaced by the corresponding value in the other.
Concat and Merge are not up to the job I have found.
I assume I have to have identical indexes in each df which correspond to the values I want to merge into one row. But, so far, no luck getting it to come out correctly. Any pandas transformational wisdom is appreciated.
This merge attempt did not do the trick:
df1.merge(df2, on='A', how='outer')
The solutions below were all offered before I edited the question. My fault there, I neglected to point out that my actual data has different indexes in the two dataframes.
Let us try mask
out = df1.mask(df1=='',df2)
Out[428]:
A B C
0 1 4 3
1 5 2 6
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
if df1.iloc[i,j]=="":
df1.iloc[i,j] = df2.iloc[i,j]
print(df1)
A B C
0 1 4 3
1 5 2 6
Since the index of your two dataframes are different, it's easier to make it into the same index.
index = [i for i in range(len(df1))]
df1.index = index
df2.index = index
ddf = df1.replace('',np.nan)).fillna(df2)
If both df1 and df2 have different size of datas, it's still workable.
df1 = pd.DataFrame([[1, '', 3], ['', 2, ''],[7,8,9],[10,11,12]], columns=['A', 'B', 'C'],index=[7,8,9,10])
index1 = [i for i in range(len(df1))]
index2 = [i for i in range(len(df2))]
df1.index = index1
df2.index = index2
df1.replace('',np.nan).fillna(df2)
You can get
Out[17]:
A B C
0 1.0 5 3.0
1 4 2.0 6
2 7.0 8.0 9.0
3 10.0 11.0 12.0
i want to add the prefix '_nan' to columns that are all nan. I have the following code that prints what I want but does not reassign the columns in the actual dataframe and I am not sure why. Does anyone have any ideas why this is happening? Thanks in advance
df = pd.DataFrame({ 'a':[1, 0, 0, 0],
'b':[np.nan, np.nan, np.nan, np.nan],
'c':[np.nan, np.nan, np.nan, np.nan]})
a = df.loc[:,df.isna().all()].columns
df[[*a]] = df[[*a]].add_suffix('_nan')
You can use list comprehension:
df.columns = [x + '_nan' if df[x].isna().all() else x for x in df.columns]
Output:
a b_nan c_nan
0 1 NaN NaN
1 0 NaN NaN
2 0 NaN NaN
3 0 NaN NaN
why this is happening?
After some experiments I found that when you assign pandas.DataFrame slice to slice of pandas.DataFrame then pandas apparently does bother solely about order of columns (given in list), not their names, consider following example:
import pandas as pd
df_1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6],'c':[7,8,9]})
df_2 = pd.DataFrame({'x':[10,20,30],'y':[400,500,600]})
df_1[['a','b']] = df_2[['x','y']]
print(df_1)
output
a b c
0 10 400 7
1 20 500 8
2 30 600 9
whilst
...
df_1[['a','b']] = df_2[['y','x']]
print(df_1)
produce
a b c
0 400 10 7
1 500 20 8
2 600 30 9
I have two data set as follows
df1 = pd.DataFrame(np.array([[10, 20, 30, 40],
[11, 21, 31, 41]]), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.array([0, 1, 0, 1]).reshape(1, -1), columns = ['A', 'B', 'C', 'D'])
What I want is; If any item of df2 is greater than 0.5, the same Items of df1 will be 0 after running the code the df1 will be
print(df)
A B C D
10 0 30 0
11 0 31 0
I tried using
df1[df2>= 0.5] = 0
I think you should use pandas.DataFrame.where(), after you brought df2 to the same shape as df1. Please understand that df.where() replaces all values if the condition does not match, so this is the reson why >= is changed to <.
df1 = df1.where(df2<0.5, 0)
>>> df1
A B C D
0 10 0 30 0
1 11 0 31 0
If you have problems to extend df2, you can use this:
df2 = pd.DataFrame([[0, 1, 0, 1]], columns = ['A', 'B', 'C', 'D'])
>>>df2
A B C D
0 0 1 0 1
n = 1 # df1.shape[0] - 1
df2 = df2.append([df2.loc[0,:]]*n,ignore_index=True)
>>> df2
A B C D
0 0 1 0 1
1 0 1 0 1
Since both of the data frames have the same number of columns, where() method in pandas data frame can get the job done.
i.e
>>> df1.where(df2 < 0.5)
A B C D
0 10.0 NaN 30.0 NaN
1 NaN NaN NaN NaN
By default, if the condition evaluated to False in the where() method, the position will be replaced with NaN but not inplace. We can change that by changing the other argument from it's default value to 0 and to make the changes in-place we set inplace=True.
>>> df1.where(df2 < 0.5, other=0, inplace=True)
>>> df1
A B C D
0 10 0 30 0
1 0 0 0 0
I would like to perform some operation (e.g. x*apples^y) on the values of column apples, based on their color. The corresponding values are in a seperate dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'apples': [2, 1, 5, 6, 7], 'color': [1, 1, 1, 2, 2]})
df2 = pd.DataFrame({'x': [100, 200], 'y': [0.5, 0.3]}).set_index(np.array([1, 2]), 'color')
I am looking for the following result:
apples color
0 100*2^0.5 1
1 100*1^0.5 1
2 100*5^0.5 1
3 200*6^0.3 2
4 200*7^0.3 2
Use DataFrame.join with default left join first and then operate with appended columns:
df = df1.join(df2, on='color')
df['apples'] = df['x'] * df['apples'] ** df['y']
print (df)
apples color x y
0 141.421356 1 100 0.5
1 100.000000 1 100 0.5
2 223.606798 1 100 0.5
3 342.353972 2 200 0.3
4 358.557993 2 200 0.3
There is left join, so append to new column in df1 should working:
df = df1.join(df2, on='color')
df1['apples'] = df['x'] * df['apples'] ** df['y']
print (df1)
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2
Another idea is use double map:
df1['apples'] = df1['color'].map(df2['x']) * df1['apples'] ** df1['color'].map(df2['y'])
print (df1)
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2
I think you need pandas.merge -
temp = df1.merge(df2, left_on='color', right_index= True, how='left')
df1['apples'] = (temp['x']*(temp['apples'].pow(temp['y'])))
Output
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2
I have a table which contains intervals
dfa = pd.DataFrame({'Start': [0, 101, 666], 'Stop': [100, 200, 1000]})
I have another table which contains timestamps and values
dfb = pd.DataFrame({'Timestamp': [102, 145, 113], 'ValueA': [1, 2, 21],
'ValueB': [1, 2, 21]})
I need to create a dataframe same size as dfa, with added a columns which contains the result of some aggregation of ValueA/ValueB, for all the rows in dfb with a Timestamp contained between Start and Stop.
So here if define my aggregation as
{'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
my desired output would be:
ValueA ValueA ValueB
nanmean nanmin nanmax Start Stop
nan nan nan 0 100
8 1 21 101 200
nan nan nan 666 1000
Use merge with cross join with helper columns created by assign:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
df = dfa.assign(A=1).merge(dfb.assign(A=1), on='A', how='outer')
Then filter by Start and Stop and aggregate by dictionary:
df = (df[(df.Timestamp >= df.Start) & (df.Timestamp <= df.Stop)]
.groupby(['Start','Stop']).agg(d))
Flatten MultiIndex by map with join:
df.columns = df.columns.map('_'.join)
print (df)
ValueA_nanmean ValueA_nanmin ValueB_nanmax
Start Stop
101 200 8 1 21
And last join to original:
df = dfa.join(df, on=['Start','Stop'])
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN
EDIT:
Solution with cut:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
#if not default index create it
dfa = dfa.reset_index(drop=True)
print (dfa)
Start Stop
0 0 100
1 101 200
2 666 1000
#add to bins first value of Start
bins = np.insert(dfa['Stop'].values, 0, dfa.loc[0, 'Start'])
print (bins)
[ 0 100 200 1000]
#binning
dfb['id'] = pd.cut(dfb['Timestamp'], bins=bins, labels = dfa.index)
print (dfb)
Timestamp ValueA ValueB id
0 102 1 1 1
1 145 2 2 1
2 113 21 21 1
#aggregate and flatten
df = dfb.groupby('id').agg(d)
df.columns = df.columns.map('_'.join)
#add to dfa
df = pd.concat([dfa, df], axis=1)
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN