Here is a working example that reproduces my problem. First some random data is generated along with the data that we will use to fill the nans:
#Generate some random data and data that will be used to fill the nans
data = np.random.random((100,6))
fill_data = np.vstack((np.ones(200), np.ones(200)*2, np.ones(200)*3,np.ones(200), np.ones(200)*2, np.ones(200)*3)).T
#Generate indices of nans that we will put in
nan_rows = np.random.randint(0,100,50)
nan_cols = np.random.randint(0,6,50)
nan_idx = np.vstack((nan_rows,nan_cols)).T
#Put in nan values
for r,c in nan_idx:
data[r,c] = np.nan
#Generate multiindex and datetimeindex for both the data and fill_data
multi = pd.MultiIndex.from_product([['A','B'],['one','two','three']])
idx1 = pd.DatetimeIndex(start='1990-01-01', periods=100, freq='d')
idx2 = pd.DatetimeIndex(start='1989-12-01', periods=200, freq='d')
#Construct dataframes
df1 = pd.DataFrame(data, idx1, multi)
df2 = pd.DataFrame(fill_data, idx2, multi)
#fill nans from df1 with df2
df1 = df1.fillna(df2, axis=1)
Here is what the resulting frames look like:
In [167]:
df1.head()
Out[167]:
A B
one two three one two three
1990-01-01 1.000000 0.341803 0.694128 0.382164 0.326956 0.506616
1990-01-02 0.439024 0.552746 0.538489 0.003906 0.968498 0.816289
1990-01-03 0.200252 0.838014 0.805633 0.008980 0.269189 0.016243
1990-01-04 0.735120 0.384871 0.579268 0.561657 0.630314 0.361932
1990-01-05 0.938185 0.335212 0.678310 2.000000 0.819046 0.482535
In [168]:
df2.head()
Out[168]:
A B
one two three one two three
1989-12-01 1 2 3 1 2 3
1989-12-02 1 2 3 1 2 3
1989-12-03 1 2 3 1 2 3
1989-12-04 1 2 3 1 2 3
1989-12-05 1 2 3 1 2 3
So the key here is that the dataframes are different lengths but have common labels in that the multiindexed columns are the same and the timestamp labels in df1 are within df2.
Here is the result:
In [165]:
df1
Out[165]:
A B
one two three one two three
1990-01-01 1.000000 0.341803 0.694128 0.382164 0.326956 0.506616
1990-01-02 0.439024 0.552746 0.538489 0.003906 0.968498 0.816289
1990-01-03 0.200252 0.838014 0.805633 0.008980 0.269189 0.016243
1990-01-04 0.735120 0.384871 0.579268 0.561657 0.630314 0.361932
1990-01-05 0.938185 0.335212 0.678310 2.000000 0.819046 0.482535
1990-01-06 0.609736 0.164815 0.295003 0.784388 3.000000 3.000000
1990-01-07 1.000000 0.394105 0.430608 0.782029 0.327485 0.855130
1990-01-08 0.573780 0.525845 0.147302 0.091022 3.000000 3.000000
1990-01-09 0.591646 0.651251 0.649255 0.205926 3.000000 0.606428
1990-01-10 0.988085 0.524769 0.481834 0.486241 0.629223 0.575843
1990-01-11 1.000000 0.586813 0.592252 0.309429 0.877121 0.547193
1990-01-12 0.853000 0.097981 0.970053 0.519838 0.828266 0.618965
1990-01-13 0.579778 0.805140 0.050559 0.432795 0.036241 0.081218
1990-01-14 0.055462 1.000000 0.159151 0.538137 3.000000 0.296754
1990-01-15 0.848238 0.697454 0.519403 0.232734 0.612487 0.891230
1990-01-16 0.808238 0.182904 0.480846 0.052806 0.900373 0.860274
1990-01-17 0.890997 0.346767 0.265168 0.486746 0.983999 0.104035
1990-01-18 0.673155 0.248853 0.245246 2.000000 0.965884 0.295021
1990-01-19 0.074864 0.714846 2.000000 0.046031 0.105930 0.641538
1990-01-20 1.000000 0.486893 0.464024 0.499484 0.794107 0.868002
If you look closely you can see that there are values equal to 1 in columns ('A','one') and ('A','two'), values equal to 2 in ('A','three') and ('B','one') and values equal to 3 in ('B','two') and ('B','three').
The expected output would be values of 1 in the 'one' columns, 2 in the 'two' columns, etc.
Am I doing something wrong here? To me this seems like some kind of bug.
This issue has been fixed in the latest version of Pandas.
Using version 0.15.0 you will be able to do this:
import pandas as pd
import numpy as np
from numpy import nan
df = pd.DataFrame({'a': [nan, 1, 2, nan, nan],
'b': [1, 2, 3, nan, nan],
'c': [nan, 1, 2, 3, 4]},
index = list('VWXYZ'))
# a b c
# V NaN 1 NaN
# W 1 2 1
# X 2 3 2
# Y NaN NaN 3
# Z NaN NaN 4
# df2 may have different index and columns
df2 = pd.DataFrame({'a': [10, 20, 30, 40, 50],
'b': [50, 60, 70, 80, 90],
'c': list('ABCDE')},
index = list('VWXYZ'))
# a b c
# V 10 50 A
# W 20 60 B
# X 30 70 C
# Y 40 80 D
# Z 50 90 E
Now, passing a DataFrame to fillna
result = df.fillna(df2)
yields
print(result)
# a b c
# V 10 1 A
# W 1 2 1
# X 2 3 2
# Y 40 80 3
# Z 50 90 4
Related
I am trying to append two pandas DataFrames that are of different shape. Here are sample DataFrames for reproducibility:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4],
'val': ['x','y','w','z'],
'val1': ['x1','y1','w1','z1']
})
df2 = pd.DataFrame({'id': [5,6,7,8],
'val2': ['x1','y1','w1','z1'],
'val': ['t','s','v','l'],
})
I'd like to append df2 to df1. Expected behavior: Non-matching columns, in this case val2 would just be dropped. Retain column ordering of df1 and reset index in appended DataFrame.
id val val1
0 1 x x1
1 2 y y1
2 3 w w1
3 4 z v1
4 5 t NaN
5 6 s NaN
6 7 v NaN
7 8 l NaN
To clarify, not looking for inner join. Need all the columns from df1 and additionally, also print the columns that didn't intersect between the dataframes.
You can use pd.concat With df.reindex (Edited to match your excepted outout):
pd.concat([df1, df2], ignore_index=True).reindex(df1.columns, axis='columns')
output:
id val val1
0 1 x x1
1 2 y y1
2 3 w w1
3 4 z z1
4 5 t NaN
5 6 s NaN
6 7 v NaN
7 8 l NaN
For columns that didn't intersect, you can either use:
df1.columns.symmetric_difference(df2.columns).tolist()
To get the columns that didn't intersect fully or at all, output:
['val1', 'val2']
OR:
df2.columns.difference(df1.columns).tolist()
To get the columns that didn't intersect at all, output:
['val2']
Problem
I have a dataframe with some NaNs that I am trying to fill intelligently based off values from another dataframe. I have not found an efficient way to do this but I suspect there is a way with pandas.
Minimal Example
index1 = [1, 1, 1, 2, 2, 2]
index2 = ['a', 'b', 'a', 'b', 'a', 'b']
# dataframe to fillna
df = pd.DataFrame(
np.asarray([[np.nan, 90, 90, 100, 100, np.nan], index1, index2]).T,
columns=['data', 'index1', 'index2']
)
# dataframe to lookup fill values from
multi_index = pd.MultiIndex.from_product([sorted(list(set(index1))), sorted(list(set(index2)))])
fill_val_lookup = pd.DataFrame([89, 91, 99, 101], index=multi_index, columns=
['fill_vals'])
Starting data (df):
data index1 index2
0 nan 1 a
1 90 1 b
2 90 1 a
3 100 2 b
4 100 2 a
5 nan 2 b
Lookup table to find values to fill NaNs:
fill_vals
1 a 89
b 91
2 a 99
b 101
Desired output:
data index1 index2
0 89 1 a
1 90 1 b
2 90 1 a
3 100 2 b
4 100 2 a
5 101 2 b
Ideas
The closest post I have found is about filling NaNs with values from one level of a multiindex.
I've also tried setting the index of df to be a multiindex using columns index1 and index2 and then using df.fillna, however this does not work.
combine_first is the function that you need. But first, update the index names of the other dataframe.
fill_val_lookup.index.names = ["index1", "index2"]
fill_val_lookup.columns = ["data"]
df.index1 = df.index1.astype(int)
df.data = df.data.astype(float)
df.set_index(["index1","index2"]).combine_first(fill_val_lookup)\
.reset_index()
# index1 index2 data
#0 1 a 89.0
#1 1 a 90.0
#2 1 b 90.0
#3 2 a 100.0
#4 2 b 100.0
#5 2 b 101.0
I have a dataframe with columns (A, B and value) where there are missing values in the value column. And there is a Series indexed by two columns (A and B) from the dataframe. How can I fill the missing values in the dataframe with corresponding values in the series?
I think you need fillna with set_index and reset_index:
df = pd.DataFrame({'A': [1,1,3],
'B': [2,3,4],
'value':[2,np.nan,np.nan] })
print (df)
A B value
0 1 2 2.0
1 1 3 NaN
2 3 4 NaN
idx = pd.MultiIndex.from_product([[1,3],[2,3,4]])
s = pd.Series([5,6,0,8,9,7], index=idx)
print (s)
1 2 5
3 6
4 0
3 2 8
3 9
4 7
dtype: int64
df = df.set_index(['A','B'])['value'].fillna(s).reset_index()
print (df)
A B value
0 1 2 2.0
1 1 3 6.0
2 3 4 7.0
Consider the dataframe and series df and s
df = pd.DataFrame(dict(
A=list('aaabbbccc'),
B=list('xyzxyzxyz'),
value=[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9]
))
s = pd.Series(range(1, 10)[::-1])
s.index = [df.A, df.B]
We can fillna with a clever join
df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
# \_____________/ \_________/
# make series same get old
# name as column column out
# we are filling of the way
A B value
0 a x 1.0
1 a y 2.0
2 a z 7.0
3 b x 4.0
4 b y 5.0
5 b z 4.0
6 c x 7.0
7 c y 8.0
8 c z 9.0
Timing
join is cute, but #jezrael's set_index is quicker
%timeit df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
100 loops, best of 3: 3.56 ms per loop
%timeit df.set_index(['A','B'])['value'].fillna(s).reset_index()
100 loops, best of 3: 2.06 ms per loop
I need to transform all group columns in a DataFrame except the one column with the output variable.
df = pd.DataFrame({
'Branch' : ['A', 'A', 'A', 'B', 'B', 'B'],
'M1': [1,3,5,8,9,3],
'M2': [2,4,5,9,2,1],
'Output': [1,5,5,8,1,3]
})
Right now, I am centering all columns, except the output column, manually by listing them explicitly in the group function.
def group_center (df):
df['M1'] = df['M1'] - df['M1'].mean()
df['M2'] = df['M2'] - df['M2'].mean()
return df
centered = df.groupby('Branch').apply(group_center)
Is there is a way to do this in a more dynamic fashion, as the number of variables I am analyzing keep increasing.
You can define a list of the cols of interest and pass this to the groupby which will operate on each of these cols via a lambda and apply:
In [53]:
cols = ['M1','M2']
df[cols] = df.groupby('Branch')[cols].apply(lambda x: x - x.mean())
df
Out[53]:
Branch M1 M2 Output
0 A -2.000000 -1.666667 1
1 A 0.000000 0.333333 5
2 A 2.000000 1.333333 5
3 B 1.333333 5.000000 8
4 B 2.333333 -2.000000 1
5 B -3.666667 -3.000000 3
Here is a more vectorized way where you do not need to input your columns anywhere.
means = df.groupby('Branch').apply(mean)
df.set_index("Branch", inplace=True)
output = df['Output']
df = df - means
df['Output'] = output
M1 M2 Output
Branch
A -2.000000 -1.666667 1
A 0.000000 0.333333 5
A 2.000000 1.333333 5
B 1.333333 5.000000 8
B 2.333333 -2.000000 1
B -3.666667 -3.000000 3
Suppose I have a Pandas DataFrame like following. These values are based on a distance matrix.
A = pd.DataFrame([(1.0,0.8,0.6708203932499369,0.6761234037828132,0.7302967433402214),
(0.8,1.0,0.6708203932499369,0.8451542547285166,0.9128709291752769),
(0.6708203932499369,0.6708203932499369,1.0,0.5669467095138409,0.6123724356957946),
(0.6761234037828132,0.8451542547285166,0.5669467095138409,1.0,0.9258200997725514),
(0.7302967433402214,0.9128709291752769,0.6123724356957946,0.9258200997725514,1.0)
])
output :
Out[65]:
0 1 2 3 4
0 1.000000 0.800000 0.670820 0.676123 0.730297
1 0.800000 1.000000 0.670820 0.845154 0.912871
2 0.670820 0.670820 1.000000 0.566947 0.612372
3 0.676123 0.845154 0.566947 1.000000 0.925820
4 0.730297 0.912871 0.612372 0.925820 1.000000
I want only the upper triangle.
c2 = A.copy()
c2.values[np.tril_indices_from(c2)] = np.nan
output :
Out[67]:
0 1 2 3 4
0 NaN 0.8 0.67082 0.676123 0.730297
1 NaN NaN 0.67082 0.845154 0.912871
2 NaN NaN NaN 0.566947 0.612372
3 NaN NaN NaN NaN 0.925820
4 NaN NaN NaN NaN NaN
Now I want to get column and row index pairs based on some criteria.
Eg : Get column and row indexes where value is greater than 0.8. For this the out put should be [1,3],[1,4],[3,4]. Any help on this?
You can use numpy's argwhere:
In [11]: np.argwhere(c2 > 0.8)
Out[11]:
array([[1, 3],
[1, 4],
[3, 4]])
To get the index/columns (rather than their integer locations), you could use a list comprehension:
[(c2.index[i], c2.columns[j]) for i, j in np.argwhere(c2 > 0.8)]