Suppose I have a Pandas DataFrame like following. These values are based on a distance matrix.
A = pd.DataFrame([(1.0,0.8,0.6708203932499369,0.6761234037828132,0.7302967433402214),
(0.8,1.0,0.6708203932499369,0.8451542547285166,0.9128709291752769),
(0.6708203932499369,0.6708203932499369,1.0,0.5669467095138409,0.6123724356957946),
(0.6761234037828132,0.8451542547285166,0.5669467095138409,1.0,0.9258200997725514),
(0.7302967433402214,0.9128709291752769,0.6123724356957946,0.9258200997725514,1.0)
])
output :
Out[65]:
0 1 2 3 4
0 1.000000 0.800000 0.670820 0.676123 0.730297
1 0.800000 1.000000 0.670820 0.845154 0.912871
2 0.670820 0.670820 1.000000 0.566947 0.612372
3 0.676123 0.845154 0.566947 1.000000 0.925820
4 0.730297 0.912871 0.612372 0.925820 1.000000
I want only the upper triangle.
c2 = A.copy()
c2.values[np.tril_indices_from(c2)] = np.nan
output :
Out[67]:
0 1 2 3 4
0 NaN 0.8 0.67082 0.676123 0.730297
1 NaN NaN 0.67082 0.845154 0.912871
2 NaN NaN NaN 0.566947 0.612372
3 NaN NaN NaN NaN 0.925820
4 NaN NaN NaN NaN NaN
Now I want to get column and row index pairs based on some criteria.
Eg : Get column and row indexes where value is greater than 0.8. For this the out put should be [1,3],[1,4],[3,4]. Any help on this?
You can use numpy's argwhere:
In [11]: np.argwhere(c2 > 0.8)
Out[11]:
array([[1, 3],
[1, 4],
[3, 4]])
To get the index/columns (rather than their integer locations), you could use a list comprehension:
[(c2.index[i], c2.columns[j]) for i, j in np.argwhere(c2 > 0.8)]
Related
I have two pandas DataFrames (df1, df2) with a different number of rows and columns and some matching values in a specific column in each df, with caveats (1) there are some unique values in each df, and (2) there are different numbers of matching values across the DataFrames.
Baby example:
df1 = pd.DataFrame({'id1': [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 6, 6]})
df2 = pd.DataFrame({'id2': [1, 1, 2, 2, 2, 2, 3, 4, 5],
'var1': ['B', 'B', 'W', 'W', 'W', 'W', 'H', 'B', 'A']})
What I am seeking to do is create df3 where df2['id2'] is aligned/indexed to df1['id1'], such that:
NaN is added to df3[id2] when df2[id2] has fewer (or missing) matches to df1[id1]
NaN is added to df3[id2] & df3[var1] if df1[id1] exists but has no match to df2[id2]
'var1' is filled in for all cases of df3[var1] where df1[id1] and df2[id2] match
rows are dropped when df2[id2] has more matching values than df1[id1] (or no matches at all)
The resulting DataFrame (df3) should look as follows (Notice id2 = 5 and var1 = A are gone):
id1
id2
var1
1
1
B
1
1
B
1
NaN
B
2
2
W
2
2
W
3
3
H
3
NaN
H
3
NaN
H
3
NaN
H
4
4
B
6
NaN
NaN
6
NaN
NaN
I cannot find a combination of merge/join/concatenate/align that correctly solves this problem. Currently, everything I have tried stacks the rows in sequence without adding NaN in the proper cells/rows and instead adds all the NaN values at the bottom of df3 (so id1 and id2 never align). Any help is greatly appreciated!
You can first assign a helper column for id1 and id2 based on groupby.cumcount, then merge. Finally ffill values of var1 based on the group id1
def helper(data,col): return data.groupby(col).cumcount()
out = df1.assign(k = helper(df1,['id1'])).merge(df2.assign(k = helper(df2,['id2'])),
left_on=['id1','k'],right_on=['id2','k'] ,how='left').drop('k',1)
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
Or similar but without assign as HenryEcker suggests :
out = df1.merge(df2, left_on=['id1', helper(df1, ['id1'])],
right_on=['id2', helper(df2, ['id2'])], how='left').drop(columns='key_1')
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
print(out)
id1 id2 var1
0 1 1.0 B
1 1 1.0 B
2 1 NaN B
3 2 2.0 W
4 2 2.0 W
5 3 3.0 H
6 3 NaN H
7 3 NaN H
8 3 NaN H
9 4 4.0 B
10 6 NaN NaN
11 6 NaN NaN
My Dataframe looks like this :
COL1 COL2 COL3
A M X
B F Y
NaN M Y
A nan Y
I am trying to label encode with nulls as such. My result should look like:
COL1_ COL2_ COL3_
0 0 0
1 1 1
NaN 0 1
0 nan 1
The code i tried :
modified_l2 = {}
for val in list(df_obj.columns):
modified_l2[val] = {k: i for i,k in enumerate(df_obj[val].unique(),0)}
for cols in modified_l2.keys():
df_obj[cols+'_']=df_obj[cols].map(modified_l2[cols],na_action='ignore')
Achieved Result :
Expected Result :
Try using the below code, I first use the apply function, than I drop the NaNs, then I convert it into a list then I use the list.index method for each value in the new list, and list.index gives the index of the first occurence of the value, after that convert it into the Series, and make the index the index of the series without NaNs, I am doing that since after I drop the NaNs it will turn from index 0, 1, 2, 3 to 0, 2, 3 or something like that, whereas the missing index will be NaN again, after that I add a underscore to each column, and I join it with the original dataframe:
print(df.join(df.apply(lambda x: pd.Series(map(x.dropna().tolist().index, x.dropna()), index=x.dropna().index)).add_suffix('_')))
Output:
COL1 COL2 COL3 COL1_ COL2_ COL3_
0 A M X 0.0 0.0 0
1 B F Y 1.0 1.0 1
2 NaN M Y NaN 0.0 1
3 A NaN Y 0.0 NaN 1
Here best is use factorize with replace:
df = df.join(df.apply(lambda x : pd.factorize(x)[0]).replace(-1, np.nan).add_suffix('_'))
print (df)
COL1 COL2 COL3 COL1_ COL2_ COL3_
0 A M X 0.0 0.0 0
1 B F Y 1.0 1.0 1
2 NaN M Y NaN 0.0 1
3 A NaN Y 0.0 NaN 1
I have a DataFrame like this:
df = pd.DataFrame(np.random.randn(6, 6),
columns=pd.MultiIndex.from_arrays((['A','A','A','B','B','B'],
['a', 'b', 'c', 'a', 'b', 'c'])))
df
A B
a b c a b c
0 -0.089902 -2.235642 0.282761 0.725579 1.266029 -0.354892
1 -1.753303 1.092057 0.484323 1.789094 -0.316307 0.416002
2 -0.409028 -0.920366 -0.396802 -0.569926 -0.538649 -0.844967
3 1.789569 -0.935632 0.004476 -1.873532 -1.136138 -0.867943
4 0.244112 0.298361 -1.607257 -0.181820 0.577446 0.556841
5 0.903908 -1.379358 0.361620 1.290646 -0.523404 -0.518992
I would like to select only the rows that have a value larger than 0 in column c. I figured that I will have to use pd.IndexSlice to select only the second level index c.
idx = pd.IndexSlice
df.loc[:,idx[:,['c']]] > 0
A B
c c
0 True False
1 True True
2 False False
3 True False
4 False True
5 True False
So, now I would expect that I could simply do df[df.loc[:,idx[:,['c']]] > 0], however that gives me an unexpected result:
df[df.loc[:,idx[:,['c']]] > 0]
A B
a b c a b c
0 NaN NaN 0.282761 NaN NaN NaN
1 NaN NaN 0.484323 NaN NaN 0.416002
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN 0.004476 NaN NaN NaN
4 NaN NaN NaN NaN NaN 0.556841
5 NaN NaN 0.361620 NaN NaN NaN
What I would have liked to have is all values (not NaNs) and only the rows where any of the c-columns is greater 0.
A B
a b c a b c
0 -0.089902 -2.235642 0.282761 0.725579 1.266029 -0.354892
1 -1.753303 1.092057 0.484323 1.789094 -0.316307 0.416002
3 1.789569 -0.935632 0.004476 -1.873532 -1.136138 -0.867943
4 0.244112 0.298361 -1.607257 -0.181820 0.577446 0.556841
5 0.903908 -1.379358 0.361620 1.290646 -0.523404 -0.518992
So, I would probably need to sneak an any() somewhere in there, however, I am not sure how to do that. Any hints?
Another version using get_level_values
df[(df.iloc[:, df.columns.get_level_values(1) == 'c'] > 0).any(axis=1)]
You are looking for any
df[(df.loc[:,idx[:,['c']]]>0).any(axis = 1)]
Out[133]:
A B
a b c a b c
1 -0.423313 0.459464 -1.457655 -0.559667 -0.056230 1.338850
3 -0.072396 1.305868 -1.239441 -0.708834 0.348704 0.260532
4 -1.415575 1.229508 0.148254 -0.812806 1.379552 -1.195062
5 -0.336973 -0.469335 1.345719 0.847943 1.465100 -0.285792
Possible duplicates but the solution provided there cannot fit my problem due to the information I get.
The idea is quite simple. I have a matrix with a multilevel (and in my case I didn't build the index, I get only the DataFrame):
#test = (('2','C'),('2','B'),('1','A'))
#test = pd.MultiIndex.from_tuples(test)
#pandas.dataFrame(index=test, columns=test)
2 1
C B A
2 C NaN NaN NaN
B NaN NaN NaN
1 A NaN NaN NaN
I would like to add a sublevel on the two axis in function of A, B, C. E.g.:
2 1
C B A
kg kg m3
2 C kg NaN NaN NaN
B kg NaN NaN NaN
1 A m3 NaN NaN NaN
In reality the index is available through the DataFrame (I didn't build it), and I only know this: {'C':'kg', 'B':'kg', 'A':'m3'}. I can get the index series and use an approach similar to the link above, but it is very slow and I cannot imagine there is something simpler and more effective.
Source DF:
In [303]: df
Out[303]:
2 1
C B A
2 C NaN NaN NaN
B NaN NaN NaN
1 A NaN NaN NaN
Solution:
In [304]: cols = df.columns
In [305]: new_lvl = [d[c] for c in df.columns.get_level_values(1)]
In [306]: df.columns = pd.MultiIndex.from_arrays([cols.get_level_values(0),
cols.get_level_values(1),
new_lvl])
In [307]: df
Out[307]:
2 1
C B A
kg kg m3
2 C NaN NaN NaN
B NaN NaN NaN
1 A NaN NaN NaN
where d is:
In [308]: d = {'C':'kg', 'B':'kg', 'A':'m3'}
In [309]: d
Out[309]: {'A': 'm3', 'B': 'kg', 'C': 'kg'}
yout can use set_index(..., append=True) to add new index
test = (('2','C'),('2','B'),('1','A'))
test = pd.MultiIndex.from_tuples(test)
x = pd.DataFrame(index=test, columns=test)
# add new index
x['new'] = pd.Series(x.index.get_level_values(-1), index=x.index).replace({'C':'kg', 'B':'kg', 'A':'m3'})
x.set_index('new', append=True, inplace=True)
x.index.names = [None] * 3
# transpose dataframe and do the same thing
x = x.T
x['new'] = pd.Series(x.index.get_level_values(-1), index=x.index).replace({'C':'kg', 'B':'kg', 'A':'m3'})
x.set_index('new', append=True, inplace=True)
x.index.names = [None] * 3
x = x.T
Here is a working example that reproduces my problem. First some random data is generated along with the data that we will use to fill the nans:
#Generate some random data and data that will be used to fill the nans
data = np.random.random((100,6))
fill_data = np.vstack((np.ones(200), np.ones(200)*2, np.ones(200)*3,np.ones(200), np.ones(200)*2, np.ones(200)*3)).T
#Generate indices of nans that we will put in
nan_rows = np.random.randint(0,100,50)
nan_cols = np.random.randint(0,6,50)
nan_idx = np.vstack((nan_rows,nan_cols)).T
#Put in nan values
for r,c in nan_idx:
data[r,c] = np.nan
#Generate multiindex and datetimeindex for both the data and fill_data
multi = pd.MultiIndex.from_product([['A','B'],['one','two','three']])
idx1 = pd.DatetimeIndex(start='1990-01-01', periods=100, freq='d')
idx2 = pd.DatetimeIndex(start='1989-12-01', periods=200, freq='d')
#Construct dataframes
df1 = pd.DataFrame(data, idx1, multi)
df2 = pd.DataFrame(fill_data, idx2, multi)
#fill nans from df1 with df2
df1 = df1.fillna(df2, axis=1)
Here is what the resulting frames look like:
In [167]:
df1.head()
Out[167]:
A B
one two three one two three
1990-01-01 1.000000 0.341803 0.694128 0.382164 0.326956 0.506616
1990-01-02 0.439024 0.552746 0.538489 0.003906 0.968498 0.816289
1990-01-03 0.200252 0.838014 0.805633 0.008980 0.269189 0.016243
1990-01-04 0.735120 0.384871 0.579268 0.561657 0.630314 0.361932
1990-01-05 0.938185 0.335212 0.678310 2.000000 0.819046 0.482535
In [168]:
df2.head()
Out[168]:
A B
one two three one two three
1989-12-01 1 2 3 1 2 3
1989-12-02 1 2 3 1 2 3
1989-12-03 1 2 3 1 2 3
1989-12-04 1 2 3 1 2 3
1989-12-05 1 2 3 1 2 3
So the key here is that the dataframes are different lengths but have common labels in that the multiindexed columns are the same and the timestamp labels in df1 are within df2.
Here is the result:
In [165]:
df1
Out[165]:
A B
one two three one two three
1990-01-01 1.000000 0.341803 0.694128 0.382164 0.326956 0.506616
1990-01-02 0.439024 0.552746 0.538489 0.003906 0.968498 0.816289
1990-01-03 0.200252 0.838014 0.805633 0.008980 0.269189 0.016243
1990-01-04 0.735120 0.384871 0.579268 0.561657 0.630314 0.361932
1990-01-05 0.938185 0.335212 0.678310 2.000000 0.819046 0.482535
1990-01-06 0.609736 0.164815 0.295003 0.784388 3.000000 3.000000
1990-01-07 1.000000 0.394105 0.430608 0.782029 0.327485 0.855130
1990-01-08 0.573780 0.525845 0.147302 0.091022 3.000000 3.000000
1990-01-09 0.591646 0.651251 0.649255 0.205926 3.000000 0.606428
1990-01-10 0.988085 0.524769 0.481834 0.486241 0.629223 0.575843
1990-01-11 1.000000 0.586813 0.592252 0.309429 0.877121 0.547193
1990-01-12 0.853000 0.097981 0.970053 0.519838 0.828266 0.618965
1990-01-13 0.579778 0.805140 0.050559 0.432795 0.036241 0.081218
1990-01-14 0.055462 1.000000 0.159151 0.538137 3.000000 0.296754
1990-01-15 0.848238 0.697454 0.519403 0.232734 0.612487 0.891230
1990-01-16 0.808238 0.182904 0.480846 0.052806 0.900373 0.860274
1990-01-17 0.890997 0.346767 0.265168 0.486746 0.983999 0.104035
1990-01-18 0.673155 0.248853 0.245246 2.000000 0.965884 0.295021
1990-01-19 0.074864 0.714846 2.000000 0.046031 0.105930 0.641538
1990-01-20 1.000000 0.486893 0.464024 0.499484 0.794107 0.868002
If you look closely you can see that there are values equal to 1 in columns ('A','one') and ('A','two'), values equal to 2 in ('A','three') and ('B','one') and values equal to 3 in ('B','two') and ('B','three').
The expected output would be values of 1 in the 'one' columns, 2 in the 'two' columns, etc.
Am I doing something wrong here? To me this seems like some kind of bug.
This issue has been fixed in the latest version of Pandas.
Using version 0.15.0 you will be able to do this:
import pandas as pd
import numpy as np
from numpy import nan
df = pd.DataFrame({'a': [nan, 1, 2, nan, nan],
'b': [1, 2, 3, nan, nan],
'c': [nan, 1, 2, 3, 4]},
index = list('VWXYZ'))
# a b c
# V NaN 1 NaN
# W 1 2 1
# X 2 3 2
# Y NaN NaN 3
# Z NaN NaN 4
# df2 may have different index and columns
df2 = pd.DataFrame({'a': [10, 20, 30, 40, 50],
'b': [50, 60, 70, 80, 90],
'c': list('ABCDE')},
index = list('VWXYZ'))
# a b c
# V 10 50 A
# W 20 60 B
# X 30 70 C
# Y 40 80 D
# Z 50 90 E
Now, passing a DataFrame to fillna
result = df.fillna(df2)
yields
print(result)
# a b c
# V 10 1 A
# W 1 2 1
# X 2 3 2
# Y 40 80 3
# Z 50 90 4