I have a pandas.DataFrame like
B1 B2 B3
A1 0 1 2
A2 3 4 5
Also, index=pd.Index(['A2', 'A1']), and columns=pd.Index(['B2', 'B3']). What I want to get is [4, 2], that is, elements in A2-B2 and A1-B3, respectively.
Is there a clever built-in operation to perform this in pandas?
I searched with different expressions for a while but no clue. There could be duplicate questions, sorry for that case. Thank you for taking a look at this.
Use Index.get_indexer for positions by values of index, columns so possible filter in numpy indexing (only convert values of df to numpy array):
index=pd.Index(['A2', 'A1'])
columns=pd.Index(['B2', 'B3'])
i = df.index.get_indexer(index)
c = df.columns.get_indexer(columns)
L = df.to_numpy()[i, c].tolist()
print (L)
[4, 2]
Or reshape by DataFrame.stack and select by DataFrame.loc with MultiIndex.from_tuples:
L = df.stack().loc[pd.MultiIndex.from_tuples(zip(index, columns))].tolist()
print (L)
[4, 2]
If only few values is possible use list comprehension with zip and DataFrame.at:
L = [df.at[i, c] for i, c in zip(index, columns)]
print (L)
[4, 2]
Another option would be to zip the index and columns (same idea as #Jezrael); however, you could just pass the unpacked values to loc (internally , it takes care of finding the right values):
temp = df.stack()
zipped = [*zip(index, columns)]
temp.loc(axis=0)[zipped].array
<PandasArray>
[4, 2]
Length: 2, dtype: int64
Related
I have a DataFrame (temp) and an array (x), whose elements correspond to some of the lines of the DataFrame. I want to get the indexes of the DataFrame whose corresponding records are identical to the elements of the array:
For example:
temp = pd.DataFrame({"A": [1,2,3,4], "B": [4,5,6,7], "C": [7,8,9,10]})
A B C
0 1 4 7
1 2 5 8
2 3 6 9
3 4 7 10
x = np.array([[1,4,7], [3,6,9]])
It should return the indexes: 0 and 2.
I was trying unsuccessfully with this:
temp.loc[temp.isin(x[0])].index
Using numpy broadcasting:
array = temp.to_numpy()[:, None]
mask = (array == x).all(axis=-1).any(axis=-1)
temp.index[mask]
I would convert to Multiindex and then to isin with np.where
i = pd.MultiIndex.from_frame(temp[['A','B','C']])
out = np.where(i.isin(pd.MultiIndex.from_arrays(x.T)))[0]
print(out)
#[0 2]
Or with merge:
cols = ['A','B','C']
out = temp.reset_index().merge(pd.DataFrame(x,columns=cols)).loc[:,'index'].tolist()
Or with np.isin and all
out = temp.index[np.isin(temp[['A','B','C']],x).all(1)]
Since you need to match entire rows of the DataFrame to rows in the numpy array, you can convert the DataFrame to an array and then use enumerate to loop and return the indices:
temp_arr = temp.to_numpy()
for idx, row in enumerate(temp_arr):
if row in x:
print(idx)
Output:
0
2
A more elegant way using list comprehension would be:
idx_list = [i for i, row in enumerate(temp_arr) if row in x ]
print(idx_list)
Output:
[0, 2]
I have two data frames. Both have one column of numpy arrays with 3 elements per entry, like so:
0 [0.552347, 0.762896, 0.336009]
1 [0.530716, 0.808313, 0.254895]
2 [0.528786, 0.734991, 0.424469]
3 [0.202799, 0.669395, -0.714691]
4 [0.791936, -0.100072, -0.602347]
6 [0.428896, -0.122712, 0.89498]
How do I take the dot product of each row of one data frame with the corresponding row of the other data frame? Meaning, I want to calculate the dot product of the first element of df1 with the first element of df2, then the second element of df1 with the second element of df2, then third, and so on.
df1 = pd.DataFrame([(np.array([0.552347, 0.762896, 0.336009]), ),
(np.array([0.530716, 0.808313, 0.254895]), )], columns=['v1'])
df2 = pd.DataFrame([(np.array([0.528786, 0.734991, 0.424469]), ),
(np.array([0.202799, 0.669395, -0.714691]), )], columns=['v2'])
pd.concat((df1, df2), axis=1).apply(lambda row: row.v1.dot(row.v2), axis=1)
0 0.995420
1 0.466538
Assuming they are same length of df1 , df2
[x.dot(y) for x, y in zip(df1.col1.values,df2.col1.values)]
Out[648]: [0.9999995633060001, 1.00000083965]
It's pretty fast to calculate dot products manually. For this you can use mul and sum if the 2 dataframes share the same index:
df1.col.mul(df2.col).apply(sum)
If they don't share the same index (but are the same length), use reset_index first:
df1.reset_index().col.mul(df2.reset_index().col).apply(sum)
Example:
>>> df1
col
0 [0, 1, 2]
1 [3, 4, 5]
>>> df2
col
0 [5, 6, 7]
1 [1, 2, 3]
>>> df1.col.mul(df2.col).apply(sum)
0 20
1 26
How do i get the value from a dataframe based on a list of index and headers?
These are the dataframes i have:
a = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
referencingDf = pd.DataFrame(['c','c','b'])
Based on the same index, i am trying to get the following dataframe output:
outputDf = pd.DataFrame([3,6,8])
Currently, i tried this but would need to take the diagonal values. Am pretty sure there is a better way of doing so:
a.loc[referencingDf.index.values, referencingDf[:][0].values]
You need lookup:
b = a.lookup(a.index, referencingDf[0])
print (b)
[3 6 8]
df1 = pd.DataFrame({'vals':b}, index=a.index)
print (df1)
vals
0 3
1 6
2 8
Another way to use list comprehension:
vals = [a.loc[i,j] for i,j in enumerate(referencingDf[0])]
# [3, 6, 8]
IIUC, you can use df.get_value in a list comprehension.
vals = [a.get_value(*x) for x in referencingDf.reset_index().values]
# a simplification would be [ ... for x in enumerate(referencingDf[0])] - DYZ
print(vals)
[3, 6, 8]
And then, construct a dataframe.
df = pd.DataFrame(vals)
print(df)
0
0 3
1 6
2 8
Here's one vectorized approach that uses column_index and then NumPy's advanced-indexing for indexing and extracting those values off each row of dataframe -
In [177]: col_idx = column_index(a, referencingDf.values.ravel())
In [178]: a.values[np.arange(len(col_idx)), col_idx]
Out[178]: array([3, 6, 8])
I want to aggregate indices of a dataframe with groupby function.
word count
0 a 3
1 the 5
2 a 3
3 an 2
4 the 1
What I want is a pd.Series which consists of list(descending order) of indices,
word
a [2, 0]
an [3]
the [4, 1]
I've tried some built-in functions with groupby, however, I couldn't find a way to aggregate indices. Would you like to provide any hint or solution for this problem?
I think you can first change order of index by [::-1], then groupby and apply index to list. Last sort_index:
print (df[::-1].groupby('word', sort=False).apply(lambda x: x.index.tolist()).sort_index())
word
a [2, 0]
an [3]
the [4, 1]
dtype: object
Another similar solution:
print (df.sort_index(ascending=False)
.groupby('word', sort=False)
.apply(lambda x: x.index.tolist())
.sort_index())
word
a [2, 0]
an [3]
the [4, 1]
dtype: object
import pandas as pd
dafr = pd.DataFrame({'a': [1,2,3], 'b': [[1,2,3],[2,3,4],[3,4,5]]})
I try to do something like:
dafr[dafr['b'].isin(2)]
which should return rows that has lists: [1,2,3] & [2,3,4].
I wonder if this is possible?
isin returns whether the column value is in what you pass. You want to check if what you pass is in the column value.
As far as I know there is no direct shortcut for this, but you can do it using map:
>>> dafr[dafr.b.map(lambda x: 2 in x)]
a b
0 1 [1, 2, 3]
1 2 [2, 3, 4]
dafr[dafr['b'].apply(lambda x: 2 in x)]
If you stored b as column of tuples rather than column of lists, afr[dafr['b'].apply(lambda x: 2 in x)]
will execute quite fast.