I want to aggregate indices of a dataframe with groupby function.
word count
0 a 3
1 the 5
2 a 3
3 an 2
4 the 1
What I want is a pd.Series which consists of list(descending order) of indices,
word
a [2, 0]
an [3]
the [4, 1]
I've tried some built-in functions with groupby, however, I couldn't find a way to aggregate indices. Would you like to provide any hint or solution for this problem?
I think you can first change order of index by [::-1], then groupby and apply index to list. Last sort_index:
print (df[::-1].groupby('word', sort=False).apply(lambda x: x.index.tolist()).sort_index())
word
a [2, 0]
an [3]
the [4, 1]
dtype: object
Another similar solution:
print (df.sort_index(ascending=False)
.groupby('word', sort=False)
.apply(lambda x: x.index.tolist())
.sort_index())
word
a [2, 0]
an [3]
the [4, 1]
dtype: object
Related
Let's say I have the following pandas DataFrame:
index
A
B
C
0
2
1
4
1
1
2
3
2
4
3
2
3
3
4
1
I want to get the index of the row in each column where the value of the column at that row is greater than all subsequent rows. So in this example, my desired output would be
A
B
C
2
3
0
What is the most efficient way to do this?
In that case, I guess I would use:
df.idxmax()
Or to get it formatted to your desired output:
pd.DataFrame(df.idxmax()).T
df[::-1].idxmax(axis=0)
Explanation: indices of last maximum values, by first reversing the row order such that index of first (i.e. lowest) occurrence is used (documentation for DataFrame.idxmax says index of first occurrence of maximum). The following code produces the desired result (as a pd.DataFrame):
df = pd.DataFrame(
[[2, 1, 4], [1, 2, 3], [4, 3, 2], [3, 4, 1]],
index=[0, 1, 2, 3], columns=['A', 'B', 'C']
)
pd.DataFrame(df[::-1].idxmax(axis=0)).T
"index of the first value greater than all subsequent rows" <-> "index of last occurrence of maximum value"
I have a DataFrame:
COL1 COL2
1 1
3 1
1 3
I need to sort by COL1 + COL2.
key=lambda col: f(col) argument-function of sort_values(...) lets you sort by a changed column but in the described case I need to sort on the basis of 2 columns. So, it would be nice if there were an opportunity to provide a key argument-function for 2 or more columns but I don't know whether such a one exists.
So, how can I sort its rows by sum COL1 + COL2?
Thank you for your time!
Assuming a unique index, you can also conveniently use the key parameter of sort_values to pass a callable to apply to the by column. Here we can add the other column:
df.sort_values(by='COL1', key=df['COL2'].add)
We can even generalize to any number of columns using sort_index:
df.sort_index(key=df.sum(1).get)
Output:
COL1 COL2
0 1 1
2 1 3
1 3 2
Used input:
data = {"COL1": [1, 3, 1], "COL2": [1, 2, 3]}
df = pd.DataFrame(data)
This does the trick:
data = {"Column 1": [1, 3, 1], "Column 2": [1, 2, 3]}
df = pd.DataFrame(data)
sorted_indices = (df["Column 1"] + df["Column 2"]).sort_values().index
df.loc[sorted_indices, :]
I just created a series that has the sum of both the columns, sorted it, got the sorted indices, and printed those indices out for the dataframe.
(I changed the data a little so you can see the sorting in action. Using the data you provided, you wouldn't have been able to see the sorted data as it would have been the same as the original one.)
I have a pandas.DataFrame like
B1 B2 B3
A1 0 1 2
A2 3 4 5
Also, index=pd.Index(['A2', 'A1']), and columns=pd.Index(['B2', 'B3']). What I want to get is [4, 2], that is, elements in A2-B2 and A1-B3, respectively.
Is there a clever built-in operation to perform this in pandas?
I searched with different expressions for a while but no clue. There could be duplicate questions, sorry for that case. Thank you for taking a look at this.
Use Index.get_indexer for positions by values of index, columns so possible filter in numpy indexing (only convert values of df to numpy array):
index=pd.Index(['A2', 'A1'])
columns=pd.Index(['B2', 'B3'])
i = df.index.get_indexer(index)
c = df.columns.get_indexer(columns)
L = df.to_numpy()[i, c].tolist()
print (L)
[4, 2]
Or reshape by DataFrame.stack and select by DataFrame.loc with MultiIndex.from_tuples:
L = df.stack().loc[pd.MultiIndex.from_tuples(zip(index, columns))].tolist()
print (L)
[4, 2]
If only few values is possible use list comprehension with zip and DataFrame.at:
L = [df.at[i, c] for i, c in zip(index, columns)]
print (L)
[4, 2]
Another option would be to zip the index and columns (same idea as #Jezrael); however, you could just pass the unpacked values to loc (internally , it takes care of finding the right values):
temp = df.stack()
zipped = [*zip(index, columns)]
temp.loc(axis=0)[zipped].array
<PandasArray>
[4, 2]
Length: 2, dtype: int64
I have two data frames. Both have one column of numpy arrays with 3 elements per entry, like so:
0 [0.552347, 0.762896, 0.336009]
1 [0.530716, 0.808313, 0.254895]
2 [0.528786, 0.734991, 0.424469]
3 [0.202799, 0.669395, -0.714691]
4 [0.791936, -0.100072, -0.602347]
6 [0.428896, -0.122712, 0.89498]
How do I take the dot product of each row of one data frame with the corresponding row of the other data frame? Meaning, I want to calculate the dot product of the first element of df1 with the first element of df2, then the second element of df1 with the second element of df2, then third, and so on.
df1 = pd.DataFrame([(np.array([0.552347, 0.762896, 0.336009]), ),
(np.array([0.530716, 0.808313, 0.254895]), )], columns=['v1'])
df2 = pd.DataFrame([(np.array([0.528786, 0.734991, 0.424469]), ),
(np.array([0.202799, 0.669395, -0.714691]), )], columns=['v2'])
pd.concat((df1, df2), axis=1).apply(lambda row: row.v1.dot(row.v2), axis=1)
0 0.995420
1 0.466538
Assuming they are same length of df1 , df2
[x.dot(y) for x, y in zip(df1.col1.values,df2.col1.values)]
Out[648]: [0.9999995633060001, 1.00000083965]
It's pretty fast to calculate dot products manually. For this you can use mul and sum if the 2 dataframes share the same index:
df1.col.mul(df2.col).apply(sum)
If they don't share the same index (but are the same length), use reset_index first:
df1.reset_index().col.mul(df2.reset_index().col).apply(sum)
Example:
>>> df1
col
0 [0, 1, 2]
1 [3, 4, 5]
>>> df2
col
0 [5, 6, 7]
1 [1, 2, 3]
>>> df1.col.mul(df2.col).apply(sum)
0 20
1 26
How do i get the value from a dataframe based on a list of index and headers?
These are the dataframes i have:
a = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
referencingDf = pd.DataFrame(['c','c','b'])
Based on the same index, i am trying to get the following dataframe output:
outputDf = pd.DataFrame([3,6,8])
Currently, i tried this but would need to take the diagonal values. Am pretty sure there is a better way of doing so:
a.loc[referencingDf.index.values, referencingDf[:][0].values]
You need lookup:
b = a.lookup(a.index, referencingDf[0])
print (b)
[3 6 8]
df1 = pd.DataFrame({'vals':b}, index=a.index)
print (df1)
vals
0 3
1 6
2 8
Another way to use list comprehension:
vals = [a.loc[i,j] for i,j in enumerate(referencingDf[0])]
# [3, 6, 8]
IIUC, you can use df.get_value in a list comprehension.
vals = [a.get_value(*x) for x in referencingDf.reset_index().values]
# a simplification would be [ ... for x in enumerate(referencingDf[0])] - DYZ
print(vals)
[3, 6, 8]
And then, construct a dataframe.
df = pd.DataFrame(vals)
print(df)
0
0 3
1 6
2 8
Here's one vectorized approach that uses column_index and then NumPy's advanced-indexing for indexing and extracting those values off each row of dataframe -
In [177]: col_idx = column_index(a, referencingDf.values.ravel())
In [178]: a.values[np.arange(len(col_idx)), col_idx]
Out[178]: array([3, 6, 8])