Dataframe: How to select for each row different column - python

Let's consider a dataframe A with three columns: a, b and c. Suppose we have also Series B of the the same size as A. In each row it contains the name of one of the A's column. I want to construct the Series which would contains the values form the table A at columns specified by B.
The simplest example would be the following:
idxs = np.arange(0, 5)
A = pd.DataFrame({
'a': [3, 1, 5, 7, 8],
'b': [5, 6, 7, 3, 1],
'c': [2, 7, 8, 2, 1],
}, index=idxs)
B = pd.Series(['b', 'c', 'c', 'a', 'a'], index=idxs)
I need to apply some operation which will give the result identical to the following series:
C = pd.Series([5, 7, 8, 7, 8], index=idxs)
In such a simple example one can perform 'broadcasting' as following on pure numpy arrays:
d = {'a':0, 'b':1, 'c':2 }
AA = A.rename(columns=d).as_matrix()
BB = B.apply(lambda x: d[x]).as_matrix()
CC = AA[idxs, BB]
That works, but in my real problem I have multiindexed Dataframe, and things become more complicated.
Is it possible to do so, using pandas tools?
The first thing that comes into my mind is:
A['idx'] = B;
C = A.apply(lambda x: x[x['idx']], axis=1)
It works!

You can use DataFrame.lookup:
pd.Series(A.lookup(B.index, B), index=B.index)
0 5
1 7
2 8
3 7
4 8
dtype: int64
A NumPy solution involving broadcasting is:
A.values[B.index, (A.columns.values == B[:, None]).argmax(1)]
# array([5, 7, 8, 7, 8])

Related

Creating a list in a Dataframe column which is a range of values from other two data frame columns

I need to create a list in a dataframe column, which is a range of numbers. The range limits should be the values in other two data frame columns.
df = pd.DataFrame({'A': [3, 7, 2, 8], 'B': [1, 3, 9, 3]},index=[1,2,3,4])
Now In need a dataframe column which will be series of lists like below
[1,2,3]
[3,4,5,6,7]
[2,3,4,5,6,7,8,9]
[3,4,5,6,7,8]
I'm able to create a list in a dataframe column this way.
df['C'] = (df[['A','B']]).to_numpy().tolist()
This gives a column as below
[3,1]
[7,3]
[2,9]
[8,3]
But I'm not able to figure out how to create a list that is range of these values in a dataframe column.
I have also defined a fuction which will generate a list of range of numbers for any given two numbers
def createlist(r1,r2):
if (r1 == r2):
return r1
elif (r1 < r2):
res = []
while(r1 < r2+1 ):
res.append(r1)
r1 += 1
return res
else:
res = []
while(r1+1 > r2 ):
res.append(r2)
r2 += 1
return res
But struggling to apply this function to generate a dataframe column while taking inputs from other columns. Can you please help out? Thanks in advance.
You can try DataFrame.apply on rows
df['C'] = df.apply(lambda row: list(range(row.min(), row.max()+1)), axis=1)
print(df)
A B C
1 3 1 [1, 2, 3]
2 7 3 [3, 4, 5, 6, 7]
3 2 9 [2, 3, 4, 5, 6, 7, 8, 9]
4 8 3 [3, 4, 5, 6, 7, 8]

How can I return a larger dataframe from two dataframes

e.g. I have two dataframes:
a = pd.DataFrame({'A':[1,2,3],'B':[6,5,4]})
b = pd.DataFrame({'A':[3,2,1],'B':[4,5,6]})
I want to get a dataframe c consisting of the larger value in each position of a & b:
c = max_function(a,b) = pd.DataFrame(max(a.iloc[i,j], b.iloc[i,j]))
c = pd.DataFrame({'A':[3,2,3],'B':[6,5,6]})
I don't want to generate c by comparing each value in a & b because the real dataframes in my work is very large.
So I wonder if there's a ready-made pandas function which can do this? Thanks!
You could use numpy.maximum:
import pandas as pd
import numpy as np
a = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]})
b = pd.DataFrame({'A': [3, 2, 1], 'B': [4, 5, 6]})
c = np.maximum(a, b)
print(c)
Output
A B
0 3 6
1 2 5
2 3 6

Maximum of an array constituting a pandas dataframe cell

I have a pandas dataframe in which a column is formed by arrays. So every cell is an array.
Say there is a column A in dataframe df, such that
A = [ [1, 2, 3],
[4, 5, 6],
[7, 8, 9],
... ]
I want to operate in each array and get, e.g. the maximum of each array, and store it in another column.
In the example, I would like to obtain another column
B = [ 3,
6,
9,
...]
I have tried these approaches so far, none of them giving what I want.
df['B'] = np.max(df['A']);#
df.applymap (lambda B: A.max())
df['B'] = df.applymap (lambda B: np.max(np.array(df['A'].tolist()),0))
How should I proceed? And is this the best way to have my dataframe organized?
You can just apply(max). It doesn't matter if the values are lists or np.array.
df = pd.DataFrame({'a': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
df['b'] = df['a'].apply(max)
print(df)
Outputs
a b
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9
Here is one way without apply:
df['B']=np.max(df['A'].values.tolist(),axis=1)
A B
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9

How to get index for all the duplicates in a dataframe (pandas - python)

I have a data frame with multiple columns, and I want to find the duplicates in some of them. My columns go from A to Z. I want to know which lines have the same values in columns A, D, F, K, L, and G.
I tried:
df = df[df.duplicated(keep=False)]
df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
However, this uses all of the columns.
I also tried
print(df[df.duplicated(['A', 'D', 'F', 'K', 'L', 'P'])])
This only returns the duplication's index. I want the index of both lines that have the same values.
Your final attempt is close. Instead of grouping by all columns, just use a list of the ones you want to consider:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [3, 3, 3, 4, 4, 5],
'C': [6, 7, 8, 9, 10, 11]})
res = df.groupby(['A', 'B']).apply(lambda x: (x.index).tolist()).reset_index()
print(res)
# A B 0
# 0 1 3 [0, 1, 2]
# 1 2 4 [3, 4]
# 2 2 5 [5]
Different layout of groupby
df.index.to_series().groupby([df['A'],df['B']]).apply(list)
Out[449]:
A B
1 3 [0, 1, 2]
2 4 [3, 4]
5 [5]
dtype: object
You can have .groupby return a dict with keys being the group labels (tuples for multiple columns) and the values being the Index
df.groupby(['A', 'B']).groups
#{(1, 3): Int64Index([0, 1, 2], dtype='int64'),
# (2, 4): Int64Index([3, 4], dtype='int64'),
# (2, 5): Int64Index([5], dtype='int64')}

Convert pandas dataframe to a column-based order list

I want to convert pandas dataframe into a list.
For example, I have a dataframe like below, and I want to make list with all columns.
Dataframe (df)
A B C
0 4 8
1 5 9
2 6 10
3 7 11
Expected result
[[0,1,2,3], [4,5,6,7], [8,9,10,11]]
If I use df.values.tolist(), it will return in row-based order list like below.
[[0,4,8], [1,5,9], [2,6,10], [3,7,11]]
It is possible to transpose the dataframe, but I want to know whether there are better solutions.
I think simpliest is transpose.
Use T or numpy.ndarray.transpose:
df1 = df.T.values.tolist()
print (df1)
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
Or:
df1 = df.values.transpose().tolist()
print (df1)
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
Another answer with list comprehension, thank you John Galt:
L = [df[x].tolist() for x in df.columns]

Categories

Resources