Pandas: Compare every two rows and output result to a new dataframe

Pandas: Compare every two rows and output result to a new dataframe - python

import pandas as pd
df1 = pd.DataFrame({'ID':['i1', 'i2', 'i3'],
'A': [2, 3, 1],
'B': [1, 1, 2],
'C': [2, 1, 0],
'D': [3, 1, 2]})
df1.set_index('ID')
df1.head()
A B C D
ID
i1 2 1 2 3
i2 3 1 1 1
i3 1 2 0 2
df2 = pd.DataFrame({'ID':['i1-i2', 'i1-i3', 'i2-i3'],
'A': [2, 1, 1],
'B': [1, 1, 1],
'C': [1, 0, 0],
'D': [1, 1, 1]})
df2.set_index('ID')
df2
A B C D
ID
i1-i2 2 1 1 1
i1-i3 1 1 0 1
i2-i3 1 1 0 1
Given a data frame as df1, I want to compare every two different rows, and get the smaller value at each column, and output the result to a new data frame like df2.
For example, to compare i1 row and i2 row, get new row i1-i2 as 2, 1, 1, 1
Please advise what is the best way of pandas to do that.

Try this:
from itertools import combinations
v = df1.values
r = pd.DataFrame([np.minimum(v[t[0]], v[t[1]])
for t in combinations(np.arange(len(df1)), 2)],
columns=df1.columns,
index=list(combinations(df1.index, 2)))
Result:
In [72]: r
Out[72]:
A B C D
(i1, i2) 2 1 1 1
(i1, i3) 1 1 0 2
(i2, i3) 1 1 0 1

Related

pandas groupby & lambda function to return nlargest(2)

Please see pandas df:
pd.DataFrame({'id': [1, 1, 2, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Jul8', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 1, 2, 3, 1]})
I am trying to groupby 'id' and 'pay_date'. I only want to keep df['id_ind'].nlargest(2) in the dataframe after grouping by 'id' and 'pay_date'. Here is my code:
df = pd.DataFrame(df.groupby(['id', 'pay_date'])['id_ind'].apply(
lambda x: x.nlargest(2)).reset_index()
This does not work, as the new df returns all the records. If it worked, 'id'==2 would only appear twice in the df, as there are 3 records and I only want the 2 largest by 'id_ind'.
My desired output:
pd.DataFrame({'id': [1, 1, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 2, 3, 1]})

Sort on id_ind and doing groupby.tail
df_final = (df.sort_values('id_ind').groupby('id').tail(2)
.sort_index()
.reset_index(drop=True))
Out[29]:
id id_ind pay_date
0 1 1 Jul1
1 1 2 Jul2
2 2 2 Aug5
3 2 3 Aug7
4 3 1 Aug22

Select a column based on name of a row (pandas) [duplicate]

This question already has an answer here:
Create a column that's a combination of other columns
(1 answer)
Closed 3 years ago.
I'm trying to select values of a column based on a value of a row using pandas.
For example:
names A B C D
A 1 0 2 0
A 2 1 0 0
C 1 0 4 0
A 2 0 0 0
D 0 1 2 5
As output I want something like:
name value
A 1
A 2
C 4
A 2
D 5
Is there a very fast (efficient) way to do such operation?

Try look at lookup
df.lookup(df.index,df.names)
Out[390]: array([1, 2, 4, 2, 5], dtype=int64)
#df['value']=df.lookup(df.index,df.names)

You can for example use the multi-purpose apply function:
import pandas
data = {
'names': ['A', 'A', 'C', 'A', 'D'],
'A': [1, 2, 1, 2, 0],
'B': [0, 1, 0, 0, 1],
'C': [2, 0, 4, 0, 2],
'D': [0, 0, 0, 0, 5],
}
df = pandas.DataFrame(data)
print(df)
def process_it(row):
return row[row['names']]
df['selection'] = df.apply(process_it, axis=1)
print(df[['names', 'selection']])

Duplicating rows with certain value in a column

I have to duplicate rows that have a certain value in a column and replace the value with another value.
For instance, I have this data:
import pandas as pd
df = pd.DataFrame({'Date': [1, 2, 3, 4], 'B': [1, 2, 3, 2], 'C': ['A','B','C','D']})
Now, I want to duplicate the rows that have 2 in column 'B' then change 2 to 4
df = pd.DataFrame({'Date': [1, 2, 2, 3, 4, 4], 'B': [1, 2, 4, 3, 2, 4], 'C': ['A','B','B','C','D','D']})
Please help me on this one. Thank you.

You can use append, to append the rows where B == 2, which you can extract using loc, but also reassigning B to 4 using assign. If order matters, you can then order by C (to reproduce your desired frame):
>>> df.append(df[df.B.eq(2)].assign(B=4)).sort_values('C')
B C Date
0 1 A 1
1 2 B 2
1 4 B 2
2 3 C 3
3 2 D 4
3 4 D 4

how does pandas.Series.str.get method work?

I have a pandas.Series named matches like this:
When I called pandas.Series.str.get method on it, it returns a new Series with its values all NaN:
I have read the document pandas.Series.str.get, but still can't understand it.

It return second element from iterable, it is same as str[1]:
df = pd.DataFrame({"A": [[1,2,3], [0,1,3]], "B":['aswed','yuio']})
print (df)
A B
0 [1, 2, 3] aswed
1 [0, 1, 3] yuio
df['C'] = df['A'].str.get(1)
df['C1'] = df['A'].str[1]
df['D'] = df['B'].str.get(1)
df['D1'] = df['B'].str[1]
print (df)
A B C C1 D D1
0 [1, 2, 3] aswed 2 2 s s
1 [0, 1, 3] yuio 1 1 u u

Checking which rows contain a value efficiently

I am trying to write a function that checks for the presence of a value in a row across columns. I have a script that does this by iterating through columns, but I am worried that this will be inefficient when used on large datasets.
Here is my current code:
import pandas as pd
a = [1, 2, 3, 4]
b = [2, 3, 3, 2]
c = [5, 6, 1, 3]
d = [1, 0, 0, 99]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
cols = ['a', 'b', 'c', 'd']
df['e'] = 0
for col in cols:
df['e'] = df['e'] + df[col] == 1
print(df)
result:
a b c d e
0 1 2 5 1 True
1 2 3 6 0 False
2 3 3 1 0 True
3 4 2 3 99 False
As you can see, column e keeps record of whether the value "1" exists in that row. I was wondering if there was a better/more efficient way of achieving these results.

You can check if values in the data frame is one and see if any is true in a row (with axis=1):
df['e'] = df.eq(1).any(1)
df
# a b c d e
#0 1 2 5 1 True
#1 2 3 6 0 False
#2 3 3 1 0 True
#3 4 2 3 99 False

Python supports 'in', and 'not in'.
EXAMPLE:
>>> a = [1, 2, 5, 1]
>>> b = [2, 3, 6, 0]
>>> c = [5, 6, 1, 3]
>>> d = [1, 0, 0, 99]
>>> 1 in a
True
>>> 1 not in a
False
>>> 99 in d
True
>>> 99 not in d
False
By using this, you don't have to iterate over the array by yourself for this case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Compare every two rows and output result to a new dataframe - python

Related

pandas groupby & lambda function to return nlargest(2)

Select a column based on name of a row (pandas) [duplicate]

Duplicating rows with certain value in a column

how does pandas.Series.str.get method work?

Checking which rows contain a value efficiently

Categories

Resources