Taking rows from dataframe until a condition is met - python

I have a dataframe with two columns:
A B
0 1 3
1 2 2
2 3 2
3 9 3
4 1 1
...
For a given index i, I want the rows from row i to the row j in which df.at[j,A]-df.at[i,B]>5. I don't want any rows after row j.
For example, let i=1, the output should be:
[out]
A B
2 2
3 2
9 3
Is there a simple way of do this without using loops?

df = pd.DataFrame({'A': [10, 1, 2, 3, 9], 'B': [1, 3, 2, 2, 3]})
i = 2
base = df.at[i, 'B']
df = df.iloc[i:]
j = df[df['A'] - df.at[i, 'B'] > 5]
if not j.empty:
print(df.iloc[:j.index[0]])
else:
print('Condition not found')
Prints:
A B
2 2 2
3 3 2
4 9 3

You could try as follows:
import pandas as pd
data = {'A': {0: 10, 1: 2, 2: 3, 3: 9}, 'B': {0: 3, 1: 2, 2: 2, 3: 3}}
df = pd.DataFrame(data)
i=1
s = df.loc[i:,'A']-df.loc[i,'B']>5
trues = s[s==True]
if not trues.empty:
subset = df.iloc[i:trues.idxmax()+1]
else:
subset = pd.DataFrame()
print(subset)
A B
1 2 2
2 3 2
3 9 3

Related

Fill column based on subsets of array

I have a dataframe like this
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'A': [1, 2, 3, 2, 3, 1],
'B': [5, 2, 4, 1, 4, 5],
'C': list('abcdef')
}
)
and an array like that
a = np.array([
[1, 5],
[3, 4]
])
I would now like to add an additional column D to df which contains the word "found" based on whether the values of A and B are contained as a subset in a.
A straightforward implementation would be
for li in a.tolist():
m = (df['A'] == li[0]) & (df['B'] == li[1])
df.loc[m, 'D'] = "found"
which gives the desired outcome
A B C D
0 1 5 a found
1 2 2 b NaN
2 3 4 c found
3 2 1 d NaN
4 3 4 e found
5 1 5 f found
Is there a solution which wold avoid the loop?
One option is , we can use merge with indicator
out = df.merge(pd.DataFrame(a,columns=['A','B']),how='left',indicator="D")
out['D'] = np.where(out['D'].eq("both"),"Found","Not Found")
print(out)
A B C D
0 1 5 a Found
1 2 2 b Not Found
2 3 4 c Found
3 2 1 d Not Found
4 3 4 e Found
5 1 5 f Found
Here is one way of doing by using numpy broadcasting:
m = (df[['A', 'B']].values[:, None] == a).all(-1).any(-1)
df['D'] = np.where(m, 'Found', 'Not found')
A B C D
0 1 5 a Found
1 2 2 b Not found
2 3 4 c Found
3 2 1 d Not found
4 3 4 e Found
5 1 5 f Found
Here is another way:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'A': [1, 2, 3, 2, 3, 1],
'B': [5, 2, 4, 1, 4, 5],
'C': list('abcdef')
}
)
a = np.array([
[1, 5],
[3, 4]
])
df = df.merge(pd.DataFrame(a, columns=['A', 'B']), 'left', indicator="D")
D = df.pop("D")
df['D'] = 'found'
df['D'] = df['D'].where(D.eq('both'), other=np.nan)
print(df)
Output:
A B C D
0 1 5 a found
1 2 2 b NaN
2 3 4 c found
3 2 1 d NaN
4 3 4 e found
5 1 5 f found

Remove multivalued columns

I have a dataframe。
A B
0 2 3
1 2 4
2 3 5
If the value of a column has more than 2 different values, I will remove.
expect the output:
A
0 2
1 2
2 3
You can use .nunique() and .loc, passing a boolean
df = pd.DataFrame({'A': {0: 2, 1: 2, 2: 3}, 'B': {0: 3, 1: 4, 2: 5}})
df.loc[:, (df.nunique() <= 2)]
A
0 2
1 2
2 3
An alternative approach (credit to this answer):
criteria = df.nunique() <= 2
df[criteria.index[criteria]]
Use for loop and value_count to get the result:-
df = pd.DataFrame(data= {'A':[2,2,3], 'B':[3,4,5]})
for var in df.columns:
result = df[var].value_counts()
if len(result)>2:
df.drop(var, axis=1,inplace=True)
df
Output
A
0 2
1 2
2 3

Python Panda - concatenate two column values into a single column with label name [duplicate]

I have a dataframe like this where the columns are the scores of some metrics:
A B C D
4 3 3 1
2 5 2 2
3 5 2 4
I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:
A B C D NewCol
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD
I tried using a series of np.where:
df[NewCol] = np.where(df['A'] > 2, 'A', '')
df[NewCol] = np.where(df['B'] > 3, 'B', '')
etc.
but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:
A B C D NewCol
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD
I am pretty sure there is an easier and correct way to do this.
You could do:
import pandas as pd
data = [[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}
df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]
print(df)
Output
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD
Using dot
s=pd.Series([2,3,1,3],index=df.columns)
df.gt(s,1).dot(df.columns)
Out[179]:
0 AC
1 BC
2 ABCD
dtype: object
#df['New']=df.gt(s,1).dot(df.columns)
Another option that operates in an array fashion. It would be interesting to compare performance.
import pandas as pd
import numpy as np
# Data to test.
data = pd.DataFrame(
[
[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]
]
, columns = ['A', 'B', 'C', 'D']
)
# Series to hold the thresholds.
thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])
# Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.
data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)
print(data)
Gives:
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD

how to convert header row into new columns in python pandas?

I am having following dataframe:
A,B,C
1,2,3
I have to convert above dataframe like following format:
cols,vals
A,1
B,2
c,3
How to create column names as a new column in pandas?
You can transpose by T:
import pandas as pd
df = pd.DataFrame({'A': {0: 1}, 'C': {0: 3}, 'B': {0: 2}})
print (df)
A B C
0 1 2 3
print (df.T)
0
A 1
B 2
C 3
df1 = df.T.reset_index()
df1.columns = ['cols','vals']
print (df1)
cols vals
0 A 1
1 B 2
2 C 3
If DataFrame has more rows, you can use:
import pandas as pd
df = pd.DataFrame({'A': {0: 1, 1: 9, 2: 1},
'C': {0: 3, 1: 6, 2: 7},
'B': {0: 2, 1: 4, 2: 8}})
print (df)
A B C
0 1 2 3
1 9 4 6
2 1 8 7
df.index = 'vals' + df.index.astype(str)
print (df.T)
vals0 vals1 vals2
A 1 9 1
B 2 4 8
C 3 6 7
df1 = df.T.reset_index().rename(columns={'index':'cols'})
print (df1)
cols vals0 vals1 vals2
0 A 1 9 1
1 B 2 4 8
2 C 3 6 7

Pandas multiindex boolean indexing

So given a multiindexed dataframe, I would like to return only rows that satisfy a condition for all levels of the lower index in a multi index. Here is a small working example:
df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4], 'c': [0, 2, 2, 2]})
df = df.set_index(['a', 'b'])
print(df)
out:
c
a b
1 1 0
2 2
2 3 2
4 2
Now, I would like to return the entries for which c > 1. For instance, I would like to do something like
df[df[c > 1]]
out:
c
a b
1 2 2
2 3 2
4 2
But I want to get
out:
c
a b
2 3 2
4 2
Any thoughts on how to do this in the most efficient way?
I ended up using groupby:
df.groupby(level=0).filter(lambda x: all([c > 1 for v in x['c']]))

Categories

Resources